deeppavlov.models.embedders¶
-
class
deeppavlov.models.embedders.bow_embedder.
BoWEmbedder
(depth: int, with_counts: bool = False, **kwargs)[source]¶ Performs one-hot encoding of tokens based on a pre-built vocabulary of tokens.
- Parameters
depth – size of output numpy vector.
with_counts – flag denotes whether to use binary encoding (with zeros and ones), or to use counts as token representation.
Example
>>> bow = BoWEmbedder(depth=3) >>> bow([[0, 1], [1], []) [array([1, 1, 0], dtype=int32), array([0, 1, 0], dtype=int32), array([0, 0, 0], dtype=int32)]
-
class
deeppavlov.models.embedders.fasttext_embedder.
FasttextEmbedder
(load_path: Union[str, pathlib.Path], pad_zero: bool = False, mean: bool = False, **kwargs)[source]¶ Class implements fastText embedding model
- Parameters
load_path – path where to load pre-trained embedding model from
pad_zero – whether to pad samples or not
-
model
¶ fastText model instance
-
tok2emb
¶ dictionary with already embedded tokens
-
dim
¶ dimension of embeddings
-
pad_zero
¶ whether to pad sequence of tokens with zeros or not
-
load_path
¶ path with pre-trained fastText binary model
-
class
deeppavlov.models.embedders.elmo_embedder.
ELMoEmbedder
(*args, **kwargs)[source]¶ ELMo
(Embeddings from Language Models) representations are pre-trained contextual representations from large-scale bidirectional language models. See a paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.- Parameters
spec – A
ModuleSpec
defining the Module to instantiate or a path where to load aModuleSpec
from viatenserflow_hub.load_module_spec
by using TensorFlow Hub.elmo_output_names –
A list of output ELMo. You can use combination of
["word_emb", "lstm_outputs1", "lstm_outputs2","elmo"]
and you can use separately["default"]
.Where,
word_emb
- CNN embedding (default dim 512)lstm_outputs*
- ouputs of lstm (default dim 1024)elmo
- weighted sum of cnn and lstm outputs (default dim 1024)default
- meanelmo
vector for sentence (default dim 1024)
See TensorFlow Hub for more information about it.
dim – Can be used for output embeddings dimensionality reduction if elmo_output_names != [‘default’]
pad_zero – Whether to use pad samples or not.
concat_last_axis – A boolean that enables/disables last axis concatenation. It is not used for
elmo_output_names = ["default"]
.max_token – The number limitation of words per a batch line.
mini_batch_size – It is used to reduce the memory requirements of the device.
If some required packages are missing, install all the requirements by running in command line:
python -m deeppavlov install <path_to_config>
where
<path_to_config>
is a path to one of the provided config files or its name without an extension, for example :python -m deeppavlov install elmo_ru-news
Examples
>>> from deeppavlov.models.embedders.elmo_embedder import ELMoEmbedder >>> elmo = ELMoEmbedder("http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz") >>> elmo([['вопрос', 'жизни', 'Вселенной', 'и', 'вообще', 'всего'], ['42']]) array([[ 0.00719104, 0.08544601, -0.07179783, ..., 0.10879009, -0.18630421, -0.2189409 ], [ 0.16325025, -0.04736076, 0.12354863, ..., -0.1889013 , 0.04972512, 0.83029324]], dtype=float32)
You can use ELMo models from DeepPavlov as usual TensorFlow Hub Module.
>>> import tensorflow as tf >>> import tensorflow_hub as hub >>> elmo = hub.Module("http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz", trainable=True) >>> sess = tf.Session() >>> sess.run(tf.global_variables_initializer()) >>> embeddings = elmo(["это предложение", "word"], signature="default", as_dict=True)["elmo"] >>> sess.run(embeddings) array([[[ 0.05817392, 0.22493343, -0.19202903, ..., -0.14448944, -0.12425567, 1.0148407 ], [ 0.53596294, 0.2868537 , 0.28028542, ..., -0.08028372, 0.49089077, 0.75939953]], [[ 0.3433637 , 1.0031182 , -0.1597258 , ..., 1.2442509 , 0.61029315, 0.43388373], [ 0.05370751, 0.02260921, 0.01074906, ..., 0.08748816, -0.0066415 , -0.01344293]]], dtype=float32)
TensorFlow Hub module also supports tokenized sentences in the following format.
>>> tokens_input = [["мама", "мыла", "раму"], ["рама", "", ""]] >>> tokens_length = [3, 1] >>> embeddings = elmo( inputs={ "tokens": tokens_input, "sequence_len": tokens_length }, signature="tokens", as_dict=True)["elmo"] >>> sess.run(embeddings) array([[[ 0.6040001 , -0.16130011, 0.56478846, ..., -0.00376141, -0.03820051, 0.26321286], [ 0.01834148, 0.17055789, 0.5311495 , ..., -0.5675535 , 0.62669843, -0.05939034], [ 0.3242596 , 0.17909613, 0.01657108, ..., 0.1866098 , 0.7392496 , 0.08285746]], [[ 1.1322289 , 0.19077688, -0.17811403, ..., 0.42973226, 0.23391506, -0.01294377], [ 0.05370751, 0.02260921, 0.01074906, ..., 0.08748816, -0.0066415 , -0.01344293], [ 0.05370751, 0.02260921, 0.01074906, ..., 0.08748816, -0.0066415 , -0.01344293]]], dtype=float32)
You can also get
hub.text_embedding_column
like described here.
-
class
deeppavlov.models.embedders.glove_embedder.
GloVeEmbedder
(load_path: Union[str, pathlib.Path], pad_zero: bool = False, mean: bool = False, **kwargs)[source]¶ Class implements GloVe embedding model
- Parameters
load_path – path where to load pre-trained embedding model from
pad_zero – whether to pad samples or not
-
model
¶ GloVe model instance
-
tok2emb
¶ dictionary with already embedded tokens
-
dim
¶ dimension of embeddings
-
pad_zero
¶ whether to pad sequence of tokens with zeros or not
-
load_path
¶ path with pre-trained GloVe model
-
class
deeppavlov.models.embedders.tfidf_weighted_embedder.
TfidfWeightedEmbedder
(embedder: deeppavlov.core.models.component.Component, tokenizer: Optional[deeppavlov.core.models.component.Component] = None, pad_zero: bool = False, mean: bool = False, tags_vocab_path: Optional[str] = None, vectorizer: Optional[deeppavlov.core.models.component.Component] = None, counter_vocab_path: Optional[str] = None, idf_base_count: int = 100, log_base: int = 10, min_idf_weight=0.0, **kwargs)[source]¶ - The class implements the functionality of embedding the sentence as a weighted average by special coefficients of tokens embeddings. Coefficients can be taken from the given TFIDF-vectorizer in
vectorizer
or calculated as TFIDF from counter vocabulary given incounter_vocab_path
. Also one can give
tags_vocab_path
to the vocabulary with weights of tags. In this case, batch with tags should be given as a second input in__call__
method.
- Parameters
embedder – embedder instance
tokenizer – tokenizer instance, should be able to detokenize sentence
pad_zero – whether to pad samples or not
mean – whether to return mean token embedding
tags_vocab_path – optional path to vocabulary with tags weights
vectorizer – vectorizer instance should be trained with
analyzer="word"
counter_vocab_path – path to counter vocabulary
idf_base_count – minimal idf value (less time occured are not counted)
log_base – logarithm base for TFIDF-coefficient calculation froom counter vocabulary
min_idf_weight – minimal idf weight
-
embedder
¶ embedder instance
-
tokenizer
¶ tokenizer instance, should be able to detokenize sentence
-
dim
¶ dimension of embeddings
-
pad_zero
¶ whether to pad samples or not
-
mean
¶ whether to return mean token embedding
vocabulary with weigths for tags
-
vectorizer
¶ vectorizer instance
-
counter_vocab_path
¶ path to counter vocabulary
-
counter_vocab
¶ counter vocabulary
-
idf_base_count
¶ minimal idf value (less time occured are not counted)
-
log_base
¶ logarithm base for TFIDF-coefficient calculation froom counter vocabulary
-
min_idf_weight
¶ minimal idf weight
Examples
>>> from deeppavlov.models.embedders.tfidf_weighted_embedder import TfidfWeightedEmbedder >>> from deeppavlov.models.embedders.fasttext_embedder import FasttextEmbedder >>> fasttext_embedder = FasttextEmbedder('/data/embeddings/wiki.ru.bin') >>> fastTextTfidf = TfidfWeightedEmbedder(embedder=fasttext_embedder, counter_vocab_path='/data/vocabs/counts_wiki_lenta.txt') >>> fastTextTfidf([['большой', 'и', 'розовый', 'бегемот']]) [array([ 1.99135890e-01, -7.14746421e-02, 8.01428872e-02, -5.32840924e-02, 5.05212297e-02, 2.76053832e-01, -2.53270134e-01, -9.34443950e-02, ... 1.18385439e-02, 1.05643446e-01, -1.21904516e-03, 7.70555378e-02])]
-
__call__
(batch: List[List[str]], tags_batch: Optional[List[List[str]]] = None, mean: Optional[bool] = None, *args, **kwargs) → List[Union[list, numpy.ndarray]][source]¶ Infer on the given data
- Parameters
batch – tokenized text samples
tags_batch – optional batch of corresponding tags
mean – whether to return mean token embedding (does not depend on self.mean)
*args – additional arguments
**kwargs – additional arguments
Returns:
- The class implements the functionality of embedding the sentence as a weighted average by special coefficients of tokens embeddings. Coefficients can be taken from the given TFIDF-vectorizer in
-
class
deeppavlov.models.embedders.transformers_embedder.
TransformersBertEmbedder
(load_path: Union[str, pathlib.Path], bert_config_path: Optional[Union[pathlib.Path, str]] = None, truncate: bool = False, **kwargs)[source]¶ Transformers-based BERT model for embeddings tokens, subtokens and sentences
- Parameters
load_path – path to a pretrained BERT pytorch checkpoint
bert_config_file – path to a BERT configuration file
truncate – whether to remove zero-paddings from returned data
-
__call__
(subtoken_ids_batch: Collection[Collection[int]], startofwords_batch: Collection[Collection[int]], attention_batch: Collection[Collection[int]]) → Tuple[Collection[Collection[Collection[float]]], Collection[Collection[Collection[float]]], Collection[Collection[float]], Collection[Collection[float]], Collection[Collection[float]]][source]¶ Predict embeddings values for a given batch
- Parameters
subtoken_ids_batch – padded indexes for every subtoken
startofwords_batch – a mask matrix with
1
for every first subtoken init in a token and0
for every other subtokenattention_batch – a mask matrix with
1
for every significant subtoken and0
for paddings