deeppavlov.models.embedders¶
-
class
deeppavlov.models.embedders.bow_embedder.
BoWEmbedder
(**kwargs)[source]¶ Performs one-hot encoding of tokens based on a pre-built vocabulary of tokens.
Example
>>> bow = BoWEmbedder() >>> bow(['a', 'b', 'c'], vocab={'a': 0, 'b': 1}) [array([1, 0], dtype=int32), array([0, 1], dtype=int32), array([0, 0], dtype=int32)]
-
class
deeppavlov.models.embedders.fasttext_embedder.
FasttextEmbedder
(load_path: Union[str, pathlib.Path], save_path: Union[str, pathlib.Path] = None, dim: int = 100, pad_zero: bool = False, **kwargs)[source]¶ Class implements fastText embedding model
Parameters: - load_path – path where to load pre-trained embedding model from
- save_path – is not used because model is not trainable; therefore, it is unchangable
- dim – dimensionality of fastText model
- pad_zero – whether to pad samples or not
- **kwargs – additional arguments
-
model
¶ fastText model instance
-
tok2emb
¶ dictionary with already embedded tokens
-
dim
¶ dimension of embeddings
-
pad_zero
¶ whether to pad sequence of tokens with zeros or not
-
load_path
¶ path with pre-trained fastText binary model
-
__call__
(batch: List[List[str]], mean: bool = False, *args, **kwargs) → List[Union[list, numpy.ndarray]][source]¶ Embed sentences from batch
Parameters: - batch – list of tokenized text samples
- mean – whether to return mean embedding of tokens per sample
- *args – arguments
- **kwargs – arguments
Returns: embedded batch
-
class
deeppavlov.models.embedders.elmo_embedder.
ELMoEmbedder
(spec: str, elmo_output_names: Optional[List] = None, dim: Optional[int] = None, pad_zero: bool = False, concat_last_axis: bool = True, max_token: Optional[int] = None, mini_batch_size: int = 32, **kwargs)[source]¶ ELMo
(Embeddings from Language Models) representations are pre-trained contextual representations from large-scale bidirectional language models. See a paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.Parameters: - spec – A
ModuleSpec
defining the Module to instantiate or a path where to load aModuleSpec
from viatenserflow_hub.load_module_spec
by using TensorFlow Hub. - elmo_output_names – A list of output ELMo. You can use combination of
["word_emb", "lstm_outputs1", "lstm_outputs2","elmo"]
and you can use separately["default"]
. See TensorFlow Hub for more information about it. - dim – Dimensionality of output token embeddings of ELMo model.
- pad_zero – Whether to use pad samples or not.
- concat_last_axis – A boolean that enables/disables last axis concatenation. It is not used for
elmo_output_names = ["default"]
. - max_token – The number limitation of words per a batch line.
- mini_batch_size – It is used to reduce the memory requirements of the device.
Examples
You can use ELMo models from DeepPavlov as usual TensorFlow Hub Module.
>>> import tensorflow_hub as hub >>> elmo = hub.Module("http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-news_wmt11-16_1.5M_steps.tar.gz", trainable=True) >>> embeddings = elmo(["это предложение", "word"], signature="default", as_dict=True)["elmo"]
You can also embed tokenized sentences.
>>> tokens_input = [["мама", "мыла", "раму"], ["рама", "", ""]] >>> tokens_length = [3, 1] >>> embeddings = elmo( inputs={ "tokens": tokens_input, "sequence_len": tokens_length }, signature="tokens", as_dict=True)["elmo"]
You can also get
hub.text_embedding_column
like described here.- spec – A