deeppavlov.models.embedders¶
- class deeppavlov.models.embedders.fasttext_embedder.FasttextEmbedder(load_path: Union[str, Path], pad_zero: bool = False, mean: bool = False, **kwargs)[source]¶
Class implements fastText embedding model
- Parameters
load_path – path where to load pre-trained embedding model from
pad_zero – whether to pad samples or not
- model¶
fastText model instance
- tok2emb¶
dictionary with already embedded tokens
- dim¶
dimension of embeddings
- pad_zero¶
whether to pad sequence of tokens with zeros or not
- load_path¶
path with pre-trained fastText binary model
- class deeppavlov.models.embedders.tfidf_weighted_embedder.TfidfWeightedEmbedder(embedder: Component, tokenizer: Optional[Component] = None, pad_zero: bool = False, mean: bool = False, tags_vocab_path: Optional[str] = None, vectorizer: Optional[Component] = None, counter_vocab_path: Optional[str] = None, idf_base_count: int = 100, log_base: int = 10, min_idf_weight=0.0, **kwargs)[source]¶
- The class implements the functionality of embedding the sentence as a weighted average by special coefficients of tokens embeddings. Coefficients can be taken from the given TFIDF-vectorizer in
vectorizer
or calculated as TFIDF from counter vocabulary given incounter_vocab_path
. Also one can give
tags_vocab_path
to the vocabulary with weights of tags. In this case, batch with tags should be given as a second input in__call__
method.
- Parameters
embedder – embedder instance
tokenizer – tokenizer instance, should be able to detokenize sentence
pad_zero – whether to pad samples or not
mean – whether to return mean token embedding
tags_vocab_path – optional path to vocabulary with tags weights
vectorizer – vectorizer instance should be trained with
analyzer="word"
counter_vocab_path – path to counter vocabulary
idf_base_count – minimal idf value (less time occured are not counted)
log_base – logarithm base for TFIDF-coefficient calculation froom counter vocabulary
min_idf_weight – minimal idf weight
- embedder¶
embedder instance
- tokenizer¶
tokenizer instance, should be able to detokenize sentence
- dim¶
dimension of embeddings
- pad_zero¶
whether to pad samples or not
- mean¶
whether to return mean token embedding
- tags_vocab¶
vocabulary with weigths for tags
- vectorizer¶
vectorizer instance
- counter_vocab_path¶
path to counter vocabulary
- counter_vocab¶
counter vocabulary
- idf_base_count¶
minimal idf value (less time occured are not counted)
- log_base¶
logarithm base for TFIDF-coefficient calculation froom counter vocabulary
- min_idf_weight¶
minimal idf weight
Examples
>>> from deeppavlov.models.embedders.tfidf_weighted_embedder import TfidfWeightedEmbedder >>> from deeppavlov.models.embedders.fasttext_embedder import FasttextEmbedder >>> fasttext_embedder = FasttextEmbedder('/data/embeddings/wiki.ru.bin') >>> fastTextTfidf = TfidfWeightedEmbedder(embedder=fasttext_embedder, counter_vocab_path='/data/vocabs/counts_wiki_lenta.txt') >>> fastTextTfidf([['большой', 'и', 'розовый', 'бегемот']]) [array([ 1.99135890e-01, -7.14746421e-02, 8.01428872e-02, -5.32840924e-02, 5.05212297e-02, 2.76053832e-01, -2.53270134e-01, -9.34443950e-02, ... 1.18385439e-02, 1.05643446e-01, -1.21904516e-03, 7.70555378e-02])]
- __call__(batch: List[List[str]], tags_batch: Optional[List[List[str]]] = None, mean: Optional[bool] = None, *args, **kwargs) List[Union[list, ndarray]] [source]¶
Infer on the given data
- Parameters
batch – tokenized text samples
tags_batch – optional batch of corresponding tags
mean – whether to return mean token embedding (does not depend on self.mean)
*args – additional arguments
**kwargs – additional arguments
Returns:
- The class implements the functionality of embedding the sentence as a weighted average by special coefficients of tokens embeddings. Coefficients can be taken from the given TFIDF-vectorizer in
- class deeppavlov.models.embedders.transformers_embedder.TransformersBertEmbedder(load_path: Union[str, Path], bert_config_path: Optional[Union[str, Path]] = None, truncate: bool = False, **kwargs)[source]¶
Transformers-based BERT model for embeddings tokens, subtokens and sentences
- Parameters
load_path – path to a pretrained BERT pytorch checkpoint
bert_config_file – path to a BERT configuration file
truncate – whether to remove zero-paddings from returned data
- __call__(subtoken_ids_batch: Collection[Collection[int]], startofwords_batch: Collection[Collection[int]], attention_batch: Collection[Collection[int]]) Tuple[Collection[Collection[Collection[float]]], Collection[Collection[Collection[float]]], Collection[Collection[float]], Collection[Collection[float]], Collection[Collection[float]]] [source]¶
Predict embeddings values for a given batch
- Parameters
subtoken_ids_batch – padded indexes for every subtoken
startofwords_batch – a mask matrix with
1
for every first subtoken init in a token and0
for every other subtokenattention_batch – a mask matrix with
1
for every significant subtoken and0
for paddings