deeppavlov.models.vectorizers¶
-
class
deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.
HashingTfIdfVectorizer
(tokenizer: deeppavlov.core.models.component.Component, hash_size=16777216, doc_index: Optional[dict] = None, save_path: Optional[str] = None, load_path: Optional[str] = None, **kwargs)[source]¶ Create a tfidf matrix from collection of documents of size [n_documents X n_features(hash_size)].
Parameters: - tokenizer – a tokenizer class
- hash_size – a hash size, power of two
- doc_index – a dictionary of document ids and their titles
- save_path – a path to .npz file where tfidf matrix is saved
- load_path – a path to .npz file where tfidf matrix is loaded from
-
hash_size
¶ a hash size
-
tokenizer
¶ instance of a tokenizer class
-
term_freqs
¶ a dictionary with tfidf terms and their frequences
-
doc_index
¶ provided by a user ids or generated automatically ids
-
rows
¶ tfidf matrix rows corresponding to terms
-
cols
¶ tfidf matrix cols corresponding to docs
-
data
¶ tfidf matrix data corresponding to tfidf values
-
__call__
(questions: List[str]) → scipy.sparse.csr.csr_matrix[source]¶ Transform input list of documents to tfidf vectors.
Parameters: questions – a list of input strings Returns: transformed documents as a csr_matrix with shape [n_documents X hash_size
]
-
fit
(docs: List[str], doc_ids: List[Any], doc_nums: List[int]) → None[source]¶ Fit the vectorizer.
Parameters: - docs – a list of input documents
- doc_ids – a list of document ids corresponding to input documents
- doc_nums – a list of document integer ids as they appear in a database
Returns: None
-
get_count_matrix
(row: List[int], col: List[int], data: List[int], size: int) → scipy.sparse.csr.csr_matrix[source]¶ Get count matrix.
Parameters: - row – tfidf matrix rows corresponding to terms
- col – tfidf matrix cols corresponding to docs
- data – tfidf matrix data corresponding to tfidf values
- size –
doc_index
size
Returns: a count csr_matrix
-
get_counts
(docs: List[str], doc_ids: List[Any]) → Generator[[Tuple[KeysView, ValuesView, List[int]], Any], None][source]¶ Get term counts for a list of documents.
Parameters: - docs – a list of input documents
- doc_ids – a list of document ids corresponding to input documents
Yields: a tuple of term hashes, count values and column ids
Returns: None
-
static
get_tfidf_matrix
(count_matrix: scipy.sparse.csr.csr_matrix) → Tuple[scipy.sparse.csr.csr_matrix, numpy.core.multiarray.array][source]¶ Convert a count matrix into a tfidf matrix.
Parameters: count_matrix – a count matrix Returns: a tuple of tfidf matrix and term frequences
-
load
() → Tuple[scipy.sparse.csr.csr_matrix, Dict][source]¶ Load a tfidf matrix as csr_matrix.
Returns: a tuple of tfidf matrix and csr data. Raises: FileNotFoundError if load_path
doesn’t exist.
-
partial_fit
(docs: List[str], doc_ids: List[Any], doc_nums: List[int]) → None[source]¶ Partially fit on one batch.
Parameters: - docs – a list of input documents
- doc_ids – a list of document ids corresponding to input documents
- doc_nums – a list of document integer ids as they appear in a database
Returns: None
-
class
deeppavlov.models.vectorizers.word_vectorizer.
DictionaryVectorizer
(save_path: str, load_path: Union[str, List[str]], min_freq: int = 1, unk_token: str = None, **kwargs)[source]¶ Transforms words into 0-1 vector of its possible tags, read from a vocabulary file. The format of the vocabulary must be word<TAB>tag_1<SPACE>…<SPACE>tag_k
Parameters: - save_path – path to save the vocabulary,
- load_path – path to the vocabulary(-ies),
- min_freq – minimal frequency of tag to memorize this tag,
- unk_token – unknown token to be yielded for unknown words
-
__call__
(data: List) → numpy.ndarray¶ Transforms words to one-hot encoding according to the dictionary.
Parameters: data – the batch of words Returns: a 3D array. answer[i][j][k] = 1 iff data[i][j] is the k-th word in the dictionary.
-
class
deeppavlov.models.vectorizers.word_vectorizer.
PymorphyVectorizer
(save_path: str, load_path: str, max_pymorphy_variants: int = -1, **kwargs)[source]¶ Transforms russian words into 0-1 vector of its possible Universal Dependencies tags. Tags are obtained using Pymorphy analyzer (pymorphy2.readthedocs.io) and transformed to UD2.0 format using russian-tagsets library (https://github.com/kmike/russian-tagsets). All UD2.0 tags that are compatible with produced tags are memorized. The list of possible Universal Dependencies tags is read from a file, which contains all the labels that occur in UD2.0 SynTagRus dataset.
Parameters: - save_path – path to save the tags list,
- load_path – path to load the list of tags,
- max_pymorphy_variants – maximal number of pymorphy parses to be used. If -1, all parses are used.
-
__call__
(data: List) → numpy.ndarray¶ Transforms words to one-hot encoding according to the dictionary.
Parameters: data – the batch of words Returns: a 3D array. answer[i][j][k] = 1 iff data[i][j] is the k-th word in the dictionary.