deeppavlov.models.vectorizers¶

class deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer(tokenizer: deeppavlov.core.models.component.Component, hash_size=16777216, doc_index: Optional[dict] = None, save_path: Optional[str] = None, load_path: Optional[str] = None, **kwargs)[source]¶

Create a tfidf matrix from collection of documents of size [n_documents X n_features(hash_size)].

Parameters

tokenizer – a tokenizer class
hash_size – a hash size, power of two
doc_index – a dictionary of document ids and their titles
save_path – a path to .npz file where tfidf matrix is saved
load_path – a path to .npz file where tfidf matrix is loaded from

hash_size¶: a hash size

tokenizer¶: instance of a tokenizer class

term_freqs¶: a dictionary with tfidf terms and their frequences

doc_index¶: provided by a user ids or generated automatically ids

rows¶: tfidf matrix rows corresponding to terms

cols¶: tfidf matrix cols corresponding to docs

data¶: tfidf matrix data corresponding to tfidf values

__call__(questions: List[str]) → scipy.sparse.csr.csr_matrix[source]¶

Transform input list of documents to tfidf vectors.

Parameters: questions – a list of input strings
Returns: transformed documents as a csr_matrix with shape [n_documents X hash_size]

fit(docs: List[str], doc_ids: List[Any], doc_nums: List[int]) → None [source]¶

Fit the vectorizer.

Parameters

docs – a list of input documents
doc_ids – a list of document ids corresponding to input documents
doc_nums – a list of document integer ids as they appear in a database

Returns

None

get_count_matrix(row: List[int], col: List[int], data: List[int], size: int) → scipy.sparse.csr.csr_matrix[source]¶

Get count matrix.

Parameters

row – tfidf matrix rows corresponding to terms
col – tfidf matrix cols corresponding to docs
data – tfidf matrix data corresponding to tfidf values
size – doc_index size

Returns

a count csr_matrix

get_counts(docs: List[str], doc_ids: List[Any]) → Generator[Tuple[KeysView, ValuesView, List[int]], Any, None][source]¶

Get term counts for a list of documents.

Parameters

docs – a list of input documents
doc_ids – a list of document ids corresponding to input documents

Yields

a tuple of term hashes, count values and column ids

Returns

None

get_index2doc() → Dict[Any, int][source]¶

Invert doc_index.

Returns: inverted doc_index dict

static get_tfidf_matrix(count_matrix: scipy.sparse.csr.csr_matrix) → Tuple[scipy.sparse.csr.csr_matrix, numpy.array][source]¶

Convert a count matrix into a tfidf matrix.

Parameters: count_matrix – a count matrix
Returns: a tuple of tfidf matrix and term frequences

load() → Tuple[scipy.sparse.csr.csr_matrix, Dict][source]¶

Load a tfidf matrix as csr_matrix.

Returns: a tuple of tfidf matrix and csr data.

:raises FileNotFoundError if load_path doesn’t exist.:

partial_fit(docs: List[str], doc_ids: List[Any], doc_nums: List[int]) → None [source]¶

Partially fit on one batch.

Parameters

docs – a list of input documents
doc_ids – a list of document ids corresponding to input documents
doc_nums – a list of document integer ids as they appear in a database

Returns

None

reset() → None [source]¶

Clear rows, cols and data

Returns: None

save() → None [source]¶

Save tfidf matrix into .npz format.

Returns: None