deeppavlov.models.doc_retrieval¶
Document retrieval classes.
-
class
deeppavlov.models.doc_retrieval.tfidf_ranker.
TfidfRanker
(vectorizer: deeppavlov.models.vectorizers.hashing_tfidf_vectorizer.HashingTfIdfVectorizer, top_n=5, active: bool = True, **kwargs)[source]¶ Rank documents according to input strings.
- Parameters
vectorizer – a vectorizer class
top_n – a number of doc ids to return
active – whether to return a number specified by
top_n
(True
) or all ids (False
)
-
top_n
¶ a number of doc ids to return
-
vectorizer
¶ an instance of vectorizer class
-
index2doc
¶ inverted
doc_index
-
iterator
¶ a dataset iterator used for generating batches while fitting the vectorizer
-
class
deeppavlov.models.doc_retrieval.logit_ranker.
LogitRanker
(squad_model: deeppavlov.core.models.component.Component, batch_size: int = 50, sort_noans: bool = False, top_n: int = 1, return_answer_sentence: bool = False, **kwargs)[source]¶ Select best answer using squad model logits. Make several batches for a single batch, send each batch to the squad model separately and get a single best answer for each batch.
- Parameters
squad_model – a loaded squad model
batch_size – batch size to use with squad model
sort_noans – whether to downgrade noans tokens in the most possible answers
top_n – number of answers to return
-
squad_model
¶ a loaded squad model
-
batch_size
¶ batch size to use with squad model
-
top_n
¶ number of answers to return
-
__call__
(contexts_batch: List[List[str]], questions_batch: List[List[str]], doc_ids_batch: Optional[List[List[str]]] = None) → Union[Tuple[List[str], List[float], List[int], List[str]], Tuple[List[List[str]], List[List[float]], List[List[int]], List[List[str]]], Tuple[List[str], List[float], List[int]], Tuple[List[List[str]], List[List[float]], List[List[int]]]][source]¶ Sort obtained results from squad reader by logits and get the answer with a maximum logit.
- Parameters
contexts_batch – a batch of contexts which should be treated as a single batch in the outer JSON config
questions_batch – a batch of questions which should be treated as a single batch in the outer JSON config
doc_ids_batch (optional) – names of the documents from which the contexts_batch was derived
- Returns
a batch of best answers, their scores, places in contexts and doc_ids for this answers if doc_ids_batch were passed
-
class
deeppavlov.models.doc_retrieval.pop_ranker.
PopRanker
(pop_dict_path: str, load_path: str, top_n: int = 3, active: bool = True, **kwargs)[source]¶ Rank documents according to their tfidf scores and popularities. It is not a standalone ranker, it should be used for re-ranking the results of TF-IDF Ranker.
Based on a Logistic Regression trained on 3 features:
tfidf score of the article
popularity of the article obtained via Wikimedia REST API as a mean number of views for the period since 2017/11/05 to 2018/11/05
multiplication of the two features above
- Parameters
pop_dict_path – a path to json file with article title to article popularity map
load_path – a path to saved logistic regression classifier
top_n – a number of doc ids to return
active – whether to return a number specified by
top_n
(True
) or all ids (False
)
-
pop_dict
¶ a map of article titles to their popularity
-
clf
¶ a loaded logistic regression classifier
-
top_n
¶ a number of doc ids to return
-
__call__
(input_doc_ids: List[List[Any]], input_doc_scores: List[List[float]]) → Tuple[List[List], List[List]][source]¶ Get tfidf scores and tfidf ids, re-rank them by applying logistic regression classifier, output pop ranker ids and pop ranker scores.
- Args:
input_doc_ids: top input doc ids of tfidf ranker input_doc_scores: top input doc scores of tfidf ranker corresponding to doc ids
- Returns
top doc ids of pop ranker and their corresponding scores