deeppavlov.models.entity_extraction¶

class deeppavlov.models.entity_extraction.ner_chunker.NerChunker(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, batch_size: int = 2, **kwargs)[source]¶

Class to split documents into chunks of max_seq_len symbols so that the length will not exceed maximal sequence length to feed into BERT

__init__(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, batch_size: int = 2, **kwargs)[source]¶

Parameters

vocab_file – vocab file of pretrained transformer model
max_seq_len – maximal length of chunks into which the document is split
lowercase – whether to lowercase text
batch_size – how many chunks are in batch

__call__(docs_batch: List[str]) → Tuple[List[List[str]], List[List[int]], List[List[Union[List[Union[Tuple[int, int], Tuple[Union[int, Any], Union[int, Any]]]], List[Tuple[Union[int, Any], Union[int, Any]]], List[Tuple[int, int]]]]], List[List[Union[List[Any], List[str]]]], List[List[str]]][source]¶

This method splits each document in the batch into chunks wuth the maximal length of max_seq_len

Parameters: docs_batch – batch of documents
Returns: batch of lists of document chunks for each document batch of lists of numbers of documents which correspond to chunks

class deeppavlov.models.entity_extraction.entity_linking.EntityLinker(load_path: str, entity_ranker=None, entities_database_filename: Optional[str] = None, words_dict_filename: Optional[str] = None, ngrams_matrix_filename: Optional[str] = None, num_entities_for_bert_ranking: int = 50, num_entities_for_conn_ranking: int = 5, num_entities_to_return: int = 10, max_text_len: int = 300, max_paragraph_len: int = 150, lang: str = 'ru', use_descriptions: bool = True, alias_coef: float = 1.1, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, kb_filename: Optional[str] = None, prefixes: Optional[Dict[str, Any]] = None, **kwargs)[source]¶

Class for linking of entity substrings in the document to entities in Wikidata

__init__(load_path: str, entity_ranker=None, entities_database_filename: Optional[str] = None, words_dict_filename: Optional[str] = None, ngrams_matrix_filename: Optional[str] = None, num_entities_for_bert_ranking: int = 50, num_entities_for_conn_ranking: int = 5, num_entities_to_return: int = 10, max_text_len: int = 300, max_paragraph_len: int = 150, lang: str = 'ru', use_descriptions: bool = True, alias_coef: float = 1.1, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, kb_filename: Optional[str] = None, prefixes: Optional[Dict[str, Any]] = None, **kwargs) → None[source]¶

Parameters

load_path – path to folder with inverted index files
entity_ranker – component deeppavlov.models.kbqa.rel_ranking_bert
entities_database_filename – filename with database with entities index
words_dict_filename – filename with words and corresponding tags
ngrams_matrix_filename – filename with char tfidf matrix
num_entities_for_bert_ranking – number of candidate entities for BERT ranking using description and context
num_entities_for_conn_ranking – number of candidate entities for ranking using connections in the knowledge graph
num_entities_to_return – number of candidate entities for the substring which are returned
max_text_len – maximal length of entity context
max_paragraph_len – maximal length of context paragraphs
lang – russian or english
use_description – whether to perform entity ranking by context and description
alias_coef – coefficient which is multiplied by the substring matching confidence if the substring is the title of the entity
use_tags – whether to filter candidate entities by tags
lemmatize – whether to lemmatize tokens
full_paragraph – whether to use full paragraph for entity context
use_connections – whether to rank entities by connections in the knowledge graph
kb_filename – filename with the knowledge base in HDT format
prefixes – entity and title prefixes
**kwargs –

__call__(substr_batch: List[List[str]], tags_batch: Optional[List[List[str]]] = None, probas_batch: Optional[List[List[float]]] = None, sentences_batch: Optional[List[List[str]]] = None, offsets_batch: Optional[List[List[List[int]]]] = None, sentences_offsets_batch: Optional[List[List[Tuple[int, int]]]] = None, entities_to_link_batch: Optional[List[List[int]]] = None)[source]¶: Call self as a function.

class deeppavlov.models.entity_extraction.entity_detection_parser.EntityDetectionParser(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, thres_proba: float = 0.8, make_tags_from_probas: bool = False, lang: str = 'en', ignored_tags: Optional[List[str]] = None, **kwargs)[source]¶

This class parses probabilities of tokens to be a token from the entity substring.

__init__(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, thres_proba: float = 0.8, make_tags_from_probas: bool = False, lang: str = 'en', ignored_tags: Optional[List[str]] = None, **kwargs)[source]¶

Parameters

o_tag – tag for tokens which are neither entities nor types
tags_file – filename with NER tags
entity_tags – tags for entities
ignore_points – whether to consider points as separate symbols
thres_proba – if the probability of the tag is less than thres_proba, we assign the tag as ‘O’
make_tags_from_probas – whether to define token tags from confidences from sequence tagging model
lang – language of texts
ignored_tags – not used tags of entities

__call__(question_tokens_batch: List[List[str]], tokens_info_batch: List[List[List[float]]], tokens_probas_batch: numpy.ndarray) → Tuple[List[dict], List[dict], List[dict]][source]¶

Parameters

question_tokens_batch – tokenized questions
tokens_info_batch – list of tags of question tokens
tokens_probas_probas – list of probabilities of question tokens

Returns

Batch of dicts where keys are tags and values are substrings corresponding to tags Batch of substrings which correspond to entity types Batch of lists of token indices in the text which correspond to entities

class deeppavlov.models.entity_extraction.entity_detection_parser.QuestionSignChecker(delete_brackets: bool = False, **kwargs)[source]¶