deeppavlov.models.entity_extraction¶

class deeppavlov.models.entity_extraction.ner_chunker.NerChunker(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, max_chunk_len: int = 180, batch_size: int = 2, **kwargs)[source]¶

Class to split documents into chunks of max_chunk_len symbols so that the length will not exceed maximal sequence length to feed into BERT

__init__(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, max_chunk_len: int = 180, batch_size: int = 2, **kwargs)[source]¶

Parameters

max_chunk_len – maximal length of chunks into which the document is split
batch_size – how many chunks are in batch

__call__(docs_batch: List[str]) → Tuple[List[List[str]], List[List[int]], List[List[List[Tuple[int, int]]]], List[List[List[str]]]][source]¶

This method splits each document in the batch into chunks wuth the maximal length of max_chunk_len

Parameters: docs_batch – batch of documents
Returns: batch of lists of document chunks for each document batch of lists of numbers of documents which correspond to chunks

class deeppavlov.models.entity_extraction.entity_linking.EntityLinker(load_path: str, entities_database_filename: str, entity_ranker=None, num_entities_for_bert_ranking: int = 50, wikidata_file: Optional[str] = None, num_entities_to_return: int = 10, max_text_len: int = 300, lang: str = 'en', use_descriptions: bool = True, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, max_paragraph_len: int = 250, **kwargs)[source]¶

Class for linking of entity substrings in the document to entities in Wikidata

__init__(load_path: str, entities_database_filename: str, entity_ranker=None, num_entities_for_bert_ranking: int = 50, wikidata_file: Optional[str] = None, num_entities_to_return: int = 10, max_text_len: int = 300, lang: str = 'en', use_descriptions: bool = True, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, max_paragraph_len: int = 250, **kwargs) → None [source]¶

Parameters

load_path – path to folder with inverted index files
entities_database_filename – file with sqlite database with Wikidata entities index
entity_ranker – deeppavlov.models.torch_bert.torch_transformers_el_ranker.TorchTransformersEntityRankerInfer
num_entities_for_bert_ranking – number of candidate entities for BERT ranking using description and context
wikidata_file – .hdt file with Wikidata graph
num_entities_to_return – number of candidate entities for the substring which are returned
max_text_len – max length of context for entity ranking by description
lang – russian or english
use_description – whether to perform entity ranking by context and description
use_tags – whether to use ner tags for entity filtering
lemmatize – whether to lemmatize tokens
full_paragraph – whether to use full paragraph for entity ranking by context and description
use_connections – whether to ranking entities by number of connections in Wikidata
max_paragraph_len – maximum length of paragraph for ranking by context and description
**kwargs –

__call__(entity_substr_batch: List[List[str]], entity_tags_batch: Optional[List[List[str]]] = None, sentences_batch: Optional[List[List[str]]] = None, entity_offsets_batch: Optional[List[List[List[int]]]] = None, sentences_offsets_batch: Optional[List[List[Tuple[int, int]]]] = None) → Tuple[Union[List[List[List[str]]], List[List[str]]], Union[List[List[List[Any]]], List[List[Any]]], Union[List[List[List[str]]], List[List[str]]]][source]¶: Call self as a function.

class deeppavlov.models.entity_extraction.entity_detection_parser.EntityDetectionParser(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, return_entities_with_tags: bool = False, thres_proba: float = 0.8, **kwargs)[source]¶

This class parses probabilities of tokens to be a token from the entity substring.

__init__(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, return_entities_with_tags: bool = False, thres_proba: float = 0.8, **kwargs)[source]¶

Parameters

o_tag – tag for tokens which are neither entities nor types
tags_file – filename with NER tags
entity_tags – tags for entities
ignore_points – whether to consider points as separate symbols
return_entities_with_tags – whether to return a dict of tags (keys) and list of entity substrings (values) or simply a list of entity substrings
thres_proba – if the probability of the tag is less than thres_proba, we assign the tag as ‘O’

__call__(question_tokens_batch: List[List[str]], tokens_info_batch: List[List[List[float]]], tokens_probas_batch: numpy.ndarray) → Tuple[List[Union[List[str], Dict[str, List[str]]]], List[List[str]], List[Union[List[int], Dict[str, List[List[int]]]]]][source]¶

Parameters

question_tokens – tokenized questions
token_probas – list of probabilities of question tokens

Returns

Batch of dicts where keys are tags and values are substrings corresponding to tags Batch of substrings which correspond to entity types Batch of lists of token indices in the text which correspond to entities

deeppavlov.models.entity_extraction.entity_detection_parser.question_sign_checker(questions: List[str]) → List[str][source]¶: Adds question sign if it is absent or replaces dots in the end with question sign.