deeppavlov.models.entity_extraction¶
- class deeppavlov.models.entity_extraction.ner_chunker.NerChunker(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, batch_size: int = 2, **kwargs)[source]¶
Class to split documents into chunks of max_seq_len symbols so that the length will not exceed maximal sequence length to feed into BERT
- __init__(vocab_file: str, max_seq_len: int = 400, lowercase: bool = False, batch_size: int = 2, **kwargs)[source]¶
- Parameters
vocab_file – vocab file of pretrained transformer model
max_seq_len – maximal length of chunks into which the document is split
lowercase – whether to lowercase text
batch_size – how many chunks are in batch
- __call__(docs_batch: List[str]) Tuple[List[List[str]], List[List[int]], List[List[Union[List[Union[Tuple[int, int], Tuple[Union[int, Any], Union[int, Any]]]], List[Tuple[Union[int, Any], Union[int, Any]]], List[Tuple[int, int]]]]], List[List[Union[List[Any], List[str]]]], List[List[str]]] [source]¶
This method splits each document in the batch into chunks wuth the maximal length of max_seq_len
- Parameters
docs_batch – batch of documents
- Returns
batch of lists of document chunks for each document batch of lists of numbers of documents which correspond to chunks
- class deeppavlov.models.entity_extraction.entity_linking.EntityLinker(load_path: str, entity_ranker=None, entities_database_filename: Optional[str] = None, words_dict_filename: Optional[str] = None, ngrams_matrix_filename: Optional[str] = None, num_entities_for_bert_ranking: int = 50, num_entities_for_conn_ranking: int = 5, num_entities_to_return: int = 10, max_text_len: int = 300, max_paragraph_len: int = 150, lang: str = 'ru', use_descriptions: bool = True, alias_coef: float = 1.1, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, kb_filename: Optional[str] = None, prefixes: Optional[Dict[str, Any]] = None, **kwargs)[source]¶
Class for linking of entity substrings in the document to entities in Wikidata
- __init__(load_path: str, entity_ranker=None, entities_database_filename: Optional[str] = None, words_dict_filename: Optional[str] = None, ngrams_matrix_filename: Optional[str] = None, num_entities_for_bert_ranking: int = 50, num_entities_for_conn_ranking: int = 5, num_entities_to_return: int = 10, max_text_len: int = 300, max_paragraph_len: int = 150, lang: str = 'ru', use_descriptions: bool = True, alias_coef: float = 1.1, use_tags: bool = False, lemmatize: bool = False, full_paragraph: bool = False, use_connections: bool = False, kb_filename: Optional[str] = None, prefixes: Optional[Dict[str, Any]] = None, **kwargs) None [source]¶
- Parameters
load_path – path to folder with inverted index files
entity_ranker – component deeppavlov.models.kbqa.rel_ranking_bert
entities_database_filename – filename with database with entities index
words_dict_filename – filename with words and corresponding tags
ngrams_matrix_filename – filename with char tfidf matrix
num_entities_for_bert_ranking – number of candidate entities for BERT ranking using description and context
num_entities_for_conn_ranking – number of candidate entities for ranking using connections in the knowledge graph
num_entities_to_return – number of candidate entities for the substring which are returned
max_text_len – maximal length of entity context
max_paragraph_len – maximal length of context paragraphs
lang – russian or english
use_description – whether to perform entity ranking by context and description
alias_coef – coefficient which is multiplied by the substring matching confidence if the substring is the title of the entity
use_tags – whether to filter candidate entities by tags
lemmatize – whether to lemmatize tokens
full_paragraph – whether to use full paragraph for entity context
use_connections – whether to rank entities by connections in the knowledge graph
kb_filename – filename with the knowledge base in HDT format
prefixes – entity and title prefixes
**kwargs –
- __call__(substr_batch: List[List[str]], tags_batch: Optional[List[List[str]]] = None, probas_batch: Optional[List[List[float]]] = None, sentences_batch: Optional[List[List[str]]] = None, offsets_batch: Optional[List[List[List[int]]]] = None, sentences_offsets_batch: Optional[List[List[Tuple[int, int]]]] = None, entities_to_link_batch: Optional[List[List[int]]] = None)[source]¶
Call self as a function.
- class deeppavlov.models.entity_extraction.entity_detection_parser.EntityDetectionParser(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, thres_proba: float = 0.8, make_tags_from_probas: bool = False, lang: str = 'en', ignored_tags: Optional[List[str]] = None, **kwargs)[source]¶
This class parses probabilities of tokens to be a token from the entity substring.
- __init__(o_tag: str, tags_file: str, entity_tags: Optional[List[str]] = None, ignore_points: bool = False, thres_proba: float = 0.8, make_tags_from_probas: bool = False, lang: str = 'en', ignored_tags: Optional[List[str]] = None, **kwargs)[source]¶
- Parameters
o_tag – tag for tokens which are neither entities nor types
tags_file – filename with NER tags
entity_tags – tags for entities
ignore_points – whether to consider points as separate symbols
thres_proba – if the probability of the tag is less than thres_proba, we assign the tag as ‘O’
make_tags_from_probas – whether to define token tags from confidences from sequence tagging model
lang – language of texts
ignored_tags – not used tags of entities
- __call__(question_tokens_batch: List[List[str]], tokens_info_batch: List[List[List[float]]], tokens_probas_batch: numpy.ndarray) Tuple[List[dict], List[dict], List[dict]] [source]¶
- Parameters
question_tokens_batch – tokenized questions
tokens_info_batch – list of tags of question tokens
tokens_probas_probas – list of probabilities of question tokens
- Returns
Batch of dicts where keys are tags and values are substrings corresponding to tags Batch of substrings which correspond to entity types Batch of lists of token indices in the text which correspond to entities