Knowledge Base Question Answering (KBQA)¶
Table of contents¶
1. Introduction to the task¶
The knowledge base:
is a comprehensive repository of information about given domain or a number of domains;
reflects the ways we model knowledge about given subject or subjects, in terms of concepts, entities, properties, and relationships;
enables us to use this structured knowledge where appropriate, e.g. answering factoid questions.
Currently, we support Wikidata as a Knowledge Base (Knowledge Graph). In the future, we will expand support for custom knowledge bases.
The question answerer:
validates questions against the preconfigured list of question templates, disambiguates entities using entity linking and answers questions asked in natural language;
can be used with Wikidata (English, Russian) and (in the future versions) with custom knowledge graphs.
Here are some of the most popular types of questions supported by the model:
Complex questions with numerical values: “What position did Angela Merkel hold on November 10, 1994?”
Complex question where the answer is a number or a date: “When did Jean-Paul Sartre move to Le Havre?”
Questions with counting of answer entities: “How many sponsors are for Juventus F.C.?”
Questions with ordering of answer entities by ascending or descending of some parameter: “Which country has highest individual tax rate?”
Simple questions: “What is crew member Yuri Gagarin’s Vostok?”
The following models are used to find the answer (the links are for the English language model):
BERT model for prediction of query template type. Model performs classification of questions into 8 classes correponding to 8 query template types;
BERT entity detection model for extraction of entity substrings from the questions;
Substring extracted by the entity detection model is used for entity linking. Entity linking performs matching the substring with one of the Wikidata entities. Matching is based on the Levenshtein distance between the substring and an entity title. The result of the matching procedure is a set of candidate entities. There is also the search for the entity among this set with one of the top-k relations predicted by classification model;
BERT model for ranking candidate relation paths;
Query generator model is used to fill query template with candidate entities and relations to find valid combinations of entities and relations for query template. Query generation model uses Wikidata HDT file.
2. Get started with the model¶
First make sure you have the DeepPavlov Library installed. More info about the first installation
[ ]:
! pip install --q deeppavlov
Then make sure that all the required packages for the model are installed.
[ ]:
! python -m deeppavlov install kbqa_cq_en
! python -m deeppavlov install kbqa_cq_ru
kbqa_cq_en
and kbqa_cq_rus
here are the names of the model’s config_files. What is a Config File?
Configuration file defines the model and describes its hyperparameters. To use another model, change the name of the config_file here and further. The full list of KBQA models with their config names can be found in the table.
3. Models list¶
The table presents a list of all of the KBQA-models available in DeepPavlov Library.
Config name |
Database |
Language |
RAM |
GPU |
---|---|---|---|---|
Wikidata |
En |
3.1 Gb |
3.4 Gb |
|
Wikidata |
Ru |
4.3 Gb |
8.0 Gb |
4. Use the model for prediction¶
4.1 Predict using Python¶
After installing the model, build it from the config and predict.
[ ]:
from deeppavlov import configs, build_model
kbqa = build_model('kbqa_cq_en', download=True, install=True)
Input: List[sentences]
Output: List[answers]
[ ]:
kbqa(['Who directed Forrest Gump?'])
[['Robert Zemeckis'],
[['Q187364']],
[['SELECT ?answer WHERE { wd:Q134773 wdt:P57 ?answer. }']]]
[ ]:
kbqa(['What position was held by Harry S. Truman on 1/3/1935?'])
[['United States senator'],
[['Q4416090']],
[['SELECT ?answer WHERE { wd:Q11613 p:P39 ?ent . ?ent ps:P39 ?answer . ?ent ?p ?x filter(contains(?x, n)). }']]]
[ ]:
kbqa(['What teams did Lionel Messi play for in 2004?'])
[['FC Barcelona B, Argentina national under-20 football team'],
[['Q10467', 'Q1187790']],
[['SELECT ?answer WHERE { wd:Q615 p:P54 ?ent . ?ent ps:P54 ?answer . ?ent ?p ?x filter(contains(?x, n)). }']]]
KBQA model for complex question answering in Russian can be used from Python using the following code:
[ ]:
from deeppavlov import configs, build_model
kbqa = build_model('kbqa_cq_ru', download=True, install=True)
[ ]:
kbqa(['Когда родился Пушкин?'])
[['26 мая 1799, 06 июня 1799'],
[['+1799-05-26^^T', '+1799-06-06^^T']],
[['SELECT ?answer WHERE { wd:Q7200 wdt:P569 ?answer. }']]]
4.2 Predict using CLI¶
You can also get predictions in an interactive mode through CLI.
[ ]:
! python -m deeppavlov interact kbqa_сq_en [-d]
! python -m deeppavlov interact kbqa_cq_ru [-d]
-d
is an optional download key (alternative to download=True
in Python code). It is used to download the pre-trained model along with embeddings and all other files needed to run the model.
Or make predictions for samples from stdin.
[ ]:
! python -m deeppavlov predict kbqa_сq_en -f <file-name>
! python -m deeppavlov predict kbqa_cq_ru -f <file-name>
4.3 Using entity linking and Wiki parser as standalone tools for KBQA¶
Default configuration for KBQA was designed to use all of the supporting models together as a part of the KBQA pipeline. However, there might be a case when you want to work with some of these models in addition to KBQA.
For example, you might want to use entity linking model as an annotator in your multiskill AI Assistant. Or, you might want to use Wiki Parser component to directly run SPARQL queries against your copy of Wikidata. To support these usages, you can also deploy supporting models as standalone components.
Don’t forget to replace the url
parameter values in the examples below with correct URLs.
Config entity_linking_en can be used with the following commands:
[ ]:
! python -m deeppavlov install entity_linking_en -d
[ ]:
! python -m deeppavlov riseapi entity_linking_en [-d] [-p <port>]
[ ]:
import requests
payload = {"entity_substr": [["Forrest Gump"]], "tags": [["PERSON"]], "probas": [[0.9]],
"sentences": [["Who directed Forrest Gump?"]]}
response = requests.post(entity_linking_url, json=payload).json()
print(response)
Config wiki_parser can be used with the following command:
[ ]:
! python -m deeppavlov riseapi wiki_parser [-d] [-p <port>]
Arguments of the annotator are parser_info
(what we want to extract from Wikidata) and query
.
Examples of queries:
To extract triplets for entities, the query
argument should be the list of entities ids. parser_info
should be the list of “find_triplets” strings.
[ ]:
requests.post(wiki_parser_url, json = {"parser_info": ["find_triplets"], "query": ["Q159"]}).json()
To extract all relations of the entities, the query
argument should be the list of entities ids, and parser_info
should be the list of “find_rels” strings.
[ ]:
requests.post(wiki_parser_url, json = {"parser_info": ["find_rels"], "query": ["Q159"]}).json()
To find labels for entities ids, the query
argument should be the list of entities ids, and parser_info
should be the list of “find_label” strings.
[ ]:
requests.post(wiki_parser_url, json = {"parser_info": ["find_label"], "query": [["Q159", ""]]}).json()
In this example, the second element of the list (an empty string) can be replaced with a sentence.
To execute SPARQL queries, the query
argument should be the list of tuples with the info about SPARQL queries, and parser_info
should be the list of “query_execute” strings.
Let us consider an example of the question “What is the deepest lake in Russia?” with the corresponding SPARQL query SELECT ?ent WHERE { ?ent wdt:P31 wd:T1 . ?ent wdt:R1 ?obj . ?ent wdt:R2 wd:E1 } ORDER BY ASC(?obj) LIMIT 5
Arguments:
what_return:
[“?obj”]
,query_seq:
[[“?ent”, “P17”, “Q159”], [“?ent”, “P31”, “Q23397”], [“?ent”, “P4511”, “?obj”]]
,filter_info:
[]
,order_info:
order_info(variable=’?obj’, sorting_order=’asc’)
.
[ ]:
requests.post("wiki_parser_url", json = {"parser_info": ["query_execute"], "query": [[["?obj"], [["Q159", "P36", "?obj"]], [], [], True]]}).json()
To use entity linking model in KBQA, you should add following API Requester component to the pipe
in the config_file:
{
"class_name": "api_requester",
"id": "linker_entities",
"url": "entity_linking_url",
"out": ["entity_substr", "entity_ids", "entity_conf", "entity_pages", "entity_labels"],
"param_names": ["entity_substr", "tags", "probas", "sentences"]
}
To use Wiki parser service in KBQA, you should add following API Requester component to the pipe
in the config_file:
{
"class_name": "api_requester",
"id": "wiki_p",
"url": "wiki_parser_url",
"out": ["wiki_parser_output"],
"param_names": ["parser_info", "query"]
}
5. Customize the model¶
5.1 Description of config parameters¶
Parameters of entity_linker
component:
num_entities_to_return: int
- the number of entity IDs, returned for each entity mention in text;lemmatize: bool
- whether to lemmatize entity mentions before searching candidate entity IDs in the inverted index;use_decriptions: bool
- whether to perform ranking of candidate entities by similarity of their descriptions to the context;use_connections: bool
- whether to use connections between candidate entities for different mentions for ranking;use_tags: bool
- whether to search only those entity IDs in the inverted index, which have the same tag as the entity mention;prefixes: Dict[str, Any]
- prefixes in the knowledge base for entities and relations;alias_coef: float
- the coefficient which is multiplied by the substring matching score of the entity if the entity mention in the text matches with the entity title.
Parameters of rel_ranking_infer
component:
return_elements: List[str]
- what elements should be returned by the component in the output tuple (answers are returned by default, optional elements are"confidences"
,"answer_ids"
,"entities_and_rels"
(entities and relations from SPARQL queries),"queries"
(SPARQL queries),"triplets"
(triplets from SPARQL queries));batch_size: int
- candidate relations list will be split into N batches of the sizebatch_size
for further ranking;softmax: bool
- whether to apply softmax function to the confidences list of candidate relations for a question;use_api_requester: bool
- true if wiki_parser is called through api_requester;rank: bool
- whether to perform ranking of candidate relation paths;nll_rel_ranking: bool
- in DeepPavlov we have two types of relation ranking models: 1) the model which takes a question and a relation and is trained to classify question-relation by two classes (relevant / irrelevant relation) 2) the model which takes a question and a list of relations (one relevant relation and others - irrelevant) and is trained to define the relevant relation in the list with NLL loss; the output format in two cases is different;nll_path_ranking: bool
- the same case asnll_rel_ranking
for ranking of relation paths;top_possible_answers: int
- SPARQL query execution can result in several valid answers, sotop_possible_answers
is the number of these answers which we leave in the output;top_n: int
- number of candidate SPARQL queries (and corresponding answers) in the output for a question;pos_class_num: int
- if we use the model which classifies question-relation into two classes (relevant / irrelevant), we should set the number of positive class (0 or 1);rel_thres: float
- we leave only relations with the confidence upper threshold;type_rels: List[str]
- relations which connect entity and its type in the knowledge graph.
Parameters of query_generator
component:
entities_to_leave: int
- how many entity IDs to use to make a a combination of entities and relations for filling in the slots of the SPARQL query template;rels_to_leave: int
- how many relations to use to make a a combination of entities and relations for filling in the slots of the SPARQL query template;max_comb_num: int
- maximal number of combinations of entities and relations for filling in the slots of SPARQL query template;map_query_str_to_kb: List[Tuple[str, str]]
- a list of elements like [“wd:”, “http://we/”], where the first element is a prefix of an entity (“wd:”) or relation in the SPARQL query template, the second - the corresponding prefix in the knowledge base (“http://we/”);kb_prefixes: Dict[str, str]
- a dictionary {“entity”: “wd:E”, “rel”: “wdt:R”, …} - prefixes of entities, relations and types in the knowledge base;gold_query_info: Dict[str, str]
- names of unknown variables in SPARQL queries in the dataset (LC-QuAD2.0 or RuBQ2.0);syntax_structure_known: bool
- whether the syntax structure of the question is known (is True in kbqa_cq_ru.json, because this config performs syntax parsing with slovnet_syntax_parser).
5.2 Train KBQA components¶
Train Query Prediction Model¶
The dataset for training query prediction model consists of three .csv files: train.csv, valid.csv and test.csv. Each line in this file contains question and corresponding query template type, for example:
"What is the longest river in the UK?", 6
Train Entity Detection Model¶
The dataset is a pickle file. The dataset must be split into three parts: train, test, and validation. Each part is a list of tuples of question tokens and tags for each token. An example of training sample:
(['What', 'is', 'the', 'complete', 'list', 'of', 'records', 'released', 'by', 'Jerry', 'Lee', 'Lewis', '?'],
['O', 'O', 'O', 'O', 'B-T', 'I-T', 'I-T', 'O', 'O', 'B-E', 'I-E', 'I-E', 'O'])
B-T
corresponds to tokens of entity types substrings beginning, I-T
- to tokens of inner part of entity types substrings, B-E
and I-E
- for entities, O
- for other tokens.
Train Path Ranking Model¶
The dataset (in pickle format) is a dict of three keys: “train”, “valid” and “test”. The value by each key is the list of samples, an example of a sample:
(['What is the Main St. Exile label, which Nik Powell co-founded?', ['record label', 'founded by']], '1')
The sample contains the question, relations in the question and label (1 - if the relations correspond to the question, 0 - otherwise).
Adding Templates For New SPARQL Queries¶
Templates can be added to sparql_queries.json file, which is a dictionary, where keys are template types and values are templates with additional information. An example of a template:
{
"query_template": "SELECT ?obj WHERE { wd:E1 p:R1 ?s . ?s ps:R1 ?obj . ?s ?p ?x filter(contains(?x, N)) }",
"rank_rels": ["wiki", "do_not_rank", "do_not_rank"],
"rel_types": ["no_type", "statement", "qualifier"],
"query_sequence": [1, 2, 3],
"return_if_found": true,
"template_num": "0",
"alternative_templates": []
}
query_template
is the template of the SPARQL query;rank_rels
is a list which defines whether to rank relations, in this example p:R1 relations we extract from Wikidata for wd:E1 entities and rank with RelRanker, ps:R1 and ?p relations we do not extract or rank;rel_types
- direct, statement or qualifier relations;query_sequence
- the sequence in which the triplets will be extracted from the Wikidata hdt file;return_if_found
- the parameter which iterates over all possible combinations of entities, relations and types, if true - return the first valid combination found, if false - consider all combinations;template_num
- the type of a template;alternative_templates
- type numbers of alternative templates to use if the answer was not found using the current template.