Relation Extraction (RE)¶
Relation extraction is the task of detecting and classifying the relationship between two entities in text. DeepPavlov provides the document-level relation extraction meaning that the relation can be detected between the entities that are not in one sentence. Currently, RE is available for English and Russian languages.
RE model for English language trained on DocRED corpus based on Wikipedia.
RE model for Russian language trained on RuRED corpus based on the Lenta.ru news corpus.
English RE model¶
The English RE model can be trained using the following command:
python -m deeppavlov train re_docred
The trained model weights can be loaded with the following command:
python -m deeppavlov download re_docred
The trained model can be used for inference with the following code:
from deeppavlov import configs, build_model
re = build_model(configs.relation_extraction.re_docred, download=False)
sentence_tokens = [["Barack", "Obama", "is", "married", "to", "Michelle", "Obama", ",", "born", "Michelle", "Robinson", "."]]
entity_pos = [[[(0, 2)], [(5, 7), (9, 11)]]]
entity_tags = [["PER", "PER"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
>> [['P26'], ['spouse']]
Model Input:
list of tokens of a text document
list of entities positions (i.e. all start and end positions of both entities’ mentions)
list of NER tags of both entities.
As NER tags, we adapted the used in the DocRED corpus, which are, in turn, inherited from Tjong Kim Sang and De Meulder(2003)
The whole list of 6 English NER tags
PER |
People, including fictional |
ORG |
Companies, universities, institutions, political or religious groups, etc. |
LOC |
Geographically defined locations, including mountains, waters, etc. Politically defined locations, including countries, cities, states, streets, etc. Facilities, including buildings, museums, stadiums, hospitals, factories, airports, etc. |
TIME |
Absolute or relative dates or periods. |
NUM |
Percents, money, quantities |
MISC |
Products, including vehicles, weapons, etc. Events, including elections, battles, sporting MISC events, etc. Laws, cases, languages, etc |
Model Output: one or several of the 97 relations found between the given entities; relation id in Wikidata (e.g. ‘P26’) and relation name (‘spouse’).
The whole list of English relation
Relation id |
Relation |
P6 |
head of government |
P17 |
country |
P19 |
place of birth |
P20 |
place of death |
P22 |
father |
P25 |
mother |
P26 |
spouse |
P27 |
country of citizenship |
P30 |
continent |
P31 |
instance of |
P35 |
head of state |
P36 |
capital |
P37 |
official language |
P39 |
position held |
P40 |
child |
P50 |
author |
P54 |
member of sports team |
P57 |
director |
P58 |
screenwriter |
P69 |
educated at |
P86 |
composer |
P102 |
member of political party |
P108 |
employer |
P112 |
founded by |
P118 |
league |
P123 |
publisher |
P127 |
owned by |
P131 |
located in the administrative territorial entity |
P136 |
genre |
P137 |
operator |
P140 |
religion |
P150 |
contains administrative territorial entity |
P155 |
follows |
P156 |
followed by |
P159 |
headquarters location |
P161 |
cast member |
P162 |
producer |
P166 |
award received |
P170 |
creator |
P171 |
parent taxon |
P172 |
ethnic group |
P175 |
performer |
P176 |
manufacturer |
P178 |
developer |
P179 |
series |
P190 |
sister city |
P194 |
legislative body |
P205 |
basin country |
P206 |
located in or next to body of water |
P241 |
military branch |
P264 |
record label |
P272 |
production company |
P276 |
location |
P279 |
subclass of |
P355 |
subsidiary |
P361 |
part of |
P364 |
original language of work |
P400 |
platform |
P403 |
mouth of the watercourse |
P449 |
original network |
P463 |
member of |
P488 |
chairperson |
P495 |
country of origin |
P527 |
has part |
P551 |
residence |
P569 |
date of birth |
P570 |
date of death |
P571 |
inception |
P576 |
dissolved, abolished or demolished |
P577 |
publication date |
P580 |
start time |
P582 |
end time |
P585 |
point in time |
P607 |
conflict |
P674 |
characters |
P676 |
lyrics by |
P706 |
located on terrain feature |
P710 |
participant |
P737 |
influenced by |
P740 |
location of formation |
P749 |
parent organization |
P800 |
notable work |
P807 |
separated from |
P840 |
narrative location |
P937 |
work location |
P1001 |
applies to jurisdiction |
P1056 |
product or material produced |
P1198 |
unemployment rate |
P1336 |
territory claimed by |
P1344 |
participant of |
P1365 |
replaces |
P1366 |
replaced by |
P1376 |
capital of |
P1412 |
languages spoken, written or signed |
P1441 |
present in work |
P3373 |
sibling |
Some details on DocRED corpus English RE model was trained on
The English RE model was trained on DocRED English corpus. It was constructed from Wikipedia and Wikidata and is now the largest human-annotated dataset for document-level RE from plain text.
As the original DocRED test dataset containes only unlabeled data, while we want to have labeled one in order to perform evaluation, we decided to: 1. merge train and dev data (= labeled data) 2. split them into new train, dev and test dataset
Currently, there are two types of possible splittings provided:
user can set the relative size of dev and test data (e.g. 1/7)
user can set the absolute size of dev and test data (e.g. 2000 samples)
In our experiment, we set the absolute size of dev and test data == 150 initial documents. It resulted in approximately 3500 samples.
We additionally generate negative samples if it was necessary to have the following proportions: - for train set: negative samples are twice as many as positive ones - for dev & test set: negative samples are the same amount as positive ones
Train |
Dev |
Test |
130650 |
3406 |
3545 |
Train Positive |
Train Negative |
Dev Positive |
Dev Negative |
Test Positive |
Test Negative |
44823 |
89214 |
1239 |
1229 |
1043 |
1036 |
Russian RE model¶
The Russian RE model can be trained using the following command:
python -m deeppavlov train re_rured
The trained model weights can be loaded with the following command:
python -m deeppavlov download re_rured
The trained model can be used for inference with the following code:
from deeppavlov import configs, build_model
model = build_model(configs.relation_extraction.re_rured)
sentence_tokens = [["Илон", "Маск", "живет", "в", "Сиэттле", "."]]
entity_pos = [[[(0, 2)], [(4, 6)]]]
entity_tags = [["PERSON", "CITY"]]
pred = model(sentence_tokens, entity_pos, entity_tags)
>> [['P551'], ['место жительства']]
Model Input:
list of tokens of a text document
list of entities positions (i.e. all start and end positions of both entities’ mentions)
list of NER tags of both entities.
Full list of 29 Russian NER tags
NER tag |
Description |
|
WORK_OF_ART |
name of work of art |
|
NORP |
affiliation |
|
GROUP |
unnamed groups of people and companies |
|
LAW |
law name |
|
NATIONALITY |
names of nationalities |
|
EVENT |
event name |
|
DATE |
date value |
|
CURRENCY |
names of currencies |
|
GPE |
geo-political entity |
|
QUANTITY |
quantity value |
|
FAMILY |
families as a whole |
|
ORDINAL |
ordinal value |
|
RELIGION |
names of religions |
|
CITY |
Names of cities, towns, and villages |
|
MONEY |
money name |
|
AGE |
people’s and object’s ages |
|
LOCATION |
location name |
|
PERCENT |
percent value |
|
BOROUGH |
Names of sub-city entities |
|
PERSON |
person name |
|
REGION |
Names of sub-country entities |
|
COUNTRY |
Names of countries |
|
PROFESSION |
Professions and people of these professions. |
|
ORGANIZATION |
organization name |
|
FAC |
building name |
|
CARDINAL |
cardinal value |
|
PRODUCT |
product name |
|
TIME |
time value |
|
STREET |
street name |
Model Output: one or several of the 30 relations found between the given entities; a Russian relation name (e.g. “участник”) or an English one, if Russian one is unavailable, and, if applicable, its id in Wikidata (e.g. ‘P710’).
Full list of Russian relation
Relation |
Relation id |
Russian relation |
MEMBER |
P710 |
участник |
WORKS_AS |
P106 |
род занятий |
WORKPLACE |
||
OWNERSHIP |
P1830 |
владеет |
SUBORDINATE_OF |
||
TAKES_PLACE_IN |
P276 |
местонахождение |
EVENT_TAKES_PART_IN |
P1344 |
участвовал в |
SELLS_TO |
||
ALTERNATIVE_NAME |
||
HEADQUARTERED_IN |
P159 |
расположение штаб-квартиры |
PRODUCES |
P1056 |
продукция |
ABBREVIATION |
||
DATE_DEFUNCT_IN |
P576 |
дата прекращения существования |
SUBEVENT_OF |
P361 |
часть от |
DATE_FOUNDED_IN |
P571 |
дата основания/создания/возн-я |
DATE_TAKES_PLACE_ON |
P585 |
момент времени |
NUMBER_OF_EMPLOYEES_FIRED |
||
ORIGINS_FROM |
P495 |
страна происхождения |
ACQUINTANCE_OF |
||
PARENT_OF |
P40 |
дети |
ORGANIZES |
P664 |
организатор |
FOUNDED_BY |
P112 |
основатель |
PLACE_RESIDES_IN |
P551 |
место жительства |
BORN_IN |
P19 |
место рождения |
AGE_IS |
||
RELATIVE |
||
NUMBER_OF_EMPLOYEES |
P1128 |
число сотрудников |
SIBLING |
P3373 |
брат/сестра |
DATE_OF_BIRTH |
P569 |
дата рождения |
Some details on RuRED corpus Russian RE model was trained on
In case of RuRED we used the train, dev and test sets from the original RuRED setting. We additionally generate negative samples if it was necessary to have the following proportions:
for train set: negative samples are twice as many as positive ones
for dev & test set: negative samples are the same amount as positive ones
Train |
Dev |
Test |
12855 |
1076 |
1072 |
Train Positive |
Train Negative |
Dev Positive |
Dev Negative |
Test Positive |
Test Negative |
4285 |
8570 |
538 |
538 |
536 |
536 |
RE Model Architecture¶
We based our model on the Adaptive Thresholding and Localized Context Pooling model and used NER entity tags as additional input. Two core ideas of this model are:
Adaptive Threshold
The usual global threshold for converting the RE classifier output probability to relation label is replaced with a learnable one. A new threshold class that learns an entities-dependent threshold value is introduced and learnt as all other classes. During prediction the positive classes (= relations that are hold in the sample indeed) are claimed to be the classes with higher logins that the TH class, while all others are negative ones.
Localised Context Pooling
The embedding of each entity pair is enhanced with an additional local context embedding related to both entities. Such representation, which is attended to the relevant context in the document, is useful to decide the relation for exactly this entity pair. For incorporating the context information the attention heads are directly used.