Relation Extraction (RE)¶

Relation extraction is the task of detecting and classifying the relationship between two entities in text. DeepPavlov provides the document-level relation extraction meaning that the relation can be detected between the entities that are not in one sentence. Currently, RE is available for English and Russian languages.

RE model for English language trained on DocRED corpus based on Wikipedia.
RE model for Russian language trained on RuRED corpus based on the Lenta.ru news corpus.

English RE model¶

The English RE model can be trained using the following command:

python -m deeppavlov train re_docred

The trained model weights can be loaded with the following command:

python -m deeppavlov download re_docred

The trained model can be used for inference with the following code:

from deeppavlov import configs, build_model
re = build_model(configs.relation_extraction.re_docred, download=False)

sentence_tokens = [["Barack", "Obama", "is", "married", "to", "Michelle", "Obama", ",", "born", "Michelle", "Robinson", "."]]
entity_pos = [[[(0, 2)], [(5, 7), (9, 11)]]]
entity_tags = [["PER", "PER"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
>> [['P26'], ['spouse']]

Model Input:

list of tokens of a text document
list of entities positions (i.e. all start and end positions of both entities’ mentions)
list of NER tags of both entities.

As NER tags, we adapted the used in the DocRED corpus, which are, in turn, inherited from Tjong Kim Sang and De Meulder(2003)

The whole list of 6 English NER tags

PER	People, including fictional
ORG	Companies, universities, institutions, political or religious groups, etc.
LOC	Geographically defined locations, including mountains, waters, etc. Politically defined locations, including countries, cities, states, streets, etc. Facilities, including buildings, museums, stadiums, hospitals, factories, airports, etc.
TIME	Absolute or relative dates or periods.
NUM	Percents, money, quantities
MISC	Products, including vehicles, weapons, etc. Events, including elections, battles, sporting MISC events, etc. Laws, cases, languages, etc

Model Output: one or several of the 97 relations found between the given entities; relation id in Wikidata (e.g. ‘P26’) and relation name (‘spouse’).

The whole list of English relation

Relation id	Relation
P6	head of government
P17	country
P19	place of birth
P20	place of death
P22	father
P25	mother
P26	spouse
P27	country of citizenship
P30	continent
P31	instance of
P35	head of state
P36	capital
P37	official language
P39	position held
P40	child
P50	author
P54	member of sports team
P57	director
P58	screenwriter
P69	educated at
P86	composer
P102	member of political party
P108	employer
P112	founded by
P118	league
P123	publisher
P127	owned by
P131	located in the administrative territorial entity
P136	genre
P137	operator
P140	religion
P150	contains administrative territorial entity
P155	follows
P156	followed by
P159	headquarters location
P161	cast member
P162	producer
P166	award received
P170	creator
P171	parent taxon
P172	ethnic group
P175	performer
P176	manufacturer
P178	developer
P179	series
P190	sister city
P194	legislative body
P205	basin country
P206	located in or next to body of water
P241	military branch
P264	record label
P272	production company
P276	location
P279	subclass of
P355	subsidiary
P361	part of
P364	original language of work
P400	platform
P403	mouth of the watercourse
P449	original network
P463	member of
P488	chairperson
P495	country of origin
P527	has part
P551	residence
P569	date of birth
P570	date of death
P571	inception
P576	dissolved, abolished or demolished
P577	publication date
P580	start time
P582	end time
P585	point in time
P607	conflict
P674	characters
P676	lyrics by
P706	located on terrain feature
P710	participant
P737	influenced by
P740	location of formation
P749	parent organization
P800	notable work
P807	separated from
P840	narrative location
P937	work location
P1001	applies to jurisdiction
P1056	product or material produced
P1198	unemployment rate
P1336	territory claimed by
P1344	participant of
P1365	replaces
P1366	replaced by
P1376	capital of
P1412	languages spoken, written or signed
P1441	present in work
P3373	sibling

Some details on DocRED corpus English RE model was trained on

The English RE model was trained on DocRED English corpus. It was constructed from Wikipedia and Wikidata and is now the largest human-annotated dataset for document-level RE from plain text.

As the original DocRED test dataset containes only unlabeled data, while we want to have labeled one in order to perform evaluation, we decided to: 1. merge train and dev data (= labeled data) 2. split them into new train, dev and test dataset

Currently, there are two types of possible splittings provided:

user can set the relative size of dev and test data (e.g. 1/7)
user can set the absolute size of dev and test data (e.g. 2000 samples)

In our experiment, we set the absolute size of dev and test data == 150 initial documents. It resulted in approximately 3500 samples.

We additionally generate negative samples if it was necessary to have the following proportions: - for train set: negative samples are twice as many as positive ones - for dev & test set: negative samples are the same amount as positive ones

Train	Dev	Test
130650	3406	3545

Train Positive	Train Negative	Dev Positive	Dev Negative	Test Positive	Test Negative
44823	89214	1239	1229	1043	1036

Russian RE model¶

The Russian RE model can be trained using the following command:

python -m deeppavlov train re_rured

The trained model weights can be loaded with the following command:

python -m deeppavlov download re_rured

The trained model can be used for inference with the following code:

from deeppavlov import configs, build_model
model = build_model(configs.relation_extraction.re_rured)

sentence_tokens = [["Илон", "Маск", "живет", "в", "Сиэттле", "."]]
entity_pos = [[[(0, 2)], [(4, 6)]]]
entity_tags = [["PERSON", "CITY"]]
pred = model(sentence_tokens, entity_pos, entity_tags)
>> [['P551'], ['место жительства']]

Model Input:

list of tokens of a text document
list of entities positions (i.e. all start and end positions of both entities’ mentions)
list of NER tags of both entities.

Full list of 29 Russian NER tags

NER tag	Description
WORK_OF_ART	name of work of art
NORP	affiliation
GROUP	unnamed groups of people and companies
LAW	law name
NATIONALITY	names of nationalities
EVENT	event name
DATE	date value
CURRENCY	names of currencies
GPE	geo-political entity
QUANTITY	quantity value
FAMILY	families as a whole
ORDINAL	ordinal value
RELIGION	names of religions
CITY	Names of cities, towns, and villages
MONEY	money name
AGE	people’s and object’s ages
LOCATION	location name
PERCENT	percent value
BOROUGH	Names of sub-city entities
PERSON	person name
REGION	Names of sub-country entities
COUNTRY	Names of countries
PROFESSION	Professions and people of these professions.
ORGANIZATION	organization name
FAC	building name
CARDINAL	cardinal value
PRODUCT	product name
TIME	time value
STREET	street name

Model Output: one or several of the 30 relations found between the given entities; a Russian relation name (e.g. “участник”) or an English one, if Russian one is unavailable, and, if applicable, its id in Wikidata (e.g. ‘P710’).

Full list of Russian relation

Relation	Relation id	Russian relation
MEMBER	P710	участник
WORKS_AS	P106	род занятий
WORKPLACE
OWNERSHIP	P1830	владеет
SUBORDINATE_OF
TAKES_PLACE_IN	P276	местонахождение
EVENT_TAKES_PART_IN	P1344	участвовал в
SELLS_TO
ALTERNATIVE_NAME
HEADQUARTERED_IN	P159	расположение штаб-квартиры
PRODUCES	P1056	продукция
ABBREVIATION
DATE_DEFUNCT_IN	P576	дата прекращения существования
SUBEVENT_OF	P361	часть от
DATE_FOUNDED_IN	P571	дата основания/создания/возн-я
DATE_TAKES_PLACE_ON	P585	момент времени
NUMBER_OF_EMPLOYEES_FIRED
ORIGINS_FROM	P495	страна происхождения
ACQUINTANCE_OF
PARENT_OF	P40	дети
ORGANIZES	P664	организатор
FOUNDED_BY	P112	основатель
PLACE_RESIDES_IN	P551	место жительства
BORN_IN	P19	место рождения
AGE_IS
RELATIVE
NUMBER_OF_EMPLOYEES	P1128	число сотрудников
SIBLING	P3373	брат/сестра
DATE_OF_BIRTH	P569	дата рождения

Some details on RuRED corpus Russian RE model was trained on

In case of RuRED we used the train, dev and test sets from the original RuRED setting. We additionally generate negative samples if it was necessary to have the following proportions:

for train set: negative samples are twice as many as positive ones
for dev & test set: negative samples are the same amount as positive ones

Train	Dev	Test
12855	1076	1072

Train Positive	Train Negative	Dev Positive	Dev Negative	Test Positive	Test Negative
4285	8570	538	538	536	536

RE Model Architecture¶

We based our model on the Adaptive Thresholding and Localized Context Pooling model and used NER entity tags as additional input. Two core ideas of this model are:

Adaptive Threshold

The usual global threshold for converting the RE classifier output probability to relation label is replaced with a learnable one. A new threshold class that learns an entities-dependent threshold value is introduced and learnt as all other classes. During prediction the positive classes (= relations that are hold in the sample indeed) are claimed to be the classes with higher logins that the TH class, while all others are negative ones.

Localised Context Pooling

The embedding of each entity pair is enhanced with an additional local context embedding related to both entities. Such representation, which is attended to the relevant context in the document, is useful to decide the relation for exactly this entity pair. For incorporating the context information the attention heads are directly used.