Named Entity Recognition (NER)¶

Train and use the model¶

There are two main types of models available: standard RNN based and BERT based. To see details about BERT based models see here. Any pre-trained model can be used for inference from both Command Line Interface (CLI) and Python. Before using the model make sure that all required packages are installed using the command:

python -m deeppavlov install ner_ontonotes_bert

To use a pre-trained model from CLI use the following command:

python deeppavlov/deep.py interact ner_ontonotes_bert [-d]

where ner_conll2003_bert is the name of the config and -d is an optional download key. The key -d is used to download the pre-trained model along with embeddings and all other files needed to run the model. Other possible commands are train, evaluate, and download,

Here is the list of all available configs:

Model	Dataset	Language	Embeddings Size	Model Size	F1 score
ner_rus_bert	Collection3 1	Ru	700 MB	1.4 GB	98.1
ner_rus	Collection3 1	Ru	1.0 GB	5.6 MB	95.1
ner_ontonotes_bert_mult	Ontonotes	Multi	700 MB	1.4 GB	88.8
ner_ontonotes_bert		En	400 MB	800 MB	88.6
ner_ontonotes			331 MB	7.8 MB	86.4
ner_conll2003_bert	CoNLL-2003		400 MB	850 MB	91.7
ner_conll2003	CoNLL-2003		331 MB	3.1 MB	89.9
ner_dstc2	DSTC2		—	626 KB	97.1

Models can be used from Python using the following code:

from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_ontonotes_bert, download=True)

ner_model(['Bob Ross lived in Florida'])
>>> [[['Bob', 'Ross', 'lived', 'in', 'Florida']], [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE']]]

The model also can be trained from the Python:

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_ontonotes_bert)

The data for training should be placed in the folder provided in the config:

from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config

config_dict = parse_config(configs.ner.ner_ontonotes_bert)

print(config_dict['dataset_reader']['data_path'])
>>> '~/.deeppavlov/downloads/ontonotes'

There must be three txt files: train.txt, valid.txt, and test.txt. Furthermore the data_path can be changed from code. The format of the data is described in the Training data section.

Multilingual BERT Zero-Shot Transfer¶

Multilingual BERT models allow to perform zero-shot transfer from one language to another. The model ner_ontonotes_bert_mult was trained on OntoNotes corpus which has 19 types in the markup schema. The model performance was evaluated on Russian corpus Collection 3 1. Results of the transfer are presented in the table below.

TOTAL	79.39
PER	95.74
LOC	82.62
ORG	55.68

The following Python code can be used to infer the model:

from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_ontonotes_bert_mult, download=True)

ner_model(['Curling World Championship will be held in Antananarivo'])
>>> (['Curling', 'World', 'Championship', 'will', 'be', 'held', 'in', 'Antananarivo']],
[['B-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'O', 'O', 'B-GPE'])

ner_model(['Mistrzostwa Świata w Curlingu odbędą się w Antananarivo'])
>>> (['Mistrzostwa', 'Świata', 'w', 'Curlingu', 'odbędą', 'się', 'w', 'Antananarivo']],
[['B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'O', 'B-GPE'])

ner_model(['Чемпионат мира по кёрлингу пройдёт в Антананариву'])
>>> (['Чемпионат', 'мира', 'по', 'кёрлингу', 'пройдёт', 'в', 'Антананариву'],
['B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'B-GPE'])

The list of available tags and their descriptions are presented below.

PERSON	People including fictional
NORP	Nationalities or religious or political groups
FACILITY	Buildings, airports, highways, bridges, etc.
ORGANIZATION	Companies, agencies, institutions, etc.
GPE	Countries, cities, states
LOCATION	Non-GPE locations, mountain ranges, bodies of water
PRODUCT	Vehicles, weapons, foods, etc. (Not services)
EVENT	Named hurricanes, battles, wars, sports events, etc.
WORK OF ART	Titles of books, songs, etc.
LAW	Named documents made into laws
LANGUAGE	Any named language
DATE	Absolute or relative dates or periods
TIME	Times smaller than a day
PERCENT	Percentage (including “%”)
MONEY	Monetary values, including unit
QUANTITY	Measurements, as of weight or distance
ORDINAL	“first”, “second”
CARDINAL	Numerals that do not fall under another type

NER task¶

Named Entity Recognition (NER) is one of the most common tasks in natural language processing. In most of the cases, NER task can be formulated as:

Given a sequence of tokens (words, and maybe punctuation symbols) provide a tag from a predefined set of tags for each token in the sequence.

For NER task there are some common types of entities used as tags:

persons
locations
organizations
expressions of time
quantities
monetary values

Furthermore, to distinguish adjacent entities with the same tag many applications use BIO tagging scheme. Here “B” denotes beginning of an entity, “I” stands for “inside” and is used for all words comprising the entity except the first one, and “O” means the absence of entity. Example with dropped punctuation:

Bernhard        B-PER
Riemann         I-PER
Carl            B-PER
Friedrich       I-PER
Gauss           I-PER
and             O
Leonhard        B-PER
Euler           I-PER

In the example above PER means person tag, and “B-” and “I-” are prefixes identifying beginnings and continuations of the entities. Without such prefixes, it is impossible to separate Bernhard Riemann from Carl Friedrich Gauss.

Training data¶

To train the neural network, you need to have a dataset in the following format:

EU B-ORG
rejects O
the O
call O
of O
Germany B-LOC
to O
boycott O
lamb O
from O
Great B-LOC
Britain I-LOC
. O

China B-LOC
says O
time O
right O
for O
Taiwan B-LOC
talks O
. O

...

The source text is tokenized and tagged. For each token, there is a tag with BIO markup. Tags are separated from tokens with whitespaces. Sentences are separated with empty lines.

Dataset is a text file or a set of text files. The dataset must be split into three parts: train, test, and validation. The train set is used for training the network, namely adjusting the weights with gradient descent. The validation set is used for monitoring learning progress and early stopping. The test set is used for final evaluation of model quality. Typical partition of a dataset into train, validation, and test are 80%, 10%, 10%, respectively.

Few-shot Language-Model based¶

It is possible to get a cold-start baseline from just a few samples of labeled data in a couple of seconds. The solution is based on a Language Model trained on open domain corpus. On top of the LM a SVM classification layer is placed. It is possible to start from as few as 10 sentences containing entities of interest.

The data for training this model should be collected in the following way. Given a collection of N sentences without markup, sequentially markup sentences until the total number of sentences with entity of interest become equal K. During the training both sentences with and without markup are used.

Mean chunk-wise F1 scores for Russian language on 10 sentences with entities :

PER	84.85
LOC	68.41
ORG	32.63

(the total number of training sentences is bigger and defined by the distribution of sentences with / without entities).

The model can be trained using CLI:

python -m deeppavlov train ner_few_shot_ru

you have to provide the train.txt, valid.txt, and test.txt files in the format described in the Training data section. The files must be in the ner_few_shot_data folder as described in the dataset_reader part of the config ner/ner_few_shot_ru_train.json .

To train and use the model from python code the following snippet can be used:

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_few_shot_ru, download=True)

ner_model(['Example sentence'])

Warning! This model can take a lot of time and memory if the number of sentences is greater than 1000!

If a lot of data is available the few-shot setting can be simulated with special dataset_iterator. For this purpose the config ner/ner_few_shot_ru_train.json . The following code can be used for this simulation:

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_few_shot_ru_simulate, download=True)

In this config the Collection dataset is used. However, if there are files train.txt, valid.txt, and test.txt in the ner_few_shot_data folder they will be used instead.

To use existing few-shot model use the following python interface can be used:

from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_few_shot_ru)

ner_model([['Example', 'sentence']])
ner_model(['Example sentence'])

Literature¶

1(1,2): Mozharova V., Loukachevitch N., Two-stage approach in Russian named entity recognition // International FRUCT Conference on Intelligence, Social Media and Web, ISMW FRUCT 2016. Saint-Petersburg; Russian Federation, DOI 10.1109/FRUCT.2016.7584769