Named Entity Recognition (NER)¶
Train and use the model¶
There are two main types of models available: standard RNN based and BERT based. To see details about BERT based models see here. Any pre-trained model can be used for inference from both Command Line Interface (CLI) and Python. Before using the model make sure that all required packages are installed using the command:
python -m deeppavlov install ner_ontonotes_bert
To use a pre-trained model from CLI use the following command:
python deeppavlov/deep.py interact ner_ontonotes_bert [-d]
where ner_conll2003_bert
is the name of the config and -d
is an optional download key. The key -d
is used
to download the pre-trained model along with embeddings and all other files needed to run the model. Other possible
commands are train
, evaluate
, and download
,
Here is the list of all available configs:
Model |
Dataset |
Language |
Embeddings Size |
Model Size |
F1 score |
---|---|---|---|---|---|
Collection3 1 |
Ru |
700 MB |
1.4 GB |
98.1 |
|
1.0 GB |
5.6 MB |
95.1 |
|||
Ontonotes |
Multi |
700 MB |
1.4 GB |
88.8 |
|
En |
400 MB |
800 MB |
88.6 |
||
331 MB |
7.8 MB |
86.4 |
|||
CoNLL-2003 |
400 MB |
850 MB |
91.7 |
||
331 MB |
3.1 MB |
89.9 |
|||
DSTC2 |
— |
626 KB |
97.1 |
Models can be used from Python using the following code:
from deeppavlov import configs, build_model
ner_model = build_model(configs.ner.ner_ontonotes_bert, download=True)
ner_model(['Bob Ross lived in Florida'])
>>> [[['Bob', 'Ross', 'lived', 'in', 'Florida']], [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE']]]
The model also can be trained from the Python:
from deeppavlov import configs, train_model
ner_model = train_model(configs.ner.ner_ontonotes_bert)
The data for training should be placed in the folder provided in the config:
from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config
config_dict = parse_config(configs.ner.ner_ontonotes_bert)
print(config_dict['dataset_reader']['data_path'])
>>> '~/.deeppavlov/downloads/ontonotes'
There must be three txt files: train.txt, valid.txt, and test.txt. Furthermore the data_path can be changed from code. The format of the data is described in the Training data section.
Multilingual BERT Zero-Shot Transfer¶
Multilingual BERT models allow to perform zero-shot transfer from one language to another. The model ner_ontonotes_bert_mult was trained on OntoNotes corpus which has 19 types in the markup schema. The model performance was evaluated on Russian corpus Collection 3 1. Results of the transfer are presented in the table below.
TOTAL |
79.39 |
PER |
95.74 |
LOC |
82.62 |
ORG |
55.68 |
The following Python code can be used to infer the model:
from deeppavlov import configs, build_model
ner_model = build_model(configs.ner.ner_ontonotes_bert_mult, download=True)
ner_model(['Curling World Championship will be held in Antananarivo'])
>>> (['Curling', 'World', 'Championship', 'will', 'be', 'held', 'in', 'Antananarivo']],
[['B-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'O', 'O', 'B-GPE'])
ner_model(['Mistrzostwa Świata w Curlingu odbędą się w Antananarivo'])
>>> (['Mistrzostwa', 'Świata', 'w', 'Curlingu', 'odbędą', 'się', 'w', 'Antananarivo']],
[['B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'O', 'B-GPE'])
ner_model(['Чемпионат мира по кёрлингу пройдёт в Антананариву'])
>>> (['Чемпионат', 'мира', 'по', 'кёрлингу', 'пройдёт', 'в', 'Антананариву'],
['B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'O', 'O', 'B-GPE'])
The list of available tags and their descriptions are presented below.
PERSON |
People including fictional |
NORP |
Nationalities or religious or political groups |
FACILITY |
Buildings, airports, highways, bridges, etc. |
ORGANIZATION |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states |
LOCATION |
Non-GPE locations, mountain ranges, bodies of water |
PRODUCT |
Vehicles, weapons, foods, etc. (Not services) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK OF ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws |
LANGUAGE |
Any named language |
DATE |
Absolute or relative dates or periods |
TIME |
Times smaller than a day |
PERCENT |
Percentage (including “%”) |
MONEY |
Monetary values, including unit |
QUANTITY |
Measurements, as of weight or distance |
ORDINAL |
“first”, “second” |
CARDINAL |
Numerals that do not fall under another type |
NER task¶
Named Entity Recognition (NER) is one of the most common tasks in natural language processing. In most of the cases, NER task can be formulated as:
Given a sequence of tokens (words, and maybe punctuation symbols) provide a tag from a predefined set of tags for each token in the sequence.
For NER task there are some common types of entities used as tags:
persons
locations
organizations
expressions of time
quantities
monetary values
Furthermore, to distinguish adjacent entities with the same tag many applications use BIO tagging scheme. Here “B” denotes beginning of an entity, “I” stands for “inside” and is used for all words comprising the entity except the first one, and “O” means the absence of entity. Example with dropped punctuation:
Bernhard B-PER
Riemann I-PER
Carl B-PER
Friedrich I-PER
Gauss I-PER
and O
Leonhard B-PER
Euler I-PER
In the example above PER means person tag, and “B-” and “I-” are prefixes identifying beginnings and continuations of the entities. Without such prefixes, it is impossible to separate Bernhard Riemann from Carl Friedrich Gauss.
Training data¶
To train the neural network, you need to have a dataset in the following format:
EU B-ORG
rejects O
the O
call O
of O
Germany B-LOC
to O
boycott O
lamb O
from O
Great B-LOC
Britain I-LOC
. O
China B-LOC
says O
time O
right O
for O
Taiwan B-LOC
talks O
. O
...
The source text is tokenized and tagged. For each token, there is a tag with BIO markup. Tags are separated from tokens with whitespaces. Sentences are separated with empty lines.
Dataset is a text file or a set of text files. The dataset must be split into three parts: train, test, and validation. The train set is used for training the network, namely adjusting the weights with gradient descent. The validation set is used for monitoring learning progress and early stopping. The test set is used for final evaluation of model quality. Typical partition of a dataset into train, validation, and test are 80%, 10%, 10%, respectively.
Few-shot Language-Model based¶
It is possible to get a cold-start baseline from just a few samples of labeled data in a couple of seconds. The solution is based on a Language Model trained on open domain corpus. On top of the LM a SVM classification layer is placed. It is possible to start from as few as 10 sentences containing entities of interest.
The data for training this model should be collected in the following way. Given a collection of N sentences without markup, sequentially markup sentences until the total number of sentences with entity of interest become equal K. During the training both sentences with and without markup are used.
Mean chunk-wise F1 scores for Russian language on 10 sentences with entities :
PER |
84.85 |
LOC |
68.41 |
ORG |
32.63 |
(the total number of training sentences is bigger and defined by the distribution of sentences with / without entities).
The model can be trained using CLI:
python -m deeppavlov train ner_few_shot_ru
you have to provide the train.txt, valid.txt, and test.txt files in the format described in the Training data section. The files must be in the ner_few_shot_data folder as described in the dataset_reader part of the config ner/ner_few_shot_ru_train.json .
To train and use the model from python code the following snippet can be used:
from deeppavlov import configs, train_model
ner_model = train_model(configs.ner.ner_few_shot_ru, download=True)
ner_model(['Example sentence'])
Warning! This model can take a lot of time and memory if the number of sentences is greater than 1000!
If a lot of data is available the few-shot setting can be simulated with special dataset_iterator. For this purpose the config ner/ner_few_shot_ru_train.json . The following code can be used for this simulation:
from deeppavlov import configs, train_model
ner_model = train_model(configs.ner.ner_few_shot_ru_simulate, download=True)
In this config the Collection dataset is used. However, if there are files train.txt, valid.txt, and test.txt in the ner_few_shot_data folder they will be used instead.
To use existing few-shot model use the following python interface can be used:
from deeppavlov import configs, build_model
ner_model = build_model(configs.ner.ner_few_shot_ru)
ner_model([['Example', 'sentence']])
ner_model(['Example sentence'])