Named Entity Recognition (NER)¶
Table of contents¶
-
3.1. Predict using Python
3.2. Predict using CLI
-
5.1. Evaluate from Python
5.2. Evaluate from CLI
1. Introduction to the task¶
Named Entity Recognition (NER) is a task of assigning a tag (from a predefined set of tags) to each token in a given sequence. In other words, NER-task consists of identifying named entities in the text and classifying them into types (e.g. person name, organization, location etc).
BIO encoding schema is usually used in NER task. It uses 3 tags: B for the beginning of the entity, I for the inside of the entity, and O for non-entity tokens. The second part of the tag stands for the entity type.
Here is an example of a tagged sequence:
Elon |
Musk |
founded |
Tesla |
in |
2003 |
. |
---|---|---|---|---|---|---|
B-PER |
I-PER |
O |
B-ORG |
O |
B-DATE |
O |
Here we can see three extracted named entities: Elon Musk (which is a person’s name), Tesla (which is a name of an organization) and 2003 (which is a date). To see more examples try out our Demo.
The list of possible types of NER entities may vary depending on your dataset domain. The list of tags used in DeepPavlov’s models can be found in the table.
2. Get started with the model¶
First make sure you have the DeepPavlov Library installed. More info about the first installation.
[ ]:
!pip install -q deeppavlov
Then make sure that all the required packages for the model are installed.
[ ]:
!python -m deeppavlov install ner_ontonotes_bert_torch
ner_ontonotes_bert_torch
is the name of the model’s config_file. What is a Config File?
Configuration file defines the model and describes its hyperparameters. To use another model, change the name of the config_file here and further. The full list of NER models with their config names can be found in the table.
There are alternative ways to install the model’s packages that do not require executing a separate command – see the options in the next sections of this page.
3. Use the model for prediction¶
3.1 Predict using Python¶
After installing the model, build it from the config and predict.
[ ]:
from deeppavlov import build_model
ner_model = build_model('ner_ontonotes_bert_torch', download=True, install=True)
The download
argument defines whether it is necessary to download the files defined in the download
section of the config: usually it provides the links to the train and test data, to the pretrained models, or to the embeddings.
Setting the install
argument to True
is equivalent to executing the command line install
command. If set to True
, it will first install all the required packages.
Input: List[sentences]
Output: List[tokenized sentences, corresponding NER-tags]
[ ]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])
[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
['Elon', 'Musk', 'founded', 'Tesla']],
[['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]
3.2 Predict using CLI¶
You can also get predictions in an interactive mode through CLI (Сommand Line Interface).
[ ]:
! python -m deeppavlov interact ner_ontonotes_bert_torch -d
-d
is an optional download key (alternative to download=True
in Python code). The key -d
is used to download the pre-trained model along with embeddings and all other files needed to run the model.
Or make predictions for samples from stdin.
[ ]:
! python -m deeppavlov predict ner_ontonotes_bert_torch -f <file-name>
4. Evaluate¶
There are two metrics that are used to evaluate a NER model in DeepPavlov:
ner_f1
is measured on the entity-level (actual text spans should match exactly)
ner_token_f1
is measured on a token level (correct tokens from not fully extracted entities will still be counted as TPs (true positives))
4.1 Evaluate from Python¶
[ ]:
from deeppavlov import evaluate_model
model = evaluate_model('ner_ontonotes_bert_torch', download=True)
4.1 Evaluate from CLI¶
[ ]:
! python -m deeppavlov evaluate ner_ontonotes_bert_torch
5. Train the model on your data¶
5.1 Train your model from Python¶
Provide your data path¶
To train the model on your data, you need to change the path to the training data in the config_file.
Parse the config_file and change the path to your data from Python.
[ ]:
from deeppavlov import train_model
from deeppavlov.core.commands.utils import parse_config
model_config = parse_config('ner_ontonotes_bert_torch')
# dataset that the model was trained on
print(model_config['dataset_reader']['data_path'])
~/.deeppavlov/downloads/ontonotes/
Provide a data_path to your own dataset.
[ ]:
# download and unzip a new example dataset
!wget http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz
!tar -xzvf "conll2003_v2.tar.gz"
[ ]:
# provide a path to the train file
model_config['dataset_reader']['data_path'] = 'contents/train.txt'
Train dataset format¶
To train the model, you need to have a txt-file with a dataset in the following format:
EU B-ORG
rejects O
the O
call O
of O
Germany B-LOC
to O
boycott O
lamb O
from O
Great B-LOC
Britain I-LOC
. O
China B-LOC
says O
time O
right O
for O
Taiwan B-LOC
talks O
. O
The source text is tokenized and tagged. For each token, there is a tag with BIO markup. Tags are separated from tokens with whitespaces. Sentences are separated with empty lines.
Train the model using new config¶
[ ]:
ner_model = train_model(model_config)
Use your model for prediction.
[ ]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])
[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
['Elon', 'Musk', 'founded', 'Tesla']],
[['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]
5.2 Train your model from CLI¶
[ ]:
! python -m deeppavlov train ner_ontonotes_bert_torch
6. Models list¶
The table presents a list of all of the NER-models available in the DeepPavlov Library.
Config name |
Dataset |
Language |
Model Size |
F1 score (ner_f1) |
F1 score (ner_f1_token) |
---|---|---|---|---|---|
ner_case_agnostic_mdistilbert |
En |
1.6 GB |
89.9 |
91.6 |
|
ner_conll2003_bert |
En |
1.3 GB |
91.9 |
93.4 |
|
ner_ontonotes_bert |
En |
1.3 GB |
89.2 |
92.7 |
|
ner_collection3_bert |
Ru |
2.1 GB |
98.5 |
98.9 |
|
ner_rus_bert |
Ru |
2.1 GB |
97.6 |
98.5 |
|
ner_rus_convers_distilrubert_2 L |
Ru |
1.3 GB |
92.9 |
96.6 |
|
ner_rus_convers_distilrubert_6 L |
Ru |
1.6 GB |
96.7 |
98.5 |
|
ner_rus_bert_probas |
Ru |
2.1 GB |
72.6 |
79.5 |
|
ner_ontonotes_bert_mult |
Multi |
2.1 GB |
88.9 |
92.0 |
7. NER-tags list¶
The table presents a list of all of the NER entity tags used in DeepPavlov’s NER-models.
PERSON |
People including fictional |
NORP |
Nationalities or religious or political groups |
FACILITY |
Buildings, airports, highways, bridges, etc. |
ORGANIZATION |
Companies, agencies, institutions, etc. |
GPE |
Countries, cities, states |
LOCATION |
Non-GPE locations, mountain ranges, bodies of water |
PRODUCT |
Vehicles, weapons, foods, etc. (Not services) |
EVENT |
Named hurricanes, battles, wars, sports events, etc. |
WORK OF ART |
Titles of books, songs, etc. |
LAW |
Named documents made into laws |
LANGUAGE |
Any named language |
DATE |
Absolute or relative dates or periods |
TIME |
Times smaller than a day |
PERCENT |
Percentage (including “%”) |
MONEY |
Monetary values, including unit |
QUANTITY |
Measurements such as weight or distance |
ORDINAL |
“first”, “second”, etc. |
CARDINAL |
Numerals that do not fall under another type |