BERT in DeepPavlov¶
BERT (Bidirectional Encoder Representations from Transformers) is a Transformer pre-trained on masked language model and next sentence prediction tasks. This approach showed state-of-the-art results on a wide range of NLP tasks in English.
There are several pre-trained BERT models released by Google Research, more details about these pre-trained models could be found here: https://github.com/google-research/bert#pre-trained-models
BERT-base, English, cased, 12-layer, 768-hidden, 12-heads, 110M parameters: download from [google], [deeppavlov]
BERT-base, English, uncased, 12-layer, 768-hidden, 12-heads, 110M parameters: download from [google], [deeppavlov]
BERT-large, English, cased, 24-layer, 1024-hidden, 16-heads, 340M parameters: download from [google]
BERT-base, multilingual, cased, 12-layer, 768-hidden, 12-heads, 180M parameters: download from [google], [deeppavlov], [deeppavlov_pytorch]
BERT-base, Chinese, cased, 12-layer, 768-hidden, 12-heads, 110M parameters: download from [google], [deeppavlov], [deeppavlov_pytorch]
We have trained BERT-base model for other languages and domains:
RuBERT, Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters: [deeppavlov], [deeppavlov_pytorch]
SlavicBERT, Slavic (bg, cs, pl, ru), cased, 12-layer, 768-hidden, 12-heads, 180M parameters: [deeppavlov], [deeppavlov_pytorch]
Conversational BERT, English, cased, 12-layer, 768-hidden, 12-heads, 110M parameters: [deeppavlov], [deeppavlov_pytorch]
Conversational RuBERT, Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters: [deeppavlov], [deeppavlov_pytorch]
Sentence Multilingual BERT, 101 languages, cased, 12-layer, 768-hidden, 12-heads, 180M parameters: [deeppavlov], [deeppavlov_pytorch]
Sentence RuBERT, Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters: [deeppavlov], [deeppavlov_pytorch]
The deeppavlov_pytorch
models are designed to be run with the HuggingFace’s Transformers library.
RuBERT was trained on the Russian part of Wikipedia and news data. We used this training data to build vocabulary of Russian subtokens and took multilingual version of BERT-base as initialization for RuBERT 1.
SlavicBERT was trained on Russian News and four Wikipedias: Bulgarian, Czech, Polish, and Russian. Subtoken vocabulary was built using this data. Multilingual BERT was used as an initialization for SlavicBERT. The model is described in our ACL paper 2.
Conversational BERT was trained on the English part of Twitter, Reddit, DailyDialogues 4, OpenSubtitles 5, Debates 6, Blogs 7, Facebook News Comments. We used this training data to build the vocabulary of English subtokens and took English cased version of BERT-base as initialization for English Conversational BERT.
Conversational RuBERT was trained on OpenSubtitles 5, Dirty, Pikabu, and Social Media segment of Taiga corpus 8. We assembled new vocabulary for Conversational RuBERT model on this data and initialized model with RuBERT.
Sentence Multilingual BERT is a representation-based sentence encoder for 101 languages of Multilingual BERT. It is initialized with Multilingual BERT and then fine-tuned on english MultiNLI 9 and on dev set of multilingual XNLI 10. Sentence representations are mean pooled token embeddings in the same manner as in Sentence-BERT 12.
Sentence RuBERT is a representation-based sentence encoder for Russian. It is initialized with RuBERT and fine-tuned on SNLI 11 google-translated to russian and on russian part of XNLI dev set 10. Sentence representations are mean pooled token embeddings in the same manner as in Sentence-BERT 12.
Here, in DeepPavlov, we made it easy to use pre-trained BERT for downstream tasks like classification, tagging, question answering and ranking. We also provide pre-trained models and examples on how to use BERT with DeepPavlov.
BERT as Embedder¶
TransformersBertEmbedder
allows for using BERT
model outputs as token, subtoken and sentence level embeddings.
Additionaly the embeddings can be easily used in DeepPavlov. To get text level, token level and subtoken level representations, you can use or modify a BERT embedder configuration:
from deeppavlov.core.common.file import read_json
from deeppavlov import build_model, configs
bert_config = read_json(configs.embedder.bert_embedder)
bert_config['metadata']['variables']['BERT_PATH'] = 'path/to/bert/directory'
m = build_model(bert_config)
texts = ['Hi, i want my embedding.', 'And mine too, please!']
tokens, token_embs, subtokens, subtoken_embs, sent_max_embs, sent_mean_embs, bert_pooler_outputs = m(texts)
Examples of using these embeddings in model training pipelines can be found in Sentiment Twitter and NER Ontonotes configuration files.
BERT for Classification¶
BertClassifierModel
and
TorchTransformersClassifierModel
provide easy to use solution for classification problem
using pre-trained BERT on TensorFlow and PyTorch correspondingly.
One can use several pre-trained English, multi-lingual and Russian BERT models that are
listed above. TorchTransformersClassifierModel
supports any Transformer-based model of Transformers <https://github.com/huggingface/transformers>.
Two main components of BERT classifier pipeline in DeepPavlov are
BertPreprocessor
on TensorFlow
(TorchTransformersPreprocessor
on PyTorch) and
BertClassifierModel
on TensorFlow
(TorchTransformersClassifierModel
on PyTorch).
Non-processed texts should be given to bert_preprocessor
(or torch_transformers_preprocessor
) for tokenization on subtokens,
encoding subtokens with their indices and creating tokens and segment masks.
In case of using one-hot encoded classes in the pipeline, set one_hot_labels
to true
.
bert_classifier
and torch_bert_classifier
have a dense layer of number of classes size upon pooled outputs of Transformer encoder,
it is followed by softmax
activation (sigmoid
if multilabel
parameter is set to true
in config).
BERT for Named Entity Recognition (Sequence Tagging)¶
Pre-trained BERT model can be used for sequence tagging. Examples of BERT application to sequence tagging
can be found here. The modules used for tagging
are BertSequenceTagger
on TensorFlow and
TorchBertSequenceTagger
on PyTorch.
The tags are obtained by applying a dense layer to the representation of
the first subtoken of each word. There is also an optional CRF layer on the top for TensorFlow implementation.
Multilingual BERT model allows to perform zero-shot transfer across languages. To use our 19 tags NER for over a hundred languages see Multilingual BERT Zero-Shot Transfer.
BERT for Morphological Tagging¶
Since morphological tagging is also a sequence labeling task, it can be solved in a similar fashion. The only difference is that we may use the last subtoken of each word in case word morphology is mostly defined by its suffixes, not prefixes (that is the case for most Indo-European languages, such as Russian, Spanish, German etc.). See also.
BERT for Syntactic Parsing¶
You can use BERT for syntactic parsing also. As most modern parsers, we use the biaffine model over the embedding layer, which is the output of BERT. The model outputs the index of syntactic head and the dependency type for each word. See the parser documentation for more information about model performance and algorithm.
BERT for Context Question Answering (SQuAD)¶
Context Question Answering on SQuAD dataset is a task
of looking for an answer on a question in a given context. This task could be formalized as predicting answer start
and end position in a given context. BertSQuADModel
on TensorFlow and
TorchBertSQuADModel
on PyTorch use two linear
transformations to predict probability that current subtoken is start/end position of an answer. For details check
Context Question Answering documentation page.
BERT for Ranking¶
There are two main approaches in text ranking. The first one is interaction-based which is relatively accurate but
works slow and the second one is representation-based which is less accurate but faster 3.
The interaction-based ranking based on BERT is represented in the DeepPavlov with two main components
BertRankerPreprocessor
on TensorFlow
(TorchBertRankerPreprocessor
on PyTorch)
and BertRankerModel
on TensorFlow
(TorchBertRankerModel
on PyTorch)
and the representation-based ranking with components
BertSepRankerPreprocessor
and BertSepRankerModel
on TensorFlow.
Additional components
BertSepRankerPredictorPreprocessor
and BertSepRankerPredictor
(on TensorFlow) are for usage in the interact
mode
where the task for ranking is to retrieve the best possible response from some provided response base with the help of
the trained model. Working examples with the trained models are given here.
Statistics are available here.
BERT for Extractive Summarization¶
The BERT model was trained on Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
NSP head was trained to detect in [CLS] text_a [SEP] text_b [SEP]
if text_b follows text_a in original document.
This NSP head can be used to stack sentences from a long document, based on a initial sentence. The first sentence in
a document can be used as initial one. BertAsSummarizer
on TensorFlow
and TorchBertAsSummarizer
on PyTorch rely on
pretrained BERT models and does not require training on summarization dataset.
We have three configuration files:
BertAsSummarizer in Russian takes first sentence in document as initialization.
BertAsSummarizer with init in Russian uses provided initial sentence.
TorchBertAsSummarizer in English takes first sentence in document as initialization.
Using custom BERT in DeepPavlov¶
The previous sections describe the BERT based models implemented in DeepPavlov. To change the BERT model used for initialization in any downstream task mentioned above the following parameters of the config file must be changed to match new BERT path:
download URL in the
metadata.download.url
part of the configbert_config_file
,pretrained_bert
in the BERT based Component. In case of PyTorch BERT,pretrained_bert
can be assigned tostring name of any Transformer-based model (e.g.
"bert-base-uncased"
,"distilbert-base-uncased"
) and thenbert_config_file
is set toNone
.
vocab_file
in thebert_preprocessor
(torch_transformers_preprocessor
). In case of PyTorch BERT,vocab_file
can be assigned tostring name of used pre-trained BERT (e.g.
"bert-base-uncased"
).
- 1
Kuratov, Y., Arkhipov, M. (2019). Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. arXiv preprint arXiv:1905.07213.
- 2
Arkhipov M., Trofimova M., Kuratov Y., Sorokin A. (2019). Tuning Multilingual Transformers for Language-Specific Named Entity Recognition . ACL anthology W19-3712.
- 3
McDonald, R., Brokos, G. I., & Androutsopoulos, I. (2018). Deep relevance ranking using enhanced document-query interactions. arXiv preprint arXiv:1809.01682.
- 4
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. IJCNLP 2017.
- 5(1,2)
Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
- 6
Justine Zhang, Ravi Kumar, Sujith Ravi, Cristian Danescu-Niculescu-Mizil. Proceedings of NAACL, 2016.
- 7
Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.
- 8
Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of “CORPORA2017”, international conference , Saint-Petersbourg, 2017.
- 9
Williams A., Nangia N. & Bowman S. (2017) A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. arXiv preprint arXiv:1704.05426
- 10(1,2)
Williams A., Bowman S. (2018) XNLI: Evaluating Cross-lingual Sentence Representations. arXiv preprint arXiv:1809.05053
- 11
Bowman, G. Angeli, C. Potts, and C. D. Manning. (2015) A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326
- 12(1,2)
Reimers, I. Gurevych (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084