Multi-task BERT in DeepPavlov¶
Multi-task BERT in DeepPavlov is an implementation of BERT training algorithm published in the paper “Multi-Task Deep Neural Networks for Natural Language Understanding”.
The idea is to share BERT body between several tasks. This is necessary if a model pipe has several components using BERT and the amount of GPU memory is limited. Each task has its own ‘head’ part attached to the output of the BERT encoder. If multi-task BERT has \(T\) heads, one training iteration consists of
composing \(T\) mini-batches, one for each task,
\(T\) gradient steps, one gradient step for each task.
When one of BERT heads is being trained, other heads’ parameters do not change. On each training step both BERT head and body parameters are modified. You may specify different learning rates for a head and a body.
Currently there are heads for classification (mt_bert_classification_task
) and sequence tagging
(mt_bert_seq_tagging_task
).
At this page, multi-task BERT usage is explained on a toy configuration file of a model that detects insults, analyzes sentiment, and recognises named entities. Multi-task BERT configuration files for training mt_bert_train_tutorial.json and for inference mt_bert_inference_tutorial.json are based on configs insults_kaggle_bert.json, sentiment_sst_multi_bert.json, ner_conll2003_bert.json.
We start with the metadata
field of the configuration file. Multi-task BERT model is saved in
{"MT_BERT_PATH": "{MODELS_PATH}/mt_bert"}
. Classes and tag vocabularies are saved in
{"INSULTS_PATH": "{MT_BERT_PATH}/insults"}
, {"SENTIMENT_PATH": "{MT_BERT_PATH}/sentiment"}
. requirements
field of Multitask BERT configuration file is identical to requirements
fields of original configs. downloads
field of Multitask BERT configuration file is a union of downloads
fields of original configs without pre-trained
models. The metadata
field of our config is given below.
{
"metadata": {
"variables": {
"ROOT_PATH": "~/.deeppavlov",
"DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
"MODELS_PATH": "{ROOT_PATH}/models",
"BERT_PATH": "{DOWNLOADS_PATH}/bert_models/cased_L-12_H-768_A-12",
"MT_BERT_PATH": "{MODELS_PATH}/mt_bert_tutorial",
"INSULTS_PATH": "{MT_BERT_PATH}/insults",
"SENTIMENT_PATH": "{MT_BERT_PATH}/sentiment",
"NER_PATH": "{MT_BERT_PATH}/ner"
},
"requirements": [
"{DEEPPAVLOV_PATH}/requirements/tf.txt",
"{DEEPPAVLOV_PATH}/requirements/bert_dp.txt",
"{DEEPPAVLOV_PATH}/requirements/fasttext.txt",
"{DEEPPAVLOV_PATH}/requirements/rapidfuzz.txt",
"{DEEPPAVLOV_PATH}/requirements/hdt.txt"
],
"download": [
{
"url": "http://files.deeppavlov.ai/datasets/insults_data.tar.gz",
"subdir": "{DOWNLOADS_PATH}"
},
{
"url": "http://files.deeppavlov.ai/datasets/yelp_review_full_csv.tar.gz",
"subdir": "{DOWNLOADS_PATH}"
},
{
"url": "http://files.deeppavlov.ai/deeppavlov_data/bert/cased_L-12_H-768_A-12.zip",
"subdir": "{DOWNLOADS_PATH}/bert_models"
}
]
}
}
Train config¶
When using multitask_bert
component, you need separate train and inference configuration files.
Data reading and iteration is performed by multitask_reader
and multitask_iterator
. These classes are composed
of task readers and iterators and generate batches that contain data from heterogeneous datasets.
A multitask_reader
configuration has parameters class_name
, data_path
, and tasks
.
data_path
field may be any string because data paths are passed for tasks individually in tasks
parameter. However, you can not drop a data_path
parameter because it is obligatory for dataset reader
configuration. tasks
parameter is a dictionary of task dataset readers configurations. In configurations of
task readers, reader_class_name
parameter is used instead of class_name
. The dataset reader configuration is
provided:
{
"dataset_reader": {
"class_name": "multitask_reader",
"data_path": "null",
"tasks": {
"insults": {
"reader_class_name": "basic_classification_reader",
"x": "Comment",
"y": "Class",
"data_path": "{DOWNLOADS_PATH}/insults_data"
},
"sentiment": {
"reader_class_name": "basic_classification_reader",
"x": "text",
"y": "label",
"data_path": "{DOWNLOADS_PATH}/yelp_review_full_csv",
"train": "train.csv",
"test": "test.csv",
"header": null,
"names": [
"label",
"text"
]
},
"ner": {
"reader_class_name": "conll2003_reader",
"data_path": "{DOWNLOADS_PATH}/conll2003/",
"dataset_name": "conll2003",
"provide_pos": false
}
}
}
}
A multitask_iterator
configuration has parameters class_name
and tasks
. tasks
is a dictionary of
configurations of task iterators. In configurations of task iterators, iterator_class_name
is used instead of
class_name
. The dataset iterator configuration is as follows:
{
"dataset_iterator": {
"class_name": "multitask_iterator",
"tasks": {
"insults": {
"iterator_class_name": "basic_classification_iterator",
"seed": 42
},
"sentiment": {
"iterator_class_name": "basic_classification_iterator",
"seed": 42,
"split_seed": 23,
"field_to_split": "train",
"split_fields": [
"train",
"valid"
],
"split_proportions": [
0.9,
0.1
]
},
"ner": {"iterator_class_name": "data_learning_iterator"}
}
}
}
Batches generated by multitask_iterator
are tuples of two elements: inputs of the model and labels. Both inputs
and labels are lists of tuples. The inputs have following format: [(first_task_inputs[0], second_task_inputs[0],
...), (first_task_inputs[1], second_task_inputs[1], ...), ...]
where first_task_inputs
, second_task_inputs
,
and so on are x values of batches from task dataset iterators. The labels in the have the similar format.
If task datasets have different sizes, then smaller datasets are repeated until
their sizes are equal to the size of the largest dataset. For example, if the first task dataset inputs are
[0, 1, 2, 3, 4, 5, 6]
, the second task dataset inputs are [7, 8, 9]
, and the batch size is 2
, then
multi-task input mini-batches will be [(0, 7), (1, 8)]
, [(2, 9), (3, 7)]
, [(4, 8), (5, 9)]
, [(6, 7)]
.
In this tutorial, there are 3 datasets. Considering the batch structure, chainer
inputs are:
{
"in": ["x_insults", "x_sentiment", "x_ner"],
"in_y": ["y_insults", "y_sentiment", "y_ner"]
}
Sometimes a task dataset iterator returns inputs or labels consisting of more than one element. For example, in model
mt_bert_train_tutorial.json siamese_iterator
input
element consists of 2 strings. If there is a necessity to split such a variable, InputSplitter
component can
be used.
Data preparation steps in the pipe of tutorial config are similar to data preparation steps in the original configs except for names of the variables.
A multitask_bert
component has task-specific parameters and parameters that are common for all tasks. The first
are provided inside the tasks
parameter. The tasks
is a dictionary that keys are task names and values are
task-specific parameters. The names of tasks have to be the same in train and inference configs.
If inference_task_names
parameter of a multitask_bert
component is provided, the component is created for
inference. Otherwise, it is created for training.
Task classes inherit MTBertTask
class. Inputs and labels of a multitask_bert
component are distributed between
the tasks according to the in_distribution
and in_y_distribution
parameters. You can drop these parameters if
only one task is called. In that case, all multitask_bert
inputs are passed to the task. Another option is
to make a distribution parameter a dictionary whose keys are task names and values are numbers of arguments the tasks
take. If this option is used, the order of the multitask_bert
component inputs in in
and in_y
parameters
must meet three conditions. First, in
and in_y
elements have to be grouped by tasks, e.g. arguments for the
first task, then arguments for the second task and so on. Secondly, the order of tasks in in
and in_y
has to
be the same as the order of tasks in the in_distribution
and in_y_distribution
parameters. Thirdly, in in
and in_y
parameters the arguments of a task have to be put in the same order as the order in which they are passed
to get_sess_run_infer_args
and get_sess_run_train_args
methods of the task. If in
and in_y
parameters
are dictionaries, you may make in_distribution
and in_y_distribution
parameter dictionaries which keys are
task names and values are lists of elements of in
or in_y
.
{
"id": "mt_bert",
"class_name": "mt_bert",
"save_path": "{MT_BERT_PATH}/model",
"load_path": "{MT_BERT_PATH}/model",
"bert_config_file": "{BERT_PATH}/bert_config.json",
"pretrained_bert": "{BERT_PATH}/bert_model.ckpt",
"attention_probs_keep_prob": 0.5,
"body_learning_rate": 3e-5,
"min_body_learning_rate": 2e-7,
"learning_rate_drop_patience": 10,
"learning_rate_drop_div": 1.5,
"load_before_drop": true,
"optimizer": "tf.train:AdamOptimizer",
"clip_norm": 1.0,
"tasks": {
"insults": {
"class_name": "mt_bert_classification_task",
"n_classes": "#classes_vocab_insults.len",
"keep_prob": 0.5,
"return_probas": true,
"learning_rate": 1e-3,
"one_hot_labels": true
},
"sentiment": {
"class_name": "mt_bert_classification_task",
"n_classes": "#classes_vocab_sentiment.len",
"return_probas": true,
"one_hot_labels": true,
"keep_prob": 0.5,
"learning_rate": 1e-3
},
"ner": {
"class_name": "mt_bert_seq_tagging_task",
"n_tags": "#tag_vocab.len",
"return_probas": false,
"keep_prob": 0.5,
"learning_rate": 1e-3,
"use_crf": true,
"encoder_layer_ids": [-1]
}
},
"in_distribution": {"insults": 1, "sentiment": 1, "ner": 3},
"in": [
"bert_features_insults",
"bert_features_sentiment",
"x_ner_subword_tok_ids",
"ner_attention_mask",
"ner_startofword_markers"],
"in_y_distribution": {"insults": 1, "sentiment": 1, "ner": 1},
"in_y": ["y_insults_onehot", "y_sentiment_onehot", "y_ner_ind"],
"out": ["y_insults_pred_probas", "y_sentiment_pred_probas", "y_ner_pred_ind"]
}
You may need to design your own metric for early stopping. In this example, the target metric is an average of AUC ROC
for insults and sentiment tasks and F1 for NER task. In order to add a metric to config, you have to register the
metric. To register metric, add the decorator register_metric
and run the command
python -m utils.prepare.registry
in DeepPavlov root directory. The code below should be placed in the file
deeppavlov/metrics/fmeasure.py
and registry is updated with command python -m utils.prepare.registry
.
@register_metric("average__roc_auc__roc_auc__ner_f1")
def roc_auc__roc_auc__ner_f1(true_onehot1, pred_probas1, true_onehot2, pred_probas2, ner_true3, ner_pred3):
from .roc_auc_score import roc_auc_score
roc_auc1 = roc_auc_score(true_onehot1, pred_probas1)
roc_auc2 = roc_auc_score(true_onehot2, pred_probas2)
ner_f1_3 = ner_f1(ner_true3, ner_pred3) / 100
return (roc_auc1 + roc_auc2 + ner_f1_3) / 3
Inference config¶
There is no need in dataset reader and dataset iterator in and inference config. A train
field and components
preparing in_y
are removed. In multitask_bert
component configuration all training parameters (learning rate,
optimizer, etc.) are omitted.
For demonstration of DeepPavlov multi-task BERT functionality, in this example, the inference is made in 2 separate
components: multitask_bert
and mtbert_reuser
. The first component performs named entity recognition and the
second performs insult detection and sentiment analysis.
To run NER using the multitask_bert
component, inference_task_names
parameter is added to
multitask_bert
component configuration. An inference_task_names
parameter can be a string or a list containing
strings and lists of strings. If an inference_task_names
parameter is a string, it is the name of the task called
separately (in individual tf.Session.run
call).
If an inference_task_names
parameter is a list, then this list contains names of called tasks. You may group
several tasks to speed up inference if these tasks have common inputs. If an element of the inference_task_names
is a list of task names, the tasks from the list are run simultaneously in one tf.Session.run
call. Despite the
fact that tasks share inputs, you have to provide full sets of inputs for all tasks in in
parameter of
multitask_bert
.
In the tutorial, NER task do not have common inputs with other tasks and have to be run separately.
{
"id": "mt_bert",
"class_name": "mt_bert",
"inference_task_names": "ner",
"bert_config_file": "{BERT_PATH}/bert_config.json",
"save_path": "{MT_BERT_PATH}/model",
"load_path": "{MT_BERT_PATH}/model",
"pretrained_bert": "{BERT_PATH}/bert_model.ckpt",
"tasks": {
"insults": {
"class_name": "mt_bert_classification_task",
"n_classes": "#classes_vocab_insults.len",
"return_probas": true,
"one_hot_labels": true
},
"sentiment": {
"class_name": "mt_bert_classification_task",
"n_classes": "#classes_vocab_sentiment.len",
"return_probas": true,
"one_hot_labels": true
},
"ner": {
"class_name": "mt_bert_seq_tagging_task",
"n_tags": "#tag_vocab.len",
"return_probas": false,
"use_crf": true,
"encoder_layer_ids": [-1]
}
},
"in": ["x_ner_subword_tok_ids", "ner_attention_mask", "ner_startofword_markers"],
"out": ["y_ner_pred_ind"]
}
mtbert_reuser
component is an interface to call
method of MultiTaskBert
class. mtbert_reuser
component is provided with multitask_bert
component, a list of task names for inference task_names
(the format
is same as in inference_task_names
parameter of multitask_bert
), and in_distribution
parameter. Notice
that tasks “insults” and “sentiment” are grouped into a list of 2 elements. This syntax invokes inference of these
tasks in one call of tf.Session.run
. If task_names
were equal to ["insults", "sentiment"]
, the inference
of the tasks would be sequential and take approximately 2 times more time.
{
"class_name": "mt_bert_reuser",
"mt_bert": "#mt_bert",
"task_names": [["insults", "sentiment"]],
"in_distribution": {"insults": 1, "sentiment": 1},
"in": ["bert_features", "bert_features"],
"out": ["y_insults_pred_probas", "y_sentiment_pred_probas"]
}