Multi-task BERT in DeepPavlov¶
Multi-task BERT in DeepPavlov is an implementation of BERT training algorithm published in the paper Multi-Task Deep Neural Networks for Natural Language Understanding.
The idea is to share BERT body between several tasks. This is necessary if a model pipe has several components using BERT and the amount of GPU memory is limited. Each task has its own ‘head’ part attached to the output of the BERT encoder. If multi-task BERT has \(T\) heads, one training iteration consists of
composing \(T\) lists of examples, one for each task,
\(T\) gradient steps, one gradient step for each task.
By default, on every training steps lists of examples for all but one tasks are empty, as if in the original MT-DNN repository.
When one of BERT heads is being trained, other heads’ parameters do not change. On each training step both BERT head and body parameters are modified.
Currently multitask bert heads support classification, regression, NER and multiple choice tasks.
At this page, multi-task BERT usage is explained on a toy configuration file of a model that is trained for the single-sentence classification, sentence pair classification, regression, multiple choice and NER. The config for this model is multitask_example.
Other examples of using multitask models can be found in mt_glue.
Train config¶
When using multitask_transformer
component, you can use the same inference file as the train file.
Data reading and iteration is performed by MultiTaskReader
and MultiTaskIterator
. These classes are composed
of task readers and iterators and generate batches that contain data from heterogeneous datasets. Example below
demonstrates the usage of multitask dataset reader:
"dataset_reader": {
"class_name": "multitask_reader",
"task_defaults": {
"class_name": "huggingface_dataset_reader",
"path": "glue",
"train": "train",
"valid": "validation",
"test": "test"
},
"tasks": {
"cola": {"name": "cola"},
"copa": {
"path": "super_glue",
"name": "copa"
},
"conll": {
"class_name": "conll2003_reader",
"use_task_defaults": false,
"data_path": "{DOWNLOADS_PATH}/conll2003/",
"dataset_name": "conll2003",
"provide_pos": false
}
}
}
Nested dataset readers are listed in the tasks
section. By default, default nested readers parameters are taken from
task_defaults
section. Values from the tasks
could complement parameters, like name
parameter in the
dataset_reader.tasks.cola
, and could overwrite default parameter values, like path
parameter from
dataset_reader.tasks.copa
. In the dataset_reader.tasks.conll
use_task_defaults
is False
. This is special
parameter, that forces multitask_reader
to ignore task_defaults
while creating nested reader, which means that
dataset reader for conll
task will use only parameters from dataset_reader.tasks.conll
.
The same principle with default values applies to multitask_iterator
.
Batches generated by multitask_iterator
are tuples of two elements: inputs of the model and labels.
Both inputsand labels are lists of tuples. The inputs have following format:
[(first_task_inputs[0], second_task_inputs[0],...), (first_task_inputs[1], second_task_inputs[1], ...), ...]
where first_task_inputs
, second_task_inputs
, and so on are x values of batches from task dataset iterators.
The labels in the second element have the similar format.
If task datasets have different sizes, then for smaller datasets the lists are padded with None
values. For example,
if the first task dataset inputs are [0, 1, 2, 3, 4, 5, 6]
, the second task dataset inputs are [7, 8, 9]
,
and the batch size is 2
, then multi-task input mini-batches will be [(0, 7), (1, 8)]
, [(2, 9), (3, None)]
,
[(4, None), (5, None)]
, [(6, None)]
.
In this tutorial, there are 5 datasets. Considering the batch structure, chainer
inputs in
multitask_example are:
"in": ["x_cola", "x_rte", "x_stsb", "x_copa", "x_conll"],
"in_y": ["y_cola", "y_rte", "y_stsb", "y_copa", "y_conll"]
Sometimes a task dataset iterator returns inputs or labels consisting of more than one element. For example, in the
model input element could consist of two strings. If there is a necessity to split such a variable, InputSplitter
component can be used. Data preparation in the multitask setting can be similar to the preparation in singletask setting
except for the names of the variables.
For streamlining the code, however, input_splitter
and tokenizer
can be unified into the
multitask_pipeline_preprocessor
. This preprocessor gets as a parameter preprocessor
the one preprocessor class
name for all tasks, or gets the preprocessor name list as a parameter preprocessors
. After splitting input by
possible_keys_to_extract
, every preprocessor (being initialized by the input beforehand) processes the input.
Note, that if strict
parameter(default:False) is set to True, we always try to split data. Here is the definition of
multitask_pipeline_preprocessor
from the multitask_example:
"class_name": "multitask_pipeline_preprocessor",
"possible_keys_to_extract": [0, 1],
"preprocessors": [
"TorchTransformersPreprocessor",
"TorchTransformersPreprocessor",
"TorchTransformersPreprocessor",
"TorchTransformersMultiplechoicePreprocessor",
"TorchTransformersNerPreprocessor"
],
"do_lower_case": true,
"n_task": 5,
"vocab_file": "{BACKBONE}",
"max_seq_length": 200,
"max_subword_length": 15,
"token_masking_prob": 0.0,
"return_features": true,
"in": ["x_cola", "x_rte", "x_stsb", "x_copa", "x_conll"],
"out": [
"bert_features_cola",
"bert_features_rte",
"bert_features_stsb",
"bert_features_copa",
"bert_features_conll"
]
The multitask_transformer
component has common and task-specific parameters. Shared parameters are provided inside
the tasks parameter. The tasks is a dictionary that keys are task names and values are task-specific parameters (type,
options). Common parameters, are backbone_model(same parameter as in the tokenizer) and all parameters from torch_bert.
The order of tasks MATTERS.
Here is the definition of multitask_transformer
from the multitask_example:
"id": "multitask_transformer",
"class_name": "multitask_transformer",
"optimizer_parameters": {"lr": 2e-5},
"gradient_accumulation_steps": "{GRADIENT_ACC_STEPS}",
"learning_rate_drop_patience": 2,
"learning_rate_drop_div": 2.0,
"return_probas": true,
"backbone_model": "{BACKBONE}",
"save_path": "{MODEL_PATH}",
"load_path": "{MODEL_PATH}",
"tasks": {
"cola": {
"type": "classification",
"options": 2
},
"rte": {
"type": "classification",
"options": 2
},
"stsb": {
"type": "regression",
"options": 1
},
"copa": {
"type": "multiple_choice",
"options": 2
},
"conll": {
"type": "sequence_labeling",
"options": "#vocab_conll.len"
}
},
"in": [
"bert_features_cola",
"bert_features_rte",
"bert_features_stsb",
"bert_features_copa",
"bert_features_conll"
],
"in_y": ["y_cola", "y_rte", "y_stsb", "y_copa", "y_ids_conll"],
"out": [
"y_cola_pred_probas",
"y_rte_pred_probas",
"y_stsb_pred",
"y_copa_pred_probas",
"y_conll_pred_ids"
]
Note that proba2labels
can now take several arguments.
{
"in":["y_cola_pred_probas", "y_rte_pred_probas", "y_copa_pred_probas"],
"out":["y_cola_pred_ids", "y_rte_pred_ids", "y_copa_pred_ids"],
"class_name":"proba2labels",
"max_proba":true
}
You may need to create your own metric for early stopping. In this example, the target metric is an average of AUC ROC for insults and sentiment tasks and F1 for NER task:
from deeppavlov.metrics.roc_auc_score import roc_auc_score
def roc_auc__roc_auc__ner_f1(true_onehot1, pred_probas1, true_onehot2, pred_probas2, ner_true3, ner_pred3):
roc_auc1 = roc_auc_score(true_onehot1, pred_probas1)
roc_auc2 = roc_auc_score(true_onehot2, pred_probas2)
ner_f1_3 = ner_f1(ner_true3, ner_pred3) / 100
return (roc_auc1 + roc_auc2 + ner_f1_3) / 3
It he code above will be saved at custom_metric.py
, metric could be used in the config as
custom_metric:roc_auc__roc_auc__ner_f1
(module.submodules:function_name
reference format).
You can make an inference-only config. In this config, there is no need in dataset reader and dataset iterator.
A train
field and components preparing in_y
are removed. In multitask_transformer
component configuration
all training parameters (learning rate, optimizer, etc.) are omitted.
Here are the results of deeppavlov/configs/multitask/mt_glue.json
compared to the analogous single-task configs,
according to the test server.
Task |
Score |
CoLA |
SST-2 |
MRPC |
STS-B |
QQP |
MNLI(m/mm) |
QNLI |
RTE |
AX |
---|---|---|---|---|---|---|---|---|---|---|
Metric |
from server |
Matthew’s Corr |
Accuracy |
F1 / Accuracy |
Pearson/Spearman Corr |
F1 / Accuracy |
Accuracy |
Accuracy |
Accuracy |
Matthew’s Corr |
Multitask config |
77.8 |
43.6 |
93.2 |
88.6/84.2 |
84.3/84.0 |
70.1/87.9 |
83.0/82.6 |
90.6 |
75.4 |
35.4 |
Singletask config |
77.6 |
53.6 |
92.7 |
87.7/83.6 |
84.4/83.1 |
70.5/88.9 |
84.4/83.2 |
90.3 |
63.4 |
36.3 |