Configuration files¶
An NLP pipeline config is a JSON file that contains one required element chainer
:
{
"chainer": {
"in": ["x"],
"in_y": ["y"],
"pipe": [
...
],
"out": ["y_predicted"]
}
}
Chainer
is a core concept of DeepPavlov library: chainer builds a pipeline from
heterogeneous components (Rule-Based/ML/DL) and allows to train or infer from pipeline as a whole. Each component in the
pipeline specifies its inputs and outputs as arrays of names, for example: "in": ["tokens", "features"]
and
"out": ["token_embeddings", "features_embeddings"]
and you can chain outputs of one components with inputs of other
components:
{
"class_name": "deeppavlov.models.preprocessors.str_lower:StrLower",
"in": ["x"],
"out": ["x_lower"]
},
{
"class_name": "nltk_tokenizer",
"in": ["x_lower"],
"out": ["x_tokens"]
},
Each Component
in the pipeline must implement method __call__()
and has
class_name
parameter, which is its registered codename, or full name of any python class in the form of
"module_name:ClassName"
. It can also have any other parameters which repeat its __init__()
method arguments.
Default values of __init__()
arguments will be overridden with the config values during the initialization of a
class instance.
You can reuse components in the pipeline to process different parts of data with the help of id
and ref
parameters:
{
"class_name": "nltk_tokenizer",
"id": "tokenizer",
"in": ["x_lower"],
"out": ["x_tokens"]
},
{
"ref": "tokenizer",
"in": ["y"],
"out": ["y_tokens"]
},
Variables¶
As of version 0.1.0 every string value in a configuration file is interpreted
as a format string where fields are evaluated
from metadata.variables
element:
{
"chainer": {
"in": ["x"],
"pipe": [
{
"class_name": "my_component",
"in": ["x"],
"out": ["x"],
"load_path": "{MY_PATH}/file.obj"
},
{
"in": ["x"],
"out": ["y_predicted"],
"config_path": "{CONFIGS_PATH}/classifiers/intents_snips.json"
}
],
"out": ["y_predicted"]
},
"metadata": {
"variables": {
"MY_PATH": "/some/path",
"CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs"
}
}
}
Variable DEEPPAVLOV_PATH
is always preset to be a path to the deeppavlov
python module.
One can override configuration variables using environment variables with prefix DP_
. So environment variable
DP_VARIABLE_NAME
will override VARIABLE_NAME
inside a configuration file.
For example, adding DP_ROOT_PATH=/my_path/to/large_hard_drive
will make most configs use this path for downloading and reading embeddings/models/datasets.
Training¶
There are two abstract classes for trainable components: Estimator
and NNModel
.
Estimator
are fit once on any data with no batching or early stopping,
so it can be safely done at the time of pipeline initialization. fit()
method has to be implemented for each
Estimator
. One example is Vocab
.
NNModel
requires more complex training. It can only be trained in a supervised
mode (as opposed to Estimator
which can be trained in both supervised and
unsupervised settings). This process takes multiple epochs with periodic validation and logging.
train_on_batch()
method has to be implemented for each
NNModel
.
Training is triggered by train_evaluate_model_from_config()
function.
Train config¶
Estimator
s that are trained should also have fit_on
parameter which
contains a list of input parameter names. An NNModel
should have the in_y
parameter which contains a list of ground truth answer names. For example:
[
{
"id": "classes_vocab",
"class_name": "default_vocab",
"fit_on": ["y"],
"level": "token",
"save_path": "vocabs/classes.dict",
"load_path": "vocabs/classes.dict"
},
{
"in": ["x"],
"in_y": ["y"],
"out": ["y_predicted"],
"class_name": "intent_model",
"save_path": "classifiers/intent_cnn",
"load_path": "classifiers/intent_cnn",
"classes_vocab": {
"ref": "classes_vocab"
}
}
]
The config for training the pipeline should have three additional elements: dataset_reader
, dataset_iterator
and train
:
{
"dataset_reader": {
"class_name": ...,
...
},
"dataset_iterator": {
"class_name": ...,
...
},
"chainer": {
...
},
"train": {
...
}
}
Simplified version of training pipeline contains two elements: dataset
and train
. The dataset
element
currently can be used for train from classification data in csv
and json
formats. You can find complete examples
of how to use simplified training pipeline in
intents_sample_csv.json and
intents_sample_json.json config files.
Train Parameters¶
epochs
— maximum number of epochs to train NNModel, defaults to-1
(infinite)batch_size
,metric_optimization
—maximize
orminimize
a metric, defaults tomaximize
validation_patience
— how many times in a row the validation metric has to not improve for early stopping, defaults to5
val_every_n_epochs
— how often to validate the pipe, defaults to-1
(never)log_every_n_batches
,log_every_n_epochs
— how often to calculate metrics for train data, defaults to-1
(never)validate_best
,test_best
flags to infer the best saved model on valid and test data, defaults totrue
tensorboard_log_dir
— path to write logged metrics during training. Use tensorboard to visualize metrics plots.metrics
— list ofmetrics
to evaluate the model.
Metrics¶
"train": {
"metrics": [
"f1",
{
"name": "accuracy",
"inputs": ["y", "y_labels"]
},
{
"name": "roc_auc",
"inputs": ["y", "y_probabilities"]
}
],
...
}
name
and inputs
properties, where name
is a registered name of a metric function and inputs
is a list of parameter names from chainer’s
inner memory that will be passed to the metric function.inputs
parameter is a concatenation of chainer’s in_y
and out
parameters.DatasetReader¶
DatasetReader
class reads data and returns it in a specified format.
A concrete DatasetReader
class should be inherited from this base class and registered with a codename:
from deeppavlov.core.common.registry import register
from deeppavlov.core.data.dataset_reader import DatasetReader
@register('dstc2_datasetreader')
class DSTC2DatasetReader(DatasetReader):
DataLearningIterator and DataFittingIterator¶
DataLearningIterator
forms the sets of data (‘train’, ‘valid’,
‘test’) needed for training/inference and divides them into batches. A concrete DataLearningIterator
class
should be registered and can be inherited from deeppavlov.data.data_learning_iterator.DataLearningIterator
class. This is a base class and can be used as a DataLearningIterator
as well.
DataFittingIterator
iterates over provided dataset without
train/valid/test splitting and is useful for Estimator
s that do not require
training.
Inference¶
All components inherited from Component
abstract class can be used for
inference. The __call__()
method should return standard output of a component. For example, a tokenizer
should return tokens, a NER recognizer should return recognized entities, a bot should return an utterance.
A particular format of returned data should be defined in __call__()
.
Inference is triggered by interact_model()
function. There is no need in a
separate JSON for inference.