Features¶

Components¶

NER component ¶

There are two models for Named Entity Recognition task in DeepPavlov: BERT-based and Bi-LSTM+CRF. The models predict tags (in BIO format) for tokens in input.

BERT-based model is described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

The second model reproduces architecture from the paper Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition which is inspired by Bi-LSTM+CRF architecture from https://arxiv.org/pdf/1603.01360.pdf.

Dataset	Lang	Model	Test F1
Persons-1000 dataset with additional LOC and ORG markup (Collection 3)	Ru	ner_rus_bert.json	97.7
	Ru	ner_rus.json	95.1
ConLL-2003	En	ner_conll2003_bert.json	91.5
ConLL-2003		ner_conll2003.json	89.9
OntoNotes		ner_ontonotes_bert.json	88.4
OntoNotes		ner_ontonotes.json	87.1
DSTC2		ner_dstc2.json	97.1

Slot filling components ¶

Based on fuzzy Levenshtein search to extract normalized slot values from text. The components either rely on NER results or perform needle in haystack search.

Dataset	Slots Accuracy
DSTC 2	98.85

Component for classification tasks (intents, sentiment, etc) on word-level. Shallow-and-wide CNN, Deep CNN, BiLSTM, BiLSTM with self-attention and other models are presented. The model also allows multilabel classification of texts. Several pre-trained models are available and presented in Table below.

Task	Dataset	Lang	Model	Metric	Valid	Test	Downloads
28 intents	DSTC 2	En	DSTC 2 emb	Accuracy	0.7613	0.7733	800 Mb
			Wiki emb		0.9629	0.9617	8.5 Gb
			BERT		0.9673	0.9636	800 Mb
7 intents	SNIPS-2017 7		DSTC 2 emb	F1-macro	0.8591	–	800 Mb
			Wiki emb		0.9820	–	8.5 Gb
			Tfidf + SelectKBest + PCA + Wiki emb		0.9673	–	8.6 Gb
			Wiki emb weighted by Tfidf		0.9786	–	8.5 Gb
Insult detection	Insults		Reddit emb	ROC-AUC	0.9263	0.8556	6.2 Gb
Insult detection	Insults		English BERT	ROC-AUC	0.9255	0.8612	1200 Mb
5 topics	AG News		Wiki emb	Accuracy	0.8922	0.9059	8.5 Gb
Sentiment	Twitter mokoron	Ru	RuWiki+Lenta emb w/o preprocessing		0.9965	0.9961	6.2 Gb
	Twitter mokoron		RuWiki+Lenta emb with preprocessing		0.7823	0.7759	6.2 Gb
	RuSentiment		RuWiki+Lenta emb	F1-weighted	0.6541	0.7016	6.2 Gb
			Twitter emb super-convergence 6		0.7301	0.7576	3.4 Gb
			ELMo		0.7519	0.7875	700 Mb
			Multi-language BERT		0.6809	0.7193	1900 Mb
Intent	Yahoo-L31		Yahoo-L31 on ELMo pre-trained on Yahoo-L6	ROC-AUC	0.9412	–	700 Mb

6: Smith L. N., Topin N. Super-convergence: Very fast training of residual networks using large learning rates. – 2018.
7: Coucke A. et al. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces //arXiv preprint arXiv:1805.10190. – 2018.

As no one had published intent recognition for DSTC-2 data, the comparison of the presented model is given on SNIPS dataset. The evaluation of model scores was conducted in the same way as in [3] to compare with the results from the report of the authors of the dataset. The results were achieved with tuning of parameters and embeddings trained on Reddit dataset.

Model	AddToPlaylist	BookRestaurant	GetWheather	PlayMusic	RateBook	SearchCreativeWork	SearchScreeningEvent
api.ai	0.9931	0.9949	0.9935	0.9811	0.9992	0.9659	0.9801
ibm.watson	0.9931	0.9950	0.9950	0.9822	0.9996	0.9643	0.9750
microsoft.luis	0.9943	0.9935	0.9925	0.9815	0.9988	0.9620	0.9749
wit.ai	0.9877	0.9913	0.9921	0.9766	0.9977	0.9458	0.9673
snips.ai	0.9873	0.9921	0.9939	0.9729	0.9985	0.9455	0.9613
recast.ai	0.9894	0.9943	0.9910	0.9660	0.9981	0.9424	0.9539
amazon.lex	0.9930	0.9862	0.9825	0.9709	0.9981	0.9427	0.9581

Shallow-and-wide CNN	0.9956	0.9973	0.9968	0.9871	0.9998	0.9752	0.9854

Goal-oriented bot ¶

Based on Hybrid Code Networks (HCNs) architecture from Jason D. Williams, Kavosh Asadi, Geoffrey Zweig, Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning – 2017. It allows to predict responses in a goal-oriented dialog. The model is customizable: embeddings, slot filler and intent classifier can be switched on and off on demand.

Available pre-trained models and their comparison with existing benchmarks:

Dataset	Lang	Model	Metric	Valid	Test	Downloads
DSTC 2 *	En	bot with slot filler	Turn Accuracy	0.521	0.529	400 Mb
DSTC 2 *		bot with slot filler & intents & attention		0.555	0.561	8.5 Gb
DSTC 2		Bordes and Weston (2016)		–	0.411	–
		Eric and Manning (2017)		–	0.480	–
		Perez and Liu (2016)		–	0.487	–
		Williams et al. (2017)		–	0.556	–

*: There were a few modifications to the original dataset.

Seq2seq goal-oriented bot ¶

Dialogue agent predicts responses in a goal-oriented dialog and is able to handle multiple domains (pretrained bot allows calendar scheduling, weather information retrieval, and point-of-interest navigation). The model is end-to-end differentiable and does not need to explicitly model dialogue state or belief trackers.

Comparison of deeppavlov pretrained model with others:

Dataset	Lang	Model	Valid BLEU	Test BLEU	Downloads
Stanford Kvret	En	KvretNet	0.131	0.132	10 Gb
		KvretNet, Mihail Eric et al. (2017)	–	0.132	–
		KvretNet, Mihail Eric et al. (2017)	–	0.132	–
		CopyNet, Mihail Eric et al. (2017)	–	0.110	–
		CopyNet, Mihail Eric et al. (2017)	–	0.110	–
		Attn Seq2Seq, Mihail Eric et al. (2017)	–	0.102	–
		Attn Seq2Seq, Mihail Eric et al. (2017)	–	0.102	–
		Rule-based, Mihail Eric et al. (2017)	–	0.066	–
		Rule-based, Mihail Eric et al. (2017)	–	0.066	–

Automatic spelling correction component ¶

Pipelines that use candidates search in a static dictionary and an ARPA language model to correct spelling errors.

Note

About 4.4 GB on disc required for the Russian language model and about 7 GB for the English one.

Comparison on the test set for the SpellRuEval competition on Automatic Spelling Correction for Russian:

Correction method	Precision	Recall	F-measure	Speed (sentences/s)
Yandex.Speller	83.09	59.86	69.59
Damerau Levenshtein 1 + lm	53.26	53.74	53.50	29.3
Brill Moore top 4 + lm	51.92	53.94	52.91	0.6
Hunspell + lm	41.03	48.89	44.61	2.1
JamSpell	44.57	35.69	39.64	136.2
Brill Moore top 1	41.29	37.26	39.17	2.4
Hunspell	30.30	34.02	32.06	20.3

Ranking component ¶

The main neural ranking model based on LSTM-based deep learning models for non-factoid answer selection. The model performs ranking of responses or contexts from some database by their relevance for the given context.

There are 3 alternative neural architectures available as well:

Sequential Matching Network (SMN): Based on the work Wu, Yu, et al. “Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots”. ACL. 2017.
Deep Attention Matching Network (DAM): Based on the work Xiangyang Zhou, et al. “Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network”. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018
Deep Attention Matching Network + Universal Sentence Encoder v3 (DAM-USE-T): Our new proposed architecture based on the works: Xiangyang Zhou, et al. “Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network”. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018 and Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, Ray Kurzweil. 2018a. Universal Sentence Encoder for English.

Available pre-trained models for ranking:

Dataset	Model config	Val	Test
Dataset	Model config	R10@1	R10@1	R10@2	R10@5	Downloads
InsuranceQA v1	ranking_insurance_interact	72.0	72.2	–	–	8374 MB
Ubuntu V2	ranking_ubuntu_v2_mt_word2vec_dam_transformer	74.32	74.46	86.77	97.38	2457 MB
Ubuntu V2	ranking_ubuntu_v2_mt_word2vec_dam	71.20	71.54	83.66	96.33	1645 MB
Ubuntu V2	ranking_ubuntu_v2_mt_word2vec_smn	68.56	67.91	81.49	95.63	1609 MB
Ubuntu V2	ranking_ubuntu_v2_bert_uncased	66.5	66.6	–	–	396 MB
Ubuntu V2	ranking_ubuntu_v2_bert_sep	66.5	66.5	–	–	396 MB
Ubuntu V2	ranking_ubuntu_v2_interact	52.9	52.4	–	–	8913 MB
Ubuntu V2	ranking_ubuntu_v2_mt_interact	59.2	58.7	–	–	8906 MB
Ubuntu V1	ranking_ubuntu_v1_mt_word2vec_dam_transformer	–	79.57	89.32	97.34	2439 MB
Ubuntu V1	ranking_ubuntu_v1_mt_word2vec_dam	–	77.95	88.07	97.06	1645 MB
Ubuntu V1	ranking_ubuntu_v1_mt_word2vec_smn	–	75.90	87.16	96.80	1591 MB

Available pre-trained models for paraphrase identification:

Dataset	Model config	Val (accuracy)	Test (accuracy)	Val (F1)	Test (F1)	Val (log_loss)	Test (log_loss)	Downloads
paraphraser.ru	paraphrase_ident_paraphraser	83.8	75.4	87.9	80.9	0.468	0.616	5938M
paraphraser.ru	paraphrase_ident_paraphraser	82.7	76.0	87.3	81.4	0.391	0.510	5938M
paraphraser.ru	paraphrase_ident_paraphraser_tune	82.9	76.7	87.3	82.0	0.392	0.479	5938M
paraphraser.ru	paraphrase_bert	87.4	79.3	90.2	83.4	–	–	1330M
Quora Question Pairs	paraphrase_ident_qqp	87.1	87.0	83.0	82.6	0.300	0.305	8134M
Quora Question Pairs	paraphrase_ident_qqp	87.7	87.5	84.0	83.8	0.287	0.298	8136M

Comparison with other models on the InsuranceQA V1:

Model	Validation (Recall@1)	Test1 (Recall@1)
Architecture II (HLQA(200) CNNQA(4000) 1-MaxPooling Tanh)	61.8	62.8
QA-LSTM basic-model(max pooling)	64.3	63.1
ranking_insurance	72.0	72.2

Comparison with other models on the Ubuntu Dialogue Corpus v1 (test):

Model	R@1	R@2	R@5
SMN last [Wu et al., 2017]	0.723	0.842	0.956
SMN last [DeepPavlov ranking_ubuntu_v1_mt_word2vec_smn]	0.754	0.869	0.967
DAM [Zhou et al., 2018]	0.767	0.874	0.969
DAM [DeepPavlov ranking_ubuntu_v1_mt_word2vec_dam]	0.779	0.880	0.970
MRFN-FLS [Tao et al., 2019]	0.786	0.886	0.976
IMN [Gu et al., 2019]	0.777	0.880	0.974
IMN Ensemble [Gu et al., 2019]	0.794	0.893	0.978
DAM-USE-T [DeepPavlov ranking_ubuntu_v1_mt_word2vec_dam_transformer]	0.7957	0.8932	0.9734

Comparison with other models on the Ubuntu Dialogue Corpus v2 (test):

Model	R@1	R@2	R@5
SMN last [Wu et al., 2017]	–	–	–
SMN last [DeepPavlov ranking_ubuntu_v2_mt_word2vec_smn]	0.6791	0.8149	0.9563
DAM [Zhou et al., 2018]	–	–	–
DAM [DeepPavlov ranking_ubuntu_v2_mt_word2vec_dam]	0.7154	0.8366	0.9633
MRFN-FLS [Tao et al., 2019]	–	–	–
IMN [Gu et al., 2019]	0.771	0.886	0.979
IMN Ensemble [Gu et al., 2019]	0.791	0.899	0.982
DAM-USE-T [DeepPavlov ranking_ubuntu_v2_mt_word2vec_dam_transformer]	0.7446	0.8677	0.9738

References:

Yu Wu, Wei Wu, Ming Zhou, and Zhoujun Li. 2017. Sequential match network: A new architecture for multi-turn response selection in retrieval-based chatbots. In ACL, pages 372–381. https://www.aclweb.org/anthology/P17-1046
Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao, Dianhai Yu and Hua Wu. 2018. Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1118-1127, ACL. http://aclweb.org/anthology/P18-1103
Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu, Dongyan Zhao, and Rui Yan. Multi-Representation Fusion Network for Multi-turn Response Selection in Retrieval-based Chatbots. In WSDM‘19. https://dl.acm.org/citation.cfm?id=3290985
Gu, Jia-Chen & Ling, Zhen-Hua & Liu, Quan. (2019). Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots. https://arxiv.org/abs/1901.01824

TF-IDF Ranker component ¶

Based on Reading Wikipedia to Answer Open-Domain Questions. The model solves the task of document retrieval for a given query.

Dataset	Model			Wiki dump	Recall@5	Downloads
SQuAD-v1.1	doc_retrieval			enwiki (2018-02-11)	75.6	33 GB

Question Answering component ¶

Models in this section solve the task of looking for an answer on a question in a given context (SQuAD task format). There are two models for this task in DeepPavlov: BERT-based and R-Net. Both models predict answer start and end position in a given context.

BERT-based model is described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

R-Net model is based on R-NET: Machine Reading Comprehension with Self-matching Networks.

Dataset	Model config	lang	EM (dev)	F-1 (dev)	Downloads
SQuAD-v1.1	DeepPavlov BERT	en	80.88	88.49	806Mb
SQuAD-v1.1	DeepPavlov R-Net	en	71.49	80.34	~2.5Gb
SDSJ Task B	DeepPavlov RuBERT	ru	66.30+-0.24	84.60+-0.11	1325Mb
SDSJ Task B	DeepPavlov multilingual BERT	ru	64.35+-0.39	83.39+-0.08	1323Mb
SDSJ Task B	DeepPavlov R-Net	ru	60.62	80.04	~5Gb

In the case when answer is not necessary present in given context we have squad_noans model. This model outputs empty string in case if there is no answer in context.

Morphological tagging component ¶

Based on character-based approach to morphological tagging Heigold et al., 2017. An extensive empirical evaluation of character-based morphological tagging for 14 languages. A state-of-the-art model for Russian and several other languages. Model takes as input tokenized sentences and outputs the corresponding sequence of morphological labels in UD format. The table below contains word and sentence accuracy on UD2.0 datasets. For more scores see full table.

Dataset	Model	Word accuracy	Sent. accuracy	Download size (MB)
UD2.0 (Russian)	Pymorphy + russian_tagsets (first tag)	60.93	0.00
	UD Pipe 1.2 (Straka et al., 2017)	93.57	43.04
	Basic model	95.17	50.58	48.7
	Pymorphy-enhanced model	96.23	58.00	48.7
UD2.0 (Czech)	UD Pipe 1.2 (Straka et al., 2017)	91.86	42.28
UD2.0 (Czech)	Basic model	94.35	51.56	41.8
UD2.0 (English)	UD Pipe 1.2 (Straka et al., 2017)	92.89	55.75
UD2.0 (English)	Basic model	93.00	55.18	16.9
UD2.0 (German)	UD Pipe 1.2 (Straka et al., 2017)	76.65	10.24
UD2.0 (German)	Basic model	83.83	15.25	18.6

Frequently Asked Questions (FAQ) component ¶

Set of pipelines for FAQ task: classifying incoming question into set of known questions and return prepared answer. You can build different pipelines based on: tf-idf, weighted fasttext, cosine similarity, logistic regression.

Skills¶

eCommerce bot ¶

The eCommerce bot intends to retrieve product items from catalog in sorted order. In addition, it asks an user to provide additional information to specify the search.

Note

About 130 Mb on disc required for eCommerce bot with TfIdf-based ranker and 500 Mb for BLEU-based ranker.

ODQA ¶

An open domain question answering skill. The skill accepts free-form questions about the world and outputs an answer based on its Wikipedia knowledge.

Dataset	Model config	Wiki dump	F1	Downloads
SQuAD-v1.1	ODQA	enwiki (2018-02-11)	35.89	9.7Gb
SQuAD-v1.1	ODQA	enwiki (2016-12-21)	37.83	9.3Gb
SDSJ Task B	ODQA	ruwiki (2018-04-01)	28.56	7.7Gb
SDSJ Task B	ODQA with RuBERT	ruwiki (2018-04-01)	37.83	4.3Gb

AutoML¶

Hyperparameters optimization ¶

Hyperparameters optimization (either by cross-validation or neural evolution) for DeepPavlov models that requires only some small changes in a config file.

Embeddings¶

Pre-trained embeddings for the Russian language ¶

Word vectors for the Russian language trained on joint Russian Wikipedia and Lenta.ru corpora.

Examples of some components¶

Run goal-oriented bot with Telegram interface:

python -m deeppavlov interactbot deeppavlov/configs/go_bot/gobot_dstc2.json -d -t <TELEGRAM_TOKEN>
Run goal-oriented bot with console interface:

python -m deeppavlov interact deeppavlov/configs/go_bot/gobot_dstc2.json -d
Run goal-oriented bot with REST API:

python -m deeppavlov riseapi deeppavlov/configs/go_bot/gobot_dstc2.json -d
Run slot-filling model with Telegram interface:

python -m deeppavlov interactbot deeppavlov/configs/ner/slotfill_dstc2.json -d -t <TELEGRAM_TOKEN>
Run slot-filling model with console interface:

python -m deeppavlov interact deeppavlov/configs/ner/slotfill_dstc2.json -d
Run slot-filling model with REST API:

python -m deeppavlov riseapi deeppavlov/configs/ner/slotfill_dstc2.json -d
Predict intents on every line in a file:

python -m deeppavlov predict deeppavlov/configs/classifiers/intents_snips.json -d --batch-size 15 < /data/in.txt > /data/out.txt

View video demo of deployment of a goal-oriented bot and a slot-filling model with Telegram UI.