dataset_readers¶
Concrete DatasetReader classes.
-
class
deeppavlov.dataset_readers.basic_classification_reader.
BasicClassificationDatasetReader
[source]¶ Class provides reading dataset in .csv format
-
read
(data_path: str, url: Optional[str] = None, format: str = 'csv', class_sep: Optional[str] = None, *args, **kwargs) → dict[source]¶ Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)
- Parameters
data_path – directory with files
url – download data files if data_path not exists or empty
format – extension of files. Set of Values:
"csv", "json"
class_sep – string separator of labels in column with labels
sep (str) – delimeter for
"csv"
files. Default: None -> only one class per sampleheader (int) – row number to use as the column names
names (array) – list of column names to use
orient (str) – indication of expected JSON string format
lines (boolean) – read the file as a json object per line. Default:
False
- Returns
dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)
-
-
class
deeppavlov.dataset_readers.conll2003_reader.
Conll2003DatasetReader
[source]¶ Class to read training datasets in CoNLL-2003 format
-
class
deeppavlov.dataset_readers.dstc2_reader.
DSTC2DatasetReader
[source]¶ Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).
There’ve been made the following modifications to the original dataset:
added api calls to restaurant database
example:
{"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}
.
new actions
bot dialog actions were concatenated into one action (example:
{"dialog_acts": ["ask", "request"]}
->{"dialog_acts": ["ask_request"]}
)if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example:
{"dialog_acts": ["ask"], "slot_vals": ["area"]}
->{"dialog_acts": ["ask_area"]}
)
new train/dev/test split
original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test
minor fixes
fixed several dialogs, where actions were wrongly annotated
uppercased first letter of bot responses
unified punctuation for bot responses
-
classmethod
read
(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶ Downloads
'dstc2_v2.tar.gz'
archive from ipavlov internal server, decompresses and saves files todata_path
.- Parameters
data_path – path to save DSTC2 dataset
dialogs – flag which indicates whether to output list of turns or list of dialogs
- Returns
dictionary that contains
'train'
field with dialogs from'dstc2-trn.jsonlist'
,'valid'
field with dialogs from'dstc2-val.jsonlist'
and'test'
field with dialogs from'dstc2-tst.jsonlist'
. Each field is a list of tuples(x_i, y_i)
.
-
class
deeppavlov.dataset_readers.dstc2_reader.
SimpleDSTC2DatasetReader
[source]¶ Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).
There’ve been made the following modifications to the original dataset:
added api calls to restaurant database
example:
{"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}
.
new actions
bot dialog actions were concatenated into one action (example:
{"dialog_acts": ["ask", "request"]}
->{"dialog_acts": ["ask_request"]}
)if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example:
{"dialog_acts": ["ask"], "slot_vals": ["area"]}
->{"dialog_acts": ["ask_area"]}
)
new train/dev/test split
original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test
minor fixes
fixed several dialogs, where actions were wrongly annotated
uppercased first letter of bot responses
unified punctuation for bot responses
-
classmethod
read
(data_path: str, dialogs: bool = False, encoding='utf-8') → Dict[str, List][source]¶ Downloads
'simple_dstc2.tar.gz'
archive from internet, decompresses and saves files todata_path
.- Parameters
data_path – path to save DSTC2 dataset
dialogs – flag which indicates whether to output list of turns or list of dialogs
- Returns
dictionary that contains
'train'
field with dialogs from'simple-dstc2-trn.json'
,'valid'
field with dialogs from'simple-dstc2-val.json'
and'test'
field with dialogs from'simple-dstc2-tst.json'
. Each field is a list of tuples(user turn, system turn)
.
-
class
deeppavlov.dataset_readers.md_yaml_dialogs_reader.
DomainKnowledge
(domain_knowledge_di: Dict)[source]¶ the DTO-like class to store the domain knowledge from the domain yaml config.
-
classmethod
from_yaml
(domain_yml_fpath: Union[str, pathlib.Path] = 'domain.yml')[source]¶ Parses domain.yml domain config file into the DomainKnowledge object :param domain_yml_fpath: path to the domain config file, defaults to domain.yml
- Returns
the loaded DomainKnowledge obect
-
classmethod
-
class
deeppavlov.dataset_readers.md_yaml_dialogs_reader.
MD_YAML_DialogsDatasetReader
[source]¶ Reads dialogs from dataset composed of
stories.md
,nlu.md
,domain.yml
.stories.md
is to provide the dialogues dataset for model to train on. The dialogues are represented as user messages labels and system response messages labels: (not texts, just action labels). This is so to distinguish the NLU-NLG tasks from the actual dialogues storytelling experience: one should be able to describe just the scripts of dialogues to the system.nlu.md
is contrariwise to provide the NLU training set irrespective of the dialogues scripts.domain.yml
is to desribe the task-specific domain and serves two purposes: provide the NLG templates and provide some specific configuration of the NLU-
classmethod
augment_form
(form_name: str, domain_knowledge: deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge, intent2slots2text: Dict) → List[str][source]¶ Replaced the form mention in stories.md with the actual turns relevant to the form :param form_name: the name of form to generate turns for :param domain_knowledge: the domain knowledge (see domain.yml in RASA) relevant to the processed config :param intent2slots2text: the mapping of intents and particular slots onto text
- Returns
the story turns relevant to the passed form
-
classmethod
augment_slot
(known_responses: List[str], known_intents: List[str], slot_name: str, form_name: str) → List[str][source]¶ Given the slot name, generates a sequence of system turn asking for a slot and user’ turn providing this slot
- Parameters
known_responses – responses known to the system from domain.yml
known_intents – intents known to the system from domain.yml
slot_name – the name of the slot to augment for
form_name – the name of the form for which the turn is augmented
- Returns
the list of stories.md alike turns
-
classmethod
augment_user_turn
(intent2slots2text, line: str, slot_name2text2value) → List[Dict[str, Any]][source]¶ given the turn information generate all the possible stories representing it :param intent2slots2text: the intents and slots to natural language utterances mapping known to the system :param line: the line representing used utterance in stories.md format :param slot_name2text2value: the slot names to values mapping known o the system
- Returns
the batch of all the possible dstc2 representations of the passed intent
-
classmethod
get_augmented_ask_intent_utter
(known_intents: List[str], slot_name: str) → Optional[str][source]¶ if the system knows the inform_{slot} intent, return this intent name, otherwise return None :param known_intents: intents known to the system :param slot_name: the slot to look inform intent for
- Returns
the slot informing intent or None
-
classmethod
get_augmented_ask_slot_utter
(form_name: str, known_responses: List[str], slot_name: str)[source]¶ if the system knows the ask_{slot} action, return this action name, otherwise return None :param form_name: the name of the currently processed form :param known_responses: actions known to the system :param slot_name: the slot to look asking action for
- Returns
the slot asking action or None
-
classmethod
get_last_users_turn
(curr_story_utters: List[Dict]) → Dict[source]¶ Given the dstc2 story, return the last user utterance from it :param curr_story_utters: the dstc2-formatted stoyr
- Returns
the last user utterance from the passed story
-
classmethod
parse_system_turn
(domain_knowledge: deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge, line: str) → Dict[source]¶ Given the RASA stories.md line, returns the dstc2-formatted json (dict) for this line :param domain_knowledge: the domain knowledge relevant to the processed stories config (from which line is taken) :param line: the story system step representing line from stories.md
- Returns
the dstc2-formatted passed turn
-
classmethod
read
(data_path: str, dialogs: bool = False, ignore_slots: bool = False) → Dict[str, List][source]¶ - Parameters
data_path – path to read dataset from
dialogs – flag which indicates whether to output list of turns or list of dialogs
ignore_slots – whether to ignore slots information provided in stories.md or not
- Returns
dictionary that contains
'train'
field with dialogs from'stories-trn.md'
,'valid'
field with dialogs from'stories-val.md'
and'test'
field with dialogs from'stories-tst.md'
. Each field is a list of tuples(x_i, y_i)
.
-
classmethod
-
class
deeppavlov.dataset_readers.faq_reader.
FaqDatasetReader
[source]¶ Reader for FAQ dataset
-
read
(data_path: Optional[str] = None, data_url: Optional[str] = None, x_col_name: str = 'x', y_col_name: str = 'y') → Dict[source]¶ Read FAQ dataset from specified csv file or remote url
- Parameters
data_path – path to csv file of FAQ
data_url – url to csv file of FAQ
x_col_name – name of Question column in csv file
y_col_name – name of Answer column in csv file
- Returns
A dictionary containing training, validation and test parts of the dataset obtainable via
train
,valid
andtest
keys.
-
-
class
deeppavlov.dataset_readers.file_paths_reader.
FilePathsReader
[source]¶ Find all file paths by a data path glob
-
read
(data_path: Union[str, pathlib.Path], train: Optional[str] = None, valid: Optional[str] = None, test: Optional[str] = None, *args, **kwargs) → Dict[source]¶ Find all file paths by a data path glob
- Parameters
data_path – directory with data
train – data path glob relative to data_path
valid – data path glob relative to data_path
test – data path glob relative to data_path
- Returns
A dictionary containing training, validation and test parts of the dataset obtainable via
train
,valid
andtest
keys.
-
-
class
deeppavlov.dataset_readers.kvret_reader.
KvretDatasetReader
[source]¶ A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset.
Stanford NLP released a corpus of 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation. The dialogues are grounded through knowledge bases ensuring that they are versatile in their natural language without being completely free form.
For details see https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/.
-
classmethod
read
(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶ Downloads
'kvrest_public.tar.gz'
, decompresses, saves files todata_path
.- Parameters
data_path – path to save data
dialogs – flag indices whether to output list of turns or list of dialogs
- Returns
dictionary with
'train'
containing dialogs from'kvret_train_public.json'
,'valid'
containing dialogs from'kvret_valid_public.json'
,'test'
containing dialogs from'kvret_test_public.json'
. Each fields is a list of tuples(x_i, y_i)
.
-
classmethod
-
class
deeppavlov.dataset_readers.morphotagging_dataset_reader.
MorphotaggerDatasetReader
[source]¶ Class to read training datasets in UD format
-
read
(data_path: Union[List, str], language: Optional[str] = None, data_types: Optional[List[str]] = None, **kwargs) → Dict[str, List][source]¶ Reads UD dataset from data_path.
- Parameters
data_path – can be either 1. a directory containing files. The file for data_type ‘mode’ is then data_path / {language}-ud-{mode}.conllu 2. a list of files, containing the same number of items as data_types
language – a language to detect filename when it is not given
data_types – which dataset parts among ‘train’, ‘dev’, ‘test’ are returned
- Returns
a dictionary containing dataset fragments (see
read_infile
) for given data types
-
-
deeppavlov.dataset_readers.morphotagging_dataset_reader.
get_language
(filepath: str) → str[source]¶ Extracts language from typical UD filename
-
deeppavlov.dataset_readers.morphotagging_dataset_reader.
read_infile
(infile: Union[pathlib.Path, str], *, from_words=False, word_column: int = 1, pos_column: int = 3, tag_column: int = 5, head_column: int = 6, dep_column: int = 7, max_sents: int = - 1, read_only_words: bool = False, read_syntax: bool = False) → List[Tuple[List, Optional[List]]][source]¶ Reads input file in CONLL-U format
- Parameters
infile – a path to a file
word_column – column containing words (default=1)
pos_column – column containing part-of-speech labels (default=3)
tag_column – column containing fine-grained tags (default=5)
head_column – column containing syntactic head position (default=6)
dep_column – column containing syntactic dependency label (default=7)
max_sents – maximal number of sentences to read
read_only_words – whether to read only words
read_syntax – whether to return
heads
anddeps
alongsidetags
. Ignored if read_only_words isTrue
- Returns
a list of sentences. Each item contains a word sequence and an output sequence. The output sentence is
None
, ifread_only_words
isTrue
, a single list of word tags ifread_syntax
is False, and a list of the form [tags
,heads
,deps
] in caseread_syntax
isTrue
.
-
class
deeppavlov.dataset_readers.paraphraser_reader.
ParaphraserReader
[source]¶ The class to read the paraphraser.ru dataset from files.
Please, see https://paraphraser.ru.
-
class
deeppavlov.dataset_readers.siamese_reader.
SiameseReader
[source]¶ The class to read dataset for ranking or paraphrase identification with Siamese networks.
-
class
deeppavlov.dataset_readers.squad_dataset_reader.
SquadDatasetReader
[source]¶ Downloads dataset files and prepares train/valid split.
SQuAD: Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/
SberSQuAD: Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html
MultiSQuAD: SQuAD dataset with additional contexts retrieved (by tfidf) from original Wikipedia article.
MultiSQuADRetr: SQuAD dataset with additional contexts retrieved by tfidf document ranker from full Wikipedia.
-
read
(dir_path: str, dataset: Optional[str] = 'SQuAD', url: Optional[str] = None, *args, **kwargs) → Dict[str, Dict[str, Any]][source]¶ - Parameters
dir_path – path to save data
dataset – default dataset names:
'SQuAD'
,'SberSQuAD'
or'MultiSQuAD'
url – link to archive with dataset, use url argument if non-default dataset is used
- Returns
dataset split on train/valid
- Raises
RuntimeError – if dataset is not one of these:
'SQuAD'
,'SberSQuAD'
,'MultiSQuAD'
.
-
-
class
deeppavlov.dataset_readers.typos_reader.
TyposCustom
[source]¶ Base class for reading spelling corrections dataset files
-
static
build
(data_path: str) → pathlib.Path[source]¶ Base method that interprets
data_path
argument.- Parameters
data_path – path to the tsv-file containing erroneous and corrected words
- Returns
the same path as a
Path
object
-
static
-
class
deeppavlov.dataset_readers.typos_reader.
TyposKartaslov
[source]¶ Implementation of
TyposCustom
that works with a Russian misspellings dataset from kartaslov-
static
build
(data_path: str) → pathlib.Path[source]¶ Download misspellings list from github
- Parameters
data_path – target directory to download the data to
- Returns
path to the resulting csv-file
-
static
-
class
deeppavlov.dataset_readers.typos_reader.
TyposWikipedia
[source]¶ Implementation of
TyposCustom
that works with English Wikipedia’s list of common misspellings-
static
build
(data_path: str) → pathlib.Path[source]¶ Download and parse common misspellings list from Wikipedia
- Parameters
data_path – target directory to download the data to
- Returns
path to the resulting tsv-file
-
static
-
class
deeppavlov.dataset_readers.ubuntu_v2_reader.
UbuntuV2Reader
[source]¶ The class to read the Ubuntu V2 dataset from csv files.
Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.
-
read
(data_path: str, positive_samples=False, *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]¶ Read the Ubuntu V2 dataset from csv files.
- Parameters
data_path – A path to a folder with dataset csv files.
positive_samples – if True, only positive context-response pairs will be taken for train
-
-
class
deeppavlov.dataset_readers.ubuntu_v2_mt_reader.
UbuntuV2MTReader
[source]¶ The class to read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue
context
.Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.
- Parameters
data_path – A path to a folder with dataset csv files.
num_context_turns – A maximum number of dialogue
context
turns.padding – “post” or “pre” context sentences padding
-
read
(data_path: str, num_context_turns: int = 1, padding: str = 'post', *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]¶ Read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue
context
.- Parameters
data_path – A path to a folder with dataset csv files.
num_context_turns – A maximum number of dialogue
context
turns.padding – “post” or “pre” context sentences padding
- Returns
Dictionary with keys “train”, “valid”, “test” and parts of the dataset as their values