dataset_readers¶
Concrete DatasetReader classes.
-
class
deeppavlov.dataset_readers.basic_classification_reader.
BasicClassificationDatasetReader
[source]¶ Class provides reading dataset in .csv format
-
read
(data_path: str, url: str = None, format: str = 'csv', class_sep: str = ', ', *args, **kwargs) → dict[source]¶ Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)
Parameters: - data_path – directory with files
- url – download data files if data_path not exists or empty
- format – extension of files. Set of Values:
"csv", "json"
- class_sep – string separator of labels in column with labels
- sep (str) – delimeter for
"csv"
files. Default:","
- header (int) – row number to use as the column names
- names (array) – list of column names to use
- orient (str) – indication of expected JSON string format
- lines (boolean) – read the file as a json object per line. Default:
False
Returns: dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)
-
-
class
deeppavlov.dataset_readers.conll2003_reader.
Conll2003DatasetReader
[source]¶ Class to read training datasets in CONLL2003 format
-
class
deeppavlov.dataset_readers.dstc2_reader.
DSTC2DatasetReader
[source]¶ Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).
There’ve been made the following modifications to the original dataset:
- added api calls to restaurant database
- example:
{"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}
.
- example:
- new actions
- bot dialog actions were concatenated into one action
(example:
{"dialog_acts": ["ask", "request"]}
->{"dialog_acts": ["ask_request"]}
) - if a slot key was associated with the dialog action, the new act
was a concatenation of an act and a slot key (example:
{"dialog_acts": ["ask"], "slot_vals": ["area"]}
->{"dialog_acts": ["ask_area"]}
)
- bot dialog actions were concatenated into one action
(example:
- new train/dev/test split
- original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test
- minor fixes
- fixed several dialogs, where actions were wrongly annotated
- uppercased first letter of bot responses
- unified punctuation for bot responses
-
classmethod
read
(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶ Downloads
'dstc2_v2.tar.gz'
archive from ipavlov internal server, decompresses and saves files todata_path
.Parameters: - data_path – path to save DSTC2 dataset
- dialogs – flag which indicates whether to output list of turns or list of dialogs
Returns: dictionary that contains
'train'
field with dialogs from'dstc2-trn.jsonlist'
,'valid'
field with dialogs from'dstc2-val.jsonlist'
and'test'
field with dialogs from'dstc2-tst.jsonlist'
. Each field is a list of tuples(x_i, y_i)
.
- added api calls to restaurant database
-
class
deeppavlov.dataset_readers.faq_reader.
FaqDatasetReader
[source]¶ Reader for FAQ dataset
-
read
(data_path: str = None, data_url: str = None, x_col_name: str = 'x', y_col_name: str = 'y') → Dict[source]¶ Read FAQ dataset from specified csv file or remote url
Parameters: - data_path – path to csv file of FAQ
- data_url – url to csv file of FAQ
- x_col_name – name of Question column in csv file
- y_col_name – name of Answer column in csv file
Returns: A dictionary containing training, validation and test parts of the dataset obtainable via
train
,valid
andtest
keys.
-
-
class
deeppavlov.dataset_readers.insurance_reader.
InsuranceReader
[source]¶ The class to read the InsuranceQA V1 dataset from files.
Please, see https://github.com/shuzi/insuranceQA.
Parameters: data_path – A path to a folder with dataset files.
-
class
deeppavlov.dataset_readers.kvret_reader.
KvretDatasetReader
[source]¶ A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset.
Stanford NLP released a corpus of 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation. The dialogues are grounded through knowledge bases ensuring that they are versatile in their natural language without being completely free form.
For details see https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/.
-
classmethod
read
(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶ Downloads
'kvrest_public.tar.gz'
, decompresses, saves files todata_path
.Parameters: - data_path – path to save data
- dialogs – flag indices whether to output list of turns or list of dialogs
Returns: dictionary with
'train'
containing dialogs from'kvret_train_public.json'
,'valid'
containing dialogs from'kvret_valid_public.json'
,'test'
containing dialogs from'kvret_test_public.json'
. Each fields is a list of tuples(x_i, y_i)
.
-
classmethod
-
class
deeppavlov.dataset_readers.morphotagging_dataset_reader.
MorphotaggerDatasetReader
[source]¶ Class to read training datasets in UD format
-
read
(data_path: Union[List, str], language: Optional[str] = None, data_types: Optional[List[str]] = None, **kwargs) → Dict[str, List][source]¶ Reads UD dataset from data_path.
Parameters: - data_path – can be either 1. a directory containing files. The file for data_type ‘mode’ is then data_path / {language}-ud-{mode}.conllu 2. a list of files, containing the same number of items as data_types
- language – a language to detect filename when it is not given
- data_types – which dataset parts among ‘train’, ‘dev’, ‘test’ are returned
Returns: a dictionary containing dataset fragments (see
read_infile
) for given data types
-
-
deeppavlov.dataset_readers.morphotagging_dataset_reader.
get_language
(filepath: str) → str[source]¶ Extracts language from typical UD filename
-
deeppavlov.dataset_readers.morphotagging_dataset_reader.
read_infile
(infile: Union[pathlib.Path, str], word_column: int = 1, pos_column: int = 3, tag_column: int = 5, max_sents: int = -1, read_only_words: bool = False) → List[Tuple[List, Optional[List]]][source]¶ Reads input file in CONLL-U format
Parameters: - infile – a path to a file
- word_column – column containing words (default=1)
- pos_column – column containing part-of-speech labels (default=3)
- tag_column – column containing fine-grained tags (default=5)
- max_sents – maximal number of sents to read
- read_only_words – whether to read only words
Returns: a list of sentences. Each item contains a word sequence and a tag sequence, which is
None
in caseread_only_words = True
-
class
deeppavlov.dataset_readers.ontonotes_reader.
OntonotesReader
[source]¶ Class to read training datasets in OntoNotes format
-
class
deeppavlov.dataset_readers.paraphraser_reader.
ParaphraserReader
[source]¶ The class to read the paraphraser.ru dataset from files.
Please, see https://paraphraser.ru.
Parameters: - data_path – A path to a folder with dataset files.
- seed – Random seed.
-
class
deeppavlov.dataset_readers.quora_question_pairs_reader.
QuoraQuestionPairsReader
[source]¶ The class to read the Quora Question Pairs dataset from files.
Please, see https://www.kaggle.com/c/quora-question-pairs/data.
Parameters: - data_path – A path to a folder with dataset files.
- seed – Random seed.
-
class
deeppavlov.dataset_readers.squad_dataset_reader.
SquadDatasetReader
[source]¶ Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/ and Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html
Downloads dataset files and prepares train/valid split.
-
class
deeppavlov.dataset_readers.typos_reader.
TyposCustom
[source]¶ Base class for reading spelling corrections dataset files
-
static
build
(data_path: str) → pathlib.Path[source]¶ Base method that interprets
data_path
argument.Parameters: data_path – path to the tsv-file containing erroneous and corrected words Returns: the same path as a Path
object
-
classmethod
read
(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]¶ Read train data for spelling corrections algorithms
Parameters: data_path – path that needs to be interpreted with build()
Returns: train data to pass to a TyposDatasetIterator
-
static
-
class
deeppavlov.dataset_readers.typos_reader.
TyposKartaslov
[source]¶ Implementation of
TyposCustom
that works with a Russian misspellings dataset from kartaslov-
static
build
(data_path: str) → pathlib.Path[source]¶ Download misspellings list from github
Parameters: data_path – target directory to download the data to Returns: path to the resulting csv-file
-
static
read
(data_path: str, *args, **kwargs) → Dict[str, List[Tuple[str, str]]][source]¶ Read train data for spelling corrections algorithms
Parameters: data_path – path that needs to be interpreted with build()
Returns: train data to pass to a TyposDatasetIterator
-
static
-
class
deeppavlov.dataset_readers.typos_reader.
TyposWikipedia
[source]¶ Implementation of
TyposCustom
that works with English Wikipedia’s list of common misspellings
-
class
deeppavlov.dataset_readers.ubuntu_v2_reader.
UbuntuV2Reader
[source]¶ The class to read the Ubuntu V2 dataset from csv files.
Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.
Parameters: data_path – A path to a folder with dataset csv files.
-
class
deeppavlov.dataset_readers.ubuntu_v2_mt_reader.
UbuntuV2MTReader
[source]¶ The class to read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue
context
.Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.
Parameters: - data_path – A path to a folder with dataset csv files.
- num_context_turns – A maximum number of dialogue
context
turns.