dataset_iterators¶
Concrete DatasetIterator classes.
-
class
deeppavlov.dataset_iterators.basic_classification_iterator.
BasicClassificationDatasetIterator
(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, split_seed: int = None, *args, **kwargs)[source]¶ Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary
Parameters: - data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
- fields_to_merge – list of fields (out of
"train", "valid", "test"
) to merge - merged_field – name of field (out of
"train", "valid", "test"
) to which save merged fields - field_to_split – name of field (out of
"train", "valid", "test"
) to split - split_fields – list of fields (out of
"train", "valid", "test"
) to which save splitted field - split_proportions – list of corresponding proportions for splitting
- seed – random seed for iterating
- shuffle – whether to shuffle examples in batches
- split_seed – random seed for splitting dataset, if
split_seed
is None, division is based on seed. - *args – arguments
- **kwargs – arguments
-
data
¶ dictionary of data with fields “train”, “valid” and “test” (or some of them)
-
class
deeppavlov.dataset_iterators.dialog_iterator.
DialogDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Iterates over dialog data, generates batches where one sample is one dialog.
A subclass of
DataLearningIterator
.-
train
¶ list of training dialogs (tuples
(context, response)
)
-
valid
¶ list of validation dialogs (tuples
(context, response)
)
-
test
¶ list of dialogs used for testing (tuples
(context, response)
)
-
-
class
deeppavlov.dataset_iterators.dialog_iterator.
DialogDBResultDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Iterates over dialog data, outputs list of all
'db_result'
fields (if present).The class helps to build a list of all
'db_result'
values present in a dataset.Inherits key methods and attributes from
DataLearningIterator
.-
train
¶ list of tuples
(db_result dictionary, '')
from “train” data
-
valid
¶ list of tuples
(db_result dictionary, '')
from “valid” data
-
test
¶ list of tuples
(db_result dictionary, '')
from “test” data
-
-
class
deeppavlov.dataset_iterators.dstc2_intents_iterator.
Dstc2IntentsDatasetIterator
(data: dict, fields_to_merge: List[str] = None, merged_field: str = None, field_to_split: str = None, split_fields: List[str] = None, split_proportions: List[float] = None, seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Class gets data dictionary from DSTC2DatasetReader instance, construct intents from act and slots, merge fields if necessary, split a field if necessary
Parameters: - data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
- fields_to_merge – list of fields (out of
"train", "valid", "test"
) to merge - merged_field – name of field (out of
"train", "valid", "test"
) to which save merged fields - field_to_split – name of field (out of
"train", "valid", "test"
) to split - split_fields – list of fields (out of
"train", "valid", "test"
) to which save splitted field - split_proportions – list of corresponding proportions for splitting
- seed – random seed
- shuffle – whether to shuffle examples in batches
- *args – arguments
- **kwargs – arguments
-
data
¶ dictionary of data with fields “train”, “valid” and “test” (or some of them)
-
class
deeppavlov.dataset_iterators.dstc2_ner_iterator.
Dstc2NerDatasetIterator
[source]¶ Iterates over data for DSTC2 NER task. Dataset takes a dict with fields ‘train’, ‘test’, ‘valid’. A list of samples (pairs x, y) is stored in each field.
Parameters: - data – list of (x, y) pairs, samples from the dataset: x as well as y can be a tuple of different input features.
- dataset_path – path to dataset
- seed – value for random seed
- shuffle – whether to shuffle the data
-
class
deeppavlov.dataset_iterators.kvret_dialog_iterator.
KvretDialogDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Inputs data from
DSTC2DatasetReader
, constructs dialog history for each turn, generates batches (one sample is a turn).Inherits key methods and attributes from
DataLearningIterator
.-
train
¶ list of “train”
(context, response)
tuples
-
valid
¶ list of “valid”
(context, response)
tuples
-
test
¶ list of “test”
(context, response)
tuples
-
-
deeppavlov.dataset_iterators.morphotagger_iterator.
preprocess_data
(data: List[Tuple[List[str], List[str]]], to_lower: bool = True, append_case: str = 'first') → List[Tuple[List[Tuple[str]], List[str]]][source]¶ Processes all words in data using
process_word()
.Parameters: - data – a list of pairs (words, tags), each pair corresponds to a single sentence
- to_lower – whether to lowercase
- append_case – whether to add case mark
Returns: a list of preprocessed sentences
-
class
deeppavlov.dataset_iterators.morphotagger_iterator.
MorphoTaggerDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, validation_split: float = 0.2)[source]¶ Iterates over data for Morphological Tagging. A subclass of
DataLearningIterator
.Parameters: - seed – random seed for data shuffling
- shuffle – whether to shuffle data during batching
- validation_split – the fraction of validation data (is used only if there is no valid subset in data)
-
class
deeppavlov.dataset_iterators.siamese_iterator.
SiameseIterator
(data: Dict[str, List], seed: int = None, shuffle: bool = False, num_samples: int = None, random_batches: bool = False, batches_per_epoch: int = None, *args, **kwargs)[source]¶ The class contains methods for iterating over a dataset for ranking in training, validation and test mode.
Parameters: - data – A dictionary containing training, validation and test parts of the dataset obtainable via
train
,valid
andtest
keys. - seed – Random seed.
- shuffle – Whether to shuffle data.
- num_samples – A number of data samples to use in
train
,validation
andtest
mode. - random_batches – Whether to choose batches randomly or iterate over data sequentally in training mode.
- batches_per_epoch – A number of batches to choose per each epoch in training mode.
Only required if
random_batches
is set toTrue
.
- data – A dictionary containing training, validation and test parts of the dataset obtainable via
-
class
deeppavlov.dataset_iterators.sqlite_iterator.
SQLiteDataIterator
(data_dir: str = '', data_url: str = 'http://files.deeppavlov.ai/datasets/wikipedia/enwiki.db', batch_size: int = None, shuffle: bool = None, seed: int = None, **kwargs)[source]¶ Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.
Parameters: - data_dir – a directory where to save downloaded DB to
- data_url – an URL where to download a DB from
- batch_size – a number of samples in a single batch
- shuffle – whether to shuffle data during batching
- seed – random seed for data shuffling
-
connect
¶ a DB connection
-
db_name
¶ a DB name
-
doc_ids
¶ DB document ids
-
doc2index
¶ a dictionary of document indices and their titles
-
batch_size
¶ a number of samples in a single batch
-
shuffle
¶ whether to shuffle data during batching
-
random
¶ an instance of
Random
class.
-
class
deeppavlov.dataset_iterators.squad_iterator.
SquadIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train
SquadModel
.It extracts
context
,question
,answer_text
andanswer_start
position from dataset. Example from a dataset is a tuple of(context, question)
and(answer_text, answer_start)
-
train
¶ train examples
-
valid
¶ validation examples
-
test
¶ test examples
-
-
class
deeppavlov.dataset_iterators.typos_iterator.
TyposDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: int = None, shuffle: bool = True, *args, **kwargs)[source]¶ Implementation of
DataLearningIterator
used for trainingErrorModel