dataset_iterators¶

Concrete DatasetIterator classes.

class deeppavlov.dataset_iterators.basic_classification_iterator.BasicClassificationDatasetIterator(data: dict, fields_to_merge: Optional[List[str]] = None, merged_field: Optional[str] = None, field_to_split: Optional[str] = None, split_fields: Optional[List[str]] = None, split_proportions: Optional[List[float]] = None, seed: Optional[int] = None, shuffle: bool = True, split_seed: Optional[int] = None, stratify: Optional[bool] = None, shot: Optional[int] = None, *args, **kwargs)[source]¶

Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary

Parameters

data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
fields_to_merge – list of fields (out of "train", "valid", "test") to merge
merged_field – name of field (out of "train", "valid", "test") to which save merged fields
field_to_split – name of field (out of "train", "valid", "test") to split
split_fields – list of fields (out of "train", "valid", "test") to which save splitted field
split_proportions – list of corresponding proportions for splitting
seed – random seed for iterating
shuffle – whether to shuffle examples in batches
split_seed – random seed for splitting dataset, if split_seed is None, division is based on seed.
stratify – whether to use stratified split
shot – number of examples to sample for each class in training data. If None, all examples will remain in data.
*args – arguments
**kwargs – arguments

data¶: dictionary of data with fields “train”, “valid” and “test” (or some of them)

class deeppavlov.dataset_iterators.siamese_iterator.SiameseIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶: The class contains methods for iterating over a dataset for ranking in training, validation and test mode.

class deeppavlov.dataset_iterators.sqlite_iterator.SQLiteDataIterator(load_path: Union[str, pathlib.Path], batch_size: Optional[int] = None, shuffle: Optional[bool] = None, seed: Optional[int] = None, **kwargs)[source]¶

Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.

Parameters

load_path – a path to local DB file
batch_size – a number of samples in a single batch
shuffle – whether to shuffle data during batching
seed – random seed for data shuffling

connect¶: a DB connection

db_name¶: a DB name

doc_ids¶: DB document ids

doc2index¶: a dictionary of document indices and their titles

batch_size¶: a number of samples in a single batch

shuffle¶: whether to shuffle data during batching

random¶: an instance of Random class.

class deeppavlov.dataset_iterators.squad_iterator.SquadIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶

SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train torch_transformers_squad:TorchTransformersSquad.

It extracts context, question, answer_text and answer_start position from dataset. Example from a dataset is a tuple of (context, question) and (answer_text, answer_start)

train¶: train examples

valid¶: validation examples

test¶: test examples

class deeppavlov.dataset_iterators.typos_iterator.TyposDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶

Implementation of DataLearningIterator used for training ErrorModel

split(test_ratio: float = 0.0, *args, **kwargs)[source]¶

Split all data into train and test

Parameters: test_ratio – ratio of test data to train, from 0. to 1.

class deeppavlov.dataset_iterators.multitask_iterator.MultiTaskIterator(data: dict, num_train_epochs: int, tasks: dict, batch_size: int = 8, sampling_mode: str = 'plain', gradient_accumulation_steps: int = 1, steps_per_epoch: int = 0, one_element_tuples: bool = True, task_defaults: Optional[dict] = None, seed: int = 42, **kwargs)[source]¶

Class merges data from several dataset iterators. When used for batch generation batches from merged dataset iterators are united into one batch. If sizes of merged datasets are different smaller datasets are repeated until their size becomes equal to the largest dataset.

Parameters

data – dictionary which keys are task names and values are dictionaries with fields "train", "valid", "test".
num_train_epochs – number of training epochs
tasks – dictionary which keys are task names and values are init params of dataset iterators. If task has key-value pair 'use_task_defaults': False task_defaults for this task dataset iterator will be ignored.
batch_size – batch_size
sampling_mode – mode of sampling we use. It can be plain, uniform or anneal.
gradient_accumulation_steps – number of gradient accumulation steps. Default is 1
steps_per_epoch – number of steps per epoch. Nesessary if gradient_accumulation_steps > 1
iterator_class_name – name of iterator class.
use_label_name –
seed –
- parameters for the iterator class (features) –
one_element_tuples – if True, tuple of x consisting of one element is returned in this element. Default: True
task_defaults – default task parameters.
- random seed for sampling (seed) –

data¶: dictionary of data with fields “train”, “valid” and “test” (or some of them)

gen_batches(batch_size: int, data_type: str = 'train', shuffle: Optional[bool] = None) → Iterator[Tuple[tuple, tuple]][source]¶

Generates batches and expected output to train neural networks. If there are not enough samples from any task, samples are padded with None :param batch_size: number of samples in batch :param data_type: can be either ‘train’, ‘test’, or ‘valid’ :param shuffle: whether to shuffle dataset before batching

Yields: A tuple of a batch of inputs and a batch of expected outputs. Inputs and outputs are tuples. Element of inputs or outputs is a tuple which elements are x values of merged tasks in the order tasks are present in tasks argument of __init__ method.

get_instances(data_type: str = 'train')[source]¶

Returns a tuple of inputs and outputs from all datasets. Lengths of and outputs are equal to the size of the largest dataset. Smaller datasets are padded with Nones until their sizes are equal to the size of the largest dataset. :param data_type: can be either ‘train’, ‘test’, or ‘valid’

Returns: A tuple of all inputs for a data type and all expected outputs for a data type.

class deeppavlov.dataset_iterators.multitask_iterator.SingleTaskBatchGenerator(dataset_iterator: deeppavlov.core.data.data_learning_iterator.DataLearningIterator, batch_size: int, data_type: str, shuffle: bool, n_batches: Optional[int] = None, size_of_last_batch: Optional[int] = None)[source]¶: Batch generator for a single task. If there are no elements in the dataset to form another batch, Nones are returned. :param dataset_iterator: dataset iterator from which batches are drawn. :param batch_size: size fo the batch. :param data_type: “train”, “valid”, or “test” :param shuffle: whether dataset will be shuffled. :param n_batches: the number of batches that will be generated.