dataset_iterators¶
Concrete DatasetIterator classes.
-
class
deeppavlov.dataset_iterators.basic_classification_iterator.
BasicClassificationDatasetIterator
(data: dict, fields_to_merge: Optional[List[str]] = None, merged_field: Optional[str] = None, field_to_split: Optional[str] = None, split_fields: Optional[List[str]] = None, split_proportions: Optional[List[float]] = None, seed: Optional[int] = None, shuffle: bool = True, split_seed: Optional[int] = None, stratify: Optional[bool] = None, shot: Optional[int] = None, *args, **kwargs)[source]¶ Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary
- Parameters
data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
fields_to_merge – list of fields (out of
"train", "valid", "test"
) to mergemerged_field – name of field (out of
"train", "valid", "test"
) to which save merged fieldsfield_to_split – name of field (out of
"train", "valid", "test"
) to splitsplit_fields – list of fields (out of
"train", "valid", "test"
) to which save splitted fieldsplit_proportions – list of corresponding proportions for splitting
seed – random seed for iterating
shuffle – whether to shuffle examples in batches
split_seed – random seed for splitting dataset, if
split_seed
is None, division is based on seed.stratify – whether to use stratified split
shot – number of examples to sample for each class in training data. If None, all examples will remain in data.
*args – arguments
**kwargs – arguments
-
data
¶ dictionary of data with fields “train”, “valid” and “test” (or some of them)
-
class
deeppavlov.dataset_iterators.siamese_iterator.
SiameseIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ The class contains methods for iterating over a dataset for ranking in training, validation and test mode.
-
class
deeppavlov.dataset_iterators.sqlite_iterator.
SQLiteDataIterator
(load_path: Union[str, pathlib.Path], batch_size: Optional[int] = None, shuffle: Optional[bool] = None, seed: Optional[int] = None, **kwargs)[source]¶ Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.
- Parameters
load_path – a path to local DB file
batch_size – a number of samples in a single batch
shuffle – whether to shuffle data during batching
seed – random seed for data shuffling
-
connect
¶ a DB connection
-
db_name
¶ a DB name
-
doc_ids
¶ DB document ids
-
doc2index
¶ a dictionary of document indices and their titles
-
batch_size
¶ a number of samples in a single batch
-
shuffle
¶ whether to shuffle data during batching
-
random
¶ an instance of
Random
class.
-
class
deeppavlov.dataset_iterators.squad_iterator.
SquadIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train
torch_transformers_squad:TorchTransformersSquad
.It extracts
context
,question
,answer_text
andanswer_start
position from dataset. Example from a dataset is a tuple of(context, question)
and(answer_text, answer_start)
-
train
¶ train examples
-
valid
¶ validation examples
-
test
¶ test examples
-
-
class
deeppavlov.dataset_iterators.typos_iterator.
TyposDatasetIterator
(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ Implementation of
DataLearningIterator
used for trainingErrorModel
-
class
deeppavlov.dataset_iterators.multitask_iterator.
MultiTaskIterator
(data: dict, num_train_epochs: int, tasks: dict, batch_size: int = 8, sampling_mode: str = 'plain', gradient_accumulation_steps: int = 1, steps_per_epoch: int = 0, one_element_tuples: bool = True, task_defaults: Optional[dict] = None, seed: int = 42, **kwargs)[source]¶ Class merges data from several dataset iterators. When used for batch generation batches from merged dataset iterators are united into one batch. If sizes of merged datasets are different smaller datasets are repeated until their size becomes equal to the largest dataset.
- Parameters
data – dictionary which keys are task names and values are dictionaries with fields
"train", "valid", "test"
.num_train_epochs – number of training epochs
tasks – dictionary which keys are task names and values are init params of dataset iterators. If task has key-value pair
'use_task_defaults': False
task_defaults for this task dataset iterator will be ignored.batch_size – batch_size
sampling_mode – mode of sampling we use. It can be plain, uniform or anneal.
gradient_accumulation_steps – number of gradient accumulation steps. Default is 1
steps_per_epoch – number of steps per epoch. Nesessary if gradient_accumulation_steps > 1
iterator_class_name – name of iterator class.
use_label_name –
seed –
- parameters for the iterator class (features) –
one_element_tuples – if True, tuple of x consisting of one element is returned in this element. Default: True
task_defaults – default task parameters.
- random seed for sampling (seed) –
-
data
¶ dictionary of data with fields “train”, “valid” and “test” (or some of them)
-
gen_batches
(batch_size: int, data_type: str = 'train', shuffle: Optional[bool] = None) → Iterator[Tuple[tuple, tuple]][source]¶ Generates batches and expected output to train neural networks. If there are not enough samples from any task, samples are padded with None :param batch_size: number of samples in batch :param data_type: can be either ‘train’, ‘test’, or ‘valid’ :param shuffle: whether to shuffle dataset before batching
- Yields
A tuple of a batch of inputs and a batch of expected outputs. Inputs and outputs are tuples. Element of inputs or outputs is a tuple which elements are x values of merged tasks in the order tasks are present in tasks argument of __init__ method.
-
get_instances
(data_type: str = 'train')[source]¶ Returns a tuple of inputs and outputs from all datasets. Lengths of and outputs are equal to the size of the largest dataset. Smaller datasets are padded with Nones until their sizes are equal to the size of the largest dataset. :param data_type: can be either ‘train’, ‘test’, or ‘valid’
- Returns
A tuple of all inputs for a data type and all expected outputs for a data type.
-
class
deeppavlov.dataset_iterators.multitask_iterator.
SingleTaskBatchGenerator
(dataset_iterator: deeppavlov.core.data.data_learning_iterator.DataLearningIterator, batch_size: int, data_type: str, shuffle: bool, n_batches: Optional[int] = None, size_of_last_batch: Optional[int] = None)[source]¶ Batch generator for a single task. If there are no elements in the dataset to form another batch, Nones are returned. :param dataset_iterator: dataset iterator from which batches are drawn. :param batch_size: size fo the batch. :param data_type: “train”, “valid”, or “test” :param shuffle: whether dataset will be shuffled. :param n_batches: the number of batches that will be generated.