deeppavlov.core.data¶
DatasetReader, Vocab, DataLearningIterator and DataFittingIterator classes.
- class deeppavlov.core.data.dataset_reader.DatasetReader[source]¶
An abstract class for reading data from some location and construction of a dataset.
- class deeppavlov.core.data.data_fitting_iterator.DataFittingIterator(data: List[str], doc_ids: Optional[List[Any]] = None, seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶
Dataset iterator for fitting estimator models, like vocabs, kNN, vectorizers. Data is passed as a list of strings(documents). Generate batches (for large datasets).
- Parameters
data – list of documents
doc_ids – provided document ids
seed – random seed for data shuffling
shuffle – whether to shuffle data during batching
- shuffle¶
whether to shuffle data during batching
- random¶
instance of
Random
initialized with a seed
- data¶
list of documents
- doc_ids¶
provided by a user ids or generated automatically ids
- class deeppavlov.core.data.data_learning_iterator.DataLearningIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶
Dataset iterator for learning models, e. g. neural networks.
- Parameters
data – list of (x, y) pairs for every data type in
'train'
,'valid'
and'test'
seed – random seed for data shuffling
shuffle – whether to shuffle data during batching
- shuffle¶
whether to shuffle data during batching
- random¶
instance of
Random
initialized with a seed
- class deeppavlov.core.data.simple_vocab.SimpleVocabulary(special_tokens: Tuple[str, ...] = (), max_tokens: int = 1073741824, min_freq: int = 0, pad_with_zeros: bool = False, unk_token: Optional[str] = None, freq_drop_load: Optional[bool] = None, *args, **kwargs)[source]¶
Implements simple vocabulary.
- Parameters
special_tokens – tuple of tokens that shouldn’t be counted.
max_tokens – upper bound for number of tokens in the vocabulary.
min_freq – minimal count of a token (except special tokens).
pad_with_zeros – if True, then batch of elements will be padded with zeros up to length of the longest element in batch.
unk_token – label assigned to unknown tokens.
freq_drop_load – if True, then frequencies of tokens are set to min_freq on the model load.