dataset_readers¶
Concrete DatasetReader classes.
-
class
deeppavlov.dataset_readers.basic_classification_reader.
BasicClassificationDatasetReader
[source]¶ Class provides reading dataset in .csv format
-
read
(data_path: str, url: Optional[str] = None, format: str = 'csv', class_sep: Optional[str] = None, *args, **kwargs) → dict[source]¶ Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)
- Parameters
data_path – directory with files
url – download data files if data_path not exists or empty
format – extension of files. Set of Values:
"csv", "json"
class_sep – string separator of labels in column with labels
sep (str) – delimeter for
"csv"
files. Default: None -> only one class per sampleheader (int) – row number to use as the column names
names (array) – list of column names to use
orient (str) – indication of expected JSON string format
lines (boolean) – read the file as a json object per line. Default:
False
- Returns
dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)
-
-
class
deeppavlov.dataset_readers.conll2003_reader.
Conll2003DatasetReader
[source]¶ Class to read training datasets in CoNLL-2003 format
-
class
deeppavlov.dataset_readers.faq_reader.
FaqDatasetReader
[source]¶ Reader for FAQ dataset
-
read
(data_path: Optional[str] = None, data_url: Optional[str] = None, x_col_name: str = 'x', y_col_name: str = 'y') → Dict[source]¶ Read FAQ dataset from specified csv file or remote url
- Parameters
data_path – path to csv file of FAQ
data_url – url to csv file of FAQ
x_col_name – name of Question column in csv file
y_col_name – name of Answer column in csv file
- Returns
A dictionary containing training, validation and test parts of the dataset obtainable via
train
,valid
andtest
keys.
-
-
class
deeppavlov.dataset_readers.paraphraser_reader.
ParaphraserReader
[source]¶ The class to read the paraphraser.ru dataset from files.
Please, see https://paraphraser.ru.
-
class
deeppavlov.dataset_readers.squad_dataset_reader.
SquadDatasetReader
[source]¶ Downloads dataset files and prepares train/valid split.
SQuAD: Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/
SQuAD2.0: Stanford Question Answering Dataset, version 2.0 https://rajpurkar.github.io/SQuAD-explorer/
SberSQuAD: Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html
MultiSQuAD: SQuAD dataset with additional contexts retrieved (by tfidf) from original Wikipedia article.
MultiSQuADRetr: SQuAD dataset with additional contexts retrieved by tfidf document ranker from full Wikipedia.
-
read
(dir_path: str, dataset: Optional[str] = 'SQuAD', url: Optional[str] = None, *args, **kwargs) → Dict[str, Dict[str, Any]][source]¶ - Parameters
dir_path – path to save data
dataset – default dataset names:
'SQuAD'
,'SberSQuAD'
or'MultiSQuAD'
url – link to archive with dataset, use url argument if non-default dataset is used
- Returns
dataset split on train/valid
- Raises
RuntimeError – if dataset is not one of these:
'SQuAD'
,'SberSQuAD'
,'MultiSQuAD'
.
-
-
class
deeppavlov.dataset_readers.typos_reader.
TyposCustom
[source]¶ Base class for reading spelling corrections dataset files
-
static
build
(data_path: str) → pathlib.Path[source]¶ Base method that interprets
data_path
argument.- Parameters
data_path – path to the tsv-file containing erroneous and corrected words
- Returns
the same path as a
Path
object
-
static
-
class
deeppavlov.dataset_readers.typos_reader.
TyposKartaslov
[source]¶ Implementation of
TyposCustom
that works with a Russian misspellings dataset from kartaslov-
static
build
(data_path: str) → pathlib.Path[source]¶ Download misspellings list from github
- Parameters
data_path – target directory to download the data to
- Returns
path to the resulting csv-file
-
static
-
class
deeppavlov.dataset_readers.typos_reader.
TyposWikipedia
[source]¶ Implementation of
TyposCustom
that works with English Wikipedia’s list of common misspellings-
static
build
(data_path: str) → pathlib.Path[source]¶ Download and parse common misspellings list from Wikipedia
- Parameters
data_path – target directory to download the data to
- Returns
path to the resulting tsv-file
-
static
-
class
deeppavlov.dataset_readers.ubuntu_v2_reader.
UbuntuV2Reader
[source]¶ The class to read the Ubuntu V2 dataset from csv files.
Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.
-
read
(data_path: str, positive_samples=False, *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]¶ Read the Ubuntu V2 dataset from csv files.
- Parameters
data_path – A path to a folder with dataset csv files.
positive_samples – if True, only positive context-response pairs will be taken for train
-