deeppavlov.models.preprocessors¶
-
class
deeppavlov.models.preprocessors.dirty_comments_preprocessor.
DirtyCommentsPreprocessor
(remove_punctuation: bool = True, *args, **kwargs)[source]¶ Class implements preprocessing of english texts with low level of literacy such as comments
-
class
deeppavlov.models.preprocessors.mask.
Mask
(*args, **kwargs)[source]¶ Takes a batch of tokens and returns the masks of corresponding length
-
class
deeppavlov.models.preprocessors.one_hotter.
OneHotter
(depth: int, pad_zeros: bool = False, single_vector=False, *args, **kwargs)[source]¶ One-hot featurizer with zero-padding. If
single_vector
, return the only vector per sample which can have several elements equal to1
.- Parameters
depth – the depth for one-hotting
pad_zeros – whether to pad elements of batch with zeros
single_vector – whether to return one vector for the sample (sum of each one-hotted vectors)
-
class
deeppavlov.models.preprocessors.sanitizer.
Sanitizer
(diacritical: bool = True, nums: bool = False, *args, **kwargs)[source]¶ Remove all combining characters like diacritical marks from tokens
- Parameters
diacritical – whether to remove diacritical signs or not diacritical signs are something like hats and stress marks
nums – whether to replace all digits with 1 or not
-
deeppavlov.models.preprocessors.str_lower.
str_lower
(batch: Union[str, list, tuple])[source]¶ Recursively search for strings in a list and convert them to lowercase
- Parameters
batch – a string or a list containing strings at some level of nesting
- Returns
the same structure where all strings are converted to lowercase
-
class
deeppavlov.models.preprocessors.str_token_reverser.
StrTokenReverser
(tokenized: bool = False, *args, **kwargs)[source]¶ Component for converting strings to strings with reversed token positions
- Parameters
tokenized – The parameter is only needed to reverse tokenized strings.
-
__call__
(batch: Union[str, list, tuple]) → Union[List[str], List[Union[List[str], List[StrTokenReverserInfo]]]][source]¶ Recursively search for strings in a list and convert them to strings with reversed token positions
- Parameters
batch – a string or a list containing strings
- Returns
the same structure where all strings tokens are reversed
-
class
deeppavlov.models.preprocessors.str_utf8_encoder.
StrUTF8Encoder
(max_word_length: int = 50, pad_special_char_use: bool = False, word_boundary_special_char_use: bool = False, sentence_boundary_special_char_use: bool = False, reversed_sentense_tokens: bool = False, bos: str = '<S>', eos: str = '</S>', **kwargs)[source]¶ Component for encoding all strings to utf8 codes
- Parameters
max_word_length – Max length of words of input and output batches.
pad_special_char_use – Whether to use special char for padding or not.
word_boundary_special_char_use – Whether to add word boundaries by special chars or not.
sentence_boundary_special_char_use – Whether to add word boundaries by special chars or not.
reversed_sentense_tokens – Whether to use reversed sequences of tokens or not.
bos – Name of a special token of the begin of a sentence.
eos – Name of a special token of the end of a sentence.
-
class
deeppavlov.models.preprocessors.odqa_preprocessors.
DocumentChunker
(sentencize_fn: Callable = nltk.sent_tokenize, keep_sentences: bool = True, tokens_limit: int = 400, flatten_result: bool = False, paragraphs: bool = False, number_of_paragraphs: int = - 1, *args, **kwargs)[source]¶ Make chunks from a document or a list of documents. Don’t tear up sentences if needed.
- Parameters
sentencize_fn – a function for sentence segmentation
keep_sentences – whether to tear up sentences between chunks or not
tokens_limit – a number of tokens in a single chunk (usually this number corresponds to the squad model limit)
flatten_result – whether to flatten the resulting list of lists of chunks
paragraphs – whether to split document by paragrahs; if set to True, tokens_limit is ignored
-
keep_sentences
¶ whether to tear up sentences between chunks or not
-
tokens_limit
¶ a number of tokens in a single chunk
-
flatten_result
¶ whether to flatten the resulting list of lists of chunks
-
paragraphs
¶ whether to split document by paragrahs; if set to True, tokens_limit is ignored
-
__call__
(batch_docs: List[Union[str, List[str]]], batch_docs_ids: Optional[List[Union[str, List[str]]]] = None) → Union[Tuple[Union[List[str], List[List[str]]], Union[List[str], List[List[str]]]], List[str], List[List[str]]][source]¶ Make chunks from a batch of documents. There can be several documents in each batch. :param batch_docs: a batch of documents / a batch of lists of documents :param batch_docs_ids: a batch of documents ids / a batch of lists of documents ids :type batch_docs_ids: optional
- Returns
chunks of docs, flattened or not and chunks of docs ids, flattened or not if batch_docs_ids were passed