deeppavlov.models.preprocessors¶
-
class
deeppavlov.models.preprocessors.assemble_embeddings_matrix.
EmbeddingsMatrixAssembler
(embedder: deeppavlov.models.embedders.abstract_embedder.Embedder, vocab: deeppavlov.core.data.simple_vocab.SimpleVocabulary, character_level: bool = False, emb_dim: Optional[int] = None, estimate_by_n: int = 10000, *args, **kwargs)[source]¶ - For a given Vocabulary assembles matrix of embeddings obtained from some Embedder. This
class also can assemble embeddins of characters using
- Parameters
embedder – an instance of the class that convertes tokens to vectors. For example
FasttextEmbedder
orGloVeEmbedder
vocab – instance of
SimpleVocab
. The matrix of embeddings will be assembled relying on every token in the vocabulary. the indexing will match vocabulary indexing.character_level – whether to perform assembling on character level. This procedure will assemble matrix with embeddings for every character using averaged embeddings of words, that contain this character.
emb_dim – dimensionality of the resulting embeddings. If not
None
it should be less or equal to the dimensionality of the embeddings provided by Embedder. The reduction of dimensionality is performed by taking main components of PCA.estimate_by_n – how much samples to use to estimate covariance matrix for PCA. 10000 seems to be enough.
-
dim
¶ dimensionality of the embeddings (can be less than dimensionality of embeddings produced by Embedder.
-
class
deeppavlov.models.preprocessors.capitalization.
CapitalizationPreprocessor
(pad_zeros: bool = True, *args, **kwargs)[source]¶ Featurizer useful for NER task. It detects following patterns in the words: - no capitals - single capital single character - single capital multiple characters - all capitals multiple characters
- Parameters
pad_zeros – whether to pad capitalization features batch with zeros up to maximal length or not.
-
dim
¶ dimensionality of the feature vectors, produced by the featurizer
-
deeppavlov.models.preprocessors.capitalization.
process_word
(word: str, to_lower: bool = False, append_case: Optional[str] = None) → Tuple[str][source]¶ - The method implements the following operations:
converts word to a tuple of symbols (character splitting),
optionally converts it to lowercase and
adds capitalization label.
- Parameters
word – input word
to_lower – whether to lowercase
append_case – whether to add case mark (‘<FIRST_UPPER>’ for first capital and ‘<ALL_UPPER>’ for all caps)
- Returns
a preprocessed word.
Example
>>> process_word(word="Zaman", to_lower=True, append_case="first") ('<FIRST_UPPER>', 'z', 'a', 'm', 'a', 'n') >>> process_word(word="MSU", to_lower=True, append_case="last") ('m', 's', 'u', '<ALL_UPPER>')
-
class
deeppavlov.models.preprocessors.capitalization.
CharSplittingLowercasePreprocessor
(to_lower: bool = True, append_case: str = 'first', *args, **kwargs)[source]¶ A callable wrapper over
process_word()
. Takes as input a batch of tokenized sentences and returns a batch of preprocessed sentences.
-
class
deeppavlov.models.preprocessors.char_splitter.
CharSplitter
(**kwargs)[source]¶ This component transforms batch of sequences of tokens into batch of sequences of character sequences.
-
class
deeppavlov.models.preprocessors.dirty_comments_preprocessor.
DirtyCommentsPreprocessor
(remove_punctuation: bool = True, *args, **kwargs)[source]¶ Class implements preprocessing of english texts with low level of literacy such as comments
-
class
deeppavlov.models.preprocessors.mask.
Mask
(*args, **kwargs)[source]¶ Takes a batch of tokens and returns the masks of corresponding length
-
class
deeppavlov.models.preprocessors.one_hotter.
OneHotter
(depth: int, pad_zeros: bool = False, single_vector=False, *args, **kwargs)[source]¶ One-hot featurizer with zero-padding. If
single_vector
, return the only vector per sample which can have several elements equal to1
.- Parameters
depth – the depth for one-hotting
pad_zeros – whether to pad elements of batch with zeros
single_vector – whether to return one vector for the sample (sum of each one-hotted vectors)
-
class
deeppavlov.models.preprocessors.random_embeddings_matrix.
RandomEmbeddingsMatrix
(vocab_len: int, emb_dim: int, *args, **kwargs)[source]¶ Assembles matrix of random embeddings.
- Parameters
vocab_len – length of the vocabulary (number of tokens in it)
emb_dim – dimensionality of the embeddings
-
dim
¶ dimensionality of the embeddings
-
class
deeppavlov.models.preprocessors.russian_lemmatizer.
PymorphyRussianLemmatizer
(*args, **kwargs)[source]¶ Class for lemmatization using PyMorphy.
-
class
deeppavlov.models.preprocessors.sanitizer.
Sanitizer
(diacritical: bool = True, nums: bool = False, *args, **kwargs)[source]¶ Remove all combining characters like diacritical marks from tokens
- Parameters
diacritical – whether to remove diacritical signs or not diacritical signs are something like hats and stress marks
nums – whether to replace all digits with 1 or not
-
class
deeppavlov.models.preprocessors.siamese_preprocessor.
SiamesePreprocessor
(save_path: str = './tok.dict', load_path: str = './tok.dict', max_sequence_length: Optional[int] = None, dynamic_batch: bool = False, padding: str = 'post', truncating: str = 'post', use_matrix: bool = True, num_context_turns: int = 1, num_ranking_samples: int = 1, add_raw_text: bool = False, tokenizer: Optional[deeppavlov.core.models.component.Component] = None, vocab: Optional[deeppavlov.core.models.estimator.Estimator] = None, embedder: Optional[deeppavlov.core.models.component.Component] = None, sent_vocab: Optional[deeppavlov.core.models.estimator.Estimator] = None, **kwargs)[source]¶ Preprocessing of data samples containing text strings to feed them in a siamese network.
First
num_context_turns
strings in each data sample corresponds to the dialoguecontext
and the rest string(s) in the sample is (are)response(s)
.- Parameters
save_path – The parameter is only needed to initialize the base class
Serializable
.load_path – The parameter is only needed to initialize the base class
Serializable
.max_sequence_length – A maximum length of text sequences in tokens. Longer sequences will be truncated and shorter ones will be padded.
dynamic_batch – Whether to use dynamic batching. If
True
, the maximum length of a sequence for a batch will be equal to the maximum of all sequences lengths from this batch, but not higher thanmax_sequence_length
.padding – Padding. Possible values are
pre
andpost
. If set topre
a sequence will be padded at the beginning. If set topost
it will padded at the end.truncating – Truncating. Possible values are
pre
andpost
. If set topre
a sequence will be truncated at the beginning. If set topost
it will truncated at the end.use_matrix – Whether to use a trainable matrix with token (word) embeddings.
num_context_turns – A number of
context
turns in data samples.num_ranking_samples – A number of condidates for ranking including positive one.
add_raw_text – whether add raw text sentences to output data list or not. Use with conjunction of models using sentence encoders
tokenizer – An instance of one of the
deeppavlov.models.tokenizers
.vocab – An instance of
deeppavlov.core.data.simple_vocab.SimpleVocabulary
.embedder – an instance of one of the
deeppavlov.models.embedders
.sent_vocab – An instance of of
deeppavlov.core.data.simple_vocab.SimpleVocabulary
. It is used to store allresponces
and to find the bestresponse
to the usercontext
in theinteract
mode.
-
deeppavlov.models.preprocessors.str_lower.
str_lower
(batch: Union[str, list, tuple])[source]¶ Recursively search for strings in a list and convert them to lowercase
- Parameters
batch – a string or a list containing strings at some level of nesting
- Returns
the same structure where all strings are converted to lowercase
-
class
deeppavlov.models.preprocessors.str_token_reverser.
StrTokenReverser
(tokenized: bool = False, *args, **kwargs)[source]¶ Component for converting strings to strings with reversed token positions
- Parameters
tokenized – The parameter is only needed to reverse tokenized strings.
-
__call__
(batch: Union[str, list, tuple]) → Union[List[str], List[Union[List[str], List[StrTokenReverserInfo]]]][source]¶ Recursively search for strings in a list and convert them to strings with reversed token positions
- Parameters
batch – a string or a list containing strings
- Returns
the same structure where all strings tokens are reversed
-
class
deeppavlov.models.preprocessors.str_utf8_encoder.
StrUTF8Encoder
(max_word_length: int = 50, pad_special_char_use: bool = False, word_boundary_special_char_use: bool = False, sentence_boundary_special_char_use: bool = False, reversed_sentense_tokens: bool = False, bos: str = '<S>', eos: str = '</S>', **kwargs)[source]¶ Component for encoding all strings to utf8 codes
- Parameters
max_word_length – Max length of words of input and output batches.
pad_special_char_use – Whether to use special char for padding or not.
word_boundary_special_char_use – Whether to add word boundaries by special chars or not.
sentence_boundary_special_char_use – Whether to add word boundaries by special chars or not.
reversed_sentense_tokens – Whether to use reversed sequences of tokens or not.
bos – Name of a special token of the begin of a sentence.
eos – Name of a special token of the end of a sentence.
-
class
deeppavlov.models.preprocessors.odqa_preprocessors.
DocumentChunker
(sentencize_fn: Callable = nltk.sent_tokenize, keep_sentences: bool = True, tokens_limit: int = 400, flatten_result: bool = False, paragraphs: bool = False, number_of_paragraphs: int = - 1, *args, **kwargs)[source]¶ Make chunks from a document or a list of documents. Don’t tear up sentences if needed.
- Parameters
sentencize_fn – a function for sentence segmentation
keep_sentences – whether to tear up sentences between chunks or not
tokens_limit – a number of tokens in a single chunk (usually this number corresponds to the squad model limit)
flatten_result – whether to flatten the resulting list of lists of chunks
paragraphs – whether to split document by paragrahs; if set to True, tokens_limit is ignored
-
keep_sentences
¶ whether to tear up sentences between chunks or not
-
tokens_limit
¶ a number of tokens in a single chunk
-
flatten_result
¶ whether to flatten the resulting list of lists of chunks
-
paragraphs
¶ whether to split document by paragrahs; if set to True, tokens_limit is ignored
-
__call__
(batch_docs: List[Union[List[str], str]], batch_docs_ids: Optional[List[Union[List[str], str]]] = None) → Union[Tuple[Union[List[str], List[List[str]]], Union[List[str], List[List[str]]]], List[str], List[List[str]]][source]¶ Make chunks from a batch of documents. There can be several documents in each batch. :param batch_docs: a batch of documents / a batch of lists of documents :param batch_docs_ids: a batch of documents ids / a batch of lists of documents ids :type batch_docs_ids: optional
- Returns
chunks of docs, flattened or not and chunks of docs ids, flattened or not if batch_docs_ids were passed