deeppavlov.models.tokenizers¶
- class deeppavlov.models.tokenizers.nltk_moses_tokenizer.NLTKMosesTokenizer(escape: bool = False, *args, **kwargs)[source]¶
Class for splitting texts on tokens using NLTK wrapper over MosesTokenizer
- escape¶
whether escape characters for use in html markup
- tokenizer¶
tokenizer instance from nltk.tokenize.moses
- detokenizer¶
detokenizer instance from nltk.tokenize.moses
- Parameters
escape – whether escape characters for use in html markup
- class deeppavlov.models.tokenizers.nltk_tokenizer.NLTKTokenizer(tokenizer: str = 'wordpunct_tokenize', download: bool = False, *args, **kwargs)[source]¶
Class for splitting texts on tokens using NLTK
- Parameters
tokenizer – tokenization mode for nltk.tokenize
download – whether to download nltk data
- tokenizer¶
tokenizer instance from nltk.tokenizers
- class deeppavlov.models.tokenizers.split_tokenizer.SplitTokenizer(**kwargs)[source]¶
Generates utterance’s tokens by mere python’s
str.split()
.Doesn’t have any parameters.
- class deeppavlov.models.tokenizers.spacy_tokenizer.StreamSpacyTokenizer(disable: Optional[Iterable[str]] = None, filter_stopwords: bool = False, batch_size: Optional[int] = None, ngram_range: Optional[List[int]] = None, lemmas: bool = False, lowercase: Optional[bool] = None, alphas_only: Optional[bool] = None, spacy_model: str = 'en_core_web_sm', **kwargs)[source]¶
Tokenize or lemmatize a list of documents. Default spacy model is en_core_web_sm. Return a list of tokens or lemmas for a whole document. If is called onto
List[str]
, performs detokenizing procedure.- Parameters
disable – spacy pipeline elements to disable, serves a purpose of performing; if nothing
filter_stopwords – whether to ignore stopwords during tokenizing/lemmatizing and ngrams creation
batch_size – a batch size for spaCy buffering
ngram_range – size of ngrams to create; only unigrams are returned by default
lemmas – whether to perform lemmatizing or not
lowercase – whether to perform lowercasing or not; is performed by default by
_tokenize()
and_lemmatize()
methodsalphas_only – whether to filter out non-alpha tokens; is performed by default by
_filter()
methodspacy_model – a string name of spacy model to use; DeepPavlov searches for this name in downloaded spacy models; default model is en_core_web_sm, it downloads automatically during DeepPavlov installation
- stopwords¶
a list of stopwords that should be ignored during tokenizing/lemmatizing and ngrams creation
- model¶
a loaded spacy model
- batch_size¶
a batch size for spaCy buffering
- ngram_range¶
size of ngrams to create; only unigrams are returned by default
- lemmas¶
whether to perform lemmatizing or not
- lowercase¶
whether to perform lowercasing or not; is performed by default by
_tokenize()
and_lemmatize()
methods
- alphas_only¶
whether to filter out non-alpha tokens; is performed by default by
_filter()
method
- __call__(batch: Union[List[str], List[List[str]]]) Union[List[List[str]], List[str]] [source]¶
Tokenize or detokenize strings, depends on the type structure of passed arguments.
- Parameters
batch – a batch of documents to perform tokenizing/lemmatizing; or a batch of lists of tokens/lemmas to perform detokenizing
- Returns
a batch of lists of tokens/lemmas; or a batch of detokenized strings
- Raises
TypeError – If the first element of
batch
is neither List, nor str.