deeppavlov.models.morpho_tagger¶
-
class
deeppavlov.models.morpho_tagger.morpho_tagger.
MorphoTagger
(symbols: deeppavlov.core.data.simple_vocab.SimpleVocabulary, tags: deeppavlov.core.data.simple_vocab.SimpleVocabulary, save_path: Union[str, pathlib.Path, None] = None, load_path: Union[str, pathlib.Path, None] = None, mode: str = 'infer', word_rnn: str = 'cnn', char_embeddings_size: int = 16, char_conv_layers: int = 1, char_window_size: Union[int, List[int]] = 5, char_filters: Union[int, List[int]] = None, char_filter_multiple: int = 25, char_highway_layers: int = 1, conv_dropout: float = 0.0, highway_dropout: float = 0.0, intermediate_dropout: float = 0.0, lstm_dropout: float = 0.0, word_vectorizers: List[Tuple[int, int]] = None, word_lstm_layers: int = 1, word_lstm_units: Union[int, List[int]] = 128, word_dropout: float = 0.0, regularizer: float = None, verbose: int = 1, **kwargs)[source]¶ A class for character-based neural morphological tagger
- Parameters
symbols – character vocabulary
tags – morphological tags vocabulary
save_path – the path where model is saved
load_path – the path from where model is loaded
mode – usage mode
word_rnn – the type of character-level network (only cnn implemented)
char_embeddings_size – the size of character embeddings
char_conv_layers – the number of convolutional layers on character level
char_window_size – the width of convolutional filter (filters). It can be a list if several parallel filters are applied, for example, [2, 3, 4, 5].
char_filters – the number of convolutional filters for each window width. It can be a number, a list (when there are several windows of different width on a single convolution layer), a list of lists, if there are more than 1 convolution layers, or None. If None, a layer with width width contains min(char_filter_multiple * width, 200) filters.
char_filter_multiple – the ratio between filters number and window width
char_highway_layers – the number of highway layers on character level
conv_dropout – the ratio of dropout between convolutional layers
highway_dropout – the ratio of dropout between highway layers,
intermediate_dropout – the ratio of dropout between convolutional and highway layers on character level
lstm_dropout – dropout ratio in word-level LSTM
word_vectorizers – list of parameters for additional word-level vectorizers, for each vectorizer it stores a pair of vectorizer dimension and the dimension of the corresponding word embedding
word_lstm_layers – the number of word-level LSTM layers
word_lstm_units – hidden dimensions of word-level LSTMs
word_dropout – the ratio of dropout before word level (it is applied to word embeddings)
regularizer – l2 regularization parameter
verbose – the level of verbosity
A subclass of
KerasModel
-
__call__
(*x_batch: numpy.ndarray, **kwargs) → Union[List, numpy.ndarray][source]¶ Predicts answers on batch elements.
- Parameters
x_batch – a batch to predict answers on. It can be either a single array for basic model or a sequence of arrays for a complex one ( configuration file or its lemmatized version).
-
load
() → None[source]¶ Checks existence of the model file, loads the model if the file exists Loads model weights from a file
-
predict_on_batch
(data: Union[List[numpy.ndarray], Tuple[numpy.ndarray]], return_indexes: bool = False) → List[List[str]][source]¶ Makes predictions on a single batch
- Parameters
data – model inputs for a single batch, data[0] contains input character encodings
is the only element of data for mist models. Subsequent elements of data (and) –
the output of additional vectorizers, e.g., dictionary-based one. (include) –
return_indexes – whether to return tag indexes in vocabulary or the tags themselves
- Returns
a batch of label sequences
-
deeppavlov.models.morpho_tagger.common.
predict_with_model
(config_path: [<class 'pathlib.Path'>, <class 'str'>], infile: Union[str, pathlib.Path, None] = None, input_format: str = 'ud', batch_size: [<class 'int'>] = 16, output_format: str = 'basic') → List[Optional[List[str]]][source]¶ Returns predictions of morphotagging model given in config :config_path:.
- Parameters
config_path – a path to config
- Returns
a list of morphological analyses for each sentence. Each analysis is either a list of tags or a list of full CONLL-U descriptions.
-
class
deeppavlov.models.morpho_tagger.lemmatizer.
UDPymorphyLemmatizer
(save_path: Optional[str] = None, load_path: Optional[str] = None, transform_lemmas=False, **kwargs)[source]¶ A class that returns a normal form of a Russian word given its morphological tag in UD format. Lemma is selected from one of PyMorphy parses, the parse whose tag resembles the most a known UD tag is chosen.
-
class
deeppavlov.models.morpho_tagger.common.
TagOutputPrettifier
(format_mode: str = 'basic', return_string: bool = True, begin: str = '', end: str = '', sep: str = '\n', **kwargs)[source]¶ Class which prettifies morphological tagger output to 4-column or 10-column (Universal Dependencies) format.
- Parameters
format_mode – output format, in basic mode output data contains 4 columns (id, word, pos, features), in conllu or ud mode it contains 10 columns: id, word, lemma, pos, xpos, feats, head, deprel, deps, misc (see http://universaldependencies.org/format.html for details) Only id, word, tag and pos values are present in current version, other columns are filled by _ value.
return_string – whether to return a list of strings or a single string
begin – a string to append in the beginning
end – a string to append in the end
sep – separator between word analyses
-
__call__
(X: List[List[str]], Y: List[List[str]]) → List[Union[List[str], str]][source]¶ Calls the
prettify()
function for each input sentence.- Parameters
X – a list of input sentences
Y – a list of list of tags for sentence words
- Returns
a list of prettified morphological analyses
-
prettify
(tokens: List[str], tags: List[str]) → Union[List[str], str][source]¶ Prettifies output of morphological tagger.
- Parameters
tokens – tokenized source sentence
tags – list of tags, the output of a tagger
- Returns
the prettified output of the tagger.
Examples
>>> sent = "John really likes pizza .".split() >>> tags = ["PROPN,Number=Sing", "ADV", >>> "VERB,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin", >>> "NOUN,Number=Sing", "PUNCT"] >>> prettifier = TagOutputPrettifier(mode='basic') >>> self.prettify(sent, tags) 1 John PROPN Number=Sing 2 really ADV _ 3 likes VERB Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 pizza NOUN Number=Sing 5 . PUNCT _ >>> prettifier = TagOutputPrettifier(mode='ud') >>> self.prettify(sent, tags) 1 John _ PROPN _ Number=Sing _ _ _ _ 2 really _ ADV _ _ _ _ _ _ 3 likes _ VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin _ _ _ _ 4 pizza _ NOUN _ Number=Sing _ _ _ _ 5 . _ PUNCT _ _ _ _ _ _
-
set_format_mode
(format_mode: str = 'basic') → None[source]¶ A function that sets format for output and recalculates self.format_string.
- Parameters
format_mode – output format, in basic mode output data contains 4 columns (id, word, pos, features), in conllu or ud mode it contains 10 columns: id, word, lemma, pos, xpos, feats, head, deprel, deps, misc (see http://universaldependencies.org/format.html for details) Only id, word, tag and pos values are present in current version, other columns are filled by _ value.
Returns:
-
class
deeppavlov.models.morpho_tagger.common.
LemmatizedOutputPrettifier
(return_string: bool = True, begin: str = '', end: str = '', sep: str = '\n', **kwargs)[source]¶ Class which prettifies morphological tagger output to 4-column or 10-column (Universal Dependencies) format.
- Parameters
format_mode – output format, in basic mode output data contains 4 columns (id, word, pos, features), in conllu or ud mode it contains 10 columns: id, word, lemma, pos, xpos, feats, head, deprel, deps, misc (see http://universaldependencies.org/format.html for details) Only id, word, lemma, tag and pos columns are predicted in current version, other columns are filled by _ value.
return_string – whether to return a list of strings or a single string
begin – a string to append in the beginning
end – a string to append in the end
sep – separator between word analyses
-
__call__
(X: List[List[str]], Y: List[List[str]], Z: List[List[str]]) → List[Union[List[str], str]][source]¶ Calls the
prettify()
function for each input sentence.- Parameters
X – a list of input sentences
Y – a list of list of tags for sentence words
Z – a list of lemmatized sentences
- Returns
a list of prettified morphological analyses
-
prettify
(tokens: List[str], tags: List[str], lemmas: List[str]) → Union[List[str], str][source]¶ Prettifies output of morphological tagger.
- Parameters
tokens – tokenized source sentence
tags – list of tags, the output of a tagger
lemmas – list of lemmas, the output of a lemmatizer
- Returns
the prettified output of the tagger.
Examples
>>> sent = "John really likes pizza .".split() >>> tags = ["PROPN,Number=Sing", "ADV", >>> "VERB,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin", >>> "NOUN,Number=Sing", "PUNCT"] >>> lemmas = "John really like pizza .".split() >>> prettifier = LemmatizedOutputPrettifier() >>> self.prettify(sent, tags, lemmas) 1 John John PROPN _ Number=Sing _ _ _ _ 2 really really ADV _ _ _ _ _ _ 3 likes like VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin _ _ _ _ 4 pizza pizza NOUN _ Number=Sing _ _ _ _ 5 . . PUNCT _ _ _ _ _ _