Pre-trained embeddings¶

ELMo¶

We are publishing Russian language ELMo embeddings model for tensorflow-hub and LM model for training and fine-tuning ELMo as LM model.

ELMo (Embeddings from Language Models) representations are pre-trained contextual representations from large-scale bidirectional language models. See a paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.

License¶

The pre-trained models are distributed under the License Apache 2.0.

Downloads¶

The models can be downloaded and run by configuration file or tensorflow hub module from:

Description	Dataset parameters	Perplexity	Configuration file and tensorflow hub module
ELMo on Russian Wikipedia	lines = 1M, tokens = 386M, size = 5GB	43.692	config_file, module_spec
ELMo on Russian WMT News	lines = 63M, tokens = 946M, size = 12GB	49.876	config_file, module_spec
ELMo on Russian Twitter	lines = 104M, tokens = 810M, size = 8.5GB	94.145	config_file, module_spec

fastText¶

We are publishing pre-trained word vectors for Russian language. These vectors were trained on joint Russian Wikipedia and Lenta.ru corpora.

All vectors are 300-dimentional. We used fastText skpip-gram (see Bojanowski et al. (2016)) for vectors training as well as various preprocessing options (see below).

You can get vectors either in binary or in text (vec) formats both for fastText and GloVe.

License¶

The pre-trained word vectors are distributed under the License Apache 2.0.

Downloads¶

The models can be downloaded from:

Model	Preprocessing	Vectors
fastText (skipgram)	tokenize (nltk word_tokenize), lemmatize (pymorphy2)	bin, vec
fastText (skipgram)	tokenize (nltk word_tokenize), lowercasing	bin, vec
fastText (skipgram)	tokenize (nltk wordpunсt_tokenize)	bin, vec
fastText (skipgram)	tokenize (nltk word_tokenize)	bin, vec
fastText (skipgram)	tokenize (nltk word_tokenize), remove stopwords	bin, vec

Word vectors training parameters¶

These word vectors were trained with following parameters ([…] is for default value):

fastText (skipgram)

lr [0.1]
lrUpdateRate [100]
dim 300
ws [5]
epoch [5]
neg [5]
loss [softmax]
pretrainedVectors []
saveOutput [0]