Pre-trained embeddings¶
BERT¶
We are publishing several pre-trained BERT models:
RuBERT for Russian language
Slavic BERT for Bulgarian, Czech, Polish, and Russian
Conversational BERT for informal English
and Conversational BERT for informal Russian
Description of these models is available in the BERT section of the docs.
License¶
The pre-trained models are distributed under the License Apache 2.0.
Downloads¶
The models can be run with the original BERT repo code. The download links are:
Description |
Model parameters |
Download link |
---|---|---|
RuBERT |
vocab size = 120K, parameters = 180M, size = 632MB |
|
Slavic BERT |
vocab size = 120K, parameters = 180M, size = 632MB |
|
Conversational BERT |
vocab size = 30K, parameters = 110M, size = 385MB |
|
Conversational RuBERT |
vocab size = 120K, parameters = 180M, size = 630MB |
ELMo¶
Russian language ELMo embeddings model
for tensorflow-hub and LM model
for training and fine-tuning ELMo as LM model.License¶
The pre-trained models are distributed under the License Apache 2.0.
Downloads¶
The models can be downloaded and run by configuration file or tensorflow hub module from:
Description |
Dataset parameters |
Perplexity |
Configuration file and tensorflow hub module |
---|---|---|---|
ELMo on Russian Wikipedia |
lines = 1M, tokens = 386M, size = 5GB |
43.692 |
|
ELMo on Russian WMT News |
lines = 63M, tokens = 946M, size = 12GB |
49.876 |
|
ELMo on Russian Twitter |
lines = 104M, tokens = 810M, size = 8.5GB |
94.145 |
fastText¶
We are publishing pre-trained word vectors for Russian language. Several models were trained on joint Russian Wikipedia and Lenta.ru corpora. We also introduce one model for Russian conversational language that was trained on Russian Twitter corpus.
All vectors are 300-dimensional. We used fastText skip-gram (see Bojanowski et al. (2016)) for vectors training as well as various preprocessing options (see below).
You can get vectors either in binary or in text (vec) formats both for fastText and GloVe.
License¶
The pre-trained word vectors are distributed under the License Apache 2.0.
Downloads¶
The pre-trained fastText skipgram models can be downloaded from:
Domain |
Preprocessing |
Vectors |
---|---|---|
Wiki+Lenta |
tokenize (nltk word_tokenize), lemmatize (pymorphy2) |
|
tokenize (nltk word_tokenize), lowercasing |
||
tokenize (nltk wordpunсt_tokenize) |
||
tokenize (nltk word_tokenize) |
||
tokenize (nltk word_tokenize), remove stopwords |
||
tokenize (nltk word_tokenize) |
Word vectors training parameters¶
These word vectors were trained with following parameters ([…] is for default value):
fastText (skipgram)
lr [0.1]
lrUpdateRate [100]
dim 300
ws [5]
epoch [5]
neg [5]
loss [softmax]
pretrainedVectors []
saveOutput [0]