deeppavlov.models.torch_bert¶
- class deeppavlov.models.preprocessors.torch_transformers_preprocessor.TorchTransformersPreprocessor(vocab_file: str, do_lower_case: bool = True, max_seq_length: int = 512, **kwargs)[source]¶
Tokenize text on subtokens, encode subtokens with their indices, create tokens and segment masks.
- Parameters
vocab_file – A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co or a path to a directory containing vocabulary files required by the tokenizer.
do_lower_case – set True if lowercasing is needed
max_seq_length – max sequence length in subtokens, including [SEP] and [CLS] tokens
- max_seq_length¶
max sequence length in subtokens, including [SEP] and [CLS] tokens
- tokenizer¶
instance of Bert FullTokenizer
- __call__(texts_a: List, texts_b: Optional[List[str]] = None) Union[List[transformers.data.processors.utils.InputFeatures], Tuple[List[transformers.data.processors.utils.InputFeatures], List[List[str]]]] [source]¶
Tokenize and create masks. texts_a and texts_b are separated by [SEP] token :param texts_a: list of texts, :param texts_b: list of texts, it could be None, e.g. single sentence classification task
- Returns
batch of
transformers.data.processors.utils.InputFeatures
with subtokens, subtoken ids, subtoken mask, segment mask, or tuple of batch of InputFeatures and Batch of subtokens
- class deeppavlov.models.preprocessors.torch_transformers_preprocessor.TorchTransformersNerPreprocessor(vocab_file: str, do_lower_case: bool = False, max_seq_length: int = 512, max_subword_length: Optional[int] = None, token_masking_prob: float = 0.0, provide_subword_tags: bool = False, subword_mask_mode: str = 'first', return_features: bool = False, **kwargs)[source]¶
Takes tokens and splits them into bert subtokens, encodes subtokens with their indices. Creates a mask of subtokens (one for the first subtoken, zero for the others).
If tags are provided, calculates tags for subtokens.
- Parameters
vocab_file – path to vocabulary
do_lower_case – set True if lowercasing is needed
max_seq_length – max sequence length in subtokens, including [SEP] and [CLS] tokens
max_subword_length – replace token to <unk> if it’s length is larger than this (defaults to None, which is equal to +infinity)
token_masking_prob – probability of masking token while training
provide_subword_tags – output tags for subwords or for words
subword_mask_mode – subword to select inside word tokens, can be “first” or “last” (default=”first”)
return_features – if True, returns answer in features format
- max_seq_length¶
max sequence length in subtokens, including [SEP] and [CLS] tokens
- max_subword_length¶
rmax lenght of a bert subtoken
- tokenizer¶
instance of Bert FullTokenizer
- class deeppavlov.models.preprocessors.torch_transformers_preprocessor.TorchBertRankerPreprocessor(vocab_file: str, do_lower_case: bool = True, max_seq_length: int = 512, **kwargs)[source]¶
Tokenize text to sub-tokens, encode sub-tokens with their indices, create tokens and segment masks for ranking.
Builds features for a pair of context with each of the response candidates.
- __call__(batch: List[List[str]]) List[List[transformers.data.processors.utils.InputFeatures]] [source]¶
Tokenize and create masks.
- Parameters
batch – list of elements where the first element represents the batch with contexts and the rest of elements represent response candidates batches
- Returns
list of feature batches with subtokens, subtoken ids, subtoken mask, segment mask.
- class deeppavlov.models.torch_bert.torch_transformers_classifier.TorchTransformersClassifierModel(n_classes, pretrained_bert, multilabel: bool = False, return_probas: bool = False, attention_probs_keep_prob: Optional[float] = None, hidden_keep_prob: Optional[float] = None, bert_config_file: Optional[str] = None, is_binary: Optional[bool] = False, num_special_tokens: Optional[int] = None, **kwargs)[source]¶
Bert-based model for text classification on PyTorch.
It uses output from [CLS] token and predicts labels using linear transformation.
- Parameters
n_classes – number of classes
pretrained_bert – pretrained Bert checkpoint path or key title (e.g. “bert-base-uncased”)
multilabel – set True if it is multi-label classification
return_probas – set True if return class probabilites instead of most probable label needed
attention_probs_keep_prob – keep_prob for Bert self-attention layers
hidden_keep_prob – keep_prob for Bert hidden layers
bert_config_file – path to Bert configuration file (not used if pretrained_bert is key title)
is_binary – whether classification task is binary or multi-class
num_special_tokens – number of special tokens used by classification model
- __call__(features: Dict[str, torch.tensor]) Union[List[int], List[List[float]]] [source]¶
Make prediction for given features (texts).
- Parameters
features – batch of InputFeatures
- Returns
predicted classes or probabilities of each class
- train_on_batch(features: Dict[str, torch.tensor], y: Union[List[int], List[List[int]]]) Dict [source]¶
Train model on given batch. This method calls train_op using features and y (labels).
- Parameters
features – batch of InputFeatures
y – batch of labels (class id or one-hot encoding)
- Returns
dict with loss and learning_rate values
- class deeppavlov.models.torch_bert.torch_transformers_sequence_tagger.TorchTransformersSequenceTagger(n_tags: int, pretrained_bert: str, bert_config_file: Optional[str] = None, attention_probs_keep_prob: Optional[float] = None, hidden_keep_prob: Optional[float] = None, use_crf: bool = False, **kwargs)[source]¶
Transformer-based model on PyTorch for text tagging. It predicts a label for every token (not subtoken) in the text. You can use it for sequence labeling tasks, such as morphological tagging or named entity recognition.
- Parameters
n_tags – number of distinct tags
pretrained_bert – pretrained Bert checkpoint path or key title (e.g. “bert-base-uncased”)
bert_config_file – path to Bert configuration file, or None, if pretrained_bert is a string name
attention_probs_keep_prob – keep_prob for Bert self-attention layers
hidden_keep_prob – keep_prob for Bert hidden layers
use_crf – whether to use Conditional Ramdom Field to decode tags
- __call__(input_ids: Union[List[List[int]], ndarray], input_masks: Union[List[List[int]], ndarray], y_masks: Union[List[List[int]], ndarray]) Tuple[List[List[int]], List[ndarray]] [source]¶
Predicts tag indices for a given subword tokens batch
- Parameters
input_ids – indices of the subwords
input_masks – mask that determines where to attend and where not to
y_masks – mask which determines the first subword units in the the word
- Returns
Label indices or class probabilities for each token (not subtoken)
- train_on_batch(input_ids: Union[List[List[int]], ndarray], input_masks: Union[List[List[int]], ndarray], y_masks: Union[List[List[int]], ndarray], y: List[List[int]], *args, **kwargs) Dict[str, float] [source]¶
- Parameters
input_ids – batch of indices of subwords
input_masks – batch of masks which determine what should be attended
args – arguments passed to _build_feed_dict and corresponding to additional input and output tensors of the derived class.
kwargs – keyword arguments passed to _build_feed_dict and corresponding to additional input and output tensors of the derived class.
- Returns
dict with fields ‘loss’, ‘head_learning_rate’, and ‘bert_learning_rate’
- class deeppavlov.models.torch_bert.torch_transformers_squad.TorchTransformersSquad(pretrained_bert: str, attention_probs_keep_prob: Optional[float] = None, hidden_keep_prob: Optional[float] = None, bert_config_file: Optional[str] = None, psg_cls: bool = False, batch_size: int = 10, **kwargs)[source]¶
Bert-based on PyTorch model for SQuAD-like problem setting: It predicts start and end position of answer for given question and context.
[CLS] token is used as no_answer. If model selects [CLS] token as most probable answer, it means that there is no answer in given context.
Start and end position of answer are predicted by linear transformation of Bert outputs.
- Parameters
pretrained_bert – pretrained Bert checkpoint path or key title (e.g. “bert-base-uncased”)
attention_probs_keep_prob – keep_prob for Bert self-attention layers
hidden_keep_prob – keep_prob for Bert hidden layers
bert_config_file – path to Bert configuration file, or None, if pretrained_bert is a string name
psg_cls – whether to use a separate linear layer to define if a passage contains the answer to the question
batch_size – batch size for inference of squad model
- __call__(features_batch: List[List[transformers.data.processors.utils.InputFeatures]]) Tuple[List[List[int]], List[List[int]], List[List[float]], List[List[float]], List[int]] [source]¶
get predictions using features as input
- Parameters
features_batch – batch of InputFeatures instances
- Returns
answer start positions end_pred_batch: answer end positions logits_batch: answer logits scores_batch: answer confidences ind_batch: indices of paragraph pieces where the answer was found
- Return type
start_pred_batch
- train_on_batch(features: List[List[transformers.data.processors.utils.InputFeatures]], y_st: List[List[int]], y_end: List[List[int]]) Dict [source]¶
Train model on given batch. This method calls train_op using features and labels from y_st and y_end
- Parameters
features – batch of InputFeatures instances
y_st – batch of lists of ground truth answer start positions
y_end – batch of lists of ground truth answer end positions
- Returns
dict with loss and learning_rate values
- class deeppavlov.models.torch_bert.torch_bert_ranker.TorchBertRankerModel(pretrained_bert: Optional[str] = None, bert_config_file: Optional[str] = None, n_classes: int = 2, return_probas: bool = True, **kwargs)[source]¶
BERT-based model for interaction-based text ranking on PyTorch.
Linear transformation is trained over the BERT pooled output from [CLS] token. Predicted probabilities of classes are used as a similarity measure for ranking.
- Parameters
pretrained_bert – pretrained Bert checkpoint path or key title (e.g. “bert-base-uncased”)
bert_config_file – path to Bert configuration file (not used if pretrained_bert is key title)
n_classes – number of classes
return_probas – set True if class probabilities are returned instead of the most probable label
- __call__(features_li: List[List[transformers.data.processors.utils.InputFeatures]]) Union[List[int], List[List[float]]] [source]¶
Calculate scores for the given context over candidate responses.
- Parameters
features_li – list of elements where each element contains the batch of features for contexts with particular response candidates
- Returns
predicted scores for contexts over response candidates
- train_on_batch(features_li: List[List[transformers.data.processors.utils.InputFeatures]], y: Union[List[int], List[List[int]]]) Dict [source]¶
Train the model on the given batch.
- Parameters
features_li – list with the single element containing the batch of InputFeatures
y – batch of labels (class id or one-hot encoding)
- Returns
dict with loss and learning rate values