deeppavlov.models.nemo¶

class deeppavlov.models.nemo.asr.NeMoASR(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], **kwargs)[source]¶

ASR model on NeMo modules.

__init__(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], **kwargs) → None [source]¶

Initializes NeuralModules for ASR.

Parameters

load_path – Path to a directory with pretrained checkpoints for JasperEncoder and JasperDecoderForCTC.
nemo_params_path – Path to a file containig labels and params for AudioToMelSpectrogramPreprocessor, JasperEncoder, JasperDecoderForCTC and AudioInferDataLayer.

__call__(audio_batch: List[Union[str, _io.BytesIO]]) → List[str][source]¶

Transcripts audio batch to text.

Parameters: audio_batch – Batch to be transcribed. Elements could be either paths to audio files or Binary I/O objects.
Returns: Batch of transcripts.
Return type: text_batch

class deeppavlov.models.nemo.tts.NeMoTTS(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], vocoder: str = 'waveglow', **kwargs)[source]¶

TTS model on NeMo modules.

__init__(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], vocoder: str = 'waveglow', **kwargs) → None [source]¶

Initializes NeuralModules for TTS.

Parameters

load_path – Path to a directory with pretrained checkpoints for TextEmbedding, Tacotron2Encoder, Tacotron2DecoderInfer, Tacotron2Postnet and, if Waveglow vocoder is selected, WaveGlowInferNM.
nemo_params_path – Path to a file containig sample_rate, labels and params for TextEmbedding, Tacotron2Encoder, Tacotron2Decoder, Tacotron2Postnet and TranscriptDataLayer.
vocoder – Vocoder used to convert from spectrograms to audio. Available options: waveglow (needs pretrained checkpoint) and griffin-lim.

__call__(text_batch: List[str], path_batch: Optional[List[str]] = None) → Union[List[_io.BytesIO], List[str]][source]¶

Creates wav files or file objects with speech.

Parameters

text_batch – Text from which human audible speech should be generated.
path_batch – i-th element of path_batch is the path to save i-th generated speech file. If argument isn’t specified, the synthesized speech will be stored to Binary I/O objects.

Returns

List of Binary I/O objects with generated speech if path_batch was not specified, list of paths to files: with synthesized speech otherwise.

deeppavlov.models.nemo.common.ascii_to_bytes_io(batch: Union[str, list]) → Union[_io.BytesIO, list][source]¶

Recursively searches for strings in the input batch and converts them into the base64-encoded bytes wrapped in Binary I/O objects.

Parameters: batch – A string or an iterable container with strings at some level of nesting.
Returns: The same structure where all strings are converted into the base64-encoded bytes wrapped in Binary I/O objects.

deeppavlov.models.nemo.common.bytes_io_to_ascii(batch: Union[_io.BytesIO, list]) → Union[str, list][source]¶

Recursively searches for Binary I/O objects in the input batch and converts them into ASCII-strings.

Parameters: batch – A BinaryIO object or an iterable container with BinaryIO objects at some level of nesting.
Returns: The same structure where all BinaryIO objects are converted into strings.

class deeppavlov.models.nemo.asr.AudioInferDataLayer(*args: Any, **kwargs: Any)[source]¶

Data Layer for ASR pipeline inference.

__init__(*, audio_batch: List[Union[str, _io.BytesIO]], batch_size: int = 32, sample_rate: int = 16000, int_values: bool = False, trim_silence: bool = False, **kwargs) → None [source]¶

Initializes Data Loader.

Parameters

audio_batch – Batch to be read. Elements could be either paths to audio files or Binary I/O objects.
batch_size – How many samples per batch to load.
sample_rate – Target sampling rate for data. Audio files will be resampled to sample_rate if it is not already.
int_values – If true, load data as 32-bit integers.
trim_silence – Trim leading and trailing silence from an audio signal if True.

class deeppavlov.models.nemo.tts.TextDataLayer(*args: Any, **kwargs: Any)[source]¶

__init__(*, text_batch: List[str], labels: List[str], batch_size: int = 32, bos_id: Optional[int] = None, eos_id: Optional[int] = None, pad_id: Optional[int] = None, **kwargs) → None [source]¶

A simple Neural Module for loading text data.

Parameters

text_batch – Texts to be used for speech synthesis.
labels – List of string labels to use when to str2int translation.
batch_size – How many strings per batch to load.
bos_id – Label position of beginning of string symbol. If None is initialized as len(labels).
eos_id – Label position of end of string symbol. If None is initialized as len(labels) + 1.
pad_id – Label position of pad symbol. If None is initialized as len(labels) + 2.

class deeppavlov.models.nemo.vocoder.WaveGlow(*, denoiser_strength: float = 0.0, n_window_stride: int = 160, **kwargs)[source]¶

__init__(*, denoiser_strength: float = 0.0, n_window_stride: int = 160, **kwargs) → None [source]¶

Wraps WaveGlowInferNM module.

Parameters

denoiser_strength – Denoiser strength for waveglow.
n_window_stride – Stride of window for FFT in samples used in model training.
kwargs – Named arguments for WaveGlowInferNM constructor.

class deeppavlov.models.nemo.vocoder.GriffinLim(*, sample_rate: float = 16000.0, n_fft: int = 1024, mag_scale: float = 2048.0, power: float = 1.2, n_iters: int = 50, **kwargs)[source]¶

__init__(*, sample_rate: float = 16000.0, n_fft: int = 1024, mag_scale: float = 2048.0, power: float = 1.2, n_iters: int = 50, **kwargs) → None [source]¶

Uses Griffin Lim algorithm to generate speech from spectrograms.

Parameters

sample_rate – Generated audio data sample rate.
n_fft – The number of points to use for the FFT.
mag_scale – Multiplied with the linear spectrogram to avoid audio sounding muted due to mel filter normalization.
power – The linear spectrogram is raised to this power prior to running the Griffin Lim algorithm. A power of greater than 1 has been shown to improve audio quality.
n_iters – Number of iterations of convertion magnitude spectrograms to audio signal.