deeppavlov.models.nemo¶
-
class
deeppavlov.models.nemo.asr.
NeMoASR
(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], **kwargs)[source]¶ ASR model on NeMo modules.
-
__init__
(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], **kwargs) → None[source]¶ Initializes NeuralModules for ASR.
- Parameters
load_path – Path to a directory with pretrained checkpoints for JasperEncoder and JasperDecoderForCTC.
nemo_params_path – Path to a file containig labels and params for AudioToMelSpectrogramPreprocessor, JasperEncoder, JasperDecoderForCTC and AudioInferDataLayer.
-
-
class
deeppavlov.models.nemo.tts.
NeMoTTS
(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], vocoder: str = 'waveglow', **kwargs)[source]¶ TTS model on NeMo modules.
-
__init__
(load_path: Union[str, pathlib.Path], nemo_params_path: Union[str, pathlib.Path], vocoder: str = 'waveglow', **kwargs) → None[source]¶ Initializes NeuralModules for TTS.
- Parameters
load_path – Path to a directory with pretrained checkpoints for TextEmbedding, Tacotron2Encoder, Tacotron2DecoderInfer, Tacotron2Postnet and, if Waveglow vocoder is selected, WaveGlowInferNM.
nemo_params_path – Path to a file containig sample_rate, labels and params for TextEmbedding, Tacotron2Encoder, Tacotron2Decoder, Tacotron2Postnet and TranscriptDataLayer.
vocoder – Vocoder used to convert from spectrograms to audio. Available options: waveglow (needs pretrained checkpoint) and griffin-lim.
-
__call__
(text_batch: List[str], path_batch: Optional[List[str]] = None) → Union[List[_io.BytesIO], List[str]][source]¶ Creates wav files or file objects with speech.
- Parameters
text_batch – Text from which human audible speech should be generated.
path_batch – i-th element of path_batch is the path to save i-th generated speech file. If argument isn’t specified, the synthesized speech will be stored to Binary I/O objects.
- Returns
- List of Binary I/O objects with generated speech if path_batch was not specified, list of paths to files
with synthesized speech otherwise.
-
-
deeppavlov.models.nemo.common.
ascii_to_bytes_io
(batch: Union[str, list]) → Union[_io.BytesIO, list][source]¶ Recursively searches for strings in the input batch and converts them into the base64-encoded bytes wrapped in Binary I/O objects.
- Parameters
batch – A string or an iterable container with strings at some level of nesting.
- Returns
The same structure where all strings are converted into the base64-encoded bytes wrapped in Binary I/O objects.
-
deeppavlov.models.nemo.common.
bytes_io_to_ascii
(batch: Union[_io.BytesIO, list]) → Union[str, list][source]¶ Recursively searches for Binary I/O objects in the input batch and converts them into ASCII-strings.
- Parameters
batch – A BinaryIO object or an iterable container with BinaryIO objects at some level of nesting.
- Returns
The same structure where all BinaryIO objects are converted into strings.
-
class
deeppavlov.models.nemo.asr.
AudioInferDataLayer
(*args: Any, **kwargs: Any)[source]¶ Data Layer for ASR pipeline inference.
-
__init__
(*, audio_batch: List[Union[str, _io.BytesIO]], batch_size: int = 32, sample_rate: int = 16000, int_values: bool = False, trim_silence: bool = False, **kwargs) → None[source]¶ Initializes Data Loader.
- Parameters
audio_batch – Batch to be read. Elements could be either paths to audio files or Binary I/O objects.
batch_size – How many samples per batch to load.
sample_rate – Target sampling rate for data. Audio files will be resampled to sample_rate if it is not already.
int_values – If true, load data as 32-bit integers.
trim_silence – Trim leading and trailing silence from an audio signal if True.
-
-
class
deeppavlov.models.nemo.tts.
TextDataLayer
(*args: Any, **kwargs: Any)[source]¶ -
__init__
(*, text_batch: List[str], labels: List[str], batch_size: int = 32, bos_id: Optional[int] = None, eos_id: Optional[int] = None, pad_id: Optional[int] = None, **kwargs) → None[source]¶ A simple Neural Module for loading text data.
- Parameters
text_batch – Texts to be used for speech synthesis.
labels – List of string labels to use when to str2int translation.
batch_size – How many strings per batch to load.
bos_id – Label position of beginning of string symbol. If None is initialized as len(labels).
eos_id – Label position of end of string symbol. If None is initialized as len(labels) + 1.
pad_id – Label position of pad symbol. If None is initialized as len(labels) + 2.
-
-
class
deeppavlov.models.nemo.vocoder.
WaveGlow
(*, denoiser_strength: float = 0.0, n_window_stride: int = 160, **kwargs)[source]¶ -
__init__
(*, denoiser_strength: float = 0.0, n_window_stride: int = 160, **kwargs) → None[source]¶ Wraps WaveGlowInferNM module.
- Parameters
denoiser_strength – Denoiser strength for waveglow.
n_window_stride – Stride of window for FFT in samples used in model training.
kwargs – Named arguments for WaveGlowInferNM constructor.
-
-
class
deeppavlov.models.nemo.vocoder.
GriffinLim
(*, sample_rate: float = 16000.0, n_fft: int = 1024, mag_scale: float = 2048.0, power: float = 1.2, n_iters: int = 50, **kwargs)[source]¶ -
__init__
(*, sample_rate: float = 16000.0, n_fft: int = 1024, mag_scale: float = 2048.0, power: float = 1.2, n_iters: int = 50, **kwargs) → None[source]¶ Uses Griffin Lim algorithm to generate speech from spectrograms.
- Parameters
sample_rate – Generated audio data sample rate.
n_fft – The number of points to use for the FFT.
mag_scale – Multiplied with the linear spectrogram to avoid audio sounding muted due to mel filter normalization.
power – The linear spectrogram is raised to this power prior to running the Griffin Lim algorithm. A power of greater than 1 has been shown to improve audio quality.
n_iters – Number of iterations of convertion magnitude spectrograms to audio signal.
-