digitalphonetics / ims-toucan Goto Github PK

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.

License: Apache License 2.0

Python 100.00%

text-to-speech toolkit speech-synthesis deep-learning speech-processing tts pytorch speech

ims-toucan's People

Contributors

Stargazers

Watchers

Forkers

flux9665 trendingtechnology c00renut adamantcat haching1105 vaibhavs10 dojunpark invincible-sazzad leo1129 esradonmez mipuc kiwi-in-transit techthiyanes shaun95 entn-at chenchy kingfener ductho9799 sciai-ai josh-zhu txhan xiaoyixuan jaedukseo ayka01 jenskaiser96 itororos tanhuir cherokeelanguage terrisgo wgwangang indhayare1 charliemcvicker maxmax2016 lym0659 lorgu ishine yuan-manx jaredly chaanks whitefu amorjnyh willamjie balakkvj ca-ressemble-a-du-fake tomschelsen bejaeger platform-kit showgan iamkhalidbashir abylouw thommy96 himsoklong benliao hakcats aixingxy silenticymoon openshiftdemos bwroblew droidxrobot raquel-antelo tpekarekrosin dinimuhd7 satellite30 wmatyjaszczyk satani99 rjrobben lbehringer alexstevechungalvarez gongchenghhu sm-exp redbeard-himalaya dmdm2002 eltociear seblemaguer seafitliu xzm2004260 star900817 annie-zhou1997 kakadeguaidao

ims-toucan's Issues

Does finetuning Vocoder on generated mels work?

Hi,
Do you think fine-tuning the vocoder on generated mels from fs2 help improve the quality of the synthesis and possible help remove some of the artifacts? and the current fs2 model is 16K do you think increasing it to 22K help in improving quality?

Thanks

Magic number 66

I was testing adding tones and lengths as features and discovered that there is a hard coded check for '66' features, the number appears to be hard coded without explanation in multiple locations.

Would it be safe to assume that any '66' is for the feature count checking only?

normalization pitch and energy

Hi Florian. I just read your paper on EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH(https://arxiv.org/pdf/2206.12229.pdf). Great job. Very interesting contributions, thanks for sharing this.

I was curious on the way you normalization pitch and energy? In paper, it's mention: "a way of normalizing pitch and energy that allows for the overwriting procedure to be compatible with a zero-shot multispeaker setting by
regaining the value ranges for each speaker through an utterance level speaker embedding" and "we normalize those values by dividing them by the average of the sequence that they occur in. "

Does it mean

that utterance speaker embedding as the input of the variance adaptor ?
use the phone level pitch/energy feature?
Thank you in advance. Best!

Feature Request: FAQ Creation

Take various questions from the issues and put them together with answers as an F.A.Q. document.

What all needs to be done to disable grapheme to phoneme conversions?

Option to generate audio file to hear how the training evolves

Hi,

I haven't found the option to generate audio files every now and then to check whether the training is evolving and to prevent overfitting.

In weight and bias website or on disk only the mel spectrograms are available. I find it great if it was also possible to have audio files of the test sentences.

If it slows down the training too much then an option should enable the generation of audio.

I know that I can workaround this lack by merging the last checkpoints and then loading the checkpoint to infer the test sentences, but I find this process cumbersome. And this sometimes causes the training to stop (maybe because of out of memory error).

If needed I can help implement this feature!

[Question] How to set accent parameter which is used on hugging face?

On hugging face the second parameter is 'Select the Accent of the Speaker'. How and in which file i can set this?

Thank you.

There is a problem with fine-tuning HiFiGAN

Hi, thank you for your contribution. I have a problem with fine-tuning HiFiGAN based on the pre-training model you gave me: hifigan_train_loop.py error on line 71 of check_dict "generator_optimizer" not found.
I checked your release of v2. For HiFiGAN model version 2, only check_dict ["generator"] can be obtained. There is no way to get resources like "generator_optimizer" and "discriminator_optimizer". How can I solve this problem?

Here is the error code in hifigan_train_loop.py:

if path_to_checkpoint is not None:
check_dict = torch.load(path_to_checkpoint, map_location=device)
optimizer_g.load_state_dict(check_dict["generator_optimizer"])
optimizer_d.load_state_dict(check_dict["discriminator_optimizer"])
scheduler_g.load_state_dict(check_dict["generator_scheduler"])
scheduler_d.load_state_dict(check_dict["discriminator_scheduler"])
g.load_state_dict(check_dict["generator"])
d.load_state_dict(check_dict["discriminator"])
step_counter = check_dict["step_counter"]

How is the new PortaSpeech Implementation performing?

Hi
I noticed that you are working on new PortaSpeech Implementation, Can I know how the model is performing, is the implementation completed? Can I try training with my own data?

Thanks

Is there a way to change Speaker Embedding layer to other Models

Hi is there any chance we can change the Speaker Embedding layer from current Speechbrain's ECAPA-TDNN and Speechbrain's x-Vector to some other models like the Speaker Embedding model from Coqui TTS

With the current model sometimes the output voice gender is Female even when the input reference audio gender is Male. So I want to try some other Speaker Embedding models too.

And do we need to change the Sample rate of the reference file to 16K before passing to tts.set_utterance_embedding(path_to_reference_audio=reference) function?

Thanks

IndexError: index -1 is out of bounds for axis 0 with size 0

I am getting an index error after several rounds of training a new model.

Any suggestions?

Prepared a FastSpeech dataset with 2756 datapoints in Corpora/chr-w.
Training model
Reloading checkpoint_126775.pt
0%| | 0/275 [00:00<?, ?it/s]

/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 275/275 [00:29<00:00,  9.26it/s]
Traceback (most recent call last):
  File "run_training_pipeline.py", line 78, in <module>
    pipeline_dict[args.pipeline](gpu_id=args.gpu_id,
  File "/home/muksihs/git/IMS-Toucan/TrainingInterfaces/TrainingPipelines/FastSpeech2_Cherokee_West.py", line 45, in run
    train_loop(net=model,
  File "/home/muksihs/git/IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/fastspeech2_train_loop.py", line 195, in train_loop
    plot_progress_spec(net, device, save_dir=save_directory, step=step_counter, lang=lang, default_emb=default_embedding)
  File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/muksihs/git/IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/fastspeech2_train_loop.py", line 57, in plot_progress_spec
    lbd.specshow(spec,
  File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/librosa/display.py", line 959, in specshow
    kwargs.setdefault("cmap", cmap(data))
  File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/librosa/display.py", line 576, in cmap
    min_val, max_val = np.percentile(data, [min_p, max_p])
  File "<__array_function__ internals>", line 5, in percentile
  File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3818, in percentile
    return _quantile_unchecked(
  File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3937, in _quantile_unchecked
    r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
  File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3515, in _ureduce
    r = func(a, **kwargs)
  File "/home/muksihs/miniconda3/envs/toucan_conda_venv/lib/python3.8/site-packages/numpy/lib/function_base.py", line 4050, in _quantile_ureduce_func
    n = np.isnan(ap[-1])
IndexError: index -1 is out of bounds for axis 0 with size 0

Model training stops all of a sudden

Hi,

Thank you very much for this amazing project! I run 2 trainings in parallel on an RTX 3090 (on a separate headless machine) because the memory footprint of a single training is low (around 10 GB) so there is room for 2 models!

I use screen to make it run even when I disconnect. But from time to time when I reload the screen sessions, one of the trainings is stopped in the middle of an epoch computation, and it shows process interrupted without error. This can happen some minutes after I launch the training, or some hours, after 12k steps or after 22k steps (spectrogram loss can be 0.18 or 0.31, not NaN).

How can I find the cause of this interruption ?

THanks in advance for your help

some question about speaker generation?

Great Job！I'm interested in timbre generation？ Do you have a more detailed description (like a paper) or t he tutorial of the updates for the following two tasks。

We now use a self-supervised embedding function based on GST, but with a special training procedure to allow for very rich speaker conditioning.
We trained a GAN to sample from this new embeddingspace. This allows us to speak in voices of speakers that do not exist. We also found a way to make the sampling process very controllable using intuitive sliders. Check out our newest demo on Huggingface to try it yourself!

Can I improve the generated output naturalness ?

Hi,

I have been playing around with Toucan TTS for some times and it is really easy to use and training is fast. I finetuned the provided Meta pretrained model with a 8 hour dataset and the result is not as good as I was expecting. So I wonder if I could make it even better or if you could help me spot where the "problem" lies in the generated audio :

Here are the waveforms (top is Coqui VITS with 260k step trained from scratch model, bottom is Toucan FastSpeech2 with 200k step trained from Meta model) :

The associated spectrograms :

And the audios :

This is from Coqui VITS, I find it crystal clear :
https://user-images.githubusercontent.com/91517923/202889766-0c2ad9ad-2ec2-4376-9abc-17a008e58364.mp4

This is from FastSpeech2. It sounds as on old tapes, the voice is like shivering (I don't know it that's the right terms!)
https://user-images.githubusercontent.com/91517923/202889734-3a02486d-3785-4e83-8365-614c6ac0f64f.mp4

Both generated audios have been compressed to mp4 to be able to post them, but they are pretty close to what the wavs sound like (to my hearing there is no difference).

So how can I make Toucan FastSpeech2 model sound better ? Should I train it some more steps or should on the contrary is it over-trained / over-fitted ? Or the only way would be to implement VITS in Toucan (I don't think it is straight forward to do) ?

Thank you in advance for helping me improve the results!

Will increasing the Duration, Pitch and Energy layers help improve quality

Hi
I am curious if we increase the number of layers for the duration, pitch, and energy using the 'duration_predictor_layers' parameter and some other parameters in the architecture, will it improve the duration and pitch predictions accuracy closer to the audio in the given embeddings sample?

If it does can you suggest some of the parameters that I can tweak if I want to train a bigger and better model.

Thanks

multispeaker multilanguage finetuning

Hello,

I would like to know how the trainingpipelines for a multi speaker and then for a multispeaker multilanguage model should be set?
In the case of finetuning monolingual and monospeaker model, I set lang_embs=100, but in the case of multispeaker setting? What about in the case of a multilanguage and multispeaker model? Should I also enter the utt_embeds parameter?

Thanks!!!
Sarah

Question about dataset

Hi,

In the paper 5 minutes of speech are used to train on a new language. But for finetuning the Meta model on an already seen language (say French) is it worth it providing hours of single speaker audio ? I mean will the quality of the model improve when providing more than 5 minutes of audio ? I tried to provide a home made 75 minute dataset but I still could not recognize the speaker in the generated audio after finetuning up to 120k steps, although the prosody was awesome!

And regarding speaker transfer (see interactive demo) because you output 44kHz audio, should the input audio (reference audio) also be 44 kHz ? I tried with 16 kHz audios but could not recognize the reference speaker from the generated output, although as mentioned earlier prosody was pretty good.

So to summarize :

how long the dataset should be ?
which framerate should it have ?
how long should each sample be ? I understand from this chat that adding longer samples enhanced the quality of the model. But to what extent, should I add 20 s, 30 s, 60 s audios in my dataset ?

Thank you very much for your help 😄

Question: CTC loss?

What is CTC loss? I have figured out that a large percentage of my chr samples are getting culled because of it. My samples are normalized. How negative would be the effect if I turned CTC loss checking off?

I really noticed this when I created "augmented" data where I'm combining random samples together as joined sentences (with 350 ms pauses) into 20 sec segments. I think that the fact that most of my data is single word utterances is causing an issue where certain phoneme combinations lose their correctness after 15,000 to 20,000 iterations. Uguku becomes Uguk. And Sihgwa becomes S sihgwa. So I'm trying to create synthetically augmented data to try and reduce this issue.

Fix Tensor to Numpy issue in InferenceInterfaces/FastSpeech2Interface.py file

The inference file has a issue while using new enhancements, I think you forgot to add .cpu() at the end of wave.unsqueeze(0) in line 161 which is leading to the below error. Can you check once.
File "/Desktop/IMS-Toucan/InferenceInterfaces/FastSpeech2Interface.py", line 161, in forward wave = enhance(self.enhancer, self.df, wave.unsqueeze(0), pad=True).squeeze() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

What do the cycle objectives refer to ?

Hi,

I saw parameters called 'phase_1_steps' and 'phase_2_steps' in the finetuning pipeline. It seems to deal with 'cycle objectives' and do not appear in all pipelines. So far I set them equally so that they sum up to 200k steps but I am not sure what they actually do.

I could not find a reference to them in the paper. How should they be set and what are their goals ?

Thank you for shedding light on that topic !

Inference Performance

FastPitch, FastSpeech2 and Avocodo all claim inference times faster than 100x real time, sometimes way more (FastPitch), on GPU.

When doing inference on an Nvidia A100, I barely get above 1x (2.6 seconds to generate 3 seconds of audio), measured with python's time.perf_counter() on the FastSpeech2Interface.forward() function (so model loading or wav file writing are not taken into account).

More precisely, those are the results (which stay roughly the same across several runs) :

Text2phone done in 0.0018 seconds.
Phone2mel done in 0.6225 seconds.
Mel2wav done in 1.8307 seconds.
Enhancement done in 0.1348 seconds.

Do other people get similar results or not with different GPUs and what could be the explanation ? Are papers inflating their inference real-case end-to-end performance, is it the implementation (did people compare with other FastSpeech2 / HifiGAN implementations), or is it a bug way lower level like cuda version related ?

I am on the v2.3 release.

Any help greatly appreciated, thanks :)

ModuleNotFoundError: No module named 'df'

Hello Everyone,

I m facing this issue while trying to run run_interactive_demo.py or run_text_to_file_reader.py

The following error is returned :
Traceback (most recent call last):

  File "run_text_to_file_reader.py", line 5, in <module>
    from InferenceInterfaces.FastSpeech2Interface import InferenceFastSpeech2
  File "/content/IMS-Toucan/InferenceInterfaces/FastSpeech2Interface.py", line 11, in <module>
    from df.enhance import enhance
ModuleNotFoundError: No module named 'df'

I have already downloaded the models using run_model_downloader.py

Also when I'm checking the directory structure, I cannot find any Module named df.

Please let me know if I m doing anything wrong.

Thanks

Add scorer utility for aligner and for fastspeech2 models. (Feature Request)

It would be useful to have a scorer utility to process all the training data after training and output the score, audio path, and transcript text to a file.

The purpose would be to identify problematic training data that may need transcript correction or removal due to bad audio or other issues.

This would be useful to remove bad data from future training cycles to achieve better models.

Separate utilities for alignment and the fastspeech models would be ideal.

Sound quality dropped compared to Huggingface

The old version (without Avocodo, without full IPA support) somehow produces better results than the current one.
Hard to say where exactly the quality dropped.
Did anyone notice that? Any hints at finetuning the inference?

espeak mistake for chinese

Actually, this is a problem with espeak-ng's ipa output.
For mandarin Chinese, using five level tone mark, the tone mark for a character is more than one, i.e.:

55 for the 1st tone, namely "阴平",
35 for the 2nd tone, namely "阳平",
214 for the 3rd tone, namely "上声",
51 for the 4th tone, namely "去声".

When using espeak with language "cmn-latn-pinyin", it will output correct tonal mark, but when using --ipa flag, the output will only maintain the first one, which is clearly a mistake. For example:

And in the latest phonemizer package (v3.2.1), it passed the ipa code to espeak api, and will always get the wrong result. By the way, I'm using the latest espeak-ng (eSpeak NG text-to-speech: 1.52-dev, build from github's source code).

Fortunately, we can simply use the tonal mark in Pinyin instead, with just one line code inserted to Preprocessing/TextFrontend before line 265, as:

class ArticulatoryCombinedTextFrontend:
    ...
    def get_phone_string(...):
        ...
        if self.g2p_lang == "cmn-latn-pinyin" or self.g2p_lang == "cmn":
            phones = ' '.join([re.sub(r'[1-5ɜ]', u[-1], p)for p, u in zip(phones.split(), utt.split())])
            ...

About Multilingual Training

Hi,
I want to train a cross-lingual (Chinese and English) model from scratch, so I choose pipeline=meta in run_training_pipeline.py, and I got a warning at the beginning as follows:

"UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the      
learning rate schedule."

Then I check the train loop file (IMS-Toucan/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/meta_train_loop.py), but I didn't find any questions, does this warning affect the model result?

BTW, Silero VAD seems need pytorch >= 1.12.0, but the suggestion is torch==1.10.1 😄
Thanks!

Detected call of `lr_scheduler.step()` before `optimizer.step()`.

lr_scheduler.py:129:
    UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`.
    In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.
    Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.
    See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

Utterance cloner unable to generate speech using text different from the reference-transcription

Hello!

I read the paper "Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech" and I ran the code using "run_utterance_cloner.py"

Looking at the aligner between text and reference, the model is not able to generate speech using text that doesn't correspond to the reference.

However, in your demo on HuggingFace .. this feature was supported.

Could you, please, describe how to generate a speech of different text using this approach?

Feature request - [New German Data Set Thorsten – 2022.10 Available for Re-Training

Are there any plans to re-train and provide a new model based on the new German data set Thorsten 2022.10
which can be found here https://www.thorsten-voice.de/en/datasets-2/

Thanks.

Error when running weight averaging.

python run_weight_averaging.py 
selecting checkpoints...
selecting checkpoints...
loading model Models/FastSpeech2_Cherokee_West/checkpoint_200000.pt
loading model Models/FastSpeech2_Cherokee_West/checkpoint_199000.pt
loading model Models/FastSpeech2_Cherokee_West/checkpoint_198000.pt
loading model Models/FastSpeech2_Cherokee_West/checkpoint_197000.pt
loading model Models/FastSpeech2_Cherokee_West/checkpoint_196000.pt
averaging...
saving model...
...done!
selecting checkpoints...
Traceback (most recent call last):
  File "run_weight_averaging.py", line 120, in <module>
    make_best_in_all(n=5)
  File "run_weight_averaging.py", line 105, in make_best_in_all
    averaged_model, default_embed = average_checkpoints(checkpoint_paths, load_func=load_net_fast)
TypeError: cannot unpack non-iterable NoneType object

Samples

Hello , where are the new samples which is generated by the new Vo-coder named Avocod, can you provide some please for me to hear (seen or unseen) , they claim it is Artifact-free.

**also, how many thousand steps/epochs did you train this new vocoder for / and can it be used as a pretrained model ?

Feel free to close the issue after responding.

Thanks in Advance.

Can you release Avocodo 24k vocoder model

Hi
I trained an 24000 sample rate FastSpeech2 model, but I don't have a vocoder with me and Vocoders in ParallelWaveGAN repo are not performing as good as Avocodo model, Since you started training Avocodo 24k model, can you release it so that I can check the quality of my model?

Thanks

Feature request: Apple M1 Support / Docker build

Howdy there,

I am working with Michael Conrad on the Cherokee TTS project and am on an ARM architecture. I would love to work with the Digital Phonetics team to either get a working set of dependencies for performant cross-platform development or an addition to the README with steps for installing on Apple Silicon.

If this feels like too narrow of a use case I would also work with you to get an IMS-Toucan Docker image built so that this tool can be used anywhere in the cloud (and, conveniently, via Rossetta 2 and docker desktop on Mac), without any local install.

If you are interested in helping with either of these things, please be in touch.

How is Mel-loss calculated with super-resolution?

Hey Florian - Aidan here. Nice to meet you at ACL and thanks again for this implementation.

I had a question about the super-resolution implementation in your HiFiGAN vocoder - you state that the model now takes Mel Spectrograms from 16kHz audio and you've adapted the upsampling rates to triple the sampling rate of the output waveform to 48kHz. Is there a requirement that your data is 48kHz then? For example the LJ dataset is only 22.05kHz, so when you're calculating the loss between the Mel spectrogram of the predicted waveform (y_g_hat_mel in the original implementation) and the ground truth Mel spectrogram (y_mel in the original implementation), there will be a sampling rate mismatch (48kHz for the generated Mel Spectrogram and 22.05kHz for the ground truth). Did I misunderstand something, or does this model require 48 kHz data even if the Mel spectrogram is 16 kHz?

Thanks!

This is scary. At step 8,000 I'm getting decent English and passable Cherokee!

This is via the run interactive script.

IMS-Toucan-samples-step-8000.zip

Too much Distortion while synthesizing high pitch female voices

Hi,

I really liked the prosody transfer capabilities of the new GST based model, but it is not able to handle all types of voices like the ECAPA model, for example I tried synthesizing some High pitch female voices and Female mocking emotion samples. The output is too distorted, I even tried fine tuning the Style Embedding layer using samples with Background noise and High pitch voices (used sox to increase pitch of a female tts dataset), there is a slight improvement but the problem still exists.

I noticed that you are working an a bigger embedding func with some modification in a separate branch, I just wanted to know if you have noticed this issue and if you did how is the new model performing in these cases and is there any projected date for the release of the new model?

Thanks

Error : could not find version that satisfies the requirement

I tried to create a conda environment by by running the commands provided in the readme, byt faced this error. Please help solve this

Getting "BrokenPipeError", "ConnectionResetError", and "EOFError" errors for hifigan training.

I've put wav files at 48k SR into a folder using the following script: https://github.com/CherokeeLanguage/Cherokee-IMS-Toucan/blob/main/create_vocoder_files.py

I am using the following to call the training: https://github.com/CherokeeLanguage/Cherokee-IMS-Toucan/blob/main/HiFiGAN_combined.py

At preprocessing 55% it crashes, and I don't see what the cause is in the error log.

hifigan-crash.log

Assistance appreciated.

python=3.8.12

Python Environment: https://github.com/CherokeeLanguage/Cherokee-IMS-Toucan/blob/main/environment.yml

Batch Size

Is it feasible to train the vocoder with a batch size of 6? I have a laptop with an 8 GB Ram GPU. Batch size 6 currently shows about 7GB RAM in use.

Adding a New Language

First of all, many thanks for a great repo. I'm kind of new to this stuff, please forgive me. Can we train a new speaker and language using this repo, for example Turkish. I would be very grateful if you could provide information on what the structure of the data set should be and how it was prepared.

Overfitting - how to detect and stop training?

I have an issue with overfitting on the data which seems to degrade the Cherokee portion of the output.

The Cherokee output starts dropping trailing syllables that start with an 'h' in later iterations, which are rendered OK in earlier iterations.

I've been trying higher iterations to get better voice matching between samples and model for data set specific voices.

Is there a way to get the loss on a per language basis?

I'm currently retraining the aligner with the Cherokee audio sourced from tape removed.
I will then train the TTS again to see if that helps any.

How to add pause in the middle of sentence?

Is there a way to add pause in the middle of sentence during synthesis eg: "What are you doing? Where are you going?" can I add a little pause after "doing" during synthesis?

Thanks

Question about batch size

I'm thinking about increasing the batch size to try and increase the training rate. My current ETA is 42 hours remaining.

I'm currently at ~50% GPU usage with 11,233 MiB on a GTX 3090 with 24,259 MiB.

Any considerations to take into account?

Multilingual ZS-Multispeaker speaker embedding injection

Hi Florian. I just read your paper on low resource multilingual zero-shot multi-speaker TTS (https://arxiv.org/pdf/2210.12223.pdf). Great job. Very interesting contributions, thanks for sharing this.

I was curious on the way you integrate speaker embeddings into the encoder hidden state. In the paper, it is mentioned: "An important trick we found is to add layer normalization right after the embedding is injected into the hidden state.". Does this mean you've experienced improvements on zero-shot adaptation or resulting audio quality by applying this layer norm after injecting the speaker embeddings? Also, does this mean the layer norm is applied right after concatenating the speaker embeddings to the encoder hs, or right after the spk_emb+hs are projected down to hs size?

Thank you in advance. Best!

Creating new text to IPA encoder. Does the existing model setup have place holders for IPA tone markers?

Looking to take advantage of the wonder work y'all have done.

In regards to creating a new text to IPA encoder. Does the existing model embedding have place holders for the full IPA character setup including the IPA standard tone markers?

Question about duration, pitch, and energy tensor lists for read_to_file

I'm trying to specify manually a tensor list for duration for InferenceFastSpeech2#read_to_file but I'm not finding any clues in the source to tell me what exactly I'm supposed to specify. I know that [torch.tensor(0.5)] doesn't work.

Help.

[Questions] Different behavior between latest github (default branch) and hugging face version

First of all, many thanks for the toolkit.

Though the underlying models for Fastspeech2 and HifiGAN seem to be the same i get complete different behavior with inference of German texts: The github version gives me an English female speakers voice (model = Meta, language = de) while the hugging face speaker results in a German male speaker (German text, accent and voice).

My question is in which file i can fine tune this behavior?

Training process stuck at AlignerDataset.py

Hi I am trying to fine-tune the Fastspeech2 model with a custom dataset and followed all the steps according to the instructions in Readme file. But when I try to run the training the training process is stuck at AlignerDataset.py file with the following log

Preparing
... building dataset cache ...
32%|███████████████████████████████████████████████████▉ | 23/73 [00:15<00:22, 2.25it/s]

can you give any pointers about where I might have done wrong.

Thanks

beginner guide/tutorial

Hi, Thanks for the amazing toolkit, for beginner who is dealing with low resource languages other than English, TRANSFER LEARNING from pretrained LJ speech or any other model, can we have steps/instructions from scratch?! That would be highly appreciated, thanks in advance.

Relationship between hop length and upsample rates

me again 😄

When setting the hop length - your 16kHz spectrogram has a hop length of 256 and a window of (4*hop_length) 1024.

Since you are tripling the sampling rate, why is the hop length set to only 1.5 times (384) and window of 1536? I would have expected that the relationship of hop length should stay proportional to the sampling rate. Indeed you have a comment in HiFiGANDataset.py that states "hop length of spec loss must be same as the product of the upscale factors", but your upscale rates (8, 6, 4, 4) have a product of 768 (3 * 256). So I would have expected the hop_length to be set at 768 and window set to 3072. Am I misunderstanding something here?

P.S sorry if I'm being pedantic 😅

digitalphonetics / ims-toucan Goto Github PK

ims-toucan's People

Contributors

Stargazers

Watchers

Forkers

ims-toucan's Issues

Recommend Projects

Recommend Topics

Recommend Org