Comments (8)
I am also trying to train XTTS GPT model as a beginner. The documentation suggest that we can only train the model for cloning a single voice. My question is : can we train XTTS on a multilingual and multispeaker dataset because I would like to improve the general model quality in 3 differents languages (Spanish, Italian and German).
I know this isn't the best place to ask this question, but I know that you encountered the same problem.
from tts.
@erogol So I tried to train xtts v2 with multi speaker in Chinese. The evaluation loss seems unnormal.
from tts.
@Thomcle So for now, I don't think xtts v2 support this mechanism, which allows for training with multispeaker when we set the speaker name with the audio name. I have tried this but the inference performance is not stable.
from tts.
We are also trying to do this. I don't see why this is not possible theoretically if the dataset quality is good. I think what is important is that the model sees a mixture of various languages during training i.e. one minibatch language A, then language B and so on.
I think the solution would look something like changing this:
config_dataset = BaseDatasetConfig(
formatter="ljspeech",
dataset_name="ljspeech",
path="/raid/datasets/LJSpeech-1.1_24khz/",
meta_file_train="/raid/datasets/LJSpeech-1.1_24khz/metadata.csv",
language="en",
)
to this:
config_dataset = BaseDatasetConfig(
formatter="ljspeech",
dataset_name="ljspeech",
path="/raid/datasets/LJSpeech-1.1_24khz/",
meta_file_train="/raid/datasets/LJSpeech-1.1_24khz/metadata.csv",
language='auto'
)
and when language = 'auto' there is a way to detect which language it is while loading the dataset. I think there are many libraries which do this.
Some additional logic might be needed if we want to smartly make sure one minibatch has only one language. Although I am not sure how important that is. We should train and find out it.
Once the language is recognized and converted to tokens, the rest of the process is the same and should need no change.
from tts.
One more solution that we are trying (training will start and loss will reduce without errors):
Go to this file:
/TTS/TTS/tts/datasets/__init__.py
def add_extra_keys(metadata, language, dataset_name):
for item in metadata:
# add language name
language = langid.classify(item['text'])[0]
if language!='en':
language = 'hi'
item["language"] = language
# add unique audio name
relfilepath = os.path.splitext(os.path.relpath(item["audio_file"], item["root_path"]))[0]
audio_unique_name = f"{dataset_name}#{relfilepath}"
item["audio_unique_name"] = audio_unique_name
return metadata
modify the function to something like this such that item['language'] is set by a language detection model or some custom logic instead of by the parameter you give during training.
Training is running currently, I will share the results here regardless of good/bad. Loss seems to have reduced significantly though.
from tts.
@smallsudarshan So have you change the ljspeech
formatter as well? Because in the ljspeech
formatted, it will set the speaker name for all audios to ljspeech
by default. But this is not correct for the multi-speaker scenario. What do you think?
from tts.
Hey @OswaldoBornemann I have not actually. I was going through the code and I think the speaker name is not being used anywhere for training. If you think it is, please let me know.
It is being used in the split_dataset
function here TTS/tts/datasets/__init__.py
however, and you might get slightly better eval metrics if you do this.
But my dataset is well balanced for speakers at the moment. So I have not added this.
from tts.
After 5 epochs on total around 12 thousand samples of varying sizes of 2 speakers (did not pre-process too much i.e. no creating gaussian distribution of text lengths, accents etc.)
Here are few samples in English and Hindi. Seems to do a decent job given the data. For eg. my hindi audios have a very strong assamese (a particular place in India) accent that it has picked up. And the quality of the audios is very close to the data I have trained on.
According to my experience:
- Quality of data in terms of - clarity, consistency and labelling punctuations matters a lot more than anything else
- Larger data diversity will ensure more robust results for various speaker references. Else some speaker references might be stable, others might not be.
Also it is able to produce audio for sentences where both Hindi and English are mixed in the text, which is often the case. Although I have not explicitly trained on such sentences.
from tts.
Related Issues (20)
- [Feature request] Add Recipe for all 3 Training stages - XTTS V2 HOT 3
- Realtime voice conversion support HOT 1
- [Bug] VITS gpu utilization HOT 1
- [Bug] ModuleNotFoundError: No module named 'TTS' (From inside the TTS folder) HOT 1
- [Feature request] Run in the browser? HOT 1
- Can't start training due to recursion depth error HOT 1
- Can't install TTL on Windows 11: Could not build wheels for TTS HOT 4
- [Feature request] Allow the use of `logging` instead of `print` HOT 3
- [Feature request] GPU Mac Silicon Chip HOT 1
- [Bug] Error during installation on Mac HOT 1
- [Bug] YourTTS alignment is weird
- [Bug] XTTS Finetuning - xtts_demo.py - Collab HOT 1
- [Feature request] Website Improvements
- [Train XTTSv2 hifigan decoder] Provide script
- [Bug] xtts_v2, AttributeError: 'TTS' object has no attribute 'speakers' HOT 3
- Hello, Please add bengali language cloning [Feature request]
- UnboundLocalError: local variable 'dataset' referenced before assignment [Bug] HOT 3
- [Bug] Anyway to run this as docker-compose ? HOT 2
- [Bug] DDP not actually working HOT 1
- [Bug] 'tts_models/ben/fairseq/vits' model didn't found Character 'য়' not found in the vocabulary.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tts.