uberduck-ai / uberduck-ml-dev Goto Github PK
View Code? Open in Web Editor NEWML models for Uberduck
License: Apache License 2.0
ML models for Uberduck
License: Apache License 2.0
"We appreciate the support. Unfortunately, sticking our heads out on this didn't make sense, but we are sure someone else will."
I'm sorry, but what do you mean by didn't make sense? In order to ensure the character voice cloning survival, the proposals I have provided, need to be taken on board in order to have all sides satisfied. It's very important.
And plus, what I will also propose, is to release the source code for character voice cloning in general.
I am also slightly annoyed because, despite sharing those proposals, they are not getting on board, as much as I would love to see fictional character voice cloning to eternally march on. To segway to FakeYou, all I got were comments from evasive fake account replies in Discord.
And I am also annoyed that rather than looking for compromises and ways so all sides are satisfied, like my own proposed ideas, the features are getting done away with without giving any sort of considerations. Besides, like I said, there are people who just use the programs for fun and that's about it, and not for some sort of projects. All I want, is for the proposals to be considered to ensure the success of fictional character voice cloning. Thank you.
When I run the training script it seems to go well but then it says it cannot locate one of the wav files.
I've gone into the filelist and tried removing the entries but it would just keep listing another wav not being able yo be located.
I've made sure my config has the correct paths to everything and I've verified multiple times the wav files are there.
When I enter in the command this is what I get:
python -m uberduck_ml_dev.exec.train_tacotron2 --config "tacotron2_config.json"
TTSTrainer start 9218.209915733
Initializing trainer with hparams:
{'attention_dim': 128,
'attention_location_kernel_size': 31,
'attention_location_n_filters': 32,
'attention_rnn_dim': 1024,
'audio_encoder_dim': 192,
'audio_encoder_path': None,
'batch_size': 18,
'checkpoint_name': 'morgan_freeman',
'checkpoint_path': 'checkpoints',
'coarse_n_frames_per_step': None,
'config': 'tacotron2_config.json',
'cudnn_enabled': True,
'dataset_path': '.',
'debug': False,
'decoder_rnn_dim': 1024,
'distributed_run': False,
'encoder_embedding_dim': 512,
'encoder_kernel_size': 5,
'encoder_n_convolutions': 3,
'epochs': 5001,
'epochs_per_checkpoint': 10,
'filter_length': 1024,
'fp16_run': False,
'gate_threshold': 0.5,
'get_gst': None,
'grad_clip_thresh': 1.0,
'gst_dim': 2304,
'gst_type': 'torchmoji',
'has_speaker_embedding': True,
'hop_length': 256,
'ignore_layers': ['speaker_embedding.weight'],
'include_f0': False,
'is_validate': True,
'learning_rate': 0.0005,
'load_f0s': False,
'load_gsts': False,
'log_dir': 'runs',
'lr_decay_min': 1e-05,
'lr_decay_rate': 216000,
'lr_decay_start': 15000,
'mask_padding': True,
'max_decoder_steps': 1000,
'max_wav_value': 32768.0,
'mel_fmax': 8000.0,
'mel_fmin': 0.0,
'n_frames_per_step_initial': 1,
'n_mel_channels': 80,
'n_speakers': 1,
'num_heads': 8,
'num_workers': 1,
'p_arpabet': 0.0,
'p_attention_dropout': 0.1,
'p_decoder_dropout': 0.1,
'p_teacher_forcing': 1.0,
'pin_memory': True,
'pos_weight': None,
'postnet_embedding_dim': 512,
'postnet_kernel_size': 5,
'postnet_n_convolutions': 5,
'prenet_dim': 256,
'ref_enc_filters': [32, 32, 64, 64, 128, 128],
'ref_enc_gru_size': 128,
'ref_enc_pad': [1, 1],
'ref_enc_size': [3, 3],
'ref_enc_strides': [2, 2],
'sample_inference_speaker_ids': [0],
'sample_inference_text': 'That quick beige fox jumped in the air loudly over '
'the thin dog fence.',
'sample_rate': 22050,
'sampling_rate': 22050,
'seed': 123,
'speaker_embedding_dim': 128,
'steps_per_sample': 50,
'symbol_set': 'nvidia_taco2',
'symbols_embedding_dim': 512,
'text_cleaners': ['english_cleaners'],
'torchmoji_model_file': '/home/rage/CodingProjects/uberduck-ml-dev-master/pytorch_model.bin',
'torchmoji_vocabulary_file': '/home/rage/CodingProjects/uberduck-ml-dev-master/vocabulary.json',
'training_audiopaths_and_text': '/home/rage/CodingProjects/uberduck-ml-dev-master/project/wavs/filelist.txt',
'val_audiopaths_and_text': '/home/rage/CodingProjects/uberduck-ml-dev-master/project/wavs/filelist.txt',
'warm_start_name': '/home/rage/CodingProjects/uberduck-ml-dev-master/tacotron2_statedict.pt',
'weight_decay': 1e-06,
'win_length': 1024,
'with_audio_encoding': False,
'with_f0s': False,
'with_gsts': False}
start train 9219.320274948
Initialized Torchmoji GST
Starting warm_start 9220.987589312
WARNING! Attempting to load a model with out the speaker_embedding.weight layer. This could lead to unexpected results during evaluation.
WARNING! Attempting to load a model with out the spkr_lin.weight layer. This could lead to unexpected results during evaluation.
WARNING! Attempting to load a model with out the spkr_lin.bias layer. This could lead to unexpected results during evaluation.
WARNING! Attempting to load a model with out the gst_lin.weight layer. This could lead to unexpected results during evaluation.
WARNING! Attempting to load a model with out the gst_lin.bias layer. This could lead to unexpected results during evaluation.
Ending warm_start 9221.034127661
Error while getting data: index = 43
[Errno 2] No such file or directory: 'mf00-44.wav'
Exception raised while training: [Errno 2] No such file or directory: 'mf00-44.wav'
Traceback (most recent call last):
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/uberduck_ml_dev/exec/train_tacotron2.py", line 46, in
run(None, None, hparams)
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/uberduck_ml_dev/exec/train_tacotron2.py", line 27, in run
raise e
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/uberduck_ml_dev/exec/train_tacotron2.py", line 23, in run
trainer.train()
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/uberduck_ml_dev/trainer/tacotron2.py", line 446, in train
for batch_idx, batch in enumerate(train_loader):
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in next
data = self._next_data()
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 677, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/uberduck_ml_dev/data/data.py", line 303, in getitem
data = self._get_data(self.audiopaths_and_text[idx])
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/uberduck_ml_dev/data/data.py", line 264, in _get_data
sampling_rate, wav_data = read(audiopath)
File "/home/rage/anaconda3/envs/test-env/lib/python3.10/site-packages/scipy/io/wavfile.py", line 647, in read
fid = open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'mf00-44.wav'
What other potential solutions could I try?
No idea what could be causing it, but it seems to rely on speaker count: an 8-speaker model of mine had the inference IDs match the training ones, whereas a 20-speaker one is jumbled up. dhama the llama
on Discord has been able to (eventually) find all the training voices in their 61-speaker model, so it's likely that the IDs are only being shuffled and not discarded.
After multiple contributors on the uberduck discord server posted their mellotron model to me, I've noticed that the inferenced audio seems to be repeating itself (and is coincidentally[?] the same audio from training audio in Tensorboard.)
Example-- this was inferenced on a spongebob model (~500+ wavs on ~5000 epochs*, batch size 24, with arpabet)
The audio doesn't seem to change from the usual repetition sound that it sticks to, no matter arpabet input or not and change of input. This seems to be similar to other runs, and was wondering if this is a cause of something in parameters, dataset, or simply the time it took to train.
* Gosmokeless28 claimed to train it for the full 5000 epochs
There should be a way to disable the validation parts between each epoch in the colab. For massive datasets, it can take up to 20 minutes to generate those samples. Being able to disable it would save a lot of time.
Thank you for your open source work, but I seem to have not found the complete implementation of zero-shot TTS.
Running this step:
#@title Install pipeline
#!pip install -q torch==1.9.0 torchvision==0.10.0 torchaudio==0.9.0
!pip install -q git+https://github.com/uberduck-ai/uberduck-ml-dev.git
I get this error:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.1+cu113 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.
torchtext 0.13.1 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.
torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.
Hi guys, its a bit hard for me to figure out what are the steps to follow to achieve something great.
I would like to train a voice first for tts and then for voice to voice in order to match other singers samples.
how to process ? what do you use for the voice to voice RADTTS ?
thank you very much
This is a personal issue close to my heart.
When I was a child growing up we had the voice of Bruce the fuck Willis in our car navi and it would talk exactly like die hard.
Unfortunately, never found that again.
Not sure how you add voices but Bruce Willis is surely a must have.
I recently came back to the Uberduck.ai website and its text-to-speech page, and the character voice clone feature of the site is all gone. Why is that?
I know it might be meaningless to make an offshoot to the last conversation related to the character voice cloning feature, but I am pleading you, on the behalf of the character voice cloning community, to test this all in court, and to make compromises if this all can be feasible.
Because character voice cloning is very useful when creating fan based content, especially when it comes to Gmod animations. And besides, there are people like myself, who wants to use it for fun and that's about it.
What I would propose, is to create some sort of online virtual license if people are to use the programs, and to also increase accountability of those who use the program as well. The important obligations however, is that the personal information will NOT be shared to the authorities unless the person is abusing the program. The virtual license should also not be susceptible to expiration, and it must be free to obtain due to the program not being a real-life sort of thing.
And what I would also want to see in turn, is to have rule of rule within the community, where everybody is equal under the rules. Particularly the people running the program.
Are the datasets used available? Am interested in the transformers and the Ratatouille one
Training a Mellotron model is pretty difficult if you don't have 10,000 wav files in your dataset... Are any current models being trained to be used for warm starting / transfer learning?
The pretrained models in Nvidia's official Mellotron implantation are adequate, but not successful in my tests. LJS (LJspeech) model is the only model I've gotten the best quality on, but the training times are still in the hours (5+ hours for 600 wavs to sound decently good) whilst on the other hand, regular Tacotron2 models take about 1 / 2 hours for the same amount of wavs.
Any info on any training / available models? If so, what is the dataset and ETA (if possible?)
fft_window = pad_center(fft_window, filter_length)
TypeError: pad_center() takes 1 positional argument but 2 were given
I think could be a librosa versioning issue from the recent 0.10.0 release. The tests pass with librosa==0.8.0 so I updated the dependency. Lets see what happens.
if the user wishes to warm_start() from a model that has optimizer, and their config has null
for ignored_layers, they will be unable to start training. A fix is:
https://github.com/uberduck-ai/uberduck-ml-dev/blob/master/uberduck_ml_dev/trainer/base.py
line 139:
if "optimizer" in checkpoint and len(self.ignore_layers) == 0:
to:
if "optimizer" in checkpoint and self.ignore_layers == None:
Hi there,
Will there be an option in the future to package the model along with it's training configuration (HParams) data inside a zip archive? This could be helpful for organization and easier inferencing.
Is there any way to currently inference the model and create an output?
Running: pip install git+https://github.com/uberduck-ai/uberduck-ml-dev.git
returns with: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 925: character maps to
Sometimes, things don't go your way, and your already deep down the rabbit hole. Maybe you've spent 2 hours training a model, and you just realized now you want to change some config, but you have set your epoch checkpoint interval to be waay too much. Would it be possible to catch a keyboard interrupt while training, to do a graceful shutdown (save checkpoint n such?)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.