Comments (29)
A good news, i found a way of extracting semantic vectors from wav2vec models without the main hubert_base model.
from bark-voice-cloning-hubert-quantizer.
HuBERT wav2vec outputs have 768 features, that's why i picked that number, if you want to use a different number, pass input_size=1024
in the constructor
the default input shape is (B, 768)
where B is the batch size, and output shape is (B, 1)
with input_size=1024, the input shape is (B, 1024)
, and output shape is (B, 1)
example: on line 161
of customtokenizer.py, in auto_train, change
model_training = CustomTokenizer(version=1).to('cuda')
to
model_training = CustomTokenizer(version=1, input_size=1024).to('cuda')
Make sure the Wav2Vec extracts features at the same rate as HuBERT too, else you'll get problems.
from bark-voice-cloning-hubert-quantizer.
about 50x768 features per second, or 50x1024 in your case.
if it's slightly different, that's fine.
from bark-voice-cloning-hubert-quantizer.
with a dataset with 3000 files, 20 minutes on my rtx3060 had good results, i then trained it for an hour or so more.
you can interrupt training at any point, and check your latest model to see how well it performs.
from bark-voice-cloning-hubert-quantizer.
@gitmylo Thanks, I have trained for 15 epochs , and planning to do for 24.
However it took around 3 hours on a P100 GPU for just 15 epochs. and i reduced the files to around 6000.
Any suggestions to improve the cloning results and for faster trainings.
from bark-voice-cloning-hubert-quantizer.
look at it like this.
if you have 3000 files, and train for 24 epochs for example, it will still be worse than 6000 files for 12 epochs.
an epoch means it has gone through every file. having more training data will make an epoch take longer.
also, if and when you decide to upload your model and/or training data to huggingface, please send the urls here, so i can add them to the readme
from bark-voice-cloning-hubert-quantizer.
@gitmylo i think, Hubert base model doesn't support hindi language because my generated text doesn't speaks what's prompted with text , instead some random words and noises.
Given that,
I even used 2 types of models example.
Model_A tranied for 23 epochs on 3700 files (after ready stage)
Model_B trained for 16 epochs on 7783 files
Both yields poor results, please any suggestions , already i spent lot of time.
from bark-voice-cloning-hubert-quantizer.
do the wavs used in training sound normal though?
from bark-voice-cloning-hubert-quantizer.
@gitmylo yes, i just checked multiple wavs (BTW some files are pure noises too) in prepared folder and they sounds perfect.
Can you please suggest by your experience what shall i do?
If you may help, i may train for another langauges too,
from bark-voice-cloning-hubert-quantizer.
maybe there's a Hindi HuBERT model somewhere, you could try loading it
from bark-voice-cloning-hubert-quantizer.
@gitmylo , I could not found any searched a lot, it would be nice ,
if you may provide one link for it.
P.S: And a point to be noted that resources such as guides/pretrained_models etc for Hindi langauges are very rare.
a update Model_A crossed 32 epochs with losses as:
from bark-voice-cloning-hubert-quantizer.
@gitmylo I assume there is problem with hubert base model doesn't supports hindi , as i checked with the generated semantic_prompt , i converted them to wav form (sematic_to_waveform) as they speaks some random words with english words, though the cloned speaker entirely speaks Hindi.
P.S cloned on 5-6 speakers, (clear voice) - Same poor results.
Conclusion is that after training for 35 epochs semantic vectors are not formed properly or in desired language.
Thanks, anyways, i will upload all , training data, both the models. But they are of no use.
from bark-voice-cloning-hubert-quantizer.
great, as long as they're on the same rate with the same amount of features, it should work
from bark-voice-cloning-hubert-quantizer.
Is it distilhubert?
And there's different versions around https://huggingface.co/models?search=distilhubert
I noticed it's on RVC too ddPn08/rvc-webui#11
from bark-voice-cloning-hubert-quantizer.
@gitmylo hey, i have one doubt , why haven't you used the hubert_base_ls960_L9_km500.bin quantizer ? And what's the reason of training for english language ??
from bark-voice-cloning-hubert-quantizer.
I haven't used that quantizer because it is not compatible with bark. It uses completely different values to represent the semantic features.
I trained on english because english is the most widely spoken language in the world. And it's supported by bark.
from bark-voice-cloning-hubert-quantizer.
@gitmylo thanks, just one last question,
Is it necessary to pass a input of size 768 to tokenizer, i mean that can we pass input of 1024 or something like to custom tokenizer ( A new that accepts input size of 1024) and then after tokenization will the result be compatible with bark or not ? That is the sematic tokens.
Consider My case is that, i am training a new tokenizer model with input size as 1024
, and just need to confirm will the output be bark compatible or not ? Just am need to confirm with you ?
Extra info:
My behind thought is that i found a pretrained well formed wav2vec2 model that i somehow to manage to extract semantic vector but output is of size 1024. So planning to train new tokenizer. Should i proceed or not ?
from bark-voice-cloning-hubert-quantizer.
Thanks, Can you please shed lights on rate ? i mean what is the required rate ?
Make sure the Wav2Vec extracts features at the same rate as HuBERT too
For example, this indicwav2vec-hindi is trained on fairseq
from bark-voice-cloning-hubert-quantizer.
does hubert_base_ls960.pt pretrained only with English?
from bark-voice-cloning-hubert-quantizer.
does hubert_base_ls960.pt pretrained only with English?
from bark-voice-cloning-hubert-quantizer.
does hubert_base_ls960.pt pretrained only with English?
it seems to work with more than just english, not every single language though.
from bark-voice-cloning-hubert-quantizer.
@gitmylo , On hubert training specs its seems its trained on librispeech_asr
dataset which is a monolingual [english only] dataset.
Additionally its labelled only english
.
Could you confirm , do quantizer or semantic_features returned from hubert model have anything to do with a language ?
https://huggingface.co/facebook/hubert-base-ls960
from bark-voice-cloning-hubert-quantizer.
@gitmylo , On hubert training specs its seems its trained on
librispeech_asr
dataset which is a monolingual [english only] dataset.Additionally its labelled only
english
.Could you confirm , do quantizer or semantic_features returned from hubert model have anything to do with a language ?
They do have something to do with a language, but that won't stop you from creating a good quantizer for a non-english language. as it is still able to recognize the patterns, as it's mostly human speech sounds anyways. It shouldn't be restricted to just english because the quantizer is english.
from bark-voice-cloning-hubert-quantizer.
@abhiprojectz I had success training the Portuguese language yesterday, and before that I was getting less than the ideal results (the model hallucinated a lot more and the voice clones sucked all around).
I used Hubert.
What I did was:
1 - Lowering the learning rate (in my case it was lowered to 0.0005 )
https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/blob/master/hubert/customtokenizer.py#L56
2 - Redid the dataset, I threw in the Bible as religious texts apparently tend to produce cadenced speech more often and has a more formal language (better for tokens), and produced over 4000 files (4249 to be exact, but in your case for hindi you should probably use more)
3 - Trained 25 epochs (I selected the 24th epoch model as the best one, though)
4 - Tested each model to check which ones produce good audio + accurate cloned voices
Try lowering the learning rate as much as you can bearably can and let it train for several epochs until you find a sweet spot or if there is any noticeable change in audio generation.
Since you mentioned you where getting "random words and noises" I suggest you to select a learning rate of like 0.0001 and below in order to not "damage" the model
from bark-voice-cloning-hubert-quantizer.
@Subarasheese thanks for the insight,
just to clarify do you use hubert base right ? (https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt)
i'm trying to train hubert from scratch with multilngual(eng + indonesia) dataset,after reading this thread
but now i will try using hubert base first
from bark-voice-cloning-hubert-quantizer.
@Subarasheese thanks for the insight,
just to clarify do you use hubert base right ? (https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt)
i'm trying to train hubert from scratch with multilngual(eng + indonesia) dataset,after reading this thread
but now i will try using hubert base first
Yes, I used base Hubert.
I would suggest you to just train over the base Hubert model before trying to train from the scratch.
from bark-voice-cloning-hubert-quantizer.
I'm currently finetuning HuBERT on common_voice_11_0 hindi let's see how it goes
from bark-voice-cloning-hubert-quantizer.
I'm currently finetuning HuBERT on common_voice_11_0 hindi let's see how it goes
Any update on this?
from bark-voice-cloning-hubert-quantizer.
Can someone share results on hindi cloning?
from bark-voice-cloning-hubert-quantizer.
Related Issues (20)
- Switch fairseq dependency to transformers' Hubert HOT 1
- issues in notebook due to fairseq version HOT 2
- RuntimeError: The size of tensor a (28) must match the size of tensor b (33) at non-singleton dimension 2
- `KeyError: 'best_loss'` when testing self-trained model HOT 2
- HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/GitMylo/bark-voice-cloning/resolve/main/japanese-HuBERT-quantizer_24_epoch.pth HOT 1
- german-HuBERT-quantizer_14_epoch.pth does not have all meta data HOT 2
- Omni-Lingual Quantizer? HOT 2
- adding batches to training?
- Support for Turkish langauge
- No module named 'hubert' HOT 1
- [Question] This notebook it's for create a speark with trained semantic model?
- How To Train Chinese Tokenizer
- How to create a quantizer in a dialect that Bark didn't support? HOT 2
- Request: Write a tutorial for training quantizers
- Added improved functionality and RestAPI
- Stuck on `Installing Demucs`
- How to Train for Non-Verbal Effects Voice? HOT 2
- Torch compile errors on windows 11
- Support for Swahili Language
- Turkish language support HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bark-voice-cloning-hubert-quantizer.