Code Monkey home page Code Monkey logo

Comments (29)

abhiprojectz avatar abhiprojectz commented on June 14, 2024 2

A good news, i found a way of extracting semantic vectors from wav2vec models without the main hubert_base model.

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024 1

HuBERT wav2vec outputs have 768 features, that's why i picked that number, if you want to use a different number, pass input_size=1024 in the constructor

the default input shape is (B, 768) where B is the batch size, and output shape is (B, 1)
with input_size=1024, the input shape is (B, 1024), and output shape is (B, 1)

example: on line 161 of customtokenizer.py, in auto_train, change
model_training = CustomTokenizer(version=1).to('cuda')
to
model_training = CustomTokenizer(version=1, input_size=1024).to('cuda')

Make sure the Wav2Vec extracts features at the same rate as HuBERT too, else you'll get problems.

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024 1

about 50x768 features per second, or 50x1024 in your case.
if it's slightly different, that's fine.

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024

with a dataset with 3000 files, 20 minutes on my rtx3060 had good results, i then trained it for an hour or so more.
you can interrupt training at any point, and check your latest model to see how well it performs.

from bark-voice-cloning-hubert-quantizer.

abhiprojectz avatar abhiprojectz commented on June 14, 2024

@gitmylo Thanks, I have trained for 15 epochs , and planning to do for 24.

However it took around 3 hours on a P100 GPU for just 15 epochs. and i reduced the files to around 6000.

Any suggestions to improve the cloning results and for faster trainings.

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024

look at it like this.
if you have 3000 files, and train for 24 epochs for example, it will still be worse than 6000 files for 12 epochs.
an epoch means it has gone through every file. having more training data will make an epoch take longer.

also, if and when you decide to upload your model and/or training data to huggingface, please send the urls here, so i can add them to the readme

from bark-voice-cloning-hubert-quantizer.

abhiprojectz avatar abhiprojectz commented on June 14, 2024

@gitmylo i think, Hubert base model doesn't support hindi language because my generated text doesn't speaks what's prompted with text , instead some random words and noises.

Given that,

I even used 2 types of models example.

Model_A tranied for 23 epochs on 3700 files (after ready stage)

Model_B trained for 16 epochs on 7783 files

Both yields poor results, please any suggestions , already i spent lot of time.

image

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024

do the wavs used in training sound normal though?

from bark-voice-cloning-hubert-quantizer.

abhiprojectz avatar abhiprojectz commented on June 14, 2024

@gitmylo yes, i just checked multiple wavs (BTW some files are pure noises too) in prepared folder and they sounds perfect.

Can you please suggest by your experience what shall i do?

If you may help, i may train for another langauges too,

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024

maybe there's a Hindi HuBERT model somewhere, you could try loading it

from bark-voice-cloning-hubert-quantizer.

abhiprojectz avatar abhiprojectz commented on June 14, 2024

@gitmylo , I could not found any searched a lot, it would be nice ,
if you may provide one link for it.

P.S: And a point to be noted that resources such as guides/pretrained_models etc for Hindi langauges are very rare.

a update Model_A crossed 32 epochs with losses as:

image

from bark-voice-cloning-hubert-quantizer.

abhiprojectz avatar abhiprojectz commented on June 14, 2024

@gitmylo I assume there is problem with hubert base model doesn't supports hindi , as i checked with the generated semantic_prompt , i converted them to wav form (sematic_to_waveform) as they speaks some random words with english words, though the cloned speaker entirely speaks Hindi.

P.S cloned on 5-6 speakers, (clear voice) - Same poor results.

Conclusion is that after training for 35 epochs semantic vectors are not formed properly or in desired language.

Thanks, anyways, i will upload all , training data, both the models. But they are of no use.

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024

great, as long as they're on the same rate with the same amount of features, it should work

from bark-voice-cloning-hubert-quantizer.

JonathanFly avatar JonathanFly commented on June 14, 2024

Is it distilhubert?
And there's different versions around https://huggingface.co/models?search=distilhubert

I noticed it's on RVC too ddPn08/rvc-webui#11

from bark-voice-cloning-hubert-quantizer.

abhiprojectz avatar abhiprojectz commented on June 14, 2024

@gitmylo hey, i have one doubt , why haven't you used the hubert_base_ls960_L9_km500.bin quantizer ? And what's the reason of training for english language ??

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024

I haven't used that quantizer because it is not compatible with bark. It uses completely different values to represent the semantic features.

I trained on english because english is the most widely spoken language in the world. And it's supported by bark.

from bark-voice-cloning-hubert-quantizer.

abhiprojectz avatar abhiprojectz commented on June 14, 2024

@gitmylo thanks, just one last question,

Is it necessary to pass a input of size 768 to tokenizer, i mean that can we pass input of 1024 or something like to custom tokenizer ( A new that accepts input size of 1024) and then after tokenization will the result be compatible with bark or not ? That is the sematic tokens.

Consider My case is that, i am training a new tokenizer model with input size as 1024, and just need to confirm will the output be bark compatible or not ? Just am need to confirm with you ?


Extra info:
My behind thought is that i found a pretrained well formed wav2vec2 model that i somehow to manage to extract semantic vector but output is of size 1024. So planning to train new tokenizer. Should i proceed or not ?

from bark-voice-cloning-hubert-quantizer.

abhiprojectz avatar abhiprojectz commented on June 14, 2024

Thanks, Can you please shed lights on rate ? i mean what is the required rate ?

Make sure the Wav2Vec extracts features at the same rate as HuBERT too

For example, this indicwav2vec-hindi is trained on fairseq

from bark-voice-cloning-hubert-quantizer.

xiabo2011 avatar xiabo2011 commented on June 14, 2024

does hubert_base_ls960.pt pretrained only with English?

from bark-voice-cloning-hubert-quantizer.

xiabo2011 avatar xiabo2011 commented on June 14, 2024

does hubert_base_ls960.pt pretrained only with English?

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024

does hubert_base_ls960.pt pretrained only with English?

it seems to work with more than just english, not every single language though.

from bark-voice-cloning-hubert-quantizer.

abhiprojectz avatar abhiprojectz commented on June 14, 2024

@gitmylo , On hubert training specs its seems its trained on librispeech_asr dataset which is a monolingual [english only] dataset.

Additionally its labelled only english .

Could you confirm , do quantizer or semantic_features returned from hubert model have anything to do with a language ?

https://huggingface.co/facebook/hubert-base-ls960

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 14, 2024

@gitmylo , On hubert training specs its seems its trained on librispeech_asr dataset which is a monolingual [english only] dataset.

Additionally its labelled only english .

Could you confirm , do quantizer or semantic_features returned from hubert model have anything to do with a language ?

https://huggingface.co/facebook/hubert-base-ls960

They do have something to do with a language, but that won't stop you from creating a good quantizer for a non-english language. as it is still able to recognize the patterns, as it's mostly human speech sounds anyways. It shouldn't be restricted to just english because the quantizer is english.

from bark-voice-cloning-hubert-quantizer.

Subarasheese avatar Subarasheese commented on June 14, 2024

@abhiprojectz I had success training the Portuguese language yesterday, and before that I was getting less than the ideal results (the model hallucinated a lot more and the voice clones sucked all around).

I used Hubert.

What I did was:
1 - Lowering the learning rate (in my case it was lowered to 0.0005 )
https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/blob/master/hubert/customtokenizer.py#L56
2 - Redid the dataset, I threw in the Bible as religious texts apparently tend to produce cadenced speech more often and has a more formal language (better for tokens), and produced over 4000 files (4249 to be exact, but in your case for hindi you should probably use more)
3 - Trained 25 epochs (I selected the 24th epoch model as the best one, though)
4 - Tested each model to check which ones produce good audio + accurate cloned voices

Try lowering the learning rate as much as you can bearably can and let it train for several epochs until you find a sweet spot or if there is any noticeable change in audio generation.

Since you mentioned you where getting "random words and noises" I suggest you to select a learning rate of like 0.0001 and below in order to not "damage" the model

from bark-voice-cloning-hubert-quantizer.

acul3 avatar acul3 commented on June 14, 2024

@Subarasheese thanks for the insight,

just to clarify do you use hubert base right ? (https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt)

i'm trying to train hubert from scratch with multilngual(eng + indonesia) dataset,after reading this thread

but now i will try using hubert base first

from bark-voice-cloning-hubert-quantizer.

Subarasheese avatar Subarasheese commented on June 14, 2024

@Subarasheese thanks for the insight,

just to clarify do you use hubert base right ? (https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt)

i'm trying to train hubert from scratch with multilngual(eng + indonesia) dataset,after reading this thread

but now i will try using hubert base first

Yes, I used base Hubert.
I would suggest you to just train over the base Hubert model before trying to train from the scratch.

from bark-voice-cloning-hubert-quantizer.

sachaarbonel avatar sachaarbonel commented on June 14, 2024

I'm currently finetuning HuBERT on common_voice_11_0 hindi let's see how it goes

from bark-voice-cloning-hubert-quantizer.

Surojit-KB avatar Surojit-KB commented on June 14, 2024

I'm currently finetuning HuBERT on common_voice_11_0 hindi let's see how it goes

Any update on this?

from bark-voice-cloning-hubert-quantizer.

super-animo avatar super-animo commented on June 14, 2024

Can someone share results on hindi cloning?

from bark-voice-cloning-hubert-quantizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.