Code Monkey home page Code Monkey logo

bark-voice-cloning-hubert-quantizer's Introduction

Bark voice cloning

Please read

This code works on python 3.10, i have not tested it on other versions. Some older versions will have issues.

Voice cloning with bark in high quality?

It's possible now.

examples_biden_example.mov

How do I clone a voice?

For developers:

For everyone:

Voices cloned aren't very convincing, why are other people's cloned voices better than mine?

Make sure these things are NOT in your voice input: (in no particular order)

  • Noise (You can use a noise remover before)
  • Music (There are also music remover tools) (Unless you want music in the background)
  • A cut-off at the end (This will cause it to try and continue on the generation)
  • Under 1 second of training data (i personally suggest around 10 seconds for good potential, but i've had great results with 5 seconds as well.)

What makes for good prompt audio? (in no particular order)

  • Clearly spoken
  • No weird background noises
  • Only one speaker
  • Audio which ends after a sentence ends
  • Regular/common voice (They usually have more success, it's still capable of cloning complex voices, but not as good at it)
  • Around 10 seconds of data

Pretrained models

Official

Name HuBERT Model Quantizer Version Epoch Language Dataset
quantifier_hubert_base_ls960.pth HuBERT Base 0 3 ENG GitMylo/bark-semantic-training
quantifier_hubert_base_ls960_14.pth HuBERT Base 0 14 ENG GitMylo/bark-semantic-training
quantifier_V1_hubert_base_ls960_23.pth HuBERT Base 1 23 ENG GitMylo/bark-semantic-training

Community

Author Name HuBERT Model Quantizer Version Epoch Language Dataset
HobisPL polish-HuBERT-quantizer_8_epoch.pth HuBERT Base 1 8 POL Hobis/bark-polish-semantic-wav-training
C0untFloyd german-HuBERT-quantizer_14_epoch.pth HuBERT Base 1 14 GER CountFloyd/bark-german-semantic-wav-training

For developers: Implementing voice cloning in your bark projects

  • Simply copy the files from this directory into your project.
  • The hubert manager contains methods to download HuBERT and the custom Quantizer model.
  • Loading the CustomHuBERT should be pretty straightforward
  • The notebook contains code to use on cuda or cpu. Instead of just cpu.
from hubert.pre_kmeans_hubert import CustomHubert
import torchaudio

# Load the HuBERT model,
# checkpoint_path should work fine with data/models/hubert/hubert.pt for the default config
hubert_model = CustomHubert(checkpoint_path='path/to/checkpoint')

# Run the model to extract semantic features from an audio file, where wav is your audio file
wav, sr = torchaudio.load('path/to/wav') # This is where you load your wav, with soundfile or torchaudio for example

if wav.shape[0] == 2:  # Stereo to mono if needed
    wav = wav.mean(0, keepdim=True)

semantic_vectors = hubert_model.forward(wav, input_sample_hz=sr)
import torch
from hubert.customtokenizer import CustomTokenizer

# Load the CustomTokenizer model from a checkpoint
# With default config, you can use the pretrained model from huggingface
# With the default setup from HuBERTManager, this will be in data/models/hubert/tokenizer.pth
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth')  # Automatically uses the right layers

# Process the semantic vectors from the previous HuBERT run (This works in batches, so you can send the entire HuBERT output)
semantic_tokens = tokenizer.get_token(semantic_vectors)

# Congratulations! You now have semantic tokens which can be used inside of a speaker prompt file.

How do I train it myself?

Simply run the training commands.

A simple way to create semantic data and wavs for training, is with my script: bark-data-gen. But remember that the creation of the wavs will take around the same time if not longer than the creation of the semantics. This can take a while to generate because of that.

For example, if you have a dataset with zips containing audio files, one zip for semantics, and one for the wav files. Inside of a folder called "Literature"

You should run process.py --path Literature --mode prepare for extracting all the data to one directory

You should run process.py --path Literature --mode prepare2 for creating HuBERT semantic vectors, ready for training

You should run process.py --path Literature --mode train for training

And when your model has trained enough, you can run process.py --path Literature --mode test to test the latest model.

Disclaimer

I am not responsible for audio generated using semantics created by this model. Just don't use it for illegal purposes.

bark-voice-cloning-hubert-quantizer's People

Contributors

alwinaind avatar brasd99 avatar gitmylo avatar rsxdalv avatar yongebai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bark-voice-cloning-hubert-quantizer's Issues

Support Japanese voice cloing

Hi, Thanks for your work to make voice cloning possible on Bark. I created datasets for Japanese and trained a Japanese quantizer. The result is pretty good after 24 epochs with around 5k data. If anyone wants to give it a try, they can simply download it from Huggingface.

japanese-quantizer
Japanese datasets

How to Train for Non-Verbal Effects Voice?

How do I train for keyword effects, such as [man], [woman], or even how do I train for [music] keyword?

Do I have to put [man] on semantic text, to train man voice?

Really appreciate for you work

Fine-tune for a certain speaker

Thanks for this great work. I am wondering, if I want to increase the quality of voice cloning for a certain speaker, is there a way to fine-tune the model? If yes, how should I do it? Thank you.

`KeyError: 'best_loss'` when testing self-trained model

Hi, first of all thank you for your work!

I created the semantic data and wavs with the help of your bark-data-gen repo, and trained the model myself by following the steps you mentioned at https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer#how-do-i-train-it-myself.
I trained the model until 20th epoch and would now like to test it.

Unfortunately testing gives me a KeyError :

$ python process.py  --path Literature  --mode test
Traceback (most recent call last):
  File ".../bark-voice-cloning-HuBERT-quantizer/process.py", line 28, in <module>
    test_hubert(path, model)
  File ".../bark-voice-cloning-HuBERT-quantizer/test_hubert.py", line 13, in test_hubert
    hubert_model = CustomHubert(checkpoint_path=model)
  File ".../bark-voice-cloning-HuBERT-quantizer/hubert/pre_kmeans_hubert.py", line 62, in __init__
    model, *_ = fairseq.checkpoint_utils.load_model_ensemble_and_task(load_model_input)
  File ".../python3.10/site-packages/fairseq/checkpoint_utils.py", line 431, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File ".../python3.10/site-packages/fairseq/checkpoint_utils.py", line 349, in load_checkpoint_to_cpu
    state = _upgrade_state_dict(state)
  File ".../python3.10/site-packages/fairseq/checkpoint_utils.py", line 595, in _upgrade_state_dict
    "best_loss": state["best_loss"]}
KeyError: 'best_loss'

In test_hubert.py, I'm passing the path to my self-trained model model_epoch_20.pth:

# test_hubert.py
def test_hubert(path: str, model: str = ".../Literature/model_epoch_20.pth", 
                tokenizer: str = 'model.pth'):
    hubert_model = CustomHubert(checkpoint_path=model)  # throws error

The self-trained model dict has the following keys:

# fairseq/checkpoint_utils.py
odict_keys(['lstm.weight_ih_l0', 'lstm.weight_hh_l0', 'lstm.bias_ih_l0', 'lstm.bias_hh_l0', 'lstm.weight_ih_l1', 'lstm.weight_hh_l1', 'lstm.bias_ih_l1', 'lstm.bias_hh_l1', 'intermediate.weight', 'intermediate.bias', 'fc.weight', 'fc.bias'])

I'm getting the same error when trying to test the pre-trained german-HuBERT-quantizer_14_epoch.pth model from C0untFloyd mentioned in this repo.

Could you please give me a hint about what I'm doing wrong here? How could I successfully test the self-trained model?

Thank you very much in advance.

Testing

I have successfully run the colab notebook and saved the speaker.npz file.
The issue now is how do I test it on a new data

No module named 'hubert'

I want to train my semantic and wav, but I have this error when start process.py

File "/kaggle/working/bark-voice-cloning-HuBERT-quantizer/prepare.py", line 8, in <module> from hubert.pre_kmeans_hubert import CustomHubert ModuleNotFoundError: No module named 'hubert'

german-HuBERT-quantizer_14_epoch.pth does not have all meta data

If I try to call prepare2 with german-HuBERT-quantizer_14_epoch.pth checkpoint, the folloiwng error emerges:

File "/usr/local/lib/python3.10/dist-packages/fairseq/checkpoint_utils.py", line 585, in _upgrade_state_dict                                                                                │
    {"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]}                                                                                                              │
KeyError: 'best_loss'  

If quantifier_hubert_base_ls960.pth checkpoint is used, everything works well.

Is there somewhere a checkpoint of german-HuBERT-quantizer_14_epoch.pth with the necessary metadata?

License for /hubert module

I saw that the model itself is MIT on huggingface; however, the /hubert module from this repository is not under a clear license.

Omni-Lingual Quantizer?

(Took me way too long to realize this, and it just goes to show that most of us are just point and click type of fellas who don't really understand what we're using - not really a skiddie because we can code, but.. .you get what I mean)

So if the whole point of using bark-generated audio training a quantizer like this,

instead of simply grabbing a massive good dataset of audio and having whisper transcribe and then adding in tags or correcting as needed (or god forbid, manually finding voice clips with actual good audio that matches more or less what you hear)

is simply because you don't know how exactly they trained their hubert voice features to semantic tokens mapping, the unknown here being the semantic tokens, and you want to make sure you at least start off from THEIR hubert voice features to tokens mapping and refine it,

... then couldn't a general-purpose hubert to semantic tokens quantizer be made instead? You would just generate or aggregate all the supported languages, generate datasets if you don't have them already, train a quantizer on ALL OF THAT since its aim is just a reverse "tell me the semantic tokens for this series of sounds" and it should theoretically cover any "known" language

(minus the african ones because there's no statistically significant presence of tongue-click languages on the internet, but knowing bark and its random noises during generation, assuming hubert model has also learned that, it probably CAN map a tongue-click language too)

I see you did it for english, but I'm wondering why everyone has stopped at a single language quantizer when it can probably be made into an omnilingual quantizer.

I ask in the name of languages like Klingon, Middle English, Old English, Vietnamese with a southern accent....

How to create a quantizer in a dialect that Bark didn't support?

Cantonese, a dialect from China. Often used by Hong Kong people and GuangZhou. The problem is, Bark only support "Chinese", which is Putonghua in fact.

I would like to create a Cantonese quantizer from beginning, how am I suppose to do that? As I know, like @junwchina said, bark-data-gen is a tool used to generate training data and train my quantizer model. But the pronunciation of Cantonese and Putonghua is completely different. If I use bark-data-gen, it will likely only output a Putonghua dataset, which is not the one I want.

How to increase quality?

Hey, @gitmylo, great work on this repo.

If I want to increase the quality what's the best way to go about that?

I imagine the number of steps that are used during both training and inference must be stored in some variable somewhere. Can you point me to it?

Or maybe there's another obvious solution?

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

file:
colab_notebook.ipynb
when:
large_quant_model = False # Use the larger pretrained model
device = 'cuda' # 'cuda', 'cpu', 'cuda:0', 0, -1, torch.device('cuda')

import numpy as np
import torch
import torchaudio
from encodec import EncodecModel
from encodec.utils import convert_audio
from hubert.hubert_manager import HuBERTManager
from hubert.pre_kmeans_hubert import CustomHubert
from hubert.customtokenizer import CustomTokenizer

model = ('quantifier_V1_hubert_base_ls960_23.pth', 'tokenizer_large.pth') if large_quant_model else ('quantifier_hubert_base_ls960_14.pth', 'tokenizer.pth')

print('Loading HuBERT...')
hubert_model = CustomHubert(HuBERTManager.make_sure_hubert_installed(), device=device)
print('Loading Quantizer...')
quant_model = CustomTokenizer.load_from_checkpoint(HuBERTManager.make_sure_tokenizer_installed(model=model[0], local_file=model[1]), device)
print('Loading Encodec...')
encodec_model = EncodecModel.encodec_model_24khz()
encodec_model.set_target_bandwidth(6.0)
encodec_model.to(device)

print('Downloaded and loaded models!')

then:

TypeError Traceback (most recent call last)
in
9 from hubert.hubert_manager import HuBERTManager
10 from hubert.pre_kmeans_hubert import CustomHubert
---> 11 from hubert.customtokenizer import CustomTokenizer
12
13 model = ('quantifier_V1_hubert_base_ls960_23.pth', 'tokenizer_large.pth') if large_quant_model else ('quantifier_hubert_base_ls960_14.pth', 'tokenizer.pth')

/data/bark_clone/hubert/customtokenizer.py in
151
152
--> 153 def auto_train(data_path, save_path='model.pth', load_model: str | None = None, save_epochs=1):
154 data_x, data_y = [], []
155

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

semantic.npy

Hi, thank you for the work.
In your code, I understand that Hubert is used to extract:
feat_string = '_semantic_features.npy'
how about:
sem_string = '_semantic.npy'
where to get the *_semantic.npy?

Thank you.

Speaker switching

Hello folks,
When I try to generate audio from the cloned speaker voice, the output is a mix of different speakers. The audio switches speakers. For example, it will first talk in the male voice and then a female voice for the same text prompt. Is there any solution for this?
Any help is appreciated

Voice to semantic

If I well understood, you used a custom semantic-voice dataset for training your HuBERT model. Can you tell me how to create this dataset? Especially how to get the semantic from a voice? Many thanks for this work.

Support for Hindi langauge

@gitmylo , hello I am currently trying to train the quantizer for on hindi dataset.

I need to know how much time would it take to train on a P100 GPU ? And also when should i stop the training ?

given that, I have dataset of approx 7000 wavs and semantic files.

I need to clarify once that will Hubert base model works well for Hindi language ?

Support for Swahili Language

Hi @gitmylo I wonder if possible to add Swahili language on the model, as would be very interesting for African community to use it natively.

Thanks for take your time to read this.

generate semantic tokens from wavs

Thanks for sharing this code. I've run through the steps and it appears to generate the semantic tokens from text, and then generates the wav files from the semantic tokens. But Is it possible to generate the semantic tokens from a set of wav files?

"no description" when bark run

I have tried to create an npz, although I think I have done something wrong. I have gotten bark running up until generate_coarse:
Exception has occurred: AssertionError exception: no description File "/Users/nickanastasoff/Desktop/bark test/bark/bark/generation.py", line 573, in generate_coarse round(x_coarse_history.shape[-1] / len(x_semantic_history), 1) File "/Users/nickanastasoff/Desktop/bark test/bark/bark/api.py", line 54, in semantic_to_waveform coarse_tokens = generate_coarse( File "/Users/nickanastasoff/Desktop/bark test/bark/bark/api.py", line 113, in generate_audio out = semantic_to_waveform(

customHuburt.txt
This is what I used to make the npz. Im pretty sure the issue is with fine_prompts = codes but im not sure what else to do.

issues in notebook due to fairseq version

Maybe implementing a stricter version in requirements is needed?

TypeError                                 Traceback (most recent call last)
Cell In [13], line 2
----> 2 from bark_hubert_quantizer.pre_kmeans_hubert import CustomHubert
      3 from bark_hubert_quantizer.customtokenizer import CustomTokenizer

File /notebooks/./bark-voice-cloning-HuBERT-quantizer/bark_hubert_quantizer/pre_kmeans_hubert.py:16
     13 from torch import nn
     14 from einops import pack, unpack
---> 16 import fairseq
     18 from torchaudio.functional import resample
     20 from audiolm_pytorch.utils import curtail_to_multiple

File /usr/local/lib/python3.9/dist-packages/fairseq/__init__.py:40
     38 import fairseq.optim.lr_scheduler  # noqa
     39 import fairseq.pdb  # noqa
---> 40 import fairseq.scoring  # noqa
     41 import fairseq.tasks  # noqa
     42 import fairseq.token_generation_constraints  # noqa

File /usr/local/lib/python3.9/dist-packages/fairseq/scoring/__init__.py:34
     29     @abstractmethod
     30     def result_string(self) -> str:
     31         pass
---> 34 _build_scorer, register_scorer, SCORER_REGISTRY, _ = registry.setup_registry(
     35     "--scoring", default="bleu"
     36 )
     39 def build_scorer(choice, tgt_dict):
     40     _choice = choice._name if isinstance(choice, DictConfig) else choice

TypeError: cannot unpack non-iterable NoneType object

Problems training a Portuguese quantizer model

Greetings,

I've followed all the steps on the guide to train a Portuguese dataset, but unfortunately, across epochs, either the model did not really clone the voice, or provided voices that did resemble the target voice but produced bad speech outputs like screeching, saying things too slowly or "sounding drunk". I could not get a single model that managed to consistently produce good speeches with voices closely resembling the target voice despite training on a dataset of a little over 3200 samples up to 30 epochs (I tested every single epoch). For the dataset, I am using some public domain classic literature books and the Bible.

What I am doing wrong, and what can I do to improve the training and get better models?

adding batches to training?

Am I correct in saying the training code in customtokenizer only trains one X Y pair at a time instead of a whole batch at once?

Are there any plans to add batches to the training code so it can process a large batch at once? As it stands right now, when combining english, german, polish, japanese, and portugese datasets from huggingface, it is taking about 1 hour per epoch and only 3 out of 8GB of VRAM is used.

(Obviously this doesn't run on google colab since trying to load 32,000+ files crashes google drive AND runs out of instance RAM on colab, but if it could, batching would be a very nice idea so that one epoch could be done in maybe a hundred steps or something instead of .. .tens of thousands. It seems to be trying to fit to one set of data but that's conflicting with another set of data, basically to rephrase that in plain English, is it tries to learn one feature and gets worse at the other, it corrects the other and gets worse at the first, whereas I think if it was all batched it would "see the bigger picture" and "all the patterns as related and part of a whole" or something like that. )

(But I dunno though, maybe this architecture is insufficient for OMNI-LINGUAL and can at best only learn languages in a group like traditional linguists define. Romance languages, Indo.. uhh.. something languages.... I say it might be like that because just last night I tried using the english 23 epoch model as a pretrain starting point, and well... 8hrs later, at 8 epochs, it sort of can map an unsupported language like vietnamese. Approximates alot of words at the wrong "notes" but it did better than expected in THAT regard, so the theory is not too far off. Where it fucked up is suddenly some speakers english words, said with an accent turned into a Russian or Polish phoneme, which really makes me wonder if there's a limit to how "different" the languages can be, but still, way off topic here, I think batched training would really help with all this. )

(But the important thing here is it seems bark DOES have the ability to generate the correct phonemes for novel sounds if you can just tease out the right semantic tokens, which you can get really close by having the quantizer hybridize the languages your target language is closest to.... but ghyaaah thats such a pain in the ass to do for every language)

German tokenizer available / overtraining

Hey Mylo, thanks for your work on enabling Bark voice cloning and documenting everything you did. I can't believe you just started with machine learning!

I successfully trained a german tokenizer which is available here:
https://huggingface.co/CountFloyd/bark-voice-cloning-german-HuBERT-quantizer

It was trained for 14 epochs with quantizer version 1. The input dataset is right here:
https://huggingface.co/datasets/CountFloyd/bark-german-semantic-wav-training

So far results are pretty good, however I'm wondering if it's possible to overtrain. In your example code you're always using the model trained for 14 epochs, not the one trained longer for 23 epochs. Shouldn't the latter one be better? Are there reasons to stop training earlier to prevent e.g. overfitting?

Anyway, please also check out my Bark Repo where I wrapped your code in a Gradio GUI. It's also possible to swap voices in audio and train with it although this is still WIP. Thanks again!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.