gitmylo / bark-voice-cloning-hubert-quantizer Goto Github PK

View Code? Open in Web Editor NEW

599.0 17.0 101.0 319.76 MB

The code for the bark-voicecloning model. Training and inference.

License: MIT License

Python 61.49% Jupyter Notebook 38.51%

ai neural-networks text-to-speech voice-cloning voice-conversion

bark-voice-cloning-hubert-quantizer's Introduction

Bark voice cloning

Please read

This code works on python 3.10, i have not tested it on other versions. Some older versions will have issues.

Voice cloning with bark in high quality?

It's possible now.

examples_biden_example.mov

How do I clone a voice?

For developers:

code examples on huggingface model page

For everyone:

Voices cloned aren't very convincing, why are other people's cloned voices better than mine?

Make sure these things are NOT in your voice input: (in no particular order)

Noise (You can use a noise remover before)
Music (There are also music remover tools) (Unless you want music in the background)
A cut-off at the end (This will cause it to try and continue on the generation)
Under 1 second of training data (i personally suggest around 10 seconds for good potential, but i've had great results with 5 seconds as well.)

What makes for good prompt audio? (in no particular order)

Clearly spoken
No weird background noises
Only one speaker
Audio which ends after a sentence ends
Regular/common voice (They usually have more success, it's still capable of cloning complex voices, but not as good at it)
Around 10 seconds of data

Pretrained models

Official

Name	HuBERT Model	Quantizer Version	Epoch	Language	Dataset
quantifier_hubert_base_ls960.pth	HuBERT Base	0	3	ENG	GitMylo/bark-semantic-training
quantifier_hubert_base_ls960_14.pth	HuBERT Base	0	14	ENG	GitMylo/bark-semantic-training
quantifier_V1_hubert_base_ls960_23.pth	HuBERT Base	1	23	ENG	GitMylo/bark-semantic-training

Community

Author	Name	HuBERT Model	Quantizer Version	Epoch	Language	Dataset
HobisPL	polish-HuBERT-quantizer_8_epoch.pth	HuBERT Base	1	8	POL	Hobis/bark-polish-semantic-wav-training
C0untFloyd	german-HuBERT-quantizer_14_epoch.pth	HuBERT Base	1	14	GER	CountFloyd/bark-german-semantic-wav-training

For developers: Implementing voice cloning in your bark projects

Simply copy the files from this directory into your project.
The hubert manager contains methods to download HuBERT and the custom Quantizer model.
Loading the CustomHuBERT should be pretty straightforward
The notebook contains code to use on cuda or cpu. Instead of just cpu.

from hubert.pre_kmeans_hubert import CustomHubert
import torchaudio

# Load the HuBERT model,
# checkpoint_path should work fine with data/models/hubert/hubert.pt for the default config
hubert_model = CustomHubert(checkpoint_path='path/to/checkpoint')

# Run the model to extract semantic features from an audio file, where wav is your audio file
wav, sr = torchaudio.load('path/to/wav') # This is where you load your wav, with soundfile or torchaudio for example

if wav.shape[0] == 2:  # Stereo to mono if needed
    wav = wav.mean(0, keepdim=True)

semantic_vectors = hubert_model.forward(wav, input_sample_hz=sr)

Loading and running the custom kmeans

import torch
from hubert.customtokenizer import CustomTokenizer

# Load the CustomTokenizer model from a checkpoint
# With default config, you can use the pretrained model from huggingface
# With the default setup from HuBERTManager, this will be in data/models/hubert/tokenizer.pth
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth')  # Automatically uses the right layers

# Process the semantic vectors from the previous HuBERT run (This works in batches, so you can send the entire HuBERT output)
semantic_tokens = tokenizer.get_token(semantic_vectors)

# Congratulations! You now have semantic tokens which can be used inside of a speaker prompt file.

How do I train it myself?

Simply run the training commands.

A simple way to create semantic data and wavs for training, is with my script: bark-data-gen. But remember that the creation of the wavs will take around the same time if not longer than the creation of the semantics. This can take a while to generate because of that.

For example, if you have a dataset with zips containing audio files, one zip for semantics, and one for the wav files. Inside of a folder called "Literature"

You should run process.py --path Literature --mode prepare for extracting all the data to one directory

You should run process.py --path Literature --mode prepare2 for creating HuBERT semantic vectors, ready for training

You should run process.py --path Literature --mode train for training

And when your model has trained enough, you can run process.py --path Literature --mode test to test the latest model.

Disclaimer

I am not responsible for audio generated using semantics created by this model. Just don't use it for illegal purposes.

bark-voice-cloning-hubert-quantizer's People

Contributors

Stargazers

Watchers

Forkers

listeningapp melmass jonathanfly rsxdalv ishine jprobichaud entn-at yeshuawb3 pioneer-of-pandora sycomix rgreer4 whitefu zhangziliang04 xiaoyangnihao congzhong yefeng235 randomradio genesisprologue-ai beijingchengxitech saradark platform-kit zacrafidi gaowudao czcollier research-clone illumionous h2k roy-phi clivemchd erickong1985 ianblenke atlonxp cleardry acul3 lowkeywx brasd99 guohongli haiewu visalord ahmed451 jaychuo levinliu dschnee souvikqb maharshi-openfabric aquilatrindade alwinaind lucasg2000 yongebai supercodeai famiya bestdpf kp-forks lcsouzamenezes creativelabsai theodore0724 dmace72 nickovchinnikov 5l1v3r1 yikedecodaai vigneshrl stempelo re-connect-ai blldw standardgalactic gjin10969 prahs callinm e1746288 bigdatasciencegroup maoshuiyang nisaengineers anhlbt serviteur ekakit karayakar gaoxiaowei miaohf trickpattyfh20 jingx8885 elnoxvie al-dim esper21 cherrylcherryl merolaika quijoteshin render-ai render-ai spikeparaffa sqblg boringtaskai absalan devonsantana sleepynoob dorucioclea ameerazam08 dwash96 davidgortega katu37-cyber

bark-voice-cloning-hubert-quantizer's Issues

Hubert with portuguese

          I've found this HuBert model in portuguese. How do I load it with your tool?

The generated voices are in English, of course, but they are very similar to the real ones.
https://huggingface.co/jonatasgrosman/exp_w2v2t_pt_hubert_s486

Originally posted by @gab-luz in #6 (comment)

HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/GitMylo/bark-voice-cloning/resolve/main/japanese-HuBERT-quantizer_24_epoch.pth

When i Use Japanese on your json it not working how to fix it?

Questions about model size with auto_train

Hello, this repo is very helpful! But I have a question, when I run the training process with data generated in data-gen. The model file size is over 200M，but the model file size is only around 100M in hugging face: https://huggingface.co/GitMylo/bark-voice-cloning/blob/main/quantifier_hubert_base_ls960.pth. What is the difference between these two files? I think the voice cloned result is pretty good with your hugging face tokenizer model

Support Japanese voice cloing

Hi, Thanks for your work to make voice cloning possible on Bark. I created datasets for Japanese and trained a Japanese quantizer. The result is pretty good after 24 epochs with around 5k data. If anyone wants to give it a try, they can simply download it from Huggingface.

japanese-quantizer
Japanese datasets

Added improved functionality and RestAPI

I have refactored some parts of your code to make it better clean code.

also fixed the model_downloader

In addition i've integrated it in a good managable fastapi REST service.

see: https://github.com/w4hns1nn/bark_fastapi

Can't make a pull request because my fork is from bark main.
Maybe you can merge our repos?

another language?

How to use it for a different language?
Because it keeps speaking in English.
I used this notebook on Google Colab.
https://colab.research.google.com/drive/1IA3c_R859nANerMARazCSrjc2UD3ws8A?usp=sharing#scrollTo=bRvO6RstRpMX

How to Train for Non-Verbal Effects Voice?

How do I train for keyword effects, such as [man], [woman], or even how do I train for [music] keyword?

Do I have to put [man] on semantic text, to train man voice?

Really appreciate for you work

Fine-tune for a certain speaker

Thanks for this great work. I am wondering, if I want to increase the quality of voice cloning for a certain speaker, is there a way to fine-tune the model? If yes, how should I do it? Thank you.

`KeyError: 'best_loss'` when testing self-trained model

Hi, first of all thank you for your work!

I created the semantic data and wavs with the help of your bark-data-gen repo, and trained the model myself by following the steps you mentioned at https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer#how-do-i-train-it-myself.
I trained the model until 20th epoch and would now like to test it.

Unfortunately testing gives me a KeyError :

$ python process.py  --path Literature  --mode test
Traceback (most recent call last):
  File ".../bark-voice-cloning-HuBERT-quantizer/process.py", line 28, in <module>
    test_hubert(path, model)
  File ".../bark-voice-cloning-HuBERT-quantizer/test_hubert.py", line 13, in test_hubert
    hubert_model = CustomHubert(checkpoint_path=model)
  File ".../bark-voice-cloning-HuBERT-quantizer/hubert/pre_kmeans_hubert.py", line 62, in __init__
    model, *_ = fairseq.checkpoint_utils.load_model_ensemble_and_task(load_model_input)
  File ".../python3.10/site-packages/fairseq/checkpoint_utils.py", line 431, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File ".../python3.10/site-packages/fairseq/checkpoint_utils.py", line 349, in load_checkpoint_to_cpu
    state = _upgrade_state_dict(state)
  File ".../python3.10/site-packages/fairseq/checkpoint_utils.py", line 595, in _upgrade_state_dict
    "best_loss": state["best_loss"]}
KeyError: 'best_loss'

In test_hubert.py, I'm passing the path to my self-trained model model_epoch_20.pth:

# test_hubert.py
def test_hubert(path: str, model: str = ".../Literature/model_epoch_20.pth", 
                tokenizer: str = 'model.pth'):
    hubert_model = CustomHubert(checkpoint_path=model)  # throws error

The self-trained model dict has the following keys:

# fairseq/checkpoint_utils.py
odict_keys(['lstm.weight_ih_l0', 'lstm.weight_hh_l0', 'lstm.bias_ih_l0', 'lstm.bias_hh_l0', 'lstm.weight_ih_l1', 'lstm.weight_hh_l1', 'lstm.bias_ih_l1', 'lstm.bias_hh_l1', 'intermediate.weight', 'intermediate.bias', 'fc.weight', 'fc.bias'])

I'm getting the same error when trying to test the pre-trained german-HuBERT-quantizer_14_epoch.pth model from C0untFloyd mentioned in this repo.

Could you please give me a hint about what I'm doing wrong here? How could I successfully test the self-trained model?

Thank you very much in advance.

Testing

I have successfully run the colab notebook and saved the speaker.npz file.
The issue now is how do I test it on a new data

change semantic vocab size to 500 to fit hubert quantizer

hi~I wonder is it possible to change semantic vocab size from 10000 to 500 to fit hubert quantizer and train semantic transformer and then coarse transformer.

ModuleNotFoundError: No module named 'hubert'

The colab notebook says ModuleNotFoundError: No module named 'hubert' and when I try to install hubert manually I get ooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
ERROR: Could not find a version that satisfies the requirement HuBERT (from versions: none)
ERROR: No matching distribution found for HuBERT.

No module named 'hubert'

I want to train my semantic and wav, but I have this error when start process.py

File "/kaggle/working/bark-voice-cloning-HuBERT-quantizer/prepare.py", line 8, in <module> from hubert.pre_kmeans_hubert import CustomHubert ModuleNotFoundError: No module named 'hubert'

german-HuBERT-quantizer_14_epoch.pth does not have all meta data

If I try to call prepare2 with german-HuBERT-quantizer_14_epoch.pth checkpoint, the folloiwng error emerges:

File "/usr/local/lib/python3.10/dist-packages/fairseq/checkpoint_utils.py", line 585, in _upgrade_state_dict                                                                                │
    {"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]}                                                                                                              │
KeyError: 'best_loss'

If quantifier_hubert_base_ls960.pth checkpoint is used, everything works well.

Is there somewhere a checkpoint of german-HuBERT-quantizer_14_epoch.pth with the necessary metadata?

License for /hubert module

I saw that the model itself is MIT on huggingface; however, the /hubert module from this repository is not under a clear license.

Turkish language support

https://huggingface.co/egeadam/bark-voice-cloning-turkish-HuBERT-quantizer

Here HuBERT training for Turkish language. It is trained from open source books, but I won't be sharing dataset since I don't have redistribute rights. It is trained with 14 epoch, results are ok. Further improvement can be done using RVC over bark.

Omni-Lingual Quantizer?

(Took me way too long to realize this, and it just goes to show that most of us are just point and click type of fellas who don't really understand what we're using - not really a skiddie because we can code, but.. .you get what I mean)

So if the whole point of using bark-generated audio training a quantizer like this,

instead of simply grabbing a massive good dataset of audio and having whisper transcribe and then adding in tags or correcting as needed (or god forbid, manually finding voice clips with actual good audio that matches more or less what you hear)

is simply because you don't know how exactly they trained their hubert voice features to semantic tokens mapping, the unknown here being the semantic tokens, and you want to make sure you at least start off from THEIR hubert voice features to tokens mapping and refine it,

... then couldn't a general-purpose hubert to semantic tokens quantizer be made instead? You would just generate or aggregate all the supported languages, generate datasets if you don't have them already, train a quantizer on ALL OF THAT since its aim is just a reverse "tell me the semantic tokens for this series of sounds" and it should theoretically cover any "known" language

(minus the african ones because there's no statistically significant presence of tongue-click languages on the internet, but knowing bark and its random noises during generation, assuming hubert model has also learned that, it probably CAN map a tongue-click language too)

I see you did it for english, but I'm wondering why everyone has stopped at a single language quantizer when it can probably be made into an omnilingual quantizer.

I ask in the name of languages like Klingon, Middle English, Old English, Vietnamese with a southern accent....

Switch fairseq dependency to transformers' Hubert

Since transformers is already an requirement, I'm thinking about getting fairseq's Hubert from there. Most projects will have transformers installed within them and makes the install faster + cleaner.

https://huggingface.co/docs/transformers/model_doc/hubert

can not find tokenizer.pth in GitMylo/bark-voice-cloning

as the title, so how can I do? Thanks

[Question] This notebook it's for create a speark with trained semantic model?

With this code https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/blob/master/notebook.ipynb, I can create new speaker with model.pth input that I trained before?

So, if I want different speaker, I need only to change .wav input?

How to create a quantizer in a dialect that Bark didn't support?

Cantonese, a dialect from China. Often used by Hong Kong people and GuangZhou. The problem is, Bark only support "Chinese", which is Putonghua in fact.

I would like to create a Cantonese quantizer from beginning, how am I suppose to do that? As I know, like @junwchina said, bark-data-gen is a tool used to generate training data and train my quantizer model. But the pronunciation of Cantonese and Putonghua is completely different. If I use bark-data-gen, it will likely only output a Putonghua dataset, which is not the one I want.

Torch compile errors on windows 11

I'm getting reports that it's failing torch compile which causes the module to fail to load.

rsxdalv/tts-generation-webui#266 (comment)

It sounds like a bug, but maybe there's a way to disable the compilation/make it safe, since the performance is never an issue unless training.

RuntimeError: The size of tensor a (28) must match the size of tensor b (33) at non-singleton dimension 2

I get this error when I try to train an RVC model. I've tried various settings, different audio files, re-installing, and running on runpod.

RuntimeError: The size of tensor a (28) must match the size of tensor b (33) at non-singleton dimension 2

(sorry this was posted on the wrong repo. it is meant for webui)

How to increase quality?

Hey, @gitmylo, great work on this repo.

If I want to increase the quality what's the best way to go about that?

I imagine the number of steps that are used during both training and inference must be stored in some variable somewhere. Can you point me to it?

Or maybe there's another obvious solution?

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

file:
colab_notebook.ipynb
when:
large_quant_model = False # Use the larger pretrained model
device = 'cuda' # 'cuda', 'cpu', 'cuda:0', 0, -1, torch.device('cuda')

import numpy as np
import torch
import torchaudio
from encodec import EncodecModel
from encodec.utils import convert_audio
from hubert.hubert_manager import HuBERTManager
from hubert.pre_kmeans_hubert import CustomHubert
from hubert.customtokenizer import CustomTokenizer

model = ('quantifier_V1_hubert_base_ls960_23.pth', 'tokenizer_large.pth') if large_quant_model else ('quantifier_hubert_base_ls960_14.pth', 'tokenizer.pth')

print('Loading HuBERT...')
hubert_model = CustomHubert(HuBERTManager.make_sure_hubert_installed(), device=device)
print('Loading Quantizer...')
quant_model = CustomTokenizer.load_from_checkpoint(HuBERTManager.make_sure_tokenizer_installed(model=model[0], local_file=model[1]), device)
print('Loading Encodec...')
encodec_model = EncodecModel.encodec_model_24khz()
encodec_model.set_target_bandwidth(6.0)
encodec_model.to(device)

print('Downloaded and loaded models!')

then:

TypeError Traceback (most recent call last)
in
9 from hubert.hubert_manager import HuBERTManager
10 from hubert.pre_kmeans_hubert import CustomHubert
---> 11 from hubert.customtokenizer import CustomTokenizer
12
13 model = ('quantifier_V1_hubert_base_ls960_23.pth', 'tokenizer_large.pth') if large_quant_model else ('quantifier_hubert_base_ls960_14.pth', 'tokenizer.pth')

/data/bark_clone/hubert/customtokenizer.py in
151
152
--> 153 def auto_train(data_path, save_path='model.pth', load_model: str | None = None, save_epochs=1):
154 data_x, data_y = [], []
155

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

semantic.npy

Hi, thank you for the work.
In your code, I understand that Hubert is used to extract:
feat_string = '_semantic_features.npy'
how about:
sem_string = '_semantic.npy'
where to get the *_semantic.npy?

Thank you.

Request: Write a tutorial for training quantizers

So people can train for their own languages.
Or if there is another tool which streamlines the process point to it.

Speaker switching

Hello folks,
When I try to generate audio from the cloned speaker voice, the output is a mix of different speakers. The audio switches speakers. For example, it will first talk in the male voice and then a female voice for the same text prompt. Is there any solution for this?
Any help is appreciated

Voice to semantic

If I well understood, you used a custom semantic-voice dataset for training your HuBERT model. Can you tell me how to create this dataset? Especially how to get the semantic from a voice? Many thanks for this work.

how to achieve sample biden's same level?

how to achieve sample biden's same level?
i used some voice of biden's,
but can't get the same level with example !

Support for Hindi langauge

@gitmylo , hello I am currently trying to train the quantizer for on hindi dataset.

I need to know how much time would it take to train on a P100 GPU ? And also when should i stop the training ?

given that, I have dataset of approx 7000 wavs and semantic files.

I need to clarify once that will Hubert base model works well for Hindi language ?

Support for Swahili Language

Hi @gitmylo I wonder if possible to add Swahili language on the model, as would be very interesting for African community to use it natively.

Thanks for take your time to read this.

generate semantic tokens from wavs

Thanks for sharing this code. I've run through the steps and it appears to generate the semantic tokens from text, and then generates the wav files from the semantic tokens. But Is it possible to generate the semantic tokens from a set of wav files?

"no description" when bark run

I have tried to create an npz, although I think I have done something wrong. I have gotten bark running up until generate_coarse:
Exception has occurred: AssertionError exception: no description File "/Users/nickanastasoff/Desktop/bark test/bark/bark/generation.py", line 573, in generate_coarse round(x_coarse_history.shape[-1] / len(x_semantic_history), 1) File "/Users/nickanastasoff/Desktop/bark test/bark/bark/api.py", line 54, in semantic_to_waveform coarse_tokens = generate_coarse( File "/Users/nickanastasoff/Desktop/bark test/bark/bark/api.py", line 113, in generate_audio out = semantic_to_waveform(

customHuburt.txt
This is what I used to make the npz. Im pretty sure the issue is with fine_prompts = codes but im not sure what else to do.

Stuck on `Installing Demucs`

Support for Turkish langauge

How i can train turkish ? tokenizer for bark clone

issues in notebook due to fairseq version

Maybe implementing a stricter version in requirements is needed?

TypeError                                 Traceback (most recent call last)
Cell In [13], line 2
----> 2 from bark_hubert_quantizer.pre_kmeans_hubert import CustomHubert
      3 from bark_hubert_quantizer.customtokenizer import CustomTokenizer

File /notebooks/./bark-voice-cloning-HuBERT-quantizer/bark_hubert_quantizer/pre_kmeans_hubert.py:16
     13 from torch import nn
     14 from einops import pack, unpack
---> 16 import fairseq
     18 from torchaudio.functional import resample
     20 from audiolm_pytorch.utils import curtail_to_multiple

File /usr/local/lib/python3.9/dist-packages/fairseq/__init__.py:40
     38 import fairseq.optim.lr_scheduler  # noqa
     39 import fairseq.pdb  # noqa
---> 40 import fairseq.scoring  # noqa
     41 import fairseq.tasks  # noqa
     42 import fairseq.token_generation_constraints  # noqa

File /usr/local/lib/python3.9/dist-packages/fairseq/scoring/__init__.py:34
     29     @abstractmethod
     30     def result_string(self) -> str:
     31         pass
---> 34 _build_scorer, register_scorer, SCORER_REGISTRY, _ = registry.setup_registry(
     35     "--scoring", default="bleu"
     36 )
     39 def build_scorer(choice, tgt_dict):
     40     _choice = choice._name if isinstance(choice, DictConfig) else choice

TypeError: cannot unpack non-iterable NoneType object

How To Train Chinese Tokenizer

Problems training a Portuguese quantizer model

Greetings,

I've followed all the steps on the guide to train a Portuguese dataset, but unfortunately, across epochs, either the model did not really clone the voice, or provided voices that did resemble the target voice but produced bad speech outputs like screeching, saying things too slowly or "sounding drunk". I could not get a single model that managed to consistently produce good speeches with voices closely resembling the target voice despite training on a dataset of a little over 3200 samples up to 30 epochs (I tested every single epoch). For the dataset, I am using some public domain classic literature books and the Bible.

What I am doing wrong, and what can I do to improve the training and get better models?

adding batches to training?

Am I correct in saying the training code in customtokenizer only trains one X Y pair at a time instead of a whole batch at once?

Are there any plans to add batches to the training code so it can process a large batch at once? As it stands right now, when combining english, german, polish, japanese, and portugese datasets from huggingface, it is taking about 1 hour per epoch and only 3 out of 8GB of VRAM is used.

(Obviously this doesn't run on google colab since trying to load 32,000+ files crashes google drive AND runs out of instance RAM on colab, but if it could, batching would be a very nice idea so that one epoch could be done in maybe a hundred steps or something instead of .. .tens of thousands. It seems to be trying to fit to one set of data but that's conflicting with another set of data, basically to rephrase that in plain English, is it tries to learn one feature and gets worse at the other, it corrects the other and gets worse at the first, whereas I think if it was all batched it would "see the bigger picture" and "all the patterns as related and part of a whole" or something like that. )

(But I dunno though, maybe this architecture is insufficient for OMNI-LINGUAL and can at best only learn languages in a group like traditional linguists define. Romance languages, Indo.. uhh.. something languages.... I say it might be like that because just last night I tried using the english 23 epoch model as a pretrain starting point, and well... 8hrs later, at 8 epochs, it sort of can map an unsupported language like vietnamese. Approximates alot of words at the wrong "notes" but it did better than expected in THAT regard, so the theory is not too far off. Where it fucked up is suddenly some speakers english words, said with an accent turned into a Russian or Polish phoneme, which really makes me wonder if there's a limit to how "different" the languages can be, but still, way off topic here, I think batched training would really help with all this. )

(But the important thing here is it seems bark DOES have the ability to generate the correct phonemes for novel sounds if you can just tease out the right semantic tokens, which you can get really close by having the quantizer hybridize the languages your target language is closest to.... but ghyaaah thats such a pain in the ass to do for every language)

Work with me on an exciting Speech project

Hey @gitmylo,

I love the work you have done here! Very impressed with your ability to learn & build.

I'm @sidroopdaska, the co-founder of MetaVoice. We are building an open-source foundational Audio-Language model for all speech related tasks like ASR, TTS, speech-to-speech, audio enhancement, etc.

I would love for you to get involved! Drop me a message on [email protected], if interested.

German tokenizer available / overtraining

Hey Mylo, thanks for your work on enabling Bark voice cloning and documenting everything you did. I can't believe you just started with machine learning!

I successfully trained a german tokenizer which is available here:
https://huggingface.co/CountFloyd/bark-voice-cloning-german-HuBERT-quantizer

It was trained for 14 epochs with quantizer version 1. The input dataset is right here:
https://huggingface.co/datasets/CountFloyd/bark-german-semantic-wav-training

So far results are pretty good, however I'm wondering if it's possible to overtrain. In your example code you're always using the model trained for 14 epochs, not the one trained longer for 23 epochs. Shouldn't the latter one be better? Are there reasons to stop training earlier to prevent e.g. overfitting?

Anyway, please also check out my Bark Repo where I wrapped your code in a Gradio GUI. It's also possible to swap voices in audio and train with it although this is still WIP. Thanks again!