Comments (11)
Try using a different wav2vec model, maybe one that's better for languages aside from english.
Also check your dataset, does it sound normal? If not, that's a bark issue, not really anything i can do about that.
from bark-voice-cloning-hubert-quantizer.
Try using a different wav2vec model, maybe one that's better for languages aside from english. Also check your dataset, does it sound normal? If not, that's a bark issue, not really anything i can do about that.
The audio from the dataset are mostly OK (it is what you would expect from Bark, a mixed bag; there is indeed bad audio in there but most of them are good in my book, as in, if the cloning model could produce those outputs I would be satisfied, unfortunately it can't)
Also, I was using a Hubert model, not Wav2Vec... But I created custom code to load this model:
https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese
Does this repo already has code to load Wav2Vec models?
Also, would I need Wave2Vec only when training, or would I need when cloning a voice from a .wav file too? Because from the repo example as well as the webuis it seem Hubert is also used when extracting features from the input .wav voice that will be used in cloning
from bark-voice-cloning-hubert-quantizer.
@gitmylo I modified the preparation script to use Wav2Rec instead of Hubert, however, the semantic features files are in a different format that is accepted by the training script that reads the Hubert-generated files instead...
Can you give me a hand on this?
def prepare2(path,model):
prepared = os.path.join(path, 'prepared')
ready = os.path.join(path, 'ready')
model_name = "Edresson/wav2vec2-large-xlsr-coraa-portuguese"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
if not os.path.isdir(ready):
os.mkdir(ready)
wav_string = '_wav.wav'
sem_string = '_semantic.npy'
for input_file in os.listdir(prepared):
input_path = os.path.join(prepared, input_file)
if input_file.endswith(wav_string):
file_num = int(input_file[:-len(wav_string)])
fname = f'{file_num}_semantic_features.npy'
print('Processing', input_file)
if os.path.isfile(fname):
continue
wav, sr = torchaudio.load(input_path)
if wav.shape[0] == 2: # Stereo to mono if needed
wav = wav.mean(dim=0)
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000) # Move this line here
wav = resampler(wav) # Resample to 16,000 Hz
inputs = processor(wav.squeeze().numpy(), return_tensors="pt", padding=True, sampling_rate=16000)
with torch.no_grad():
outputs = model(**inputs)
out_array = outputs.last_hidden_state.cpu().numpy()
numpy.save(os.path.join(ready, fname), out_array)
elif input_file.endswith(sem_string):
fname = os.path.join(ready, input_file)
if os.path.isfile(fname):
continue
shutil.copy(input_path, fname)
print('All set! We\'re ready to train!')
from bark-voice-cloning-hubert-quantizer.
check the shape of the outputs of wav2vec, and when creating the HuBERT quantizer model, set the CustomTokenizer
's input_size
to that value.
from bark-voice-cloning-hubert-quantizer.
check the shape of the outputs of wav2vec, and when creating the HuBERT quantizer model, set the
CustomTokenizer
'sinput_size
to that value.
Here are the outputs of shape and hidden size:
(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode prepare2
Processing 0_wav.wav
Shape of Wav2Vec model output: (1, 325, 1024)
Hidden size (last dimension of model output): 1024
Processing 1_wav.wav
Shape of Wav2Vec model output: (1, 642, 1024)
Hidden size (last dimension of model output): 1024
Processing 2_wav.wav
Shape of Wav2Vec model output: (1, 357, 1024)
Hidden size (last dimension of model output): 1024
Processing 3_wav.wav
Shape of Wav2Vec model output: (1, 689, 1024)
Hidden size (last dimension of model output): 1024
Processing 4_wav.wav
From what I understand, you asked me to do this:
class CustomTokenizer(nn.Module):
def __init__(self, hidden_size=1024, input_size=1024, output_size=10000, version=0):
super(CustomTokenizer, self).__init__()
input_size = 1024
next_size = input_size
(...)
However, this change alone is not enough, as it results in errors:
(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode train
Creating new model.
Traceback (most recent call last):
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 184, in auto_train
model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0) # Print loss every 50 steps
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 86, in train_step
loss = lossfunc(y_pred, y_train_hot)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Float
Even after modifying the train_step method, removing the lines of code that were creating a one-hot vector from y_train and adding a line to convert y_train to a Long tensor using the long() method resulted in an error:
(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode train
Creating new model.
Traceback (most recent call last):
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 181, in auto_train
model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0) # Print loss every 50 steps
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 82, in train_step
loss = lossfunc(y_pred, y_train)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected target size [1, 10000], got [1]
Any ideas?
from bark-voice-cloning-hubert-quantizer.
You made a mistake by modifying the customtokenizer, it doesn't require any modifications. this is the line that requires modifications, you need to add input_size=1024
to this line.
Doing it in the way you did above, will cause incompatibility with other models, which would prevent you from loading it with the standard automatic loader function load_from_checkpoint
.
And removing the one-hot is obviously not a great idea. it converts a tensor from indicating a position, to an absolute position. Changing the shape from (n,)
to (n, 10000)
.
from bark-voice-cloning-hubert-quantizer.
You made a mistake by modifying the customtokenizer, it doesn't require any modifications. this is the line that requires modifications, you need to add
input_size=1024
to this line.Doing it in the way you did above, will cause incompatibility with other models, which would prevent you from loading it with the standard automatic loader function
load_from_checkpoint
.And removing the one-hot is obviously not a great idea. it converts a tensor from indicating a position, to an absolute position. Changing the shape from
(n,)
to(n, 10000)
.
I did that simple change, however, this problem occurs in the script:
Creating new model.
Traceback (most recent call last):
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 183, in auto_train
model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0) # Print loss every 50 steps
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 85, in train_step
loss = lossfunc(y_pred, y_train_hot)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Float
The only modification in the script is:
model_training = CustomTokenizer(version=1,input_size=1024).to('cuda')
from bark-voice-cloning-hubert-quantizer.
Make sure the data types are correct, and are correctly converted if needed.
from bark-voice-cloning-hubert-quantizer.
Make sure the data types are correct, and are correctly converted if needed.
Ok, so I just added .long() to this line:
loss = lossfunc(y_pred, y_train_hot.long())
And it started training... With suspiciously really low loss:
Creating new model.
Loss 5.78387975692749
Loss 0.004742730874568224
Loss 0.003009543986991048
Loss 0.001717338222078979
Loss 0.0006962093175388873
Loss 0.0011351052671670914
Loss 0.0004761719610542059
Loss 0.001148720970377326
Loss 0.0010141769889742136
Loss 0.0012041080044582486
Loss 0.000879546336364001
(...)
So I went to test the model...
And then this happened:
Loading Hubert ./models/hubert/hubert.pt
Loading Custom Hubert Tokenizer ./models/hubert/pt_tokenizer.pth
Traceback (most recent call last):
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/routes.py", line 439, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1384, in process_api
result = await self.call_function(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1089, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
response = f(*args, **kwargs)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
response = f(*args, **kwargs)
File "/home/user/Coding/bark-gui/cloning/clonevoice.py", line 48, in clone_voice
semantic_tokens = tokenizer.get_token(semantic_vectors)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 55, in get_token
return torch.argmax(self(x), dim=1)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 40, in forward
x, _ = self.lstm(x)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 810, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 730, in check_forward_args
self.check_input(input, batch_sizes)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 218, in check_input
raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 1024, got 768
So I changed everything related to input_size to 768 and this happened:
Loading Hubert ./models/hubert/hubert.pt
Loading Custom Hubert Tokenizer ./models/hubert/pt_tokenizer.pth
Traceback (most recent call last):
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/routes.py", line 439, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1384, in process_api
result = await self.call_function(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1089, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
response = f(*args, **kwargs)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
response = f(*args, **kwargs)
File "/home/user/Coding/bark-gui/cloning/clonevoice.py", line 34, in clone_voice
tokenizer = CustomTokenizer.load_from_checkpoint(f'./models/hubert/{tokenizer_lang}_tokenizer.pth').to(device) # Automatically uses the right layers
File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 122, in load_from_checkpoint
model.load_state_dict(torch.load(path))
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CustomTokenizer:
size mismatch for lstm.weight_ih_l0: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 768]).
On clonevoice.py, I even changed the code to do this (padding the vectors with zeros up to 1024 ) and of course that turned out to be a terrible idea as the resulting cloned npz came out corrupted:
def pad_to_size(vec, size):
zeros = torch.zeros((vec.shape[0], size - vec.shape[1]))
vec = torch.cat([vec, zeros], dim=1)
return vec
# Before you call get_token:
semantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
semantic_vectors = pad_to_size(semantic_vectors, 1024)
semantic_tokens = tokenizer.get_token(semantic_vectors)
What do you suggest? Am I missing something?
from bark-voice-cloning-hubert-quantizer.
This issue can be closed. I finished training a portuguese quantizer.
24 epochs, 0.0005 LR, over 4000 utterances.
Model weights:
https://huggingface.co/MadVoyager/bark-voice-cloning-portuguese-HuBERT-quantizer
Dataset:
https://huggingface.co/datasets/MadVoyager/bark-portuguese-semantic-wav-training/
Retraining using a lower Learning Rate and more utterances greatly improved the model.
from bark-voice-cloning-hubert-quantizer.
@Subarasheese Can help me to train Italian? Can you write steps do you perform?
Thanks
from bark-voice-cloning-hubert-quantizer.
Related Issues (20)
- Support Japanese voice cloing HOT 7
- Switch fairseq dependency to transformers' Hubert HOT 1
- issues in notebook due to fairseq version HOT 2
- RuntimeError: The size of tensor a (28) must match the size of tensor b (33) at non-singleton dimension 2
- `KeyError: 'best_loss'` when testing self-trained model HOT 2
- HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/GitMylo/bark-voice-cloning/resolve/main/japanese-HuBERT-quantizer_24_epoch.pth HOT 1
- german-HuBERT-quantizer_14_epoch.pth does not have all meta data HOT 2
- Omni-Lingual Quantizer? HOT 1
- adding batches to training?
- Support for Turkish langauge
- No module named 'hubert' HOT 1
- [Question] This notebook it's for create a speark with trained semantic model?
- How To Train Chinese Tokenizer
- How to create a quantizer in a dialect that Bark didn't support? HOT 2
- Request: Write a tutorial for training quantizers
- Added improved functionality and RestAPI
- Stuck on `Installing Demucs`
- How to Train for Non-Verbal Effects Voice? HOT 2
- Torch compile errors on windows 11
- Support for Swahili Language
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bark-voice-cloning-hubert-quantizer.