Code Monkey home page Code Monkey logo

Comments (11)

gitmylo avatar gitmylo commented on June 3, 2024

Try using a different wav2vec model, maybe one that's better for languages aside from english.
Also check your dataset, does it sound normal? If not, that's a bark issue, not really anything i can do about that.

from bark-voice-cloning-hubert-quantizer.

Subarasheese avatar Subarasheese commented on June 3, 2024

Try using a different wav2vec model, maybe one that's better for languages aside from english. Also check your dataset, does it sound normal? If not, that's a bark issue, not really anything i can do about that.

The audio from the dataset are mostly OK (it is what you would expect from Bark, a mixed bag; there is indeed bad audio in there but most of them are good in my book, as in, if the cloning model could produce those outputs I would be satisfied, unfortunately it can't)

Also, I was using a Hubert model, not Wav2Vec... But I created custom code to load this model:

https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese

Does this repo already has code to load Wav2Vec models?

Also, would I need Wave2Vec only when training, or would I need when cloning a voice from a .wav file too? Because from the repo example as well as the webuis it seem Hubert is also used when extracting features from the input .wav voice that will be used in cloning

from bark-voice-cloning-hubert-quantizer.

Subarasheese avatar Subarasheese commented on June 3, 2024

@gitmylo I modified the preparation script to use Wav2Rec instead of Hubert, however, the semantic features files are in a different format that is accepted by the training script that reads the Hubert-generated files instead...

Can you give me a hand on this?


def prepare2(path,model):
    prepared = os.path.join(path, 'prepared')
    ready = os.path.join(path, 'ready')
    model_name = "Edresson/wav2vec2-large-xlsr-coraa-portuguese"
    processor = Wav2Vec2Processor.from_pretrained(model_name)
    model = Wav2Vec2Model.from_pretrained(model_name)
    if not os.path.isdir(ready):
        os.mkdir(ready)

    wav_string = '_wav.wav'
    sem_string = '_semantic.npy'
    for input_file in os.listdir(prepared):
        input_path = os.path.join(prepared, input_file)
        if input_file.endswith(wav_string):
            file_num = int(input_file[:-len(wav_string)])
            fname = f'{file_num}_semantic_features.npy'
            print('Processing', input_file)
            if os.path.isfile(fname):
                continue
            wav, sr = torchaudio.load(input_path)

            if wav.shape[0] == 2:  # Stereo to mono if needed
                wav = wav.mean(dim=0)
            resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)  # Move this line here
            wav = resampler(wav)  # Resample to 16,000 Hz
            inputs = processor(wav.squeeze().numpy(), return_tensors="pt", padding=True, sampling_rate=16000)
            with torch.no_grad():
                outputs = model(**inputs)
            out_array = outputs.last_hidden_state.cpu().numpy()
            numpy.save(os.path.join(ready, fname), out_array)
        elif input_file.endswith(sem_string):
            fname = os.path.join(ready, input_file)
            if os.path.isfile(fname):
                continue
            shutil.copy(input_path, fname)
    print('All set! We\'re ready to train!')

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 3, 2024

check the shape of the outputs of wav2vec, and when creating the HuBERT quantizer model, set the CustomTokenizer's input_size to that value.

from bark-voice-cloning-hubert-quantizer.

Subarasheese avatar Subarasheese commented on June 3, 2024

check the shape of the outputs of wav2vec, and when creating the HuBERT quantizer model, set the CustomTokenizer's input_size to that value.

Here are the outputs of shape and hidden size:

(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode prepare2
Processing 0_wav.wav
Shape of Wav2Vec model output: (1, 325, 1024)
Hidden size (last dimension of model output): 1024
Processing 1_wav.wav
Shape of Wav2Vec model output: (1, 642, 1024)
Hidden size (last dimension of model output): 1024
Processing 2_wav.wav
Shape of Wav2Vec model output: (1, 357, 1024)
Hidden size (last dimension of model output): 1024
Processing 3_wav.wav
Shape of Wav2Vec model output: (1, 689, 1024)
Hidden size (last dimension of model output): 1024
Processing 4_wav.wav

From what I understand, you asked me to do this:

class CustomTokenizer(nn.Module):
    def __init__(self, hidden_size=1024, input_size=1024, output_size=10000, version=0):
        super(CustomTokenizer, self).__init__()
        input_size = 1024
        next_size = input_size
(...)

However, this change alone is not enough, as it results in errors:

(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode train
Creating new model.
Traceback (most recent call last):
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
    auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 184, in auto_train
    model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0)  # Print loss every 50 steps
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 86, in train_step
    loss = lossfunc(y_pred, y_train_hot)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Float

Even after modifying the train_step method, removing the lines of code that were creating a one-hot vector from y_train and adding a line to convert y_train to a Long tensor using the long() method resulted in an error:

(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode train
Creating new model.
Traceback (most recent call last):
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
    auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 181, in auto_train
    model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0)  # Print loss every 50 steps
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 82, in train_step
    loss = lossfunc(y_pred, y_train)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected target size [1, 10000], got [1]

Any ideas?

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 3, 2024

You made a mistake by modifying the customtokenizer, it doesn't require any modifications. this is the line that requires modifications, you need to add input_size=1024 to this line.

Doing it in the way you did above, will cause incompatibility with other models, which would prevent you from loading it with the standard automatic loader function load_from_checkpoint.

And removing the one-hot is obviously not a great idea. it converts a tensor from indicating a position, to an absolute position. Changing the shape from (n,) to (n, 10000).

from bark-voice-cloning-hubert-quantizer.

Subarasheese avatar Subarasheese commented on June 3, 2024

You made a mistake by modifying the customtokenizer, it doesn't require any modifications. this is the line that requires modifications, you need to add input_size=1024 to this line.

Doing it in the way you did above, will cause incompatibility with other models, which would prevent you from loading it with the standard automatic loader function load_from_checkpoint.

And removing the one-hot is obviously not a great idea. it converts a tensor from indicating a position, to an absolute position. Changing the shape from (n,) to (n, 10000).

I did that simple change, however, this problem occurs in the script:


Creating new model.
Traceback (most recent call last):
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
    auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 183, in auto_train
    model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0)  # Print loss every 50 steps
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 85, in train_step
    loss = lossfunc(y_pred, y_train_hot)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Float

The only modification in the script is:

    model_training = CustomTokenizer(version=1,input_size=1024).to('cuda')

from bark-voice-cloning-hubert-quantizer.

gitmylo avatar gitmylo commented on June 3, 2024

Make sure the data types are correct, and are correctly converted if needed.

from bark-voice-cloning-hubert-quantizer.

Subarasheese avatar Subarasheese commented on June 3, 2024

Make sure the data types are correct, and are correctly converted if needed.

Ok, so I just added .long() to this line:

    loss = lossfunc(y_pred, y_train_hot.long())

And it started training... With suspiciously really low loss:

Creating new model.
Loss 5.78387975692749
Loss 0.004742730874568224
Loss 0.003009543986991048
Loss 0.001717338222078979
Loss 0.0006962093175388873
Loss 0.0011351052671670914
Loss 0.0004761719610542059
Loss 0.001148720970377326
Loss 0.0010141769889742136
Loss 0.0012041080044582486
Loss 0.000879546336364001
(...)

So I went to test the model...

And then this happened:

Loading Hubert ./models/hubert/hubert.pt
Loading Custom Hubert Tokenizer ./models/hubert/pt_tokenizer.pth
Traceback (most recent call last):
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/routes.py", line 439, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1384, in process_api
    result = await self.call_function(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1089, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/Coding/bark-gui/cloning/clonevoice.py", line 48, in clone_voice
    semantic_tokens = tokenizer.get_token(semantic_vectors)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 55, in get_token
    return torch.argmax(self(x), dim=1)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 40, in forward
    x, _ = self.lstm(x)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 810, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 730, in check_forward_args
    self.check_input(input, batch_sizes)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 218, in check_input
    raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 1024, got 768

So I changed everything related to input_size to 768 and this happened:

Loading Hubert ./models/hubert/hubert.pt
Loading Custom Hubert Tokenizer ./models/hubert/pt_tokenizer.pth
Traceback (most recent call last):
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/routes.py", line 439, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1384, in process_api
    result = await self.call_function(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1089, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/Coding/bark-gui/cloning/clonevoice.py", line 34, in clone_voice
    tokenizer = CustomTokenizer.load_from_checkpoint(f'./models/hubert/{tokenizer_lang}_tokenizer.pth').to(device)  # Automatically uses the right layers
  File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 122, in load_from_checkpoint
    model.load_state_dict(torch.load(path))
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CustomTokenizer:
        size mismatch for lstm.weight_ih_l0: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 768]).

On clonevoice.py, I even changed the code to do this (padding the vectors with zeros up to 1024 ) and of course that turned out to be a terrible idea as the resulting cloned npz came out corrupted:


    def pad_to_size(vec, size):
        zeros = torch.zeros((vec.shape[0], size - vec.shape[1]))
        vec = torch.cat([vec, zeros], dim=1)
        return vec

    # Before you call get_token:
    semantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
    semantic_vectors = pad_to_size(semantic_vectors, 1024)
    semantic_tokens = tokenizer.get_token(semantic_vectors)

What do you suggest? Am I missing something?

from bark-voice-cloning-hubert-quantizer.

Subarasheese avatar Subarasheese commented on June 3, 2024

This issue can be closed. I finished training a portuguese quantizer.
24 epochs, 0.0005 LR, over 4000 utterances.

Model weights:

https://huggingface.co/MadVoyager/bark-voice-cloning-portuguese-HuBERT-quantizer

Dataset:

https://huggingface.co/datasets/MadVoyager/bark-portuguese-semantic-wav-training/

Retraining using a lower Learning Rate and more utterances greatly improved the model.

from bark-voice-cloning-hubert-quantizer.

Maverick1983 avatar Maverick1983 commented on June 3, 2024

@Subarasheese Can help me to train Italian? Can you write steps do you perform?

Thanks

from bark-voice-cloning-hubert-quantizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.