Code Monkey home page Code Monkey logo

Comments (3)

cmp-nct avatar cmp-nct commented on July 30, 2024 2

The readme made it sound like a drop in replacement ;-)

@yl4579
It would be nice to get a few more steps, given many people have never trained any audio type models.
It's all a bit overwhelming

Here is a new utils.py for the xphonebert that acts like the previous utils.py

import os
from transformers import AutoConfig, AutoModelForMaskedLM


class CustomXPhoneBERT(AutoModelForMaskedLM):
    def forward(self, *args, **kwargs):
        # Call the original forward method
        outputs = super().forward(*args, **kwargs)

        # Only return the last_hidden_state
        return outputs.last_hidden_state


def load_xbert(model_name_or_path):
    # Load the configuration for 'xphonebert-base'
    config = AutoConfig.from_pretrained(model_name_or_path)

    # Initialize the custom XPhoneBERT model using the configuration
    xbert = CustomXPhoneBERT.from_pretrained(model_name_or_path, config=config)

    # Return the custom model
    return xbert
The inference won't be as compatible I guess, that's the current inference code which relies on the english-only bert:
    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
        text_mask = length_to_mask(input_lengths).to(device)

        t_en = model.text_encoder(tokens, input_lengths, text_mask)
        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        s_pred = sampler(noise=torch.randn((1, 256)).unsqueeze(1).to(device),
                         embedding=bert_dur,
                         embedding_scale=embedding_scale,
                         features=ref_s,  # reference from the same speaker as the embedding
                         num_steps=diffusion_steps).squeeze(1)

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

Unfortunately this is not a straightforward replacement because the phoneimzer between PL-BERT and XPhoneBERT is quite different. You will have to re-train the text aligner (ASR) with the XPhoneBERT phonemizer and also prepare your data in that format, then you can replace PL-BERT with XPhoneBERT.

from styletts2.

yl4579 avatar yl4579 commented on July 30, 2024

The model is publicly available here: https://huggingface.co/vinai/xphonebert-base

from styletts2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.