Hi, I'm looking to generate Hindi audio but it was mentioned that PL

The readme made it sound like a drop in replacement ;-) <a class="us

The model is publicly available here: <a href="https://huggingface.co/vinai/xphonebert

How to replace PL-BERT with XPhoneBERT? about styletts2 HOT 3 CLOSED

yl4579 commented on July 30, 2024

How to replace PL-BERT with XPhoneBERT?

from styletts2.

Comments (3)

cmp-nct commented on July 30, 2024 2

The readme made it sound like a drop in replacement ;-)

@yl4579
It would be nice to get a few more steps, given many people have never trained any audio type models.
It's all a bit overwhelming

Here is a new utils.py for the xphonebert that acts like the previous utils.py

import os
from transformers import AutoConfig, AutoModelForMaskedLM


class CustomXPhoneBERT(AutoModelForMaskedLM):
    def forward(self, *args, **kwargs):
        # Call the original forward method
        outputs = super().forward(*args, **kwargs)

        # Only return the last_hidden_state
        return outputs.last_hidden_state


def load_xbert(model_name_or_path):
    # Load the configuration for 'xphonebert-base'
    config = AutoConfig.from_pretrained(model_name_or_path)

    # Initialize the custom XPhoneBERT model using the configuration
    xbert = CustomXPhoneBERT.from_pretrained(model_name_or_path, config=config)

    # Return the custom model
    return xbert

The inference won't be as compatible I guess, that's the current inference code which relies on the english-only bert:
    with torch.no_grad():
        input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device)
        text_mask = length_to_mask(input_lengths).to(device)

        t_en = model.text_encoder(tokens, input_lengths, text_mask)
        bert_dur = model.bert(tokens, attention_mask=(~text_mask).int())
        d_en = model.bert_encoder(bert_dur).transpose(-1, -2)

        s_pred = sampler(noise=torch.randn((1, 256)).unsqueeze(1).to(device),
                         embedding=bert_dur,
                         embedding_scale=embedding_scale,
                         features=ref_s,  # reference from the same speaker as the embedding
                         num_steps=diffusion_steps).squeeze(1)

from styletts2.

yl4579 commented on July 30, 2024

Unfortunately this is not a straightforward replacement because the phoneimzer between PL-BERT and XPhoneBERT is quite different. You will have to re-train the text aligner (ASR) with the XPhoneBERT phonemizer and also prepare your data in that format, then you can replace PL-BERT with XPhoneBERT.

from styletts2.

yl4579 commented on July 30, 2024

The model is publicly available here: https://huggingface.co/vinai/xphonebert-base

from styletts2.

Recommend Projects