r9y9 / tacotron_pytorch Goto Github PK

View Code? Open in Web Editor NEW

303.0 16.0 81.0 21.19 MB

PyTorch implementation of Tacotron speech synthesis model.

Home Page: http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/master/notebooks/Test%20Tacotron.ipynb

License: Other

Python 0.20% Jupyter Notebook 99.80%

speech-synthesis pytorch tacotron python speech

tacotron_pytorch's Introduction

tacotron_pytorch

PyTorch implementation of Tacotron speech synthesis model.

Inspired from keithito/tacotron. Currently not as much good speech quality as keithito/tacotron can generate, but it seems to be basically working. You can find some generated speech examples trained on LJ Speech Dataset at here.

If you are comfortable working with TensorFlow, I'd recommend you to try https://github.com/keithito/tacotron instead. The reason to rewrite it in PyTorch is that it's easier to debug and extend (multi-speaker architecture, etc) at least to me.

Requirements

PyTorch
TensorFlow (if you want to run the training script. This definitely can be optional, but for now required.)

Installation

git clone --recursive https://github.com/r9y9/tacotron_pytorch
pip install -e . # or python setup.py develop

If you want to run the training script, then you need to install additional dependencies.

pip install -e ".[train]"

Training

The package relis on keithito/tacotron for text processing, audio preprocessing and audio reconstruction (added as a submodule). Please follows the quick start section at https://github.com/keithito/tacotron and prepare your dataset accordingly.

If you have your data prepared, assuming your data is in "~/tacotron/training" (which is the default), then you can train your model by:

python train.py

Alignment, predicted spectrogram, target spectrogram, predicted waveform and checkpoint (model and optimizer states) are saved per 1000 global step in checkpoints directory. Training progress can be monitored by:

tensorboard --logdir=log

Testing model

Open the notebook in notebooks directory and change checkpoint_path to your model.

tacotron_pytorch's People

Contributors

Stargazers

Watchers

Forkers

fireae likeucode jfsantos benjamesbabala jiangoforit qbx2 yangsenit g-wang entn-at pbaljeka cequencer tivaro sunottie shubhampachori12110095 toanhvu zhang-jian artbataev aitorbajo yhgon samprate1st michiboo1 zuochenftts wangjunchao1118 afcarl tkm2261 yoosif0 kenimardira crazyrex zhouzhikun323 dendisuhubdy cheneng szcom ttaoretw boltomli ggansik chienlinhuang1116 birhanushimelis human2b zhoulinmin vandnachaturvedi seunghyunseo puppyapple pushkarparanjpe phamdinhkhanh oknkc8 httpsgithu zge pandinosaurus avashnagovender 5l1v3r1 mahmoudeid789 emotional-text-to-speech cesarmak chenchy sshuster macsharma mogwai pragyanaischool tubbz-alt aflyingwolf webia1 ranacm johnsontsing icmaker-jx yueyedeai owen864720655 atlisig trendingtechnology kuangdd muxichu kimyeondu joaomouraosa kquark shaun95 xmxhuihui icecoke-7

tacotron_pytorch's Issues

Try max_norm=1 for Embedding layer

to see if it works as done in loop https://github.com/facebookresearch/loop/blob/773fa2a639530d5f84fa3d7eafa517fd06968d33/model.py#L59

about validation

First thanks to you great work ! But I have some questions about it, why in the training code doesn't have validation part ? Can I just add it like the testing code do ? but in the testing code the input batch size is 1, and I don't know can I inference one batch.
And when I use the cmu clb dataset to train the model, I can see the alignment is well for the training data, but when I use the ckpt to inference, it's quiet bad. Is that the model is overfitting for the small dataset (about one hour) ?

Update model

Note to self: When I finish current experiment, I will update http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/master/notebooks/Test%20Tacotron.ipynb.

Muli-speaker embedding

Implementation of Bahdanau attention is possibly different from paper?

Hi @r9y9.
On the forward pass of the attention RNN, I think you're computing the attention state using the previous attention instead of the current attention.
Page 3 on https://arxiv.org/pdf/1409.0473.pdf

Shouldn't the order of operations be this:

def forward(self, query, attention, cell_state, memory, processed_memory=None, mask=None, memory_lengths=None):
    
    if processed_memory is None:
        processed_memory = memory
    if memory_lengths is not None and mask is None:
        mask = get_mask_from_lengths(memory, memory_lengths)

    # Compute Alignment (batch, max_time)
    # e_{ij} = a(s_{i-1}, h_j)
    alignment = self.attention_mechanism(cell_state, processed_memory)

    if mask is not None:
        mask = mask.view(query.size(0), -1)
        alignment.data.masked_fill_(mask, self.score_mask_value)

    # Normalize attention weight
    # \alpha_{ij} = softmax(e_{ij})
    alignment = F.softmax(alignment)

    # Attention context vector
    # c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j
    attention = torch.bmm(alignment.unsqueeze(1), memory)
    attention = attention.squeeze(1)

    # Concat y_{i-1} and c_{i}
    cell_input = torch.cat((query, attention), -1)
    cell_input = cell_input.unsqueeze(1)

    # Feed it to RNN
    # s_i = f(y_{i-1}, c_{i}, s_{i-1})
    cell_output = self.rnn_cell(cell_input, cell_state)

    return cell_output, attention, alignment

instead of what we have on this repo right now:

def forward(self, query, attention, cell_state, memory, processed_memory=None, mask=None, memory_lengths=None):

    if processed_memory is None:
        processed_memory = memory
    if memory_lengths is not None and mask is None:
        mask = get_mask_from_lengths(memory, memory_lengths)

    # Concat y_{i-1} and c_{i}
    cell_input = torch.cat((query, attention), -1)

    # Feed it to RNN
    # s_i = f(y_{i-1}, c_{i-1}, s_{i-1}) should be f(y_{i-1}, c_{i}, s_{i-1}) according to the paper.
    cell_output = self.rnn_cell(cell_input, cell_state)

    # Compute Alignment (batch, max_time)
    # e_{ij} = a(s_{i-1}, h_j)
    alignment = self.attention_mechanism(cell_output, processed_memory)

    if mask is not None:
        mask = mask.view(query.size(0), -1)
        alignment.data.masked_fill_(mask, self.score_mask_value)

    # Normalize attention weight
    # \alpha_{ij} = softmax(e_{ij})
    alignment = F.softmax(alignment)

    # Attention context vector
    # c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j
    attention = torch.bmm(alignment.unsqueeze(1), memory)

    # (batch, dim)
    attention = attention.squeeze(1)

    return cell_output, attention, alignment

How are linear targets being passed to the model?

This might be a really dumb question....but I am not sure I understand why you are not passing the linear targets to the model here?
https://github.com/r9y9/tacotron_pytorch/blob/master/train.py#L240
Are you using a pre-trained post-net in your implementation?

Why sort within batch

Hi, thanks for implementing this in pytorch.

I was wondering why you are sorting the datapoints within your batch by sequence length during training

Would not doing this sorting break some assumptions / some other parts of the code?

I am using part of you model in another setup so it would be great to know if this is important.

Best,

why turning off dropout of decoder's prenet make a serious performance regression

Thanks very much for your codes. When i was working with it, also found that dropout on eval make a serious regresssion. I can not find out where is the problem. May it caused by the combination of BN and dropout?

Alignment mask

I was wondering if the alignment mask gets set in attention.py

  
alignment.data.masked_fill_(mask, self.score_mask_value)

As against


alignment = Variable(alignment.data.masked_fill_(mask, self.score_mask_value))

Checkpoint file and dropout on eval.

Would it be possible for you to share your checkpoint that produced the samples at
http://nbviewer.jupyter.org/github/r9y9/tacotron_pytorch/blob/master/notebooks/Test%20Tacotron.ipynb

I am not able to reproduce the same quality with LJSpeech at 720k steps.
Also any ideas on why dropout on eval makes so much difference in speech quality?

Thanks!

why is_end_of_frames can detection the end frame in test phase?

thanks for your code.
I have a question about the tacotron_pytorch/tacotron.py line 274. why output.data <= 0.2 is the end frame in test phase. if i use this funtion, i only can decode 2 step in test time.

def is_end_of_frames(output, eps=0.2):
return (output.data <= eps).all()

retraining from the checkpoint on GPU fails

Hey,

Thanks for the implementation. Noticed that trying to retrain from a checkpoint fails during optimizer.step().

Seems because the optimizer is initialized based on model parameters and then .cuda() is called on the model afterwards. Adding .cuda() before defining optimizer (only while retraining) seems to help.

GPU inference time?

How long does it take to synthesize speech?

Masked loss function

Hello,

Shouldn't you apply the L1-loss only to the real frames and not the padding?
I.e. you implement correctly the GRU in CBHG with the pack_padded_sequence and the masked attention, but in the end I think that you calculate the L1-loss on the whole generated utterance.

Please tell me if I am missing something because I am in the middle of some same debugging problems!

Double denormalization?

[1] https://github.com/r9y9/tacotron_pytorch/blob/master/synthesis.py#L57
[2] https://github.com/r9y9/tacotron/blob/5ec6822d823096fa4fd0f3e8a3b8eb639c164271/util/audio.py#L35
Here the output is denormalized at [1], and the spectrogram is denormalized at [2].
Is there any reason, or just a mistake?

About BatchNorm

tacotron_pytorch/tacotron_pytorch/tacotron.py

Line 36 in 5f41d9d

self.bn = nn.BatchNorm1d(out_dim, momentum=0.99, eps=1e-3)

In your implementation code, it says "following tensorflow's default parameters". However, the momentum in PyTorch is the opposite with Tensroflow.

# PyTorch
x(n+1) = (1 - momentum) * avg(x(1-n)) + momentum * x(n)  # so default momentum=0.1
# Tensorflow
x(n+1) = momentum* avg(x(1-n)) + (1-momentum) * x(n)   # so default momentum=0.99

What's your idea on that ? And I consider why using so large eps(1e-3) for audio ? The default eps for PyTorch is 1e-5

ModuleNotFoundError: No module named 'text'

thanks for this repo. I am a very newbie. I have cloned the repo and have run setup.py.

However I get the following error

ModuleNotFoundError Traceback (most recent call last)
in ()
5 import sys
6 sys.path.insert(0, "../lib/tacotron")
----> 7 from text import text_to_sequence, symbols
8 from util import audio

ModuleNotFoundError: No module named 'text'

Could you please advise me what to do next?

Thanks

I am trying to use the model checkpoint from keithito/tacotron

Trained Model

Could you please provide the weights of your trained model?
It would be of help in fine-tuning.

Where is your checkpoint_step720000.pth?

Hello,
I'm sorry, I can't find the file "checkpoints". Can you tell me how to find it?
Thank you!

BahdanauMonoAttention cannot work well

I follow monotonic attention here: https://arxiv.org/pdf/1704.00784.pdf.

In tensorflow, it work well. (source code here: https://github.com/tensorflow/tensorflow/blob/r1.4/tensorflow/contrib/seq2seq/python/ops/attention_wrapper.py. )

But in pytorch, it cannot work. Here is my source code. Could you take a look， please？

def safe_cumprod(x, exclusive=False, max_value=1):
    """
    exclusive=True: cumprod(x) = [1, x1, x1*x2, x1*x2*x3, ...]
    exclusive=False: cumprod(x) = [x1, x1*x2, x1*x2*x3, ...]
    Args:
        x (torch.Tensor): shape of [batch, input_dim]
        exclusive ():
        max_value (): clip max value

    Returns:

    """
    tiny = float(np.finfo(np.float32).tiny)
    clip_x = torch.clamp(x, tiny, max_value)
    cumprod_x = torch.exp(torch.cumsum(torch.log(clip_x), dim=1))
    if exclusive is True:
        return F.pad(cumprod_x, (1, 0, 0, 0), value=1)[:, :-1]
    else:
        return cumprod_x


class BahdanauAttention(nn.Module):
    def __init__(self, dim):
        super(BahdanauAttention, self).__init__()
        self.query_layer = nn.Linear(dim, dim, bias=False)
        self.tanh = nn.Tanh()
        self.v = Parameter(torch.Tensor(1, dim))
        self.reset_parameters()

    def reset_parameters(self):
        fan_in, fan_out = self.v.size()
        scale = 1 / max(1., (fan_in + fan_out) / 2.)
        limit = math.sqrt(3.0 * scale)
        self.v.data.uniform_(-limit, limit)

    def _alignment_probability(self, score, previous_alignment=None):
        return F.softmax(score, dim=1)

    def forward(self, query, processed_memory):
        """
        Args:
            query: (batch, 1, dim) or (batch, dim)
            processed_memory: (batch, max_time, dim)
        """
        if query.dim() == 2:
            # insert time-axis for broadcasting
            query = query.unsqueeze(1)
        # (batch, 1, dim)
        processed_query = self.query_layer(query)

        # (batch, max_time, 1)
        alignment = F.linear(self.tanh(processed_query + processed_memory), self.v)

        # (batch, max_time)
        return alignment.squeeze(-1)


class BahdanauMonoAttention(BahdanauAttention):
    """BahdanauMonoAttention
    """
    def __init__(self, dim):
        super(BahdanauMonoAttention, self).__init__(dim)
        self.score_bias = Parameter(torch.Tensor(1))
        self.reset_parameters()

    def reset_parameters(self):
        self.score_bias.data.zero_()

    def forward(self, query, processed_memory):
        return super(BahdanauMonoAttention, self).forward(query, processed_memory) + self.score_bias

    def _alignment_probability(self, score, previous_alignment=None):
        """
        _mono_score, https://arxiv.org/pdf/1704.00784.pdf
        Args:
            score (): shape of [batch, encoder_length]
            previous_alignment (): shape of [batch, encoder_length]

        Returns:

        """
       #score += Variable(torch.FloatTensor(np.random.randn(*score.shape) * 2).cuda())
        p_choose_i = F.sigmoid(score)
        cumprod_1mp_choose_i = safe_cumprod(1 - p_choose_i, exclusive=True, max_value=1)
        attention = p_choose_i * cumprod_1mp_choose_i * torch.cumsum(
            previous_alignment / torch.clamp(cumprod_1mp_choose_i, 1e-10, 1.), dim=1)
        return attention



def get_mask_from_lengths(memory, memory_lengths):
    """Get mask tensor from list of length

    Args:
        memory: (batch, max_time, dim)
        memory_lengths: array like
    """
    mask = memory.data.new(memory.size(0), memory.size(1)).byte().zero_()
    for idx, l in enumerate(memory_lengths):
        mask[idx][:l] = 1
    return ~mask


class AttentionWrapper(nn.Module):
    def __init__(self, rnn_cell, attention_mechanism,
                 score_mask_value=-float("inf")):
        super(AttentionWrapper, self).__init__()
        self.rnn_cell = rnn_cell
        self.attention_mechanism = attention_mechanism
        self.score_mask_value = score_mask_value

    def forward(self, query, attention, cell_state, memory, previous_alignment=None,
                processed_memory=None, mask=None, memory_lengths=None):
        if processed_memory is None:
            processed_memory = memory
        if memory_lengths is not None and mask is None:
            mask = get_mask_from_lengths(memory, memory_lengths)

        # Concat input query and previous attention context
        cell_input = torch.cat((query, attention), -1)

        # Feed it to RNN
        cell_output = self.rnn_cell(cell_input, cell_state)

        # Alignment
        # (batch, max_time)
        alignment = self.attention_mechanism(cell_output, processed_memory)

        if mask is not None:
            mask = mask.view(query.size(0), -1)
            alignment.data.masked_fill_(mask, self.score_mask_value)

        # Normalize attention weight
        # alignment = F.softmax(alignment, dim=-1)
        alignment = self.attention_mechanism._alignment_probability(alignment, previous_alignment)

        # Attention context vector
        # (batch, 1, dim)
        attention = torch.bmm(alignment.unsqueeze(1), memory)

        # (batch, dim)
        attention = attention.squeeze(1)

        return cell_output, attention, alignment

class Decoder(nn.Module):
    def __init__(self, in_dim, r, use_mono=True):
        super(Decoder, self).__init__()
        self.in_dim = in_dim
        self.r = r
        self.prenet = Prenet(in_dim, sizes=[256, 128])
        # (prenet_out + attention context) -> output
        if use_mono is True:
            attention_mechanism = BahdanauMonoAttention(256)
        else:
            attention_mechanism = BahdanauAttention(256)
        self.attention_rnn = AttentionWrapper(
            nn.GRUCell(256 + 128, 256),
            attention_mechanism
        )
        self.memory_layer = nn.Linear(256, 256, bias=False)
        self.project_to_decoder_in = nn.Linear(512, 256)

        self.decoder_rnns = nn.ModuleList(
            [nn.GRUCell(256, 256) for _ in range(2)])

        self.proj_to_mel = nn.Linear(256, in_dim * r)
        self.max_decoder_steps = 200

    def forward(self, encoder_outputs, inputs=None, memory_lengths=None):
        """
        Decoder forward step.

        If decoder inputs are not given (e.g., at testing time), as noted in
        Tacotron paper, greedy decoding is adapted.

        Args:
            encoder_outputs: Encoder outputs. (B, T_encoder, dim)
            inputs: Decoder inputs. i.e., mel-spectrogram. If None (at eval-time),
              decoder outputs are used as decoder inputs.
            memory_lengths: Encoder output (memory) lengths. If not None, used for
              attention masking.
        """
        B = encoder_outputs.size(0)
        T_encoder = encoder_outputs.size(1)

        processed_memory = self.memory_layer(encoder_outputs)
        if memory_lengths is not None:
            mask = get_mask_from_lengths(processed_memory, memory_lengths)
        else:
            mask = None

        # Run greedy decoding if inputs is None
        greedy = inputs is None

        if inputs is not None:
            # Grouping multiple frames if necessary
            if inputs.size(-1) == self.in_dim:
                inputs = inputs.view(B, inputs.size(1) // self.r, -1)
            assert inputs.size(-1) == self.in_dim * self.r
            T_decoder = inputs.size(1)

        # go frames
        initial_input = Variable(
            encoder_outputs.data.new(B, self.in_dim).zero_())

        # Init decoder states
        attention_rnn_hidden = Variable(
            encoder_outputs.data.new(B, 256).zero_())
        decoder_rnn_hiddens = [Variable(
            encoder_outputs.data.new(B, 256).zero_())
            for _ in range(len(self.decoder_rnns))]
        current_attention = Variable(
            encoder_outputs.data.new(B, 256).zero_())

        # Time first (T_decoder, B, in_dim)
        if inputs is not None:
            inputs = inputs.transpose(0, 1)

        outputs = []
        alignments = []

        t = 0
        current_input = initial_input
        previous_alignment = Variable(
            encoder_outputs.data.new(B, T_encoder).zero_())
        previous_alignment[:, 0] = 1.0
        while True:
            if t > 0:
                current_input = outputs[-1] if greedy else inputs[t - 1]
                current_input = current_input[:, -self.in_dim:]
            # Prenet
            current_input = self.prenet(current_input)

            # Attention RNN
            attention_rnn_hidden, current_attention, alignment = self.attention_rnn(
                current_input, current_attention, attention_rnn_hidden,
                encoder_outputs, previous_alignment=previous_alignment,
                processed_memory=processed_memory, mask=mask)
            previous_alignment = alignment

            # Concat RNN output and attention context vector
            decoder_input = self.project_to_decoder_in(
                torch.cat((attention_rnn_hidden, current_attention), -1))

            # Pass through the decoder RNNs
            for idx in range(len(self.decoder_rnns)):
                decoder_rnn_hiddens[idx] = self.decoder_rnns[idx](
                    decoder_input, decoder_rnn_hiddens[idx])
                # Residual connectinon
                decoder_input = decoder_rnn_hiddens[idx] + decoder_input

            output = decoder_input
            output = self.proj_to_mel(output)

            outputs += [output]
            alignments += [alignment]

            t += 1

            if greedy:
                if t > 1 and is_end_of_frames(output):
                    break
                elif t > self.max_decoder_steps:
                    print("Warning! doesn't seems to be converged")
                    break
            else:
                if t >= T_decoder:
                    break

        assert greedy or len(outputs) == T_decoder

        # Back to batch first
        alignments = torch.stack(alignments).transpose(0, 1)
        outputs = torch.stack(outputs).transpose(0, 1).contiguous()

        return outputs, alignments

@r9y9

transpose before conv1d_banks？

thanks for your code.
I am confuse about the transpose before conv1d_banks in master/tacotron_pytorch/tacotron.py line 101.
why need transpose here? I can not find transpose in keithito's tensorflow version.
And in paper <<TACOTRON: A FULLY END-TO-END TEXT-TO-SPEECH SYNTHESIS MODEL>>, it does not describe clearly about Conv1D_bank .