WaveRNN is a technique used to synthesise neural audio much faster than other neural synthesizer by providing rudimentary improvements.
This implementation is based on tensorflow and will require librosa to manipulate raw audio.
These points provides an intution about the model, how the model has been implented [any inputs on this are most welcome]
- Input data is taken as float32 based and normalized to 16bit unsigned data.
- 16bit data has been split into two components by using divmod (with floor )
- Two different
RNNs
has been created ofcell-length
896 as mentioned in the paper, being stacked in 2 layers (have to test number of layers). - First rnn that synthesize
coarse_data
doesnt needc(t)
, so input to this is[batch_size,sequence_length,2]
where currently batch size is 1 with sequence length of 200 and 2 for[c(t-1),f(t-1)]
- Second rnn that synthesize
fine_data
needsc(t)
so currently it is dependent on thecoarse_data
for generation (but it should improve after subscaling). It has an input vector of[batch_size,sequence_length,3]
where 3 is to store extrac(t)
- Output of these
RNNs
are parsed to a dense linear transformation of same length, which is then used passed in torelu
to remove negative entries. - For cost function, currently
softmax_cross_entropy_with_logits_v2
is used but most probably will switch to sparse one. - Adam is used to optimize the whole network, this initial commit doesnt include intensive hyper parameter testing, so it is using default learning rate.
- Basic implementation, which improves net algorithmic efficiency
- Providing support for faster future prediction (On-going)
- Transfer from notebook based developaramsent to OOPS, for better management (WIP)
- Sparse Prunification and Sub batched sampling
- Vocoder Implementation
Please go through the issues, there are many conceptual doubts and I would love to hear opinions on those.
This repository only provides implementation of the model WaveRNN
as mentioned here.