Code Monkey home page Code Monkey logo

nv-wavenet's Introduction

Introduction

nv-wavenet is a CUDA reference implementation of autoregressive WaveNet inference. In particular, it implements the WaveNet variant described by Deep Voice. nv-wavenet only implements the autoregressive portion of the network; conditioning vectors must be provided externally. More details about the implementation and performance can be found on the NVIDIA Developer Blog.

Channel counts are provided as template parameters. The following channel count combinations have been tested and are expected to function correctly:

  • 32 residual channels, 128 skip channels, 256 audio channels
  • 64 residual channels, 128 skip channels, 256 audio channels
  • 64 residual channels, 256 skip channels, 256 audio channels
  • 128 residual channels, 256 skip channels, 256 audio channels

The implementation provides four different variants, with different complexity, sample rate, throughput and resource characteristics:

  • Single-Block: implements the entire network in a single thread block. Each thread block must read all model weights per sample, and thus sample rate is limited by the rate at which a single Streaming Multiprocessor can read weights.
  • Dual-Block: implements the network across two collaborating thread blocks. As these blocks may now span multiple Streaming Multiprocessors, this implementation can support a larger model at a given sample rate.
  • Persistent: loads all weights into the register file, where they persist for the entire inference.
  • Manyblock: uses the same distribution of the model across blocks as Persistent, but reloads weights for each sample so that they do not need to persist in the register file. Useful for models that are too large for the Persistent approach.

In all three implementations, a single kernel runs inference for potentially many samples.

Usage

nv_wavenet.cuh provides a templated class nvWavenetInfer. The template parameters are:

  • T_weight : should be float for fp32 inference, half2 for fp16 inference
  • T_data : should be float for fp32 inference, half for fp16 inference
  • R : the number of residual channels
  • S : the number of skip channels
  • A : the number of audio channels

The nvWavenetInfer constructor accepts the following arguments:

  • numLayers : the number of residual layers in the WaveNet
  • maxDilation : the maximum dilation amount. The dilated convolution of each residual layer will have dilation equal to twice the dilation of the prior layer, until this maximum value is reached. The next layer will then reset its dilation to 1.
  • batchSize : the inference batch size (the number of utterances to generate in parallel)
  • sampleCount : the number of audio samples to generate
  • implementation : the implementation variant to use, as defined by the nvWavenetInfer::Implementation enum. Options are SINGLE_BLOCK, DUAL_BLOCK and PERSISTENT
  • tanhEmbed : specifies whether the result of the input embedding should pass through a tanh

Once the nvWavenetInfer object is constructed, it is necessary to upload weights for the model. Weight matrices are provided as float* arrays, in column-major order. In the fp16 case, data conversion and vectorization is provided automatically by the weight upload functions. The provided pointers can be on the host or on the device - in either case, the data will be copied to a buffer belonging to the NvWavenetInfer object.

nvWavenetInfer::setEmbeddings() uploads the embedding table for the causal input. nvWavenetInfer::setLayerWeights() uploads all necessary weights for a single residual layer. nvWavenetInfer::setOutWeights() uploads all weights for the final output layers prior to the softmax.

The nvWavenetInfer::setInputs() method allows the user to upload conditioning vectors and random values for use by the random sampling post-softmax. While setInputs does accept device pointers, it will still copy/convert the data into the NvWavenetInfer object's allocation. For efficient deployment where the conditioning vectors / random values are already present in GPU memory, this method should be modified to simply update the necessary pointers.

Testing

nv-wavenet includes a simple reference implementation in nv_wavenet_reference.h and nv_wavenet_reference.cpp. nv_wavenet_test.cu runs the reference implementation against the CUDA configuration for several configurations with random weights. To run:

make nv_wavenet_test
./nv_wavenet_test

Performance

nv_wavenet_perf.cu provides a simple performance test.

Before performance testing, it is recommended to fix the GPU clocks using nvidia-smi. To query available clocks, run nvidia-smi -q -d SUPPORTED_CLOCKS. The clock can then be set using nvidia-smi -ac

To build and run the performance test, run:

make nv_wavenet_perf

./nv_wavenet_perf <-l num_layers> <-r residual__channels> <-s skip_channels> <-a audio_channels> <-b batch_size> <-c batch_size_per_block> <-n num_samples> <-d max_dilation> <-m mode> <-p precision>

Finding the best performance at a particular sample rate will require experimenting with different values for batch_size, batch_size_per_block and mode. batch_size must be a multiple of batch_size_per_block

Open Source License

nv-wavenet is released by NVIDIA Corporation under the "New BSD" open-source license:

Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
   *  Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
   *  Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
   *  Neither the name of the NVIDIA CORPORATION nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

nv-wavenet's People

Contributors

brianpharris avatar grzegorz-k-karch avatar levsnv avatar nvchai avatar petrochukm avatar pfriesch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nv-wavenet's Issues

Error in distributed.py: Parameter object has no attribute "_execution_engine"

When running python distributed.py -c config.json with default settings on a 2 V100 GPU machine, I receive the following error:

python distributed.py -c config.json
['train.py', '--config=config.json', '--rank=0', '--group_name=group_2018_05_30-220302']
['train.py', '--config=config.json', '--rank=1', '--group_name=group_2018_05_30-220302']
Initializing Distributed
/home/ubuntu/nv-wavenet/pytorch/wavenet.py:46: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
/home/ubuntu/nv-wavenet/pytorch/wavenet.py:46: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
output directory checkpoints
Epoch: 0
train.py:141: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  reduced_loss = reduce_tensor(loss.data, num_gpus)[0]
Traceback (most recent call last):
  File "train.py", line 193, in <module>
    train(num_gpus, args.rank, args.group_name, **train_config)
  File "train.py", line 144, in train
    loss.backward()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/ubuntu/nv-wavenet/pytorch/distributed.py", line 133, in allreduce_hook
    param._execution_engine.queue_callback(allreduce_params)
AttributeError: 'Parameter' object has no attribute '_execution_engine'
train.py:141: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
  reduced_loss = reduce_tensor(loss.data, num_gpus)[0]
Traceback (most recent call last):
  File "train.py", line 193, in <module>
    train(num_gpus, args.rank, args.group_name, **train_config)
  File "train.py", line 144, in train
    loss.backward()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/ubuntu/nv-wavenet/pytorch/distributed.py", line 133, in allreduce_hook
    param._execution_engine.queue_callback(allreduce_params)
AttributeError: 'Parameter' object has no attribute '_execution_engine'

When poking around online, I found this issue which looks similar: NVIDIA/sentiment-discovery#16

The issue implies that the the code expects a Variable class, but in this case gets torch.cuda.FloatTensor.

Don't really know how to fix this, apologies.

Slow Generation

I want to implement traditional wavenet generation for this model. Does anyone know how I would go about that?

train with tacotron2(nvidia)

HI:
i want to train nv-wavenet with tacotron2 mel_outputs_postnet but mel_outputs_postnet size is not equals with mel_spectrogram from audio . How do i align mel_outputs_postnet and audio.
Please help. Thanks!

hints on variable and function names

Hello,

Could you please give me a high-level explanation of the functions and variable names ( or at least the important ones ) ? Some function and variable names are intuitive, but many are not. This, coupled with my lack of knowledge, makes it really hard for me to read and understand all the back-end code.

I would really really appreciate your help!

Thanks

global and local conditioning?

Hello,

What is the cond_input.pt actually representing as per the google drive file you provided?

We understand that the dimensions are : 2xR , layers, batch_size, Sample_len

How can we create our own inputs for conditioning?

Thanks,
Shreyas

Unable to clone tacotron2 submodule

Hi all, I'm failing cloning the repository with tacotron2 submodule.
Is this a bug or am I doing something wrong?
Thanks

➜  ~ git clone --recursive https://github.com/NVIDIA/nv-wavenet
Cloning into 'nv-wavenet'...
remote: Counting objects: 167, done.
remote: Compressing objects: 100% (66/66), done.
remote: Total 167 (delta 78), reused 111 (delta 64), pack-reused 36
Receiving objects: 100% (167/167), 6.67 MiB | 0 bytes/s, done.
Resolving deltas: 100% (95/95), done.
Checking connectivity... done.
Submodule 'pytorch/tacotron2' ([email protected]:NVIDIA/tacotron2) registered for path 'pytorch/tacotron2'
Cloning into 'pytorch/tacotron2'...
Warning: Permanently added the RSA host key for IP address '192.30.253.112' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:NVIDIA/tacotron2' into submodule path 'pytorch/tacotron2' failed

Shud skip connection bias be only added once?

Hi there!
Thanks for your awesome work and sharing it.

As per my understanding to the deep voice paper, in the layer inference part (section 5.1.2.(d)) they only bias the skip connection once. Since they implemented the skip projection thru accumulation, they made q^(0) to be the bias and only added that once in the beginning.

However at line 87 in nv_wavenet_reference.cpp, q^(j) got biased by Bskip at each layer. It seems like a little bug or is there anything I am missing?

best,
Xuan

The method 'namedBarrierSync'?

As a beginner of GPU programming, I am not quite familiar with the common used cuda methods. I have checked the cuda programming guide and did not find any documents about the method 'namedBarrierSync', would you please give me some hints about this method? Thanks.

Errors when building

Does anyone know why I can't build the project?

ubuntu@ip-172-31-18-152:~/nv-wavenet$ cd pytorch; make
nvcc -arch=sm_37 -std=c++11 --use_fast_math -lineinfo -maxrregcount 128 -I .. wavenet_infer.cu ../matrix.cpp -lz -Xcompiler -fPIC -shared -o wavenet_infer.so
../nv_wavenet_util.cuh(89): error: more than one conversion function from "half" to a built-in type applies:
function "__half::operator float() const"
function "__half::operator short() const"
function "__half::operator unsigned short() const"
function "__half::operator int() const"
function "__half::operator unsigned int() const"
function "__half::operator long long() const"
function "__half::operator unsigned long long() const"
function "__half::operator __nv_bool() const"

../nv_wavenet_util.cuh(89): error: more than one conversion function from "half" to a built-in type applies:
function "__half::operator float() const"
function "__half::operator short() const"
function "__half::operator unsigned short() const"
function "__half::operator int() const"
function "__half::operator unsigned int() const"
function "__half::operator long long() const"
function "__half::operator unsigned long long() const"
function "__half::operator __nv_bool() const"

2 errors detected in the compilation of "/tmp/tmpxft_00004edd_00000000-8_wavenet_infer.cpp1.ii".
Makefile:48: recipe for target 'wavenet_infer' failed
make: *** [wavenet_infer] Error 1

How to integrate it with r9y9/Tacotron-2 ?

Tacotron-2 implementation of r9y9 (https://github.com/r9y9/Tacotron-2) output the mel-spectrogram but when I give that mel-spectrogram input to nv-wavenet after covert .npy file to torch tensor and do inference then it generates noise. Do I have to do some extra to Tacotron 2 generate mel-spectrogram then input to nv-wavenet for speech synthesis?

Does nv-wavenet support higher mu_quantization value?

The default mu_quantization value is 256, and I think higher value of mu_quantization may produce better quality audio.
I did these modifications to test mu_quantization value of 512:

  1. modify values of mu_quantization, n_in_channels, n_out_channels in config.json to 512;
  2. modify value of const int A in wavenet_infer.cu to 512, and re-build the project;
  3. modify the 256 value in inference.py to 512.

But after training the model and inferencing, the produced audio files contain no sound.
Did I miss anything else to modify, or nv-wavenet doesn't support other mu_quantization value?

Training support

Hi all,

Seems like your code currently supports only inference. Will training be supported as well? If so, when?

Thanks

Attempting to optimize persistence for GTX 1080

Hello,

I am trying to optimize nv_wavenet_persistence.cu to be able to run on GTX 1080.

I am using the below arguements:
-l 20 -r 64 -s 256 -a 256 -b 1 -d 2 -m 3

First, in order to even run persistence on GTX 1080 with the above parameters, I needed to decrease "maxrregcount" from 128 to 64. This resulted in a very big performance decrease as a huge portion of memory that used to be stored in registers are now stored in the local memory.

Second, in order to decrease register leaks, I want to eliminate all the register variables ( i.e. variables with "_reg" at the end ). As I was looking through the code to eliminate such variables, I realized that most of these were for zero negative checking.

So I tried a very simple and blunt approach: I replaced this code from nv_wavenet_persistent_GEMM_MxK function

bool valid = false; while (!valid) { valid = true; #pragma unroll for (int b=0; b<N_UNROLL; b++) { act_in_reg[b] = loadVolatile(act_in,(batch_offset+b)*ldb + row); } #pragma unroll for (int b=0; b<N_UNROLL; b++) { valid &= !isNegativeZero(act_in_reg[b]); } } #pragma unroll // fill [batch_size x R] shared vector for (int b=0; b<N_UNROLL; b++) { act_in_sh[b][row] = act_in_reg[b]; }

with

act_in_sh[0][row] = loadVolatile(act_in,(batch_offset)*ldb + row);

As I expected this does not work, but I am not sure why it does not.

My question is:

  1. what does "isNegativeZero" do? Because it is assembly, I am having a hard time reading.

  2. is there a way to minimize the use of registers s.t. 64 registers per thread would allow sufficient performance?

  3. what is the need for the check that is performed in the bigger piece of code above?

  4. Why initiate activations with -0.f instead of 0.f?

I would really appreciate your help.

Thanks!

make pytorch wrapper fails - similar to #54 but a different compiler error

../nv_wavenet_util.cuh(89): error: more than one conversion function from "half" to a built-in type applies:
function "__half::operator float() const"
function "__half::operator short() const"
function "__half::operator unsigned short() const"
function "__half::operator int() const"
function "__half::operator unsigned int() const"
function "__half::operator long long() const"
function "__half::operator unsigned long long() const"
function "__half::operator __nv_bool() const"

Math test error with ARCH=sm60 & racecheck & CUDA 9.0

Hi,

If I use the vanilla nv-wavenet code (commit ade2689) with ARCH=sm_60, and run

cuda-memcheck --tool racecheck --racecheck-report all math_test

I see a lot of errors and eventually a mismatch

========= ERROR: Race reported between Read access at 0x00001330 in /home/test/git/nv-wavenet/matrix_math.cuh:99:void gemm_kernel<float, float, int=128, int=64, int=1>(int, float*, float*, float*, int, unsigned long*, int)
=========     and Write access at 0x000012d0 in /home/test/git/nv-wavenet/math_test.cu:152:void gemm_kernel<float, float, int=128, int=64, int=1>(int, float*, float*, float*, int, unsigned long*, int) [112 hazards]
=========
mismatch 1728 320.000000 307.000000
math_test: math_test.cu:314: int main(): Assertion `check_results<float>(M,N,K,C,gpuC)' failed.
========= Error: process didn't terminate successfully

Do I need to upgrade something?

too many blocks in cooperative launch

Hi all,
I've followed up the install process and downloaded the pretrained model and corresponding input data, but when I do python nv_wavenet_test.py, I got the following error:

45 blocks, 2 blocks per SM
GPUassert: too many blocks in cooperative launch ../nv_wavenet_persistent.cuh 530
python: wavenet_infer.cu:97: void infer(std::shared_ptr<nvWavenetInfer<float, float, 64, 256, 256> >, float*, int*, int, int): Assertion `wavenet->run(sample_count, batch_size, samples, batch_size_per_block, true)' failed.
Aborted (core dumped)

I'm using GTX1070 so compiled it with sm_61 option.
Any ideas?

Not able to get good audio quality

Hi,

I am training a WaveNet model from Rayhane-mamah Tacotron-2 GTA mels (I've modified mel2samp_onehot.py accordingly) and I am not able to get good audio quality.

This is the audio I get from a training sample after 250k iter (batch size 7):
original.wav
250k_iter.wav

I know I might want to train longer, but I've already trained a 1.3M iter WaveNet model and I get the same "low quality" audio.

Here are my parameters:

"data_config": {
        "segment_length": 22050,
        "mu_quantization": 256,
        "filter_length": 2048,
        "hop_length": 275,
        "win_length": 1100,
        "sampling_rate": 22050,
        "mel_fmin": 50,
        "mel_fmax": 7600
    },
    "wavenet_config": {
        "n_in_channels": 256,
        "n_layers": 16,
        "max_dilation": 256,
        "n_residual_channels": 64,
        "n_skip_channels": 256,
        "n_out_channels": 256,
        "n_cond_channels": 80,
        "upsamp_window": 1100,
        "upsamp_stride": 275
    }

Any clues what might be wrong? Maybe some pre/post processing missing? Maybe this is due to mulaw-quantize with 256 values?

Any help is really appreaciated. Thank you very much.

Expected Performance on Geforce 10 Series GPUs?

I've been testing the performance on a 1070 and 1080 Ti on two separate machines, both running Ubuntu 16.04 with driver 384.130, and the results I'm getting are dramatically worse than the ones shown in the blog or here #18, with both GPUs performing at about 4.3khz and 6.3khz for the single-block and dual-block implementations respectively for the 20-layer medium network. Is this to be expected from Geforce 10 series GPUs or is there a possible configuration issue that is slowing performance?

For fast and higher fidelity, support double softmax

DeepMind published a follow-up paper to Wavenet called WaveRNN. One of the notable components that achieved higher fidelity is double softmax. This, I believe, is a relatively easy improvement over Wavenet that allows for high fidelity audio.

image

building pytorch wrapper

Hello,

I'm trying to build pytorch wrapper on Windows without success.
I followed the instructions (from: ) and built project using makefile.

However when I use python build.py command I get following error:
LINK : fatal error LNK1181: cannot open input file '.\Release_nv_wavenet_ext.obj'

I'm using:
-Windows 10
-Python 3.6
-Cuda 9.0
-1060 GTX
-Latest version of Pytorch

I'm not an expert so any help would be appreciated.

Distributed code just hangs with no output

Configuration: CUDA 9.0
Python: 3.5
PyTorch: 0.4.1
Command: python3 distributed.py -c config.json
Nothing gets written to GPU_1/2/3.log even after waiting for 15-20min.

GPUAssert raised during inference at cudaLaunchCooperativeKernel() on GTX 1080 Ti

Testing with nv_wavenet_test.py succeeds (congrats on the remarkable speed!) and training in distributed mode goes swiftly, 10k+ iters in. inference.py fails though, asserting on kernel launch. How should I proceed?

System info

Driver version: 396.24
Kernel: Linux safpa 4.15.0-22-generic #24-Ubuntu SMP Wed May 16 12:15:17 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
GPUs: 2x GeForce GTX 1080 Ti

Error

mabel@safpa:~/git/nv-wavenet/pytorch$ python3 inference.py -f mel_files.txt -c checkpoints/wavenet_10000 -o .
arctic_a0001.wav.pt
57 blocks, 2 blocks per SM
GPUassert: too many blocks in cooperative launch ../nv_wavenet_persistent.cuh 530
python3: wavenet_infer.cu:97: void infer(std::shared_ptr<nvWavenetInfer<float, float, 64, 256, 256> >, float*, int*, int, int): Assertion `wavenet->run(sample_count, batch_size, samples, batch_size_per_block, true)' failed.
Abortita (nekropsio elŝutita)

Line 530 of nv_wavenet_persistent.cuh:

cudaError_t code = cudaLaunchCooperativeKernel((void*)nv_wavenet_persistent<T_weight,T_data,R,S,A,BATCH_UNROLL>, grid, block, &p_params, 0, stream);
gpuAssert(code, __FILE__, __LINE__, false);

Thanks!

UPDATE: Issue seems to be only with the Persistent implementation. Using: python3 inference.py -f mel_files.txt -c checkpoints/wavenet_10000 -o . -i dual I was able to run without issue. Would love to use Persistent though, if supported by my hardware!

nv_wavenet_test.py fails with a larger than one batch size

Issue description

Running nv_wavenet_test.py with an increased batch size causes an error.

Offending Code

    model = torch_load("model.pt", torch.device('cuda'))
    wavenet = nv_wavenet.NVWaveNet(**model)
    cond_input = torch_load("cond_input.pt", torch.device('cuda'))
    cond_input = cond_input.repeat(1, 2, 1, 1)  # Increase the batch size to 2

    samples = wavenet.infer(cond_input, nv_wavenet.Impl.AUTO)

Error

GPUassert: invalid argument ../nv_wavenet.cuh 547

System Info

PyTorch version: 0.4.0
Python version: 3.6.4
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 390.30
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5
/usr/local/lib/python2.7/dist-packages/torch/lib/libcudnn-7a90c013.so.7.0.5
/usr/local/lib/python3.5/dist-packages/torch/lib/libcudnn-3f9a723f.so.6.0.21

batching is broken in inference.py

Hi,

Providing batch size >1 to inference.py results in the following error:

Traceback (most recent call last):
  File "inference.py", line 89, in <module>
    main(args.filelist_path, args.checkpoint_path, args.output_dir, args.batch_size, implementation)
  File "inference.py", line 53, in main
    cond_input = model.get_cond_input(torch.cat(mels, 0))
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 706 and 773 in dimension 2 at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCTensorMath.cu:111

It seems to be because line 53 tries to concat mel spectrograms of different lengths.

An easy fix is to just pad the sequences to max_len, but I can't try it right now due to my model not producing audible samples.

Test R=256, to replicate Tacotron 2

Both Parallel WaveNet and Tacotron 2 to the best of my knowledge use R=256 while this repo tests R=64 at most.

Parallel WaveNet Quote

The number of hidden units in the gating layers is 512 (split into two groups of 256 for the two parts of the activation function (1)).

how to use trained nvidia/Tacotron2 with your pre-trained nv-wavenet

Hi,

I have trained nvidia/Tacotron2 model and now I'd like to use it with your pre-trained nvidia/nv-wavenet.

I'm trying to use nv-wavenet/pytorch/inferece.py. It seems thought that inference script expects model checkpoint instead of saved model.pt. (as in nv_wavenet_test.py).

When I change L42-43 in inference.py to:

 model = torch.load('model.pt')
 wavenet = nv_wavenet.NVWaveNet(**model)

I will get an error AttributeError: 'dict' object has no attribute 'get_cond_input'

Or do you have any advice how to generate cond_input.pt from nvidia/Tacotron2 so that I could use it the same way as in nv_wavenet_test.py?

Thanks.

error: when make pytorch wrapper

Hi @BrianPharris

Could you please give me some help?

I cannot make pytorch wrapper, there is error message.
nvcc -arch=sm_61 -std=c++11 --use_fast_math -lineinfo -maxrregcount 128 -I .. wavenet_infer.cu ../matrix.cpp -lz -Xcompiler -fPIC -shared -o wavenet_infer.so
wavenet_infer.cu(97): error: argument of type "int *" is incompatible with parameter of type "int"

wavenet_infer.cu(97): error: argument of type "int" is incompatible with parameter of type "int *"

2 errors detected in the compilation of "/tmp/tmpxft_00006df7_00000000-8_wavenet_infer.cpp1.ii".
Makefile:48: recipe for target 'wavenet_infer' failed
make: *** [wavenet_infer] Error 1

However, before you last commit, I still can make pytorch wrapper successfully.

Error in Pytorch wrapper

nvcc -arch=sm_60 -std=c++11 --use_fast_math -lineinfo -maxrregcount 128 -I .. wavenet_infer.cu ../matrix.cpp -lz -Xcompiler -fPIC -shared -o wavenet_infer.so
/usr/bin/ld: cannot find -lz
collect2: error: ld returned 1 exit status
Makefile:48: recipe for target 'wavenet_infer' failed
make: *** [wavenet_infer] Error 1

question about performance

I have built for pascal/volta with sm_60/sm_70, and it runs well on P100 & V100. but when I set all parameters the same but with different precisions fp16/fp32, but got the similar performance when the mode is PERSISTENT, both on P100 and V100, but it is totally different when set to other mode(SINGLE, DUAL).
so, first question is, why does this happened?

second question is, the V100 has enable Tensor Core, Did this souce code use tensor core?

and last question is, the V100 is up to date, so it is supposed to be better, however, the actually performance is close even a little less than P100, so why?

    GP100 V100-PCIE
Graphics clock   1556 MHz 1380 MHz
Memory clock   715 MHz 877 MHz
Medium-Single FP16 28.04 26.97
Medium-Single FP32 10.69 11.26
Medium-Dual FP16 32.40 31.02
Medium-Dual FP32 14.68 15.28
Medium-Persistent FP16 41.85 42.83
Medium-Persistent FP32 37.50 42.02

Problem with testing script

Hi,

I have tried to run the test script from the Makefile with the command make nv_wavenet_test ./nv_wavenet_test. Everything works correctly and the test runs. However, when I reach the large WaveNet model (R=64 -- S=256), an assertion occurs when the code compares the matrices obtained with the reference and the CUDA implementation.

At a given layer in the model, the difference between the two matrices is greater than the epsilon value of the function matrix_compare(name, A, B, epsilon). I can increase this value in the code and the assertion will disappear, but I would like to understand why this error.

Here is my GPU configuration : NVIDIA 1080Ti - CUDA 9.0 - Driver version: 384.111

Thanks a lot

residual layer should be one shorter

great work !It seems every layer has a residual weight-bias, skip weight-bias, dilate weight-bias, why the model require residual weights/biases lists should be one shorter? actually you append a zero-weights/biases rather than use the trained model's residual weights/biases.

GPUassert: invalid device function ../nv_wavenet_util.cuh 48

Issue description

Error running nv-wavenet test for PyTorch

Code example

$ python3.6 nv-wavenet/pytorch/nv_wavenet_test.py
GPUassert: invalid device function ../nv_wavenet_util.cuh 48

System Info

PyTorch version: 0.4.0
Python version: 3.6.4
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: GPU 0: GeForce GTX 1080 Ti
Nvidia driver version: 390.30
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.0.5
/usr/local/lib/python2.7/dist-packages/torch/lib/libcudnn-7a90c013.so.7.0.5
/usr/local/lib/python3.5/dist-packages/torch/lib/libcudnn-3f9a723f.so.6.0.21

Is it possible to use it without Cooperative Kernel? (regardless of performance loss) (it works)

Firstly, Thank you for developing this repos :)

I've been trying to develop a Windows port of this repos, and I managed to build the pytorch version of this using MSVC, and I had met with CUDA error 71 (cudaErrorNotSupported).
(I'm currently using GTX1060 6GB on Windows 10 with CUDA 9.0)

The error was tracked to

cudaError_t code = cudaLaunchCooperativeKernel((void*)nv_wavenet_persistent<T_weight,T_data,R,S,A,BATCH_UNROLL>, grid, block, &p_params, 0, stream);

Via https://devtalk.nvidia.com/default/topic/1022751/cuda-setup-and-installation/gtx-1080-does-not-support-cooperative-kernel-launch-/, I discovered that co-op kernel is only available with linux or Windows in TCC mode.

Would it be possible to use in non-coop kernel (change cudaLaunchCooperativeKernel to cudaLaunchKernel), and if possible, how much performance loss would there be?

Compilation error for target arch sm_52

I cannot seem to compile for target arch sm_52

It works if I specify sm_60 or sm_70.

Is there an issue with compiling for a gtx titan x (sm_52) or is this meant to be supported?

running 'make nv_wavenet_test' outputs the following:

`nvcc -arch=sm_30 -std=c++11 --use_fast_math -lineinfo -maxrregcount 512 nv_wavenet_test.cu matrix.cpp nv_wavenet_reference.cpp -o nv_wavenet_test
nv_wavenet_util.cuh(89): error: more than one conversion function from "half" to a built-in type applies:
function "__half::operator float() const"
function "__half::operator short() const"
function "__half::operator unsigned short() const"
function "__half::operator int() const"
function "__half::operator unsigned int() const"
function "__half::operator long long() const"
function "__half::operator unsigned long long() const"
function "__half::operator __nv_bool() const"

nv_wavenet_util.cuh(89): error: more than one conversion function from "half" to a built-in type applies:
function "__half::operator float() const"
function "__half::operator short() const"
function "__half::operator unsigned short() const"
function "__half::operator int() const"
function "__half::operator unsigned int() const"
function "__half::operator long long() const"
function "__half::operator unsigned long long() const"
function "__half::operator __nv_bool() const"

2 errors detected in the compilation of "/tmp/tmpxft_000024ab_00000000-8_nv_wavenet_test.cpp1.ii".
Makefile:70: recipe for target 'nv_wavenet_test' failed
make: *** [nv_wavenet_test] Error 1
`

This is the makefile:

`NVCC = nvcc

#ARCH=compute_52 -code=sm_52,compute_52
#NVCC_FLAGS = -arch=$(ARCH) -std=c++11
#nvcc_ARCH += -gencode=arch=compute_52,code="sm_52,compute_52"
#NVCC_FLAGS += $(nvcc_ARCH) -std=c++11

ARCH=sm_30#
-gencode=arch=compute_30,code=sm_30
-gencode=arch=compute_50,code=sm_50
-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_62,code=sm_62
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_70,code=compute_70

NVCC_FLAGS = -arch=$(ARCH) -std=c++11

NVCC_FLAGS += --use_fast_math

MAX_REGS = 512

HEADERS = nv_wavenet_util.cuh
nv_wavenet_singleblock.cuh
nv_wavenet_dualblock.cuh
nv_wavenet_persistent.cuh
nv_wavenet.cuh
matrix_math.cuh
softmax.cuh
nv_wavenet_conversions.cuh

default: test

test : math_test nv_wavenet_test
math_test
nv_wavenet_test

nv_wavenet_perf : nv_wavenet_perf.cu $(HEADERS)
$(NVCC) $(NVCC_FLAGS) -maxrregcount $(MAX_REGS) --ptxas-options=-v nv_wavenet_perf.cu -o nv_wavenet_perf

nv_wavenet_test : nv_wavenet_test.cu matrix.cpp matrix.h nv_wavenet_reference.cpp $(HEADERS)
$(NVCC) $(NVCC_FLAGS) -lineinfo -maxrregcount $(MAX_REGS) nv_wavenet_test.cu matrix.cpp nv_wavenet_reference.cpp -o nv_wavenet_test

math_test : math_test.cu matrix_math.cuh matrix.cpp softmax.cuh
$(NVCC) $(NVCC_FLAGS) math_test.cu matrix.cpp -lineinfo -o math_test

clean:
rm nv_wavenet_perf nv_wavenet_test math_test
`

Core dump on inference with R=128;A=S=256

Hi, I'm training a new wavenet model with R=128 (and apparently it is training successfully), but when I try to run inference I get this error:

wavenet_infer.cu
const int A = 256;
const int R = 128;
const int S = 256;

config.json
"n_in_channels": 256,
"n_layers": 16,
"max_dilation": 256,
"n_residual_channels": 128,
"n_skip_channels": 256,
"n_out_channels": 256

It gives me this error when I try to run inference:

wavenet_infer.cu:97: void infer(std::shared_ptr<nvWavenetInfer<float, float, 128, 256, 256> >, float*, int*, int, int): Assertion `wavenet->run(sample_count, batch_size, samples, batch_size_per_block, true)' failed.

Single GPU GTX 1080Ti
Tried with all implementations: single, dual and persistent. All broken.

Trained model generating inaudible samples

Hello,

I have trained a model on LJSpeech 1.1 (downsampled to 16k, verified to be good), using the training scheme outlined in the README.md, with distributed.py for ~800k iterations.

I then attempted to do inference using several checkpoints (10k, 150k, 780k), but all I get is silence.

Here are some samples from the 100k model.

What could be the problem?

Support kernel size of 3, to replicate Tacotron 2

Hi There!

Can you support a kernel size of 3 please?


Reading the Tacotron 2 paper more closely looks like they are using a kernel size of 3. Otherwise, they would not have been able to accomplish 505 sample receptive field with 24 layers, 4 cycles, and 6 cycle size.

The math comes out to be with a kernel size of 2:
(1 + 2 + 4 + 8 + 16 + 32) * 4 + 1 == 253

The math comes out to be with a kernel size of 3:
(2 + 4 + 8 + 16 + 32 + 64) * 4 + 1 == 505


Similarly, the parallel WaveNet paper used a kernel size of 3:

This required a WaveNet with a wider receptive field, which we achieved by increasing the dilated convolution filter size from 2 to 3.

Invalid Device Function with CUDA 9.0 and Geforce Titan X

I am trying to use my Geforce Titan X to run the test provided. However, with the given sm_70, and also sm_60 it results in GPUassert: invalid device function nv_wavenet_util.cuh 48
Also, when I try to use sm_52, as this is the given compute capability, I run into different errors.

May my card is not be supported at all? I'd appreciate any help, if possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.