pytorch-tvmisc's Issues

Need pytorch nightly?

Thank you for making the Wasserstein loss extension available. Forgive me if this isn't an issue, I am not an expert user. I just wanted to comment that when I tried to run the extension in my computer (torch 1.1.0) I was getting this error in compilation time:

error: identifier "TORCH_CHECK" is undefined

After installing the latest pytorch-nightly everything seems to run smoothly, so I guess this may be a requirement?

Consider transfering `load_inline` to `setuptools`?

Hi! Thanks a lot for your great work about the wasserstein distance <Pytorch_Wasserstein.ipynb>!

Since torch.utils.cpp_extension.load_inline will compile the cuda code every run, would you consider making it to setuptools, i.e., python install, so that one could load pre-build libraries?

Sorry but I'm not familiar with this. Is there any barrier?


Problem with scripting the model

Hi Sir,

I have started learning torchscript and your blog was a great source to understand JIT. I tried to run the notebook pytorch_automatic_optimization_jit.ipynb but I am unable to run the c++, CUDA, CPU kernels also I am unable to get the similar graph present in the notebook. I have attached the link of the colab, I am working with.

I request you to help me with this problem

Colan Notebook

FileNotFoundError: [Errno 2] No such file or directory: 'ninja': 'ninja'

Why I had install ninja with conda but still met this bug?? Please help me! T_T

ninja --version


$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

pytorch 1.2.0



Traceback (most recent call last):
  File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/", line 890, in verify_ninja_availability
    subprocess.check_call('ninja --version'.split(), stdout=devnull)
  File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/", line 342, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/", line 323, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/", line 775, in __init__
    restore_signals, start_new_session)
  File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/", line 1522, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'ninja': 'ninja'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/devdata/new_Relation_Extraction/", line 208, in <module>
  File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/", line 787, in load_inline
  File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/", line 827, in _jit_compile
  File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/", line 850, in _write_ninja_file_and_build
  File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/", line 892, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions


import math
import torch
import torch.utils
import torch.utils.cpp_extension
# % matplotlib inline

# from matplotlib import pyplot
# import matplotlib.transforms
# import ot  # for comparison

cuda_source = """

#include <torch/extension.h>
#include <ATen/core/TensorAccessor.h>
#include <ATen/cuda/CUDAContext.h>

using at::RestrictPtrTraits;
using at::PackedTensorAccessor;

#if defined(__HIP_PLATFORM_HCC__)
constexpr int WARP_SIZE = 64;
constexpr int WARP_SIZE = 32;

// The maximum number of threads in a block
#if defined(__HIP_PLATFORM_HCC__)
constexpr int MAX_BLOCK_SIZE = 256;
constexpr int MAX_BLOCK_SIZE = 512;

// Returns the index of the most significant 1 bit in `val`.
__device__ __forceinline__ int getMSB(int val) {
  return 31 - __clz(val);

// Number of threads in a block given an input size up to MAX_BLOCK_SIZE
static int getNumThreads(int nElem) {
#if defined(__HIP_PLATFORM_HCC__)
  int threadSizes[5] = { 16, 32, 64, 128, MAX_BLOCK_SIZE };
  int threadSizes[5] = { 32, 64, 128, 256, MAX_BLOCK_SIZE };
  for (int i = 0; i != 5; ++i) {
    if (nElem <= threadSizes[i]) {
      return threadSizes[i];
  return MAX_BLOCK_SIZE;

template <typename T>
__device__ __forceinline__ T WARP_SHFL_XOR(T value, int laneMask, int width = warpSize, unsigned int mask = 0xffffffff)
#if CUDA_VERSION >= 9000
    return __shfl_xor_sync(mask, value, laneMask, width);
    return __shfl_xor(value, laneMask, width);

// While this might be the most efficient sinkhorn step / logsumexp-matmul implementation I have seen,
// this is awfully inefficient compared to matrix multiplication and e.g. NVidia cutlass may provide
// many great ideas for improvement
template <typename scalar_t, typename index_t>
__global__ void sinkstep_kernel(
  // compute log v_bj = log nu_bj - logsumexp_i 1/lambda dist_ij - log u_bi
  // for this compute maxdiff_bj = max_i(1/lambda dist_ij - log u_bi)
  // i = reduction dim, using threadIdx.x
  PackedTensorAccessor<scalar_t, 2, RestrictPtrTraits, index_t> log_v,
  const PackedTensorAccessor<scalar_t, 2, RestrictPtrTraits, index_t> dist,
  const PackedTensorAccessor<scalar_t, 2, RestrictPtrTraits, index_t> log_nu,
  const PackedTensorAccessor<scalar_t, 2, RestrictPtrTraits, index_t> log_u,
  const scalar_t lambda) {

  using accscalar_t = scalar_t;

  __shared__ accscalar_t shared_mem[2 * WARP_SIZE];

  index_t b = blockIdx.y;
  index_t j = blockIdx.x;
  int tid = threadIdx.x;

  if (b >= log_u.size(0) || j >= log_v.size(1)) {
  // reduce within thread
  accscalar_t max = -std::numeric_limits<accscalar_t>::infinity();
  accscalar_t sumexp = 0;

  if (log_nu[b][j] == -std::numeric_limits<accscalar_t>::infinity()) {
    if (tid == 0) {
      log_v[b][j] = -std::numeric_limits<accscalar_t>::infinity();

  for (index_t i = threadIdx.x; i < log_u.size(1); i += blockDim.x) {
    accscalar_t oldmax = max;
    accscalar_t value = -dist[i][j]/lambda + log_u[b][i];
    max = max > value ? max : value;
    if (oldmax == -std::numeric_limits<accscalar_t>::infinity()) {
      // sumexp used to be 0, so the new max is value and we can set 1 here,
      // because we will come back here again
      sumexp = 1;
    } else {
      sumexp *= exp(oldmax - max);
      sumexp += exp(value - max); // if oldmax was not -infinity, max is not either...

  // now we have one value per thread. we'll make it into one value per warp
  // first warpSum to get one value per thread to
  // one value per warp
  for (int i = 0; i < getMSB(WARP_SIZE); ++i) {
    accscalar_t o_max    = WARP_SHFL_XOR(max, 1 << i, WARP_SIZE);
    accscalar_t o_sumexp = WARP_SHFL_XOR(sumexp, 1 << i, WARP_SIZE);
    if (o_max > max) { // we're less concerned about divergence here
      sumexp *= exp(max - o_max);
      sumexp += o_sumexp;
      max = o_max;
    } else if (max != -std::numeric_limits<accscalar_t>::infinity()) {
      sumexp += o_sumexp * exp(o_max - max);

  // this writes each warps accumulation into shared memory
  // there are at most WARP_SIZE items left because
  // there are at most WARP_SIZE**2 threads at the beginning
  if (tid % WARP_SIZE == 0) {
    shared_mem[tid / WARP_SIZE * 2] = max;
    shared_mem[tid / WARP_SIZE * 2 + 1] = sumexp;
  if (tid < WARP_SIZE) {
    max = (tid < blockDim.x / WARP_SIZE ? shared_mem[2 * tid] : -std::numeric_limits<accscalar_t>::infinity());
    sumexp = (tid < blockDim.x / WARP_SIZE ? shared_mem[2 * tid + 1] : 0);
  for (int i = 0; i < getMSB(WARP_SIZE); ++i) {
    accscalar_t o_max    = WARP_SHFL_XOR(max, 1 << i, WARP_SIZE);
    accscalar_t o_sumexp = WARP_SHFL_XOR(sumexp, 1 << i, WARP_SIZE);
    if (o_max > max) { // we're less concerned about divergence here
      sumexp *= exp(max - o_max);
      sumexp += o_sumexp;
      max = o_max;
    } else if (max != -std::numeric_limits<accscalar_t>::infinity()) {
      sumexp += o_sumexp * exp(o_max - max);

  if (tid == 0) {
    log_v[b][j] = (max > -std::numeric_limits<accscalar_t>::infinity() ?
                   log_nu[b][j] - log(sumexp) - max :

template <typename scalar_t>
torch::Tensor sinkstep_cuda_template(const torch::Tensor& dist, const torch::Tensor& log_nu, const torch::Tensor& log_u,
                                     const double lambda) {
  TORCH_CHECK(dist.is_cuda(), "need cuda tensors");
  TORCH_CHECK(dist.device() == log_nu.device() && dist.device() == log_u.device(), "need tensors on same GPU");
  TORCH_CHECK(dist.dim()==2 && log_nu.dim()==2 && log_u.dim()==2, "invalid sizes");
  TORCH_CHECK(dist.size(0) == log_u.size(1) &&
           dist.size(1) == log_nu.size(1) &&
           log_u.size(0) == log_nu.size(0), "invalid sizes");
  auto log_v = torch::empty_like(log_nu);
  using index_t = int32_t;

  auto log_v_a = log_v.packed_accessor<scalar_t, 2, RestrictPtrTraits, index_t>();
  auto dist_a = dist.packed_accessor<scalar_t, 2, RestrictPtrTraits, index_t>();
  auto log_nu_a = log_nu.packed_accessor<scalar_t, 2, RestrictPtrTraits, index_t>();
  auto log_u_a = log_u.packed_accessor<scalar_t, 2, RestrictPtrTraits, index_t>();

  auto stream = at::cuda::getCurrentCUDAStream();

  int tf = getNumThreads(log_u.size(1));
  dim3 blocks(log_v.size(1), log_u.size(0));
  dim3 threads(tf);

  sinkstep_kernel<<<blocks, threads, 2*WARP_SIZE*sizeof(scalar_t), stream>>>(
    log_v_a, dist_a, log_nu_a, log_u_a, static_cast<scalar_t>(lambda)

  return log_v;

torch::Tensor sinkstep_cuda(const torch::Tensor& dist, const torch::Tensor& log_nu, const torch::Tensor& log_u,
                            const double lambda) {
    return AT_DISPATCH_FLOATING_TYPES(log_u.scalar_type(), "sinkstep", [&] {
       return sinkstep_cuda_template<scalar_t>(dist, log_nu, log_u, lambda);

  m.def("sinkstep", &sinkstep_cuda, "sinkhorn step");


wasserstein_ext = torch.utils.cpp_extension.load_inline("wasserstein", cpp_sources="", cuda_sources=cuda_source,

mexh wavelet

Hi,How can i use mexh from your codes?
can you show me a sample for mexh wavelet?

Confusion about Lambda

Hello, Firstly thank you for the awesome work!
I had a question in the Pytorch_Wasserstein.ipynb:

In the WassersteinLossVanilla, why is it
self.K = torch.exp(-self.cost/self.lam) ?
Shouldn't it be
self.K = torch.exp(-self.cost*self.lam)?

In mocha also it is the above

Have you changed it because "Note that we use a different convention for $\lambda$ (i.e. we use $\lambda$ as the weight for the regularisation, later versions of the above use $\lambda^-1$ as the weight)." ?

Also what is the reason for the above?

Wasserstein implementation does not seem to be fully "batched"

Hi @t-vi,

Thanks for sharing your code!

I would like to ask a question regarding your implementation of the Sinkhorn algorithm. You stated that one of the main motivations was to obtain efficient batched computation. However, looking at the code I observe that it only supports the case where the cost matrix is the same across the batch:

def forward(ctx, mu, nu, dist, lam=1e-3, N=100):
        assert mu.dim() == 2 and nu.dim() == 2 and dist.dim() == 2
        bs = mu.size(0)
        d1, d2 = dist.size()
        assert nu.size(0) == bs and mu.size(1) == d1 and nu.size(1) == d2

That is, the shape dist is d1 x d2 instead of bs x d1 x d2. Is this expected?

Thank you in advance for your reply.

issue about pytorch wassdistance

I tried to reproduce the pytorch wassdistance under windows system,but it show some problems bellow
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Alienware.conda\envs\pytorch\lib\site-packages\torch\utils\", line 1293, in load_inline
return _jit_compile(
File "C:\Users\Alienware.conda\envs\pytorch\lib\site-packages\torch\utils\", line 1382, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "C:\Users\Alienware.conda\envs\pytorch\lib\site-packages\torch\utils\", line 1775, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: DLL load failed while importing wasserstein: The specified module could not be found

Error building extension 'wasserstein'

Hi @t-vi

First of all, thank you for sharing your impressive work. Right now I'm using the code you used for comparison to calculate Wasserstein loss. However, that take around 4 minutes for one batch in my case. That takes too long. And your work seems much faster.

However, when I trying to run your code on my server, I got error like below: Do you know what this means. The server I used is a team server, I don't want to change gcc without know if they will massup the current environment.

Appreciate any help you can provide!

/home/anyu/anaconda3/lib/python3.6/site-packages/torch/utils/ UserWarning:

                           !! WARNING !!

Your compiler (c++) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 4.9 and above.

for instructions on how to install GCC 4.9 or higher.

                          !! WARNING !!


CalledProcessError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/torch/utils/ in _build_extension_module(name, build_directory)
758 subprocess.check_output(
--> 759 ['ninja', '-v'], stderr=subprocess.STDOUT, cwd=build_directory)
760 except subprocess.CalledProcessError:

~/anaconda3/lib/python3.6/ in check_output(timeout, *popenargs, **kwargs)
335 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 336 **kwargs).stdout

~/anaconda3/lib/python3.6/ in run(input, timeout, check, *popenargs, **kwargs)
417 raise CalledProcessError(retcode, process.args,
--> 418 output=stdout, stderr=stderr)
419 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)
in ()
1 import torch
2 wasserstein_ext = torch.utils.cpp_extension.load_inline("wasserstein", cpp_sources="", cuda_sources=cuda_source,
----> 3 extra_cuda_cflags=["--expt-relaxed-constexpr"] )
5 def sinkstep(dist, log_nu, log_u, lam: float):

~/anaconda3/lib/python3.6/site-packages/torch/utils/ in load_inline(name, cpp_sources, cuda_sources, functions, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda)
639 build_directory,
640 verbose,
--> 641 with_cuda=with_cuda)

~/anaconda3/lib/python3.6/site-packages/torch/utils/ in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda)
680 if verbose:
681 print('Building extension module {}...'.format(name))
--> 682 _build_extension_module(name, build_directory)
683 finally:
684 baton.release()

~/anaconda3/lib/python3.6/site-packages/torch/utils/ in _build_extension_module(name, build_directory)
763 # error.output contains the stdout and stderr of the build attempt.
764 raise RuntimeError("Error building extension '{}': {}".format(
--> 765 name, error.output.decode()))

RuntimeError: Error building extension 'wasserstein': [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=wasserstein -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/anyu/anaconda3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options '-fPIC' --expt-relaxed-constexpr -std=c++11 -c /tmp/torch_extensions/wasserstein/ -o cuda.cuda.o
FAILED: cuda.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=wasserstein -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/anyu/anaconda3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options '-fPIC' --expt-relaxed-constexpr -std=c++11 -c /tmp/torch_extensions/wasserstein/ -o cuda.cuda.o
/tmp/torch_extensions/wasserstein/ fatal error: torch/extension.h: No such file or directory
compilation terminated.
[2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=wasserstein -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/anyu/anaconda3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++11 -c /tmp/torch_extensions/wasserstein/main.cpp -o main.o
ninja: build stopped: subcommand failed.

