t-vi / pytorch-tvmisc Goto Github PK
View Code? Open in Web Editor NEWTotally Versatile Miscellanea for Pytorch
License: MIT License
Totally Versatile Miscellanea for Pytorch
License: MIT License
Hi!
Thank you for making the Wasserstein loss extension available. Forgive me if this isn't an issue, I am not an expert user. I just wanted to comment that when I tried to run the extension in my computer (torch 1.1.0
) I was getting this error in compilation time:
error: identifier "TORCH_CHECK" is undefined
After installing the latest pytorch-nightly
everything seems to run smoothly, so I guess this may be a requirement?
Hi! Thanks a lot for your great work about the wasserstein distance <Pytorch_Wasserstein.ipynb>!
Since torch.utils.cpp_extension.load_inline
will compile the cuda code every run, would you consider making it to setuptools
, i.e., python setup.py install
, so that one could load pre-build libraries?
Sorry but I'm not familiar with this. Is there any barrier?
Thanks!
I have tried several implementation, this is the first time I'm running into Nan, could you please review the reason ?
So I added unsqueeze_(1)
to each max and mean function. maybe it is the new version difference
Hi Sir,
I have started learning torchscript and your blog was a great source to understand JIT. I tried to run the notebook pytorch_automatic_optimization_jit.ipynb but I am unable to run the c++, CUDA, CPU kernels also I am unable to get the similar graph present in the notebook. I have attached the link of the colab, I am working with.
I request you to help me with this problem
1.7.2
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
py3.7_cuda10.0.130_cudnn7.6.2_0
Traceback (most recent call last):
File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 890, in verify_ninja_availability
subprocess.check_call('ninja --version'.split(), stdout=devnull)
File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/subprocess.py", line 342, in check_call
retcode = call(*popenargs, **kwargs)
File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/subprocess.py", line 323, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/subprocess.py", line 775, in __init__
restore_signals, start_new_session)
File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'ninja': 'ninja'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/devdata/new_Relation_Extraction/test_wasserstein.py", line 208, in <module>
extra_cuda_cflags=["--expt-relaxed-constexpr"])
File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 787, in load_inline
is_python_module)
File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 827, in _jit_compile
with_cuda=with_cuda)
File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 850, in _write_ninja_file_and_build
verify_ninja_availability()
File "/home/lowen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 892, in verify_ninja_availability
raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions
import math
import torch
import torch.utils
import torch.utils.cpp_extension
# % matplotlib inline
#
# from matplotlib import pyplot
# import matplotlib.transforms
#
# import ot # for comparison
cuda_source = """
#include <torch/extension.h>
#include <ATen/core/TensorAccessor.h>
#include <ATen/cuda/CUDAContext.h>
using at::RestrictPtrTraits;
using at::PackedTensorAccessor;
#if defined(__HIP_PLATFORM_HCC__)
constexpr int WARP_SIZE = 64;
#else
constexpr int WARP_SIZE = 32;
#endif
// The maximum number of threads in a block
#if defined(__HIP_PLATFORM_HCC__)
constexpr int MAX_BLOCK_SIZE = 256;
#else
constexpr int MAX_BLOCK_SIZE = 512;
#endif
// Returns the index of the most significant 1 bit in `val`.
__device__ __forceinline__ int getMSB(int val) {
return 31 - __clz(val);
}
// Number of threads in a block given an input size up to MAX_BLOCK_SIZE
static int getNumThreads(int nElem) {
#if defined(__HIP_PLATFORM_HCC__)
int threadSizes[5] = { 16, 32, 64, 128, MAX_BLOCK_SIZE };
#else
int threadSizes[5] = { 32, 64, 128, 256, MAX_BLOCK_SIZE };
#endif
for (int i = 0; i != 5; ++i) {
if (nElem <= threadSizes[i]) {
return threadSizes[i];
}
}
return MAX_BLOCK_SIZE;
}
template <typename T>
__device__ __forceinline__ T WARP_SHFL_XOR(T value, int laneMask, int width = warpSize, unsigned int mask = 0xffffffff)
{
#if CUDA_VERSION >= 9000
return __shfl_xor_sync(mask, value, laneMask, width);
#else
return __shfl_xor(value, laneMask, width);
#endif
}
// While this might be the most efficient sinkhorn step / logsumexp-matmul implementation I have seen,
// this is awfully inefficient compared to matrix multiplication and e.g. NVidia cutlass may provide
// many great ideas for improvement
template <typename scalar_t, typename index_t>
__global__ void sinkstep_kernel(
// compute log v_bj = log nu_bj - logsumexp_i 1/lambda dist_ij - log u_bi
// for this compute maxdiff_bj = max_i(1/lambda dist_ij - log u_bi)
// i = reduction dim, using threadIdx.x
PackedTensorAccessor<scalar_t, 2, RestrictPtrTraits, index_t> log_v,
const PackedTensorAccessor<scalar_t, 2, RestrictPtrTraits, index_t> dist,
const PackedTensorAccessor<scalar_t, 2, RestrictPtrTraits, index_t> log_nu,
const PackedTensorAccessor<scalar_t, 2, RestrictPtrTraits, index_t> log_u,
const scalar_t lambda) {
using accscalar_t = scalar_t;
__shared__ accscalar_t shared_mem[2 * WARP_SIZE];
index_t b = blockIdx.y;
index_t j = blockIdx.x;
int tid = threadIdx.x;
if (b >= log_u.size(0) || j >= log_v.size(1)) {
return;
}
// reduce within thread
accscalar_t max = -std::numeric_limits<accscalar_t>::infinity();
accscalar_t sumexp = 0;
if (log_nu[b][j] == -std::numeric_limits<accscalar_t>::infinity()) {
if (tid == 0) {
log_v[b][j] = -std::numeric_limits<accscalar_t>::infinity();
}
return;
}
for (index_t i = threadIdx.x; i < log_u.size(1); i += blockDim.x) {
accscalar_t oldmax = max;
accscalar_t value = -dist[i][j]/lambda + log_u[b][i];
max = max > value ? max : value;
if (oldmax == -std::numeric_limits<accscalar_t>::infinity()) {
// sumexp used to be 0, so the new max is value and we can set 1 here,
// because we will come back here again
sumexp = 1;
} else {
sumexp *= exp(oldmax - max);
sumexp += exp(value - max); // if oldmax was not -infinity, max is not either...
}
}
// now we have one value per thread. we'll make it into one value per warp
// first warpSum to get one value per thread to
// one value per warp
for (int i = 0; i < getMSB(WARP_SIZE); ++i) {
accscalar_t o_max = WARP_SHFL_XOR(max, 1 << i, WARP_SIZE);
accscalar_t o_sumexp = WARP_SHFL_XOR(sumexp, 1 << i, WARP_SIZE);
if (o_max > max) { // we're less concerned about divergence here
sumexp *= exp(max - o_max);
sumexp += o_sumexp;
max = o_max;
} else if (max != -std::numeric_limits<accscalar_t>::infinity()) {
sumexp += o_sumexp * exp(o_max - max);
}
}
__syncthreads();
// this writes each warps accumulation into shared memory
// there are at most WARP_SIZE items left because
// there are at most WARP_SIZE**2 threads at the beginning
if (tid % WARP_SIZE == 0) {
shared_mem[tid / WARP_SIZE * 2] = max;
shared_mem[tid / WARP_SIZE * 2 + 1] = sumexp;
}
__syncthreads();
if (tid < WARP_SIZE) {
max = (tid < blockDim.x / WARP_SIZE ? shared_mem[2 * tid] : -std::numeric_limits<accscalar_t>::infinity());
sumexp = (tid < blockDim.x / WARP_SIZE ? shared_mem[2 * tid + 1] : 0);
}
for (int i = 0; i < getMSB(WARP_SIZE); ++i) {
accscalar_t o_max = WARP_SHFL_XOR(max, 1 << i, WARP_SIZE);
accscalar_t o_sumexp = WARP_SHFL_XOR(sumexp, 1 << i, WARP_SIZE);
if (o_max > max) { // we're less concerned about divergence here
sumexp *= exp(max - o_max);
sumexp += o_sumexp;
max = o_max;
} else if (max != -std::numeric_limits<accscalar_t>::infinity()) {
sumexp += o_sumexp * exp(o_max - max);
}
}
if (tid == 0) {
log_v[b][j] = (max > -std::numeric_limits<accscalar_t>::infinity() ?
log_nu[b][j] - log(sumexp) - max :
-std::numeric_limits<accscalar_t>::infinity());
}
}
template <typename scalar_t>
torch::Tensor sinkstep_cuda_template(const torch::Tensor& dist, const torch::Tensor& log_nu, const torch::Tensor& log_u,
const double lambda) {
TORCH_CHECK(dist.is_cuda(), "need cuda tensors");
TORCH_CHECK(dist.device() == log_nu.device() && dist.device() == log_u.device(), "need tensors on same GPU");
TORCH_CHECK(dist.dim()==2 && log_nu.dim()==2 && log_u.dim()==2, "invalid sizes");
TORCH_CHECK(dist.size(0) == log_u.size(1) &&
dist.size(1) == log_nu.size(1) &&
log_u.size(0) == log_nu.size(0), "invalid sizes");
auto log_v = torch::empty_like(log_nu);
using index_t = int32_t;
auto log_v_a = log_v.packed_accessor<scalar_t, 2, RestrictPtrTraits, index_t>();
auto dist_a = dist.packed_accessor<scalar_t, 2, RestrictPtrTraits, index_t>();
auto log_nu_a = log_nu.packed_accessor<scalar_t, 2, RestrictPtrTraits, index_t>();
auto log_u_a = log_u.packed_accessor<scalar_t, 2, RestrictPtrTraits, index_t>();
auto stream = at::cuda::getCurrentCUDAStream();
int tf = getNumThreads(log_u.size(1));
dim3 blocks(log_v.size(1), log_u.size(0));
dim3 threads(tf);
sinkstep_kernel<<<blocks, threads, 2*WARP_SIZE*sizeof(scalar_t), stream>>>(
log_v_a, dist_a, log_nu_a, log_u_a, static_cast<scalar_t>(lambda)
);
return log_v;
}
torch::Tensor sinkstep_cuda(const torch::Tensor& dist, const torch::Tensor& log_nu, const torch::Tensor& log_u,
const double lambda) {
return AT_DISPATCH_FLOATING_TYPES(log_u.scalar_type(), "sinkstep", [&] {
return sinkstep_cuda_template<scalar_t>(dist, log_nu, log_u, lambda);
});
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("sinkstep", &sinkstep_cuda, "sinkhorn step");
}
"""
wasserstein_ext = torch.utils.cpp_extension.load_inline("wasserstein", cpp_sources="", cuda_sources=cuda_source,
extra_cuda_cflags=["--expt-relaxed-constexpr"])
Hi,How can i use mexh from your codes?
can you show me a sample for mexh wavelet?
Hello, Firstly thank you for the awesome work!
I had a question in the Pytorch_Wasserstein.ipynb:
In the WassersteinLossVanilla, why is it
self.K = torch.exp(-self.cost/self.lam)
?
Shouldn't it be
self.K = torch.exp(-self.cost*self.lam)
?
In mocha also it is the above https://github.com/pluskid/Mocha.jl/blob/5e15b882d7dd615b0c5159bb6fde2cc040b2d8ee/src/layers/wasserstein-loss.jl#L33
Have you changed it because "Note that we use a different convention for
Also what is the reason for the above?
Hi @t-vi,
Thanks for sharing your code!
I would like to ask a question regarding your implementation of the Sinkhorn algorithm. You stated that one of the main motivations was to obtain efficient batched computation. However, looking at the code I observe that it only supports the case where the cost matrix is the same across the batch:
def forward(ctx, mu, nu, dist, lam=1e-3, N=100):
assert mu.dim() == 2 and nu.dim() == 2 and dist.dim() == 2
bs = mu.size(0)
d1, d2 = dist.size()
assert nu.size(0) == bs and mu.size(1) == d1 and nu.size(1) == d2
That is, the shape dist
is d1 x d2
instead of bs x d1 x d2
. Is this expected?
Thank you in advance for your reply.
I got unstable results on CPU. The get_coupling
function got inf values when executed on CPU.
I tried to reproduce the pytorch wassdistance under windows system,but it show some problems bellow
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Alienware.conda\envs\pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1293, in load_inline
return _jit_compile(
File "C:\Users\Alienware.conda\envs\pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1382, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "C:\Users\Alienware.conda\envs\pytorch\lib\site-packages\torch\utils\cpp_extension.py", line 1775, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: DLL load failed while importing wasserstein: The specified module could not be found
Hi @t-vi
First of all, thank you for sharing your impressive work. Right now I'm using the code you used for comparison to calculate Wasserstein loss. However, that take around 4 minutes for one batch in my case. That takes too long. And your work seems much faster.
However, when I trying to run your code on my server, I got error like below: Do you know what this means. The server I used is a team server, I don't want to change gcc without know if they will massup the current environment.
Appreciate any help you can provide!
/home/anyu/anaconda3/lib/python3.6/site-packages/torch/utils/cpp_extension.py:118: UserWarning:
!! WARNING !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 4.9 and above.
See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
for instructions on how to install GCC 4.9 or higher.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! WARNING !!
CalledProcessError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _build_extension_module(name, build_directory)
758 subprocess.check_output(
--> 759 ['ninja', '-v'], stderr=subprocess.STDOUT, cwd=build_directory)
760 except subprocess.CalledProcessError:
~/anaconda3/lib/python3.6/subprocess.py in check_output(timeout, *popenargs, **kwargs)
335 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 336 **kwargs).stdout
337
~/anaconda3/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
417 raise CalledProcessError(retcode, process.args,
--> 418 output=stdout, stderr=stderr)
419 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
in ()
1 import torch
2 wasserstein_ext = torch.utils.cpp_extension.load_inline("wasserstein", cpp_sources="", cuda_sources=cuda_source,
----> 3 extra_cuda_cflags=["--expt-relaxed-constexpr"] )
4
5 def sinkstep(dist, log_nu, log_u, lam: float):
~/anaconda3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load_inline(name, cpp_sources, cuda_sources, functions, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda)
639 build_directory,
640 verbose,
--> 641 with_cuda=with_cuda)
642
643
~/anaconda3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda)
680 if verbose:
681 print('Building extension module {}...'.format(name))
--> 682 _build_extension_module(name, build_directory)
683 finally:
684 baton.release()
~/anaconda3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _build_extension_module(name, build_directory)
763 # error.output contains the stdout and stderr of the build attempt.
764 raise RuntimeError("Error building extension '{}': {}".format(
--> 765 name, error.output.decode()))
766
767
RuntimeError: Error building extension 'wasserstein': [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=wasserstein -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/anyu/anaconda3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options '-fPIC' --expt-relaxed-constexpr -std=c++11 -c /tmp/torch_extensions/wasserstein/cuda.cu -o cuda.cuda.o
FAILED: cuda.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=wasserstein -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/anyu/anaconda3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options '-fPIC' --expt-relaxed-constexpr -std=c++11 -c /tmp/torch_extensions/wasserstein/cuda.cu -o cuda.cuda.o
/tmp/torch_extensions/wasserstein/cuda.cu:6:29: fatal error: torch/extension.h: No such file or directory
compilation terminated.
[2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=wasserstein -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/TH -I/home/anyu/anaconda3/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/local/cuda/include -I/home/anyu/anaconda3/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++11 -c /tmp/torch_extensions/wasserstein/main.cpp -o main.o
ninja: build stopped: subcommand failed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.