ap-hynninen / cutt Goto Github PK
View Code? Open in Web Editor NEWCUDA Tensor Transpose (cuTT) library
CUDA Tensor Transpose (cuTT) library
Hello, I use your cutt to do transpose, but I have encountered a problem---'Illegal instruction (core dumped)'. My code is
`int main() {
// Four dimensional tensor
// Transpose (31, 549, 2, 3) -> (3, 31, 2, 549)
int dim[4] = {31, 549, 2, 3};
int permutation[4] = {3, 0, 2, 1};
int size = 1;
for (int i = 0; i < sizeof(permutation) / sizeof(permutation[0]); i++)
{
size = dim[i]*size;
}
double *idata = new double[size]();
double *odata = new double[size];
// Option 1: Create plan on NULL stream and choose implementation based on heuristics
cuttHandle plan;
cuttCheck(cuttPlan(&plan, 4, dim, permutation, sizeof(double), 0));
// Option 2: Create plan on NULL stream and choose implementation based on performance measurements
// cuttCheck(cuttPlanMeasure(&plan, 4, dim, permutation, sizeof(double), 0, idata, odata));
// Execute plan
cuttCheck(cuttExecute(plan, idata, odata));
cout << odata << endl;
// Destroy plan
cuttCheck(cuttDestroy(plan));
delete[](idata);
delete[](odata);
return 0;
}`
Then , I use gdb and I find the problem happens on cuttCheck(cuttPlan(&plan, 4, dim, permutation, sizeof(double), 0));
I run the cutt_test and the same problem happens.
Thanks.
run ./cutt_test
and get:
cudaGetLastError() in file src/TensorTester.cu, function setTensorCheckPattern
Error String: no kernel image is available for execution on the device
run ./cutt_bench
and get:
Using GeForce GTX 950M SM version 5.0
Clock 1.124Ghz numSM 5 ECC 0 mem BW 28.80GB/s shMemBankSize 4B
L2 2.00MB
CPU using vector type AVX2 of length 8
cudaMalloc(pp, sizeofT*len) in file src/CudaUtils.cu, function allocate_device_T
Error String: out of memory
version: git commit 4c251c6
make stdout: https://paste.ubuntu.com/p/xJPfMg7V3D/
make stderr: https://paste.ubuntu.com/p/5kJt82yQGJ/
Hi,
many thanks for your library - it seems to be a really useful tool for GPU codes!
I am testing it on Summit and find the following error:
cudaFuncSetSharedMemConfig(transposePacked<float, 1>, cudaSharedMemBankSizeFourByte ) in file src/calls.h, function cuttKernelSetSharedMemConfig
Error String: invalid device function
Please let me know what is going on.
Would you like to wrap any pointer data members with the class template โstd::unique_ptrโ?
Update candidates:
I believe we need the std=c++11 flag in the CUDA flags as well, otherwise it did not compile in my case since the code uses "nullptr" which is c++11.
On new Ubuntu 16.04 with recent CUDA there is a problem with a .cu file compilation due to not resolving the symbol in string.h. Adding a flag "-D_FORCE_INLINES" to the nvcc flags solves the problem.
Output is empty when one of dims is 1, such as
` int dim[4] = {W, H, C, N};
int permutation[4] = {3, 0, 1, 2};
cuttHandle handle;
cuttPlan(&handle, 4, dim, permutation, sizeof(float), streamId);
cuttExecute(handle, in, out);
cuttDestroy(handle);`
Output is empty when W==1.
==21682== Conditional jump or move depends on uninitialised value(s)
==21682== at 0x41E27D: computePos0(int, int const*, int const*, int const*, int const*, int*, int*) (cuttGpuModel.cpp:249)
==21682== by 0x41E429: computePos0(int, TensorConvInOut const*, int, int*, int*) (cuttGpuModel.cpp:294)
==21682== by 0x40B9C1: cuttPlan_t::countCycles(cudaDeviceProp&, int) (cuttplan.cpp:1126)
==21682== by 0x409A30: cuttPlan(unsigned int*, int, int*, int*, unsigned long, CUstream_st*) (cutt.cpp:148)
==21682== by 0x4046D3: bool test_tensor(std::vector<int, std::allocator >&, std::vector<int, std::allocator >&) (cutt_test.cpp:465)
==21682== by 0x4031DE: test1() (cutt_test.cpp:151)
==21682== by 0x401D72: main (cutt_test.cpp:102)
==21682==
==21682== Conditional jump or move depends on uninitialised value(s)
==21682== at 0x41E2BD: computePos0(int, int const*, int const*, int const*, int const*, int*, int*) (cuttGpuModel.cpp:256)
==21682== by 0x41E429: computePos0(int, TensorConvInOut const*, int, int*, int*) (cuttGpuModel.cpp:294)
==21682== by 0x40B9C1: cuttPlan_t::countCycles(cudaDeviceProp&, int) (cuttplan.cpp:1126)
==21682== by 0x409A30: cuttPlan(unsigned int*, int, int*, int*, unsigned long, CUstream_st*) (cutt.cpp:148)
==21682== by 0x4046D3: bool test_tensor(std::vector<int, std::allocator >&, std::vector<int, std::allocator >&) (cutt_test.cpp:465)
==21682== by 0x4031DE: test1() (cutt_test.cpp:151)
==21682== by 0x401D72: main (cutt_test.cpp:102)
==21682==
==21682== Conditional jump or move depends on uninitialised value(s)
==21682== at 0x41E27D: computePos0(int, int const*, int const*, int const*, int const*, int*, int*) (cuttGpuModel.cpp:249)
==21682== by 0x41E429: computePos0(int, TensorConvInOut const*, int, int*, int*) (cuttGpuModel.cpp:294)
==21682== by 0x40BA5F: cuttPlan_t::countCycles(cudaDeviceProp&, int) (cuttplan.cpp:1154)
==21682== by 0x409A30: cuttPlan(unsigned int*, int, int*, int*, unsigned long, CUstream_st*) (cutt.cpp:148)
==21682== by 0x4046D3: bool test_tensor(std::vector<int, std::allocator >&, std::vector<int, std::allocator >&) (cutt_test.cpp:465)
==21682== by 0x4031DE: test1() (cutt_test.cpp:151)
==21682== by 0x401D72: main (cutt_test.cpp:102)
==21682==
==21682== Conditional jump or move depends on uninitialised value(s)
==21682== at 0x41E2BD: computePos0(int, int const*, int const*, int const*, int const*, int*, int*) (cuttGpuModel.cpp:256)
==21682== by 0x41E429: computePos0(int, TensorConvInOut const*, int, int*, int*) (cuttGpuModel.cpp:294)
==21682== by 0x40BA5F: cuttPlan_t::countCycles(cudaDeviceProp&, int) (cuttplan.cpp:1154)
==21682== by 0x409A30: cuttPlan(unsigned int*, int, int*, int*, unsigned long, CUstream_st*) (cutt.cpp:148)
==21682== by 0x4046D3: bool test_tensor(std::vector<int, std::allocator >&, std::vector<int, std::allocator >&) (cutt_test.cpp:465)
==21682== by 0x4031DE: test1() (cutt_test.cpp:151)
==21682== by 0x401D72: main (cutt_test.cpp:102)
==21682==
==21682== Conditional jump or move depends on uninitialised value(s)
==21682== at 0x41FCF7: countPackedShTransactions0(int, int, int, int, TensorConv const*, int, int&, int&, int&, int&) (cuttGpuModel.cpp:513)
==21682== by 0x40C29F: cuttPlan_t::countCycles(cudaDeviceProp&, int) (cuttplan.cpp:1352)
==21682== by 0x409A30: cuttPlan(unsigned int*, int, int*, int*, unsigned long, CUstream_st*) (cutt.cpp:148)
==21682== by 0x4046D3: bool test_tensor(std::vector<int, std::allocator >&, std::vector<int, std::allocator >&) (cutt_test.cpp:465)
==21682== by 0x4031DE: test1() (cutt_test.cpp:151)
==21682== by 0x401D72: main (cutt_test.cpp:102)
==21682==
==21682== Conditional jump or move depends on uninitialised value(s)
==21682== at 0x41FCF7: countPackedShTransactions0(int, int, int, int, TensorConv const*, int, int&, int&, int&, int&) (cuttGpuModel.cpp:513)
==21682== by 0x40C346: cuttPlan_t::countCycles(cudaDeviceProp&, int) (cuttplan.cpp:1384)
==21682== by 0x409A30: cuttPlan(unsigned int*, int, int*, int*, unsigned long, CUstream_st*) (cutt.cpp:148)
==21682== by 0x4046D3: bool test_tensor(std::vector<int, std::allocator >&, std::vector<int, std::allocator >&) (cutt_test.cpp:465)
==21682== by 0x4031DE: test1() (cutt_test.cpp:151)
==21682== by 0x401D72: main (cutt_test.cpp:102)
Hi, I am running into this problem when constructing cuttPlan:
cuttPlan(&m_rot_plan[0], 3, dim_0, permu_0, sizeof(int), nullptr);
and I also used vigrind to test it. The relevant msg is:
`==8915== Process terminating with default action of signal 6 (SIGABRT)
==8915== at 0x69D9FB7: raise (raise.c:51)
==8915== by 0x69DB920: abort (abort.c:79)
==8915== by 0x6A24966: __libc_message (libc_fatal.c:181)
==8915== by 0x6ACFB60: __fortify_fail_abort (fortify_fail.c:33)
==8915== by 0x6ACFB21: __stack_chk_fail (stack_chk_fail.c:29)
==8915== by 0x21B386: cuttPlan(unsigned int*, int, int*, int*, unsigned long, CUstream_st*) (in /home/joseph/yzchen_ws/UAV/cpc_ws/devel/lib/cpc_aux_mapping/cpc_aux_mapping_node)
==8915== by 0x18C140: AuxMapper::AuxMapper() (aux_mapper.cpp:67)
==8915== by 0x184201: main (aux_mapping_node.cpp:7)
==8915==
==8915== HEAP SUMMARY:
==8915== in use at exit: 15,351,186 bytes in 16,317 blocks
==8915== total heap usage: 23,410 allocs, 7,093 frees, 62,269,008 bytes allocated
==8915== 104 bytes in 1 blocks are possibly lost in loss record 1,715 of 3,065
==8915== at 0x4C33B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8915== by 0xE4B53C2: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01)
==8915== by 0xE4B5B90: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01)
==8915== by 0xE4B6690: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01)
==8915== by 0xE3520E4: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01)
==8915== by 0xE40C1B6: cuMemAlloc_v2 (in /usr/lib/x86_64-linux-gnu/libcuda.so.460.73.01)
==8915== by 0x1E9CAD: __cudart602 (in /home/joseph/yzchen_ws/UAV/cpc_ws/devel/lib/cpc_aux_mapping/cpc_aux_mapping_node)
==8915== by 0x1BFFAA: __cudart607 (in /home/joseph/yzchen_ws/UAV/cpc_ws/devel/lib/cpc_aux_mapping/cpc_aux_mapping_node)
==8915== by 0x1F577A: cudaMalloc (in /home/joseph/yzchen_ws/UAV/cpc_ws/devel/lib/cpc_aux_mapping/cpc_aux_mapping_node)
==8915== by 0x23BA69: allocate_device_T(void**, unsigned long, unsigned long) (in /home/joseph/yzchen_ws/UAV/cpc_ws/devel/lib/cpc_aux_mapping/cpc_aux_mapping_node)
==8915== by 0x21C430: cuttPlan_t::activate() (in /home/joseph/yzchen_ws/UAV/cpc_ws/devel/lib/cpc_aux_mapping/cpc_aux_mapping_node)
==8915== by 0x21B204: cuttPlan(unsigned int*, int, int*, int*, unsigned long, CUstream_st*) (in /home/joseph/yzchen_ws/UAV/cpc_ws/devel/lib/cpc_aux_mapping/cpc_aux_mapping_node)
==8915== by 0x18C140: AuxMapper::AuxMapper() (aux_mapper.cpp:67)
==8915== by 0x184201: main (aux_mapping_node.cpp:7)
`
Anyone has similar problems?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.