inducer / pycuda Goto Github PK

View Code? Open in Web Editor NEW

1.8K 56.0 285.0 2.98 MB

CUDA integration for Python, plus shiny features

Home Page: http://mathema.tician.de/software/pycuda

License: Other

Python 62.64% C++ 36.90% Makefile 0.05% Cython 0.41%

python pycuda gpu gpu-computing cuda array multidimensional-arrays scientific-computing

pycuda's Introduction

PyCUDA: Pythonic Access to CUDA, with Arrays and Algorithms

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist-so what's so special about PyCUDA?

Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code. PyCUDA knows about dependencies, too, so (for example) it won't detach from a context before all memory allocated in it is also freed.
Convenience. Abstractions like pycuda.driver.SourceModule and pycuda.gpuarray.GPUArray make CUDA programming even more convenient than with Nvidia's C-based runtime.
Completeness. PyCUDA puts the full power of CUDA's driver API at your disposal, if you wish. It also includes code for interoperability with OpenGL.
Automatic Error Checking. All CUDA errors are automatically translated into Python exceptions.
Speed. PyCUDA's base layer is written in C++, so all the niceties above are virtually free.
Helpful Documentation.

Relatedly, like-minded computing goodness for OpenCL is provided by PyCUDA's sister project PyOpenCL.

pycuda's People

Contributors

Stargazers

Watchers

Forkers

spatel81 npinto fjarri andrewcron abergeron jsalva wking leifdenby hannes-brt tempbottle shinsec iconica dirkhaehnel kif priyankt68 untom ncuaddison bassamtiano seibert totallybullshit wme7 stevenlol kirknorth harveyliufly smerkli zoulianmp eelcohoogendoorn bmerry grlee77 seanprime7 twistedmove chagge idkwim tnarihi ahnitz fengyuqi apark263 cc13ny drufat feckpown rutsky davidweichiang mjabri sjperkins mbrubake lebedov zamorays vedranmiletic blakeshih lukepfister mohendra timlegrand chenyuan920911 televator wanjinchang abhishekgahlot eb-brandonhamric greenstick gaassassins thestew42 stsievert mikkab bdice astrofelipe topological kmichaelkills xingangahu shahzaibgill-zz donglianggao shashankg7 chschnell stratakis yygr chunggi asc34 akkaze dimitar-petrov keepsorted zhenv5 mengqhui psteinb cw-delli-bird louity walkoncross seungwookim vdt aiengine damontung sacrosis umitsamima nikulukani solertis devopsmi hanahimi clorton loikki caomw yutiansut ipelupessy rajee1412

pycuda's Issues

memory leak in reshape

import pycuda.autoinit
from pycuda import gpuarray
import numpy as np
shape = (3,4)
a = gpuarray.zeros(shape, np.float32)
while True:
a = a.reshape(shape)

You'll see all your CPU memory leak away eventually. I suspect it's that a newly constructed GPUarray object is returned every time reshape is called without deleting the previous one.

gpuarray.reshape does not accept "-1"

Numpy ndarrays accept "-1" as value for their reshape operation, which essentially stands for "fill in whatever is left over". This is useful to e.g. turn row-vectors into column-vectors:

x.reshape(-1, 1)    # this is now a column-vector
x.reshape(-1)   # essentially the same as x.ravel()

I use this feature often and find it very convenient, however PyCUDA does currently not support this. It would be nice if this could be added.

IPython Magic

Add Ipython Magics to pycuda like there are in pyopencl (see for example: https://github.com/pyopencl/pyopencl/blob/master/pyopencl/ipython_ext.py).

ImportError: No module named _driver

When I run any of the example pycuda programs, I get the error of ImportError: No module named _driver when trying to do import pycuda.driver as drv or similar. Pycuda appeared to install ok. Using the newest Cuda downloads on OS X 10.11.

➜  pycuda-2015.1.3  sudo make install                                                                     
Password:
ctags -R src || true
/usr/local/opt/python/bin/python2.7 setup.py install
running install
running bdist_egg
running egg_info
writing requirements to pycuda.egg-info/requires.txt
writing pycuda.egg-info/PKG-INFO
writing top-level names to pycuda.egg-info/top_level.txt
writing dependency_links to pycuda.egg-info/dependency_links.txt
reading manifest file 'pycuda.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.cpp' under directory 'bpl-subset/bpl_subset/boost'
warning: no files found matching '*.html' under directory 'bpl-subset/bpl_subset/boost'
warning: no files found matching '*.inl' under directory 'bpl-subset/bpl_subset/boost'
warning: no files found matching '*.txt' under directory 'bpl-subset/bpl_subset/boost'
warning: no files found matching '*.h' under directory 'bpl-subset/bpl_subset/libs'
warning: no files found matching '*.ipp' under directory 'bpl-subset/bpl_subset/libs'
warning: no files found matching '*.pl' under directory 'bpl-subset/bpl_subset/libs'
writing manifest file 'pycuda.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.11-x86_64/egg
running install_lib
running build_py
running build_ext
creating build/bdist.macosx-10.11-x86_64/egg
creating build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/__init__.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/_cluda.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/_driver.so -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/_mymako.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/_pvt_struct.so -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/autoinit.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/characterize.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/compiler.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
creating build/bdist.macosx-10.11-x86_64/egg/pycuda/compyte
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/compyte/__init__.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/compyte
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/compyte/array.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/compyte
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/compyte/dtypes.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/compyte
creating build/bdist.macosx-10.11-x86_64/egg/pycuda/cuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/cuda/pycuda-complex-impl.hpp -> build/bdist.macosx-10.11-x86_64/egg/pycuda/cuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/cuda/pycuda-complex.hpp -> build/bdist.macosx-10.11-x86_64/egg/pycuda/cuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/cuda/pycuda-helpers.hpp -> build/bdist.macosx-10.11-x86_64/egg/pycuda/cuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/cumath.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/curandom.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/debug.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/driver.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/elementwise.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
creating build/bdist.macosx-10.11-x86_64/egg/pycuda/gl
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/gl/__init__.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/gl
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/gl/autoinit.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/gl
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/gpuarray.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/reduction.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/scan.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
creating build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/sparse/__init__.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/sparse/cg.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/sparse/coordinate.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/sparse/inner.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/sparse/operator.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/sparse/packeted.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/sparse/pkt_build.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse
copying build/lib.macosx-10.11-x86_64-2.7/pycuda/tools.py -> build/bdist.macosx-10.11-x86_64/egg/pycuda
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/_cluda.py to _cluda.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/_mymako.py to _mymako.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/autoinit.py to autoinit.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/characterize.py to characterize.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/compiler.py to compiler.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/compyte/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/compyte/array.py to array.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/compyte/dtypes.py to dtypes.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/cumath.py to cumath.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/curandom.py to curandom.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/debug.py to debug.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/driver.py to driver.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/elementwise.py to elementwise.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/gl/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/gl/autoinit.py to autoinit.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/gpuarray.py to gpuarray.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/reduction.py to reduction.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/scan.py to scan.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse/cg.py to cg.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse/coordinate.py to coordinate.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse/inner.py to inner.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse/operator.py to operator.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse/packeted.py to packeted.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/sparse/pkt_build.py to pkt_build.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/tools.py to tools.pyc
creating stub loader for pycuda/_driver.so
creating stub loader for pycuda/_pvt_struct.so
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/_driver.py to _driver.pyc
byte-compiling build/bdist.macosx-10.11-x86_64/egg/pycuda/_pvt_struct.py to _pvt_struct.pyc
creating build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying pycuda.egg-info/PKG-INFO -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying pycuda.egg-info/SOURCES.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying pycuda.egg-info/dependency_links.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying pycuda.egg-info/not-zip-safe -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying pycuda.egg-info/requires.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
copying pycuda.egg-info/top_level.txt -> build/bdist.macosx-10.11-x86_64/egg/EGG-INFO
writing build/bdist.macosx-10.11-x86_64/egg/EGG-INFO/native_libs.txt
creating 'dist/pycuda-2015.1.3-py2.7-macosx-10.11-x86_64.egg' and adding 'build/bdist.macosx-10.11-x86_64/egg' to it
removing 'build/bdist.macosx-10.11-x86_64/egg' (and everything under it)
Processing pycuda-2015.1.3-py2.7-macosx-10.11-x86_64.egg
removing '/usr/local/lib/python2.7/site-packages/pycuda-2015.1.3-py2.7-macosx-10.11-x86_64.egg' (and everything under it)
creating /usr/local/lib/python2.7/site-packages/pycuda-2015.1.3-py2.7-macosx-10.11-x86_64.egg
Extracting pycuda-2015.1.3-py2.7-macosx-10.11-x86_64.egg to /usr/local/lib/python2.7/site-packages
pycuda 2015.1.3 is already the active version in easy-install.pth

Installed /usr/local/lib/python2.7/site-packages/pycuda-2015.1.3-py2.7-macosx-10.11-x86_64.egg
Processing dependencies for pycuda==2015.1.3
Searching for appdirs==1.4.0
Best match: appdirs 1.4.0
Adding appdirs 1.4.0 to easy-install.pth file

Using /usr/local/lib/python2.7/site-packages
Searching for decorator==4.0.6
Best match: decorator 4.0.6
Adding decorator 4.0.6 to easy-install.pth file

Using /usr/local/lib/python2.7/site-packages
Searching for pytest==2.8.7
Best match: pytest 2.8.7
Adding pytest 2.8.7 to easy-install.pth file
Installing py.test-2.7 script to /usr/local/bin
Installing py.test script to /usr/local/bin

Using /usr/local/lib/python2.7/site-packages
Searching for pytools==2016.1
Best match: pytools 2016.1
Adding pytools 2016.1 to easy-install.pth file

Using /usr/local/lib/python2.7/site-packages
Searching for py==1.4.31
Best match: py 1.4.31
Adding py 1.4.31 to easy-install.pth file

Using /usr/local/lib/python2.7/site-packages
Searching for numpy==1.10.2
Best match: numpy 1.10.2
Adding numpy 1.10.2 to easy-install.pth file

Using /usr/local/lib/python2.7/site-packages
Searching for six==1.10.0
Best match: six 1.10.0
Adding six 1.10.0 to easy-install.pth file

Using /usr/local/lib/python2.7/site-packages
Finished processing dependencies for pycuda==2015.1.3
➜  pycuda-2015.1.3  python test.py   
Traceback (most recent call last):
  File "test.py", line 1, in <module>
    import pycuda.driver as drv
  File "/Users/David/code/pycuda-2015.1.3/pycuda/driver.py", line 5, in <module>
    from pycuda._driver import *  # noqa
ImportError: No module named _driver

Implement way of distributing CUDA-powered application without requirement to have whole CUDA toolkit installed

Currently I can only compile CUDA file to cubin using pycuda.compiler.compile() for specific real architecture and then load it using pycuda.driver.module_from_buffer().

AFAIK to distribute CUDA file without compiler/CUDA toolkit I need to either generate cubin-s for fixed set of architectures, and/or to generate ptx file, that can be assemblied by driver.

It would be nice if PyCUDA will provide some infrastructure for generating, storing and loading cubin/ptx, so that it would be possible to distribute CUDA-powered applications without CUDA toolkit/compiler.

Python 3 compatibility issue with OS X

On a Mac using Python 3.x, PyCUDA will throw an exception in the compiler module when determining the host platform:

File "/opt/local/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pycuda-2016.1.1-py3.4-macosx-10.11-x86_64.egg/pycuda/compiler.py", line 243, in compile if 'darwin' in sys.platform and sys.maxint == 9223372036854775807: AttributeError: 'module' object has no attribute 'maxint'

The code in question is just looking for maxint, which I suppose is not part of the sys module in Python 3.

if 'darwin' in sys.platform and sys.maxint == 9223372036854775807:
options.append('-m64')

There is a pretty simple workaround (on my 64-bit Mac) that has worked for me thus far, and to the extent that I have used PyCUDA there do not seem to be any further problems with Python 3.x:

import sys
sys.maxint = 9223372036854775807

Thanks for making PyCUDA and PyOpenCL both, they have been terrific.

View into non-contiguous gpuarrays returns incorrect data

As of 2013.1, arbitrary slicing is implemented, returning a non-contiguous array when column-slicing a c-contiguous array or row-slicing a f-contiguous array. But the view function discards the flag and thus returns the wrong data:

In [1]: import pycuda.autoinit

In [2]: import numpy as np

In [3]: from pycuda.curandom import rand as curand

In [4]: from pycuda import gpuarray

In [5]: X = curand((5, 10), dtype=np.float32)

In [6]: Y = X[:3,:5]

In [7]: Y.flags.forc
Out[7]: False

In [8]: y = Y.view()

In [9]: y.flags.forc
Out[9]: True

In [10]: y.get() == X.get()[:3, :5]
Out[10]: 
array([[ True,  True,  True,  True,  True],
       [False, False, False, False, False],
       [False, False, False, False, False]], dtype=bool)

error: 'launch_kernel' is not a member of 'init_module__driver()::cl'

python setup.py build gives me the following error:

src/wrapper/wrap_cudadrv.cpp:905: error: 'launch_kernel' is not a member of 'init_module__driver()::cl'
error: command 'gcc' failed with exit status 1

Python 2.7.1
gcc (GCC) 4.3.3
CUDA 3.2

Am I missing something?

DeviceMemoryPool causes crash at shutdown

Consider the following piece of code:

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as gpu
from pycuda.tools import DeviceMemoryPool

cuda_memory_pool = DeviceMemoryPool()
X = np.array([1, 2, 3.0])
X = gpu.to_gpu(X, allocator=cuda_memory_pool.allocate)

On my machine, this gives:

python2.7 test.py
terminate called after throwing an instance of 'pycuda::error'
  what():  explicit_context_dependent failed: invalid context - no currently active context?
Aborted

This crash happens AFTER all cleanup operations, somewhere in the guts of pycuda. (At this stage, the _finish_up() routine from pycuda.autoinit has already been run). While the main script runs through, this makes it impossible to profile scripts through the Nvidia Visual Profiler, which is rather annoying.

can't create GPUArray with shape stored as an ndarray

Passing a shape stored in an ndarray to the GPUArray constructor raises an exception because of ndarray broadcasting:

In [1]: import numpy, pycuda.autoinit, pycuda.gpuarray

In [2]: pycuda.gpuarray.empty(np.array([1,2]), np.float32)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-6a2209df87f6> in <module>()
----> 1 pycuda.gpuarray.empty(np.array([1,2]), np.float32)

/home/lev/Work/miniconda/envs/DEFAULT/lib/python2.7/site-packages/pycuda-2015.1.3-py2.7-linux-x86_64.egg/pycuda/gpuarray.pyc in __init__(self, shape, dtype, allocator, base, gpudata, strides, order)
    182             elif order == "C":
    183                 strides = _c_contiguous_strides(
--> 184                         dtype.itemsize, shape)
    185             else:
    186                 raise ValueError("invalid order: %s" % order)

/home/lev/Work/miniconda/envs/DEFAULT/lib/python2.7/site-packages/pycuda-2015.1.3-py2.7-linux-x86_64.egg/pycuda/compyte/array.pyc in c_contiguous_strides(itemsize, shape)
     37 
     38 def c_contiguous_strides(itemsize, shape):
---> 39     if shape:
     40         strides = [itemsize]
     41         for s in shape[:0:-1]:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Support for shared memory configurations?

Just to make sure, pycuda does not provide support for Kepler's shared memory configurations (enum CUsharedconfig) yet, right?

cc1plus: error: unrecognized command line option "-Wp"

Hi,
I was trying to run https://gist.github.com/adityaatluri/8828fdf47aa005a0d49d on CentOS. CUDA v6.0, Python 2.6. The string is coming from nvcc_toolchain

gpuarray.copy ignores allocator

Thank you so much for pycuda - makes life so much easier.

However, just ran into some weird errors when I used some old pycuda code and ran the tests which used to work.
It kept on giving me cuMemAlloc failed: out of memory errors, which never occurred before.
I think I tracked down the issue being in gpuarray.copy() which won't pass the allocator to the copy when creating it, eventually using unpooled memory.
Not sure why this issue hasn't popped up before. I am using this in a sum with generator expression which obviously first calls gpuarray.__add__(0) which calls copy.

Sorry, I couldn't get my head around a situation where the behavior is intended, so I simply fixed my problem with
new = GPUArray(self.shape, self.dtype, self.allocator)

Cheers

Dynamic compilation and linking on Windows

The docs righly state that dynamic compilation and linking is now working on Windows. Is there any plans to remedy this situation?

`empty_like` ignores memory-order

Consider:

x = np.random.normal(size=(3, 5)).astype(np.float64, order="F")
x_gpu = gpuarray.to_gpu(x)
y_gpu = gpuarray.empty_like(x_gpu)
x_gpu.flags.c_contiguous == y_gpu.flags.c_contiguous   # gives "False"

I'm not very familiar with the ArrayFlags mechanism. It seems to me that the problem is that empty_like doesn't set the strides argument of the GPUArray-constructor. But I don't know the repercussions of doing this, so I didn't send a PR.

Interested in getting ```pip install git+https://github.com/inducer/pycuda.git``` to work?

@inducer,

Would you be interested in getting pip install git+https://github.com/inducer/pycuda.git to work out of the box? Right now it doesn't work since git submodule init && git submodule update is not auto-magically called. I'll be happy to contribute a PR for this if you'd like.

Also, what's your take on git-submodule versus git-subtree ?

pyCuda cannot install properly via pip in macos 10.9

Last login: Fri Sep 26 13:24:19 on ttys002
kilon-imac:~ kilon$ pip install pyCuda
Downloading/unpacking pyCuda
Downloading pycuda-2014.1.tar.gz (1.6MB): 1.6MB downloaded
Running setup.py (path:/private/var/folders/gk/f3241lwj5yg_3n0h09lsj_t40000gp/T/pip_build_kilon/pyCuda/setup.py) egg_info for package pyCuda
*** WARNING: nvcc not in path.
*************************************************************
*** I have detected that you have not run configure.py.
*************************************************************
*** Additionally, no global config files were found.
*** I will go ahead with the default configuration.
*** In all likelihood, this will not work out.
***
*** See README_SETUP.txt for more information.
***
*** If the build does fail, just re-run configure.py with the
*** correct arguments, and then retry. Good luck!
*************************************************************
*** HIT Ctrl-C NOW IF THIS IS NOT WHAT YOU WANT
*************************************************************
Continuing in 1 seconds... ..
Traceback (most recent call last):
File "", line 17, in
File "/private/var/folders/gk/f3241lwj5yg_3n0h09lsj_t40000gp/T/pip_build_kilon/pyCuda/setup.py", line 216, in
main()
File "/private/var/folders/gk/f3241lwj5yg_3n0h09lsj_t40000gp/T/pip_build_kilon/pyCuda/setup.py", line 88, in main
conf["CUDA_INC_DIR"] = [join(conf["CUDA_ROOT"], "include")]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 77, in join
elif path == '' or path.endswith('/'):
AttributeError: 'NoneType' object has no attribute 'endswith'
Complete output from command python setup.py egg_info:
*** WARNING: nvcc not in path.

*** I have detected that you have not run configure.py.

*** Additionally, no global config files were found.

*** I will go ahead with the default configuration.

*** In all likelihood, this will not work out.

*** See README_SETUP.txt for more information.

*** If the build does fail, just re-run configure.py with the

*** correct arguments, and then retry. Good luck!

*** HIT Ctrl-C NOW IF THIS IS NOT WHAT YOU WANT

Continuing in 1 seconds...

Traceback (most recent call last):

File "", line 17, in

File "/private/var/folders/gk/f3241lwj5yg_3n0h09lsj_t40000gp/T/pip_build_kilon/pyCuda/setup.py", line 216, in

main()

File "/private/var/folders/gk/f3241lwj5yg_3n0h09lsj_t40000gp/T/pip_build_kilon/pyCuda/setup.py", line 88, in main

conf["CUDA_INC_DIR"] = [join(conf["CUDA_ROOT"], "include")]

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 77, in join

elif path == '' or path.endswith('/'):

AttributeError: 'NoneType' object has no attribute 'endswith'

Cleaning up...
Command python setup.py egg_info failed with error code 1 in /private/var/folders/gk/f3241lwj5yg_3n0h09lsj_t40000gp/T/pip_build_kilon/pyCuda
Storing debug log for failure in /Users/kilon/Library/Logs/pip.log
kilon-imac:~ kilon$

Hash collision.

In pycuda/pycuda/compiler.py, I see that the cache file name is created based on the md5 of the preprocesed output of the code. As far as I know, hash collisions can happen with md5, and I don't see any code that deals with hash collision in pycuda/pycuda/compiler.py. Am I misunderstanding something or is it assumed that hash collisions never happen with pycuda?

cache elementwise kernels

reference:
http://lists.tiker.net/pipermail/pycuda/2011-September/003416.html
http://article.gmane.org/gmane.comp.python.cuda/2424

I am also trying to deploy pycuda on non-admin machines.
I am able to use SourceModule based kernels using the cache_dir option to it.
However, elementwise kernels seem to use #includes and unconditionally runs external compilers.

what do you think of having two compiler cache options: 1. based on preprocessed source, 2. based on postprocessed source.
if one is messing with headers, they could just use the postprocessed cache option.
For deployment, we could use preprocessed cache option, and ship the preprocessed cache dir.

if you are ok with this, i can go ahead and implement my suggestion and send a pull request.
Thanks for making pycuda and pyopencl.
Naveen

Multi-gpu Peer Access inconsistent

This may be an oversight on my end, but I'm having trouble with enabling peer access though PyCUDA. I have a K40 and a K20 -- the simpleP2P program included with the cuda toolkit passes its test but in PyCUDA, I get a different result.

the following snippet throws an exception on the last line

from pycuda import driver
driver.init()
k40 = driver.Device(0)
k20 = driver.Device(1)

print "K40 can access K20: {}".format(k40.can_access_peer(k20))
print "K20 can access K40: {}".format(k20.can_access_peer(k40))

k4ctx = k40.make_context()
k2ctx = k20.make_context()

k2ctx.enable_peer_access(k4ctx)
k4ctx.enable_peer_access(k2ctx)

---> 13 k4ctx.enable_peer_access(k2ctx)
LogicError: cuCtxEnablePeerAccess failed: peer access is not supported between these two devices

Pycuda [Error cuda.lib not found ] and link.exe exit with code 1181 ?

i get error while installing pycuda on windows 64 bit with cuda 5.0 installed

UnicodeDecodeError: 'utf8' codec can't decode byte

I ran the code below with Pycuda on windows 7 64bit:

import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy

a_gpu = gpuarray.to_gpu(numpy.random.randn(4,4).astype(numpy.float32))  ## pass
a_doubled = (2*a_gpu).get()  ## the line can't be passed with Ipython
print a_doubled
print a_gpu

and got this error:

 File "C:\Python27\lib\site-packages\pycuda\compiler.py", line 137, in compile_plain
    lcase_err_text = (stdout+stderr).decode("utf-8").lower()
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb8 in position 109: invalid start byte

when ran the hello_gpu example, it was the same error. it seems that I have installed Pycuda successfully. Hope you could give me some hints. Thanks in advance!

Insecure temporary file creation on UNIX

The cache files are created in the system temporary directory with predictable filenames. While the user id is incorporated into the filename to prevent unintentional conflicts, a malicious user could create a directory with another user's ID. This allows at least two forms of attack:

by creating a different .cubin file, it allows arbitrary code to be run on the GPU
by creating a symlink, it can cause the user to write a file to an arbitrary filesystem location.

Suggested fix: use the appdirs module to find an appropriate per-user cache location for the operating system. There is a side benefit that the cache won't get wiped out on reboots.

managed_zeros_like calls pagelocked_empty_like

Something I just happened to notice while browsing the code - I haven't actually tested whether it is an issue:

def managed_zeros_like(array, mem_flags=0):
    result = pagelocked_empty_like(array, mem_flags)
    result.fill(0)
    return result

in pycuda/driver.py. I assume it should call managed_empty_like rather than pagelocked_empty_like.

Python 3 compatibility issues

Hi!

The current HEAD does not work correctly with Python 3. Running the testsuite on Python 2.7 (on Ubuntu 14.04) gives me:

pycuda$ python test/test_gpuarray.py 
=== test session starts ===
platform linux2 -- Python 2.7.6 -- py-1.4.20 -- pytest-2.5.2
 collected 51 items 
test/test_gpuarray.py ...................................................
=== 51 passed in 231.07 seconds ===

However, on the same machine, using Python 3.4:

python3 test/test_gpuarray.py           
============================= test session starts ==============================
platform linux -- Python 3.4.0 -- py-1.4.22 -- pytest-2.6.0
collected 51 items 

test/test_gpuarray.py .........F...................F.....................

=================================== FAILURES ===================================
_________________________ TestGPUArray.test_2d_slice_f _________________________

args = (<test_gpuarray.TestGPUArray object at 0x7f5b8bd79f28>,), kwargs = {}
pycuda = <module 'pycuda' from '/usr/local/lib/python3.4/dist-packages/pycuda-2013.1.1-py3.4-linux-x86_64.egg/pycuda/__init__.py'>
ctx = <pycuda._driver.Context object at 0x7f5b8bd753c8>
clear_context_caches = <function clear_context_caches at 0x7f5b96085378>
collect = <built-in function collect>

    def f(*args, **kwargs):
        import pycuda.driver
        # appears to be idempotent, i.e. no harm in calling it more than once
        pycuda.driver.init()

        ctx = make_default_context()
        try:
            assert isinstance(ctx.get_device().name(), str)
            assert isinstance(ctx.get_device().compute_capability(), tuple)
            assert isinstance(ctx.get_device().get_attributes(), dict)
>           inner_f(*args, **kwargs)

/usr/local/lib/python3.4/dist-packages/pycuda-2013.1.1-py3.4-linux-x86_64.egg/pycuda/tools.py:451: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/test_gpuarray.py:588: in test_2d_slice_f
    a = a_gpu_f.get()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <[TypeError("expected an object with a writable buffer interface") raised in repr()] GPUArray object at 0x7f5b8bd79b70>
ary = array([[  0.00000000e+00,   1.35631280e-19,   4.74273749e+30, ...,
          1.80602727e+28,   8.13464249e+32,   1.186...  3.60627856e+03,   1.40870256e+01, ...,
          1.84612119e+20,   8.90377656e-15,   3.28151813e-18]], dtype=float32)
pagelocked = False

    def get(self, ary=None, pagelocked=False):
        if ary is None:
            if pagelocked:
                ary = drv.pagelocked_empty(self.shape, self.dtype)
            else:
                ary = np.empty(self.shape, self.dtype)

            ary = _as_strided(ary, strides=self.strides)
        else:
            assert ary.size == self.size
            assert ary.dtype == self.dtype
            assert ary.flags.forc

        assert self.flags.forc, "Array in get() must be contiguous"

        if self.size:
>           drv.memcpy_dtoh(ary, self.gpudata)
E           TypeError: expected an object with a writable buffer interface

/usr/local/lib/python3.4/dist-packages/pycuda-2013.1.1-py3.4-linux-x86_64.egg/pycuda/gpuarray.py:265: TypeError
____________________ TestGPUArray.test_stride_preservation _____________________

args = (<test_gpuarray.TestGPUArray object at 0x7f5b8290fdd8>,), kwargs = {}
pycuda = <module 'pycuda' from '/usr/local/lib/python3.4/dist-packages/pycuda-2013.1.1-py3.4-linux-x86_64.egg/pycuda/__init__.py'>
ctx = <pycuda._driver.Context object at 0x7f5b82905518>
clear_context_caches = <function clear_context_caches at 0x7f5b96085378>
collect = <built-in function collect>

    def f(*args, **kwargs):
        import pycuda.driver
        # appears to be idempotent, i.e. no harm in calling it more than once
        pycuda.driver.init()

        ctx = make_default_context()
        try:
            assert isinstance(ctx.get_device().name(), str)
            assert isinstance(ctx.get_device().compute_capability(), tuple)
            assert isinstance(ctx.get_device().get_attributes(), dict)
>           inner_f(*args, **kwargs)

/usr/local/lib/python3.4/dist-packages/pycuda-2013.1.1-py3.4-linux-x86_64.egg/pycuda/tools.py:451: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/test_gpuarray.py:736: in test_stride_preservation
    AT_GPU = gpuarray.to_gpu(AT)
/usr/local/lib/python3.4/dist-packages/pycuda-2013.1.1-py3.4-linux-x86_64.egg/pycuda/gpuarray.py:914: in to_gpu
    result.set(ary)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <[TypeError("expected an object with a writable buffer interface") raised in repr()] GPUArray object at 0x7f5b8290fcc0>
ary = array([[ 0.09840709,  0.52116679,  0.16925525],
       [ 0.33036802,  0.95108236,  0.43318754],
       [ 0.73978786,  0.71290125,  0.21846779]])

    def set(self, ary):
        assert ary.size == self.size
        assert ary.dtype == self.dtype
        if ary.strides != self.strides:
            from warnings import warn
            warn("Setting array from one with different strides/storage order. "
                    "This will cease to work in 2013.x.",
                    stacklevel=2)

        assert self.flags.forc

        if self.size:
>           drv.memcpy_htod(self.gpudata, ary)
E           ValueError: ndarray is not C-contiguous

/usr/local/lib/python3.4/dist-packages/pycuda-2013.1.1-py3.4-linux-x86_64.egg/pycuda/gpuarray.py:229: ValueError
----------------------------- Captured stdout call -----------------------------
True False
===================== 2 failed, 49 passed in 45.45 seconds =====================

pycuda.compiler.SourceModule incorrectly handles sources in `bytes` type

Cuda source code compilation fails if it's passed to pycuda.compiler.SourceModule as Python 3's binary (e.g. when source was read from filesystem in binary mode).

In this code:

    if not no_extern_c:
        source = 'extern "C" {\n%s\n}\n' % source

binary is being implicitly converted to string with invalid cuda code, e.g.:

In [10]: source = b"Test"
In [11]: source = 'extern "C" {\n%s\n}\n' % source
In [12]: print(source)
extern "C" {
b'Test'
}

Notice the b'Test' in result source code.

Division between int-gpuarray and int scalar does not work

Consider:

import pycuda.autoinit
from pycuda import gpuarray
x = gpuarray.to_gpu(np.arange(4))
x / 4

This will give the following exception:

error                                     Traceback (most recent call last)
<ipython-input-1-c5776fbf102a> in <module>()
      3 
      4 x = gpuarray.to_gpu(np.arange(4))
----> 5 x / 4

/usr/local/lib/python3.4/dist-packages/pycuda-2015.1.2-py3.4-linux-x86_64.egg/pycuda/gpuarray.py in __div__(self, other)
    476                 # create a new array for the result
    477                 result = self._new_like_me(_get_common_dtype(self, other))
--> 478                 return self._axpbz(1/other, 0, result)
    479 
    480     __truediv__ = __div__

/usr/local/lib/python3.4/dist-packages/pycuda-2015.1.2-py3.4-linux-x86_64.egg/pycuda/gpuarray.py in _axpbz(self, selffac, other, out, stream)
    319         func.prepared_async_call(self._grid, self._block, stream,
    320                 selffac, self.gpudata,
--> 321                 other, out.gpudata, self.mem_size)
    322 
    323         return out

/usr/local/lib/python3.4/dist-packages/pycuda-2015.1.2-py3.4-linux-x86_64.egg/pycuda/driver.py in function_prepared_async_call(func, grid, block, stream, *args, **kwargs)
    506 
    507         from pycuda._pvt_struct import pack
--> 508         arg_buf = pack(func.arg_format, *args)
    509 
    510         for texref in func.texrefs:

error: required argument is not an integer

As far as I understand, the problem is the following: within __div__, it is assumed that the output has the same dtype as the input, but this is not the case when dividing an integer array by a scalar. This could be fixed easily by making sure that division always returns a floating point type.

As a sidenote, a related problem is that // (the 'integer division' from Py3) is not yet implemented in gpuarray. As far as I can see, the "turn the division into a multiplication" does not work for that operator, so maybe the division-mechanism should be entirely switched to something other than axpbz.

pycuda._driver.LogicError: cuMemcpyHtoDAsync failed: invalid value

Hi,
I am trying to use async memcpy using pycuda. I am getting this error:

Traceback (most recent call last):
  File "streams.py", line 29, in <module>
    drv.memcpy_htod_async(a_gpu,a_pin)
pycuda._driver.LogicError: cuMemcpyHtoDAsync failed: invalid value

The source code is:

import numpy as np
import pycuda.autoinit
import pycuda.driver as drv

from pycuda.compiler import SourceModule

mod = SourceModule("""
__global__ void add_them(long *dest, long *a, long *b){
int tx = blockIdx.x;
dest[tx] = a[tx] + b[tx];
}
""")

add_them = mod.get_function("add_them")

a = np.random.randint(10, size = 65536)
b = np.random.randint(10, size = 65536)

dest = np.zeros_like(a)
a_gpu = drv.mem_alloc(a.nbytes)
b_gpu = drv.mem_alloc(b.nbytes)
dest_gpu = drv.mem_alloc(dest.nbytes)

a_pin = drv.register_host_memory(a)
b_pin = drv.register_host_memory(b)

drv.memcpy_htod_async(a_gpu,a_pin)
drv.memcpy_htod_async(b_gpu,b_pin)

add_them(dest_gpu,a_gpu,b_gpu,block=(1,1,1),grid=(65536,1,1))

drv.memcpy_dtoh(dest,dest_gpu)

Windows wheel installer has wrong lib paths

setup.py on windows is trying to use linux paths.

Looks for cuda libs in %CUDA_PATH%\lib64
Should be looking in %CUDA_PATH%\lib\x64

Temporary workaround for windows users:
Copy x64 folder to %CUDA_PATH% and rename to lib64.

Support recursive launches/dynamic parallelism

I am trying to implement dynamic parallelism in my PyCUDA file. I am getting this error message. What to do?

Large arrays of random numbers have incorrect distribution

When generating relatively small arrays of uniform random numbers, the distribution is correct. However, with large arrays, the distribution is no longer uniform:

>>> import pycuda.autoinit
>>> from pycuda import curandom
>>> import numpy as np

>>> sampler = curandom.XORWOWRandomNumberGenerator(curandom.seed_getter_uniform)

>>> sampler.gen_uniform((1000, 1000), np.float32).get().mean()
0.50003648

>>> sampler.gen_uniform((10000, 10000), np.float32).get().mean()
0.16777216

# Doesn't happen with doubles, though
>>> sampler.gen_uniform((1000, 1000), np.float64).get().mean()
0.50014077359074693

>>> sampler.gen_uniform((10000, 10000), np.float64).get().mean()
0.49996000309851663

ImportError: cannot import name intern

Just upgraded six to latest version. make tests fails with ImportError: cannot import name intern. This happened on both python3 and python2.7, building from source and also attempting to use the tarball. Using Ubuntu 14.02.

Fix confusion between non-asynchrony and zero stream

PyCUDA's API conflates "use the zero stream" (which means "overlap with nothing device-side") with "don't be asynchronous". Raised by @davidweichiang in #76.

import random

on curandom.py at line #536 on "def seed_getter_uniform(N)"

you made use of "random." where the import?

Complex math error exp() and need for tests.

The implementation of exp() for complex types current has the order or the real an imaginary parts messed up:

return complex<_Tp>(expx * s, expx * c);

This should be

return complex<_Tp>(expx * c, expx * s);

Complex types also need to be added to the math tests.

setup.py build returns error: ‘PyStructType’ was not declared in this scope

running build_ext
building '_pvt_struct' extension
x86_64-pc-linux-gnu-g++ -pthread -fPIC -I/usr/lib64/python2.6/site-packages/numpy/core/include -I/usr/include/python2.6 -c src/wrapper/_pycuda_struct.cpp -o build/temp.linux-x86_64-2.6/src/wrapper/_pycuda_struct.o
src/wrapper/_pycuda_struct.cpp:129: warning: deprecated conversion from string constant to ‘char*’
src/wrapper/_pycuda_struct.cpp: In function ‘int s_init(PyObject*, PyObject*, PyObject*)’:
src/wrapper/_pycuda_struct.cpp:917: warning: deprecated conversion from string constant to ‘char*’
src/wrapper/_pycuda_struct.cpp:919: error: ‘PyStructType’ was not declared in this scope
src/wrapper/_pycuda_struct.cpp: In function ‘PyObject* s_unpack(PyObject*, PyObject*)’:
src/wrapper/_pycuda_struct.cpp:993: error: ‘PyStructType’ was not declared in this scope
src/wrapper/_pycuda_struct.cpp: In function ‘PyObject* s_unpack_from(PyObject*, PyObject*, PyObject*)’:
src/wrapper/_pycuda_struct.cpp:1031: warning: deprecated conversion from string constant to ‘char*’
src/wrapper/_pycuda_struct.cpp:1031: warning: deprecated conversion from string constant to ‘char*’
src/wrapper/_pycuda_struct.cpp:1035: warning: deprecated conversion from string constant to ‘char*’
src/wrapper/_pycuda_struct.cpp:1040: error: ‘PyStructType’ was not declared in this scope
src/wrapper/_pycuda_struct.cpp: In function ‘PyObject* s_pack(PyObject*, PyObject*)’:
src/wrapper/_pycuda_struct.cpp:1166: error: ‘PyStructType’ was not declared in this scope
src/wrapper/_pycuda_struct.cpp: In function ‘PyObject* s_pack_into(PyObject*, PyObject*)’:
src/wrapper/_pycuda_struct.cpp:1206: error: ‘PyStructType’ was not declared in this scope
src/wrapper/_pycuda_struct.cpp: At global scope:
src/wrapper/_pycuda_struct.cpp:1280: warning: deprecated conversion from string constant to ‘char*’
src/wrapper/_pycuda_struct.cpp:1280: warning: deprecated conversion from string constant to ‘char*’
src/wrapper/_pycuda_struct.cpp:1280: warning: deprecated conversion from string constant to ‘char*’
src/wrapper/_pycuda_struct.cpp:1280: warning: deprecated conversion from string constant to ‘char*’
src/wrapper/_pycuda_struct.cpp: In function ‘void init_pvt_struct()’:
src/wrapper/_pycuda_struct.cpp:1552: warning: deprecated conversion from string constant to ‘char*’
error: command 'x86_64-pc-linux-gnu-g++' failed with exit status 1

trouble installing pycuda

I am finding some trouble installing pycuda. When I run 'python test_driver.py', I get the following:
E CompileError: nvcc compilation of /var/folders/8k/4dgjf1252rvdrqmb63f_w5bm0000gn/T/tmpFIgNjO/kernel.cu failed
E [command: nvcc --cubin -arch sm_30 -m64 -I/Users/mas/PycharmProjects/kaggle-ndsb/env5/lib/python2.7/site-packages/pycuda-2015.1.3-py2.7-macosx-10.10-intel.egg/pycuda/cuda kernel.cu]
E [stderr:
E nvcc fatal : The version ('70000') of the host compiler ('Apple clang') is not supported
E ]

../../env5/lib/python2.7/site-packages/pycuda-2015.1.3-py2.7-macosx-10.10-intel.egg/pycuda/compiler.py:137: CompileError
================ 19 failed, 6 passed, 2 skipped in 2.40 seconds ================

I have a mac and my mvcc version is release 7.0, V7.0.27

Including nccl multi-gpu library to PyCUDA

Hello there,

I want to wrap NVIDIA's nccl library for multi-gpu communication collectives into Python, so as to be compatible with PyCUDA framework.

I thought that it would be appropriate to include this into PyCUDA itself, because the library refers to gpu-gpu communication and not to CUDA implementations of algorithms.
What are your thoughts about this?

In case you are interested, is there already an effort going on? What should I notice before I start implementing this?

Also, I could not find a developer-related documentation for extending PyCUDA functionalities. Where should I start from?

libcuda not always in the expected location

At least for the version of the cuda7.5 toolkit that ships with the docker image nvidia/cuda, the file libcuda.so is to be found in a goofy subdirectory of the regular lib directory, called "stubs." In order for me to get things working, I needed to use a configuration command like

python configure.py --ldflags="-L/usr/local/cuda/lib64/stubs" --cuda-root=/usr/local/cuda-7.5

It would be cool if the configure script could figure that out by itself.

More generally, the wiki says that pycuda was built against the beta of the 2.0 toolkit, and that the latest version of pycuda is 2.2. Which I guess means the last time it was edited was March 2009. What version of the toolkit are you developers actually building against?

Cannot pip install pycuda

I have an issue installing pyCUDA using "pip install pycuda."
I am running Mac OS X Mavericks 10.9. I installed CUDA 6.5 SDK for Mac 10.9. Xcode 5.1.1 and the developer tools are also installed. My Nvidia graphics is NVIDIA GeForce GTX 680MX (2GB memory).

The errors I got are

ld: library not found for -lcuda
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command 'clang++' failed with exit status 1

What is causing this issue? Should I configure on top of using "pip install pycuda?"
Please guide me to fixing the issue. Thanks.

GPUArray.get_async parameter order doesn't match documentation

According to http://documen.tician.de/pycuda/array.html, get_async takes parameters ary, stream. However, in the source the parameters are in the order stream, ary.

Note: the linked docs are for 2013.1.1, while I'm testing with 2014.1.

“import pycuda.driver as cuda" gives me an error

Recently I try to build the enviroment of caffe and pycuda in my machine.I install pycuda follow the url: https://wiki.tiker.net/PyCuda/Installation/Linux/Ubuntu .And I run the demo.py in pycuda/examples and it shows an output.I think the installation of pycuda is successful.
And I build the enviroment of caffe and run the examples in caffe successful.
But when I try to run the demo.py and it gives me an error.
Traceback (most recent call last): File "demo.py", line 4, in <module> import pycuda.driver as cuda ImportError: No module named pycuda.driver
Can anyone help me to solve it.

Query about integrating with third party libraries.

I am trying to integrate PyCUDA with arrayfire-python and am having a few issues with context creation.

arrayfire uses the runtime API exclusively and does not deal in cuContexts. Is there a bridge function in PyCUDA that attaches to an existing context instead of creating a new context?

PyCUDA working only with cuda-6.0

Hi,
I was trying to install pycuda with different toolkits. It is working fine only with cuda-6.0. I tested with cuda-6.5 and cuda-7.0 also. It is throwing these errors. These are pertaining to cuda driver changes. Does changes need to be done for pycuda repo (this repo)?

/usr/local/lib/python2.7/dist-packages/pycuda-2014.1-py2.7-linux-x86_64.egg/pycuda/_driver.so: undefined symbol: cuMemHostRegister_v2

DeviceMemoryPool doesn't work in Python3

Hi!
The following code doesn't work under Python 3, but executes just fine in Python 2:

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as gpu
from pycuda.tools import DeviceMemoryPool
mempool = DeviceMemoryPool()
x = gpu.to_gpu(np.random.randn(5, 5).astype(np.float32), allocator=mempool.allocate)
2*x

In Python 3, this raises the following stacktrace:

error                                     Traceback (most recent call last)
<ipython-input-30-6440a0c09c4b> in <module>()
     11 
     12 x = gpu.to_gpu(np.random.randn(5, 5).astype(np.float32), allocator=mempool.allocate)
---> 13 2*x

/usr/local/lib/python3.4/dist-packages/pycuda-2014.1-py3.4-linux-x86_64.egg/pycuda/gpuarray.py in __rmul__(self, scalar)
    470     def __rmul__(self, scalar):
    471         result = self._new_like_me(_get_common_dtype(self, scalar))
--> 472         return self._axpbz(scalar, 0, result)
    473 
    474     def __imul__(self, other):

/usr/local/lib/python3.4/dist-packages/pycuda-2014.1-py3.4-linux-x86_64.egg/pycuda/gpuarray.py in _axpbz(self, selffac, other, out, stream)
    335         func.prepared_async_call(self._grid, self._block, stream,
    336                 selffac, self.gpudata,
--> 337                 other, out.gpudata, self.mem_size)
    338 
    339         return out

/usr/local/lib/python3.4/dist-packages/pycuda-2014.1-py3.4-linux-x86_64.egg/pycuda/driver.py in function_prepared_async_call(func, grid, block, stream, *args, **kwargs)
    503 
    504         from pycuda._pvt_struct import pack
--> 505         arg_buf = pack(func.arg_format, *args)
    506 
    507         for texref in func.texrefs:

error: required argument is not an integer

(Tested with Python 3.4 under Ubuntu 14.04, CUDA 6.5)

A memory leak with pagelocked_empty

Hey @inducer,

I'm getting weird memory leaks from pagelocked_empty (see below), am I using it the wrong way ? How can I free this memory ?

Demo (use with caution):

from pycuda import driver                                                                                                                                                 
import pycuda.autoinit                                                                                                                                                    
N = 1e6                                                                                                                                                                   
for _ in xrange(N):                                                                                                                                                       
    arr = driver.pagelocked_empty((800, 800), dtype='f')                                                                                                                  
    del arr

And see your computer eat all your memory ;-)

np.intp can't necessarily hold GPU pointers

When doing development on a 32-bit ARM system I was getting this error (not on all programs though):

  File "/home/ubuntu/work/sdp/env/local/lib/python2.7/site-packages/pycuda/driver.py", line 377, in function_call
    handlers, arg_buf = _build_arg_buf(args)
  File "/home/ubuntu/work/sdp/env/local/lib/python2.7/site-packages/pycuda/driver.py", line 150, in _build_arg_buf
    gpudata = np.intp(arg.gpudata)
OverflowError: Python int too large to convert to C long

Changing np.intp to np.uintp made things work for me. Looking at mem_obj_to_long, it seems to always interpret pointers as unsigned, so hopefully this is safe on x86 as well.

GPUArray creation fails for shapes containing type long

Using pycuda 2015.1.3 with Python 2.7.10, the following raises an AssertionError:

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as gpuarray

x_gpu = gpuarray.empty(long(3), np.float32)

The failure occurs at line 170 in gpuarray.py:

assert isinstance(shape, (int, int, np.integer))

Leaving aside the question as to why int appears twice in the above check, I suggest modifying it to

import numbers
...
assert isinstance(shape, numbers.Integral)

which seems to accept int, long, and the various numpy integer types.

(Context: the above might be the cause of the skcuda issue reported here.)

download-examples-from-wiki.py No longer works

The download-examples-from-wiki.py gives the following error output:

downloading wiki examples from http://wiki.tiker.net/PyCuda/Examples to wiki-examples/...
fetching page list...
Traceback (most recent call last):
File "download-examples-from-wiki.py", line 19, in
all_pages = destwiki.getAllPages()
File "//anaconda/lib/python2.7/xmlrpclib.py", line 1240, in call
return self.__send(self.__name, args)
File "//anaconda/lib/python2.7/xmlrpclib.py", line 1599, in __request
verbose=self.__verbose
File "//anaconda/lib/python2.7/xmlrpclib.py", line 1280, in request
return self.single_request(host, handler, request_body, verbose)
File "//anaconda/lib/python2.7/xmlrpclib.py", line 1328, in single_request
response.msg,
xmlrpclib.ProtocolError: <ProtocolError for wiki.tiker.net/?action=xmlrpc2: 301 Moved Permanently>

Kernel cache not working (hexdigest broken?)

On my systems the kernel caching mechanic is not working. Have tested this on both linux and windows.

The checksum.hexdigest() seems to return a different value when passed the same kernel. I have no idea why... does anyone else see this behaviour?