Code Monkey home page Code Monkey logo

cufinufft's Introduction

PyPI package stats: PyPI - Downloads

$$\text{\color{red} \Huge LEGACY CODEBASE}$$

Note: This repository holds the legacy cuFINUFFT codebase. Further development will take place in the FINUFFT repository. Please direct any issues or pull requests to that repository.

cuFINUFFT v1.3

cuFINUFFT is a very efficient GPU implementation of the 1-, 2-, and 3-dimensional nonuniform FFT of types 1 and 2, in single and double precision, based on the CPU code FINUFFT.

cuFINUFFT introduces several algorithmic innovations, including load-balancing, bin-sorting for cache-aware access, and use of fast shared memory. Our tests show an acceleration over FINUFFT of up to 10x on modern hardware, and up to 100x faster than other established GPU NUFFT codes:

The linear transforms it can perform may be summarized as follows: type 1 maps nonuniform data (locations and corresponding strengths) to the uniformly spaced coefficients of a Fourier series (or its bi- or tri-variate generalization, according to dimension). Type 2 does the adjoint operation of type 1, ie maps in the reverse order. However, note that type 2 and type 1 are not generally each other's inverse, unlike for the FFT case! These transforms are performed to a user-presribed tolerance, at close-to-FFT speeds; under the hood, this involves detailed kernel design, custom spreading/interpolation stages, and plain FFTs performed by cuFFT. See the documentation for FINUFFT for a full mathematical description of the transforms and their applications to signal processing, imaging, and scientific computing.

Note: We are currently in the process of adapting the cuFINUFFT interface to closer match that of FINUFFT. This will likely break code depending on the current interface once the next release is published. At this point we will publish a migration guide that will detail the exact changes to the interfaces.

Main developer: Yu-hsuan Melody Shih (NYU). Main other contributors: Garrett Wright (Princeton), Joakim Andén (KTH/Flatiron), Johannes Blaschke (LBNL), Alex Barnett (Flatiron). See github for full list of contributors. This project came out of Melody's 2018 and 2019 summer internships at the Flatiron Institute, advised by CCM project leader Alex Barnett.

Installation

Note for most Python users, you may skip to the Python Package section first, and consider installing from source if that solution is not adequate for your needs. Note that 1D is not available in Python yet. Here's the C++ install process:

  • Make sure you have the prerequisites: a C++ compiler (eg g++) and a recent CUDA installation (nvcc).
  • Get the code: git clone https://github.com/flatironinstitute/cufinufft.git
  • Review the Makefile: If you need to customize build settings, create and edit a make.inc. Example:
    • To override the standard CUDA /usr/local/cuda location your make.inc should contain: CUDA_ROOT=/your/path/to/cuda.
    • For examples, see one for IBM machines (targets/make.inc.power9), and another for the Courant Institute cluster (sites/make.inc.CIMS).
  • Compile: make all -j (this takes several minutes)
  • Run test codes: make check which should complete in less than a minute without error.
  • You may then want to try individual test drivers, such as bin/cufinufft2d1_test_32 2 1e3 1e3 1e7 1e-3 which tests the single-precision 2D type 1. Most such executables document their usage when called with no arguments.

Basic usage and interface

Please see the codes in examples/ to see how to call cuFINUFFT and link to from C++/CUDA, and to call from Python.

The default use of the cuFINUFFT API has four stages, that match those of the plan interface to FINUFFT (in turn modeled on those of, eg, FFTW or NFFT). Here they are from C++:

  1. Plan one transform, or a set of transforms sharing nonuniform points, specifying overall dimension, numbers of Fourier modes, etc:

    ier = cufinufft_makeplan(type, dim, nmodes, iflag, ntransf, tol, maxbatchsize, &plan, NULL);
  2. Set the locations of nonuniform points from the arrays x, y, and possibly z:

    ier = cufinufft_setpts(M, x, y, z, 0, NULL, NULL, NULL, plan);

    (Note that here arguments 5-8 are reserved for future type 3 implementation, to match the FINUFFT interface).

  3. Perform the transform(s) using these nonuniform point arrays, which reads strengths c and writes into modes fk for type 1, or vice versa for type 2:

    ier = cufinufft_execute(c, fk, plan);
  4. Destroy the plan (clean up):

    ier = cufinufft_destroy(plan);

In each case the returned integer ier is a status indicator. Here is the full C++ documentation.

It is also possible to change advanced options by changing the last NULL argument of the cufinufft_makeplan call to a pointer to an options struct, opts. This struct should first be initialized via cufinufft_default_opts(type, dim, &opts); before the user changes any fields. For examples of this advanced usage, see test/cufinufft*.cu

Library installation

It is up to the user to decide how exactly to link or otherwise install the libraries produced in lib. If you plan to use the Python wrapper you will minimally need to extend your LD_LIBRARY_PATH, such as with export LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH} or a more permanent installation path of your choosing.

If you would like to always have this installation in your library path, you can add to your shell rc with something like the following:

echo "\n# cufinufft librarypath \nexport LD_LIBRARY_PATH=${PWD}/lib:${LD_LIBRARY_PATH}" >> ~/.bashrc

Because CUDA itself has similar library/path requirements, it is expected the user is somewhat familiar. If not, please ask, we might be able to help.

Python wrapper

For those installing from source, this code comes with a Python wrapper module cufinufft, which depends on pycuda. Once you have successfully installed and tested the CUDA library, you may run make python to manually install the additional Python package.

Python package

General Python users, or Python software packages which would like to automatically depend on cufinufft using setuptools may use a precompiled binary distribution. This totally avoids installing from source and managing libraries for supported systems.

Binary distributions are specific to both hardware and software. We currently provide binary wheels targeting Linux systems covered by manylinux2010 for CUDA 10 forward with compatible GPUs. If you have such a system, you may run:

pip install cufinufft

For other cases, the Python wrapper should be able to be built from source.

Advanced topics

Advanced Makefile Usage

If you want to test/benchmark the spreader and interpolator (the performance-critical components of the NUFFT algorithm), without building the whole library, do this with make checkspread.

In general for make tasks, it's possible to specify the target architecture using the target variable, eg:

make target=power9 -j

By default, the makefile assumes the x86_64 architecture. We've included site-specific configurations -- such as Cori at NERSC, or Summit at OLCF -- which can be accessed using the site variable, eg:

make site=olcf_summit

The currently supported targets and sites are:

  1. Sites
    1. NERSC Cori (site=nersc_cori)
    2. NERSC Cori GPU (site=nersc_cgpu)
    3. OLCF Summit (site=olcf_summit) -- automatically sets target=power9
    4. CIMS (site=CIMS)
    5. Flatiron Institute, rusty cluster GPU node (site=FI)
  2. Targets
    1. Default (x86_64) -- do not specify target variable
    2. IBM power9 (target=power9)

A general note about expanding the platform support: targets should contain settings that are specific to a compiler/hardware architecture, whereas sites should contain settings that are specific to a HPC facility's software environment. The site-specific script is loaded before the target-specific settings, hence it is possible to specify a target in a site make.inc.* (but not the other way around).

Makefile preprocessors

  • TIME - timing for each stage. Enable by adding "-DTIME" to NVCCFLAGS.
  • SPREADTIME - more detailed timing from spreading and interpolation
  • DEBUG - debug mode outputs all the middle stages' result

Other notes

  • If you are interested in optimizing for GPU Compute Capability, you may want to specify NVARCH=-arch=sm_XX in your make.inc to reduce compile times, or for other performance reasons. See Matching SM Architectures.

Tasks for developers

  • 1D version is close to finished (needs vectorized testers and Py interfaces)
  • Type 3 transforms (which are quite tricky) as in FINUFFT are in progress (at least in 3D) on a PR, thanks to Simon Frasch; please go and test!
  • We need some more tutorial examples in C++ and Python
  • Please help us to write MATLAB (gpuArray) and Julia interfaces
  • There are various Tensorflow and related interfaces in progress (please help with them or test them): https://github.com/mrphys/tensorflow-nufft https://github.com/dfm/jax-finufft
  • Please see Issues and PRs for other things you can help fix or test

References

  • cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs, Yu-hsuan Shih, Garrett Wright, Joakim Andén, Johannes Blaschke, Alex H. Barnett, PDSEC2021 conference (best paper prize). https://arxiv.org/abs/2102.08463

cufinufft's People

Contributors

ahbarnett avatar elliottslaughter avatar garrettwrong avatar janden avatar jblaschke avatar melodyshih avatar raulppelaez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cufinufft's Issues

Make tests link to the library

Right now, they link directly to the raw object files. Having the tests link to the library would more closely mimic end-user behavior.

Multi Device Support

Longer term, I think it would be good if we could naively use multi gpu in an embarrassingly parallel way. That is, cufinufft keeps the responsibility of managing device choice and user gpu data (on correct device) outside of cufinufft. What makes it a bit complicated is how that is managed. One approach would require notion of driver context that is passed through (the device and memory is bound to that context and owned/managed by the caller). This obviously would be optional.

There are more involved ways to do multi gpu, but I think this would be a good balance, and the least disruptive.

For a production size ASPIRE workload, this would be a great performance opportunity. Currently the testing sizes are limited by the memory bounds. One can imagine just running the largest batch that will fit on a single GPU concurrently on two or four separate gpus (common configurations for HPC nodes).

This is not anything immediate, just something I think is worth considering.

Idea for SM in double-precision 3D

So if I've understood correctly, the reason we can't perform SM in 3D with doubles (at high precision) is that the 3D box won't fit. In other words, for 12 digits, we get w = 12 + 1, which means that after padding, a 1×1×1 bin becomes 14×14×14 and takes up 16 * 14 ** 3 = 43904 bytes, so barely fits in shared memory since a complex double is 16 bytes.

What if we ditch interleaved complex numbers (I'm fairly sure cuFFT allows you to output real and imaginary separately)? We'd only need 8 bytes per voxel, which means we can fit a padded bin of size 19×19×17 (49096 bytes) which corresponds to 6×6×4 unpadded. Is this still too small or could it be workable? The other question is of course how much we could expect to improve over GM-sort given the magnitude of the changes needed here.

TIME flag not consistently treated in tests

Even with the flag turned off, these codes output certain timings. Since these are only there for testing purposes, it might make sens for them to ignore the TIME flag altogether.

Release Tagging

After #42 is in , I think we should tag a release candidate v1.0-rc. I will base the container and wheels off of that commit and get them ready for staging. (initial tests seem good)

Once #43 completes, can push to Test PyPI index, at which time the wheels may be field tested with:
pip install -i https://test.pypi.org/simple/ --no-deps cufinufft

Following that, assuming no changes required, the v1.0 tag can be created at the same commit, and the packages pushed to the live index.

This would give ASPIRE enough to tie to and test against. It would still allow for the software to be iterated on further before any formal announcement. The very first release is always has a bit of the chicken-egg dance.

Default target in Makefile

Right now the default target is spread2d (since it comes first). Might make more sense to have all be the default?

Compiler Optimization flags

I noticed this when I was making the timing change:

CXXFLAGS= -DNEED_EXTERN_C  -fPIC -O3 -funroll-loops -march=native -g -std=c++11

IIRC -g will override previous optimization levels (the -03 on the left in this case).

Because NVCC has its own opt flags, I don't really think this affected CUDA performance, but in case it would (or another layer of your code), I'll push a patch momentarily.

License?

What license is this code published under?
I couldn't find a license in the repository, but maybe I just missed it.

Cleanup/optimize nvcc gencode flags

There are a couple challenges with gencode flags. We need to maximize compatibility with users' gpu hardware, software (cuda toolkit), while at the some time keep our makefile manageable and have a clean path to generate a maximally compatible package for distribution.

I've emailed one of our Nvidia architect contacts to ask if they have any good practices here.

This might get addressed in a coming PR, but I am creating a note of it just in case.

Question regarding running tests

I noticed the following testing a PR and I confirmed I am seeing same in master. I think I am running the test incorrectly. @MelodyShih , can you tell me what I am doing wrong and give me a better invocation? (Method 2 and 4 error levels look strange to me).

(base) ➜  cufinufft.15 git:(master) ✗ bin/cufinufft3d1_test 1 100 100 100
[time  ] cufinufft plan:		 0.12 s
[time  ] cufinufft setNUpts:		 0.00217 s
[time  ] cufinufft exec:		 0.544 s
[time  ] cufinufft destroy:		 0.0204 s
[Method 1] 1000000 NU pts to #1000000 U pts in 0.567 s (	1.77e+06 NU pts/s)
[gpu   ] one mode: abs err in F[37,26,13] is 5.92e-05
[gpu   ] one mode: rel err in F[37,26,13] is 2e-08
(base) ➜  cufinufft.15 git:(master) ✗ bin/cufinufft3d1_test 2 100 100 100
[time  ] cufinufft plan:		 0.123 s
[time  ] cufinufft setNUpts:		 0.0019 s
error: not enough shared memory (92160)
error: cnufftspread3d_gpu_subprob
error: cuspread3d, method(2)
[time  ] cufinufft exec:		 0.000488 s
[time  ] cufinufft destroy:		 0.0194 s
[Method 2] 1000000 NU pts to #1000000 U pts in 0.0218 s (	4.6e+07 NU pts/s)
[gpu   ] one mode: abs err in F[37,26,13] is 743
[gpu   ] one mode: rel err in F[37,26,13] is inf
(base) ➜  cufinufft.15 git:(master) ✗ bin/cufinufft3d1_test 4 100 100 100
[time  ] cufinufft plan:		 0.122 s
CUDA error at src/3d/spread3d_wrapper.cu:791 code=700(cudaErrorIllegalAddress) "cudaMemcpy(&totalnumsubprob,&d_subprobstartpts[n], sizeof(int),cudaMemcpyDeviceToHost)" 
(base) ➜  cufinufft.15 git:(master) ✗ bin/cufinufft3d1_test 4 10 10 10   
[time  ] cufinufft plan:		 0.118 s
[time  ] cufinufft setNUpts:		 0.000386 s
[time  ] cufinufft exec:		 0.000811 s
[time  ] cufinufft destroy:		 7.16e-05 s
[Method 4] 1000 NU pts to #1000 U pts in 0.00127 s (	7.88e+05 NU pts/s)
[gpu   ] one mode: abs err in F[3,2,1] is 1.28e+06
[gpu   ] one mode: rel err in F[3,2,1] is 0.163

Move towards manylinux2014 wheels

Following #92 , it is time to move to newer OS and CUDA basis for the python wheels.

  • Confirm manylinux2014 ci image builds (make inside the container)
  • Confirm make check inside the container succeeds on my machine
  • Confirm container build and check on a second machine
  • Confirm build wheel script runs
  • Check each wheel output
  • Check if python 3.5 is deprecated; if so purge it
  • Update the ci/readme to be a little more scriptable (vars for version etc). Perhaps make it a real script and just have CI call the script?.. (probably)
  • Consider what the next thing after 2014 will be (manylinux, OS, CUDA version) so project is prepared again.

Bad input testing

In #31 Alex suggested adding some sanity checker checks like in test/dumbinputs in FINUFFT.

Considering users will generally be integrating this software in another application, this is a great idea... just maybe a bit more than I wanted to bite off in the other PR.

status and precision of test/{spread,interp}{2,3}d ? bin/interp2d 2 0 ... fails.

these testers are compiled but not tested in the make check.
Are they supported, or should they go into some devel dir instead?
Are they double-prec only? (not so useful, since most GPU will be sinlge-prec).

I ask because
bin/interp2d 2 0 1e3 1e3 1e7 1e-3
fails for me with, eg:

(base) ahb@ccmlin001 /mnt/home/ahb/numerics/cufinufft> bin/interp2d 2 0 1e3 1e3 1e7
[time  ] (warm up) First cudamalloc call 2.429e-04 s

CUDA error at src/2d/interp2d_wrapper.cu:89 code=77(cudaErrorIllegalAddress) "cudaMemcpy(h_c,d_plan->c,M*sizeof(CUCPX), cudaMemcpyDeviceToHost)" 

bin/interp2d 1 0 1e3 1e3 1e7 1e-3 works ok
bin/spread2d and the 3d ones all work.

`error: not enough shared memory` when setting `tol` to less than `1e-3`

I've been using the python interface -- following the design of _test_type1 and _test_type2 in https://github.com/flatironinstitute/cufinufft/blob/master/python/cufinufft/tests/test_basic.py. But whenever I set tol < 1e-3, I get error: not enough shared memory.

I modified the spreading / interpolation kernels to not use shared memory, but then the results are wrong. I don't know enough about the code to track down why this is. My modified kernels where identical to the existing ones, except that fwshared was in global memory.

Do you have any advice on how to get the tolerance down using the python interface?

[bug report] incorrect 2d type 2 nufft result

(Description edited to make assessment easier)

It is likely an error of rounding or mod operations related to coordinates.
Play with the x_ofst, y_ofst below to see the error shows/disappears.

The following lines can reproduce it:

import numpy as np
import finufft, cufinufft, pycuda
import pycuda.autoinit

from pycuda import gpuarray

sx, sy = 4, 4
x_ofst, y_ofst = 0.00, 0.00  # change this to reproduce the error

x, y = np.mgrid[-sx//2:sx//2:1., -sy//2:sy//2:1.]
x, y = x+x_ofst, y+y_ofst
xr, yr = x.reshape(-1)/sx * 2 * np.pi, y.reshape(-1)/sy * 2 * np.pi

p = np.zeros((sx, sy), dtype=np.complex128)
p[0][0] = 1

xr_cu, yr_cu, p_cu = gpuarray.to_gpu(xr), gpuarray.to_gpu(yr), gpuarray.to_gpu(p)

plan = finufft.Plan(2, (sx, sy), 1, eps=1e-12)
plan.setpts(xr.reshape(-1), yr.reshape(-1))

plan_cu = cufinufft.cufinufft(2, (sx, sy), eps=1e-12, dtype=np.float64)
plan_cu.set_pts(xr_cu, yr_cu)

kp = plan.execute(p)

kp_cu = gpuarray.GPUArray((sx*sy,), dtype=np.complex128)
plan_cu.execute(kp_cu, p_cu)
kp_cu_np = kp_cu.get()

kp, kp_cu_np = kp.reshape((sx,sy)), kp_cu_np.reshape((sx,sy))

print(kp.real)
print(kp_cu_np.real)

and the outputs are:

# kp.real, correct
array([[ 1. -1.  1. -1.]
 [-1.  1. -1.  1.]
 [ 1. -1.  1. -1.]
 [-1.  1. -1.  1.]])

# kp_cu_np, incorrect
array([[ 1.  0.  1. nan]
 [nan  0.  0.  0.]
 [ 1.  0.  0. nan]
 [ 0. nan nan nan]])

Rapidly degrading adjoint test in 3D

Dear all,

I was testing cufinufft via the Python wrapper on some 3d problems, and I stumbled upon some accuracy issues when performing the adjoint test. That is, given the NFFT linear operator F,

dot(F*x,y) \approx dot(x,F'*y).

The adjoint test passes for very small-sized problems (32,32,32) (rel err around 1e-7 on my machine) but it gets quickly unacceptable for size (128,128,128) (rel err around 0.5, worst case). You can find below the code I was running.

Please let me know if this is somehow expected, or if it can be mitigated somehow.

Regards,
Gabrio

# Module import
import numpy as np
import pycuda.autoinit
from pycuda.gpuarray import GPUArray, to_gpu
from cufinufft import cufinufft

# NFFT eval
def cufinufft3d_eval(u, kx, ky, kz, eps=1e-6):
     
     # Allocate memory on the GPU for output variable
     n_transf = 1
     U_gpu = GPUArray((n_transf, len(kx)), dtype=u.dtype)
     
     # Initialize the plan and set the points
     plan = cufinufft(2, u.shape, n_transf, eps=eps, dtype=kx.dtype)
     plan.set_pts(to_gpu(kx), to_gpu(ky), to_gpu(kz))
     
     # Execute the plan, reading from the array u and storing the result in U_gpu
     plan.execute(U_gpu, to_gpu(u))
     
     # Retreive the result from the GPU
     return U_gpu.get()[0,:]

def cufinufft3d_adjeval(U, kx, ky, kz, shape, eps=1e-6):
     
     # Allocate memory on the GPU for output variable
     n_transf = 1
     u_gpu = GPUArray((n_transf, shape[0], shape[1], shape[2]), dtype=U.dtype)
     
     # Initialize the plan and set the points
     plan = cufinufft(1, shape, n_transf, eps=eps, dtype=kx.dtype)
     plan.set_pts(to_gpu(kx), to_gpu(ky), to_gpu(kz))
     
     # Execute the plan, reading from the array U and storing the result in u_gpu
     plan.execute(to_gpu(U), u_gpu)
     
     # Retreive the result from the GPU
     return u_gpu.get()[0,:,:,:]

# 3D settings
# shape = (32,32,32)
# shape = (64,64,64)
shape = (128,128,128)
eps = 1e-7
u = np.random.randn(*shape).astype(np.complex64)+1j*np.random.randn(*shape).astype(np.complex64)
M = 10
kx = np.linspace(-np.pi,np.pi, num=M).astype(np.float32)
ky = np.linspace(-np.pi,np.pi, num=M).astype(np.float32)
kz = np.linspace(-np.pi,np.pi, num=M).astype(np.float32)

# Adjoint test
U  = cufinufft3d_eval(u, kx.flatten(), ky.flatten(), kz.flatten(), eps=eps)
V = np.random.randn(*U.shape).astype(np.complex64)+1j*np.random.randn(*U.shape).astype(np.complex64)
v = cufinufft3d_adjeval(V, kx.flatten(), ky.flatten(), kz.flatten(), shape, eps=eps)
a = np.vdot(U.flatten(),V.flatten())
b = np.vdot(u.flatten(),v.flatten())
np.linalg.norm(a-b)/np.linalg.norm(a) # rel err

Type 2 3D transform gives different result from (non-cu) finufft for some inputs

This is a bit of a strange error, so I will need to unpack a little here:

For some inputs, the type 2 3D NUFFT implemented in cufinufft is giving different results from the finufft implementation. I am attaching a bitbucket repo containing the finufft output (stored) and the code the runs cufinufft:
https://bitbucket.org/jpblaschke/cufinufft-error/src/master/
The get_data.sh script will download the input and output data (approx 3GB).

This is what I see so far:

  1. output from our finufft install is stored here: https://bitbucket.org/jpblaschke/cufinufft-error/downloads/output-cpu.npy (we have a fairly old finufft install -- I am blowing off the cobwebs/api changes so that i can confirm that the latest finufft gives the same answer)
  2. I tested 2 versions of cufinufft ("old" and "new") which in turn give different results. Use git submodule update --init to clone the latest versions. Especially the output from cufinufft-new.py surprizes me, at it is all zeros.
  3. For my other unit tests, the results from finufft and cufinufft agree to within the specified tolerance.

I will add more info as I find out more, but I would appreciate any insights all y'all can give me.

My current theory is that this could have something to do with the input data size/density. I have tested both larger and smaller inputs, for which finufft and cufinufft agree, so it's not as simple as "for large inputs this breaks"

Also tagging @monarin and @irischang020 who brought this bug to my attention.

Update README

Various things to address here:

  • Add @JBlaschke as a contributor.
  • Add Python usage example.
  • Fix typo in C++ usage (we get default options but don't use them, maybe just get rid of this step).
  • Minor typos & formatting.

I can take a look at this later. Just wanted to check if there's other things here that I can take care of while I'm at it.

Inaccurate type 1 transform in Python

When running the type 1 transform in Python with a small number of nonuniform points, the output seems to be garbage (i.e., order one error). Sometimes it's zero, sometimes not. Initializing the output array with a fixed value shows that something the array is being written to. I've not been able to reproduce this in the C interface, so it's likely this has something to do with the Python binding.

import numpy as np
import pycuda.autoinit
from pycuda.gpuarray import GPUArray, to_gpu
from cufinufft import cufinufft

N1, N2 = 256, 256
M = 10
dtype = np.float32
complex_dtype = np.complex64

kx = np.random.uniform(-np.pi, np.pi, size=M)
ky = np.random.uniform(-np.pi, np.pi, size=M)

c = (np.random.standard_normal(M)
     + 1j * np.random.standard_normal(M))

kx = kx.astype(dtype)
ky = ky.astype(dtype)
c = c.astype(complex_dtype)

fk_gpu = to_gpu(np.full(fill_value=-123, shape=(N1, N2),
    dtype=complex_dtype))

plan = cufinufft(1, (N1, N2), dtype=dtype)
plan.set_pts(to_gpu(kx), to_gpu(ky))

plan.execute(to_gpu(c), fk_gpu)

fk = fk_gpu.get()

nt1 = int(0.37 * N1)
nt2 = int(0.26 * N2)

x, y = nt1 - N1 // 2, nt2 - N2 // 2
fk_true = np.sum(c * np.exp(1j * (x * kx + y * ky)))

print(f"fk={fk[nt1, nt2]}, fk_true={fk_true}")

err = np.abs(fk[nt1, nt2] - fk_true)
rel_err = err / np.max(np.abs(fk))

print(f"err={err}, rel_err={rel_err}")

Propagation of shared memory error

Making the dual type python wrapper, I ran into a case where the code wasn't handling an error correctly. If I do something I shouldn't, like call a 3d method with doubles I get the shared memory allocation errors printed,

error: not enough shared memory (92160)
error: cnufftspread3d_gpu_subprob
error: cuspread3d, method(2)

but from spread3d_wrapper.cu we begin returning 0, so calling code doesn't raise as an error. Probably we should return a non zero code. I took a quick look around and picked up a few more, but I might have missed some. Its minor change, will push momentarily.

gpuNUFFT is faster than cufiNUFFT in 3D

Hi, with @chaithyagr we continued development of gpuNUFFT (https://github.com/paquiteau/pygpuNUFFT and https://github.com/chaithyagr/gpuNUFFT) library.

We run some benchmark test and we found some performance issue on CufiNUFFt, contradicting the results shown on Figure 4 in
https://arxiv.org/abs/2102.08463

In particular, Forward (Type II NUFFT, image to fourier domain) and Adjoint (Type I NUFFT, fourier domain to image) cufiNUFFT operation are slower than gpuNUFFT in 3D both for single channel and multichannel data.

We use a samples from the PySAP dataset, with the same order of magnitude of resolution size and number of sampling point in fourier space.

For 32 coil data: (Image is 32x128x128x160)

gpu3d_op     :   1582.74ms  ± 3.853 (x5.65 speedup)
cufi3d_op    :   8952.54ms  ± 2.771 

gpu3d_adjop  :   1739.30ms  ± 10.767 (x2.39 speedup)
cufi3d_adjop :   4159.96ms  ± 5.939  

for single channel data (Image is 128x128x160)

gpu3d_op     :     73.42ms  ± 1.119 ( x3.8 speedup)
cufi3d_op    :    279.36ms  ± 0.552 

gpu3d_adjop  :     62.73ms  ± 0.640 ( x2.1 speedup)
cufi3d_adjop :    132.27ms  ± 0.697 

The difference of speed up between the single and multichannel example is due to some concurrency usage in gpuNUFFT.
Moreover, we considered only Computation and D2H copy times for cufiNUFFT, whereas gpuNUFFT times also includes H2D copy.

Nevertheless, cufiNUFFT remains faster for 2D data.

A minimal reproducible example is available on google collab:

https://colab.research.google.com/drive/1IuOgwshEG04GVaZoinZNdbRkPmG-Kadp?usp=sharing

maxbatchsize = 0 should pick a good default

In FINUFFT I've been using 0 for an option to mean: auto-pick a good default.
But currently it hangs when 0 is sent in.
Also I can crash it by picking maxbatchsize=100 :)

Move from manylinux2010 to manylinux2014

As discussed in #121, the new version of NumPy only releases binary wheels for manylinux2014, while our CI (and our releases) are for manylinux2010 for now. This is currently resolved by downgrading NumPy to the last version to publish manylinux2010 wheels (1.21), but this is not a tenable solution. One way to approach this would be to stick with manylinux2010 but build NumPy from source during testing. However, the fact that this is becoming a problem suggests that it is maybe time to move to manylinux2014. This should be relatively straightforward to implement, but we have to consider downstream issues. Are any dependent packages going to be severely affected by this?

Vestigial Dependence on FFTW

A user had a case where they were have conflicts integrating an FFTW install at a HPC with cufinufft.

FFTW dependence was brought over from FINUFFT in the process of trying to use their headers in contrib. To the best of my knowledge cufinufft uses cuFFT exclusively. Besides the vestigial lines from the parent repo, we don't use FFTW here.

So we need to decide which is more important. Travelling light (removing the unused dependence), or having the contrib files from another project be as similar/identical as possible. Since the files already required changing, I might suggest trimming the dependence.

I will push a PR shortly that removes it for consideration. Its a very small change.

finishing up the 1D functionality

I just added a bunch of 1D tests to the make tasks.
But we need Python interface for 1D, and "1dmany" (vectorized) testers. This is not too hard.

We should then push the release to 1.2.1 and make a github release tag.

Adding `cupy` support

Dear authors,

I was wondering if cufinufft could add support for doing cufinufft on cupy.ndarray, in a consideration of that nufft is often not an isolated computation, but one step midst some pipelines.
Rather than spending time transporting array back-n-forth b/w cpu and gpu, one may well put as much computation as possible on gpu, and cupy IMHO is exactly for such needs.

(In my fork repo, the branch tluo/cupy will do the job, if you decide supporting cupy is of interest.)

Thanks!

GPU memory leak issue?

Dear Yu-hsuan,

I am trying to invoke the cunufft in C++ time and time again, but the monitor for NVIDIA tool NVTOP shows the gpu memory is keep growing.

Would you please kindly review my code below and show me if I have some mistakes in the cunufft API usage?

#include <cufinufft.h>
#include <complex>
#include <iostream>
#include "unistd.h" // sleep

#define FI_OK 0

void call_finufft(int N1, int N2, int M, int ntransf, float* x, float* y, cuFloatComplex* c, cuFloatComplex* fk)
{
	int maxbatchsize = 0;
	int dim = 2;
	int nmodes[3] = { N1, N2, 1};
	float tol = 1e-6;

	int type = 1;
	int iflag = 1;

	cufinufft_opts opts;
	if (FI_OK != cufinufftf_default_opts(type, dim, &opts))
	{
		std::cout << "err: cufinufft_default_opts" << std::endl;
		return;
	}		
	cufinufftf_plan dplan;
	if (FI_OK != cufinufftf_makeplan(type, dim, nmodes, iflag, ntransf, tol, maxbatchsize, &dplan, NULL))
	{
		std::cout << "err: cufinufftf_makeplan" << std::endl;
		return;
	}
	if (FI_OK != cufinufftf_setpts(M, x, y, NULL, 0, NULL, NULL, NULL, dplan))
	{
		std::cout << "err: cufinufftf_setpts" << std::endl;
		return;
	}
	if (FI_OK != cufinufftf_execute(c, fk, dplan))
	{
		std::cout << "err: cufinufftf_execute" << std::endl;
		return;
	}
	if (FI_OK != cufinufftf_destroy(dplan))
	{
		std::cout << "err: cufinufftf_destroy" << std::endl;
		return;
	}

	std::cout << "ok" << std::endl;
}

void test()
{
	int N1 = 512;
	int N2 = 512;
	int M = 632 * 149;
	int ntransf = 2;

	float *d_x, *d_y;
	CUCPX *d_c, *d_fk;
	cudaMalloc(&d_x,M*sizeof(float));
	cudaMalloc(&d_y,M*sizeof(float));
	cudaMalloc(&d_c,M*ntransf*sizeof(cuFloatComplex));
	cudaMalloc(&d_fk,N1*N2*ntransf*sizeof(cuFloatComplex));

	for (int outer = 0; outer < 30; outer++)
	{
		for (int inner = 0; inner < 5; inner++)
		{
			call_finufft(N1, N2, M, ntransf, d_x, d_y, d_c, d_fk);
		}
		sleep(1);// the gpu memory is keep growing, memory leak?
	}

	cudaFree(d_x);
	cudaFree(d_y);
	cudaFree(d_c);
	cudaFree(d_fk);
}

// test_mem_leak.cu is located at the test folder
// nvcc test_mem_leak.cu -o test_mem_leak -L /usr/local/cuda/lib64 -lcuda -lcufft -I../include -L ../lib -lcufinufft

int main(int argc, char* argv[])
{
	test();
	
	return 0;
}

ntransfcufftplan vs ntransf

I was struggling to get parity between the many methods in ASPIRE, and of course, I was using the wrong variable (oops 😅 ), so only yielding the first transforms worth or results. That is totally operator error, but I would like to understand this a little better.

Can anyone tell me if ntransfcufftplan is required for me as an end user? I did some light reading and I see lines like below, and also the related PR #20 ... It appears this is to control size of loop of batches? I'm guessing there is a limit to the simultaneous transforms? Do we have a good default value you can recommend?

for(int i=0; i*d_plan->ntransfcufftplan < d_plan->ntransf;

Documentation

How far do we want to go with this? I don't know if it's worth it to have a full-fledged site lite for Finufft, but something more substantial (like a tutorial or something) than in-code documentation might be a good idea.

Update Fourier Pts Arg ordering

Now that FINUFFT v2 is done, it has come to attention that the geomerty of arguments for Fourier pts are reversed in cufinufft (ie XYZ vs ZYX). cufinufft was consistent with the original implementation (and v2 until rather recently). I'll try to have a change set for this shortly so everything is matching.

Turn off timing prints by default

They are useful for developers, but cause some pain when using the code in other programs. I might have really blown up some logs :D. I will push a one liner remove in few minutes, and add sentence of detail on how to enable in the README.

Fixup Centos6 repo references for manylinux2010 images

Centos6 was end of lifed, and so some references to repos no longer resolve. This has the effect of breaking my local attempts at building the docker images (that are used to make the 2010 wheels, CUDA 10). It will probably effect other developers if not already..

I think to fix it just requires pointing to the vault copy of centos6. I'll look monday.

Moving forward the project can use manylinux2014, which is based on centos 7 and cuda 11. Making a separate issue about that.

...
Step 7/21 : RUN yum install -y cuda-cudart-$CUDA_PKG_VERSION cuda-compat-10-1 &&     ln -s cuda-10.1 /usr/local/cuda &&     rm -rf /var/cache/yum/*
 ---> Running in 44a0be3e52b1
Loaded plugins: fastestmirror, ovl
Setting up Install Process
Determining fastest mirrors
Error: Cannot find a valid baseurl for repo: base
YumRepo Error: All mirror URLs are not using ftp, http[s] or file.
...

Harmonize C interface with that of Finufft

The current interface was modeled on an early version of the Finufft plan interface, which has changed in the meantime.

  • Options passed to cufinufft_makeplan. Currently, the user has to set these manually as part of plan. We should follow Finufft in allowing the user to provide the options to cufinufft_makeplan.
  • Plan is a pointer to a struct as opposed to a struct. This should simplify the ctypes interface, among other things.
  • Naming. Really only cufinufft_setNUpts that should become cufinufft_setpts.
  • Merge C++ and C interfaces. Currently, we have cufinufft and cufinufftc. Ideally, we'd only have one. I know there have been issues with this in the past, but if we pass the options to makeplan and make the plan a pointer, a lot of the C++ dependency can be stashed away from the API.

I'm sure there are more things, but these are the first that come to mind.

Naming and setting `ntransfcufftplan`

I find this name a little confusing. After digging through the source code, it appears to be the “block size” of the algorithm, that is, the number of inputs to process simultaneously (which, is therefore equal to the number of inputs in the CUFFT plan).

Maybe something like block_size would be better? Or ntransfblock? We should also discuss good rules of thumb for setting this by default since most users won't want to deal with it. In the tests, we default to 8, but I saw much better performance on my laptop for 16 and 32, after which it seems to saturate.

Clean up CI Dockerfile

Currently, the Dockerfiles in ci/docker run make all and rely on the user later running python setup.py develop to set up the Python package. While this works well for wheel building, it's not ideal for the Jenkins setup since it forces a rebuild of the Docker image each time and prevents us from caching the dependencies (which shouldn't change too often).

One way to do this is to modify the Dockerfiles to install dependencies first and run make second, then python setup.py develop third. Alternatively, if we want to keep the current structure, we separate into three Dockerfiles, one main one and two branches: one for wheel building and one for Jenkins.

On a related note, we should be able to cut down on running time by replacing make all with make lib, but maybe I'm missing something here.

Windows version

Hi,

I was wondering if cuFINUFFT is platform neutral. If not, do you plan to develop a Windows version?

Thanks,
Yuan

Compatibility CUDA 5.5

  1. Will cufinufft compile under CUDA 5.5 and gcc 4.4?
  2. Also do I need python in order to compile the package? If seems like this based on the readme file.
    Thanks!

Good default value for `maxbatchsize`

This is the parameter formerly known as ntransfcufftplan. It currently default to maxbatchsize=8, but it is not clear whether this is optimal for many architectures.

2d1 total time is off

totaltime += milliseconds;

Why is total time increment from milliseconds just before the latter is set, rather than after?
I don't expect this to matter much because destroy is fast, but, it's confusing.
I'm adding total time to 2d2 now.

set_pts memory leak

I want to point out though that using cufinufft that way presupposes the user is carefully managing memory anyway, as CUFINUFFT_SETPTS always mallocs (which is assumed to be freed with the plan in CUFINUFFT_DESTROY) => so by setting new points without first destroying+creating a plan will cause a memory leak (this might be a good discussion in a different issue @MelodyShih @ahbarnett).

Originally posted by @JBlaschke in #71 (comment)

Consider Breaking CUB Dependence

It is nicer when code stands alone. I would like to look at how much work it would be to either include a minimal part of CUB (frozen, and with citation), or roll our own similar function(s). This would be better than a submodule or external dep.

From our 5/12 discussion, I got the impression this is not too much work. I will look into it at a good time.

request: MATLAB/octave wrappers

Just lodging the request that we build some MATLAB/octave wrappers some day soon. Calling this from MATLAB would be really cool, as a drop-in replacement for FINUFFT.

CU vs CPP

A lot of the source code is currently written as CU files that could just be straight CPP files (that is, they don't contain any kernel invocations). I'm not sure what effect this has, but it does make the build process more convoluted than it perhaps needs to be. (The compiler errors do become more difficult to parse since since GCC will report errors in the preprocessed source files which end up in /tmp on my setup).

@garrettwrong Do you know if this could be a potential issue? Otherwise we will leave it as is.

Python timing tests

Per @ahbarnett in #126:

Any Py-based timing tests, that match a C++-based test like Melody just did, would be amazing to add to the Py tests dir, and would help resolve some of these speed anomalies!

Multi-GPU Support in Python

Hi all y'all

do we have multi-gpu support for the python front-end? I.e. is there a way to inject cudaSetDevice (or gpuDeviceInit) to calls to the python front end?

I am not looking for fancy multi-gpu support (i.e. I don't need to share data between gpus -- yet), but just a way to have one plan run on one GPU and the other or another.

Please let me know If this isn't documented (I did a quick git grep and couldn't find anything), but already implemented. Otherwise I can put in some time to implement it for another project (if y'all want it implemented in a certain way, let me know).

Cheers! 🍺

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.