This FFT was developed by Manthan Verma , Soumyadeep chatterjee and Prof Mahendra Verma in collaboration with NVIDIA.
This is a Open-source parallel FFT primarily for GPUs (Either AMD or NVIDIA). This FFT has been tested on NVIDIA's A100 and AMD INSTINCT MI210 Accelerator GPUs. It Has been tested and scaled on various supercomputers like selene, Param-Siddhi_AI, Param-sanganak etc.
This FFT has been published in S.N. Comp. Sci. in Volume 4, article number 625, (2023).
Link to the Research paper is : "https://link.springer.com/article/10.1007/s42979-023-02109-0".
To Cite this article use-- "Verma, M., Chatterjee, S., Garg, G. et al. Scalable Multi-node Fast Fourier Transform on GPUs. SN COMPUT. SCI. 4, 625 (2023). https://doi.org/10.1007/s42979-023-02109-0"
This FFT works for BOPTH AMD and NVIDIA GPUs. This FFT uses HIP for AMD GPUs. Installation is very simple. Follow the following steps :
- Clone this GIT-REPO and go to GPU_FFT folder
- Make sure you have " CUDA-AWARE or ROCM-AWARE MPI" and "CUDA or ROCM/HIP " are installed in your system.
- Now do
make CC_HOME=<> MPI_HOME=<> COMPILER=<>
- Here in
CC_HOME
give the path od CUDA HOME directory or ROCM HOME directory. While, inMPI_HOME
put the home of CUDA-AWARE or ROCM-AWARE MPI Directory. inCOMPILER=<>
put HIP for HIP compilation and NVCC for CUDA based compilation. - Now , do
make INSTALL_DIR=<INSTALLATION_DIRECTORY> install
- GPU_FFT will be install in the specified folder path.
First, make sure the Include and library directories of installed library is in path.
Then , in your code include :/
#include <GPU_FFT/GPU_FFT.h>
Now, just after initializing the FFT do
cudaSetDevice()
or hipSetDevice()
.
Now, in your code initialize GPU_FFT as :
GPU_FFT::INIT_GPU_FFT<T1,T2>(Nx,Ny,Nz,procs,rank,MPI_COMMUNICATOR)
Here, Nx, Ny, Nz are the dimesnions in real space of the data to be Transformed. procs and rank are the mpi size and rank respectively. MPI_COMMUNICATOR is the COMMUNICATOR with which you want to use the FFT with.
Here, T1 and T2 are the templated variables, Where T1 and T2 can be either "float" and "cufftComplex" for single precision or "double" and "cufftDoubleComplex" for double precision. This code supports both single as well as double precision only.
Now GPU_FFT is initialized.
Now you can use the FFT using ,
GPU_FFT::GPU_FFT_R2C<T1,T2>(T1 *data)
or GPU_FFT::GPU_FFT_C2R<T1,T2>(T2 *data)
. Where both these functions does the FFT inplace only.
In this FFT implementation real space data (Data before Transform) is distributed in slabs across all GPUs equally. That is for a Grid size of
After The transfrom, That is (in Complex space) data is distributed as
A Basic installation script is provided in GPU_FFT_TEST_SCRIPT folder. Do a make CC_HOME=<> MPI_HOME=<> GPU_FFT_HOME=<> COMPILER=<>
in TEST_SCRIPT folder by setting the relative parameters in make file. and then do make CC_HOME=<> MPI_HOME=<> GPU_FFT_HOME=<> COMPILER=<> test_fft Nx=128 Ny=128 Ny=128
Here we have used a real function ::
output will be :
You can set Nx, Ny, Nz according to your requirment
Time of multicore FFT of
We have developed a wrapper around this library for using in python as well.
Which will be soon availaible on this repository.
We also have a better version of this FFT that use NVSHMEMS for communication ans is stream aware. This version is currently not open source and can be provided on specific demand submitted on [email protected] or in this repository. This new FFT is suppose to be better than cuFFTMp itself.