computationalradiationphysics / imresh Goto Github PK
View Code? Open in Web Editor NEWShrink-Wrap Phase Reconstruction Algorithm
License: MIT License
Shrink-Wrap Phase Reconstruction Algorithm
License: MIT License
The normal input would have the brightness maximum in the middle (so do the test diffraction patterns) ("normally" the center would be masked ...), but the shrink-wrap algorithm or to be more precise: cuFFT (and fftw) anticipates the maximum to be in one of the corners (I think the very first element of the 2D array). Meaning everything needs to be periodically shift by half the height/width. A function for this actually already exists with fftShift.
Normal input images won't be centered perfectly though. Meaning we would have to find the center somehow ... e.g. by finding the maximum, but that won't work if it is masked.
I'm not really sure how this is done in other HIO algorithms which existed for some time. Maybe that fftShift is not even necessary, but I think it at least made problems in the beginning stages, that's why I wrote fftShift
.
The main loop where depending on the parameters 80% of the time is spent is:
for ( unsigned iHioCycle = 0; iHioCycle < rnHioCycles; ++iHioCycle )
{
cudaKernelApplyHioDomainConstraints<<<nBlocks,nThreads,0,rStream >>>
( dpgPrevious, dpCurData, dpIsMasked, nElements, rHioBeta );
CUFFT_ERROR( cufftExecC2C( ftPlan, dpgPrevious, dpCurData, CUFFT_FORWARD ) );
cudaKernelApplyComplexModulus<<<nBlocks,nThreads,0,rStream>>>
( dpCurData, dpCurData, dpIntensity, nElements );
CUFFT_ERROR( cufftExecC2C( ftPlan, dpCurData, dpCurData, CUFFT_INVERSE ) );
}
cudaKernelApplyHioDomainConstraints
and cudaKernelApplyComplexModulus
are both element-wise function, so I think CuFFT callbacks are possible and could bring some speedup by reducing global memory accesses, especially because cuFFT-calls and the calls to these two functions are almost 50:50 in usage for the image size of 1024x1024.
Note: I head cufftCallbacks have some problems like only working in Linux 64bit and needing some sort of license for CUDA <7.0 ... (unchecked rumors)
Many of the automatically generated test cases are not well-suited for reconstruction, because the objects touch the border of the image. Either the image should be a periodic tile or there should be a margin to the image border. E.g. this example has this problem:
That's why the reconstruction looks a bit weird, because the pseudo-periodic image is periodically shifted. But it is interesting that it worked anyway:
The parallel unit test could just spawn 50 or so tasks to the batch system and then compare all their results which should be exactly identical even accounting for floating point rounding errors.
try checking out dev (de14cd9), add following code to src/imresh/libs/cudacommon.hpp
:
#ifdef USE_SPLASH
# warning "USE_SPLASH defined!"
#else
# warning "USE_SPLASH NOT defined!"
#endif
#ifdef USE_PNG
# warning "USE_PNG defined!"
#else
# warning "USE_PNG NOT defined!"
#endif
#ifdef IMRESH_DEBUG
# warning "IMRESH_DEBUG defined!"
#else
# warning "IMRESH_DEBUG NOT defined!"
#endif
#ifdef USE_FFTW
# warning "USE_FFTW defined!"
#else
# warning "USE_FFTW NOT defined!"
#endif
#error "!"
Now compile with:
mkdir build
cd build
cmake .. \
-DCMAKE_CXX_COMPILER=$(which g++-4.9) \
-DCMAKE_C_COMPILER=$(which gcc-4.9) \
-DUSE_PNG=ON \
-DUSE_FFTW=ON \
-DUSE_SPLASH=ON \
-DBUILD_DOC=OFF \
-DIMRESH_DEBUG=ON
make VERBOSE=1
The output is:
/usr/bin/nvcc -M -D__CUDACC__ imresh/src/imresh/libs/cudacommon.cu -o imresh/build/CMakeFiles/imresh.dir/src/imresh/libs/imresh_generated_cudacommon.cu.o.NVCC-depend -ccbin /usr/bin/gcc-4.9 -m64 --std c++11 -DIMRESH_DEBUG -DUSE_SPLASH -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -D_FORTIFY_SOURCE=2 -DUSE_FFTW -Xcompiler ,\"-Wall\",\"-Wextra\",\"-Wshadow\",\"-Wno-unused-parameter\",\"-O2\",\"-g\",\"-fPIC\",\"-pthread\",\"-fopenmp\",\"-g\" -Xcompiler -Wall,-Wextra,-Wshadow -G -lineinfo -Xptxas=-v -arch=sm_30 -DNVCC -Iimresh/src/imresh -I/usr/include -I/usr/include -Iimresh/src/imresh -I$PNGWRITER_ROOT/include -I/usr/include/freetype2 -I$SPLASH_ROOT/include -I/usr/include/hdf5/serial
[ 5%] Building NVCC (Device) object CMakeFiles/imresh.dir/src/imresh/libs/imresh_generated_cudacommon.cu.o
In file included from imresh/src/imresh/libs/cudacommon.cu:26:0:
imresh/src/imresh/libs/cudacommon.hpp:33:9: warning: #warning "USE_PNG NOT defined!" [-Wcpp]
# warning "USE_PNG NOT defined!"
^
imresh/src/imresh/libs/cudacommon.hpp:36:9: warning: #warning "IMRESH_DEBUG defined!" [-Wcpp]
# warning "IMRESH_DEBUG defined!"
^
imresh/src/imresh/libs/cudacommon.hpp:41:9: warning: #warning "USE_FFTW defined!" [-Wcpp]
# warning "USE_FFTW defined!"
^
imresh/src/imresh/libs/cudacommon.hpp:45:6: error: #error "!"
#error "!"
^
CMake Error at imresh_generated_cudacommon.cu.o.cmake:207 (message):
Error generating
imresh/build/CMakeFiles/imresh.dir/src/imresh/libs/./imresh_generated_cudacommon.cu.o
So for some reason only IMRESH_DEBUG
and USE_FFTW
are defined, but not USE_PNG
.
Note the difference:
add_definitions("-DUSE_PNG ${PNGwriter_DEFINITIONS}")
add_definitions("-DUSE_SPLASH" ${Splash_DEFINITIONS})
Changing this to
add_definitions("-DUSE_PNG" ${PNGwriter_DEFINITIONS})
will make it work.
Note also this minimal-non-working example:
cmake_minimum_required(VERSION 3.5)
find_package( CUDA )
add_definitions( "-D IMRESH_DEBUG" )
add_definitions( "-DIMRESH_DEBUG2" )
SET_SOURCE_FILES_PROPERTIES( main.cpp PROPERTIES CUDA_SOURCE_PROPERTY_FORMAT OBJ )
cuda_add_library( mimi main.cpp )
target_link_libraries( mimi )
Use the following main.cpp
#ifdef IMRESH_DEBUG
# error "IMRESH_DEBUG defined!"
#endif
int main ( void ){ return 0; }
Here is an excerpt from the output:
/usr/bin/nvcc ~/cmakeIsysBug/main.cpp -x=cu -c -o ~/cmakeIsysBug/build/CMakeFiles/mimi.dir//./mimi_generated_main.cpp.o -ccbin /usr/bin/gcc-4.9 -m64 -DIMRESH_DEBUG2 -Xcompiler ,\"-g\" -DNVCC -I/usr/include -I/usr/include
So especially there is no compile error, because IMRESH_DEBUG
won't be defined, but IMRESH_DEBUG2
is.
This means CMake has a bug where it misinterprets definitions containing spaces. Note that the space after -D
is completely fine!
Comparing the above to this minimal-working CMakeLists.txt
cmake_minimum_required(VERSION 3.5)
add_definitions( "-D IMRESH_DEBUG" )
add_definitions( "-DIMRESH_DEBUG2" )
add_library( mimi main.cpp )
target_link_libraries( mimi )
the output will be:
[ 50%] Building CXX object CMakeFiles/mimi.dir/main.cpp.o
/usr/bin/g++-4.9 -DIMRESH_DEBUG2 -D IMRESH_DEBUG -o CMakeFiles/mimi.dir/main.cpp.o -c ~/cmakeIsysBug/main.cpp
~/cmakeIsysBug/main.cpp:3:5: error: #error "IMRESH_DEBUG defined!"
# error "IMRESH_DEBUG defined!"
^
Meaning the bug seems to be inside FindCUDA
.
I'm only kinda surprised that the missing -D USE_PNG
didn't lead to any notable errors Oo. Maybe I'm overlooking something again.
This software lacks a short introduction / description paragraph at the top of its README.md :)
I think this would be best done in a forked project, because at least for benchmarking purposes CUDA and alpaka are orthogonal to each other.
Try out test cases which don't have hard edges, and also not only Gaussian edges (Gaussian falls quite fast to zero). Instead e.g. try a linear or Lorentz-like target density distribution or a combination of low and high density targets next to each other. In some smaller test the low density target will be forced to 0 by accident, because the high density target confuses the mask process -> See also #57
We should decide which build system we want to use quite early. I'd like to use CMake but could live with make as well.
The main part of this program. It can be subdivided as follows:
First:
Then:
During implementation benchmarks should be run in order to ensure performant code.
For easier access from usual post-processing workflows a python binding would be nice to have.
After the rebase to dev (before the two merges today) I get the following warning when compiling:
/usr/bin/ld: warning: libhdf5_serial.so.8, needed by /opt/libsplash/lib/libsplash.so, may conflict with libhdf5_serial.so.10
Is this normal? Can this be suppressed?
The example imresh/examples/threadedExample.cpp
suggests the following call to free the allocated data where justFree is just a wrapper to delete[]
. This will be run in parallel and is bad (Segmentation fault).
// Now we can run the algorithm for testing purposes and free the data
// afterwards
imresh::io::addTask( file.first, file.second,
imresh::io::writeOutFuncs::justFree,
"free" /* gives an identifier for debugging */ );
Possible fixes:
delete[]
inside taskQueue.cu after join(). In this case we need to store the pointers to the data to freeThe problem with letting the user do it are plentiful, because it breaks paralellism or is just too complex.
My suggestion is to convert taskQueue.cu into a class e.g. called ShrinkWrapBatch
so that adding tasks could be as easy as:
{
ShrinkWrapBatch batch( true /* automatically free given data */ );
while ( /* files in directory */ )
{
batch.add( readFile( filename ) );
}
} // batch.~batch will be called
Note that no call to initialize is necessary anymore because it would be included in the constructor. Also no need to call deinit is necessary, because the outer scope will call the Destructor. Each data given is freed either by batch.add
or batch.~batch
E.g. Input:
std::vector< std::string > inputFilenames
std::vector< std::string > OutputFilenames
The loop could be:
Possible further optimizations:
When running the test suite with the IMRESH_DEBUG
CMake-Option, then the shrink wrap benchmark will output something like:
( 2, 2) : [Error -nan/1.17549e-38] [Cycle 0/3]
[Error -nan/1.17549e-38] [Cycle 1/3]
[Error -nan/1.17549e-38] [Cycle 2/3]
[Error -nan/1.17549e-38] [Cycle 3/3]
[Error -nan/1.175e-38] [Cycle 0/3]
[Error -nan/1.175e-38] [Cycle 1/3]
[Error -nan/1.175e-38] [Cycle 2/3]
[Error -nan/1.175e-38] [Cycle 3/3]
[Error -nan/1e-05] [Cycle 0/3]
[Error -nan/1e-05] [Cycle 0/3]
[Error -nan/1e-05] [Cycle 0/3]
This may be normal, because the image to reconstruct is just too small.
Either remove these cases, or make them work, else the benchmark time may not even have any meaning (e.g. because a faulty loop could be faster or slower).
Add checks to these benchmarks which test if the reconstruction actually worked. Currently it only is timed.
A good project needs a good documentation.
Readme.md
with compile-, install- and usage-instructions)When calling the algorithm multiple times at once like in the following
for( int i = 1; i < 30; i++)
{
imresh::io::addTask( file.first, file.second, imresh::io::writeOutFuncs::writeOutPNG, "imresh_" + std::to_string( i ) + "_cycles.png", i );
}
the results get worse with each loop. After a few loops only a black image is returned (have a look at the attached images, sorry for the inversed order). Also the algorithm's output switches to [Error -nan/1e-05] [Cycle 0/2]
for each loop.
Fine tuning of the shrink wrap parameters have not solved this issue for me.
It seems like the different algorithm calls influence each other.
This may be necessary while already developing:
For testing purposes we need example data. The following patterns should be provided:
To create the data, we should write a Python script. All patterns should be regeneratable in different resolutions.
Currently the shrink-wrap algorithm needs to copy some reduce-results (and also kernel parameters) to and from the GPU. This could be a potential bottleneck, because these operations can't be interleaved with calculations easily. Implementing the whole shrink-wrap algorithm as a kernel on the GPU which could spawn threads using dynamic parallelism would solve this performance problem.
We could save some cudaMallocs and cudaFrees, if taskQueue.cu
could do cudaMalloc in the initializer call where it also creates the work thread list. The pointers to the memory locations or the one large memory location could then be given to shrink wrap which in the current version calls cudaMalloc and cudaFree each time.
It would make cudaShrinkWrap
harder to call, so I would prefer to copy-paste it to cudaShrinkWrapBatch
, which could be called by the first after it allocates the needed memory.
Currently all tests begin their work from the original object to reconstruct and then use the function diffractionPattern to create, well, the diffraction pattern :). Normally the input is a diffraction pattern. We should test if it works for the examples we received.
A problem which occurs here is the anticipated layout of the diffraction pattern. See Issue #34
In the current CMakeLists.txt
on the refactoring
branch (by now the most up to date), no parameters for compiler optimizations are given. This defaults to no optimization (at least on my machine).
When turning on optimization with anything higher than -O0
(e.g. -O1
, -O2
and so on) linking the example binary fails with
[100%] Linking CXX executable examples
CMakeFiles/examples.dir/examples/createAtomCluster.cpp.o: In function `examples::createAtomCluster(std::vector<unsigned int, std::allocator<unsigned int> > const&)':
createAtomCluster.cpp:(.text+0xf2): undefined reference to `void imresh::libs::gaussianBlur<float>(float* const&, unsigned int const&, unsigned int const&, double const&)'
createAtomCluster.cpp:(.text+0xd4f): undefined reference to `void imresh::libs::gaussianBlur<float>(float* const&, unsigned int const&, unsigned int const&, double const&)'
libimresh.so: undefined reference to `float imresh::algorithms::vectorMax<float>(float* const&, unsigned int const&)'
libimresh.so: undefined reference to `void imresh::algorithms::cuda::cudaGaussianBlur<float>(float* const&, unsigned int const&, unsigned int const&, double const&)'
libimresh.so: undefined reference to `void imresh::algorithms::applyComplexModulus<float [2], float>(float (* const&) [2], float const (* const&) [2], float const* const&, unsigned int const&)'
libimresh.so: undefined reference to `void imresh::algorithms::complexNormElementwise<float, float [2]>(float* const&, float const (* const&) [2], unsigned int const&)'
collect2: Fehler: ld gab 1 als Ende-Status zurück
Right now I have no idea where to start investigating into this. As this is not game breaking at the moment and this is your code, I'd like to ask you, @mxmlnkn to look into this when you have some sparse time. But as this library is intended to be fast, compiler optimizations should be possible at least right before the final release.
Maybe it's just an error in the CMakeLists.txt
that I don't get right now so the first step could be that you could try your old Makefiles with enabled optimizations.
Currently the mask used by the HIO-algorithm is saved as an array of float being either 0 or 1 ... If bit-masks could work without memory conflicts and anything it possibly could give a performance boost of 32 (for those parts), because it should reduce throughput needed. If the original array is not reinterpreted and overwritten it will result in a slightly higher memory footprint. But the memory footprint is currently relatively high anyway.
For the shrink-wrap algorithm we need to hold the original intensity, the current step, for the HIO algorithm the last step and the mask in memory. This means we have a memory footprint of: 4*N_Pixels*sizeof(float)
. With the mask it would be (4+1/32)*N_Pixels*sizeof(float)
If done, of coure it needs to be benchmarked and compared to see if it really improves on things or if the added bitshifts for unpacking and packing slow it down. Maybe a profiling prior to see if it really slows things done, could rule this enhancement out.
Input like shown here, which reconstructs a 3D object from a 3D diffraction pattern would be cool.
With the createAtomCluster a atom cluster can be generated. The generated pictures smaller ~500x500 do converge fine, but for the version 3000x1500 it doesn't if the macro variable SCALE_EXAMPLE_IMAGE
is set to 1.
Ground Truth (#define SCALE_EXAMPLE_IMAGE 1
)
Diffraction Intensity (#define SCALE_EXAMPLE_IMAGE 1
)
*Reconstructed Image (#define SCALE_EXAMPLE_IMAGE 1
) (aborted after 64 iterations) *
But if SCALE_EXAMPLE_IMAGE
is set to 0 it actually does converge exactly as well as the smaller versions.
Ground Truth (#define SCALE_EXAMPLE_IMAGE 0
)
Diffraction Intensity (#define SCALE_EXAMPLE_IMAGE 0
)
Reconstruction (#define SCALE_EXAMPLE_IMAGE 0
)
Note the cluster is barely visible in the upper and bottom left corners.
I'm not sure why this happens. I hope I do not embarrass myself, but my guess actually is on the algorithm itself. First tests have shown, that it is a problem with the creation of the initial mask. The initial mask uses a a threshold determined by the maximum of the auto-correlation map (Fourier transform of diffraction intensity). For some reason that mask creation is dependent on the image dimensions. For the scaled atom cluster too many pixels will be masked out. The shrink wrap basically is too strong and slips off the object too reconstruct.
Initial Mask (#define SCALE_EXAMPLE_IMAGE 1
)
Initial Mask (#define SCALE_EXAMPLE_IMAGE 0
)
Basically, after tweaking rIntensityCutOffAutoCorel
it should also converge, but somehow this parameter should be chosen automatically, or another parameter should be used, e.g. to select the 50% top pixels to be masked out initially, where 50% is an adjustable parameter. (Note that on 100% and 0% the HIO algorithm may not advance).
The proposed method would need to make a histogram of the pixels though, which could be a quite expensive operation.
As we're working with external data, we need an IO interface. The main points to implement for this interface are:
The whole IO interface should use classes and live in it's own (sub-)namespace.
When compiling 570d1a5 from the alpaka branch the gaussian blur fails, i.e. it seems to not finish.
Compile with compile.sh
:
cmake .. \
-DCMAKE_CXX_COMPILER=$(which g++-4.9) \
-DCMAKE_C_COMPILER=$(which gcc-4.9) \
-DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLE=ON \
-DBUILD_DOC=OFF \
-DIMRESH_DEBUG=ON \
-DBUILD_EXAMPLES=ON \
-DRUN_TESTS=ON \
-DUSE_PNG=ON \
-DUSE_SPLASH=ON \
-DUSE_TIFF=ON \
-DUSE_FFTW=ON
Either the kernels need to be debugged or they could be replaced by a Gaussian blur which uses Fourier transformation.
On my GTX 760 for images fo 4000x4000 pixel the GPU memory can only work on 4 shrink wraps simultaneously.
/* allocate needed memory so that HIO doesn't need to allocate and
* deallocate on each call */
cufftComplex * dpCurData, * dpgPrevious;
float * dpIntensity, * dpIsMasked;
CUDA_ERROR( cudaMalloc( (void**)&dpCurData , sizeof(dpCurData [0])*nElements ) );
CUDA_ERROR( cudaMalloc( (void**)&dpgPrevious, sizeof(dpgPrevious[0])*nElements ) );
CUDA_ERROR( cudaMalloc( (void**)&dpIntensity, sizeof(dpIntensity[0])*nElements ) );
CUDA_ERROR( cudaMalloc( (void**)&dpIsMasked , sizeof(dpIsMasked [0])*nElements ) );
The largest memories needed for shrink wrap can be seen above. It is roughly 6 times pixels times 4 byte = 384MiB per call makes 1536MiB after 4 calls. In idle my card uses 208MiB out of 2048MiB, so this calculation seems to be correct.
The output is:
imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
CUDA error in ./imresh/src/imresh/algorithms/cuda/cudaShrinkWrap.cu line:241 : out of memory
terminate called without an active exception
Aborted
Of course there is not much helping it, if there is memory shortage, but it's a problem that taskQueue.cu tries to run as many shrink wraps as there are multi processors. It should catch the out of memory error and stop sending jobs to the GPU. Also I think the number of running shrink-wraps calls is a bit high. The shrink wrap algorithm is fairly computationally expensive and I think it could do well without that added concurrency. But I'm not sure, it should be benchmarked.
Ideas to reduce the memory footprint:
dpIntensity
dpIsMasked
is also problematic. It could be bitpacked, but we would need temporary memory for the blurring of the mask (we can't overwrite dpCurData
or dpgPrevious
while creating the mask). As we need the blurred mask only for creating a bitmask, we could however hard-customize the blurring algorithm to take as input a float and as output a char, because we don't need much precision for determining where to cut off.dpgPrevious
, but in a case for which we knew that the reconstructed object only is real valued we could use float* instead of cufftComplex
for dpgPrevious
by using cufftC2RExec
, but this would reduce generality=> this would reduce the memory requirement down to 4.75*nPixels*sizeof(float)
which is not that much of an improvement for the loss of generality.
Also we don't know what the input data is, but even it is a CCD sensor with e.g. only 8 or even 16 bit black/white channel, I don't think cuFFT would support short float or even integer input to work on. But that could reduce memory requirements by 2 or even 4 (char).
Thus, work cannot be scheduled accross devices.
Implement and try out an extension to shrink-wrap which after shrink wrap finishes slowly begins "blowing up" the mask, i.e. relaxing the support, so that even small features which are almost zero maybe can be reconstructed.
I guess the issue is in the second call to
splash::SerialDataCollector::read( int32_t, const char*, const Dimensions, void* )
where the void*
buffer is already initialized (a float*
with the dimension's scalar size as the size in this case). The first call seems to work as the scalar size is 40000 (200 * 200) which is the correct size.
As our software should be able to scale from one GPU to multiple GPUs (and maybe multiple nodes in the future), we should implement an hardware abstraction layer. This layer should provide the following information:
This information is necessary to implement a performant stream concept. In addition we should add some convenience functionality such as:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.