computationalradiationphysics / imresh Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 2.0 9.64 MB

Shrink-Wrap Phase Reconstruction Algorithm

License: MIT License

CMake 5.69% Cuda 37.43% C++ 56.50% TeX 0.38%

imresh's People

Contributors

Stargazers

Watchers

Forkers

mxmlnkn m-zacharias

imresh's Issues

Problem with input image diffraction layout

The normal input would have the brightness maximum in the middle (so do the test diffraction patterns) ("normally" the center would be masked ...), but the shrink-wrap algorithm or to be more precise: cuFFT (and fftw) anticipates the maximum to be in one of the corners (I think the very first element of the 2D array). Meaning everything needs to be periodically shift by half the height/width. A function for this actually already exists with fftShift.

Normal input images won't be centered perfectly though. Meaning we would have to find the center somehow ... e.g. by finding the maximum, but that won't work if it is masked.

I'm not really sure how this is done in other HIO algorithms which existed for some time. Maybe that fftShift is not even necessary, but I think it at least made problems in the beginning stages, that's why I wrote fftShift.

Use CuFFT callbacks to chain

The main loop where depending on the parameters 80% of the time is spent is:

for ( unsigned iHioCycle = 0; iHioCycle < rnHioCycles; ++iHioCycle )
{
    cudaKernelApplyHioDomainConstraints<<<nBlocks,nThreads,0,rStream >>>
        ( dpgPrevious, dpCurData, dpIsMasked, nElements, rHioBeta );
    CUFFT_ERROR( cufftExecC2C( ftPlan, dpgPrevious, dpCurData, CUFFT_FORWARD ) );
    cudaKernelApplyComplexModulus<<<nBlocks,nThreads,0,rStream>>>
         ( dpCurData, dpCurData, dpIntensity, nElements );
    CUFFT_ERROR( cufftExecC2C( ftPlan, dpCurData, dpCurData, CUFFT_INVERSE ) );
}

cudaKernelApplyHioDomainConstraints and cudaKernelApplyComplexModulus are both element-wise function, so I think CuFFT callbacks are possible and could bring some speedup by reducing global memory accesses, especially because cuFFT-calls and the calls to these two functions are almost 50:50 in usage for the image size of 1024x1024.

Note: I head cufftCallbacks have some problems like only working in Linux 64bit and needing some sort of license for CUDA <7.0 ... (unchecked rumors)

Adjust test cases

Many of the automatically generated test cases are not well-suited for reconstruction, because the objects touch the border of the image. Either the image should be a periodic tile or there should be a margin to the image border. E.g. this example has this problem:

That's why the reconstruction looks a bit weird, because the pseudo-periodic image is periodically shifted. But it is interesting that it worked anyway:

Create unit tests forparallelism and other parts not tested yet

The parallel unit test could just spawn 50 or so tasks to the batch system and then compare all their results which should be exactly identical even accounting for floating point rounding errors.

CMake add_definitions bug in conjunction with FindCUDA

try checking out dev (de14cd9), add following code to src/imresh/libs/cudacommon.hpp:

#ifdef USE_SPLASH
#   warning "USE_SPLASH defined!"
#else
#   warning "USE_SPLASH NOT defined!"
#endif
#ifdef USE_PNG
#   warning "USE_PNG defined!"
#else
#   warning "USE_PNG NOT defined!"
#endif
#ifdef IMRESH_DEBUG
#   warning "IMRESH_DEBUG defined!"
#else
#   warning "IMRESH_DEBUG NOT defined!"
#endif
#ifdef USE_FFTW
#   warning "USE_FFTW defined!"
#else
#   warning "USE_FFTW NOT defined!"
#endif
#error "!"

Now compile with:

mkdir build
cd build
cmake ..                                \
  -DCMAKE_CXX_COMPILER=$(which g++-4.9) \
  -DCMAKE_C_COMPILER=$(which gcc-4.9)   \
  -DUSE_PNG=ON                          \
  -DUSE_FFTW=ON                         \
  -DUSE_SPLASH=ON                       \
  -DBUILD_DOC=OFF                       \
  -DIMRESH_DEBUG=ON
make VERBOSE=1

The output is:

/usr/bin/nvcc -M -D__CUDACC__ imresh/src/imresh/libs/cudacommon.cu -o imresh/build/CMakeFiles/imresh.dir/src/imresh/libs/imresh_generated_cudacommon.cu.o.NVCC-depend -ccbin /usr/bin/gcc-4.9 -m64 --std c++11 -DIMRESH_DEBUG -DUSE_SPLASH -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -D_FORTIFY_SOURCE=2 -DUSE_FFTW -Xcompiler ,\"-Wall\",\"-Wextra\",\"-Wshadow\",\"-Wno-unused-parameter\",\"-O2\",\"-g\",\"-fPIC\",\"-pthread\",\"-fopenmp\",\"-g\" -Xcompiler -Wall,-Wextra,-Wshadow -G -lineinfo -Xptxas=-v -arch=sm_30 -DNVCC -Iimresh/src/imresh -I/usr/include -I/usr/include -Iimresh/src/imresh -I$PNGWRITER_ROOT/include -I/usr/include/freetype2 -I$SPLASH_ROOT/include -I/usr/include/hdf5/serial
[  5%] Building NVCC (Device) object CMakeFiles/imresh.dir/src/imresh/libs/imresh_generated_cudacommon.cu.o
In file included from imresh/src/imresh/libs/cudacommon.cu:26:0:
imresh/src/imresh/libs/cudacommon.hpp:33:9: warning: #warning "USE_PNG NOT defined!" [-Wcpp]
     #   warning "USE_PNG NOT defined!"
         ^
imresh/src/imresh/libs/cudacommon.hpp:36:9: warning: #warning "IMRESH_DEBUG defined!" [-Wcpp]
     #   warning "IMRESH_DEBUG defined!"
         ^
imresh/src/imresh/libs/cudacommon.hpp:41:9: warning: #warning "USE_FFTW defined!" [-Wcpp]
     #   warning "USE_FFTW defined!"
         ^
imresh/src/imresh/libs/cudacommon.hpp:45:6: error: #error "!"
     #error "!"
      ^
CMake Error at imresh_generated_cudacommon.cu.o.cmake:207 (message):
  Error generating
  imresh/build/CMakeFiles/imresh.dir/src/imresh/libs/./imresh_generated_cudacommon.cu.o

So for some reason only IMRESH_DEBUG and USE_FFTW are defined, but not USE_PNG.
Note the difference:

add_definitions("-DUSE_PNG ${PNGwriter_DEFINITIONS}")
add_definitions("-DUSE_SPLASH" ${Splash_DEFINITIONS})

Changing this to

add_definitions("-DUSE_PNG" ${PNGwriter_DEFINITIONS})

will make it work.

Note also this minimal-non-working example:

cmake_minimum_required(VERSION 3.5)
find_package( CUDA )
add_definitions( "-D IMRESH_DEBUG" )
add_definitions( "-DIMRESH_DEBUG2" )
SET_SOURCE_FILES_PROPERTIES( main.cpp PROPERTIES CUDA_SOURCE_PROPERTY_FORMAT OBJ )
cuda_add_library( mimi main.cpp )
target_link_libraries( mimi )

Use the following main.cpp

#ifdef IMRESH_DEBUG
#   error "IMRESH_DEBUG defined!"
#endif
int main ( void ){ return 0; }

Here is an excerpt from the output:

/usr/bin/nvcc ~/cmakeIsysBug/main.cpp -x=cu -c -o ~/cmakeIsysBug/build/CMakeFiles/mimi.dir//./mimi_generated_main.cpp.o -ccbin /usr/bin/gcc-4.9 -m64 -DIMRESH_DEBUG2 -Xcompiler ,\"-g\" -DNVCC -I/usr/include -I/usr/include

So especially there is no compile error, because IMRESH_DEBUG won't be defined, but IMRESH_DEBUG2 is.
This means CMake has a bug where it misinterprets definitions containing spaces. Note that the space after -D is completely fine!

Comparing the above to this minimal-working CMakeLists.txt

cmake_minimum_required(VERSION 3.5)
add_definitions( "-D IMRESH_DEBUG" )
add_definitions( "-DIMRESH_DEBUG2" )
add_library( mimi main.cpp )
target_link_libraries( mimi )

the output will be:

[ 50%] Building CXX object CMakeFiles/mimi.dir/main.cpp.o
/usr/bin/g++-4.9   -DIMRESH_DEBUG2  -D IMRESH_DEBUG -o CMakeFiles/mimi.dir/main.cpp.o -c ~/cmakeIsysBug/main.cpp
~/cmakeIsysBug/main.cpp:3:5: error: #error "IMRESH_DEBUG defined!"
 #   error "IMRESH_DEBUG defined!"
     ^

Meaning the bug seems to be inside FindCUDA.

I'm only kinda surprised that the missing -D USE_PNG didn't lead to any notable errors Oo. Maybe I'm overlooking something again.

Description

This software lacks a short introduction / description paragraph at the top of its README.md :)

Use the alpaka abstraction layer instead of CUDA

I think this would be best done in a forked project, because at least for benchmarking purposes CUDA and alpaka are orthogonal to each other.

More test cases

Try out test cases which don't have hard edges, and also not only Gaussian edges (Gaussian falls quite fast to zero). Instead e.g. try a linear or Lorentz-like target density distribution or a combination of low and high density targets next to each other. In some smaller test the low density target will be forced to 0 by accident, because the high density target confuses the mask process -> See also #57

Choose a build system

We should decide which build system we want to use quite early. I'd like to use CMake but could live with make as well.

Implement the algorithm

The main part of this program. It can be subdivided as follows:

First:

Provide a running and stable interface for the rest of the program to enable parallel development.

Then:

Implement the Gaussian kernel
Implement the hybrid input output algorithm
Implement the shrink-wrap algorithm
Verify the results against the example data (see #4)

During implementation benchmarks should be run in order to ensure performant code.

Add Python Bindings

For easier access from usual post-processing workflows a python binding would be nice to have.

Warning about libhdf5 and libsplash

After the rebase to dev (before the two merges today) I get the following warning when compiling:

/usr/bin/ld: warning: libhdf5_serial.so.8, needed by /opt/libsplash/lib/libsplash.so, may conflict with libhdf5_serial.so.10

Is this normal? Can this be suppressed?

We need free[dom]

The example imresh/examples/threadedExample.cpp suggests the following call to free the allocated data where justFree is just a wrapper to delete[]. This will be run in parallel and is bad (Segmentation fault).

// Now we can run the algorithm for testing purposes and free the data
// afterwards
imresh::io::addTask( file.first, file.second,
                     imresh::io::writeOutFuncs::justFree,
                     "free" /* gives an identifier for debugging */ );

Possible fixes:

Introduce dependencies for task, meaning a task can reference another task which needs to finish first
Return a reference to an std::thread object, so that the user can join (not sure if this even works, if the object is inside a list and the list changes, but the reference doesn't. Another possibility is to return an ID, ... )
Call delete[] inside taskQueue.cu after join(). In this case we need to store the pointers to the data to free

The problem with letting the user do it are plentiful, because it breaks paralellism or is just too complex.
My suggestion is to convert taskQueue.cu into a class e.g. called ShrinkWrapBatch so that adding tasks could be as easy as:

{
    ShrinkWrapBatch batch( true /* automatically free given data */ );
    while ( /* files in directory */ )
    {
        batch.add( readFile( filename ) );
    }
} // batch.~batch will be called

Note that no call to initialize is necessary anymore because it would be included in the constructor. Also no need to call deinit is necessary, because the outer scope will call the Destructor. Each data given is freed either by batch.add or batch.~batch

Add Batch-processing capability

E.g. Input:

std::vector< std::string > inputFilenames
std::vector< std::string > OutputFilenames

The loop could be:

load file 1
call shrink wrap on stream 0
load next file in parallel
call shrink wrap on stream 1
repeat until GPU memory is full or not enough streams (need to count memory needed)

Possible further optimizations:

avoid cudaMalloc, by managing allocated arrays by the program in order to reuse them. Would need modification of shrinkWrap interface. Also would only work if the imageSize doesn't change, or at least doesn't get larger

Rework ShrinkWrap Tests

When running the test suite with the IMRESH_DEBUG CMake-Option, then the shrink wrap benchmark will output something like:

(    2,    2) : [Error -nan/1.17549e-38] [Cycle 0/3]
[Error -nan/1.17549e-38] [Cycle 1/3]
[Error -nan/1.17549e-38] [Cycle 2/3]
[Error -nan/1.17549e-38] [Cycle 3/3]
[Error -nan/1.175e-38] [Cycle 0/3]
[Error -nan/1.175e-38] [Cycle 1/3]
[Error -nan/1.175e-38] [Cycle 2/3]
[Error -nan/1.175e-38] [Cycle 3/3]
[Error -nan/1e-05] [Cycle 0/3]
[Error -nan/1e-05] [Cycle 0/3]
[Error -nan/1e-05] [Cycle 0/3]

This may be normal, because the image to reconstruct is just too small.
Either remove these cases, or make them work, else the benchmark time may not even have any meaning (e.g. because a faulty loop could be faster or slower).

Add checks to these benchmarks which test if the reconstruction actually worked. Currently it only is timed.

Write documentation

A good project needs a good documentation.

In file documentation (doxygen)
End user documentation (Readme.md with compile-, install- and usage-instructions)
Example programs

Use cudaDeviceGetAttribute instead of slow cudaGetDeviceProperties

See https://github.com/ComputationalRadiationPhysics/alpaka/pull/846

False results when calling shrink wrap multiple times

When calling the algorithm multiple times at once like in the following

for( int i = 1; i < 30; i++)
{
    imresh::io::addTask( file.first, file.second, imresh::io::writeOutFuncs::writeOutPNG, "imresh_" + std::to_string( i ) + "_cycles.png", i );
}

the results get worse with each loop. After a few loops only a black image is returned (have a look at the attached images, sorry for the inversed order). Also the algorithm's output switches to [Error -nan/1e-05] [Cycle 0/2] for each loop.

Fine tuning of the shrink wrap parameters have not solved this issue for me.

It seems like the different algorithm calls influence each other.

Execute benchmarks

This may be necessary while already developing:

benchmark all parts on each own. If there are multiple versions of one part, then choose the fastest.
benchmark all fastest working together in shrink wrap
do profiling?

Provide example data

For testing purposes we need example data. The following patterns should be provided:

Symmetric patterns
- Sphere
- Cone
Asymmetric patterns
- Half spherer
- Half cone
- Rough noise
Periodic patterns
- Checkerboard

To create the data, we should write a Python script. All patterns should be regeneratable in different resolutions.

Use dynamic parallelism for shrink-wrap algorithm

Currently the shrink-wrap algorithm needs to copy some reduce-results (and also kernel parameters) to and from the GPU. This could be a potential bottleneck, because these operations can't be interleaved with calculations easily. Implementing the whole shrink-wrap algorithm as a kernel on the GPU which could spawn threads using dynamic parallelism would solve this performance problem.

Manually manage GPU memory

We could save some cudaMallocs and cudaFrees, if taskQueue.cu could do cudaMalloc in the initializer call where it also creates the work thread list. The pointers to the memory locations or the one large memory location could then be given to shrink wrap which in the current version calls cudaMalloc and cudaFree each time.

It would make cudaShrinkWrap harder to call, so I would prefer to copy-paste it to cudaShrinkWrapBatch, which could be called by the first after it allocates the needed memory.

Add and do diffraction pattern input tests

Currently all tests begin their work from the original object to reconstruct and then use the function diffractionPattern to create, well, the diffraction pattern :). Normally the input is a diffraction pattern. We should test if it works for the examples we received.

A problem which occurs here is the anticipated layout of the diffraction pattern. See Issue #34

Enabling Optimization breaks algorithm code linking

In the current CMakeLists.txt on the refactoring branch (by now the most up to date), no parameters for compiler optimizations are given. This defaults to no optimization (at least on my machine).

When turning on optimization with anything higher than -O0 (e.g. -O1, -O2 and so on) linking the example binary fails with

[100%] Linking CXX executable examples
CMakeFiles/examples.dir/examples/createAtomCluster.cpp.o: In function `examples::createAtomCluster(std::vector<unsigned int, std::allocator<unsigned int> > const&)':
createAtomCluster.cpp:(.text+0xf2): undefined reference to `void imresh::libs::gaussianBlur<float>(float* const&, unsigned int const&, unsigned int const&, double const&)'
createAtomCluster.cpp:(.text+0xd4f): undefined reference to `void imresh::libs::gaussianBlur<float>(float* const&, unsigned int const&, unsigned int const&, double const&)'
libimresh.so: undefined reference to `float imresh::algorithms::vectorMax<float>(float* const&, unsigned int const&)'
libimresh.so: undefined reference to `void imresh::algorithms::cuda::cudaGaussianBlur<float>(float* const&, unsigned int const&, unsigned int const&, double const&)'
libimresh.so: undefined reference to `void imresh::algorithms::applyComplexModulus<float [2], float>(float (* const&) [2], float const (* const&) [2], float const* const&, unsigned int const&)'
libimresh.so: undefined reference to `void imresh::algorithms::complexNormElementwise<float, float [2]>(float* const&, float const (* const&) [2], unsigned int const&)'
collect2: Fehler: ld gab 1 als Ende-Status zurück

Right now I have no idea where to start investigating into this. As this is not game breaking at the moment and this is your code, I'd like to ask you, @mxmlnkn to look into this when you have some sparse time. But as this library is intended to be fast, compiler optimizations should be possible at least right before the final release.

Maybe it's just an error in the CMakeLists.txt that I don't get right now so the first step could be that you could try your old Makefiles with enabled optimizations.

Use packed bitmask for masking algorithms

Currently the mask used by the HIO-algorithm is saved as an array of float being either 0 or 1 ... If bit-masks could work without memory conflicts and anything it possibly could give a performance boost of 32 (for those parts), because it should reduce throughput needed. If the original array is not reinterpreted and overwritten it will result in a slightly higher memory footprint. But the memory footprint is currently relatively high anyway.

For the shrink-wrap algorithm we need to hold the original intensity, the current step, for the HIO algorithm the last step and the mask in memory. This means we have a memory footprint of: 4*N_Pixels*sizeof(float). With the mask it would be (4+1/32)*N_Pixels*sizeof(float)

If done, of coure it needs to be benchmarked and compared to see if it really improves on things or if the added bitshifts for unpacking and packing slow it down. Maybe a profiling prior to see if it really slows things done, could rule this enhancement out.

3D input

Input like shown here, which reconstructs a 3D object from a 3D diffraction pattern would be cool.

Converging problem for large pictures

With the createAtomCluster a atom cluster can be generated. The generated pictures smaller ~500x500 do converge fine, but for the version 3000x1500 it doesn't if the macro variable SCALE_EXAMPLE_IMAGE is set to 1.

Ground Truth (#define SCALE_EXAMPLE_IMAGE 1)

Diffraction Intensity (#define SCALE_EXAMPLE_IMAGE 1)
$atomcluster-scaled-diffractionintensity$

*Reconstructed Image (#define SCALE_EXAMPLE_IMAGE 1) (aborted after 64 iterations) *

But if SCALE_EXAMPLE_IMAGE is set to 0 it actually does converge exactly as well as the smaller versions.

Ground Truth (#define SCALE_EXAMPLE_IMAGE 0)

Diffraction Intensity (#define SCALE_EXAMPLE_IMAGE 0)
$atomcluster-extent-diffractionintensity$

Reconstruction (#define SCALE_EXAMPLE_IMAGE 0)

Note the cluster is barely visible in the upper and bottom left corners.

I'm not sure why this happens. I hope I do not embarrass myself, but my guess actually is on the algorithm itself. First tests have shown, that it is a problem with the creation of the initial mask. The initial mask uses a a threshold determined by the maximum of the auto-correlation map (Fourier transform of diffraction intensity). For some reason that mask creation is dependent on the image dimensions. For the scaled atom cluster too many pixels will be masked out. The shrink wrap basically is too strong and slips off the object too reconstruct.

Initial Mask (#define SCALE_EXAMPLE_IMAGE 1)

Initial Mask (#define SCALE_EXAMPLE_IMAGE 0)

Basically, after tweaking rIntensityCutOffAutoCorel it should also converge, but somehow this parameter should be chosen automatically, or another parameter should be used, e.g. to select the 50% top pixels to be masked out initially, where 50% is an adjustable parameter. (Note that on 100% and 0% the HIO algorithm may not advance).

The proposed method would need to make a histogram of the pixels though, which could be a quite expensive operation.

Create an IO interface

As we're working with external data, we need an IO interface. The main points to implement for this interface are:

The whole IO interface should use classes and live in it's own (sub-)namespace.

The Gaussian Blur Kernels are not working with Alpaka

When compiling 570d1a5 from the alpaka branch the gaussian blur fails, i.e. it seems to not finish.
Compile with compile.sh:

cmake ..                                        \
    -DCMAKE_CXX_COMPILER=$(which g++-4.9)       \
    -DCMAKE_C_COMPILER=$(which gcc-4.9)         \
    -DALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLE=ON     \
    -DBUILD_DOC=OFF                             \
    -DIMRESH_DEBUG=ON                           \
    -DBUILD_EXAMPLES=ON                         \
    -DRUN_TESTS=ON                              \
    -DUSE_PNG=ON                                \
    -DUSE_SPLASH=ON                             \
    -DUSE_TIFF=ON                               \
    -DUSE_FFTW=ON

Either the kernels need to be debugged or they could be replaced by a Gaussian blur which uses Fourier transformation.

Out of memory error crashes program for larger images

On my GTX 760 for images fo 4000x4000 pixel the GPU memory can only work on 4 shrink wraps simultaneously.

    /* allocate needed memory so that HIO doesn't need to allocate and
     * deallocate on each call */
    cufftComplex * dpCurData, * dpgPrevious;
    float * dpIntensity, * dpIsMasked;
    CUDA_ERROR( cudaMalloc( (void**)&dpCurData  , sizeof(dpCurData  [0])*nElements ) );
    CUDA_ERROR( cudaMalloc( (void**)&dpgPrevious, sizeof(dpgPrevious[0])*nElements ) );
    CUDA_ERROR( cudaMalloc( (void**)&dpIntensity, sizeof(dpIntensity[0])*nElements ) );
    CUDA_ERROR( cudaMalloc( (void**)&dpIsMasked , sizeof(dpIsMasked [0])*nElements ) );

The largest memories needed for shrink wrap can be seen above. It is roughly 6 times pixels times 4 byte = 384MiB per call makes 1536MiB after 4 calls. In idle my card uses 208MiB out of 2048MiB, so this calculation seems to be correct.

The output is:

imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
imresh::io::addTask(): Creating working thread.
imresh::io::addTaskAsync(): Mutex locked, device and stream selected. Calling shrink-wrap.
CUDA error in ./imresh/src/imresh/algorithms/cuda/cudaShrinkWrap.cu line:241 : out of memory
terminate called without an active exception
Aborted

Of course there is not much helping it, if there is memory shortage, but it's a problem that taskQueue.cu tries to run as many shrink wraps as there are multi processors. It should catch the out of memory error and stop sending jobs to the GPU. Also I think the number of running shrink-wraps calls is a bit high. The shrink wrap algorithm is fairly computationally expensive and I think it could do well without that added concurrency. But I'm not sure, it should be benchmarked.

Ideas to reduce the memory footprint:

there is no way around dpIntensity
dpIsMasked is also problematic. It could be bitpacked, but we would need temporary memory for the blurring of the mask (we can't overwrite dpCurData or dpgPrevious while creating the mask). As we need the blurred mask only for creating a bitmask, we could however hard-customize the blurring algorithm to take as input a float and as output a char, because we don't need much precision for determining where to cut off.
We always need dpgPrevious, but in a case for which we knew that the reconstructed object only is real valued we could use float* instead of cufftComplex for dpgPrevious by using cufftC2RExec, but this would reduce generality

=> this would reduce the memory requirement down to 4.75*nPixels*sizeof(float) which is not that much of an improvement for the loss of generality.

Also we don't know what the input data is, but even it is a CCD sensor with e.g. only 8 or even 16 bit black/white channel, I don't think cuFFT would support short float or even integer input to work on. But that could reduce memory requirements by 2 or even 4 (char).

CUDA device not properly set during stream creation

Thus, work cannot be scheduled accross devices.

Bubble-Wrap

Implement and try out an extension to shrink-wrap which after shrink wrap finishes slowly begins "blowing up" the mask, i.e. relaxing the support, so that even small features which are almost zero maybe can be reconstructed.

readHDF5 just returns the first line of the image

I guess the issue is in the second call to

splash::SerialDataCollector::read( int32_t, const char*, const Dimensions, void* )

where the void* buffer is already initialized (a float* with the dimension's scalar size as the size in this case). The first call seems to work as the scalar size is 40000 (200 * 200) which is the correct size.

Create an hardware abstraction layer

As our software should be able to scale from one GPU to multiple GPUs (and maybe multiple nodes in the future), we should implement an hardware abstraction layer. This layer should provide the following information:

How many GPUs are accessible?
How many processing units has each GPU?
How much memory has each GPU?

This information is necessary to implement a performant stream concept. In addition we should add some convenience functionality such as:

Iterables
- Over all devices
- Over all streams (device independent)