naoyam / lbann Goto Github PK

View Code? Open in Web Editor NEW

This project forked from llnl/lbann

0.0 0.0 0.0 24.94 MB

Livermore Big Artificial Neural Network Toolkit

Home Page: http://software.llnl.gov/lbann/

License: Other

CMake 3.19% Shell 3.07% C++ 68.33% Python 17.24% Cuda 8.16% Dockerfile 0.02%

lbann's People

Contributors

Watchers

lbann's Issues

Convolutions with shared non-root-only tensors

Spatial parallel evaluation layer

Overlapping halo exchange in backprop of convolution

Inter-node spatial partitioning

Consider communication avoiding optimization

Try overlapped tiling for addressing network latency cost.

Currently, because of MPI_Alltoallv, the benefit of overlapping by AL's non-blocking allreduces is limited. Using the P2P library, it would not be much of work to replace MPI_Alltoallv and realize stream-oriented shuffling.

Proper exchange of corner cells when unpacking is delayed

Unpacking is delayed when halo exchange for backward data convolution is overlapped with backward filter convolution. This does not correctly transfer corner cells to diagonal neighbor processes, and as the filter convolution zero-clears the halo regions, the corner cells should be always zero. Its impact to the backward data convolution is likely to be small, but should not be ignored. See 91935960cef185a9a972367ecbe5e36bb181013e of the distconv repository.

Fusing layers for optimizing memory accesses

Halo exchanges in shared tensors

Add basic usage documentation

Make allocation of Hydrogen matrices only done when requested

The original Hydrogen matrices are still kept allocated as they are used for debugging. Removing their allocations will be necessary for supporting large models. It should still be possible to allocate them when requested, for example, for debugging.

Convolution algorithm selection

Currently, it's selected by the user. Use cudnnFind/GetConvolutionForwardAlgorithm to automatically pick a good algorithm.

Root-only convolutions for small spatial domains

Hybrid parallelism of all dimensions, incuding channel

Remove device synchronization between end of FP and beginning of BP

Waiting for LLNL#632

Convolution/pooling when local spatial dimensions are smaller than filter dimensions

The problem is that each local convolution will require data from non-adjacent remote processes. For example, this happens with the last pooling in the Resnet model, where pooling size is 7. In general, this tends to happen in later layers of convolutional networks as spatial dimensions are getting smaller along the forward traversal of networks.

One approach would be changing the partitioning from spatial to sample or filter partitioning only. However, sample partitioning is not applicable when the mini-batch size is smaller than the number of processes. Filter partitioning is not yet supported and may not be desirable in some cases.

Another workaround is gathering the partitioned regions to smaller number of larger partitions. This means either only part of processes will be active or partitions are redundantly shared.

Leaky ReLU

Merge the post-Hydrogen version of LBANN

Shuffle for the last mini batch uses the full mini batch size

Use the new in-place optimizer

See convolution.hpp and base_convolution.hpp

Support depthwise separable convolutions

Support dilated convolutions

Remove default allocations of weights and other layer-specific tensors

Activations are not allocated when distconv is enabled, but other tensors such as weights still get allocated no matter distconv is used or not.

Halo exchange with manual pack and unpack instead of using MPI derived types

It is currently implemented with MPI derived types that express halo regions. Once derived types are set up, halo exchanges are just MPI_Isend and Irecv as MPI does necessary packing and unpacking transparently. It's done on GPU with MVAPICH2-GDR, and its performance seems fine.

However, Spectrum MPI is terribly slow with derived types of GPU data. It seems it just copies to host all chunks of continuous regions one by one, which results in an extremely large number of cudaMemcpy for tensors like 256x64x16x16 as there would be 256x64 cudaMemcpy if it is decomposed with process grid of 1x1xNx1.

MVAPICH2-GDR has its own issues, so we should also work well with Spectrum MPI, so we need to support manual on-GPU packing.

Use Aluminum instead of MPI when possible

Root-only pooling for small spatial domains

Pooling with shared non-root-only tensors

Spatial-parallel fully-connected layer

Support split layer

Support Mesh model with dilated and grouped convolutions

Overlapping of halo exchange in FP convolution with strides > 1

Verify correctness of the test model with bias

Support sum layer

Required for #13

Use ParallelStrategy to set layer tensor decomposition

Currently, the supported layers always uses 1xNPx1x1 decomposition, where NP is the number of processes. Use ParallelStrategy to configure the process grid. This will also require at each layer boundary the strategies must be checked and if different shuffling is required. The current scheme is to just check whether the previous one is also distconv-enabled, assuming once in the distconv world, every layer uses 1xNPx1x1 strategy.

naoyam / lbann Goto Github PK

lbann's People

Contributors

Watchers

lbann's Issues

Recommend Projects

Recommend Topics

Recommend Org