naoyam / lbann Goto Github PK
View Code? Open in Web Editor NEWThis project forked from llnl/lbann
Livermore Big Artificial Neural Network Toolkit
Home Page: http://software.llnl.gov/lbann/
License: Other
This project forked from llnl/lbann
Livermore Big Artificial Neural Network Toolkit
Home Page: http://software.llnl.gov/lbann/
License: Other
Required for #13
Currently, it's selected by the user. Use cudnnFind/GetConvolutionForwardAlgorithm to automatically pick a good algorithm.
Revert #53 when it becomes obsolete.
This requires all layers to support non-sample parallelism.
Currently, because of MPI_Alltoallv, the benefit of overlapping by AL's non-blocking allreduces is limited. Using the P2P library, it would not be much of work to replace MPI_Alltoallv and realize stream-oriented shuffling.
Aggregates statistics over spatial domains. The test Resnet model results in about 10% lower accuracies with local statistics when 4-way spatial partitioning is used.
Waiting for LLNL#632
Activations are not allocated when distconv is enabled, but other tensors such as weights still get allocated no matter distconv is used or not.
See convolution.hpp and base_convolution.hpp
The current overlapping method does not work with multi-dimensional partitioning in bp convolutions.
The problem is that each local convolution will require data from non-adjacent remote processes. For example, this happens with the last pooling in the Resnet model, where pooling size is 7. In general, this tends to happen in later layers of convolutional networks as spatial dimensions are getting smaller along the forward traversal of networks.
One approach would be changing the partitioning from spatial to sample or filter partitioning only. However, sample partitioning is not applicable when the mini-batch size is smaller than the number of processes. Filter partitioning is not yet supported and may not be desirable in some cases.
Another workaround is gathering the partitioned regions to smaller number of larger partitions. This means either only part of processes will be active or partitions are redundantly shared.
The original Hydrogen matrices are still kept allocated as they are used for debugging. Removing their allocations will be necessary for supporting large models. It should still be possible to allocate them when requested, for example, for debugging.
It is currently implemented with MPI derived types that express halo regions. Once derived types are set up, halo exchanges are just MPI_Isend and Irecv as MPI does necessary packing and unpacking transparently. It's done on GPU with MVAPICH2-GDR, and its performance seems fine.
However, Spectrum MPI is terribly slow with derived types of GPU data. It seems it just copies to host all chunks of continuous regions one by one, which results in an extremely large number of cudaMemcpy for tensors like 256x64x16x16 as there would be 256x64 cudaMemcpy if it is decomposed with process grid of 1x1xNx1.
MVAPICH2-GDR has its own issues, so we should also work well with Spectrum MPI, so we need to support manual on-GPU packing.
This happens when the last mini batch has samples that are fewer than the number of processes.
Try overlapped tiling for addressing network latency cost.
Currently, the supported layers always uses 1xNPx1x1 decomposition, where NP is the number of processes. Use ParallelStrategy to configure the process grid. This will also require at each layer boundary the strategies must be checked and if different shuffling is required. The current scheme is to just check whether the previous one is also distconv-enabled, assuming once in the distconv world, every layer uses 1xNPx1x1 strategy.
Unpacking is delayed when halo exchange for backward data convolution is overlapped with backward filter convolution. This does not correctly transfer corner cells to diagonal neighbor processes, and as the filter convolution zero-clears the halo regions, the corner cells should be always zero. Its impact to the backward data convolution is likely to be small, but should not be ignored. See 91935960cef185a9a972367ecbe5e36bb181013e of the distconv repository.
Done in #60. Should be done in the main LBANN branch too.
Layers yet to be supported:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.