Code Monkey home page Code Monkey logo

lbann's People

Contributors

adammoody avatar andy-yoo avatar benson31 avatar bvanessen avatar davidhysom avatar dylanmckinney avatar enmccarthy avatar forsyth2 avatar graham63 avatar gunney1 avatar ianlee1521 avatar jaeseungyeom avatar jslenderman avatar lukejaffe avatar mcneish1 avatar naoyam avatar ndryden avatar oyamay avatar samadejacobs avatar spears9 avatar timmoon10 avatar wderekjones avatar

Watchers

 avatar  avatar  avatar  avatar

lbann's Issues

Convolution algorithm selection

Currently, it's selected by the user. Use cudnnFind/GetConvolutionForwardAlgorithm to automatically pick a good algorithm.

Non-blocking shuffling

Currently, because of MPI_Alltoallv, the benefit of overlapping by AL's non-blocking allreduces is limited. Using the P2P library, it would not be much of work to replace MPI_Alltoallv and realize stream-oriented shuffling.

Batchnorm with partially global statistics

Aggregates statistics over spatial domains. The test Resnet model results in about 10% lower accuracies with local statistics when 4-way spatial partitioning is used.

Convolution/pooling when local spatial dimensions are smaller than filter dimensions

The problem is that each local convolution will require data from non-adjacent remote processes. For example, this happens with the last pooling in the Resnet model, where pooling size is 7. In general, this tends to happen in later layers of convolutional networks as spatial dimensions are getting smaller along the forward traversal of networks.

One approach would be changing the partitioning from spatial to sample or filter partitioning only. However, sample partitioning is not applicable when the mini-batch size is smaller than the number of processes. Filter partitioning is not yet supported and may not be desirable in some cases.

Another workaround is gathering the partitioned regions to smaller number of larger partitions. This means either only part of processes will be active or partitions are redundantly shared.

Make allocation of Hydrogen matrices only done when requested

The original Hydrogen matrices are still kept allocated as they are used for debugging. Removing their allocations will be necessary for supporting large models. It should still be possible to allocate them when requested, for example, for debugging.

Halo exchange with manual pack and unpack instead of using MPI derived types

It is currently implemented with MPI derived types that express halo regions. Once derived types are set up, halo exchanges are just MPI_Isend and Irecv as MPI does necessary packing and unpacking transparently. It's done on GPU with MVAPICH2-GDR, and its performance seems fine.

However, Spectrum MPI is terribly slow with derived types of GPU data. It seems it just copies to host all chunks of continuous regions one by one, which results in an extremely large number of cudaMemcpy for tensors like 256x64x16x16 as there would be 256x64 cudaMemcpy if it is decomposed with process grid of 1x1xNx1.

MVAPICH2-GDR has its own issues, so we should also work well with Spectrum MPI, so we need to support manual on-GPU packing.

Use ParallelStrategy to set layer tensor decomposition

Currently, the supported layers always uses 1xNPx1x1 decomposition, where NP is the number of processes. Use ParallelStrategy to configure the process grid. This will also require at each layer boundary the strategies must be checked and if different shuffling is required. The current scheme is to just check whether the previous one is also distconv-enabled, assuming once in the distconv world, every layer uses 1xNPx1x1 strategy.

Proper exchange of corner cells when unpacking is delayed

Unpacking is delayed when halo exchange for backward data convolution is overlapped with backward filter convolution. This does not correctly transfer corner cells to diagonal neighbor processes, and as the filter convolution zero-clears the halo regions, the corner cells should be always zero. Its impact to the backward data convolution is likely to be small, but should not be ignored. See 91935960cef185a9a972367ecbe5e36bb181013e of the distconv repository.

Support Alexnet

Layers yet to be supported:

  • local_response_normalization
  • fc
  • dropout
  • softmax
  • convolution without padding

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.