We have been noticing a slowdown on training that was introduced by our dataloader. Up

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

A revisit on improving the performance of Data Loader about fms-fsdp HOT 2 OPEN

lchu-ibm commented on July 25, 2024

A revisit on improving the performance of Data Loader

from fms-fsdp.

Comments (2)

lchu-ibm commented on July 25, 2024

@nairbv @thoangtrvn @JRosenkranz

from fms-fsdp.

daviswer commented on July 25, 2024

Stateless Implementation

Although the LCG provides the desired random permutation, this approach introduces extra state to be tracked (our position in the recursively-generated permutation sequence, and/or our position in the shard file). A much cleaner implementation is to use the LCG as a stateless, randomized bijective map from a contiguous range of doc indices to a shuffled, noncontiguous range of doc indices.

We can do this by leveraging the fact that the state of the LCG above is always set to the last emitted value. Since the LCG emits every value in the desired range exactly once per cycle, each seeded by the previous, we can instead simply re-seed the LCG every time with a position index argument, and it will hash that index to a new position with guaranteed no collisions. So at runtime we can simply iterate sequentially through the range of documents in a file shard owned by a given worker (possibly a subset of the full shard), and LCG will provide a map to a new, shuffled, noncontiguous set of documents.

Pros: Introduces no extra state to track, avoids materializing any long shuffled lists of position indices. Allows workers to now perform non-contiguous partitioning of shard files, in cases where files are split over multiple workers.

Cons: Produces similar shuffles across different shard files. The algorithm for finding the bijective mapping provided by LCG for a given index is: "Take the length-m cycle of indices produced by the given choice of m (2^16+1, 2^23, 2^32 above), find the given index, and proceed through the cycle until you land on a new index below your size threshold". This means that two shard files with the same number of documents will receive the same mapping, since they are stepping through the same cycle, the same way. Furthermore, two shard files of size m1, m2 with m2>m1 will also have the same mapping, up to insertion of the new indices greater than m1 and smaller than m2. Thus our LCG mapping is clearly less random than the original shuffled doc list implementation.

from fms-fsdp.

Recommend Projects

A revisit on improving the performance of Data Loader about fms-fsdp HOT 2 OPEN

Comments (2)

Stateless Implementation

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent