Code Monkey home page Code Monkey logo

Comments (2)

lchu-ibm avatar lchu-ibm commented on July 25, 2024

@nairbv @thoangtrvn @JRosenkranz

from fms-fsdp.

daviswer avatar daviswer commented on July 25, 2024

Stateless Implementation

Although the LCG provides the desired random permutation, this approach introduces extra state to be tracked (our position in the recursively-generated permutation sequence, and/or our position in the shard file). A much cleaner implementation is to use the LCG as a stateless, randomized bijective map from a contiguous range of doc indices to a shuffled, noncontiguous range of doc indices.

We can do this by leveraging the fact that the state of the LCG above is always set to the last emitted value. Since the LCG emits every value in the desired range exactly once per cycle, each seeded by the previous, we can instead simply re-seed the LCG every time with a position index argument, and it will hash that index to a new position with guaranteed no collisions. So at runtime we can simply iterate sequentially through the range of documents in a file shard owned by a given worker (possibly a subset of the full shard), and LCG will provide a map to a new, shuffled, noncontiguous set of documents.

Pros: Introduces no extra state to track, avoids materializing any long shuffled lists of position indices. Allows workers to now perform non-contiguous partitioning of shard files, in cases where files are split over multiple workers.

Cons: Produces similar shuffles across different shard files. The algorithm for finding the bijective mapping provided by LCG for a given index is: "Take the length-m cycle of indices produced by the given choice of m (2^16+1, 2^23, 2^32 above), find the given index, and proceed through the cycle until you land on a new index below your size threshold". This means that two shard files with the same number of documents will receive the same mapping, since they are stepping through the same cycle, the same way. Furthermore, two shard files of size m1, m2 with m2>m1 will also have the same mapping, up to insertion of the new indices greater than m1 and smaller than m2. Thus our LCG mapping is clearly less random than the original shuffled doc list implementation.

from fms-fsdp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.