Code Monkey home page Code Monkey logo

astra-sim's Introduction

ASTRA-sim 2.0

ASTRA-sim is a distributed machine learning system simulator, developed as a joint collaboration between Georgia Tech, Meta, and Intel. The previous version, ASTRA-sim 1.0, is available in the ASTRA-sim-1.0 branch.

Here is a concise visual summary of our simulator: alt text

For a comprehensive understanding of the tool, and to gain insights into its capabilities, please refer to our paper:

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna, "ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale". In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2023. [pdf]

For tutorials on how to use ASTRA-SIM, please visit our tutorial page.

Citation

If you use ASTRA-sim in your research, please cite our paper:

@INPROCEEDINGS{10158106,
    author={Won, William and Heo, Taekyung and Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar},
    booktitle={2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
    title={ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale},
    year={2023},
    volume={},
    number={},
    pages={283-294},
    doi={10.1109/ISPASS57527.2023.00035}}

Build Instructions

ASTRA-sim can be built either in your local environment or within a Docker container. The following steps will guide you through both methods.

1. Build ASTRA-sim Locally

To build ASTRA-sim without using Docker, you first need to install the necessary packages. This can be done using the following commands:

$ sudo apt-get -y update
$ sudo apt-get -y install\
    gcc g++ make cmake\
    libboost-dev libboost-program-options-dev\
    libprotobuf-dev protobuf-compiler\
    python3 python3-pip git
$ sudo pip3 install protobuf==3.6.1 pydot

Once the packages are installed, you will need to clone this repository onto your local machine using the following command:

$ git clone --recurse-submodules [email protected]:astra-sim/astra-sim.git
$ cd astra-sim

Then, based on your target network backend, execute the corresponding build script:

# For the analytical network backend
$ ./build/astra_analytical/build.sh -c

2. Build ASTRA-sim in a Docker Image

Alternatively, you can build ASTRA-sim within a Docker container. Start by cloning this repository to your local machine using the same command as above:

$ git clone --recurse-submodules [email protected]:astra-sim/astra-sim.git
$ cd astra-sim

Next, create a Docker image using the following command:

$ docker build -t astra-sim .

Once the Docker image is created, you can run it with this command:

$ docker run -it astra-sim

Finally, similar to the local build process, depending on your target network backend, you should run the corresponding build script:

# For the analytical network backend
$ ./build/astra_analytical/build.sh -c

Running ASTRA-sim

Once ASTRA-sim is built, conduct experiments by passing the required configurations. You might need to provide additional configurations based on the network backend. The following configurations are mandatory:

  • --workload-configuration: Path prefix to the execution trace. The naming rule for execution traces follows the format {path prefix}.{npu_id}.eg. This argument provides the path prefix.
  • --system-configuration: Path to the system configuration. Example system configurations can be found at inputs/system/.
  • --network-configuration: Path to the network configuration Example network configurations can be found at inputs/network/.

Execution traces can be created using Chakra tools. You have the option of using either the execution trace generator (et_generator) or the execution trace converter (et_converter). The et_generator can be used to define and generate any execution traces, functioning as a test case generator. Meanwhile, the et_converter is a trace schema conversion tool, supporting PyTorch and FlexFlow execution traces, as well as ASTRA-sim 1.0 input files.

Using the Execution Trace Generator

You can generate execution traces with et_generator with the following commands.

$ cd extern/graph_frontend/chakra/et_generator
$ cmake CMakeLists.txt && make -j$(nproc)
$ ./et_generator --num_npus 64 --num_dims 1

To run one of the example traces (twoCompNodesDependent), execute the following command.

$ cd -
$ ./build/astra_analytical/build/AnalyticalAstra/bin/AnalyticalAstra \
  --workload-configuration=./extern/graph_frontend/chakra/et_generator/twoCompNodesDependent \
  --system-configuration=./inputs/system/sample_fully_connected_sys.txt \
  --network-configuration=./inputs/network/analytical/fully_connected.json \
  --memory-configuration=/inputs/memory/analytical/no_memory_expansion.json

Upon completion, ASTRA-sim will display the number of cycles it took to run the simulation.

sys[0] finished, 10 cycles
sys[1] finished, 10 cycles
...
sys[62] finished, 10 cycles
sys[63] finished, 10 cycles

Using the Execution Trace Converter

You can convert ASTRA-sim 1.0 text input files into Chakra traces with the following commands.

$ cd extern/graph_frontend/chakra/
$ python3 setup.py install --user
$ python3 -m et_converter.et_converter\
    --input_type Text\
    --input_filename ../../../inputs/workload/ASTRA-sim-1.0/Resnet50_DataParallel.txt\
    --output_filename ../../../inputs/workload/ASTRA-sim-2.0/Resnet50_DataParallel\
    --num_npus 64\
    --num_dims 1\
    --num_passes 1

Run the following command.

$ cd -
$ ./build/astra_analytical/build/AnalyticalAstra/bin/AnalyticalAstra \
  --workload-configuration=./inputs/workload/ASTRA-sim-2.0/Resnet50_DataParallel \
  --system-configuration=./inputs/system/sample_fully_connected_sys.txt \
  --network-configuration=./inputs/network/analytical/fully_connected.json \
  --memory-configuration=/inputs/memory/analytical/no_memory_expansion.json

Upon completion, ASTRA-sim will display the number of cycles it took to run the simulation.

sys[62] finished, 187442108 cycles
sys[61] finished, 187442108 cycles
...
sys[0] finished, 187442108 cycles
sys[63] finished, 187442108 cycles

Features Under Active Development

We are constantly working to improve ASTRA-sim and expand its capabilities. Here are some of the features that are currently under active development:

  • Congestion-aware Analytical Network Backend
  • NS3 Network Backend
  • Garnet Network Backend
  • Detailed Statistics Report (Network Utilization)

Please note that these features are under active development and, while we aim to have them available as soon as possible, the completion timeline can vary. Check back regularly for updates on the progress of these and other features. We appreciate your interest and support in ASTRA-sim!

Contact Us

This project is a collaboration of dedicated professionals. Each core developer and contributor plays a unique role in the project. For any inquiries or questions, feel free to reach out to the corresponding developer based on their expertise.

Developer Organization Responsibility Contact
Saeed Rashidi Hewlett Packard Labs ASTRA-sim 1.0, system layer, communicator groups, in-switch collective communication [email protected]
William Won Georgia Tech Network layer [email protected]
Taekyung Heo Georgia Tech Chakra, workload layer, graph execution engine, memory API [email protected]
Changhai Man Georgia Tech Chakra [email protected]
Jinsun Yoo Georgia Tech NS3 Network Layer Integration [email protected]
Srinivas Sridharan Meta Chakra, General inquiries [email protected]
Tushar Krishna Georgia Tech General inquiries [email protected]

astra-sim's People

Contributors

rashidi1saeed avatar willjwon avatar taekyungheo avatar tushar-krishna avatar steinbrecher avatar bryanmr avatar changhai0109 avatar anands09 avatar dkadiyala3 avatar jinsun-yoo avatar srinivas212 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.