Code Monkey home page Code Monkey logo

taco's Introduction

The Tensor Algebra Compiler (taco) is a C++ library that computes tensor algebra expressions on sparse and dense tensors. It uses novel compiler techniques to get performance competitive with hand-optimized kernels in widely used libraries for both sparse tensor algebra and sparse linear algebra.

You can use taco as a C++ library that lets you load tensors, read tensors from files, and compute tensor expressions. You can also use taco as a code generator that generates C functions that compute tensor expressions.

Learn more about taco at tensor-compiler.org, in the paper The Tensor Algebra Compiler, or in this talk. To learn more about where taco is going in the near-term, see the technical reports on optimization and formats.

You can also subscribe to the taco-announcements email list where we post announcements, RFCs, and notifications of API changes, or the taco-discuss email list for open discussions and questions.

TL;DR build taco using CMake. Run make test.

Build and test

Build and Test

Build taco using CMake 2.8.12 or greater:

cd <taco-directory>
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j8

Building Python API

To build taco with the Python API (pytaco), add -DPYTHON=ON to the cmake line above. For example:

cmake -DCMAKE_BUILD_TYPE=Release -DPYTHON=ON ..

You will then need to add the pytaco module to PYTHONPATH:

export PYTHONPATH=<taco-directory>/build/lib:$PYTHONPATH

pytaco requires NumPy and SciPy to be installed.

Building for OpenMP

To build taco with support for parallel execution (using OpenMP), add -DOPENMP=ON to the cmake line above. For example:

cmake -DCMAKE_BUILD_TYPE=Release -DOPENMP=ON ..

Building for CUDA

To build taco for NVIDIA CUDA, add -DCUDA=ON to the cmake line above. For example:

cmake -DCMAKE_BUILD_TYPE=Release -DCUDA=ON ..

Please also make sure that you have CUDA installed properly and that the following environment variables are set correctly:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH

If you do not have CUDA installed, you can still use the taco cli to generate CUDA code with the -cuda flag.

Running tests

To run all tests:

cd <taco-directory>/build
make test

Tests can be run in parallel by setting CTEST_PARALLEL_LEVEL=<n> in the environment (which runs <n> tests in parallel).

To run the C++ test suite individually:

cd <taco-directory>
./build/bin/taco-test

To run the Python test suite individually:

cd <taco-directory>
python3 build/python_bindings/unit_tests.py

Code coverage analysis

To enable code coverage analysis, configure with -DCOVERAGE=ON. This requires the gcovr tool to be installed in your PATH.

For best results, the build type should be set to Debug. For example:

cmake -DCMAKE_BUILD_TYPE=Debug -DCOVERAGE=ON ..

Then to run code coverage analysis:

make gcovr

This will run the test suite and produce some coverage analysis. This process requires that the tests pass, so any failures must be fixed first. If all goes well, coverage results will be output to the coverage/ folder. See coverage/index.html for a high level report, and click individual files to see the line-by-line results.

Library example

The following sparse tensor-times-vector multiplication example in C++ shows how to use the taco library.

// Create formats
Format csr({Dense,Sparse});
Format csf({Sparse,Sparse,Sparse});
Format  sv({Sparse});

// Create tensors
Tensor<double> A({2,3},   csr);
Tensor<double> B({2,3,4}, csf);
Tensor<double> c({4},     sv);

// Insert data into B and c
B.insert({0,0,0}, 1.0);
B.insert({1,2,0}, 2.0);
B.insert({1,2,1}, 3.0);
c.insert({0}, 4.0);
c.insert({1}, 5.0);

// Pack inserted data as described by the formats
B.pack();
c.pack();

// Form a tensor-vector multiplication expression
IndexVar i, j, k;
A(i,j) = B(i,j,k) * c(k);

// Compile the expression
A.compile();

// Assemble A's indices and numerically compute the result
A.assemble();
A.compute();

std::cout << A << std::endl;

Code generation tools

If you just need to compute a single tensor kernel you can use the taco online tool to generate a custom C library. You can also use the taco command-line tool to the same effect:

cd <taco-directory>
./build/bin/taco
Usage: taco [options] <index expression>

Examples:
  taco "a(i) = b(i) + c(i)"                            # Dense vector add
  taco "a(i) = b(i) + c(i)" -f=b:s -f=c:s -f=a:s       # Sparse vector add
  taco "a(i) = B(i,j) * c(j)" -f=B:ds                  # SpMV
  taco "A(i,l) = B(i,j,k) * C(j,l) * D(k,l)" -f=B:sss  # MTTKRP

Options:
  ...

For more information, see our paper on the taco tools taco: A Tool to Generate Tensor Algebra Kernels.

taco's People

Contributors

dcbdan avatar drhagen avatar fredrikbk avatar gkanwar avatar infinoid avatar lugatod avatar oicirtap avatar penpornk avatar rawnhenry avatar roastduck avatar rohany avatar rsenapps avatar shoaibkamil avatar stephenchouca avatar syoyo avatar weiya711 avatar willow-ahrens avatar ychen306 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

taco's Issues

mappers: integrate backpressuring for index launches

Unify the backpressuring mechanism to handle both individual task launches and points from an index launch. A level of indirection here is likely all that is needed. The approach for doing the backpressuring on index launches is in #102.

upgrade tensor_to_hdf5 to potentially prepack tensors into a given format

It would save some time if we could already pack a tensor into a desired compressed format (like CSR) once while preprocessing the tensor in tensor_to_hdf5. Then, future operations can just load the tensor and avoid a somewhat expensive packing step, which can save some time while running experiments.

Legion Issues Tracker

A meta issue for me to track all of the Legion issues that I'm blocked on.

Legion

These are bugs blocking either implementation of benchmarks that I want to write, or evaluation of benchmarks that I already have.

GASNetEx

  • Invalid gather from GPU device memory StanfordLegion/legion#1122 -- hard blocker. Shows up in MTTKRP, COSMA. Running with -gex:bindcuda 0 gets rid of the segfault, but comes at an unacceptable performance hit (basically puts us back at gasnet1 speeds).
  • Crashes when running at high node counts StanfordLegion/legion#1118 (blocks large multi-GPU GEMM runs). Even with -gex:batch 0 I see some errors, and performance with -gex:batch 0 is noticeably worse.
  • 2D Gathers with Gex StanfordLegion/legion#1138. I believe that multi-node adapt benchmark needs this. I think that this will also improve performance of the 2d algorithms on rectangular node counts.

Core Legion

  • Assertion failure around futures / hangs with mttkrp + futures StanfordLegion/legion#1123. (TODO (rohany): When this is fixed, pull control replication to get the fix for memory corruption around instance reuse).
  • Slow CPU fills StanfordLegion/legion#1113
  • Memory corruption around instance reuse StanfordLegion/legion#1108 (sean has a patch for this in progress)
  • Collective instances StanfordLegion/legion#546. These are necessary for algorithms which fully replicate a (small) single region among all GPUs. We have a hacky workaround for now, but it isn't great. Collective instances will also be needed for algorithms (like mttkrp) that replicate a portion of a region onto many nodes (StanfordLegion/legion#1125).
  • Another hang around future maps: StanfordLegion/legion#1145

Feature Requests

  • Virtual mappings for reduction instances StanfordLegion/legion#1119. Without this, we can't feasibly implement hierarchical distribution with algorithms that use reductions at the leaves (COSMA, Johnson, Solomonik). This does not look like it's going to be implemented by the PLDI deadline, so I should pivot things to performing flat decompositions over each GPU.

Legate

I'd like to benchmark against Legate, but there are some errors here blocking me from doing so.

  • Sharding functor output error nv-legate/cunumeric#53
  • Bad instance mapping leading to OOMs (I don't have an issue for this, but Manolis said he is looking into it).

lowerer: implement pos distribution with outer partitioning

The lowerer currently doesn't handle a case where we do a position split, but some tensor dimensions are partitioned by an outer variable as well, so the constructed projection needs to handle dimensions partitioned from the outside. This can been seen in #97 .

taco: implement heirarchical partitions for up front partitioning

I still have to decide if / when I should do this, but it seems like something like this maybe possible? The idea is that we can create our partitions once up front, even for the hierarchical ones by explicitly holding onto all of the handles that sub-partitions tasks create. However, this leads to some extremely nested structures, where perhaps too many partitions are being created at once (consider cannon's GPU implementation which creates a partition of each tensor at each level of the computation for 4 levels -- node level, node level k loop, gpu level, gpu level k loop. This quickly balloons out to a large number of partition objects to manage at the top level. Before undertaking this, I would need to be more convinced that there are performance wins to gain here.

*: integrate with collective instances when ready

As soon as the collective instance branch is ready for testing, we need to move to it and introduce new concepts in the lowerer and mapper to correctly handle creating them. This is a three phase process.

  1. Simple "replicated tensor" computations, and remove all of the hard-coded manual replication things. Codes that come to mind are
  • SpMV weak scale
  • TTV
  • TTMC
  • MTTKRP
  1. More complicated launch patterns where subsets of launches need pieces of tensors
  • Johnson's Algorithm
  • COSMA
  1. 2D matrix computations that do lock-step broadcast communcations, such as SUMMA. This step will be the hardest, as it requires changing alot of code. The problem with the current approach is that it does 1 2D launch, and then each launched sub-task launches a bunch more tasks. It's likely that we will need to convert this into a 3D launch with a projection functor that understands the ordering between tasks (also generated by DISTAL), and then chooses collectives to use for each row/column.
  • PUMMA
  • SUMMA
  • 2.5D MatMul

partitioning: hoisting creation of partitions of every tensor to the top level

The current strategy for naming partitions by their iteration space point identifier works great when all tensors are partitioned at the top level. However, when a tensor is partitioned only at lower levels of the computation, this becomes complicated as we need to ensure that all index spaces in the LegionTensor get a different color, and can also look up that color later. We avoid this problem for all tensors partitioned at the top level by letting the top level partition color be auto generated, giving all sub-partitions of the tensor a separate namespace.

In addition to a correctness motivation, there may be a performance benefit here. For example, consider the computation A(i, j) = B(i, k) * C(k, j), where we distribute i and communicate C under j. Here, we would want to create a single partition of C at the top level that all task loops over j operate on. The current strategy creates a partition of C for each subtask instead, which may not be as performant.

Hoisting these partitioning operations would simplify several aspects of the taco-legion interop (and avoid gotchas around index space reuse) as well as potentially allow for some performance improvements. The optimization is not deep and is just an application of LICM, but will be some work to implement.

scripts: make install script more robust

Several small things to do:

  • Be resilient to partial installations and allow for rebuilding certain components
  • Control the number of threads to use with make
  • Command line control over which builds are expected to succeed

Potential higher order tensor operations to implement

  • Inner product of two order 3 tensors ($$\sum_{ijk} = A_{ijk} * B_{ijk}$$) -- I don't think that the compiler can generate code that does this right now. It also seems like another case of something that requires collective instances for the broadcast reduction. I looked into this a bit more, and here's a plan of attack: 1) when doing a scalar reduction like this, we would want to return the result from intermediate tasks, and then use runtime->reduce_future_map to get the result, rather than replicating a dummy region and everyone reducing into that.

similarly to .assemble, add a .partitionForCompute() or similar method

In order to avoid doing data dependent partitioning operations (or partitioning operations in general) when they are unneeded, we should expose a method to calculate all of the partitions needed (for at least the top level) of a distributed computation and returns these to the user. Then, the compute method should accept partitions for each of the tensors and not do any partitioning of its own in order to limit the runtime overhead of tracking all the extra metadata.

It's unclear how much of a benefit this will give us for dense computations, but it definitely will be useful for sparse computations where partitioning operations are data dependent. In such cases, avoiding a partitioning operation at each iteration of a solver will greatly improve performance.

cosma: depending on the performance, maybe add steps

If it the cosma performance looks like what I saw with johnsons (where the computation is mainly bottlenecked on getting data then runs 1 task), add the steps back into the algorithm to get some pipelining going.

DISTAL Library Checklist

  • Test that it works on Sapling in multi-node settings
  • Lassen in multi-node settings
  • stop printing output from the compilation command to stdout unless asked
  • Implement GPU compilation support
  • Move runtime library into the DISTAL directory
  • Move runtime library within a namespace
  • Move submodule dependencies into a folder called deps
  • Transition compilation tests and generated code tests to use the new API, integrate with gtest
  • Figure out installation details / exports of link and include directories to a target

support distributed temporaries with precompute

The precompute relation right now supports only when temporaries exist within a single machine. Based on discussions at PLDI, it would be useful to be able to factorize computations with precompute where the temporaries themselves are distributed across the machine, similarly to the input tensors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.