I'm looking at running some higher-order tensor contraction experiments in CTF, and wa

understanding performance overheads in CTF about ctf HOT 27 CLOSED

rohany commented on July 18, 2024

understanding performance overheads in CTF

from ctf.

Comments (27)

solomonik commented on July 18, 2024

Thanks for asking about this. The slow performance is not too surprising, what CTF will do when the contraction is specified this way is to first compute a partial sum of the tensor over the j index, which yields a sparse output matrix, and is slow due to the associated sparse data management. Instead, I would recommend to use the CTF MTTKRP kernel by multiplying over the j index with a vector full of 1s. The general sparse contraction/summation kernels are orders of magnitude slower than the multi-tensor MTTKRP kernel (see https://arxiv.org/pdf/2106.07135.pdf for details on this). There are also other multi-tensor contraction kernels present in CTF, and a general one is currently in development. These currently need to be invoked manually, when contractions are specified with Einsum syntax, pairwise contractions will be performed and performance can suffer e.g., if some intermediate computed quantities are sparse.

from ctf.

rohany commented on July 18, 2024

Thanks edgar! Can you point me to the docs or the header files that have these kernels declared? I'm not sure how to go about finding them (a search on the doc-site came up blank).

from ctf.

solomonik commented on July 18, 2024

Sorry, yeah documentation of this may be incomplete. There is some albeit python-oriented introduction to this in the tutorial,

https://github.com/cyclops-community/ctf/wiki/Tensor-introduction#multi-tensor-contraction

The currently available multi-tensor routines are mainly just MTTKRP and TTTP (generalization of SDDMM). I think @raghavendrak may also have some implementation of TTMc for third order tensors on another fork. This is the C++ interface file for them

https://github.com/cyclops-community/ctf/blob/master/src/interface/multilinear.h

from ctf.

rohany commented on July 18, 2024

I'm sorry Edgar, can you give me a little more guidance about how to use the MTTKRP kernel in multilinear.h to compute the kernel I'm looking at (a["i"] = B["ijk"] * c["k"])? I can't understand from the comments how this works (it seems like if B is 3-dimensional, then the other operands need to be two dimensional).

from ctf.

solomonik commented on July 18, 2024

MTTKRP performs A["ir"] = B["ijk"]*C["kr"]*D["kr"] so if A, C, D are n-by-1 and D["kr"]=1 (that should also be interpretable as a comamnd by CTF in C++) you would recover the operation you are interested in as a special case. I think the mat_list interface will accept vectors too and automatically do this (if not, you can call the reshape function).

from ctf.

rohany commented on July 18, 2024

Thanks for your help Edgar! I think that I've implemented this as you imagine:

  int x = 12092;
  int y = 9184;
  int z = 28818;
  int lens[] = {x, y, z};
  Tensor<double> B(3, true /* is_sparse */, lens, dw);
  Vector<double> a(x, dw);
  Vector<double> c(z, dw);
  Vector<double> d(z, dw);
  c.fill_random(1.0, 1.0);
  d.fill_random(1.0, 1.0);
  a.fill_random(0.0, 0.0);
  Tensor<double>* mats[] = {&a, &c, &d};
  MTTKRP(&B, mats, 0, false);

I was unsure about the last argument, or where to put the output tensor, but it seemed to work out?

Using this specialized kernel removed a good portion of the overhead, bringing it from 2 seconds to 90ms (vs 10ms hand coded).

from ctf.

solomonik commented on July 18, 2024

The output tensor is the 0th argument since you passed 0 to the kernel, so that looks fine (I assume B is filled with nonzeros elsewhere). We recorded 2-4X overhead relative to SPLATT MTTKRP, which uses CSF. Current implementation uses a less compact default sparse tensor format than CSF (we are working on integration of CSF into this and other kernels) and is minimally optimized.

from ctf.

rohany commented on July 18, 2024

One last question. If I understand correctly, the above approach isn't going to work out if I want to run a slightly different SpTTV kernel: a["ij"] = B["ijk"] * c["k"]?

from ctf.

solomonik commented on July 18, 2024

Yes, for that, just performing the CTF contraction as usual would be advisable. Choosing A to be sparse or dense will lead to different performance, depending on sparsity of B. The multi-tensor kernels we've introduced so far are driven by common applications in tensor decomposition methods.

from ctf.

rohany commented on July 18, 2024

Hi again edgar, I'm trying to use the MTTKRP function to actually compute an MTTKRP. Here's my code, which should be evaluating the contraction A[i, l] = B[i, j, k] * C[j, l] * D[k, l].

void mttkrp(int nIter, int warmup, std::string filename, std::vector<int> dims, World& dw, int ldim) {
  Tensor<double> B(3, true /* is_sparse */, dims.data(), dw);
  Matrix<double> A(dims[0], ldim, dw);
  Matrix<double> C(dims[1], ldim, dw);
  Matrix<double> D(dims[2], ldim, dw);
  C.fill_random(1.0, 1.0);
  D.fill_random(1.0, 1.0);

  B.read_sparse_from_file(filename.c_str());
  if (dw.rank == 0) {
    std::cout << "Read " << B.nnz_tot << " non-zero entries from the file." << std::endl;
  }

  Tensor<double>* mats[] = {&A, &C, &D};
  auto avgMs = benchmarkWithWarmup(warmup, nIter, [&]() {
    MTTKRP(&B, mats, 0, false);
  });

  if (dw.rank == 0) {
    std::cout << "Average execution time: " << avgMs << " ms." << std::endl;
  }
}

This code segfaults with the following backtrace:

(gdb) bt
#0  0x00000000100302a4 in CTF_int::handler() ()
#1  0x00000000100190b4 in CTF::Semiring<double, true>::MTTKRP(int, long*, int*, long, long, int, bool, CTF::Pair<double> const*, double const* const*, double*) ()
#2  0x0000000010019a1c in void CTF::MTTKRP<double>(CTF::Tensor<double>*, CTF::Tensor<double>**, int, bool) ()
#3  0x000000001000ebf8 in std::_Function_handler<void (), mttkrp(int, int, std::string, std::vector<int, std::allocator<int> >, CTF::World&, int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#4  0x000000001000e740 in mttkrp(int, int, std::string, std::vector<int, std::allocator<int> >, CTF::World&, int) ()
#5  0x0000000010008a98 in main ()

Is there anything here that looks incorrect to you? This code was adapted from the working version you helped me with above.

from ctf.

rohany commented on July 18, 2024

Somewhat related, I was looking at the TTTP implementation, trying to implement SDDMM. Based on the comment at least, I'm not sure how to use the handwritten function to implement a contraction A[i, k] = B[i, k] * C[i, j] * D[j, k]. I dont understand how to map the indices in the comment For ndim=3 and mat_list=[X,Y,Z], this operation is equivalent to einsum("ijk,ia,ja,ka->ijk",A,X,Y,Z) to something like this.

from ctf.

raghavendrak commented on July 18, 2024

I think the segfault is because you are using false for aux_mode_first (currently, there is no implementation available for aux_mode_first = false).
Sample code for MTTKRP to achieve ijk,ja,ka->ia:

    Tensor<dtype> UT(2, false, lens, dw);
    UT["ji"] = U["ij"]; 
    Tensor<dtype> VT(2, false, lens, dw);
    VT["ji"] = V["ij"]; 
    Tensor<dtype> UC(2, false, lens, dw);
    Tensor<dtype> * mlist[3] = {&UC, &UT, &VT};
    MTTKRP<dtype>(&T, mlist, 0, true);

from ctf.

rohany commented on July 18, 2024

In my comment above (#134 (comment)), I used aux_mode_first = false, and that seemed to work in a different situation.

That seems to be the einsum notation expression I want, but I'm unsure why you are transposing the matrices.

from ctf.

raghavendrak commented on July 18, 2024

In the comment you mentioned, the factors are vectors. Since there is only one index, aux_mode_first false or true will be the same.
I think you are seeing the segfault from this:

ctf/src/interface/semiring.h

Line 1259 in c4f89dc

IASSERT(0);

aux_mode_first = true expects the auxiliary index (in the contraction) to show up first.

from ctf.

rohany commented on July 18, 2024

Thanks @raghavendrak! Following up, do you have an explanation why you transposed the matrices in your example?

Also, wondering if you have an answer to my question about TTTP/SDDMM.

from ctf.

solomonik commented on July 18, 2024

@rohany
I think Raghavendra had the transposes to compute the MTTKRP in the form you are looking for, i.e., sum_{ij} T_{ijk}U_{ia}V_{ja} since our routine evidently can currently only do sum_{ij} T_{ijk}U_{ai}V_{aj} (U and V) transposed. For SDDMM, you should use TTTP with an order 2 sparse tensor, performing the einsum("ij,ia,ja->ij",A,X,Y) to do the SDDMM Hadamard_product(A,X^TY). Let me know if all this makes sense.

from ctf.

rohany commented on July 18, 2024

Thanks, the MTTKRP transpose portion makes sense to me.

I'm still confused about SDDMM -- the einsum you wrote there is not isomorphic to what I want to compute: A[i, k] = B[i, k] * C[i, j] * D[j, k].

from ctf.

solomonik commented on July 18, 2024

@rohany I think it is, you have "ik,ij,jk->ik", which is my einsum modulo transposition and substitutions j->k, a->j.

from ctf.

rohany commented on July 18, 2024

If I'm reading this correctly, you wrote einsum("ij,ia,ja->ij",A,X,Y), which translated to my notation is A[i, j] = A[i, j] * X[i, a] * Y[i, a], while my desired computation is A[i, k] = B[i, k] * C[i, j] * D[j, k]. This is not isomorphic, as the operation X[i, a] * Y[i, a] does not form a matrix multiplication between X and Y.

from ctf.

solomonik commented on July 18, 2024

einsum("ij,ia,ja->ij",A,X,Y) is A[i, j] = A[i, j] * X[i, a] * Y[j, a] not A[i, j] = A[i, j] * X[i, a] * Y[i, a]

from ctf.

rohany commented on July 18, 2024

Yes, sorry. A[i, j] = A[i, j] * X[i, a] * Y[j, a] is still not correct right, as X[i, a] * Y[j, a] is not a matrix multiplication -- it should be X[i, a] * Y[a, j] right?

from ctf.

solomonik commented on July 18, 2024

I think X[i, a] * Y[j, a] is a matrix multiplication (at least modulo transposition, as I mentioned in my post), i.e. it is XY^T

from ctf.

solomonik commented on July 18, 2024

The CTF interface assumes all matrices (X,Y) are oriented in the same way, which makes much more sense in generalizing to tensors.

from ctf.

rohany commented on July 18, 2024

That make sense to me now. I will transpose my Y matrix to fit the interface.

from ctf.

rohany commented on July 18, 2024

Thanks for all the help! Code seems to be up and running now, so I'll close the issue.

from ctf.

rohany commented on July 18, 2024

Hi @solomonik, I was looking at this again:

The general sparse contraction/summation kernels are orders of magnitude slower than the multi-tensor MTTKRP kernel (see https://arxiv.org/pdf/2106.07135.pdf for details on this).

I was scanning the paper you linked, but I'm not finding discussion of distributed sparse MTTKRP in that paper -- is it the right link?

from ctf.

solomonik commented on July 18, 2024

Its not, sorry about that, here is the right link https://arxiv.org/pdf/1910.02371.pdf

from ctf.

understanding performance overheads in CTF about ctf HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent