I am creating an issue to keep track of something that needs to be done fairly long-term in nnet3.
Note: documentation for the nnet3 code exists at
http://www.danielpovey.com/kaldi-docs/
(this is like what's at kaldi-asr.org/doc/, but corresponds to the nnet3 branch).
What's below is an email. I was originally hoping Karel would agree to do this, but he is too busy.
The task is quite complicated. I may have to do this myself in the end.
[what follows is an email]
OK, so there is a change of plan on this. Since the thing that needs
to be done now is a slightly bigger piece, I'm hoping that Karel will
agree to do it.
Karel, let me explain what the issue is.
If you download the sandbox/nnet3 branch and run, in src/nnet3/, the
program nnet-optimize-test, you'll see the following output near the
end.
c127: m64 += m62
c128: m6.AddRows(m64[-1x7, 0, -1x7, 1, -1x7, 2, -1x7, 3, -1x7, 4,
-1x7, 5, -1x7, 6])
c129: recurrent_affine1.Backprop(NULL, m59, [], m62, &m60)
c130: m60 += m58
c131: nonlin1.Backprop(NULL, [], m57, m58, &m56)
c132: m56 += m54
c133: m6.AddRows(m56[-1x6, 0, -1x7, 1, -1x7, 2, -1x7, 3, -1x7, 4,
-1x7, 5, -1x7, 6, -1])
c134: recurrent_affine1.Backprop(NULL, m51, [], m54, &m52)
c135: m52 += m50
c136: nonlin1.Backprop(NULL, [], m49, m50, &m48)
c137: m48 += m46
c138: m6.AddRows(m48[-1x5, 0, -1x7, 1, -1x7, 2, -1x7, 3, -1x7, 4,
-1x7, 5, -1x7, 6, -1, -1])
c139: recurrent_affine1.Backprop(NULL, m43, [], m46, &m44)
This is some kind of RNN, and m6 is a matrix corresponding to some
component near the input.
The command
c128: m6.AddRows(m64[-1x7, 0, -1x7, 1, -1x7, 2, -1x7, 3, -1x7, 4,
-1x7, 5, -1x7, 6])
is calling an AddRows function, and inside [ ]is the vector of indexes
. They have been pretty-printed, and -1x7 means -1 repeated 7 times.
The -1's in the vector mean, "do nothing for that row". The 0, 1, 2
through 6 mean, for those places where they appear, copy that row of
matrix m64 to a row of matrix m6. Now the problem here is that we are
invoking way too many CUDA kernels, most of which are doing nothing
because the argument is -1. What the thread below was about was, I
was asking Guoguo to add an AddToRows() function where m64 would be
the "this", and the vector argument would say which row of m6 to add
to. (we'd assume the indexes were unique). Guoguo pointed out that
we could use the AddToRows() function that takes pointer arguments,
but I hadn't wanted to do this because we'd need to transfer the
pointers to the GPU for each minibatch (since while the indexes don't
change, in general we reallocate the matrices each time, so the
addresses change). However, I realized that there is a better way to
do this. I'd like to ask you to do it because it's a little bit
tricky to get right, and this will be an opportunity for you to get
involved in nnet3.
The way I think it should be done is to add an optimization method
that detects situations where the same matrix (m6 in this case) is
subject to multiple repeated AddRows calls with nothing else in
between. Please try to understand what the stuff in nnet-analyze is
doing. You could first, by accessing all the Commands, work out for
each submatrix how many AddRows calls it has (as the *this), and then
for each submatrix that has multiple AddRows calls, detect ranges of
AddRows calls such that no other reads or writes of that submatrix
occur within that range of commands. (You'd have to iterate over the
Variables for that submatrix to do this, although there would normally
be just one). Then you would attempt to consolidate all of those
AddRows calls into one (or a few)- the command index could be the
latest of the command indexes of all the individual AddRows calls that
you are removing. You would likely want to first consolidate all the
AddRows calls into a single submat_locations_list [search for that in
nnet-compile.cc] and then use existing code from nnet-compile.cc to
turn that into either one command, or a list of commands. [Obviously
if it ends up generating as many commands as we started with, we're
not gaining anything and you'd want to abandon the attempt at
optimization.] You may need to move some functions from
nnet-compile.cc into nnet-compile-utils.{h,cc} to make them available
to the nnet-optimize code.