eyalroz / cuda-kat Goto Github PK
View Code? Open in Web Editor NEWCUDA kernel author's tools
License: BSD 3-Clause "New" or "Revised" License
CUDA kernel author's tools
License: BSD 3-Clause "New" or "Revised" License
While nVIDIA's own C headers for builtin wrappers use the fundamental types int
, unsigned
, unsigned long long
etc. - the builtins are actually based on exact parameter sizes, not the wishy-washy C integer types. Should we then not make our builtins reflect that, by taking only intNN_t
's and uintNN_t
s as integer parameters?
CUDA 9.0 was release in September of 2017 - 2.5 years ago. It changed the interfaces of some functions and related PTX instructions. Mostly, .sync
versions of these were now to be used, which take lane mask parameters, e.g. warp balloting only among the lanes with 1
bits in the mask.
Should we continue to support the CUDA 8.x and earlier versions of these functions/PTX intrinsics, or is it safe to just drop them?
We have many templated functions which make a (potentially) large number of reads or writes to memory, and therefore benefit from coalescing their memory operations. However, most, if not all of them are not specialized for element types below 4 bytes long, and are therefore slower than they might have been. Examples include copying, filling, appending to global memory etc.
We should add specializations for these cases.
Most code in the library is currently not covered by any unit tests. Let's add that coverage.
Skipping:
int add(int x, int y) { return x+y; }
function - what can you test about it?Now that we have (half-)decent unit test coverage (see #24), we should introduce code coverage checks to see how much remains uncovered.
This requires:
CMakeLists.txt
(both of them? only for tests?) to use that module and to generate reports after a successful buildSee:
The programming guide says:
E.3.14.3. Rvalue references
By default, the CUDA compiler will implicitly consider
std::move
andstd::forward
function templates to have__host__ __device__
execution space qualifiers, and therefore they can be invoked directly from device code. Thenvcc
flag--no-host-device-move-forward
will disable this behavior; std::move and std::forward will then be considered as__host__
functions and will not be directly invokable from device code.
We currently use our own kat::
versions of those two functions. Should we drop them in favor of std::move()
and std::forward()
, relying on this behavior of CUDA's?
In grid_info.cuh
:
index()
function current return dimensions_t
rather than unsigned
. This might fit a "position" function (perhaps the subject of another issue I'll open, but certainly not an index, singluar.index_in_grid()
function for thread doesn't current compute the index correctly; plus, it shouls use other, already-defined functions which themselves call detail::row_major_linearization()
rather than calling that directly.warp::index()
to itself.At the moment, our effective definition of a "builtin" function is one that produces a single PTX instruction (when inlined); and this definition is not even entirely consistent in our code.
However, PTX instructions are in no way guaranteed to become a single SASS instruction. An example which motivated our inconsistency: CLZ vs CTZ. There's is a CLZ instruction in PTX. But... no NVIDIA micro-architecture has that as a single instruction. It's just implemented using SASS internally somewhere.
It should be clear to the user of cuda-kat what will result in a single hardware instruction and what may or may not be one.
We've adapted a tuple implementation; however, that tuple doesn't know that there's "another tuple" it needs to be compatible with... we do know. So, let's try and make kat::tuple
usable wherever on the host side we can use std::tuple
; and vice-versa.
on_device/printing.cuh currently uses macros, mostly in order to easily "plug" the format string into a longer string.
This can actually be achieved, using a little voodoo, without in a plain function, and without any heap allocation. Let's try and do that.
At the moment, we only test the shuffle::xxx
functions with types of sizes 1, 2, 4 and 8. We should test them with types of other sizes, in particular: 3, 5, 7, and sizes beyond 8.
At the moment, some of our math utility functions have both constexpr and non-constexpr variants, in different files, while some have only the constexpr ones which work at run-time as well. But - we indicate the first case with the kat::constexpr_
namespace. Is this explained anywhere? Not really. Also unexplained is why math.cuh
includes constexpr_math.cuh
.
To add to the fun, we have several functions with two implementations, foo_safe()
and foo_unsafe()
; and - some of the non-safe versions are actually in constexpr_math...
We should get this stuff in order.
Beginning with CUDA 10 (or maybe 9?) we have three kinds of atomics:
atomicFoo()
- atomic w.r.t. other memory access from within the same GPU.atomicFoo_system()
- atomic w.r.t. memory access from any GPU and from the host, on the same system.atomicFoo_block()
- atomic w.r.t. memory accesses from threads in the same thread block only.We currently support only the first kind, but should support the other two.
We need to have an atomicCAS()
equivalent available, some way or another for all types up to the hardware capability for atomic ops (8 bytes, i.e. unsigned long long int
). Right now there's only apply_atomically()
exposed, which is nice, but not enough.
So let's:
atomic::compare_and_swap()
the way it is now (i.e. only for the types CUDA supports directly).atomic::compare_and_swap()
for smaller types using the larger type.std::span
/ gsl::span
are very useful in host-side code: A pointer+length pair with standard-library-container trappings (iterators, operators, usable in standard algorithms etc.)
Now, it's not as though you should just use spans willy-nilly; they can have some overhead, but - they can sometimes make sense. The may be particularly useful as kernel parameters - pay the overhead just once per thread, then in the kernel you do what you like. And it makes it easier to work with memory regions - device-side and host-side.
See also issue #17.
We can base ourselves on either an GSL or a standard-library implementation.
But:
Expects
, Requires
-> No (?) need for <gsl/gsl_assert>
<gsl/gsl_byte>
// for byte<gsl/gsl_util>
.... edit: Going with std::span adapted for earlier C++ versions. Hope it's not too bad. Still not sure if I shouldn't just use gsl::span instead.
For std::reverse_iterator
we need #include <iterator>
.
Something like:
#include <kat/containers/array.hpp>
int main() {
kat::array<int, 7> arr;
}
should fail to compile.
We have many functions returning lane ids or numbers-of-lanes. Mostly those use unsigned
. But for better readability/clarity, I'm thinking of introducing something like:
using lane_id_t = unsigned;
within the kat namespace.
Users - if I have any that look at this issue, which is unlikely - are welcome to comment on this prospect.
Even if most of our code is for the device-side - it's still a little too presumptuous to put it in the global namespace. Instead, let's place it all within the kat
namespace.
An index is either a "list of items" arranged in order, or "a number... used as an indicator or measure", or "a number ... associated with another to indicate... position in an arrangement".
An id is an identification, or " a document bearing identifying information".
So, if we have an item in a 3D block, the triplet of coordinates in each axes is - literally - that item's 'id' not its 'index'. If anything, it is its position in a linearization of that grid that could be considered its 'index'.
... Unfortunately, CUDA defines this exactly backwards: The 'index' in CUDA is the 3-dimensional entity, and the 'id' is the number.
Currently, in my code, I'm making a bit of a mess of these two conceptions. I'm going to sort that out, but the question is how?
_index()
functions in grid_info to return uint3
's and dimensions_t
's.What do you think?
In a 2D or 3D block, the CUDA "thread index" - according to official documentation - is a 3D or 3D entity, while the "thread ID" is its linearization (where x changes fastest, then y, then z).
cuda-kat currently doesn't observe this distinction, due to a bias in favor of work with linear grids. We should make sure it is respected in:
grid_info::
namespaces)We should also check if there are separate special registers for the dimensioned index and the id; that could be useful.
There is overuse of const& T
over T
for parameters of some math fuctions. These functions are intended for numbers, not more complex objects; plus, they are inlined and simple enough for any copies to be optimized away, so let's not complicate our lives with references here.
While we have a separate namespace for grid_info functions in linear grids, I also want to templatize the general versions of the functions to support working with such grids and minimizing unnecessary computations (= based on .y
and .z
value of positions and dimensions). I also want to do something similar for 2-D grids. So - I'm going to templatize.
Will implement this together with a "choice" w.r.t. issue #50.
This repo should have a device-side function (probably for a single thread to run) which pretty-prints a part of a column in GPU memory. It should be a well-pimped function with lots of knobs and levers for configuring the printing (elements per row, separators, inter-column spacing, width, chars vs numbers, numeric base, bit-resolution interpretation yes/no, true/false symbols for bits or booleans, indices at start of line, header lines yes/no etc.)
We have grid-scope action in two forms - at grid stride and at block stride. The block stride action means each block acts on consecutive data. At block-scope - we only have the first kind of action, where the stride involves all collaborating elements. We don't have the second kind of action, but with warps, which may be told how many consecutive warp's worth of positions to act on.
The following PTX instructions don't have wrapper functions (nor builtins::
templated functions where relevant). Add them!
lop3
- Logical operation on 3 operands using an immediate 3-parameter lookup table.cvt.pack
fns
- find n'th bit setdp4a
, dp2a
for bytes and halfword, respectively.Shuffles are warp collaboration primitives. They should be in namespace kat::collaboration::warp
- and declared in the warp collaboration primitives header - if only perhaps through an inclusion of another file.
This is a general problem of nvcc
I would say:
#include <array>
#include <kat/containers/array.hpp>
constexpr int duzzle = -7;
__global__
void kernel() {
kat::array<int, 7> arr;
arr.fill(duzzle); // fails to compile
}
int main() {
std::array<int, 7> arr;
arr.fill(duzzle);
}
The problem here is the signature of fill(const value_type&)
. E.g. if we omit the reference it works fine or if we define the constexpr
variable in the device function itself. A real funny workaround:
__global__
void kernel() {
kat::array<int, 7> arr;
constexpr int duzzle_ = duzzle;
arr.fill(duzzle_);
}
Much of the code is lacking Doxygen comments - especially file-level comments. Write those in.
The code under src/cuda/on_device/primitives
is a hot mess.
I mean, most of it is very useful, but not all of it; and there's almost no order to the different files except w.r.t. to the scope of collaboration (warp/block/grid).
At the very least we need to:
Remove code whose general usefulness is limited/questionable.
Extract related functionality into a separate file (or files for differnet scope):
at_warp_stride()
)... and do it while keeping the namespace scheme (e.g. separating block-scope from grid-scope functions).
Consider duplicate functionality (there's probably a bit of that in there)
printing.cuh
currently has some redundant code, some unused code, code that's really not GPU-specific in any way, and code which may not be significant enough to publish as part of the library.
I should make a pass over the file and remove most/all of this code.
array.hpp
, and soon the tuple implementation and possible other places, rely on the existence of a size_t
type. Let's explicitly define ours.
We already have a few string.h-like functions implemented, currently in miscellany.h
. Why not just complete them to cover all of string.h
- and separate them into a file of their own?
While this code originates in other repositories, which do have some unit and other testing - that can't be migrated here. We should have tests for all functionality in this repository, and a proper testing framework/library for them.
I'm thinking of doctest but my mind is not fully made up yet.
Part of the CUDA on-device "math API" is multiple casting operations. We should provide a uniform, readable, no-overhead, C++'ish interface to that functionality.
"Built-ins" in cuda-kat means those functions which translate into single PTX instructions (not necessarily single SASS instructions though!)
We have on_device/builtins.cuh
, and on_device/non-builtins.cuh
which contains functions which are builtin-like, or one might expect to be built-in, but aren't, and instead have a pretty tight implementation - one or two lines - which calls a builtin.
There's a problem, though, with certain functions which only translate into single PTX instructions when --use_fast_math
is specified as a compiler switch. Example: cosf()
in the CUDA math function header. With --use_fast_math
, it yields something like:
cos.approx.ftz.f32 %f2, %f1;
but if you remove the switch, you get a sequence of over 150 (!!) PTX instructions, including several loops, for computing the cosine. See this on GodBolt.
So, cosf()
and other functions are sometimes builtins and sometime they aren't. Where should we put them then?
Idea:
builtins.cuh
, non-builtins.cuh
and maybe-builtins.cuh
.maybe-builtins.cuh
will have everything depending on a switch.instructions, conditioned on the compilation happening with the appropriate switch. Thus, if all three are included,
builtins::cosine()will work if
--use_fast_mathis on, and fail otherwise; and the opposite goes for
non_builtins::cosine()`.PS:
--use_fast_math
is not even the only switch: We have --prec-div
(for precise division) --prec-sqrt
(for precise square root) and --fmad
for enabling floating-point fused-multiply-add instructions. I should also not that not all functions which depend on these four switches are actually covered by cuda-kat right now, although that's a different issue.
Our builtins involving warp voting, balloting etc. already support lane mask parameters for CUDA >= 9; but in kat::collaboration::warp::
, the methods do not take such parameters. We should add them.
unaligned.cuh
is missing the align_down()
function. Either add it or drop unaligned.cuh
.
While it's rarely a great idea, for the sake of completeness, we may want to have implementations of the Add abstract <algorithm>
and <numeric>
algorithms which could be run by all threads without collaboration, each on its own data.
What do you think? Good idea? Bad idea?
See also issue #18.
CUDA atomics apply to types of size at most 8 and which are trivially copyable. There is no benefit, therefore, from their wrappers taking constant references rather than actual value. In fact, this will mostly serve to cause problems if we try to pass rvalues.
So, let's just drop all that use of const &T
in favor of T
s.
Also, this will mean most atomic wrappers only take a single address (pointer/reference), so we no longer need to mark them with __restrict__
(!).
The test fixtures have improved, and become a bit more flexible and requiring less boilerplate, from one test suite to another. We should use the later ones - currently on collaborative.cu
in the earlier tests, particularly builtins.cu
and perhaps math.cu
and others.
We need most, if not all, of the functions in
<algorithm>
and <numeric>
available on the device, for execution at the warp and block level. (But not the uninitialized memory stuff, nor qsort/bsearch etc.)
Implementation status:
includes
With a reasonable compiler and a reasonable implementation of a gsl::span
-like class (see issue #6), there should be no penalty for having device functions take a kat::span<T>
instead of a T*
and a size_t length
. So let's convert some functions to using that...
Of course we might need to make the span templated on the appropriate size type.
CUDA's math API has functions for computing averages of integers without overflow. Let's make those more accessible.
Let's convert our on-device printing code to use a standard-library-like ostream, which at a time of our choosing will issue a printf()
and clear its buffer.
Consider basing this off of: https://github.com/msharov/ustl/releases/tag/v2.8
Let's add a device-capable (and host-capable) tuple type.
Two issues with the include guards:
CUDA_KAT
in them/* */
rather than a line comment.Let's change that.
I've somehow mis-implemented atomic::increment()
and atomic::decrement()
. Need to bring the comments and the implementation in line with the CUDA Programming Guide description, as well as make sure that the default arguments actually do something useful (which, currently, they do not).
My shorthands for __forceinline__
, __device__
and __host__
are quite convenient, but perhaps it's better I don't force them on people - even at the price of making my code more verbose.
I'd appreciate input from whoever visits this page.
At the moment, a lot of the functions in grid_info.cuh
are only relevant to linear grids. We should:
kat::linear_grid::grid_info
.We should implement all_satisfy()
, none_satisfy()
and some_satisfy()
at the block level (using the warp-level primitives and shared memory to exchange information).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.