The cuda-kat from eyalroz

Consider using sized integer types for the builtins

While nVIDIA's own C headers for builtin wrappers use the fundamental types int, unsigned, unsigned long long etc. - the builtins are actually based on exact parameter sizes, not the wishy-washy C integer types. Should we then not make our builtins reflect that, by taking only intNN_t's and uintNN_ts as integer parameters?

Should we drop support for CUDA 8.x?

CUDA 9.0 was release in September of 2017 - 2.5 years ago. It changed the interfaces of some functions and related PTX instructions. Mostly, .sync versions of these were now to be used, which take lane mask parameters, e.g. warp balloting only among the lanes with 1 bits in the mask.

Should we continue to support the CUDA 8.x and earlier versions of these functions/PTX intrinsics, or is it safe to just drop them?

Specialize functions with many reads/writes for sub-4-byte element types

We have many templated functions which make a (potentially) large number of reads or writes to memory, and therefore benefit from coalescing their memory operations. However, most, if not all of them are not specialized for element types below 4 bytes long, and are therefore slower than they might have been. Examples include copying, filling, appending to global memory etc.

We should add specializations for these cases.

Cover all functionality with basic unit tests

Most code in the library is currently not covered by any unit tests. Let's add that coverage.

Skipping:

src/kat/on_device/unaligned.cuh <- Dropped from the repository (but expected to make a comeback soon - with unit tests)
src/kat/on_device/grid_info.cuh <- These are all one-liners; it's like testing an int add(int x, int y) { return x+y; } function - what can you test about it?
src/kat/on_device/printing.cuh <- Dropped from the repository (for now)
src/kat/on_device/ptx/* <- Mostly covered by "builtins" unit tests, and not quite "user-facing".

Check code coverage

Now that we have (half-)decent unit test coverage (see #24), we should introduce code coverage checks to see how much remains uncovered.

This requires:

Getting a coverage-related CMake module
Changing CMakeLists.txt (both of them? only for tests?) to use that module and to generate reports after a successful build
Working with a GitHub-interfacing service to visualize results or even check on the fly after each commit. Perhaps codecov?

See:

Should we use CUDA's implicit host-device move and forward?

The programming guide says:

E.3.14.3. Rvalue references

By default, the CUDA compiler will implicitly consider std::move and std::forward function templates to have __host__ __device__ execution space qualifiers, and therefore they can be invoked directly from device code. The nvcc flag --no-host-device-move-forward will disable this behavior; std::move and std::forward will then be considered as __host__ functions and will not be directly invokable from device code.

We currently use our own kat:: versions of those two functions. Should we drop them in favor of std::move() and std::forward(), relying on this behavior of CUDA's?

Inappropriate return type and incorrect calculation in grid_info.cuh

In grid_info.cuh:

The variants of the index() function current return dimensions_t rather than unsigned. This might fit a "position" function (perhaps the subject of another issue I'll open, but certainly not an index, singluar.
The templated index_in_grid() function for thread doesn't current compute the index correctly; plus, it shouls use other, already-defined functions which themselves call detail::row_major_linearization() rather than calling that directly.
Recursive call of warp::index() to itself.

Distinguish between PTX builtins and SASS builtins

At the moment, our effective definition of a "builtin" function is one that produces a single PTX instruction (when inlined); and this definition is not even entirely consistent in our code.

However, PTX instructions are in no way guaranteed to become a single SASS instruction. An example which motivated our inconsistency: CLZ vs CTZ. There's is a CLZ instruction in PTX. But... no NVIDIA micro-architecture has that as a single instruction. It's just implemented using SASS internally somewhere.

It should be clear to the user of cuda-kat what will result in a single hardware instruction and what may or may not be one.

Make kat::tuple compatible with std::tuple

We've adapted a tuple implementation; however, that tuple doesn't know that there's "another tuple" it needs to be compatible with... we do know. So, let's try and make kat::tuple usable wherever on the host side we can use std::tuple; and vice-versa.

Use functions instead of macros in on_device/printing.cuh

on_device/printing.cuh currently uses macros, mostly in order to easily "plug" the format string into a longer string.

This can actually be achieved, using a little voodoo, without in a plain function, and without any heap allocation. Let's try and do that.

Add shuffle tests with non-power-of-2 sizes

At the moment, we only test the shuffle::xxx functions with types of sizes 1, 2, 4 and 8. We should test them with types of other sizes, in particular: 3, 5, 7, and sizes beyond 8.

Bring some order to `_safe` vs `_unsafe`, constexpr vs non-constexpr math functions

At the moment, some of our math utility functions have both constexpr and non-constexpr variants, in different files, while some have only the constexpr ones which work at run-time as well. But - we indicate the first case with the kat::constexpr_ namespace. Is this explained anywhere? Not really. Also unexplained is why math.cuh includes constexpr_math.cuh.

To add to the fun, we have several functions with two implementations, foo_safe() and foo_unsafe(); and - some of the non-safe versions are actually in constexpr_math...

We should get this stuff in order.

Support system-wide and block-wide atomics

Beginning with CUDA 10 (or maybe 9?) we have three kinds of atomics:

atomicFoo() - atomic w.r.t. other memory access from within the same GPU.
atomicFoo_system() - atomic w.r.t. memory access from any GPU and from the host, on the same system.
atomicFoo_block() - atomic w.r.t. memory accesses from threads in the same thread block only.

We currently support only the first kind, but should support the other two.

Support atomicCAS() for all types

We need to have an atomicCAS() equivalent available, some way or another for all types up to the hardware capability for atomic ops (8 bytes, i.e. unsigned long long int). Right now there's only apply_atomically() exposed, which is nice, but not enough.

So let's:

Properly expose atomic::compare_and_swap() the way it is now (i.e. only for the types CUDA supports directly).
Implement atomic::compare_and_swap() for smaller types using the larger type.

Add a device-side-enabled version of gsl::span or std::span

std::span / gsl::span are very useful in host-side code: A pointer+length pair with standard-library-container trappings (iterators, operators, usable in standard algorithms etc.)

Now, it's not as though you should just use spans willy-nilly; they can have some overhead, but - they can sometimes make sense. The may be particularly useful as kernel parameters - pay the overhead just once per thread, then in the kernel you do what you like. And it makes it easier to work with memory regions - device-side and host-side.

Missing include in containers/array.hpp

For std::reverse_iterator we need #include <iterator>.

Something like:

#include <kat/containers/array.hpp>
int main() {
kat::array<int, 7> arr;
}

should fail to compile.

Introduce a lane_id_t/lane_t type

We have many functions returning lane ids or numbers-of-lanes. Mostly those use unsigned. But for better readability/clarity, I'm thinking of introducing something like:

using lane_id_t = unsigned;

within the kat namespace.

Users - if I have any that look at this issue, which is unlikely - are welcome to comment on this prospect.

Place (essentially) all code within the kat namespace

Even if most of our code is for the device-side - it's still a little too presumptuous to put it in the global namespace. Instead, let's place it all within the kat namespace.

Shoud we adhere to CUDA's confusing use of "index" and "id"?

An index is either a "list of items" arranged in order, or "a number... used as an indicator or measure", or "a number ... associated with another to indicate... position in an arrangement".

An id is an identification, or " a document bearing identifying information".

So, if we have an item in a 3D block, the triplet of coordinates in each axes is - literally - that item's 'id' not its 'index'. If anything, it is its position in a linearization of that grid that could be considered its 'index'.

... Unfortunately, CUDA defines this exactly backwards: The 'index' in CUDA is the 3-dimensional entity, and the 'id' is the number.

Currently, in my code, I'm making a bit of a mess of these two conceptions. I'm going to sort that out, but the question is how?

Keep CUDA's terminology, use _index() functions in grid_info to return uint3's and dimensions_t's.
Switch CUDA's terminology as described above, advise users to do the same.
Use only neutral terms, e.g. 'position' and 'linearized_id'.
Mix of the above.

What do you think?

Use "xxx_index" in dimensioned contexts and "xxx_id" in linearized ones

In a 2D or 3D block, the CUDA "thread index" - according to official documentation - is a 3D or 3D entity, while the "thread ID" is its linearization (where x changes fastest, then y, then z).

cuda-kat currently doesn't observe this distinction, due to a bias in favor of work with linear grids. We should make sure it is respected in:

Function and member names (especially but not limited to grid_info:: namespaces)
Parameter names
Implementations

We should also check if there are separate special registers for the dimensioned index and the id; that could be useful.

Simplify some of the numeric functions in math.cuh and constexpr.cuh

There is overuse of const& T over T for parameters of some math fuctions. These functions are intended for numbers, not more complex objects; plus, they are inlined and simple enough for any copies to be optimized away, so let's not complicate our lives with references here.

Templatize grid_info functions for less-than-3-dim kernels

While we have a separate namespace for grid_info functions in linear grids, I also want to templatize the general versions of the functions to support working with such grids and minimizing unnecessary computations (= based on .y and .z value of positions and dimensions). I also want to do something similar for 2-D grids. So - I'm going to templatize.

Will implement this together with a "choice" w.r.t. issue #50.

Device-side function for pretty-printing (part of a) column

This repo should have a device-side function (probably for a single thread to run) which pretty-prints a part of a column in GPU memory. It should be a well-pimped function with lots of knobs and levers for configuring the printing (elements per row, separators, inter-column spacing, width, chars vs numbers, numeric base, bit-resolution interpretation yes/no, true/false symbols for bits or booleans, indices at start of line, header lines yes/no etc.)

Implement kat::linear_grid::collaborative::block::at_warp_stride()

We have grid-scope action in two forms - at grid stride and at block stride. The block stride action means each block acts on consecutive data. At block-scope - we only have the first kind of action, where the stride involves all collaborating elements. We don't have the second kind of action, but with warps, which may be told how many consecutive warp's worth of positions to act on.

Add wrappers (and builtins?) for more PTX instructions

The following PTX instructions don't have wrapper functions (nor builtins:: templated functions where relevant). Add them!

lop3 - Logical operation on 3 operands using an immediate 3-parameter lookup table.
prefetching instructions?
cvt.pack
fns - find n'th bit set
Sub-32-bit dot product with accumulation: dp4a, dp2a for bytes and halfword, respectively.

shuffles are warp collaboration primitives.

Shuffles are warp collaboration primitives. They should be in namespace kat::collaboration::warp - and declared in the warp collaboration primitives header - if only perhaps through an inclusion of another file.

Array.fill with constexpr variable

This is a general problem of nvcc I would say:

#include <array>
#include <kat/containers/array.hpp>

constexpr int duzzle = -7;

__global__
void kernel() {
	kat::array<int, 7> arr;
	arr.fill(duzzle); // fails to compile
}

int main() {

	std::array<int, 7> arr;
	arr.fill(duzzle);
}

The problem here is the signature of fill(const value_type&). E.g. if we omit the reference it works fine or if we define the constexpr variable in the device function itself. A real funny workaround:

__global__
void kernel() {
	kat::array<int, 7> arr;
        constexpr int duzzle_ = duzzle;
	arr.fill(duzzle_);
}

Doxygen-comment the code

Much of the code is lacking Doxygen comments - especially file-level comments. Write those in.

Untangle the mess in primitives/

The code under src/cuda/on_device/primitives is a hot mess.

I mean, most of it is very useful, but not all of it; and there's almost no order to the different files except w.r.t. to the scope of collaboration (warp/block/grid).

At the very least we need to:

Remove code whose general usefulness is limited/questionable.
Extract related functionality into a separate file (or files for differnet scope):
1. Shared memory
2. Thread/lane coordination
3. Iteration/coverage patterns (like at_warp_stride())
4. Reductions and reduction-like operations
5. (Search?)
... and do it while keeping the namespace scheme (e.g. separating block-scope from grid-scope functions).
Consider duplicate functionality (there's probably a bit of that in there)

Cull some of the printing.cuh code

printing.cuh currently has some redundant code, some unused code, code that's really not GPU-specific in any way, and code which may not be significant enough to publish as part of the library.

I should make a pass over the file and remove most/all of this code.

Explicitly use std::size_t as kat::size_t

array.hpp, and soon the tuple implementation and possible other places, rely on the existence of a size_t type. Let's explicitly define ours.

Support string.h functionality on the device side

We already have a few string.h-like functions implemented, currently in miscellany.h. Why not just complete them to cover all of string.h - and separate them into a file of their own?

Add testing instrumentation

While this code originates in other repositories, which do have some unit and other testing - that can't be migrated here. We should have tests for all functionality in this repository, and a proper testing framework/library for them.

I'm thinking of doctest but my mind is not fully made up yet.

Add wrappers for mathmatical type conversion functionality

Part of the CUDA on-device "math API" is multiple casting operations. We should provide a uniform, readable, no-overhead, C++'ish interface to that functionality.

Rearrange math builtins

"Built-ins" in cuda-kat means those functions which translate into single PTX instructions (not necessarily single SASS instructions though!)

We have on_device/builtins.cuh, and on_device/non-builtins.cuh which contains functions which are builtin-like, or one might expect to be built-in, but aren't, and instead have a pretty tight implementation - one or two lines - which calls a builtin.

There's a problem, though, with certain functions which only translate into single PTX instructions when --use_fast_math is specified as a compiler switch. Example: cosf() in the CUDA math function header. With --use_fast_math, it yields something like:

        cos.approx.ftz.f32      %f2, %f1;

but if you remove the switch, you get a sequence of over 150 (!!) PTX instructions, including several loops, for computing the cosine. See this on GodBolt.

So, cosf() and other functions are sometimes builtins and sometime they aren't. Where should we put them then?

Idea:

Have three files - builtins.cuh, non-builtins.cuh and maybe-builtins.cuh.
maybe-builtins.cuh will have everything depending on a switch.
The former two may have a bunch of "using ...instructions, conditioned on the compilation happening with the appropriate switch. Thus, if all three are included,builtins::cosine()will work if--use_fast_mathis on, and fail otherwise; and the opposite goes for non_builtins::cosine()`.

PS:

--use_fast_math is not even the only switch: We have --prec-div (for precise division) --prec-sqrt (for precise square root) and --fmad for enabling floating-point fused-multiply-add instructions. I should also not that not all functions which depend on these four switches are actually covered by cuda-kat right now, although that's a different issue.

Add support for lane masks in `collaboration::warp::` methods

Our builtins involving warp voting, balloting etc. already support lane mask parameters for CUDA >= 9; but in kat::collaboration::warp::, the methods do not take such parameters. We should add them.

Unaligned access is missing the align_down() function

unaligned.cuh is missing the align_down() function. Either add it or drop unaligned.cuh.

Add <algorithm> and <numeric> functions as thread-level primitives?

While it's rarely a great idea, for the sake of completeness, we may want to have implementations of the Add abstract <algorithm> and <numeric> algorithms which could be run by all threads without collaboration, each on its own data.

What do you think? Good idea? Bad idea?

Pass arguments by value to atomic functions

CUDA atomics apply to types of size at most 8 and which are trivially copyable. There is no benefit, therefore, from their wrappers taking constant references rather than actual value. In fact, this will mostly serve to cause problems if we try to pass rvalues.

So, let's just drop all that use of const &T in favor of Ts.

Also, this will mean most atomic wrappers only take a single address (pointer/reference), so we no longer need to mark them with __restrict__ (!).

Use the test fixtures in the collaborative primitives in other tests

The test fixtures have improved, and become a bit more flexible and requiring less boilerplate, from one test suite to another. We should use the later ones - currently on collaborative.cu in the earlier tests, particularly builtins.cu and perhaps math.cu and others.

Add missing <algorithm> functions to the warp and block sequence-ops

We need most, if not all, of the functions in
<algorithm> and <numeric> available on the device, for execution at the warp and block level. (But not the uninitialized memory stuff, nor qsort/bsearch etc.)

Implementation status:

Non-modifying sequence operations

Modifying sequence operations

Partitioning operations

Sorting operations

Binary search operations (on sorted ranges)

lower_bound
upper_bound
binary_search
equal_range

Other operations on sorted ranges

merge
inplace_merge

Set operations (on sorted ranges)

includes

set_difference
set_intersection
set_symmetric_difference
set_union

Heap operations

Minimum/maximum operations

Comparison operations

equal
lexicographical_compare
compare_3way
lexicographical_compare_3way

Permutation operations

is_permutation
next_permutation
prev_permutation

Numeric operations

Use spans as parameters where relevant

With a reasonable compiler and a reasonable implementation of a gsl::span-like class (see issue #6), there should be no penalty for having device functions take a kat::span<T> instead of a T* and a size_t length. So let's convert some functions to using that...

Of course we might need to make the span templated on the appropriate size type.

Add the average-computing functions to builtins.cuh

CUDA's math API has functions for computing averages of integers without overflow. Let's make those more accessible.

A device-side ostream with `printf()` as its back-end

Let's convert our on-device printing code to use a standard-library-like ostream, which at a time of our choosing will issue a printf() and clear its buffer.

Consider basing this off of: https://github.com/msharov/ustl/releases/tag/v2.8

Add a tuple type

Let's add a device-capable (and host-capable) tuple type.

Adapt include guards to this library + drop their use of block comment

Two issues with the include guards:

They don't have CUDA_KAT in them
The end-of-file include guard uses a block comment /* */ rather than a line comment.

Let's change that.

Semantics of atomic::increment() and atomic::decrement() wrong

I've somehow mis-implemented atomic::increment() and atomic::decrement(). Need to bring the comments and the implementation in line with the CUDA Programming Guide description, as well as make sure that the default arguments actually do something useful (which, currently, they do not).

Drop fd , fhd etc. in favor of something more palletable

My shorthands for __forceinline__, __device__ and __host__ are quite convenient, but perhaps it's better I don't force them on people - even at the price of making my code more verbose.

I'd appreciate input from whoever visits this page.

Implement the non-linear-grid variants of all grid_info functions

At the moment, a lot of the functions in grid_info.cuh are only relevant to linear grids. We should:

Ensure all linear-grid specific functions are under kat::linear_grid::grid_info.
Ensure all functions implemented for linear grids are implemented for non-linear grids (unless they're irrelevant).
Give some more thought to function naming, considering the linear-vs-non-linear grid dichotomy.

Block-level conjuction and disjunction

We should implement all_satisfy(), none_satisfy() and some_satisfy() at the block level (using the warp-level primitives and shared memory to exchange information).

eyalroz / cuda-kat Goto Github PK

cuda-kat's People

Contributors

Stargazers

Watchers

Forkers

cuda-kat's Issues