ginkgo-project / ginkgo Goto Github PK

View Code? Open in Web Editor NEW

388.0 388.0 86.0 154.95 MB

Numerical linear algebra software package

Home Page: https://ginkgo-project.github.io/

License: BSD 3-Clause "New" or "Revised" License

CMake 2.01% C++ 90.63% Shell 0.59% Cuda 3.33% NASL 2.95% Python 0.46% R 0.02%

cuda dpcpp gpu-computing hip hpc krylov-methods linear-algebra oneapi openmp preconditioning sparse-linear-systems spmv

ginkgo's People

Contributors

Stargazers

Watchers

Forkers

yhmtsai williamliao28 tjhei ozturkosu fritzgoebel edufrechou chekaru juliuswoerner zxplsec batmanabcdefg prinz7 keldu flipflapflop eminsight eoli-an nbeams yangwang92 wy2609 antodo lksriemer tamiko adam-m-jcbs greole red-portal zeta1999 raphaelk12 slaedr cho-uc lmzhhh noureddine-as aggarwal2000 jaynotleno gonzalop72 ahdhn phyboyzhang kyrillh pse-parkview kliegeois jjones-jr ttjost kcvivek lcy-seso rschoof jakubtrzcskni bryantacy tbertus jovanwang zjvskobe vasilisge0 assassinand tomflint22 raynoldtan tcojean thoasm learning-chip lahwaacz yasemingtonguc lql341 alejandromp12 acmunich nordegraf stemauro barracuda156 xcleancode weslleyspereira craffael yu-nix young1901 joyxu davidscn thilinarmtb xanaga marcelkoch youwuyou khuck xwentian2020 mayank-k17 wangwei90 lihuibng lisa-gm pwxy rscohn2 tthx vahidmech

ginkgo's Issues

Ensuring non-null arguments

The problem

It almost never makes sense to pass a null pointer to one of Ginkgo's methods. There's usually no way to continue and the best we can do is just throw an exception. However, we don't ensure this behavior anywhere yet, so passing a null pointer will result in a crash when we try to dereference it.

Usual solution

The proposed C++ way (Sutter's Mill, also suggested by @mhoemmen) is to use a reference (&) instead of a pointer for objects that cannot be null (assuming we are not giving any onwership).
I must say I don't like this approach as it has some drawbacks (in the order of importance, so if you are convinced by statement n, you can skip statements m > n):

How do you handle non-null objects when you do transfer ownership? The solution doesn't handle the case of non-null uniqe_ptr or shared_ptr, just a plain pointer.
Plain pointer / reference, shared pointer, and unique pointer determine the ownership of the object, not the functionality of the underlying object. However, using a reference instead of a pointer changes the way we use the object (instead of -> we need to start using .). This non-uniformity makes it more difficult to write code that doesn't depend on the kind of ownership we have (and may be confusing for some inexperienced users).
1. The first example is the gko::clone utility proposed in #30. While we only need a single implementation, that internally calls obj->clone() independent of what kind of ownership the object has, supporting references would require writing another version of the method that calls obj.clone() instead.
2. Say I'm writing a complicated function:
```
std::unique_ptr<LinOp> x = /* something */;
// do 1st thing with x (multiple statements, using x->something())
// do  2nd thing with x (multiple statement, using x->something())
```
  And I realize that I might want to simplify the function and make two things reusable by separating them into functions, making "1st thing" a function void f1(LinOp &x), and the second thing void f2(LinOp &y). I didn't change the functionality, but I have to find and replace all x->something() with x.something() just because I want to assert that x cannot be null.
3. This is a revers situation than the previous example. Suppose I have a following function:
```
void f(LinOp &x) {
    // do thing 1 with x
    // do thing 2 with x
}
```
  Then I figure out I have a bug, and that "thing 2" should work on the original x, and "thing 1" should have a temporary copy of 'x'. I decide to add auto tmp = gko::clone(x), and use tmp instead. Now I sure have to replace every occurrence of x with tmp, but I also have to change how I call things ('.' with '->'), which can be a bit confusing.

Proposed solution

Rather than mixing references/pointers just to express if the argument can be null, I would suggest adding a decorator that throws a runtime error if the object is null. Here is the pseudocode (that would probably have to be refined a bit to handle special cases, e.g. plain pointer):

template <typename Pointer>
class non_null {
public:
    template <typename... Args>
    non_null(Args...&& args) : p(std::forward<Args>(args)...) {
        if (p == nullptr) {
            throw std::runtime_error("Expected a non-null pointer");
        }
    }

    auto get() const { p.get(); }  // have to specialize this for plain pointer
    auto operator *() const { return *p; }
    auto operator ->() const { return p; }
    operator Pointer&() const { return p; }  // maybe explicit?
private:
    mutable Pointer p;
};

We could also publicly inherit from Pointer instead, to provide all the same methods as Ponter does (we have to be careful not to inherit from plain pointer).

Then we could use:

std::unique_ptr<LinOp> LinOp::generate(non_null<std::shared_ptr<const LinOp>> op);
void LinOp::apply(non_null<const LinOp *> b, non_null<LinOp *> x);

// currently don't have a function in Ginkgo that takes a unique pointer, so making it up:
void my_sink(non_null<std::unique_ptr<LinOp>> op);

I think it's pretty self documenting that all these expect a non-null pointer, maybe even more than by passing by reference, and we get a run-time check that the pointer is actually not null.

`examples/simple_solver.cpp` wrong name of the input file for the starting approximation

In examples/simple_solver.cpp, should

auto x = gko::read<vec>("data/x.mtx", exec);

be instead

auto x = gko::read<vec>("data/x0.mtx", exec);

since in data subdirectory there is x0.mtx but not x.mtx?

Is axpy function needed?

COO spmv kernels need to works on the output vector c first.
COO cuda kernels only do c = A*b + c
Thus, in spmv (c = A*b = A*b + 0), we set zero in c for ensuring kernel works correctly

In Hybrid format, its spmv function is composed of ELL and COO spmv functions.

ell_mtx->apply(b, x); // x = (ELL) * b
coo_mtx->apply(1, b, 1, x); // x = 1 * x + 1 * (COO) * b

The second command spends time to multiply x with 1, which is redundant.

Does It have any performance issue?
If it does, maybe we need the axpy function like c = A*b + c instead of calling advanced spmv.

ell_mtx->apply(b, x); // x = (ELL) * b
coo_mtx->axpy(b, x); // x = x + (COO) * b

Hybrid (ELL + COO) matrix format

declare gko::matrix::Hybrid class template in core/matrix/hybrid.hpp
implement any device independent methods in core/matrix/hybrid.cpp
mark device dependent methods (like apply) as NOT_IMPLEMENTED
add unit tests for device independent methods in core/test/matrix/hybrid.cpp

Warnings caused by spmv_comparison_cuda

The SpMV comparison benchmark causes warnings due to destructors throwing exceptions:

/home/gflegar/Sources/ginkgo/public/repo/benchmark/spmv_comparison_cuda/spmv_comparison_cuda.cpp: In destructor ‘CuspCsrmp::~CuspCsrmp()’:
/home/gflegar/Sources/ginkgo/public/repo/core/base/exception_helpers.hpp:195:64: warning: throw will always call terminate() [-Wterminate]
     ::gko::CusparseError(__FILE__, __LINE__, __func__, _errcode)

This is because destructors are marked noexcept by default. To change the default, solve the warnings, and prevent the code from crashing if an exception is thrown, the destructor should be marked with noexcept(false). For example:

struct A {
   ~A() noexcept(false) { throw std::runtime_error("No warnings here"); }
};

Change "padding" to "stride"

(Original issue posted by @hartwiganzt on private repo)

Now, that we already have dense, ELL matrices, I conclude from a discussion with Mark that the name "padding" is probably inappropriate. Padding is what you add to make it fit nicely. What we use it for is something different. Traditionally called "leading dimension" Mark suggests the name "stride".

> *--*            \
> ||||            |
> ||||            | 
> ||||            | stride
> *--* \          |
> :  : | padding  |
> *--* /          /

EDIT: In more detail, the extra amount of memory added to a multidimensional array (unfolded as a series of vectors of size n) to assure aligned access to each vector is usually called padding. Meaning that the distance (usually called stride) between first elements of two consecutive vectors is exactly stride = n + padding.

Current Ginkgo naming is a bit misleading as it calls the distance between two first elements padding (instead of stride).

replace all references to padding with stride

Automatically comment-out example code generated by Solver/Preconditioner generator

@tcojean could we modify the solver generator script to automatically comment-out these placeholders and add NOT_IMPLEMENTED specifications instead?

It shouldn't be that difficult, every line containing just { (with no indent) is bound to be a start of function body (classes and namespaces have { on the same line as the name), and every line containing just } is bound to be end of a function (classes have }; and namespaces } // namespace <name>).

Support for right & two-sided preconditioning

Ginkgo currently supports only left preconditioning.

However, it seems that some use-cases require right and two-sided preconditioning. We started this as an off-topic (or maybe vaguely-on-topic) discussion in thread #17 , so I'll collect related comments, and we can continue the discussion here.

[...] we have to decide between left preconditioning ( MAx = Mb ) Right preconditioning ( AMx =c, x = My) and mixed preconditioning. Matlab only supports left preconditioning (afaik), and the same holds for MAGMA-sparse (that I should know.) Is it valid to stick with that? I am not sure what Trilinos/SuperLU/PetSC does. I would appreciate input from the community.
-- @hartwiganzt

Iterative solvers may

choose to accept left, right, or both (split) preconditioners.

have this option without overhead, if the solver implementation dispatches to different algorithms based on whether or not the (left,right,both) preconditioner(s) exist(s).
[...]

Most users want right preconditioning because it preserves the residual norm (e.g., makes GMRES' "implicit" residual norm the same as its "explicit" residual norm norm(b-A*x, 2)).

However, some users want left preconditioning (optimization folks, for example -- I can fish around for details if you're interested).

-- @mhoemmen

I know some people who use left preconditioning and others who use right. Mathematically they give different results, so I think both options should be supported. However, split/both is much less common, so lower priority though it is the most general approach.
-- @egboman (upvoted by @mhoemmen, @drittelhacker)

We talked a little bit about LSQR over the phone today. Matlab actually right-preconditions LSQR:

https://www.mathworks.com/help/matlab/ref/lsqr.html?s_tid=gn_loc_drop

That makes sense because LSQR solves a (constrained) least-squares problem. A "left preconditioner" would actually change the function to minimize. (That's true for GMRES as well, but GMRES aims to solve the linear system and a good solution to the linear system will minimize that function regardless. LSQR works even when there is no x such that Ax=b.)

Our users tend to prefer right preconditioners for GMRES etc. for this reason. However, it's not hard to make GMRES, etc. achieve the tolerance for the actual ("explicit") residual norm(b-A*x,2); first iterate until the "implicit residual" (the thing that GMRES computes from the little upper Hessenberg system) achieves the desired tolerance, than iterate some more until norm(b-A*x,2) does as well. Thus, it's not much different to use a left preconditioner.

-- @mhoemmen

[...] if we support both left and right, we can also support two-sided preconditioning without significant effort.
-- @gflegar

Script create_new_algorithm generates false test targets

There are some small problems with the script create_new_algorithm.

This is what I did:
./create_new_algorithm.sh gmres solver cg
I want to create a new algorithm gmres with cg as model.

The script adds false test targets in CMakeLists.txt under test/solver directories
In this case, false test targets bigmresstab, fgmres, and gmress are generated.
Another small issue:
omp/test/solver/gmres_kernels.cpp is not generated.

The following is the output message at generation.

Creating temporary files:
./tmp_1532434475/core/solver/gmres.cpp
./tmp_1532434475/core/solver/gmres.hpp
./tmp_1532434475/core/solver/gmres_kernels.hpp
./tmp_1532434475/reference/solver/gmres_kernels.cpp
./tmp_1532434475/omp/solver/gmres_kernels.cpp
./tmp_1532434475/cuda/solver/gmres_kernels.cu
./tmp_1532434475/core/test/solver/gmres.cpp
./tmp_1532434475/reference/test/solver/gmres_kernels.cpp
./tmp_1532434475/cuda/test/solver/gmres_kernels.cpp

Renaming and distributing files
cleaning up temporary files.
Modifiying CMakeLists.txt and common_kernels.inc.cpp
head: cannot open '259' for reading: No such file or directory
./create_new_algorithm.sh: line 326: 178
259: syntax error in expression (error token is "259")

Error message is spotted here

Summary:
Created solver file
core/solver/gmres.cpp
	This is where the gmres algorithm needs to be implemented.

Created class header
core/solver/gmres.hpp
	This is where the gmres class functions need to be implemented.

Created kernel header
core/solver/gmres_kernels.hpp
	This is where the algorithm-specific kernels need to be added.

Created kernel file
reference/solver/gmres_kernels.cpp
	Reference kernels for gmres need to be implemented here.

Created kernel file
omp/solver/gmres_kernels.cpp
	OMP kernels for gmres need to be implemented here.

Created kernel file
cuda/solver/gmres_kernels.cu
	CUDA kernels for gmres need to be implemented here.

Created unit tests for gmres 
core/test/solver/gmres.cpp

Created unit tests for gmres reference kernels
reference/test/solver/gmres_kernels.cpp

Created unit tests for gmres CUDA kernels
cuda/test/solver/gmres_kernels.cpp

Modified core/CMakeLists.txt
Modified reference/CMakeLists.txt
Modified omp/CMakeLists.txt
Modified cuda/CMakeLists.txt
Modified core/test/solver/CMakeLists.txt
Modified reference/test/solver/CMakeLists.txt
Modified cuda/test/solver/CMakeLists.txt
Modified core/device_hooks/common_kernels.inc.cpp
In all of the previous files cg_kernels.cpp was automatically replaced into gmres. Ensure there is no inconsistency.

All the imported code was commented and TODO items were generated in the new files.
Check all the modified files for '// TODO (script):' items
e.g. by using grep -HR '// TODO (script):' /home/u/yanjen/ginkgo/dev_tools/scripts/../..

A summary of the required next steps has been written to:
todo_gmres.txt

Code generation macros impairing easy code navigation and analysis

This issue is the summary of a discussion which arose from MR #46, but is more general in Ginkgo. I will try to expose the current state of things and eventual problems. If you find any mistake here feel free to correct it or notify me.

Macros are very useful to generate code

Macros are very useful to generate code. This allows to completely remove boiler plate code from every place and therefore reduces the code base by quite an amount of lines of code. One such current usage is GKO_REGISTER_OPERATION(xxx, kernel) which does the following:

create a class named xxx_operation containing run(Executor) specializations launching the correct kernel method
create a function named make_xxx_operation which is an interface to build an object of the previously defined class

In this case, the macro usage is at the top of the kernel cpp file, therefore you can at least go there easily and inspect the code if needed in a relatively easy way.
But that is not always the case, and it's particularly true with the factory macros of MR #46, and probably some macros we already use elsewhere in Ginkgo.

Problem: this impairs easy code navigation and analysis

There is a very old saying, as old as macros, that macros are evil incarnate and should be killed with fire. There are multiple classic reasons for that, such as the fact that macros don't have scope, don't ensure type safety, are very hard to debug: you need to run the preprocessor to know what code is actually generated by your macro in order to debug it, and much more.

I'm actually kind of fine with the previous points (not so much debugging), but for me one big problem is when macros add class members or class/method definitions, in particular when the internal name is not directly in the parameters, then you have in your code something such as:

// [...]
make_xxx_operation(...);
// [...]

But if you grep for this symbol, or use more advanced tools such as tags, then you will never find this particular symbol in your source code. The symbol appears from nowhere in the code, because a macro generates it. To find it, you need to go through the macro and pretty much navigate through the code by hand, which makes code navigation and understanding problematic.

My main question is the following: I perceive this to be a pretty serious problem personally, is this the case for you? That is maybe not the case for everyone, it really depends on habits and usage I guess.

Possible solutions

Supposing this is a problem, multiple solutions can be thought of. All of these solutions involve some sort of compromise between doing more boiler plate code and keeping the code generation capacity that macros provide.

Extreme "solution": no macros

This may be hard to implement in practice so that's not really a true solution. The theory is to never use it but in practice there is always some cases where you will need macros. Maybe a combination of templated classes and namespaces can completely remove the need for some macros though. This will still lead to some more boiler plate code, though.

Do not put member/function declarations in macros

This solution is to let the user declare his own members therefore they are properly defined and embedded in the code. The main downside is this clearly increases the amount of code that you have to write. In addition, you may have weird user/macro interactions such as:

the macro declares a new class foo_type
the user has to declare his class's member attributes foo_type foo_var;, so the user needs to know what the macro does which is not a great design

Never generate symbol names in the macro

The main idea here is that if your macro declares a type foo_type, then as parameter to the macro you pass foo_type directly. That is:

Instead of MACRO(foo)
I pass MACRO(foo_type, foo_var,...)

We have a clear addition of boiler plate code (names), but at least we can grep the code for foo_type and find that very easily. Therefore there are no magically appearing names anywhere and most usual navigation tools should work properly. The downside I can see is that for some macros, this list may get fairly long.

Maybe: use some links embedded in documentation

I'm not sure we can do this properly and with which tool, but an idea would be to add anchors/bookmarks at places in the code, and when we use a symbol which comes from nowhere we give a direct link to the precise macro call which generates it.

Maybe: find some decent plugins for standard editors which call the preprocessor on the fly

We could just keep macros and have tools which pretty much always call the compiler preprocessor, so we can view every particular macro call if needed. A problem is the tools I have seen doing that are too simple and not very usable, though.

Missleading implementation of asynchronous_stopping_criterion

The ByInteraction stopping criterion in asynchronous_stopping_criterion example is implemented inside the gko::stop namespace. We should change this as it might cause the user to think that custom stopping criteria have to be implemented in that namespce.

In fact, the exact opposite is true. User-defined stopping criteria should be implemented in the namespace of the application using them, and the gko namespace should be used exclusively for Ginkgo code. Otherwise, the user-defined criterion might have name clashes with future versions of Ginkgo.

More descriptive memory management

Currently we use std::move, .get(), and ->clone() with combination of different objects to express who owns what, who shares what, etc. This might get a bit confusing for users that are not so proficient in C++ and the code can get a bit strange.
For example:

using Mtx = gko::matrix::Dense<>;
auto unique = Mtx::create(exec);                              // this is a matrix with unique ownership (i.e. it cannot be passed to methods that want shared ownership)
std::shared_ptr<Mtx> shared = Mtx::create(exec);              // this is a matrix with shared ownership
auto unique_clone = unique->clone();                          // a clone of "unique" with unique ownership, note that the static type of unique_clone is LinOp, not Mtx
std::shared_ptr<gko::LinOp> unique_clone2 = unique->clone();  // a clone of "unique" with shared ownership
auto shared_clone = shared->clone();                          // a clone of "shared" with unique ownership
std::shared_ptr<gko::LinOp> shared_clone2 = shared->clone();  // a clone of "shared" with shared ownership
auto unique_ref = unique.get();                               // a "reference" to unique with no ownership - i.e. they are the same object, unique_ref doesn't "own" the object, "unique" does
auto shared_ref = shared.get();                               // same as above, just that shared can share ownership with someone else

Now if we want to call some functions (consider all the lines as stand-alone examples, they don't work as a block of code, since some lines invalidate the objects):

auto cg_f = gko::solver::CgFactory::create(...);
cg_f->generate(unique);             // error, "unique" has unique ownership of object
cg_f->generate(unique_ref);         // error, "unique_ref" doesn't have ownership of object, and generate() expects it
cg_f->generate(std::move(unique));  // o.k. transferring ownership of unique to generate
cg_f->generate(unique->clone());    // o.k. creating a copy of the object to give to generate

cg_f->generate(shared);             // ok, sharing ownership with generate()
cg_f->generate(shared_ref);         // error, "shared_ref" doesn't have ownership of object
cg_f->generate(std::move(shared));  // ok, giving my ownership of shared to generate() (but there might be others who still have ownership)
cg_f->generate(shared->clone());    // ok, creating a private clone for generate()

auto cg = cg_f->generate(/* something */);
cg->apply(unique, unique_clone);            // error, apply doesn't want ownership
cg->apply(unique_ref, unique_clone.get());  // o.k. passing "references" without ownership
cg->apply(shared, shared_clone2);           // error, apply doesn't want ownership
cg->apply(shared_ref, shared_clone2.get()); // o.k. passing "references" without ownership

There are of course more combinations, but this is enough to make it confusing for some users. Where to put .get(), where to put ->clone(), where to use std::move. To make it worse all 3 are called in radically different ways: as a method, as a method on a dereferenced object, and as a top-level function. There's also the problem of having to repeat the type twice when wanting to create a shared object, and the problem of clone which doesn't return the same type, but a super-type, which leads to clumsy syntax like:

std::unique_ptr<Mtx> clone = static_cast<Mtx *>(unique->clone().release());

So, I recommend the following wrappers:

Wrapper	Semantics	Implementation
`gko::clone`	Clones any object, unique/shared/no ownership	Calls `->clone()` and preserves type
`gko::share`	Shares a unique/shared object*	Moves a unique_ptr/shared_ptr to shared_ptr
`gko::give`	Gives the ownership of a unique/shared object to a function/operator*	same as `std::move`
`gko::lend` / `gko::loan`	Returns a reference to the object, without changing ownership	either call `.get()`, or just return the input

* The original input becomes invalid

I am not very happy with "borrow", I would prefer a shorter name since that will be the most common one, and I don't really like the verb. But it does fit perfectly in the semantics - the function that got the object using `gko::borrow` will assume it "owns" the object until it returns, and at that point it gives it back. From google translate:

take and use (something belonging to someone else) with the intention of returning it.

"he had borrowed a car from one of his colleagues"

synonyms: take, take for oneself, help oneself to, use as one's own, abscond with, carry off, appropriate, commandeer, abstract; steal, purloin, shoplift rob, swipe, nab, rip off, lift, liberate, snaffle, snitch, pinch, half-inch, whip, knock off, nobble, bone, scrump, bag, blag, glom, hook

"his workmates had ‘borrowed’ all his tools"

~~If anyone has a better name, I appreciate it.~~

I think we figured out we want either gko::lend or gko::loan, which better describe what is happening.

With the new wrappers, the examples would look like this:

using Mtx = gko::matrix::Dense<>;
auto unique = Mtx::create(exec);                      // this is a matrix with unique ownership (i.e. it cannot be passed to methods that want shared ownership)
auto shared = gko::share(Mtx::create(exec));          // this is a matrix with shared ownership
auto unique_clone = gko::clone(unique);               // a clone of "unique" with unique ownership, static type is now Mtx
auto unique_clone2 = gko::share(gko::clone(unique));  // a clone of "unique" with shared ownership
auto shared_clone = gko::clone(shared);               // a clone of "shared" with unique ownership
auto shared_clone2 = gko::share(gko::clone(shared));  // a clone of "shared" with shared ownership
auto unique_ref = gko::lend(unique);                  // a "reference" to unique with no ownership - i.e. they are the same object, unique_ref doesn't "own" the object, "unique" does
auto shared_ref = gko::lend(shared);                  // same as above, just that shared can share ownership with someone else

auto cg_f = gko::solver::CgFactory::create(...);
cg_f->generate(unique);             // error, "unique" has unique ownership of object
cg_f->generate(unique_ref);         // error, "unique_ref" doesn't have ownership of object, and generate() expects it
cg_f->generate(gko::give(unique));  // o.k. transferring ownership of unique to generate
cg_f->generate(gko::clone(unique)); // o.k. creating a copy of the object to give to generate

cg_f->generate(shared);             // ok, sharing ownership with generate()
cg_f->generate(shared_ref);         // error, "shared_ref" doesn't have ownership of object
cg_f->generate(gko::give(shared));  // ok, giving my ownership of shared to generate() (but there might be others who still have ownership)
cg_f->generate(gko::clone(shared)); // ok, creating a private clone for generate()

auto cg = cg_f->generate(/* something */);
cg->apply(unique, unique_clone);                   // error, apply doesn't want ownership
cg->apply(unique_ref, gko::lend(unique_clone));    // o.k. passing "references" without ownership
cg->apply(shared, shared_clone2);                  // error, apply doesn't want ownership
cg->apply(shared_ref, gko::lend(shared_clone2));   // o.k. passing "references" without ownership

@ginkgo-project/developers what do you think about this?

Computing the matrix norm

Sometimes it is beneficial to have a method that computes the norm of a matrix (or liner operator). (Currently we only have a function which computes the column-wise dot product of two dense matrices - which is enough for use in Krylov solvers, but not necessarily in all scenarios).

There are a few questions though on how we implement this:

There are multiple interesting norms (1, Frobenious, infinity, etc), do we want to support all of them? And how do we design the interface for that, multiple functions, one templated function, or one function with an extra parameter?
For which kinds of linear operators does it makes sense to implement the norm in the first place? Probably not for solvers. For all of the matrices it is not extremely difficult (depending on the norm). For some preconditioners (e.g block-Jacobi) it is easy, but for others it's more involved (e.g. ILU-type). Do we add it as a virtual method to LinOp, do we create a new interface which is inherited by all operators implementing it (e.g. like we have for transpose), or do we just put that as non-virtual members of the concrete linear operators implementing those?

Investigate and implement alternative input/output formats for linear operators

Currently the only supported format for reading/writing matrices is the matrix market format (see PRs #94 and #98).
There are other, more general, data iterchange formats (JSON, YAML, etc.) which could be used to store linear operators on disk to enable better interoperability with other software / languages (especially web-based software, which almost exclusively uses JSON for data exchange).

Existing open-source C++ libraries that support this should be investigated (maybe boost has something?), and integrated into Ginkgo. It should also be investigate if there is a standardized way to represent sparse and dense matrices (or even linear operators) in these formats, and, if so, represent them in the same way in Ginkgo.

Full support for custom datatypes

Ginkgo is supposed to be designed to work with user-provided datatypes. However, due to the use of external libraries (cuBLAS and cuSPARSE), which do not have support for this, it is not entirely true.

PR #49 fixes Ginkgo so it can be compiled for such data types, but it doesn't provide alternatives for all external library calls, so parts of Ginkgo will report a non-implemented error for these types.
This issue tracks the progress of these implementations:

matrix::Dense::compute_dot() (needs custom dot kernel)
matrix::Dense::apply() (needs custom gemm kernel)
matrix::Dense::scale() (needs custom scal kernel)
matrix::Dense::add_scaled() (needs custom axpy kernel)
matrix::Dense::transpose() (needs custom geam kernel)
matrix::Dense::conj_transpose() (needs custom geam kernel)
matrix::Csr::apply() (needs custom spmv kernel)
matrix::Csr::transpose() (needs custom transpose kernel)
matrix::Csr::conj_transpose() (needs custom transpose kernel)

Create a script that generates source files

In addition to the scripts for generating solvers/preconditioners/matrices, it would be nice to have a script that generates general source files. For example, typing something like:

ginkgo-cli generate source "module/path/to/file"

Should create the following files:

module/path/to/file.cpp:

/* Copy of the licence */

#include "moudle/path/to/file.hpp"


namespace gko {


// Your code goes here


}  // namespace gko

module/path/to/file.hpp:

/* Copy of the licence */

#ifndef MODULE_PATH_TO_FILE_HPP_
#define MODULE_PATH_TO_FILE_HPP_


namespace gko {


// Your code goes here


}  // namespace gko


#endif  // MODULE_PATH_TO_FILE_HPP_

module/test/path/to/file.cpp:

/* Copy of the licence */

#include <moudle/path/to/file.hpp>


#include <gtest/gtest.h>


namespace {


class File : public ::testing::Test {
    // your test setup goes here
};


TEST_F(File, YourTestName) {
    // Your test goes here
}


// Other tests go here


}  // namespace

In addition:

The "closest" CMakeLists.txt file in the path to module/path/to/file.cpp should have file.cpp added to its SOURCES variable (e.g. if there is no CmakeLists.txt in module/path/to/, but there is one in module/path, the SOURCES variable in module/path/CmakeLists.txt should be modified to include to/file.cpp).
module/test/path/to/CMakeLists.txt should be created if it does not already exist, and the line create_test(file) should be added to it.

Test GPU kernels for problems larger than one thread block

We need to check the functionality of all GPU kernels also for problems where more than one single thread block is launched.

Is custom initial parameters of function "read" needed?

The function "read" of matrix computes the matrix parameter by itself.
For example, we cannot set customized stride for ell matrices when reading mtx file.
However, when we convert Dense matrix to another matrix format, we can set the custom initial parameters of the matrix.

How about we also allow custom initial parameters for function "read"?

SELL-P GPU kernels

requires #8
provide GPU implementation of the kernels in gpu/matrix/sellp_kernels.cu
compare GPU implementation with reference kernels in gpu/test/matrix/sellp_kernels.cpp

Improvements to Doxygen

Taken from comments on PR #75

Here are some changes that may be worth considering?

To document functions which are standalone (e.g. gko::transpose), then either:

the namespace should be documented (gko namespace here)
the file should be documented
Source: https://stackoverflow.com/questions/2794054/documenting-functions-in-c-with-doxygen

In addition, if we want to include some functions inside the documentation of a class, such as the operators and transpose for dim, I think the only solution is the following:

at the beginning of the related function declarations, insert \memberof dim @{
at the end of the function declarations, close the block /** @} */

Finally, at some point, we may want to also document the master branch separately, do we already prepare a location for that and add master as a target for documentation deployment?

Improve GPU detection script

With the current version of GPU detection script, CMake configuration fails for new CUDA versions, which have not yet been added to the script.

While this is not a lot of work, it's not very pleasant for the user. For example, someone may be happy with an old version of Ginkgo, and wants to compile it with a newer CUDA version. This will of course sometimes be impossible if the new CUDA API has changed, but if it didn't there is no reason why it wouldn't compile.

Thus, removing this restriction from the detection script would be a good idea. It should be possible to do by extracting the supported architectures directly from nvcc --help command instead of relying on hardcoded values. We still won't get support for new generation names (Fermi, Maxwell, etc.), but at least the user will be able to compile Ginkgo by specifying the architectures explicitly.

`reference/test/solver/bicgstab_kernels.cpp` tests fail when converted to `float` (i.e., single precision)

Having converted reference/test/solver/bicgstab_kernels.cpp to use single precision by copying it to reference/test/solver/bicgstab_kernels_float.cpp and making changes as per diff below:

51c51
<     using Mtx = gko::matrix::Dense<>;
---
>     using Mtx = gko::matrix::Dense<float>;
57c57
<               gko::solver::BicgstabFactory<>::create(exec, 8, 1e-15))
---
>               gko::solver::BicgstabFactory<float>::create(exec, 8, 1e-7))
62c62
<     std::unique_ptr<gko::solver::BicgstabFactory<>> bicgstab_factory;
---
>     std::unique_ptr<gko::solver::BicgstabFactory<float>> bicgstab_factory;
74c74
<     ASSERT_MTX_NEAR(x, l({-4.0, -1.0, 4.0}), 1e-8);
---
>     ASSERT_MTX_NEAR(x, l({-4.0, -1.0, 4.0}), 1e-4);
87c87
<     ASSERT_MTX_NEAR(x, l({{-4.0, 1.0}, {-1.0, 2.0}, {4.0, -1.0}}), 1e-8);
---
>     ASSERT_MTX_NEAR(x, l({{-4.0, 1.0}, {-1.0, 2.0}, {4.0, -1.0}}), 1e-4);
102c102
<     ASSERT_MTX_NEAR(x, l({-8.5, -3.0, 6.0}), 1e-8);
---
>     ASSERT_MTX_NEAR(x, l({-8.5, -3.0, 6.0}), 1e-4);
118c118
<     ASSERT_MTX_NEAR(x, l({{-8.5, 1.0}, {-3.0, 2.0}, {6.0, -5.0}}), 1e-8);
---
>     ASSERT_MTX_NEAR(x, l({{-8.5, 1.0}, {-3.0, 2.0}, {6.0, -5.0}}), 1e-4);

the following tests fail by not attaining the required accuracy:

[ RUN      ] Bicgstab.SolvesMultipleDenseSystems
.../ginkgo/reference/test/solver/bicgstab_kernels_float.cpp:87: Failure
Relative error between x and {{-4.0, 1.0}, {-1.0, 2.0}, {4.0, -1.0}} is 0.11072188922929171
	which is larger than 1e-4 (which is 0.0001)
x is:
	-4.0000228881835938	0.54083889722824097	
	-1.0000052452087402	1.858891487121582	
	4.000030517578125	-0.50263333320617676	
{{-4.0, 1.0}, {-1.0, 2.0}, {4.0, -1.0}} is:
	-4	1	
	-1	2	
	4	-1	
component-wise relative error is:
	5.7220131568155855e-06	0.45916110277175903	
	5.2451812281639528e-06	0.070554256439208984	
	7.6293363240331724e-06	0.49736666679382324	

[  FAILED  ] Bicgstab.SolvesMultipleDenseSystems (1 ms)
[ RUN      ] Bicgstab.SolvesDenseSystemUsingAdvancedApply
.../ginkgo/reference/test/solver/bicgstab_kernels_float.cpp:102: Failure
Relative error between x and {-8.5, -3.0, 6.0} is 0.0014036462767795599
	which is larger than 1e-4 (which is 0.0001)
x is:
	-8.4900836944580078	
	-2.9963669776916504	
	5.9890694618225098	
{-8.5, -3.0, 6.0} is:
	-8.5	
	-3	
	6	
component-wise relative error is:
	0.0011666241814108455	
	0.0012110074361165364	
	0.0018217563629150391	

[  FAILED  ] Bicgstab.SolvesDenseSystemUsingAdvancedApply (0 ms)
[ RUN      ] Bicgstab.SolvesMultipleDenseSystemsUsingAdvancedApply
.../ginkgo/reference/test/solver/bicgstab_kernels_float.cpp:118: Failure
Relative error between x and {{-8.5, 1.0}, {-3.0, 2.0}, {6.0, -5.0}} is 0.0015665962953109148
	which is larger than 1e-4 (which is 0.0001)
x is:
	-8.4900836944580078	0.99250507354736328	
	-2.9963669776916504	1.9976270198822021	
	5.9890694618225098	-4.9917192459106445	
{{-8.5, 1.0}, {-3.0, 2.0}, {6.0, -5.0}} is:
	-8.5	1	
	-3	2	
	6	-5	
component-wise relative error is:
	0.0011666241814108455	0.0074949264526367188	
	0.0012110074361165364	0.0011864900588989258	
	0.0018217563629150391	0.0016561508178710937	

[  FAILED  ] Bicgstab.SolvesMultipleDenseSystemsUsingAdvancedApply (1 ms)

Add support for operator composition

It is often required to create a linear operator as a composition of two operators without explicitly multiplying their coefficient matrices.
This is especially useful for composing factorization-based preconditioners M = U * V where preconditioner application is performed in two stages as M^-1 * x = V^-1 * (U^-1 * x).
In general, given a type ComposedLinOp which contains the two operators as members, every factorization F, which decomposes a matrix A into factors B and C can be viewed as a higher-order operation F : A -> (B*C), which returns an object of ComposedLinOp type.

Pseudocode:

class ComposedLinOp : public BasicLinOp<ComposedLinOp> {
public:
    LinOp *get_first() { return first_; }
    LinOp *get_second() { return second_; }
    void apply(const LinOp *b, LinOp *x) const {
        auto *tmp = x->clone(); // should optimize this by allocating only once
        second_->apply(b, tmp.get());
        first_->apply(tmp.get(), x);
    }
    unique_ptr<LinOp> transpose() const {
        return ComposedLinOp::create(second_->transpose(), first_->transpose());
    }

protected:
    ComposedLinOp(shared_ptr<LinOp> first, shared_ptr<LinOp> second);

private:
    shared_ptr<LinOp> first_;
    shared_ptr<LinOp> second_;
};

Using "modern" CMake in Ginkgo

I always had trouble finding good and up-to-date tutorials about CMake, and how to properly do stuff there (especially the things that seem to be possible to do in multiple ways).

I think I finally managed to find a good collection of resources here.
Just by taking a quick look I already figured out that we're doing some of the things the wrong way which will prevent us in playing nicely with other software.

Someone should go over the materials in the link, and update our build system to follow the rules.

Build fails with both SET_CUDA_HOST_COMPILER and CMAKE_CUDA_HOST_COMPILER enabled

Excerpt from @venovako's e-mail:

SET_CUDA_HOST_COMPILER=ON and manually setting CMAKE_CUDA_HOST_COMPILER are mutually exclusive.
Here's what happens with the latest GitHub Ginkgo + CMake 3.11.4 if both are set:

Building CUDA object gpu/CMakeFiles/ginkgo_gpu.dir/matrix/coo_kernels.cu.o
cd /home/novakoni/src/GitHub/build/gpu && /state/partition1/soft/cuda-9.1/bin/nvcc -ccbin=/state/partition1/soft/gnu/gcc-6.3.0/bin/g++ -Dginkgo_gpu_EXPORTS -I/home/novakoni/src/GitHub/build/include -I/home/novakoni/src/GitHub/flx_ginkgo/include -isystem=/state/partition1/soft/cuda-9.1/include -I/home/novakoni/src/GitHub/flx_ginkgo  -Xcompiler=-mno-fma -I/home/novakoni/src/GitHub/FloatX/src --compiler-bindir /state/partition1/soft/gnu/gcc-6.3.0/bin/g++ -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_60,code=compute_60 --expt-relaxed-constexpr -O3 -DNDEBUG -Xcompiler=-fPIC   -std=c++11 -x cu -c /home/novakoni/src/GitHub/flx_ginkgo/gpu/matrix/coo_kernels.cu -o CMakeFiles/ginkgo_gpu.dir/matrix/coo_kernels.cu.o
nvcc fatal   : redefinition of argument 'compiler-bindir'

If you look at the compile line, it specifies -ccbin and again --compiler-bindir.
CMAKE_CUDA_HOST_COMPILER is manually set to /state/partition1/soft/gnu/gcc-6.3.0/bin/g++
The message is: if the user sets CMAKE_CUDA_HOST_COMPILER manually, and your system of setting the compiler also kicks in, nvcc dies.
It doesn't bother to check if the two definitions of the host compiler are in fact identical or not.
So, you might want to check if CMAKE_CUDA_HOST_COMPILER and SET_CUDA_HOST_COMPILER are both set and complain about it.

Problems related to mismatching executors

Some operations on operators on Ginkgo assume that the executors of all arguments are the same. This does not have to be true, and with the addition of gko::make_temporary_clone helper can usually be solved easily.

BLAS-1 operations on dense matrices are an example of this problem.

Should num_stored_elements be removed from `LinOp` interface?

I've been redesigning some of Ginkgo's core classes, and one thing that I'm not sure about is whether we want to expose the num_stored_elements property in the LinOp interface. We currently do that, but I think that's just for historical reasons (while we still didn't have a unified interface for all linear operators). I think that we should have this (or equivalent) parameter only in classes where it makes sense (like matrices, and some preconditioners).

The cons of having it in the base class are:

It's more complicated to call the base constructor from constructors of concrete LinOps, as we have to pass this extra parameter.
The mathematical concept of a linear operator does not have a property "number of stored elements", so in a lot of cases we just have to make something up (and pay the price of storing it, but that's a minor concern here). E.g. what's the number of stored elements of a linear operator (A^+) A + (A^T) A (A^+ is the moore-penrose inverse), which is implemented by just keeping a reference to A, and computing the required operations (multiplication, transposed multiplication, solving a under/overdetermined system on the fly)? We can immediately say that the number of rows and column of such an operator is the same as the number of rows of A, but we cannot really say what is the number of stored elements (0, the same as that of A, something else?).

The only pro I can see is that users can get the number of stored elements even without knowing the type of the operator they have. However, is this a realistic use case? Can they do anything useful with such an information, even though they have no idea if they have a solver, a preconditioner, or a matrix, and which kind it is?

@ginkgo-project/developers what do you think?

I would like to include the conclusions of this discussion into #46, as I've already changes the way we handle operator sizes there, so makes no sense in changing it twice.

Sorting for SELL format

For unbalanced matrices, a pre-sorting step can accumulate rows with similar nonzero counts in the same matrix slice. Starting with scheduling the largest slices, this should give good performance.
For a sorting ecosystem, we need:

actual sorting algorithm
storing the swapping information (similar to pivoting)
reordering routines for the matrix and the vectors

The reordering of the matrix can happen in the conversion process from CSR to SELLP.

Implement atomic operations for custom value types

The current implementation of atomic operations (introduced in #57 and #64) does not support custom value types. Ginkgo will compile, but attempts of using it will result in assertions failures.

We should explore the possibility of general mutex functionalities for GPUs, which would allow us to implement atomics. They won't be extremely fast, but will at least work correctly.

gcc-5.5 compile error

Compile with gcc-5.5 fails at compiling range.cpp
[ 64%] Building CXX object core/test/base/CMakeFiles/core_test_base_range.dir/range.cpp.o

This is part of the error message:
In file included from /home/u/yanjen/ginkgo/build/third_party/gtest/src/googletest/include/gtest/gtest.h:370:0, from /home/u/yanjen/ginkgo/core/test/base/range.cpp:37: /home/u/yanjen/ginkgo/core/test/base/range.cpp: In member function ‘virtual void {anonymous}::Range_ComputesBitwiseNot_Test::TestBody()’: /home/u/yanjen/ginkgo/core/test/base/range.cpp:243:26: error: no match for call to ‘(gko::range<gko::accessor::bitwise_not<{anonymous}::dummy_accessor> >) (int, int, int)’ EXPECT_EQ(res(1, 2, 3), ~(2 + 6 + 3)); ^ In file included from /home/u/yanjen/ginkgo/core/test/base/range.cpp:34:0: /home/u/yanjen/ginkgo/core/base/range.hpp:331:35: note: candidate: template<class ... DimensionTypes> constexpr decltype (declval<gko::range<Accessor>::accessor>()((forward<DimensionTypes>)(gko::range::operator()::dimensions)...)) gko::range<Accessor>::operator()(DimensionTypes&& ...) const [with DimensionTypes = {DimensionTypes ...}; Accessor = gko::accessor::bitwise_not<{anonymous}::dummy_accessor>] GKO_ATTRIBUTES constexpr auto operator()(DimensionTypes &&... dimensions)

Should we document methods in std_extensions.hpp?

The std_extensions.hpp file contains implementations of some useful standard utilities that are part of the C++14 and C++17 standards, but are not part of the C++11 standard (though they can be implemented in C++11).

For now, we are not documenting any of them, as there are documented in numerous online references about C++ (the standard draft, cplusplus.com, cppreference.com).

However, there is a proposal that we should stil provide some documentation for them, to make the documentation self-contained, and not require implementers of Ginkgo to look online for the documentation.

The purpose of this issue is to discuss if we should do this or not. Please use the thumbs up / thumbs down buttons to vote, and write comments if you want to clarify your position on this.

A critique of current Krylov method review process

The Problem

After already reviewing a couple of other implementations of Krylov solvers, today I looked into the TFQMR pull request (#23). After taking a closer look, I realized that the same problems I found here could have easily slipped through the review process of all the other Krylov methods (o.k. I do believe CG is sound). And I'm not only talking about design and implementation issues, I'm pretty sure the code itself is wrong (unless there's supposed to be a complicated calculation in there that doesn't affect the final result at all 🤔).

I believe this can happen because we ignore a lot of problems:

Variable names are not descriptive (yes, x is the solution, r is the residual, and tau is the residual norm, maybe you can figure out that d is the update direction, but that's about it). They are taken from some book / implementation which we do not reference anywhere, and we have no idea what each variable represents.
Since we do not know what each variable represents, we also have no idea whether the updates to these variables are correct or not (sure, we can expect that x is updated by d multiplied by some scalar, and r by A * d multiplied by the same scalar, but everything else is a mystery).
We have a single unit test with a 3-by-3 system somewhere in the reference module which "verifies" that it works correctly. The problem here is, unit tests are designed to find accidental bugs in the code, not to verify if the algorithm used is sound. So, we turn a blind eye to everything we usually check in other parts of the library, and just say: "well, if it looks like a Krylov solver, and it quacked once like a Krylov solver, then it is a Krylov solver".

I'm sure that the implementation of TFQMR did quack once, but there's obviously something wrong with it. The only reason I found there was a problem is because I didn't like how the loop body looked, and wanted to see if I can simplify it. How many other "solvers" quacked once, but won't quack again, and got through the review process because no one disliked their loop bodies?

What to do?

Here is what I propose (this is marked with "help wanted", and "question", so I do want to hear other people's opinion):

We should have more meaningful variable names - I do know that mathematicians like using single letters, but I don't think this works for software development. We can start by using solution, residual, residual_norm, update_direction, update_step_size instead of x, r, tau, d, <solver-dependent Greek letter>. Doing a bit more typing won't kill us, but it will do a lot for code readability. I realize that the semantics of some variables are hard to express in 2-3 words, so we should add a more detailed description of what every variable is in the documentation.
We should briefly describe what the solver does (i.e. how it constructs the Krylov subspace, how it tries to find the solution, etc.), and give formulas (with references) for modifying each variable. I'm not asking to prove why the solver works, or to give reasons for the numerical soundness of each update, but anyone reading the documentation should get an idea of what is happening in the algorithm, even if they don't know anything about Krylov methods. I do realize this disqualifies "copy the solver from MATLAB / MAGMA-sparse" as a viable procedure for adding new Krylov solvers, but I hope that everyone can see the problem with that approach.

What follows is the example for CG (and should be included in the doxygen for Cg class):

The conjugate gradient method (CG) solves a linear system Ax = b by minimizing the quadratic form f(x) = x^T A x + b^T x. If A is symmetric positive definite, the unique minimum is achieved in point x which satisfies Ax = b [reference]. Throughout this documentation and the implementation we will also refer to the linear operator A as system_matrix, b as input_vector and the solution x as result_vector. [note: exact notation is still to be discussed, but keep in mind that b and x should have a meaningful name in the context of any linear operator, not just a solver].

The minimum is computed in a series of iterations by starting with an initial guess result_vector(0), and at each iteration i constructing the next guess result_vector(i+1) by making a step of size update_step_size(i) from the current guess result_vector(i) along the update direction vector update_direction(i). That is, the next solution is constructed as:

result_vector(i + 1) = result_vector(i) + update_step_size(i) * update_direction(i)     (1)

Each new update direction is chosen to be A-orthogonal (i.e. orthogonal w.r.p. to the dot product p(x, y) = x^T * system_matrix * y) to the vector subspace search_subspace(i) spanned by all previous search directions search_direction(0), ..., search_direction(i-1). [Optional: (Note that A-orthogonal vectors are sometimes called conjugate, hence the "C" in the method name.)]
The size of the step is chosen such that the next error error(i + 1) = true_result - result_vector(i + 1) is A-orthogonal to the new search subspace search_subspace(i + 1) (i.e. span(search_subspace(i) U { update_direction(i) })).
[Optional:
An interesting side-affect of this choice is that the residual residual(i + 1) = input_vector - system_matrix * result_vector(i + 1) is orthogonal to search_subspace(i + 1).
Thus (assuming exact arithmetic), after n steps of the method (n being the size of the problem), the current search subspace will span the entire solution space, so the residual will surely be 0, i.e. the exact solution will be obtained.]
To reduce the amount of computation, the residuals can be obtained using the recurrence:

residual(i + 1) = residual(i) - update_step_size(i) * mapped_update_direction(i)    (2)

Where mapped_update_direction(i) is the update direction mapped using the linear operator A to its range:

mapped_update_direction(i) = system_matrix * update_direction(i)    (3)

This quantity is already available, as it is required to compute the update step size.

[Optional: The above explained method is often called the method of conjugate directions (CD). The CG method has the additional requirement that the previous residual (residual(i)) is included in the new search subspace search_subspace(i + 1).
It can be shown [reference] that this, together with (2) and (3) implies that the search direction is a Krylov subspace:

search_subspace(i+1) = span{residual(0), system_matrix * residual(0), ..., system_matrix^i * residual(0)}

] [Can be replaced by a single statement: Finally, the new search direction is chosen such that the previous residual (residual(i)) is included in the new search subspace search_subspace(i + 1).]

With these restrictions, the next update direction can be computed by A-orthogonalizing the last residual [reference]:

residual_norm(i) = residual(i)^T * residual(i)
orthogonalization_factor = residual_norm(i) / residual_norm(i-1)

update_direction(i) =  residual(i) + orthogonalization_factor * update_direction(i - 1)    (4)

And the update step size as [reference]:

update_step_size(i) = residual_norm(i) / (update_direction(i)^T * mapped_update_direction(i))    (5)

[Finally, a section on how to apply the preconditioner should come here, since this is not trivial with CG - we have to implicitly factorize it so we cannot just say we apply the method on M^-1 A.]

Note that this is a bit long, since we are skipping 2 methods that logically come before CG (steepest descent and conjugate directions). We can make this shorter by removing some of the interesting facts I included here (e.g. reference to conjugate directions, the fact it is a Krylov subspace). In general it should be explained what the method does and where to get more information. My guess is that other methods like BiCG will have something in common, so we can reference CG from it - i.e they all have the concept of residual, update direction, update step size, residual norm and such, so we don't have to define it from scratch.
The main point I wanted to make here is that the code can be written in a way such that all variables have some meaning. E.g. you can immediately see that (4) constructs the new search direction from the residual by orthogonalizing it to the previous direction.
You cannot see that from a formula like d_{i+1} = r_{i+1) + beta_{i+1} d_{i}. You may figure out what d and r are, but you sure cannot figure out what is beta.

I've written this description using CG without the agonizing pain (If you look at it as 64 pages condensed into 1-2, I feel I did a decent job), so all [reference] anotations should reference a specific formula/chapter in that book (I was just to lazy to write that).

EDIT: I edited the above text to remove all of my useless and offensive ranting from the first version of the text, so at least anyone reading this later doesn't have to go through it. I would also like to apologize to anyone that read the first version and felt offended, that was not my intention. Finally, I added the tags [Optional: content] to all the optional text in the proposed Cg documentation, to make it clearer what can be removed if we want to be more concise.

Tests with variable ValueType and IndexType

Currently we only test the default configuration of ValueType = double, IndexType = int32.
We should modify our tests to use Type-parameterized tests to check for other combinations.

Hybrid matrix format GPU kernels

requires #5
provide GPU implementation of the kernels in gpu/matrix/hybrid_kernels.cu
compare GPU implementation with reference kernels in gpu/test/matrix/hybrid_kernels.cpp

Krylov methods expecting factorized preconditioners

Some Krylov methods, such as the Quasi-Minimal Residual (templates, page 22), are designed for preconditioners of the form M = U * V.
The implementation of the method assumes that the preconditioner can be decomposed and U and V factors applied separately.

How should we implement this in Ginkgo? Currently, the only assumption is that preconditioners can be applied to a vector (there's nothing about decomposing them).

ELLPACK matrix format

declare gko::matrix::Ell class template in core/matrix/ell.hpp
implement any device independent methods in core/matrix/ell.cpp
mark device dependent methods (like apply) as NOT_IMPLEMENTED
add unit tests for device independent methods in core/test/matrix/ell.cpp
add your names to contributors.txt 😄

Simplify specifying CUDA architectures in build system

Ginkgo's build system doesn't specify the GPU architectures on nvcc's command line. For this reason it just compiles the kernels to an intermediate code, which requires a jitter pass when the code is first run (maybe even every time the system is rebooted?), and even fails to build with CUDA 8 or lower, as it tries to build for c.c. 2.x, which is not supported by Ginkgo. To fix this the user has to manually modify the CMAKE_CUDA_FLAGS variable to set the architectures to the correct versions (this is awfully verbose if setting multiple architectures).

To simplify the process we could provide the user with a cmake variable which will let them set the architectures in a simpler way, the build system could then check if all the specified architectures are supported by Ginkgo (and throw a nice error if this is not true), and even detect the hardware and populate the variable automatically if the user doesn't set it by themselves.

Verifying Ginkgo with custom value types

Currently we are only compiling Ginkgo for default (float, double, std::complex<> of those) value types when checking the correctness of our implementation. (We're also testing only for double, see #32 for discussions on that).

This means that the interface that ValueType needs to have may change after a PR without anyone noticing (e.g. someone used a part of the interface of default types, which user-defined types may not have to implement).

I propose we create a dummy ValueType implementation (we would create a concept instead if TS concepts were a part of C++11), that could just be a double wrapped in a class, and have the interface that Ginkgo's ValueType has to satisfy defined. This gives us two benefits:

we can try to compile with this type whenever we test Ginkgo and get a compiler error whenever we inadvertently assume something new about ValueType's interface
users can look into this type, and see what their custom value type needs to have defined in order to work with Ginkgo

ELLPACK reference kernels

requires #1
implement device dependent methods in core/matrix/ell.cpp
declare all kernels required by device dependent methods in core/base/ell_kernels.hpp
define these kernels as NOT_COMPILED in core/base/device_hooks/common_kernels.inc.cpp
define these kernels as NOT_IMPLEMENTED in cpu/matrix/ell_kernels.cpp and gpu/matrix/ell_kernels.cu
provide reference implementations of these kernels in reference/matrix/ell_kernels.cpp
add unit tests for device-dependent methods in reference/test/matrix/ell_kernels.cpp

SELL-P reference kernels

requires #7
implement device dependent methods in core/matrix/sellp.cpp
declare all kernels required by device dependent methods in core/base/sellp_kernels.hpp
define these kernels as NOT_COMPILED in core/base/device_hooks/common_kernels.inc.cpp
define these kernels as NOT_IMPLEMENTED in cpu/matrix/sellp_kernels.cpp and gpu/matrix/sellp_kernels.cu
provide reference implementations of these kernels in reference/matrix/sellp_kernels.cpp
add unit tests for device-dependent methods in reference/test/matrix/sellp_kernels.cpp

ELLPACK GPU kernels

requires #2
provide GPU implementation of the kernels in gpu/matrix/ell_kernels.cu
compare GPU implementation with reference kernels in gpu/test/matrix/ell_kernels.cpp

SELL-P matrix format

declare gko::matrix::Sellp class template in core/matrix/sellp.hpp
implement any device independent methods in core/matrix/sellp.cpp
mark device dependent methods (like apply) as NOT_IMPLEMENTED
add unit tests for device independent methods in core/test/matrix/sellp.cpp

link pthread failed when using cmake 3.12.0

When use cmake 3.12.0, it will produce a error nvcc fatal : Unknown option 'pthread'

There are two cmake version 3.10.1 and 3.12.0 difference.

// cmake-3.10.1
/usr/local/cuda-8.0/bin/nvcc   -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_35,code=compute_35 --expt-relaxed-constexpr  -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink CMakeFiles/cuda_test_base_cuda_executor.dir/cuda_executor.cu.o -o CMakeFiles/cuda_test_base_cuda_executor.dir/cmake_device_link.o 
// cmake-3.12.0
/usr/local/cuda-8.0/bin/nvcc   -gencode=arch=compute_35,code=sm_35 -gencode=arch=compute_35,code=compute_35 --expt-relaxed-constexpr -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink CMakeFiles/cuda_test_base_cuda_executor.dir/cuda_executor.cu.o -o CMakeFiles/cuda_test_base_cuda_executor.dir/cmake_device_link.o -Xnvlink ../../../core/libginkgo.so -Xnvlink ../../../third_party/gtest/build/googlemock/gtest/libgtest_main.so -Xnvlink ../../../omp/libginkgo_omp.so -Xnvlink ../../libginkgo_cuda.so -Xnvlink ../../../reference/libginkgo_reference.so -Xnvlink ../../../third_party/gtest/build/googlemock/gtest/libgtest.so -pthread

pthread comes from gtest

It is maybe from
https://gitlab.kitware.com/cmake/cmake/commit/41eab150a8ef42bbebff18ff84652e9da1ef4e75

Support for multiple GPUs

The GpuExecutor currently ignores the device_id field, and just uses the currently active device.
To fix this, several of the GpuExecutor's methods have to be modified:

GpuExecutor class in /core/base/executor.hpp has to override the run() method defined in Executor.
an implementation of the overridden run() method equivalent to the one in ExecutorBase should be put in /core/device_hooks/gpu_hooks.cpp (this is the one that will be used if the GPU module is not compiled)
another implementation of the overridden run() method should be put into /gpu/base/executor.cpp (this is the one used when the GPU module is compiled). This implementation should do the followig:
1. get the currently active device and save it to a local variable
2. set the currently active device to device_id_
3. call op.run() in the same way as in ExecutorBase::run()
4. restore the previously active device from the local variable (this is to ensure Ginkgo plays nicely with other libraries)
all other methods in /gpu/base/executor.cpp (except get_num_devices and GpuExecutor::raw_copy_to(GpuExecutor,...) should be modified to include these device manipulation steps
GpuExecutor::raw_copy_to(GpuExecutor,...) should be implemented using cudaMemcpyPeer instead of cudaMemcpy

Testing freezes in `gpu/test/preconditioner/block_jacobi_kernels` with Titan V

The testing freezes in gpu/test/preconditioner/block_jacobi_kernels when using cuda 9.0 and cuda 9.1 with Titan V.

There are freezing testing cases in block_jacobi_kernels:

BlockJacobi.GpuPreconditionerEquivalentToRefWithMPW.
BlockJacobi.GpuApplyEquivalentToRef
BlockJacobi.GpuLinearCombinationApplyEquivalentToRef
BlockJacobi.GpuApplyToMultipleVectorsEquivalentToRef
BlockJacobi.GpuLinearCombinationApplyToMultipleVectorsEquivalentToRef

Run these tests is fine when using cuda 8.0, cuda 9.0, and cuda 9.1 with K40, but they freeze when using cuda 9.0 and cuda 9.1 with Titan V

Hybrid matrix format reference kernels

requires #4
implement device dependent methods in core/matrix/hybrid.cpp
declare all kernels required by device dependent methods in core/base/hybrid_kernels.hpp
define these kernels as NOT_COMPILED in core/base/device_hooks/common_kernels.inc.cpp
define these kernels as NOT_IMPLEMENTED in cpu/matrix/hybrid_kernels.cpp and gpu/matrix/hybrid_kernels.cu
provide reference implementations of these kernels in reference/matrix/hybrid_kernels.cpp
add unit tests for device-dependent methods in reference/test/matrix/hybrid_kernels.cpp

Ranges

Note: this issue was moved from Gitlab: https://gitlab.com/ginkgo-project/ginkgo/issues/74

It is often required to obtain a sub-range of an array or a matrix, for example when designing algorithms that work with matrix blocks. Formalizing this in a range concept and defining classes which implement it would have several benefits:

it removes the need for error-prone arithmetic involving pointer offsets, sizes and "ldas" to implement sub-ranges,
it reduces the number of function parameters and bugs due to wrong dimensions, as each range contains information about the pointer, the size, and the shape of the range,
it abstracts away memory access so additional facilities can be added: e.g. out-of-bound checking in debug mode, access to values types stored in structure-of-arrays format (think of the new idea of splitting a double number as s|eeeeeeeeeee|mmmmmmmmmmmmmmmmmmmm||mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm and storing the two halves separately),
it implements the same relationship as standard library containers and iterators - the first one "owns" the data, and the second one is used to manipulate it without carrying any data ownership semantics

How I imagine the interface to look like is something like this (inspired by the Armadillo library):

template <typename Range>
void range_example(Range r)
{
    constexpr int dims = Range::num_dimensions; // get the number of dimensions in the range (1 for vec, 2 for matrix, should also support tensors)
    constexpr auto order = Range::storage_order;  // get the storage order of the range - row-major/column-major (1)
    static_assert(dims >= 2, "This only works for ranges with at least 2 dimensions");
    auto elem = r(3, 4); // get element at position (3, 4), if the range has more dimensions all the trailing dimensions are set to 0s
    auto sub_range = r(span(end - 3, end), span()):  // get subrange (end - 3:end, 0:end, 0, ...) of the range (2), (3)
    r((span(0, 3), span()) = sub_range;  // copy sub-range, range implementation can optimize the copy for the storage scheme
    auto r2 = r.reshape(4, 5, 2);  // reshape the range to size (4, 5, 2);
    auto s = size(r2);  // get the size of a range, in this case (4, 5, 2);
}

(1) maybe we could also have this as a list of dimension indexes - the implementation could be easier, and it is a more flexible solution for multidimensional ranges
(2) span is the name they use in armadillo, maybe we can think of something better - but should be both clear and short, as we'll use it all the time, and also shouldn't clash with any other "usual" names
(3) end could cause a lot of name clashes - not sure how to solve this, it has to be descriptive and short, but I am worried about end, maybe npos, as this will be familiar for anyone that used std::string?

Simplify defining sets of named vectors

I feel that defining tons of vectors like this in every solver is getting nonsensical. How about we define some helper macros for that:

namespace detail {


template <typename T>
struct add_unique_ptr {
    using type = std::unique_ptr<T>;
};

template <typename T>
struct add_unique_ptr<T*> {
    using type = std::unique_ptr<T>;
};

template <typename T, typename Deleter>
struct add_unique_ptr<std::unique_ptr<T, Deleter>> {
    using type = std::unique_ptr<T, Deleter>;
};


template <typename SourceType>
inline void initialize_named_vectors(const SourceType &) {}

template <typename SourceType, typename VectorType, typename... OtherVectorTypes>
inline void initialize_named_vectors(const SourceType &source, VectorType &vector, OtherVectorTypes&... rest)
{
    vector = SourceType::create(exec, source->get_num_rows(), source->get_num_cols(), source->get_stride());
    initialize_named_vectors(source, rest);
}

template <typename SourceType, typename VectorType, typename... OtherVectorTypes>
inline void initialize_named_scalars(const SourceType &source, VectorType &vector, OtherVectorTypes&... rest)
{
    vector = SourceType::create(exec, 1, source->get_num_cols());
    initialize_named_vectors(vector.get(), rest);
}


}  // namespace detail


#define INITIALIZE_NAMED_VECTORS(_source, ...) \
    typename detail::add_unique_ptr<typename std::decay<decltype(_source)>::type>::type __VA_ARGS__; \
    detail::initialize_named_vectors(_source, __VA_ARGS__);

#define INITIALIZE_NAMED_SCALARS(_source, ...) \
    typename detail::add_unique_ptr<typename std::decay<decltype(_source)>::type>::type __VA_ARGS__; \
    detail::initialize_named_scalars(_source, __VA_ARGS__);

Note that I didn't test this code, so there could be some bugs in it.
But once we have it set up, we should be able to do:

INITIALIZE_NAMED_VECTORS(dense_b, r, r_tld, p, q, u, u_hat, v_hat, t);
INITIALIZE_NAMED_SCALARS(dense_b, alpha, beta, gamma, rho_prev, rho, tau);

Add support for factorization-based preconditioners

Once we have #26, adding factorization-based preconditioners will be straightforward.

We only need to implement factorizations (I propose we use the factorization namespace for those): a special kind of LinOp factory that returns a ComposedLinOp.
E.g. pseudocode for ILU:

template <typename FactorType = matrix::Csr<>>
class Ilu : public LinOpFactory {
    unique_ptr<LinOp> generate(shared_ptr<LinOp> *op) {
           // calculate size, memory requirement
           auto l_factor = FactorType::create(/* size, memory requirements */);
           auto u_factor = FactorType::create(/* size, memory requirements */);
           launch_factorization_kernel(op, l_factor.get(), u_factor.get());
           return ComposedLinOp::create(std::move(l_factor), std::move(u_factor));
    }
};

fact = ilu->generate(A) gives us a linear operator fact = L*U from which we can get L and U by calling fact->get_first() and fact->get_second().

We can then use these factors to create an ILU preconditioner (i.e. an operator that solves a pair of systems Ly = b, Ux = y) by simply doing:

auto A = /* initialize matrix */;
auto upper_triangular_solver = /* initialize upper triangular solver factory */;
auto lower_triangular_solver = /* initialize lower triangular solver factory */;
auto factorization = /* initialize factorization factory */;

auto factorized_A = factorization->generate(A);
auto precond = ComposedLinOp::create(upper_triangular_solver->generate(factorized_A->get_second()), lower_triangular_solver->generate(factorized_A->get_first()));

// use precond in a solver

Of course, it's easier for the user if they don't have to manually compose the preconditioner (they could also make the mistake in composing it the wrong way - reversing the order, or forgetting to add the solvers), so we can add a utility FactorizedPreconditionerFactory that will do the correct composition for the user (it also has the advantage that it can be used in place of any other LinOpFactory):

auto A = /* initialize matrix */;
auto upper_triangular_solver = /* initialize upper triangular solver factory */;
auto lower_triangular_solver = /* initialize lower triangular solver factory */;
auto factorization = /* initialize factorization factory */;
auto precond_factory = FactorizedPreconditionerFactory::create(factorization, lower_triangular_solver, upper_triangular_solver);

auto precond = precond_factory->generate(A);

// use precond in a solver

Class naming convention

This is a summary of a small discussion which arose during MR #46 but is more general in Ginkgo. The purpose of this Issue is simply to discuss it and collect everyone's opinion on the subject.

Current class naming convention

The current class naming convention for Ginkgo is detailed here and can be summed up as the following (correct me if I'm wrong in any of this post):

Usual classes follow the function, variable, ..., naming convention, which is class_name.
Classes with some sort of polymorphic behavior use ClassName

The reason for the distinction is that polymorphic classes should be used with care and are "heavier" in a way (and capital case can be seen as heavy).

Problems

There are some problems though to distinguish what we consider to be polymorphic behavior. We for example have called the array class Array which has executors so it can be seen as polymorphic behavior, but there are no virtual methods and all of them are quite cheap.

I also think this naming convention can tend to be confusing and at the minimum, the convention should be made very clear somewhere in the code. The reason for that is that the convention itself serves as a form of documentation and it's not useful unless that convention is embedded in the documentation: ClassName documents that this class is polymorphic and could imply significant overhead depending on what you do with it or which function you use.

Personal proposition

I also personally like to be able to distinguish classes from functions, variable names, ..., in all cases so I'm in favor of a uniform ClassName convention. In addition, it's true some classes (LinOps for example) are heavily polymorphic and should be used with care. I think we should have a standard way of adding a @warning in the documentation for those classes.

What do you think @ginkgo-project/developers.