librapid / librapid Goto Github PK

A highly optimised C++ library for mathematical applications and neural networks.

License: MIT License

C++ 87.93% CMake 2.72% C 5.85% Makefile 0.01% Cuda 1.84% Python 1.62% Shell 0.04%

array cpp high-performance-computing matrix parallel-programming multidimensional-arrays multithreading cuda gpu library

librapid's Introduction

What is LibRapid?

LibRapid is an extremely fast, highly-optimised and easy-to-use C++ library for mathematics, linear algebra and more, with an extremely powerful multidimensional array class at it's core. Every part of LibRapid is designed to provide the best possible performance without making the sacrifices that other libraries often do.

Everything in LibRapid is templated, meaning it'll just work with almost any datatype you throw at it. In addition, LibRapid is engineered with compute-power in mind, meaning it's easy to make the most out of the hardware you have. All array operations are vectorised with SIMD instructions, parallelised via OpenMP and can even be run on external devices via CUDA and OpenCL. LibRapid also supports a range of BLAS libraries to make linear algebra operations even faster.

What's more, LibRapid provides lazy evaluation of expressions, allowing us to perform optimisations at compile-time to further improve performance. For example, dot(3 * a, 2 * transpose(b)) will be compiled into a single GEMM call, with alpha=6, beta=0, transA=false and transB=true.

Why use LibRapid?

If you need the best possible performance and an intuitive interface that doesn't sacrifice functionality, LibRapid is for you. You can fine-tune LibRapid's performance via the CMake configuration and change the device used for a computation by changing a single template parameter (e.g. librapid::backend::CUDA for CUDA compute).

Additionally, LibRapid provides highly-optimised vectors, complex numbers, multiprecision arithmetic (via custom forks of MPIR and MPFR) and a huge range of mathematical functions that operate on all of these types. LibRapid also provides a range of linear algebra functions, machine learning activation functions, and more.

When to use LibRapid

When you need the best possible performance
When you want to write one program that can run on multiple devices
When you want to use a single library for all of your mathematical needs
When you want a simple interface to develop with

When not to use LibRapid

When you need a rigorously tested and documented library
- LibRapid is still in early development, so it's not yet ready for production use. That said, we still have a wide range of tests which are run on every push to the repository, and we're working on improving the documentation.
When you need a well-established library.
- LibRapid hasn't been around for long, and we've got a very small community.
When you need a wider range of functionality.
- While LibRapid implements a lot of functions, there are some features which are not yet present in the library. If you need these features, you may want to look elsewhere. If you would still like to use LibRapid, feel free to open an issue and I'll do my best to implement it.

Documentation

Latest Documentation
Develop Branch Docs

LibRapid uses Doxygen to parse the source code and extract documentation information. We then use a combination of Breathe, Exhale and Sphinx to generate a website from this data. The final website is hosted on Read the Docs.

The documentation is rebuilt every time a change is made to the source code, meaning it is always up-to-date.

Current Development Stage

At the current point in time, LibRapid C++ is being developed solely by me (pencilcaseman).

I'm currently a student in my first year of university, so time and money are both tight. I'm working on LibRapid in my spare time, and I'm not able to spend as much time on it as I'd like to.

If you like the library and would like to support its development, feel free to create issues or pull requests, or reach out to me via Discord and we can chat about new features. Any support is massively appreciated.

Roadmap

The roadmap is a rough outline of what I want to get implemented in the library and by what point, but please don't count on features being implemented quickly -- I can't promise I'll have the time to implement everything as soon as I'd like... (I'll try my best though!)

If you have any feature requests or suggestions, feel free to create an issue describing it. I'll try to get it working as soon as possible. If you really need something implemented quickly, a small donation would be appreciated, and would allow me to bump it to the top of my to-do list.

Dependencies

LibRapid has a few dependencies to improve functionality and performance. Some of these are optional, and can be configured with a CMake option. The following is a list of the external dependencies and their purpose (these are all submodules of the library -- you don't need to install anything manually):

Submodules

fmt - Advanced string formatting
doxygen-awesome-css - A theme for the Doxygen docs
CLBlast - An OpenCL BLAS library
Vc - SIMD primitives for C++
Jitify - A CUDA JIT compiler
pocketfft - A fast, lightweight FFT library
scnlib - Advanced string parsing

External

OpenMP - Multi-threading library
CUDA - GPU computing library
OpenCL - Multi-device computing library
OpenBLAS - Highly optimised BLAS library
MPIR - Arbitrary precision integer arithmetic
MPFR - Arbitrary precision real arithmetic
FFTW - Fast(est) Fourier Transform library

Star History

Contributors

Support

Thanks to JetBrains for providing LibRapid with free licenses for their amazing tools!

librapid's People

Contributors

Stargazers

Watchers

Forkers

nervousnullptr athulmekkoth fossabot clayne davekeogh tcmetzger ksrikar1234 sarvex bwry

librapid's Issues

Full MPFR support

Array BLAS Functions

Add support for calling gemm or dot, for example, directly on Array objects. This could dramtically simplify the implementation of the code as well :)

Test Issue

This is a test issue with some code which should be Carbonited

for i in range(100):
    print("Hello, World")

print("Goodbye, World")

Does this work?

Add support for arbitrary CUDA datatypes in BLAS routines

https://github.com/BigNerd95/CUDASamples/blob/master/samples/0_Simple/cudaTensorCoreGemm/cudaTensorCoreGemm.cu

Allow for integer dot products, for example

Vector library documentation

Write and check documentation for the vector library

Matrix Transposition

Add support for matrix transpositions.

This should also allow the conversion of a vector array into a column vector with dimensions Nx1.

Can this be lined with matrix multiplication (#141) to provide higher performance matrix operations? Transposing a matrix and then doing a matrix product with it automatically uses the gemm with transposed arguments?

Test Issue

Below is a code block which should be turned into an image :)

for i in range(123):
    print("Hello, World!")

Matrix Transpose Bugs

Matrix transposition works fine if all the arrays are the correct size, but it's possible that memory errors could arise if the arrays are not correctly sized.

Increase support for "float16_t"

Currently, float16_t has limited support in the main math library. This needs to be supported somewhat soon.

Array Manipulation Error

Not sure exactly where the error is, but the following code doesn't work under the current Development and Master branches.

lrc::Array<float> val(lrc::Shape({2, 2}));
lrc::Array<float> lower1(lrc::Shape({2, 2}));
lrc::Array<float> upper1(lrc::Shape({2, 2}));
lrc::Array<float> lower2(lrc::Shape({2, 2}));
lrc::Array<float> upper2(lrc::Shape({2, 2}));

val << 1, 2, 3, 4;
lower1 << 0, 0, 0, 0;
upper1 << 10, 10, 10, 10;
lower2 << 0, 0, 0, 0;
upper2 << 100, 100, 100, 100;

fmt::print("{}\n\n", val);
fmt::print("{}\n\n", lower1);
fmt::print("{}\n\n", upper1);
fmt::print("{}\n\n", lower2);
fmt::print("{}\n\n", upper2);
fmt::print("{}\n", lrc::map(val, lower1, upper1, lower2, upper2));

Wait for more powerful hosted-runners

Hopefully, by the end of Q3 2022, GitHub Actions will support more powerful, custom runners. This will dramatically reduce the build-times for LibRapid wheels as well as allowing for a more fully-featured package due to fewer limitations on RAM and CPU power

github/roadmap#161

Matrix product

Support 2D gemm functionality, as well as higher-dimensional products

CUDA support is also required

Array Copying

Create a function for strided array copying, otherwise the entire array library will crash with non-trivial arrays.

The main issue will occur within the multiarray_operations.hpp file, as makeSameAccelerator does not take into account strides, which will lead to some form of segfault or simply using the wrong values.

Array Slicing

Is your feature request related to a problem? Please describe.
There is currently no way of accessing sub-arrays without manually iterating over them, which is sometimes difficult and inconvenient.

Describe the solution you'd like
Some sort of ArraySlice object that can be used as a strided view of an Array object.

Describe alternatives you've considered

Use a strided array to begin with? Doesn't feel optimal
Do not allow strided access, but allow for sub-array access (parent data pointer)

Benchmarks

Benchmark the Array type and other classes and helpers

Compare against

Numpy
XTensor
Eigen
Boost multidimensional array

Add support for GEMM

BLAS gemm support

Code Cleanup

The code is becoming increasingly fragmented -- this makes development more difficult and leads to unwanted and unexpected bugs.

Each file should implement only one thing, and header files should not include any other headers. Instead, files should be included in librapid.hpp in the correct order, and STL includes should be at the top of config.hpp

Matrix Transposition not working

Tested in Python

View raw code

import librapid as lrp

x = lrp.Array(lrp.Extent((1000, 1000)), "f32")
x.transpose() # This doesn't work -- list type is invalid
x.transpose(lrp.Extent()) # This gives the wrong output. Possibly a missing copy?

Can Array constructors be simplified to one templated function?

Can the constructors in the multiarray_constuctors.cpp file be converted into a single, templated constructor which makes use of datatype inspection?

Improve custom BLAS performance

Remove "patch" definitions

Describe the bug
The patch() functions defined in the multiprecision source files when multiprecision is NOT enabled should be removed/fixed

To Reproduce
Steps to reproduce the behavior:

Include LibRapid without LIBRAPID_USE_MULTIPREC

Minimal Reproducable Example

#include <librapid>

int main() {
    fmt::print("Hello, World\n");
    return 0;
}

Expected behavior
No warnings

Stack Traces

multiprecModAbs.cpp.obj : warning LNK4006: "int __cdecl patch(int)" (?patch@@YAHH@Z) already defined in multiprecTrig.cpp.obj; second definition ignored
multiprecHypot.cpp.obj : warning LNK4006: "int __cdecl patch(int)" (?patch@@YAHH@Z) already defined in multiprecTrig.cpp.obj; second definition ignored
multiprecFloorCeil.cpp.obj : warning LNK4006: "int __cdecl patch(int)" (?patch@@YAHH@Z) already defined in multiprecTrig.cpp.obj; second definition ignored
multiprecExpLogPow.cpp.obj : warning LNK4006: "int __cdecl patch(int)" (?patch@@YAHH@Z) already defined in multiprecTrig.cpp.obj; second definition ignored
multiprecCasting.cpp.obj : warning LNK4006: "int __cdecl patch(int)" (?patch@@YAHH@Z) already defined in multiprecTrig.cpp.obj; second definition ignored

Matrix Multiplication

Add support for:

Matrix-matrix multiplication
Matrix-vector multiplication
Vector product

Matrix Transposition is too Slow

Matrix Transposition is currently trivially implemented, and it quite slow compared to other libraries. This should be optimized for all dimensions of array, though specifically for matrices where the matrix transpose is performance critical in many applications.

Ideas:

OpenBlas *omatcopy
Hard-coded matrix transpose for 2D
Vectorised transpose (See transpose.hpp)
Help me...

Vector library storage

Fundamental data structures required for an efficient Vector library

Optimise Array Slice Performance

Array slicing is functional, but is not optimised to the extent required by LibRapid. Additionally, while it does work with CUDA, there are no specific routines for it and hence device->host->device copies are required for every value, making it incredibly slow.

Optimisation + Simplification

Can the code in multiarray_operations.hpp at the top of the functions (to ensure everything is in the same place) be altered to use only malloc and memcpy?

Docs have no search bar

Online documentation does not have a search bar for some reason...

Vector library testing suite

Create a full testing suite for the vector library

Improve documentation

Does what it says on the can. Write some ****ing documentation.

Vector Library Rewrite Required

Vector library is currently unmaintainable. A rewrite is needed to improve functionality and ease of development

Another Test Issue

This is some code:

View raw code

# -*- coding: utf-8 -*-
import os
import platform
import shutil
import sys
import site
from packaging.version import LegacyVersion
from skbuild import setup
from skbuild.cmaker import get_cmake_version
from skbuild.exceptions import SKBuildError

# Copy OpenBLAS build if present in the root directory
if os.path.exists("openblas_install") and not os.path.exists(os.path.join("src", "librapid", "openblas_install")):
    shutil.copytree("openblas_install", os.path.join("src", "librapid", "openblas_install"))

# Remove _skbuild directory if it already exists. It can lead to issues
if os.path.exists("_skbuild"):
    shutil.rmtree("_skbuild")

# Remove the _librapid_python_cmake directory if it's present. This can cause more issues...
if os.path.exists("_librapid_python_cmake"):
    shutil.rmtree("_librapid_python_cmake")

# If the directory "src/librapid/blas" is empty and "src/librapid/openblas_install" is empty,
# run CMake to automatically detect BLAS before installing the Python library
if not os.path.exists(os.path.join("src", "librapid", "blas")) and not os.path.exists(os.path.join("src", "librapid", "openblas_install")):
    out = os.system("mkdir _librapid_python_cmake && cd _librapid_python_cmake && cmake ..")
    if out != 0:
        print("\nCMake failed to run correctly, so it is likely that BLAS will not be installed with LibRapid")

# Add CMake as a build requirement if cmake is not installed or is too low a version
setup_requires = []
install_requires = []

try:
    if LegacyVersion(get_cmake_version()) < LegacyVersion("3.10"):
        setup_requires.append('cmake')
        install_requires.append("cmake")
except SKBuildError:
    setup_requires.append('cmake')
    install_requires.append("cmake")

Array to String

Add functionality for converting arrays to strings for printing purposes.

Conan Support

Hey! This library looks really cool. For your awareness, I've raised a request for a recipe for this library to be created for the Conan Center Index . Conan is a C++ package manager that I think would help distribute your library with dependency management, especially with respect to pulling openblas in. If you're interested in supporting a conan recipe, I'm sure they would love to receive a pull request.

Vector to string

Allow the creation of a string representation of a vector object (ideally with formatting options)

CUDA support in Python Wheels

Have a look into cuda-toolkit and think about getting CUDA support in Python Wheels. This would require some sort of pip install librapid_cuda_11_5, for example, which would also need to be set up

Test Issue

This is a test issue

View raw code

while True:
    print("This is going to be made into an image!")

print("Hehehe this won't run")

Python Iterator Memory Leak

Arrays do not get freed (or get over-freed) when iterating in python.

To replicate:

View raw code

import librapid as lrp
x = lrp.Array([1, 2, 3])
for val in x:
    print(val)

cuBLAS Handles not Initialized in Python Library

When using cuBLAS functions in the python library, the cuBLAS handles are not being initialised and hence an error is thrown.

Example:

View raw code

import librapid as lrp

x = lrp.Array(lrp.Extent(1000, 1000), "f32", "gpu")
res = x.dot(x)

Should be fixable by changing where the handle initialisation occurs.

Implement basic functions for all LibRapid and primitive types

Functions

Trigonometry
- sin
- cos
- tan
- csc
- sec
- cot
- inverse
- hyperbolic
Arithmetic
- add
- sub
- mul
- div
- inplace?
Logical
- greaterThan
- lessThan
- equalTo
- notEqualTo

Scalar operations assignment operator

need to have a closer look, but this feels wrong:

View raw code

ScalarSum<LHS, RHS> &operator=(const ScalarSum<LHS, RHS> &other) { return *this; }

Fix support for Boolean arrays

Currently, Boolean arrays pose a few issues, the first of which is that each element is stored as a single bit, not a whole byte. This dramatically improves memory efficiency and increases performance, however it also causes a lot of problems with algorithm design and interoperability with other array datatypes.

Currently, the major issue is that any form of logical operation will result in a Vc::Mask being returned from the SIMD packet instruction, whereas the result will be expecting a Vc::Vector type. This will, of course, produce a compile-time error.

Current steps to solve problem

Logical operation factory -- define a logical operation which automatically returns a Boolean array, regardless of input type
Rewrite Boolean array class to operate more nicely with Vc::Mask datatypes
- This includes adding a loadFrom function which will accept a Vc::Mask to allow for direct loading from a Vc logical comparison
Casting support to and from Vc::Mask datatypes?

Other ideas:

Can Boolean arrays be stored directly as arrays of Vc::Mask?

IN THE MEANTIME, SUPPORT FOR BOOLEAN ARRAYS IN THE PYTHON LIBRARY WILL BE DISABLED

x *= y;
x -= y;
// etc.

Assigning non-equal length array to boolean array

Once again, more problems with the boolean array.

This must be fixed by #102

Reproducable example:

View raw code

lrc::Array<int, lrc::device::CPU> arr(lrc::Extent(10));
arr = lrc::Array<int>(lrc::Extent(100));
arr.fill(123);
fmt::print("Array: {}\n", arr);

Extend MPIR printing functionality with FMT

File: src/librapid/internal/config.hpp

Line: 52

Request: Improve printing to support a wider range of arguments and configuration options

Test Issue

View raw code

print("This is a test issue")
# Does this get formatted?
for i in range(10):
    print("Hello, World!")

lrc::Array<float, lrc::device::GPU> gpuArray(lrc::Extent(3, 4));
lrc::Array<float, lrc::device::CPU> cpuArray(lrc::Extent(3, 4));
for (lrc::i32 i = 0; i < 10000; i++) {
    fmt::print("Doing thing {}\n", i);
    cpuArray = (gpuArray * 2);
}

librapid / librapid Goto Github PK

librapid's Introduction

What is LibRapid?

Why use LibRapid?

When to use LibRapid

When not to use LibRapid

Documentation

Current Development Stage

Dependencies

Submodules

External

Star History

Contributors

Support

librapid's People

Contributors

Stargazers

Watchers

Forkers

librapid's Issues

Functions

IN THE MEANTIME, SUPPORT FOR BOOLEAN ARRAYS IN THE PYTHON LIBRARY WILL BE DISABLED

Recommend Projects

Recommend Topics

Recommend Org