Code Monkey home page Code Monkey logo

frank's Introduction

FRANK (Factorization of RANK structured matrices)

β€œAny intelligent fool can make things bigger and more complex. It takes a touch of genius - and a lot of courage to move in the opposite direction.” - Albert Einstein

Library for hierarchical low-rank matrix factorizations.

How to build and make

mkdir build
cd build
cmake -DUSE_MKL=ON ..
make

binaries will be compiled into bin. C++17 is required in order to build FRANK. The build process will try to detect or download the following dependencies:

  • nlohmann_json
  • YOMM2

Build flags

-DUSE_MKL

If this is specified, FRANK will try to use the Intel MKL library on your system. Otherwise it will try to detect the default BLAS and LAPACK libraries installed on your system.

-DBUILD_DOCS

If this is set, the documentation files are created during the build process. Requires Doxygen installed on the system. Default to OFF.

-DBUILD_TESTS

If this is set, test files will be included in the build process. Default to ON.

-DBUILD_EXAMPLES

If this is set, examples files will be included in the build process. Default to ON.

frank's People

Contributors

derpda avatar wawando avatar rioyokota avatar v0dro avatar delta-leader avatar

Stargazers

 avatar Shumpei Shiina avatar

Watchers

 avatar  avatar YOSHIFUJI Naoki avatar

Forkers

v0dro wawando

frank's Issues

Break up pre_scheduler.cpp into smaller files.

Currently the pre_sceduler.cpp is a massive file with collections of all sorts of functions and classes relating to LAPACK, BLAS, kernel generation and even generation and destruction of tasks. Breaking this up into smaller files should simply things further.

Pure python serial plotter for HMat

Write plotting tool (matplotlib) in pure Python. This will not consider at all what happens in the C++ code.
The Python code runs in serial, but should stick as close to the design of the C++ code as possible. This way, the plotting tool should be adaptable to take C++ objects without too much hassle.

Implement universal admissibility interface.

A new admissibility interface should be defined that delegates the calculation of admissiblity to the user from hicma. This interface should be able to work with any sort of co-ordinate distribution depending on the user's library. Some users will have different structures to define their coordinates. For example an array of structures with x,y,z. Others may define a structure of arrays is x[N], y[N], z[N]. Other may define it as a single array X[3*N]. Additionally the admis condition can be calculated in FORTRAN or any other C++ - interfacable language.

Current proposal:
Letting the user define a function that gives partcular co-ordinates for a given position similar to the function that returns the data for the particular function that you have now? So basically something like bool get_admis(box1_row_start, box1_row_end, box1_col_start, box1_col_end, box2_row_start, box2_row_end, box2_col_start, box2_col_end, admis_value) .

This function signature will encapsulate the box co-ordinates within which you want to compute the admissibility. Since you want to do this in fortran it can use only primitive types to make it easier.

Something like a stars-H interface can define its own function within which it will refer to its own co-ordinate system calling scheme and then pass the function to the hicma matrix constructor.

Cleanup

  • Remove unused Hierarchical constructors from textfile
  • Remove unused unique_id parameter from Dense class (done in template branch)
  • Add const modified to non modified variables
  • Use constexpr modifier for compile time constants
  • Compatibility with newer C++ compiler (gcc-11) (done in Thomas' branch)
  • Move OMM declarations from extension_headers into individual header file (looks clean but need to modify separate file that may not be intuitive when adding a new function)
  • Remove GPU example files
  • Remove TODO.md?
  • Move CONTRIBUTING.md to other place?
  • Rename and refactor test files to <function_name>_test.cpp?

Finish Multi-Method GEMM operations with transpose parameter

Complete the implementation of the following gemm functions

  • gemm(LowRank, Dense, LowRank)
  • gemm(Dense, LowRank, LowRank)
  • gemm(LowRank, LowRank, LowRank)
  • gemm(Hierarchical, LowRank, LowRank)
  • gemm(LowRank, Hierarchical, LowRank)
  • gemm(LowRank, LowRank, Hierarchical)

Configure starpu MPI with cmake.

Currently cmake adds the link flags only for the non-MPI shared object of starPU. If using MPI, also add the flag for starPU MPI (and compile starPU with flags for enabling MPI).

Create matrix visualizations over MPI.

The current functionality using XML+python is limited to a single node and small matrices. The python script also has a long processing time due to storing all the data in the XML file.

Create a way to visualize a very large H-matrix scattered over multiple nodes efficiently.

Float32 support

Template the following classes (with hardcoded "double" in the rest of the code):

  • Dense
  • LowRank
  • Hierarchical
  • MatrixProxy, Matrix? (should not be necessary)
  • IndexRange (split_like function)
  • MatrixInitializer (no template necessary, can fill any type of matrix)
  • MatrixInitializerFile (no template necessary, reads doubles and can fill any type of matrix)
  • MatrixInitializerBlock (template for the datatype of the matrix to be copied from, can fill any type of matrix)
  • MatrixInitializerFunction(template for the matrix kernel, can only fill double matrix from double function
  • MatrixKernel (removed for now)

Template functions so that they can support both double and float:

  • Initialization:
  • Functions (kernels for filling matrices - currently limited to double params)
  • Find a better solution for filling from functions/kernels
  • Misc:
  • split (Hierarchical; note that his calls an omm that does not work for float yet)
  • get_dim
  • misc
  • morton_index (not templated because currently not in use)
  • norm
  • resize
  • transpose
  • randomized:
  • rid
  • rsvd
  • operations:
  • addition (plus operator finished - extend to OMM?)
  • multiplication (what happens in case of float*double?)
  • subtraction

Extend classes to "float" - OMM:

  • Dense - Constructor from MatrixProxy (If MatrixProxy is not of the same Type a bad_cast exception is thrown instead of omm_error and aborting; EDIT: found a solution that works with YOMM)
  • Dense - Constructor from Matrix (no float types for Hierarchical and LowRank yet)
  • Dense - implicit type conversion? (currently only explicit conversion allowed)
  • Matrix Proxy (both clone and move_clone)

Extend functions to "float" - OMM:

  • Misc:
  • get_dim
  • resize
  • norm
  • BLAS (What to do with double parameters for BLAS calls?):
  • gemm (alpha, beta remain double, template type is deduced depending on the type of Matrix passed in)
  • trmm
  • trsm
  • LAPACK:
  • gepq3
  • geqrt
  • id
  • larfb
  • lartms
  • qr
  • svd (template specialization for float and double)
  • tpmqrt
  • tpqrt
  • Util:
  • experiment_setup
  • geometry_file (not templated since parameters remain double)
  • get_memory_usage
  • global_key_value (no template needed)
  • l2_error
  • omm_error_handler (no template needed)
  • print
  • timer (no template needed)
  • Arithmetic:
  • subtraction
  • addition
  • merge_col_basis
  • merge_row_basis

Write test functions:

  • Initialization:
  • ClusterTree
  • IndexRange
  • MatrixFunctions
  • MatrixInitializer (except coordinate based admissibility for Kernel)
  • Dense:
  • Standard Constructor, Construct from dimensions and Copy-Constructor
  • Construct from Kernel
  • Construct from other Matrix types
  • Construct from MatrixProxy
  • Construct from File?
  • Assignment (currently only allowed for identical datatypes)
  • split function (currently only for double)
  • LowRank:
  • Standard Constructor, Construct from 3 matrices
  • Construct from Dense
  • Hierarchical:
  • Constructors (except from file)
  • operations:
  • addition (currently only for dense)
  • subtraction?
  • scalar multiply?
  • BLAS:
  • gemm
  • trmm
  • trsm
  • LAPACK:
  • gepq3
  • geqrt
  • id
  • larfb
  • lartms
  • qr
  • svd
  • tpmqrt
  • tpqrt
  • Misc:
  • get_dim
  • misc
  • morton_index
  • norm
  • resize
  • transpose
  • randomized:
  • rid
  • rsvd
  • util:
  • experiment_setup
  • geometry_file (not templated since parameters remain double)
  • get_memory_usage
  • global_key_value (no template needed)
  • l2_error
  • omm_error_handler (no template needed)
  • print
  • timer (no template needed)

Requires addition of the corresponding float functions from BLAS/LAPACK:

  • BLAS - float
  • LAPACK - float
  • Define operations between double and float?

Notes:
Prioritize getting a working single-precision implementation with a hierarchical LU test-case. Other test cases can be added later and the limitations described below should only be addressed if there is an actual need for their functionality.

Initialization from Kernel-Functions is still considerably restricted. The following limitations apply:

  • Geometry information can only be
  • Random number generators can only be
  • Seed for the random number generators can not be set
  • Admissibility can not be passed as a function
  • Only pre-defined kernel-functions are available

There are also some shortcomings in arithmetics:

  • Addition operator += is defined for all matrix types while operator + is only defined for dense
  • Subtraction operator -= is not defined
  • Multiplication (Scalar) is only defined for double (restrict? allow for mixed precision?)
  • Multiplication (Scalar) operator * is not defined

Investigate how a MatrixProxy is conerted into a Matrix (i.e. is that template safe?)
Automatic template deduction does not work for return type, in some cases old C style might be better (include return type as argument)
Shall we change the return value of functions that return a double to float?

Design discussion for ideal class structure.

This thread is meant to be a discussion on the best approaches for design decisions in hicma and is meant to serve as a log for why certain decisions were taken, their pros and cons and what other approaches exist to best address the trade-off of usability vs. performance.

StarPU installation cannot find hwloc.

Ida sensei's machine gets this error:

configure: error: libhwloc or pkg-config was not found on your system. If the target machine is hyperthreaded the performance may be impacted a lot.  It is strongly recommended to install libhwloc and pkg-config. However, if you really want to use StarPU without enabling libhwloc, please restart configure by specifying the option '--without-hwloc'.
gmake[2]: *** [CMakeFiles/starpu.dir/build.make:109: starpu-prefix/src/starpu-stamp/starpu-configure] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:68: CMakeFiles/starpu.dir/all] Error 2
gmake: *** [Makefile:84: all] Error 2
CMake Error at cmake/find_or_download.cmake:80 (message):
  Build of dependency starpu failed: 2.
Call Stack (most recent call first):
  CMakeLists.txt:53 (find_or_download)
-- Configuring incomplete, errors occurred!
See also "/home2/ida/hicma/build/CMakeFiles/CMakeOutput.log".
[

Employ inner blocking within dense kernel of Tiled Householder QR

Currently, the Blocked Householder BLR-QR algorithm has O(n2) complexity, whereas the more fine grained Tiled Householder BLR-QR has O(n2.5) complexity. The fact that no inner blocking is used within dense kernel of Tiled Householder BLR-QR may be responsible for the extra O(n0.5) complexity. This needs to be investigated and implemented to improve the performance of that particular method.

Make timer class multi-threading friendly.

The timer class currently performs non-atomic operations, which leads to issues when using multi-threading within the library, for example with openMP tasks. Making the operations atomic or somehow building per-thread timing constructs should mitigate this problem.

Here's one of the errors I get:

blr-lu-exp3d: /home/acb10922qh/gitrepos/useful-tsubame-benchmarks/hmatrix-benchmarks/hicma-yokotalab/src/util/timer.cpp:73: void hicma::timing::Timer::start(): Assertion `!running' failed.

Distributed Memory Implementation

  • Implement distributed memory parallel version of the kernels using MPI and/or runtime systems e.g. StarPU, Parsec, etc.
  • Employ asynchronous communication whenever possible

Complete Documentation

Add missing documentation for functions inside the following header files

  • include/hicma/util/experiment_setup.h
  • include/hicma/util/geometry_file.h
  • include/hicma/util/get_memory_usage.h
  • include/hicma/util/initialize.h
  • include/hicma/util/l2_error.h
  • include/hicma/util/omm_error_handler.h
  • include/hicma/util/print.h
  • include/hicma/util/timer.h
  • include/hicma/operations/LAPACK.h
  • include/hicma/operations/misc.h
  • include/hicma/classes/initialization_helpers/matrix_initializer_file.h
  • README.md: How to make and build

Implement Cholesky factorization for BLR matrix.

Implementing Cholesky factorization should open the doors for benchmarking with currently latest work such as LORAPO which use Cholesky of an SPD matrix for reporting performing of a BLR factorization. Besides, this routine also takes about half the operations of LU when working with the right data.

GEMM(H, H, LR) Improvement

Current gemm(H, H, LR) implementation converts C to Dense and recurse to gemm(H, H, D).

define_method(
  void, gemm_omm,
  (
    const Hierarchical& A, const Hierarchical& B, LowRank& C,
    const double alpha, const double beta,
    const bool TransA, const bool TransB
  )
) {
  Dense CD(C);
  gemm(A, B, CD, alpha, beta, TransA, TransB);
  C = LowRank(CD, C.eps); //or C = LowRank(CD, rank);
}

However, converting C to Hierarchical (e.g. use split function) might be better since this avoids unnecessary multiplications with Dense block.

  Hierarchical CH(C);
  gemm(A, B, CH, alpha, beta, TransA, TransB);
  C = LowRank(CH, C.eps); //or C = LowRank(CH, rank);

For this to work, additional LowRank constructor from a Hierarchical object is needed. Also when converting LowRank C to Hierarchical, make sure that C.eps is passed down to the resulting LowRank blocks

Initializing starPU leads to drop in performance without calling execute_scedule.

A performance regression can be observed when the initilalize_starpu() function is called within the Runtime class constructor during application startup even if we are using MKL threading without calling start_schedule() or execute_schedule(). A fix would involve using a switch to called starpu initialization and shutdown functions.

Here's some perf results for N=16384, nleaf=1024, rank=256 and admis=0 and 1 on the latest dev branch:

This is the performance for BLR LU with MKL multi-threading with the current behaviour:
dev-mkl-threading.pdf

This is the performance after I manually comment out the initialize_starpu() function:
blr-lu-wo-starpu-breakdown.pdf

This is the program used for running the benchmarks:

#include "hicma/hicma.h"

#include <cassert>
#include <cstdint>
#include <tuple>
#include <vector>
#include <iostream>
#include <fstream>

using namespace hicma;
using namespace std;

int main([[maybe_unused]] int argc, char** argv) {
  timing::start("Overall");
  hicma::initialize();


  int64_t nleaf = atoi(argv[1]);
  int64_t rank = atoi(argv[2]);
  int64_t N = atoi(argv[3]);
  double admis = atof(argv[4]);
  int64_t basis = 0;
  int64_t nblocks = N / nleaf;
  assert(basis == NORMAL_BASIS || basis == SHARED_BASIS);

  /* Default parameters for statistics */
  double beta = 0.1;
  double nu = 0.5;//in matern, nu=0.5 exp (half smooth), nu=inf sqexp (inifinetly smooth)
  double noise = 1.e-1;
  double sigma = 1.0;

  // starsh::exp_kernel_prepare(N, beta, nu, noise, sigma, 3);

  std::vector<std::vector<double>> randx{get_sorted_random_vector(N)};
  print("Being compression");
  timing::start("Hierarchical compression");
#ifdef USE_STARPU
  start_schedule();
#endif // USE_STARPU

  int ndim = 3;
  Hierarchical A(N, nleaf, nblocks, beta, nu, noise, sigma, ndim,
                  admis, rank, basis); // works with square matrix only (prototype).

#ifdef USE_STARPU
  execute_schedule();
#endif // USE_STARPU
  double comp_time = timing::stop("Hierarchical compression");

  printXML(A);
  
  Hierarchical A_c = A;

  timing::start("LU decomposition");
  Hierarchical L, U;
#ifdef USE_STARPU
  start_schedule();
#endif // USE_STARPU

  std::tie(L, U) = getrf(A);

#ifdef USE_STARPU
  execute_schedule();
#endif // USE_STARPU
  double fact_time = timing::stop("LU decomposition");

  print("Begin multiplication of L and U.");
  timing::start("BLR LU gemm");
#ifdef USE_STARPU
  start_schedule();
#endif // USE_STARPU
  Hierarchical L1(identity, randx, N, N, rank, nleaf, admis,
                 nblocks, nblocks, basis);
  Hierarchical U1(identity, randx, N, N, rank, nleaf, admis,
                 nblocks, nblocks, basis);
  for (int i = 0; i < nblocks; ++i) {
    for (int j = 0; j < nblocks; ++j) {
      if (i == j ) {
        L1(i, j) = L(i,j);
        U1(i, j) = U(i,j);
      }
      else if (j < i) {             // lower triangle
        L1(i, j) = L(i,j);
      }
      else {                    // upper triangle
        U1(i, j) = U(i,j);
      }
      
    }
  }
#ifdef USE_STARPU
  execute_schedule();

  start_schedule();
#endif // USE_STARPU
  auto C = gemm(L1, U1, 1, 0);
#ifdef USE_STARPU
  execute_schedule();
#endif // USE_STARPU
  
  timing::start("BLR LU gemm");
  
  print("Calculate l2 error.");
  double error = l2_error(A_c, C);

  std::ofstream file;
  file.open("blr-lu-exp3d-mkl-threads-sqrt-admis.csv", std::ios::app | std::ios::out);
  file << N << "," << nleaf << "," << rank << "," << admis << ","  << error
       << "," << fact_time << "," << comp_time << ","
#ifdef USE_STARPU
       << std::getenv("STARPU_NCPU")
#else
       << std::getenv("MKL_NUM_THREADS")
#endif // USE_STARPU
       <<  std::endl;
  file.close();

  return 0;
}

Clean up current code

  • Dense Matrix: Reset data to vector
  • Remove Data Handler
  • Remove Tracking (only needed for shared bases)
  • Add some tests (randomized ID/SVD)
  • set up a file structure for tests (test/examples)

Cannot find BOOST on new lab server.

CMake Error at /mnt/nfs/packages/x86_64/cmake/cmake-3.17.0/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:164 (message):
  Could NOT find Boost (missing: Boost_INCLUDE_DIR) (Required is at least
  version "1.53")
Call Stack (most recent call first):
  /mnt/nfs/packages/x86_64/cmake/cmake-3.17.0/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:445 (_FPHSA_FAILURE_MESSAGE)
  /mnt/nfs/packages/x86_64/cmake/cmake-3.17.0/share/cmake-3.17/Modules/FindBoost.cmake:2145 (find_package_handle_standard_args)
  CMakeLists.txt:11 (find_package)

Fast Formatted LR addition does not work when rank equals to 1

Reproduce error by running example file bin/lr_add 32 1.
Output:

--- LR Add Default ----------------------------
munmap_chunk(): invalid pointer
Aborted (core dumped)

Investigate formatted_addition function in src/operations/arithmetic/addition.cpp

Remove Dependency to StarsH

StarsH leads to build failed on Github server. Since this may also lead to other problems in the future, this dependency should be removed.

OMP_PLACES malfunctions on ABCI.

The OMP_PLACES ENV variable, when set to cores is supposed to assign each new thread to a different physical core successively from 0-40 (in case of ABCI). While this setting works in other programs, it does not for some reason work in our cod (my branch is https://github.com/rioyokotalab/hicma/tree/openmp-tasks).

In order to reproduce this error, modify the compression algorithm to use OpenMP tasks like so:

Hierarchical::Hierarchical(
                           int64_t N, int64_t nleaf, int64_t nblocks, double beta,
                           double nu, double noise, double sigma, int ndim,
                           double admis, int64_t rank, int basis_type,
                           int64_t row_start, int64_t col_start
) {

  // make this shared. The data within this object (without use of nested bases)
  // applies to the entire matrix, for example the admis or rank. The actual
  // that it generated is produced inside get_dense_representation, and since that
  // data is alway local to a single task, the allocation of that data is local to
  // that task too.
  MatrixInitializerStarshExponential initer(N, beta, nu, noise, sigma, ndim,
                                            admis, rank, basis_type, GEOMETRY_BASED_ADMIS);
  ClusterTree cluster_tree(
                           {row_start, N}, {col_start, N}, nblocks, nblocks, nleaf
                           );
  dim = cluster_tree.block_dim;
  data.resize(dim[0]*dim[1]);
#pragma omp parallel proc_bind(close)
  {
#pragma omp single
    {
      for (const ClusterTree& child : cluster_tree) {
#pragma omp task shared(cluster_tree, initer) firstprivate(child)
        {
          std::cout << "exec: " << omp_get_thread_num() << "core: " << sched_getcpu()
                    << " p: " << omp_get_place_num() << std::endl;
          if (initer.is_admissible(child)) {
            (*this)[child] = initer.get_compressed_representation(child);
          } else {
            if (child.is_leaf()) {
              (*this)[child] = initer.get_dense_representation(child);
            } else {
              (*this)[child] = Hierarchical(child, initer);
            }
          }
        }
      }
    }
  }
}

Then set OMP_PLACES=cores and OMP_PROC_BIND=close and run this code with many threads. You will see that the binding of the threads only takes places on CPUs 0 and 40. So only 2 CPUs are being kept busy. Removing the variables makes it work again, but there is no control on the thread affinity which is very important for NUMA performance.

Complete GEMM Implementations and Tests

Complete the implementation of the following gemm specializations

  • gemm(LowRank, Dense, LowRank)
  • gemm(Dense, LowRank, LowRank)
  • gemm(Dense, LowRank, Hierarchical)
  • gemm(LowRank, LowRank, LowRank)
  • gemm(Hierarchical, LowRank, LowRank)
  • gemm(LowRank, Hierarchical, LowRank)
  • gemm(LowRank, LowRank, Hierarchical)

Tests for following gemm specializations

  • gemm(Dense,Dense,Dense)
  • gemm(Dense,Dense,LowRank)
  • gemm(Dense,Dense,Hierarchical)
  • gemm(Dense,LowRank,Dense)
  • gemm(Dense,LowRank,LowRank)
  • gemm(Dense,LowRank,Hierarchical)
  • gemm(Dense,Hierarchical,Dense)
  • gemm(Dense,Hierarchical,LowRank)
  • gemm(Dense,Hierarchical,Hierarchical)
  • gemm(LowRank,Dense,Dense)
  • gemm(LowRank,Dense,LowRank)
  • gemm(LowRank,Dense,Hierarchical)
  • gemm(LowRank,LowRank,Dense)
  • gemm(LowRank,LowRank,LowRank)
  • gemm(LowRank,LowRank,Hierarchical)
  • gemm(LowRank,Hierarchical,Dense)
  • gemm(LowRank,Hierarchical,LowRank)
  • gemm(LowRank,Hierarchical,Hierarchical)
  • gemm(Hierarchical,Dense,Dense)
  • gemm(Hierarchical,Dense,LowRank)
  • gemm(Hierarchical,Dense,Hierarchical)
  • gemm(Hierarchical,LowRank,Dense)
  • gemm(Hierarchical,LowRank,LowRank)
  • gemm(Hierarchical,LowRank,Hierarchical)
  • gemm(Hierarchical,Hierarchical,Dense)
  • gemm(Hierarchical,Hierarchical,LowRank)
  • gemm(Hierarchical,Hierarchical,Hierarchical)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.