rioyokotalab / frank Goto Github PK

2.0 3.0 2.0 4.3 MB

License: BSD 3-Clause "New" or "Revised" License

C++ 85.12% Python 11.07% CMake 3.47% C 0.34%

frank's Introduction

FRANK (Factorization of RANK structured matrices)

“Any intelligent fool can make things bigger and more complex. It takes a touch of genius - and a lot of courage to move in the opposite direction.” - Albert Einstein

Library for hierarchical low-rank matrix factorizations.

How to build and make

mkdir build
cd build
cmake -DUSE_MKL=ON ..
make

binaries will be compiled into bin. C++17 is required in order to build FRANK. The build process will try to detect or download the following dependencies:

nlohmann_json
YOMM2

Build flags

-DUSE_MKL

If this is specified, FRANK will try to use the Intel MKL library on your system. Otherwise it will try to detect the default BLAS and LAPACK libraries installed on your system.

-DBUILD_DOCS

If this is set, the documentation files are created during the build process. Requires Doxygen installed on the system. Default to OFF.

-DBUILD_TESTS

If this is set, test files will be included in the build process. Default to ON.

-DBUILD_EXAMPLES

If this is set, examples files will be included in the build process. Default to ON.

frank's People

Contributors

Stargazers

Watchers

Forkers

v0dro wawando

frank's Issues

Optimize Initialize Function

Remove unused Runtime class

Break up pre_scheduler.cpp into smaller files.

Currently the pre_sceduler.cpp is a massive file with collections of all sorts of functions and classes relating to LAPACK, BLAS, kernel generation and even generation and destruction of tasks. Breaking this up into smaller files should simply things further.

Pure python serial plotter for HMat

Write plotting tool (matplotlib) in pure Python. This will not consider at all what happens in the C++ code.
The Python code runs in serial, but should stick as close to the design of the C++ code as possible. This way, the plotting tool should be adaptable to take C++ objects without too much hassle.

Implement universal admissibility interface.

A new admissibility interface should be defined that delegates the calculation of admissiblity to the user from hicma. This interface should be able to work with any sort of co-ordinate distribution depending on the user's library. Some users will have different structures to define their coordinates. For example an array of structures with x,y,z. Others may define a structure of arrays is x[N], y[N], z[N]. Other may define it as a single array X[3*N]. Additionally the admis condition can be calculated in FORTRAN or any other C++ - interfacable language.

Current proposal:
Letting the user define a function that gives partcular co-ordinates for a given position similar to the function that returns the data for the particular function that you have now? So basically something like bool get_admis(box1_row_start, box1_row_end, box1_col_start, box1_col_end, box2_row_start, box2_row_end, box2_col_start, box2_col_end, admis_value) .

This function signature will encapsulate the box co-ordinates within which you want to compute the admissibility. Since you want to do this in fortran it can use only primitive types to make it easier.

Something like a stars-H interface can define its own function within which it will refer to its own co-ordinate system calling scheme and then pass the function to the hicma matrix constructor.

Use Column Major Ordering Within Dense Matrix

Compare Parallel Performance of Compressed Representations

Compare {BLR, HSS, H-matrix, H2-matrix}, vary {eps (rank), N, procs}
Check load balance, breakdown, vary {block size, partition size}

Implement Fixed Accuracy (Variable Rank) Compression Strategy

Adopt variable-rank compression scheme
Adjust matrix operations to support variable-rank LowRank instances

Implement block LU with MPI

Cleanup

Finish Multi-Method GEMM operations with transpose parameter

Complete the implementation of the following gemm functions

gemm(LowRank, Dense, LowRank)
gemm(Dense, LowRank, LowRank)
gemm(LowRank, LowRank, LowRank)
gemm(Hierarchical, LowRank, LowRank)
gemm(LowRank, Hierarchical, LowRank)
gemm(LowRank, LowRank, Hierarchical)

GPU Implementation

Implement GPU version of all kernels

Configure starpu MPI with cmake.

Currently cmake adds the link flags only for the non-MPI shared object of starPU. If using MPI, also add the flag for starPU MPI (and compile starPU with flags for enabling MPI).

Fix LAPACK operations for non square matrices

SVD
QR
any others?

Create matrix visualizations over MPI.

The current functionality using XML+python is limited to a single node and small matrices. The python script also has a long processing time due to storing all the data in the XML file.

Create a way to visualize a very large H-matrix scattered over multiple nodes efficiently.

Float32 support

Template the following classes (with hardcoded "double" in the rest of the code):

Template functions so that they can support both double and float:

Extend classes to "float" - OMM:

Dense - Constructor from MatrixProxy (If MatrixProxy is not of the same Type a bad_cast exception is thrown instead of omm_error and aborting; EDIT: found a solution that works with YOMM)
Dense - Constructor from Matrix (no float types for Hierarchical and LowRank yet)
Dense - implicit type conversion? (currently only explicit conversion allowed)
Matrix Proxy (both clone and move_clone)

Extend functions to "float" - OMM:

Write test functions:

Requires addition of the corresponding float functions from BLAS/LAPACK:

BLAS - float
LAPACK - float
Define operations between double and float?

Notes:
Prioritize getting a working single-precision implementation with a hierarchical LU test-case. Other test cases can be added later and the limitations described below should only be addressed if there is an actual need for their functionality.

Initialization from Kernel-Functions is still considerably restricted. The following limitations apply:

Geometry information can only be
Random number generators can only be
Seed for the random number generators can not be set
Admissibility can not be passed as a function
Only pre-defined kernel-functions are available

There are also some shortcomings in arithmetics:

Addition operator += is defined for all matrix types while operator + is only defined for dense
Subtraction operator -= is not defined
Multiplication (Scalar) is only defined for double (restrict? allow for mixed precision?)
Multiplication (Scalar) operator * is not defined

Investigate how a MatrixProxy is conerted into a Matrix (i.e. is that template safe?)
Automatic template deduction does not work for return type, in some cases old C style might be better (include return type as argument)
Shall we change the return value of functions that return a double to float?

Design discussion for ideal class structure.

This thread is meant to be a discussion on the best approaches for design decisions in hicma and is meant to serve as a log for why certain decisions were taken, their pros and cons and what other approaches exist to best address the trade-off of usability vs. performance.

StarPU installation cannot find hwloc.

Ida sensei's machine gets this error:

configure: error: libhwloc or pkg-config was not found on your system. If the target machine is hyperthreaded the performance may be impacted a lot.  It is strongly recommended to install libhwloc and pkg-config. However, if you really want to use StarPU without enabling libhwloc, please restart configure by specifying the option '--without-hwloc'.
gmake[2]: *** [CMakeFiles/starpu.dir/build.make:109: starpu-prefix/src/starpu-stamp/starpu-configure] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:68: CMakeFiles/starpu.dir/all] Error 2
gmake: *** [Makefile:84: all] Error 2
CMake Error at cmake/find_or_download.cmake:80 (message):
  Build of dependency starpu failed: 2.
Call Stack (most recent call first):
  CMakeLists.txt:53 (find_or_download)
-- Configuring incomplete, errors occurred!
See also "/home2/ida/hicma/build/CMakeFiles/CMakeOutput.log".
[

Employ inner blocking within dense kernel of Tiled Householder QR

Currently, the Blocked Householder BLR-QR algorithm has O(n²) complexity, whereas the more fine grained Tiled Householder BLR-QR has O(n^2.5) complexity. The fact that no inner blocking is used within dense kernel of Tiled Householder BLR-QR may be responsible for the extra O(n^0.5) complexity. This needs to be investigated and implemented to improve the performance of that particular method.

Make timer class multi-threading friendly.

The timer class currently performs non-atomic operations, which leads to issues when using multi-threading within the library, for example with openMP tasks. Making the operations atomic or somehow building per-thread timing constructs should mitigate this problem.

Here's one of the errors I get:

blr-lu-exp3d: /home/acb10922qh/gitrepos/useful-tsubame-benchmarks/hmatrix-benchmarks/hicma-yokotalab/src/util/timer.cpp:73: void hicma::timing::Timer::start(): Assertion `!running' failed.

Distributed Memory Implementation

Implement distributed memory parallel version of the kernels using MPI and/or runtime systems e.g. StarPU, Parsec, etc.
Employ asynchronous communication whenever possible

Complete Documentation

Add missing documentation for functions inside the following header files

Compare against other libraries

Compare against other hierarchical low-rank approximation libraries:

Hatrix
The ones listed in https://github.com/gchavez2/awesome_hierarchical_matrices

Implement Cholesky factorization for BLR matrix.

Implementing Cholesky factorization should open the doors for benchmarking with currently latest work such as LORAPO which use Cholesky of an SPD matrix for reporting performing of a BLR factorization. Besides, this routine also takes about half the operations of LU when working with the right data.

GEMM(H, H, LR) Improvement

Current gemm(H, H, LR) implementation converts C to Dense and recurse to gemm(H, H, D).

define_method(
  void, gemm_omm,
  (
    const Hierarchical& A, const Hierarchical& B, LowRank& C,
    const double alpha, const double beta,
    const bool TransA, const bool TransB
  )
) {
  Dense CD(C);
  gemm(A, B, CD, alpha, beta, TransA, TransB);
  C = LowRank(CD, C.eps); //or C = LowRank(CD, rank);
}

However, converting C to Hierarchical (e.g. use split function) might be better since this avoids unnecessary multiplications with Dense block.

  Hierarchical CH(C);
  gemm(A, B, CH, alpha, beta, TransA, TransB);
  C = LowRank(CH, C.eps); //or C = LowRank(CH, rank);

For this to work, additional LowRank constructor from a Hierarchical object is needed. Also when converting LowRank C to Hierarchical, make sure that C.eps is passed down to the resulting LowRank blocks

Initializing starPU leads to drop in performance without calling execute_scedule.

A performance regression can be observed when the initilalize_starpu() function is called within the Runtime class constructor during application startup even if we are using MKL threading without calling start_schedule() or execute_schedule(). A fix would involve using a switch to called starpu initialization and shutdown functions.

Here's some perf results for N=16384, nleaf=1024, rank=256 and admis=0 and 1 on the latest dev branch:

This is the performance for BLR LU with MKL multi-threading with the current behaviour:
dev-mkl-threading.pdf

This is the performance after I manually comment out the initialize_starpu() function:
blr-lu-wo-starpu-breakdown.pdf

This is the program used for running the benchmarks:

#include "hicma/hicma.h"

#include <cassert>
#include <cstdint>
#include <tuple>
#include <vector>
#include <iostream>
#include <fstream>

using namespace hicma;
using namespace std;

int main([[maybe_unused]] int argc, char** argv) {
  timing::start("Overall");
  hicma::initialize();


  int64_t nleaf = atoi(argv[1]);
  int64_t rank = atoi(argv[2]);
  int64_t N = atoi(argv[3]);
  double admis = atof(argv[4]);
  int64_t basis = 0;
  int64_t nblocks = N / nleaf;
  assert(basis == NORMAL_BASIS || basis == SHARED_BASIS);

  /* Default parameters for statistics */
  double beta = 0.1;
  double nu = 0.5;//in matern, nu=0.5 exp (half smooth), nu=inf sqexp (inifinetly smooth)
  double noise = 1.e-1;
  double sigma = 1.0;

  // starsh::exp_kernel_prepare(N, beta, nu, noise, sigma, 3);

  std::vector<std::vector<double>> randx{get_sorted_random_vector(N)};
  print("Being compression");
  timing::start("Hierarchical compression");
#ifdef USE_STARPU
  start_schedule();
#endif // USE_STARPU

  int ndim = 3;
  Hierarchical A(N, nleaf, nblocks, beta, nu, noise, sigma, ndim,
                  admis, rank, basis); // works with square matrix only (prototype).

#ifdef USE_STARPU
  execute_schedule();
#endif // USE_STARPU
  double comp_time = timing::stop("Hierarchical compression");

  printXML(A);
  
  Hierarchical A_c = A;

  timing::start("LU decomposition");
  Hierarchical L, U;
#ifdef USE_STARPU
  start_schedule();
#endif // USE_STARPU

  std::tie(L, U) = getrf(A);

#ifdef USE_STARPU
  execute_schedule();
#endif // USE_STARPU
  double fact_time = timing::stop("LU decomposition");

  print("Begin multiplication of L and U.");
  timing::start("BLR LU gemm");
#ifdef USE_STARPU
  start_schedule();
#endif // USE_STARPU
  Hierarchical L1(identity, randx, N, N, rank, nleaf, admis,
                 nblocks, nblocks, basis);
  Hierarchical U1(identity, randx, N, N, rank, nleaf, admis,
                 nblocks, nblocks, basis);
  for (int i = 0; i < nblocks; ++i) {
    for (int j = 0; j < nblocks; ++j) {
      if (i == j ) {
        L1(i, j) = L(i,j);
        U1(i, j) = U(i,j);
      }
      else if (j < i) {             // lower triangle
        L1(i, j) = L(i,j);
      }
      else {                    // upper triangle
        U1(i, j) = U(i,j);
      }
      
    }
  }
#ifdef USE_STARPU
  execute_schedule();

  start_schedule();
#endif // USE_STARPU
  auto C = gemm(L1, U1, 1, 0);
#ifdef USE_STARPU
  execute_schedule();
#endif // USE_STARPU
  
  timing::start("BLR LU gemm");
  
  print("Calculate l2 error.");
  double error = l2_error(A_c, C);

  std::ofstream file;
  file.open("blr-lu-exp3d-mkl-threads-sqrt-admis.csv", std::ios::app | std::ios::out);
  file << N << "," << nleaf << "," << rank << "," << admis << ","  << error
       << "," << fact_time << "," << comp_time << ","
#ifdef USE_STARPU
       << std::getenv("STARPU_NCPU")
#else
       << std::getenv("MKL_NUM_THREADS")
#endif // USE_STARPU
       <<  std::endl;
  file.close();

  return 0;
}

Adjust Naming Conventions etc. to Conform New Library Name: FRANK

Change namespace, directories, etc. to reflect new library name

Fix bug in rounded low-rank addition when rankA+rankB > block_size

Currently the orthogonality_preserving_addition does not work when the sum of the operands' rank are bigger than the block_size. This more general case needs to be considered when assigning dimensions inside the function.

Clean up current code

Dense Matrix: Reset data to vector
Remove Data Handler
Remove Tracking (only needed for shared bases)
Add some tests (randomized ID/SVD)
set up a file structure for tests (test/examples)

Simplify OpenMP Linking in CMake

Cannot find BOOST on new lab server.

CMake Error at /mnt/nfs/packages/x86_64/cmake/cmake-3.17.0/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:164 (message):
  Could NOT find Boost (missing: Boost_INCLUDE_DIR) (Required is at least
  version "1.53")
Call Stack (most recent call first):
  /mnt/nfs/packages/x86_64/cmake/cmake-3.17.0/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:445 (_FPHSA_FAILURE_MESSAGE)
  /mnt/nfs/packages/x86_64/cmake/cmake-3.17.0/share/cmake-3.17/Modules/FindBoost.cmake:2145 (find_package_handle_standard_args)
  CMakeLists.txt:11 (find_package)

Fast Formatted LR addition does not work when rank equals to 1

Reproduce error by running example file bin/lr_add 32 1.
Output:

--- LR Add Default ----------------------------
munmap_chunk(): invalid pointer
Aborted (core dumped)

Investigate formatted_addition function in src/operations/arithmetic/addition.cpp

Remove Dependency to StarsH

StarsH leads to build failed on Github server. Since this may also lead to other problems in the future, this dependency should be removed.

Test for BLR and Hierarchical Factorizations

Check unit-test branch and take anything that is useful from there

OMP_PLACES malfunctions on ABCI.

The OMP_PLACES ENV variable, when set to cores is supposed to assign each new thread to a different physical core successively from 0-40 (in case of ABCI). While this setting works in other programs, it does not for some reason work in our cod (my branch is https://github.com/rioyokotalab/hicma/tree/openmp-tasks).

In order to reproduce this error, modify the compression algorithm to use OpenMP tasks like so:

Hierarchical::Hierarchical(
                           int64_t N, int64_t nleaf, int64_t nblocks, double beta,
                           double nu, double noise, double sigma, int ndim,
                           double admis, int64_t rank, int basis_type,
                           int64_t row_start, int64_t col_start
) {

  // make this shared. The data within this object (without use of nested bases)
  // applies to the entire matrix, for example the admis or rank. The actual
  // that it generated is produced inside get_dense_representation, and since that
  // data is alway local to a single task, the allocation of that data is local to
  // that task too.
  MatrixInitializerStarshExponential initer(N, beta, nu, noise, sigma, ndim,
                                            admis, rank, basis_type, GEOMETRY_BASED_ADMIS);
  ClusterTree cluster_tree(
                           {row_start, N}, {col_start, N}, nblocks, nblocks, nleaf
                           );
  dim = cluster_tree.block_dim;
  data.resize(dim[0]*dim[1]);
#pragma omp parallel proc_bind(close)
  {
#pragma omp single
    {
      for (const ClusterTree& child : cluster_tree) {
#pragma omp task shared(cluster_tree, initer) firstprivate(child)
        {
          std::cout << "exec: " << omp_get_thread_num() << "core: " << sched_getcpu()
                    << " p: " << omp_get_place_num() << std::endl;
          if (initer.is_admissible(child)) {
            (*this)[child] = initer.get_compressed_representation(child);
          } else {
            if (child.is_leaf()) {
              (*this)[child] = initer.get_dense_representation(child);
            } else {
              (*this)[child] = Hierarchical(child, initer);
            }
          }
        }
      }
    }
  }
}

Then set OMP_PLACES=cores and OMP_PROC_BIND=close and run this code with many threads. You will see that the binding of the threads only takes places on CPUs 0 and 40. So only 2 CPUs are being kept busy. Removing the variables makes it work again, but there is no control on the thread affinity which is very important for NUMA performance.

Complete GEMM Implementations and Tests

Complete the implementation of the following gemm specializations

gemm(LowRank, Dense, LowRank)
gemm(Dense, LowRank, LowRank)
gemm(Dense, LowRank, Hierarchical)
gemm(LowRank, LowRank, LowRank)
gemm(Hierarchical, LowRank, LowRank)
gemm(LowRank, Hierarchical, LowRank)
gemm(LowRank, LowRank, Hierarchical)

Tests for following gemm specializations