Code Monkey home page Code Monkey logo

mratsim / arraymancer Goto Github PK

View Code? Open in Web Editor NEW
1.3K 37.0 95.0 3.78 MB

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends

Home Page: https://mratsim.github.io/Arraymancer/

License: Apache License 2.0

Nim 98.01% Ruby 0.02% Python 1.93% Julia 0.04%
tensor nim multidimensional-arrays cuda deep-learning machine-learning cudnn high-performance-computing gpu-computing matrix-library

arraymancer's Introduction

Join the chat on Discord #nim-science Github Actions CI License Stability

Arraymancer - A n-dimensional tensor (ndarray) library.

Arraymancer is a tensor (N-dimensional array) project in Nim. The main focus is providing a fast and ergonomic CPU, Cuda and OpenCL ndarray library on which to build a scientific computing ecosystem.

The library is inspired by Numpy and PyTorch and targets the following use-cases:

  • N-dimensional arrays (tensors) for numerical computing
  • machine learning algorithms (as in Scikit-learn: least squares solvers, PCA and dimensionality reduction, classifiers, regressors and clustering algorithms, cross-validation).
  • deep learning

The ndarray component can be used without the machine learning and deep learning component. It can also use the OpenMP, Cuda or OpenCL backends.

Note: While Nim is compiled and does not offer an interactive REPL yet (like Jupyter), it allows much faster prototyping than C++ due to extremely fast compilation times. Arraymancer compiles in about 5 seconds on my dual-core MacBook.

Reminder of supported compilation flags:

  • -d:release: Nim release mode (no stacktraces and debugging information)
  • -d:danger: No runtime checks like array bound checking
  • -d:blas=blaslibname: Customize the BLAS library used by Arraymancer. By default (i.e. if you don't define this setting) Arraymancer will try to automatically find a BLAS library (e.g. blas.so/blas.dll or libopenblas.dll) on your path. You should only set this setting if for some reason you want to use a specific BLAS library. See nimblas for further information
  • -d:lapack=lapacklibname: Customize the LAPACK library used by Arraymancer. By default (i.e. if you don't define this setting) Arraymancer will try to automatically find a LAPACK library (e.g. lapack.so/lapack.dll or libopenblas.dll) on your path. You should only set this setting if for some reason you want to use a specific LAPACK library. See nimlapack for further information
  • -d:openmp: Multithreaded compilation
  • -d:mkl: Deprecated flag which forces the use of MKL. Implies -d:openmp. Use -d:blas=mkl -d:lapack=mkl instead, but only if you want to force Arraymancer to use MKL, instead of looking for the available BLAS / LAPACK libraries
  • -d:openblas: Deprecated flag which forces the use of OpenBLAS. Use -d:blas=openblas -d:lapack=openblas instead, but only if you want to force Arraymancer to use OpenBLAS, instead of looking for the available BLAS / LAPACK libraries
  • -d:cuda: Build with Cuda support
  • -d:cudnn: Build with CuDNN support, implies -d:cuda
  • -d:avx512: Build with AVX512 support by supplying the -mavx512dq flag to gcc / clang. Without this flag the resulting binary does not use AVX512 even on CPUs that support it. Setting this flag, however, makes the binary incompatible with CPUs that do not support AVX512. See the comments in #505 for a discussion (from v0.7.9)
  • You might want to tune library paths in nim.cfg after installation for OpenBLAS, MKL and Cuda compilation. The current defaults should work on Mac and Linux; and on Windows after downloading libopenblas.dll or another BLAS / LAPACK DLL (see the Installation section for more information) and copying it into a folder in your path or into the compilation output folder.

Show me some code

The Arraymancer tutorial is available here.

Here is a preview of Arraymancer syntax.

Tensor creation and slicing

import math, arraymancer

const
    x = @[1, 2, 3, 4, 5]
    y = @[1, 2, 3, 4, 5]

var
    vandermonde = newSeq[seq[int]]()
    row: seq[int]

for i, xx in x:
    row = newSeq[int]()
    vandermonde.add(row)
    for j, yy in y:
        vandermonde[i].add(xx^yy)

let foo = vandermonde.toTensor()

echo foo

# Tensor[system.int] of shape "[5, 5]" on backend "Cpu"
# |1          1       1       1       1|
# |2          4       8      16      32|
# |3          9      27      81     243|
# |4         16      64     256    1024|
# |5         25     125     625    3125|

echo foo[1..2, 3..4] # slice

# Tensor[system.int] of shape "[2, 2]" on backend "Cpu"
# |16      32|
# |81     243|

echo foo[_|-1, _] # reverse the order of the rows

# Tensor[int] of shape "[5, 5]" on backend "Cpu"
# |5      25      125     625     3125|
# |4      16       64     256     1024|
# |3       9       27      81      243|
# |2       4        8      16       32|
# |1       1        1       1        1|

Reshaping and concatenation

import arraymancer, sequtils

let a = toSeq(1..4).toTensor.reshape(2,2)

let b = toSeq(5..8).toTensor.reshape(2,2)

let c = toSeq(11..16).toTensor
let c0 = c.reshape(3,2)
let c1 = c.reshape(2,3)

echo concat(a,b,c0, axis = 0)
# Tensor[system.int] of shape "[7, 2]" on backend "Cpu"
# |1      2|
# |3      4|
# |5      6|
# |7      8|
# |11    12|
# |13    14|
# |15    16|

echo concat(a,b,c1, axis = 1)
# Tensor[system.int] of shape "[2, 7]" on backend "Cpu"
# |1      2     5     6    11    12    13|
# |3      4     7     8    14    15    16|

Broadcasting

Image from Scipy

import arraymancer

let j = [0, 10, 20, 30].toTensor.reshape(4,1)
let k = [0, 1, 2].toTensor.reshape(1,3)

echo j +. k
# Tensor[system.int] of shape "[4, 3]" on backend "Cpu"
# |0      1     2|
# |10    11    12|
# |20    21    22|
# |30    31    32|

A simple two layers neural network

From example 3.

import arraymancer, strformat

discard """
A fully-connected ReLU network with one hidden layer, trained to predict y from x
by minimizing squared Euclidean distance.
"""

# ##################################################################
# Environment variables

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
let (N, D_in, H, D_out) = (64, 1000, 100, 10)

# Create the autograd context that will hold the computational graph
let ctx = newContext Tensor[float32]

# Create random Tensors to hold inputs and outputs, and wrap them in Variables.
let
  x = ctx.variable(randomTensor[float32](N, D_in, 1'f32))
  y = randomTensor[float32](N, D_out, 1'f32)

# ##################################################################
# Define the model

network TwoLayersNet:
  layers:
    fc1: Linear(D_in, H)
    fc2: Linear(H, D_out)
  forward x:
    x.fc1.relu.fc2

let
  model = ctx.init(TwoLayersNet)
  optim = model.optimizer(SGD, learning_rate = 1e-4'f32)

# ##################################################################
# Training

for t in 0 ..< 500:
  let
    y_pred = model.forward(x)
    loss = y_pred.mse_loss(y)

  echo &"Epoch {t}: loss {loss.value[0]}"

  loss.backprop()
  optim.update()

Teaser A text generated with Arraymancer's recurrent neural network

From example 6.

Trained 45 min on my laptop CPU on Shakespeare and producing 4000 characters

Whter!
Take's servant seal'd, making uponweed but rascally guess-boot,
Bare them be that been all ingal to me;
Your play to the see's wife the wrong-pars
With child of queer wretchless dreadful cold
Cursters will how your part? I prince!
This is time not in a without a tands:
You are but foul to this.
I talk and fellows break my revenges, so, and of the hisod
As you lords them or trues salt of the poort.

ROMEO:
Thou hast facted to keep thee, and am speak
Of them; she's murder'd of your galla?

# [...] See example 6 for full text generation samples

Table of Contents

Installation

Nim is available in some Linux repositories and on Homebrew for macOS.

I however recommend installing Nim in your user profile via choosenim. Once choosenim installed Nim, you can nimble install arraymancer which will pull the latest arraymancer release and all its dependencies.

To install Arraymancer development version you can use nimble install arraymancer@#head.

Arraymancer requires a BLAS and a LAPACK library.

  • On Windows you can get the OpenBLAS library, which combines BLAS and LAPACK into a single DLL (libopenblas.dll), from the binary packages section of the OpenBLAS web page. Alternatively you can download separate BLAS and LAPACK libraries from the LAPACK for Windows site. You must then copy or extract those DLLs into a folder on your path or into the folder containing your compilation target.
  • On MacOS, Apple Accelerate Framework is included in all MacOS versions and provides those.
  • On Linux, you can download libopenblas and liblapack through your package manager.

Windows users may have to download libopenblas.dll from the binary releases section of openblas, extract it to the compilation

Full documentation

Detailed API is available at Arraymancer official documentation. Note: This documentation is only generated for 0.X release. Check the examples folder for the latest devel evolutions.

Features

For now Arraymancer is mostly at the multidimensional array stage, in particular Arraymancer offers the following:

  • Basic math operations generalized to tensors (sin, cos, ...)
  • Matrix algebra primitives: Matrix-Matrix, Matrix-Vector multiplication.
  • Easy and efficient slicing including with ranges and steps.
  • No need to worry about "vectorized" operations.
  • Broadcasting support. Unlike Numpy it is explicit, you just need to use +. instead of +.
  • Plenty of reshaping operations: concat, reshape, split, chunk, permute, transpose.
  • Supports tensors of up to 6 dimensions. For example a stack of 4 3D RGB minifilms of 10 seconds would be 6 dimensions: [4, 10, 3, 64, 1920, 1080] for [nb_movies, time, colors, depth, height, width]
  • Can read and write .csv, Numpy (.npy) and HDF5 files.
  • OpenCL and Cuda backed tensors (not as feature packed as CPU tensors at the moment).
  • Covariance matrices.
  • Eigenvalues and Eigenvectors decomposition.
  • Least squares solver.
  • K-means and PCA (Principal Component Analysis).

Arraymancer as a Deep Learning library

Deep learning features can be explored but are considered unstable while I iron out their final interface.

Reminder: The final interface is still work in progress.

You can also watch the following animated neural network demo which shows live training via nim-plotly.

Fizzbuzz with fully-connected layers (also called Dense, Affine or Linear layers)

Neural network definition extracted from example 4.

import arraymancer

const
  NumDigits = 10
  NumHidden = 100

network FizzBuzzNet:
  layers:
    hidden: Linear(NumDigits, NumHidden)
    output: Linear(NumHidden, 4)
  forward x:
    x.hidden.relu.output

let
  ctx = newContext Tensor[float32]
  model = ctx.init(FizzBuzzNet)
  optim = model.optimizer(SGD, 0.05'f32)
# ....
echo answer
# @["1", "2", "fizz", "4", "buzz", "6", "7", "8", "fizz", "10",
#   "11", "12", "13", "14", "15", "16", "17", "fizz", "19", "buzz",
#   "fizz", "22", "23", "24", "buzz", "26", "fizz", "28", "29", "30",
#   "31", "32", "fizz", "34", "buzz", "36", "37", "38", "39", "40",
#   "41", "fizz", "43", "44", "fizzbuzz", "46", "47", "fizz", "49", "50",
#   "fizz", "52","53", "54", "buzz", "56", "fizz", "58", "59", "fizzbuzz",
#   "61", "62", "63", "64", "buzz", "fizz", "67", "68", "fizz", "buzz",
#   "71", "fizz", "73", "74", "75", "76", "77","fizz", "79", "buzz",
#   "fizz", "82", "83", "fizz", "buzz", "86", "fizz", "88", "89", "90",
#   "91", "92", "fizz", "94", "buzz", "fizz", "97", "98", "fizz", "buzz"]

Handwritten digit recognition with convolutions

Neural network definition extracted from example 2.

import arraymancer

network DemoNet:
  layers:
    cv1:        Conv2D(@[1, 28, 28], out_channels = 20, kernel_size = (5, 5))
    mp1:        Maxpool2D(cv1.out_shape, kernel_size = (2,2), padding = (0,0), stride = (2,2))
    cv2:        Conv2D(mp1.out_shape, out_channels = 50, kernel_size = (5, 5))
    mp2:        MaxPool2D(cv2.out_shape, kernel_size = (2,2), padding = (0,0), stride = (2,2))
    fl:         Flatten(mp2.out_shape)
    hidden:     Linear(fl.out_shape[0], 500)
    classifier: Linear(500, 10)
  forward x:
    x.cv1.relu.mp1.cv2.relu.mp2.fl.hidden.relu.classifier

let
  ctx = newContext Tensor[float32] # Autograd/neural network graph
  model = ctx.init(DemoNet)
  optim = model.optimizer(SGD, learning_rate = 0.01'f32)

# ...
# Accuracy over 90% in a couple minutes on a laptop CPU

Sequence classification with stacked Recurrent Neural Networks

Neural network definition extracted example 5.

import arraymancer

const
  HiddenSize = 256
  Layers = 4
  BatchSize = 512


network TheGreatSequencer:
  layers:
    gru1: GRULayer(1, HiddenSize, 4) # (num_input_features, hidden_size, stacked_layers)
    fc1: Linear(HiddenSize, 32)                  # 1 classifier per GRU layer
    fc2: Linear(HiddenSize, 32)
    fc3: Linear(HiddenSize, 32)
    fc4: Linear(HiddenSize, 32)
    classifier: Linear(32 * 4, 3)                # Stacking a classifier which learns from the other 4
  forward x, hidden0:
    let
      (output, hiddenN) = gru1(x, hidden0)
      clf1 = hiddenN[0, _, _].squeeze(0).fc1.relu
      clf2 = hiddenN[1, _, _].squeeze(0).fc2.relu
      clf3 = hiddenN[2, _, _].squeeze(0).fc3.relu
      clf4 = hiddenN[3, _, _].squeeze(0).fc4.relu

    # Concat all
    # Since concat backprop is not implemented we cheat by stacking
    # Then flatten
    result = stack(clf1, clf2, clf3, clf4, axis = 2)
    result = classifier(result.flatten)

# Allocate the model
let
  ctx = newContext Tensor[float32]
  model = ctx.init(TheGreatSequencer)
  optim = model.optimizer(SGD, 0.01'f32)

# ...
let exam = ctx.variable([
    [float32 0.10, 0.20, 0.30], # increasing
    [float32 0.10, 0.90, 0.95], # increasing
    [float32 0.45, 0.50, 0.55], # increasing
    [float32 0.10, 0.30, 0.20], # non-monotonic
    [float32 0.20, 0.10, 0.30], # non-monotonic
    [float32 0.98, 0.97, 0.96], # decreasing
    [float32 0.12, 0.05, 0.01], # decreasing
    [float32 0.95, 0.05, 0.07]  # non-monotonic
  ])
# ...
echo answer.unsqueeze(1)
# Tensor[ex05_sequence_classification_GRU.SeqKind] of shape [8, 1] of type "SeqKind" on backend "Cpu"
# 	  Increasing|
# 	  Increasing|
# 	  Increasing|
# 	  NonMonotonic|
# 	  NonMonotonic|
# 	  Increasing| <----- Wrong!
# 	  Decreasing|
# 	  NonMonotonic|

Composing models

Network models can also act as layers in other network definitions. The handwritten-digit-recognition model above can also be written like this:

import arraymancer

network SomeConvNet:
  layers h, w:
    cv1:        Conv2D(@[1, h, w], 20, (5, 5))
    mp1:        Maxpool2D(cv1.out_shape, (2,2), (0,0), (2,2))
    cv2:        Conv2D(mp1.out_shape, 50, (5, 5))
    mp2:        MaxPool2D(cv2.out_shape, (2,2), (0,0), (2,2))
    fl:         Flatten(mp2.out_shape)
  forward x:
    x.cv1.relu.mp1.cv2.relu.mp2.fl

# this model could be initialized like this: let model = ctx.init(SomeConvNet, h = 28, w = 28)

# functions `out_shape` and `in_shape` returning a `seq[int]` are convention (but not strictly necessary)
# for layers/models that have clearly defined output and input size
proc out_shape*[T](self: SomeConvNet[T]): seq[int] =
  self.fl.out_shape
proc in_shape*[T](self: SomeConvNet[T]): seq[int] =
  self.cv1.in_shape

network DemoNet:
  layers:
  # here we use the previously defined SomeConvNet as a layer
    cv:         SomeConvNet(28, 28)
    hidden:     Linear(cv.out_shape[0], 500)
    classifier: Linear(hidden.out_shape[0], 10)
  forward x:
    x.cv.hidden.relu.classifier

Custom layers

It is also possible to create fully custom layers. The documentation for this can be found in the official API documentation.

Tensors on CPU, on Cuda and OpenCL

Tensors, CudaTensors and CLTensors do not have the same features implemented yet. Also CudaTensors and CLTensors can only be float32 or float64 while CpuTensors can be integers, string, boolean or any custom object.

Here is a comparative table of the core features.

Action Tensor CudaTensor ClTensor
Accessing tensor properties [x] [x] [x]
Tensor creation [x] by converting a cpu Tensor by converting a cpu Tensor
Accessing or modifying a single value [x] [] []
Iterating on a Tensor [x] [] []
Slicing a Tensor [x] [x] [x]
Slice mutation a[1,_] = 10 [x] [] []
Comparison == [x] [] []
Element-wise basic operations [x] [x] [x]
Universal functions [x] [] []
Automatically broadcasted operations [x] [x] [x]
Matrix-Matrix and Matrix-Vector multiplication [x] [x] [x]
Displaying a tensor [x] [x] [x]
Higher-order functions (map, apply, reduce, fold) [x] internal only internal only
Transposing [x] [x] []
Converting to contiguous [x] [x] []
Reshaping [x] [x] []
Explicit broadcast [x] [x] [x]
Permuting dimensions [x] [] []
Concatenating tensors along existing dimension [x] [] []
Squeezing singleton dimension [x] [x] []
Slicing + squeezing [x] [] []

What's new in Arraymancer

The full changelog is available in changelog.md.

4 reasons why Arraymancer

The Python community is struggling to bring Numpy up-to-speed

  • Numba JIT compiler
  • Dask delayed parallel computation graph
  • Cython to ease numerical computations in Python
  • Due to the GIL shared-memory parallelism (OpenMP) is not possible in pure Python
  • Use "vectorized operations" (i.e. don't use for loops in Python)

Why not use in a single language with all the blocks to build the most efficient scientific computing library with Python ergonomics.

OpenMP batteries included.

A researcher workflow is a fight against inefficiencies

Researchers in a heavy scientific computing domain often have the following workflow: Mathematica/Matlab/Python/R (prototyping) -> C/C++/Fortran (speed, memory)

Why not use in a language as productive as Python and as fast as C? Code once, and don't spend months redoing the same thing at a lower level.

Can be distributed almost dependency free

Arraymancer models can be packaged in a self-contained binary that only depends on a BLAS library like OpenBLAS, MKL or Apple Accelerate (present on all Mac and iOS).

This means that there is no need to install a huge library or language ecosystem to use Arraymancer. This also makes it naturally suitable for resource-constrained devices like mobile phones and Raspberry Pi.

Bridging the gap between deep learning research and production

The deep learning frameworks are currently in two camps:

  • Research: Theano, Tensorflow, Keras, Torch, PyTorch
  • Production: Caffe, Darknet, (Tensorflow)

Furthermore, Python preprocessing steps, unless using OpenCV, often needs a custom implementation (think text/speech preprocessing on phones).

  • Managing and deploying Python (2.7, 3.5, 3.6) and packages version in a robust manner requires devops-fu (virtualenv, Docker, ...)
  • Python data science ecosystem does not run on embedded devices (Nvidia Tegra/drones) or mobile phones, especially preprocessing dependencies.
  • Tensorflow is supposed to bridge the gap between research and production but its syntax and ergonomics are a pain to work with. Like for researchers, you need to code twice, "Prototype in Keras, and when you need low-level --> Tensorflow".
  • Deployed models are static, there is no interface to add a new observation/training sample to any framework, what if you want to use a model as a webservice with online learning?

Relevant XKCD from Apr 30, 2018

Python environment mess

So why Arraymancer ?

All those pain points may seem like a huge undertaking however thanks to the Nim language, we can have Arraymancer:

  • Be as fast as C
  • Accelerated routines with Intel MKL/OpenBLAS or even NNPACK
  • Access to CUDA and CuDNN and generate custom CUDA kernels on the fly via metaprogramming.
  • Almost dependency free distribution (BLAS library)
  • A Python-like syntax with custom operators a * b for tensor multiplication instead of a.dot(b) (Numpy/Tensorflow) or a.mm(b) (Torch)
  • Numpy-like slicing ergonomics t[0..4, 2..10|2]
  • For everything that Nim doesn't have yet, you can use Nim bindings to C, C++, Objective-C or Javascript to bring it to Nim. Nim also has unofficial Python->Nim and Nim->Python wrappers.

Future ambitions

Because apparently to be successful you need a vision, I would like Arraymancer to be:

  • The go-to tool for Deep Learning video processing. I.e. vid = load_video("./cats/youtube_cat_video.mkv")
  • Target javascript, WebAssembly, Apple Metal, ARM devices, AMD Rocm, OpenCL, you name it.
  • The base of a Starcraft II AI bot.
  • Target cryptominers FPGAs because they drove the price of GPUs for honest deep-learners too high.

arraymancer's People

Contributors

angelezquerra avatar anon767 avatar asnt avatar auxym avatar bitsnaps avatar bluenote10 avatar brentp avatar bung87 avatar chimez avatar clonkk avatar edubart avatar fabriciopashaj avatar keyehzy avatar manguluka avatar metagn avatar metasyn avatar mratsim avatar narimiran avatar niminem avatar oxinabox avatar paulnorrie avatar ringabout avatar shalokshalom avatar struggle avatar tandy-1000 avatar timotheecour avatar tmokazaki avatar tsoj avatar vindaar avatar ynfle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arraymancer's Issues

Implement integer matrix multiplication and integer matrix-vector multiplication

Unfortunately no BLAS supports accelerated matrix multiplication.

Int32 can be safely converted to float64 (only issue is if the product goes into the int64 range)

int64 may incur loss of precision for numbers bigger than int32

==> Implement a fallback method for int
==> Display a static warning at compile-time to the users to tell the users to convert to float if numbers are small enough

Sparse tensor and Sparse CudaTensor

Sparse tensor support is important in general machine learning especially for to store matrices of one-hot encoded vectors.

For CUDA backend, CuBLAS has a Sparse API.
For CPU, further investigation is needed to find a suitable Sparse BLAS backend. See this link for potential libraries.

Unable to use `_` operator in this example

import arraymancer

proc foo[T](t: Tensor[T], x: int): Tensor[T] =
  t.unsafeSlice(x, _, _).unsafeReshape([t.shape[1], t.shape[2]])

echo zeros([2,2,2], int).foo(1)

in global_config.nim it works.

test2.nim(6, 25) template/generic instantiation from here
test2.nim(4, 25) Error: type mismatch: got (int, array[0..1, int])
but expected one of: 
proc unsafeReshape(t: Tensor; new_shape: varargs[int]): Tensor

C++ codegen broken (i.e. CudaTensors broken)

C++ codegen was broken by 829d8b1 and #90.

This is blocking for Cuda as Cuda kernel relies on C++ templates.

(fails with or without release)
nim cpp -d:release --out:bin/all_tests --nimcache:./nimcache tests/all_tests.nim

Error: execution of an external compiler program 'clang++ -c  -w  -O3 -ffast-math   -I'/Users/tesuji/.choosenim/toolchains/nim-#devel/lib' -o ./nimcache/arraymancer_arraymancer.o ./nimcache/arraymancer_arraymancer.cpp' failed with exit code: 1

./nimcache/arraymancer_arraymancer.cpp:2608:19: error: declaration of reference variable 'x' requires an initializer
                NimStringDesc*& x;
                                ^
./nimcache/arraymancer_arraymancer.cpp:2610:7: error: C-style cast from rvalue to reference type 'NimStringDesc *&'
                x = (NimStringDesc*&)0;
                    ^~~~~~~~~~~~~~~~~~
./nimcache/arraymancer_arraymancer.cpp:3192:21: error: declaration of reference variable 'x' requires an initializer
                                NimStringDesc*& x;
                                                ^
./nimcache/arraymancer_arraymancer.cpp:3194:9: error: C-style cast from rvalue to reference type 'NimStringDesc *&'
                                x = (NimStringDesc*&)0;
                                    ^~~~~~~~~~~~~~~~~~
./nimcache/arraymancer_arraymancer.cpp:3636:21: error: declaration of reference variable 'x' requires an initializer
                                NimStringDesc*& x;
                                                ^
./nimcache/arraymancer_arraymancer.cpp:3638:9: error: C-style cast from rvalue to reference type 'NimStringDesc *&'
                                x = (NimStringDesc*&)0;
                                    ^~~~~~~~~~~~~~~~~~
./nimcache/arraymancer_arraymancer.cpp:4446:9: error: declaration of reference variable 'x' requires an initializer
                                NI& x;
                                    ^
./nimcache/arraymancer_arraymancer.cpp:4448:9: error: C-style cast from rvalue to reference type 'NI &' (aka 'long long &')
                                x = (NI&)0;
                                    ^~~~~~
./nimcache/arraymancer_arraymancer.cpp:4901:21: error: declaration of reference variable 'x' requires an initializer
                                NimStringDesc*& x;
                                                ^
./nimcache/arraymancer_arraymancer.cpp:4903:9: error: C-style cast from rvalue to reference type 'NimStringDesc *&'
                                x = (NimStringDesc*&)0;
                                    ^~~~~~~~~~~~~~~~~~
./nimcache/arraymancer_arraymancer.cpp:5827:7: error: declaration of reference variable 'old_val' requires an initializer
                NI& old_val;
                    ^~~~~~~
./nimcache/arraymancer_arraymancer.cpp:5828:13: error: C-style cast from rvalue to reference type 'NI &' (aka 'long long &')
                old_val = (NI&)0;
                          ^~~~~~
./nimcache/arraymancer_arraymancer.cpp:6013:7: error: declaration of reference variable 'x_2' requires an initializer
                NI& x_2;
                    ^~~
./nimcache/arraymancer_arraymancer.cpp:6015:9: error: C-style cast from rvalue to reference type 'NI &' (aka 'long long &')
                x_2 = (NI&)0;
                      ^~~~~~
./nimcache/arraymancer_arraymancer.cpp:6159:7: error: declaration of reference variable 'x' requires an initializer
                NI& x;
                    ^
./nimcache/arraymancer_arraymancer.cpp:6161:7: error: C-style cast from rvalue to reference type 'NI &' (aka 'long long &')
                x = (NI&)0;
                    ^~~~~~
./nimcache/arraymancer_arraymancer.cpp:7326:10: error: declaration of reference variable 'v' requires an initializer
                                        NF& v;
                                            ^
./nimcache/arraymancer_arraymancer.cpp:7328:10: error: C-style cast from rvalue to reference type 'NF &' (aka 'double &')
                                        v = (NF&)0;
                                            ^~~~~~
./nimcache/arraymancer_arraymancer.cpp:7766:10: error: declaration of reference variable 'v_2' requires an initializer
                                        NF& v_2;
                                            ^~~
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.

Zoom in on non-string - line 5827

N_NIMCALL(void, slicerMut_5HNBiHajzTdmOUheqCgr2A)(tyObject_Tensor_8gG034a4DEBvD9ctN8nyxaw& t, tyObject_SteppedSlice_rJTlbLcKC9bpAJbFSrlwOrQ* slices, NI slicesLen_0, NI val$
        tyObject_Tensor_8gG034a4DEBvD9ctN8nyxaw sliced;
        tyObject_Tensor_8gG034a4DEBvD9ctN8nyxaw T1_;
        memset((void*)(&sliced), 0, sizeof(sliced));
        memset((void*)(&T1_), 0, sizeof(T1_));
        unsafeSlicer_SRZrYFq46KhcUYobS9bdI9cg_2((&t), slices, slicesLen_0, (&T1_));
        sliced.shape = T1_.shape;
        sliced.strides = T1_.strides;
        sliced.offset = T1_.offset;
        sliced.data = T1_.data;
        {
                NI& old_val;        // <------- here, needs initializer
                old_val = (NI&)0;  // <------- here, needs initializer
                NI* data = dataArray_v3eUw6FwA4Vs9cV5O3TkzvQtest_init((&sliced));
                {
                        if (!is_C_contiguous_dDfDc3pbZUoFoNv6AVz18Atest_init((&sliced))) goto LA5_;
{                       {
                                NI i;
                                NI colontmp_;
                                NI T8_;
                                ...

Implement image loader (PNG, JPG ...)

This will probably be separated in a different package later.

Options include:

Considerations:

  • Can work on ARM
  • Fast (AVX2 for x86-64 and Neon for ARM)
  • Easy to integrate and maintain

Open question:

  • For data augmentation like rotation/shearing, would it be better to use the library functions (tying it deeply with Arraymancer) or implement them directly on Arraymancer tensors (ensuring compatibility with all backends, and potentially using GPU acceleration)

Forum topic: https://forum.nim-lang.org/t/3056

Provide default configuration for OpenMP, MKL, Cuda, -flto

The build options are becoming quite complicated and there is no way to set certain options directly in the nim files.

Cuda:

template cudaSwitches() =
switch("cincludes", "/opt/cuda/include")
switch("cc", "gcc") # We trick Nim about nvcc being gcc, pending https://github.com/nim-lang/Nim/issues/6372
switch("gcc.exe", "/opt/cuda/bin/nvcc")
switch("gcc.linkerexe", "/opt/cuda/bin/nvcc")
switch("gcc.cpp.exe", "/opt/cuda/bin/nvcc")
switch("gcc.cpp.linkerexe", "/opt/cuda/bin/nvcc")
# Due to the __ldg intrinsics in kernels
# we only support compute capabilities 3.5+
# See here: http://docs.nvidia.com/cuda/pascal-compatibility-guide/index.html
# And wikipedia for GPU capabilities: https://en.wikipedia.org/wiki/CUDA
switch("gcc.options.always", "-arch=sm_61 --x cu") # Interpret .c files as .cu
switch("gcc.cpp.options.always", "-arch=sm_61 --x cu -Xcompiler -fpermissive") # Interpret .c files as .cu, gate fpermissive behind Xcompiler

MKL and MKL + OpenMP

task test_mkl, "Run all tests - Intel MKL - single threaded":
switch("define","blas=mkl_intel_lp64")
switch("clibdir", "/opt/intel/mkl/lib/intel64")
switch("passl", "/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.a")
switch("passl", "-lmkl_core")
switch("passl", "-lmkl_sequential")
switch("dynlibOverride","mkl_intel_lp64")
test "all_tests"
task test_mkl_omp, "Run all tests - Intel MKL + OpenMP":
switch("define","openmp")
switch("define","blas=mkl_intel_lp64")
switch("clibdir", "/opt/intel/mkl/lib/intel64")
switch("passl", "/opt/intel/mkl/lib/intel64/libmkl_intel_lp64.a")
switch("passl", "-lmkl_core")
switch("passl", "-lmkl_gnu_thread")
switch("passl", "-lgomp")
switch("dynlibOverride","mkl_intel_lp64")
test "all_tests"

Arraymancer should ship with arraymancer.cuda.nim.cfg, arraymancer.mkl_openmp.nim.cfg so setup is easier.

Optimize integer BLAS (GEMM and GEMV)

Follow-up on #6

Basic integer matrix multiplcation and matrix-vector multiplication are implemented.

Several optimizations can be implemented to speed up integer computation further.

Can be done

  • Automatic loop unrolling. Nim has an unrollpragma that currently does nothing.
    Alternatively, unrolling can be implemented like in nim GC.
  • Ensure 16-bit alignment of bufferArray. This is required for AVX/AVX2 optimization.
    Creating a custom pragma like today
    {.pragma: align16, codegenDecl: "$# $# __attribute__((aligned(16)))".}
    probably align the pointer and not the actual data. Also align pragma which should allow this doesn't work: nim-lang/Nim#5315
  • Add OpenMP (except on macOS as Clang macOS is not built with OpenMP support and it's a pain to override macOS compiler)

Unsure if possible

  • Get L1/L2 cache size at compile-time
  • Get number of registers at compile-time

D Mir GLAS uses LLVM intrinsics to get this information.

Hard and not portable

Use AVX2 intrinsics. AVX2 operations support integer, unfortunately it is hard to get the compiler to use them automatically. An alternative could be to code directly with intrinsics.

As Arraymancer integer GEMM is based on ulmBLAS design, intrinsics can be implemented following ulmBLAS course: http://apfel.mathematik.uni-ulm.de/~lehn/ulmBLAS/

Unsure if helpful

Using pointers instead of seq + offset.
While it seems like less computation (no bound check, no recomputing of the new position during iteration), using seq means the compiler can do much more assumption about the data layout and optimize access (and GEMM is memory-bound, computing the new position is cheap)
An experiment with safe pointers (unsafe when build for release) can be found in pointer_GEMM branch.

Cuda or C++ compilation broken by Nim inlining macro

Commit: 3e0212d

tmpxft_00000443_00000000-4_arraymancer_all_tests_cuda.cudafe1.cpp:(.text+0x93): undefined reference to `isObjSlowPath_k9bdq9bQE075AR7scLFt5wIg(TNimType*, TNimType*, TNimType**)'
nimcache/arraymancer_all_tests_cuda.o: In function `nimFrame(TFrame*)':
tmpxft_00000443_00000000-4_arraymancer_all_tests_cuda.cudafe1.cpp:(.text+0x118): undefined reference to `stackOverflow_II46IjNZztN9bmbxUD8dt8g()'
nimcache/arraymancer_all_tests_cuda.o: In function `suiteStarted_GgW0QiD89cMFXShyPQ8t8qg(tyObject_OutputFormattercolonObjectType__dLGU9cWWYqlqOn8lpRVsx9cw*, NimStringDesc*)':

...
...
...
thousands of lines
...
...
...
nimcache/nimcuda_cusolver_common.o:tmpxft_00000694_00000000-4_nimcuda_cusolver_common.cudafe1.cpp:(.text+0x8d): more undefined references to `stackOverflow_II46IjNZztN9bmbxUD8dt8g()' follow
nimcache/stdlib_unittest.o: In function `suiteEnded_8zybqvexf9aYkm9bmgyU3eLg_2':
stdlib_unittest.c:(.text+0x332a): undefined reference to `suiteEnded_K2O75e2roACICAuYiNImyw'
nimcache/stdlib_unittest.o: In function `testEnded_xoIS1BdhYWQ6hhMP22XXiw':
stdlib_unittest.c:(.text+0x3502): undefined reference to `testEnded_azSnnJ3hAiHHuUdLzViJlw'
collect2: error: ld returned 1 exit status
Error: execution of an external program failed: '/opt/cuda/bin/nvcc   -o /home/ml/programming/Arraymancer/./bin/all_tests_cuda  nimcache/arraymancer_all_tests_cuda.o nimcache/stdlib_system.o nimcache/arraymancer_arraymancer.o nimcache/arraymancer_test_operators_blas_cuda.o nimcache/arraymancer_test_accessors_slicer_cuda.o nimcache/arraymancer_test_shapeshifting_cuda.o nimcache/stdlib_sequtils.o nimcache/stdlib_strutils.o nimcache/stdlib_future.o nimcache/stdlib_algorithm.o nimcache/nimblas_nimblas.o nimcache/stdlib_math.o nimcache/stdlib_typetraits.o nimcache/stdlib_macros.o nimcache/stdlib_random.o nimcache/stdlib_parseutils.o nimcache/stdlib_times.o nimcache/nimcuda_cuda_runtime_api.o nimcache/nimcuda_vector_types.o nimcache/nimcuda_driver_types.o nimcache/nimcuda_surface_types.o nimcache/nimcuda_texture_types.o nimcache/nimcuda_cublas_api.o nimcache/nimcuda_library_types.o nimcache/nimcuda_cuComplex.o nimcache/nimcuda_cublas_v2.o nimcache/nimcuda_nimcuda.o nimcache/nimcuda_cuda_occupancy.o nimcache/nimcuda_cudnn.o nimcache/nimcuda_cufft.o nimcache/nimcuda_curand.o nimcache/nimcuda_cusolver_common.o nimcache/nimcuda_cusolverDn.o nimcache/nimcuda_cusolverRf.o nimcache/nimcuda_cusolverSp.o nimcache/nimcuda_cusparse.o nimcache/nimcuda_nvblas.o nimcache/nimcuda_nvgraph.o nimcache/stdlib_unittest.o nimcache/stdlib_streams.o nimcache/stdlib_sets.o nimcache/stdlib_hashes.o nimcache/stdlib_os.o nimcache/stdlib_posix.o nimcache/stdlib_ospaths.o nimcache/stdlib_terminal.o nimcache/stdlib_termios.o  -lm -lrt   -ldl'
       Tip: 4 messages have been suppressed, use --verbose to show them.
     Error: Execution failed with exit code 1
        ... Command: "/home/ml/.nimble/bin/nim" cpp --noNimblePath --path:"/home/ml/.nimble/pkgs/nimblas-0.2.0" --path:"/home/ml/.nimble/pkgs/nimcuda-0.1.4" "--define:cuda" "--cc:gcc" "--gcc.cpp.linkerexe:/opt/cuda/bin/nvcc" "--gcc.linkerexe:/opt/cuda/bin/nvcc" "--out:./bin/all_tests_cuda" "--gcc.cpp.exe:/opt/cuda/bin/nvcc" "--run" "--cincludes:/opt/cuda/include" "--gcc.exe:/opt/cuda/bin/nvcc" "--gcc.cpp.options.always:-arch=sm_61 --x cu -Xcompiler -fpermissive" "--gcc.options.always:-arch=sm_61 --x cu" "--nimcache:"nimcache""  "tests/all_tests_cuda.nim"

OpenMP crashes in debug mode

Curretly OpenMP only works in release mode, in debug mode there are some allocations being made, and with the current GC design allocating from OMP threads is not supported. Some seqs (for example calling is_C_Contiguous does allocate a seq) and strings are being allocated in the OMP threads, why?

Investigate further, do we need to mark OMP functions as Gcsafe or something?

Span slicing inside dynamic type procs fails to compile

Minor test case:

import arraymancer

proc boo[T](): T =
  var a = zeros([2,2], int)
  echo a[1,_]

discard boo[int]()

When compiling the following error is shown:

test.nim(4, 12) Error: undeclared identifier: '_'

Removing dynamic type [T] from the proc the code works fine.

Implement autograd (automatic backpropagation)

An autograd automatically compute the gradients of complex operations by decomposing it in basic ops. It allows great flexibility to create new neural network layers as people don't need to compute the gradient by hand.

An autograd for scalar values can be found in Nim-rmad

Architecture considerations

It uses a sequence of closures to construct the list of operations to backpropagate through.
The usual solution in deep learning framework is to implement each operation as a class/object with a forward and backward method.

Compared to the usual solution Nim-rmad avoids dynamic dispatch and many indirections. However it is at the price of ease of use and extensibility by data scientists and researchers.

The final goal is to implement the autograd through Nim concept and VTable (vtref and vtptr).
The "AutogradOp" concept will implement and forward and backward.

Completely stuck today ๐Ÿ˜ฟ

Unfortunately, while waiting for the VTable implementation (target: August or so according to Araq on IRC) I can't implement a prototype with closure or method due to static bugs:

Use inline iterators to avoid copying

Unfortunately inline iterator chaining is broken: nim-lang/Nim#4516
As a workaround, closures are currently used

Occurrences:

Understand cublasSgemmStridedBatched for non-contiguous GEMM on CudaTensors

It seems like cublasSgemmStridedBatched can be used for matrix multiplication on GPU without copy for non-contiguous tensors. See the article here. (Note: if it works we can also use it for GEMV)

API has been added to Arraymancer here:

proc cublas_gemmStridedBatched[T: SomeReal](
transa, transb: cublasOperation_t;
m, n, k: int;
alpha: T; A: ptr T; lda: int; strideA: int;
B: ptr T; ldb: int; strideB: int;
beta: T; C: ptr T; ldc: int; strideC: int;
batchCount: int) {.inline.} =
# C + i*strideC = ฮฑop(A + i*strideA)op(B + i*strideB)+ฮฒ(C + i*strideC),
# for i โˆˆ [ 0 , b a t c h C o u n t โˆ’ 1 ]
# A, B, C: matrices
# We need to pass an address to CuBLAS for beta
# If the input is not a variable but a float directly
# It won't have an address and can't be used by CUBLAS
var
al = alpha
be = beta
check cublasSetStream(defaultHandle, defaultStream)
when T is float32:
check cublasSgemmStridedBatched(
defaultHandle,
transa, transb,
m.cint, n.cint, k.cint,
addr al, A, lda.cint, strideA,
B, ldb.cint, strideB,
addr be, C, ldc.cint, strideC,
batchCount.cint
)
elif T is float64:
check cublasDgemmStridedBatched(
defaultHandle,
transa, transb,
m.cint, n.cint, k.cint,
addr al, A, lda.cint, strideA,
B, ldb.cint, strideB,
addr be, C, ldc.cint, strideC,
batchCount.cint
)

Unfortunately I can't manage to use it even to multiply 2 contiguous matrices. Official documentation is here but there is no example of simple matmul

The follow snippet doesn't work whatever the strides I use to tell Cuda "There is no overlap"
Another poor soul had the same issue on StackOverflow and Nvidia forums.

Snippet is a port from his code

import ../src/cuda
import nimcuda/[cuda_runtime_api, driver_types, cublas_api, cublas_v2, nimcuda]


let m = 100
let n = 5
let k = 8


# C = B^T * A

var alpha = 1'f32

let B = randomTensor([800, 5], 1'f32).astype(float32).cuda
# B^T block has dimension 5 by 8
let ldb = 800
let strideB = 8

let A = randomTensor([800, 100], 1'f32).astype(float32).cuda
# A block has a dimension of 8 by 100
let lda = 800
let strideA = 8


var beta = 0'f32

let C = newTensor([500,100], float32).cuda
# C block has dimension 5 by 100

let ldc = 500
let strideC = 5

let n_blocks = 100

let foo = randomTensor([4,4],1'f32)
echo foo
echo foo.cuda

# echo B
# echo A
# echo C

cublas_gemmStridedBatched(
      CUBLAS_OP_T, CUBLAS_OP_N,
      n, m, k,
      alpha, B.get_data_ptr, ldb, n*k,
      A.get_data_ptr, ldA, m*k,
      beta, C.get_data_ptr, ldc, m * n,
      n_blocks
    )

echo C.cpu == B.cpu.transpose * A.cpu

Error:

 ** On entry to SGEMM  parameter number 15 had an illegal value
 ** On entry to SGEMM  parameter number 15 had an illegal value
 ** On entry to SGEMM  parameter number 15 had an illegal value

Note that parameter are (?) moved by one due to cublasHandle param being hidden.

References - cublasSgemmStridedBatched in the wild:

Optimize Host <-> Cuda memory transfers

Pinned memory is memory allocated on the host using the cudaMallocHost function, which prevents the memory from being swapped out and provides improved transfer speeds between Host and the actual Cuda device via DMA (Direct Memory Access).
It also allows non-blocking host<->GPU memcpy w.r.t. both host and GPU computations.

Link to topic on Nim forum.

It doesn't seem possible to use a custom allocator with new or newSeq

With Nim's memory region, it is possible to clearly distinguish Pinned Memory from normal memory, with somethin similar to

type
  UncheckedArray {.unchecked.}[T] = array[0..100_000_000, T]
  PinnedArray[T] = object
    len: int
    data: ptr UncheckedArray[T]
  Cuda = object

var foo: Cuda ptr PinnedArray[int]
foo = cast[ptr[Cuda, PinnedArray[int]]](alloc sizeOf(PinnedArray[int]))
foo.data = cast[ptr UncheckedArray[int]](alloc 50_000)
foo.len = 50_000

foo.data[][2]=7
echo foo.data[][2]

Unfortunately seq do not support memory region yet which means implementing manual memory management for those pinned tensors.

Where to use

If seq cannot be used with cudaHostAlloc (i.e. manual memory management is needed), a PinnedTensor type different from the base Tensor will be needed.
In that case, it is probably best not to expose it, and implement only a subset needed to load tensors/images/videos/audio files from disk and do data_augmentation.

Warning: Benchmark needed

Subdimensional indexing

Let A be a CxMxN tensor, to access its subdimensional matrix B of size MxN at index c in the C axis currently we have to do,

B  = A[c, _, _].reshape([M,N])

and if we want to access without copy,

B = A.unsafeView(c, _, _).unsafeReshape([M,N])

both ways is not ease of use, instead a more conventional way would be,

B = A.at(c)

and without copy,

B = A.unsafeAt(c)

Reduce XDeclaredButNotUsed messages

During testing. pending nim-lang/Nim#4647 and nim-lang/Nim#4044

Hint: test_blas [Processing]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(14, 12) Hint: '-=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(8, 12) Hint: '+=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(20, 12) Hint: '[]=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(5, 12) Hint: '+' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(11, 12) Hint: '-' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(17, 12) Hint: '[]' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(14, 12) Hint: '-=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(8, 12) Hint: '+=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(20, 12) Hint: '[]=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(5, 12) Hint: '+' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(11, 12) Hint: '-' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(17, 12) Hint: '[]' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(14, 12) Hint: '-=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(8, 12) Hint: '+=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(20, 12) Hint: '[]=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(5, 12) Hint: '+' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(11, 12) Hint: '-' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(17, 12) Hint: '[]' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(14, 12) Hint: '-=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(8, 12) Hint: '+=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(20, 12) Hint: '[]=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(5, 12) Hint: '+' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(11, 12) Hint: '-' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(17, 12) Hint: '[]' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(14, 12) Hint: '-=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(8, 12) Hint: '+=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(20, 12) Hint: '[]=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(5, 12) Hint: '+' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(11, 12) Hint: '-' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(17, 12) Hint: '[]' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(14, 12) Hint: '-=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(8, 12) Hint: '+=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(20, 12) Hint: '[]=' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(11, 12) Hint: '-' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(24, 12) Hint: '-' is declared but not used [XDeclaredButNotUsed]
/Users/.../Arraymancer/src/utils/pointer_arithmetic.nim(17, 12) Hint: '[]' is declared but not used [XDeclaredButNotUsed]

Display "off-by-one" following iterator rework

test case:

import ../src/arraymancer, sequtils

let a = toSeq(1..24).toTensor.reshape(6,4)

echo a

Result

Tensor of shape 6x4 of type "int" on backend "Cpu"
        1|
|2      3       4       5|
|6      7       8       9|
|10     11      12      13|
|14     15      16      17|
|18     19      20      21|
|22     23      24

Introduced by: 829d8b1

Display of 5D+ tensors

import arraymancer, sequtils

let a = toSeq(1..60).toTensor(Cpu).reshape(3,4,5)

echo a
Tensor of shape 3x4x5 of type "int" on backend "Cpu"
|1      2       3       4|
5|
|
|6      7       8|
|9      10|
|
|11     12|
|13     14      15|
|
|16     |17     18      19      20|
|
|21     22      23      24|
25|
|
|26     27      28|
|29     30|
|
|31     32|
|33     34      35|
|
|36     |37     38      39      40|
|
|41     42      43      44|
45|
|
|46     47      48|
|49     50|
|
|51     52|
|53     54      55|
|
|56     |57     58      59      60|
|

Broadcasting syntax

cc @edubart.

As mentionned in the README, current broadcasting syntax is not ideal.

let j = [0, 10, 20, 30].toTensor(Cpu).reshape(4,1)
let k = [0, 1, 2].toTensor(Cpu).reshape(1,3)

echo j.bc([4,3]) + k.bc([4,3])
# Tensor of shape 4x3 of type "int" on backend "Cpu"
# |0      1       2|
# |10     11      12|
# |20     21      22|
# |30     31      32|

Numpy syntax would be j + k

This is great as it is concise, however it is probably the source of lots of bug-hunting for new Numpy users when broadcasting is not wanted. The bad thing is that the error is silent.

For this reason, I would like to default to mathematical expectations (i.e. "Incompatible shape error"), and provide an opt-in broadcasting convenience like so:

j.bc + k.bc

Implementation

A term-rewriting macro can bc in operations and broadcast one or both arguments to compatible shapes

newSeqUninitialized regression

The following builds failed on devel only but not on stable.
This is probably following the usage of the new newSeqUnitialized

https://travis-ci.org/mratsim/Arraymancer/jobs/279307094

Traceback (most recent call last)
accessors_slicer.nim(467) test_accessors
system.nim(3547)         *=
system.nim(2727)         sysFatal
    Unhandled exception: over- or underflow
  [FAILED] indexing + in-place operator

https://travis-ci.org/mratsim/Arraymancer/jobs/279307258

  [OK] Iterators
    test_accessors.nim(93, 13): Check failed: a == [[0, 0, 0], [0, 200, 0], [0, 0, 0]].toTensor
    a was Tensor of shape 3x3 of type "int" on backend "Cpu"
|4	8	16|
|9	740	81|
|16	64	256|
  [FAILED] indexing + in-place operator

[Suite] Accessing and setting tensor values
[OK] Accessing and setting a single value
[OK] Out of bounds checking
[OK] Iterators
test_accessors.nim(93, 13): Check failed: a == [[0, 0, 0], [0, 200, 0], [0, 0, 0]].toTensor
a was Tensor of shape 3x3 of type "int" on backend "Cpu"
|4 8 16|
|9 740 81|
|16 64 256|
[FAILED] indexing + in-place operator


cc @edubart

Create more descriptive error types

Instead of using ValueError and IndexError everywhere it is probably more informative to introduce more descriptive error types.

For example:

  • IncompatibleShapeError (for tensors that can't be multiplied together)
  • UnreachableError (for when/if elif construct that should cover all cases)
  • NotImplementedError (for missing features or for proc/method that must be overloaded, e.g. forward and backward)

Tests fail in release mode

No exception is thrown when running tests in release, making many tests fails, also randomly crashes with
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Add inplace operators for slices

Suppose x is a 2 dimensional matrix, currently to sum a to its element at i,j we have to do,

x[i,j] = x[i,j] + a

which is not ideal because indexing will be calculated two times, instead much better and easy of use would be,

x[i,j] += a

The following operators are missing and would be good to have += -= *= /=, note that I am talking about an operation on a single element, however inplace operation on many elements would be a plus.

OpenCL

You seem to use cuda, but this is a nvidia-lock solution. What do the owners of AMD videocards do?
Would not it be better to use an OpenCL?
I do not understand well in this area, but in the future I would like to buy a new AMD card and work on machine learning.

Avoid/limit heap allocation/seq (shape/strides) in tight loop

Naive benchmarking shows that "shape_to_strides", seq assignation and creation generates a constant non-negligeable overhead - 60% of time spent in seq management for this benchmark:

bench

(available in "benchmarks" folder)

import ../src/arraymancer_nn, ../src/arraymancer_ag, ../src/arraymancer

let ctx = newContext Tensor[float32]

let bsz = 32 #batch size

# We will create a tensor of size 3200 --> 100 batch sizes of 32
# We create it as int between [0, 2[ (2 excluded) and convert to bool
let x_train_bool = randomTensor([bsz * 100, 2], 2).astype(bool)

# Let's build or truth labels. We need to apply xor between the 2 columns of the tensors
proc xor_alt[T](x,y: T): T =
  ## xor is builtin and cannot be passed to map as is
  x xor y

let y_bool = map2(x_train_bool[_,0], xor_alt, x_train_bool[_,1])


# Convert to float and transpose so batch_size is last
let x_train = ctx.variable(x_train_bool.astype(float32).transpose)
let y = y_bool.astype(float32).transpose

# First hidden layer of 3 neurons, with 2 features in
# We initialize with random weights between -1 and 1
let layer_3neurons = ctx.variable(
                      randomTensor(3, 2, 2.0f) .- 1.0f
                      )

# Classifier layer with 1 neuron per feature. (In our case only one neuron overall)
# We initialize with random weights between -1 and 1
let classifier_layer = ctx.variable(
                  randomTensor(1, 3, 2.0f) .- 1.0f
                  )

# Stochastic Gradient Descent
let optim = newSGD[float32](
  layer_3neurons, classifier_layer, 0.01f # 0.01 is the learning rate
)

for epoch in 0..100:

  for batch_id in 0..<100:

    # offset in the Tensor (Remember, batch size is last)
    let offset = batch_id * 32
    let x = x_train[_, offset ..< offset + 32]
    let target = y[_, offset ..< offset + 32]

    # Building the network
    let n1 = linear(x, layer_3neurons)
    let n1_act = n1.relu
    let n2 = linear(n1_act, classifier_layer)
    let loss = sigmoid_cross_entropy(n2, target)

    # Compute the gradient (i.e. contribution of each parameter to the loss)
    loss.backprop()

    # Correct the weights now that we have the gradient information
    optim.update()

Trace

Weight	Self Weight		Symbol Name
192.00 ms   60.3%	0 s	 	ex01_xor (23883)
192.00 ms   60.3%	0 s	 	 Main Thread  0x62524
192.00 ms   60.3%	0 s	 	  start
192.00 ms   60.3%	0 s	 	   main
192.00 ms   60.3%	2.00 ms	 	    NimMainModule
57.00 ms   17.9%	1.00 ms	 	     backprop_oVdk9aMLcybHCChVR6aQk5g
51.00 ms   16.0%	0 s	 	      backward_Fqpw21lFPHWNdNORshaa9cA
6.00 ms    1.8%	2.00 ms	 	       map2_RoKBnD6H0IuZSmQLcCMZ1g
4.00 ms    1.2%	0 s	 	        shape_to_strides_vk3EIHePgu5hL4dxsL38tg
4.00 ms    1.2%	2.00 ms	 	         X5BX5D__JoZhL7eQinMhVkOHQyuBhQ
2.00 ms    0.6%	0 s	 	          newSeq
12.00 ms    3.7%	1.00 ms	 	       map_7sNmoBM0FdFmPtxsYkKirA
2.00 ms    0.6%	0 s	 	        at__7KXK9aqsdE0ndrhB7KxewvA
2.00 ms    0.6%	0 s	 	         newSeq
1.00 ms    0.3%	0 s	 	        nimNewSeqOfCap
8.00 ms    2.5%	0 s	 	        shape_to_strides_vk3EIHePgu5hL4dxsL38tg
3.00 ms    0.9%	0 s	 	         X5BX5D__JoZhL7eQinMhVkOHQyuBhQ
3.00 ms    0.9%	0 s	 	          newSeq
4.00 ms    1.2%	1.00 ms	 	         amp__YMHcPoBMZP9bnnIcR8Iy9cUQ
3.00 ms    0.9%	2.00 ms	 	          newSeq
1.00 ms    0.3%	0 s	 	         genericSeqAssign
8.00 ms    2.5%	0 s	 	       reversed_S4WoGleqxGOb3jjUvFKyfA
8.00 ms    2.5%	0 s	 	        newSeq
24.00 ms    7.5%	2.00 ms	 	       star__xQtjCZX3EuzWXnc0t9bCM2w
1.00 ms    0.3%	0 s	 	        newSeq
4.00 ms    1.2%	0 s	 	        nimNewSeqOfCap
1.00 ms    0.3%	1.00 ms	 	        setLengthSeq
16.00 ms    5.0%	2.00 ms	 	        unsafeContiguous_Nck5nnO9bAJVgi7JMAI8knA
14.00 ms    4.4%	0 s	 	         genericSeqAssign
1.00 ms    0.3%	0 s	 	       unsafeBroadcast_9aZErpPXuidMn9cbDgghwtrg
1.00 ms    0.3%	0 s	 	        genericSeqAssign
1.00 ms    0.3%	0 s	 	      genericSeqAssign
1.00 ms    0.3%	0 s	 	      newSeq
3.00 ms    0.9%	0 s	 	      shape_to_strides_vk3EIHePgu5hL4dxsL38tg
2.00 ms    0.6%	0 s	 	       amp__YMHcPoBMZP9bnnIcR8Iy9cUQ
2.00 ms    0.6%	0 s	 	        newSeq
1.00 ms    0.3%	0 s	 	       genericSeqAssign
1.00 ms    0.3%	1.00 ms	 	     incrSeqV2
21.00 ms    6.6%	1.00 ms	 	     linear_9aa0crPUgSzc3WGSOrKwanw
20.00 ms    6.2%	0 s	 	      forward_pdrb9bebPpNDv5TauQs8LOgex01_xor
1.00 ms    0.3%	0 s	 	       newSeq
1.00 ms    0.3%	1.00 ms	 	       newSeq_mi9afQ1klNXRFnVSLwJV9aVg
5.00 ms    1.5%	0 s	 	       shape_to_strides_vk3EIHePgu5hL4dxsL38tg
3.00 ms    0.9%	0 s	 	        X5BX5D__JoZhL7eQinMhVkOHQyuBhQ
3.00 ms    0.9%	0 s	 	         newSeq
1.00 ms    0.3%	0 s	 	        amp__YMHcPoBMZP9bnnIcR8Iy9cUQ
1.00 ms    0.3%	0 s	 	         newSeq
1.00 ms    0.3%	0 s	 	        genericSeqAssign
13.00 ms    4.0%	0 s	 	       star__xQtjCZX3EuzWXnc0t9bCM2w
1.00 ms    0.3%	0 s	 	        newSeq
12.00 ms    3.7%	1.00 ms	 	        unsafeContiguous_Nck5nnO9bAJVgi7JMAI8knA
11.00 ms    3.4%	1.00 ms	 	         genericSeqAssign
1.00 ms    0.3%	0 s	 	     randomTensor_0CLBTaXQo1slbLknroeFow
1.00 ms    0.3%	0 s	 	      shape_to_strides_vk3EIHePgu5hL4dxsL38tg
1.00 ms    0.3%	0 s	 	       genericSeqAssign
11.00 ms    3.4%	0 s	 	     relu_SZbqcSLLEQfQBKnhXXEa0w
7.00 ms    2.2%	0 s	 	      forward_afTd72d9apMokICtszjdsPAex01_xor
1.00 ms    0.3%	0 s	 	       at__7KXK9aqsdE0ndrhB7KxewvA
1.00 ms    0.3%	0 s	 	        newSeq
4.00 ms    1.2%	2.00 ms	 	       map_7sNmoBM0FdFmPtxsYkKirA
1.00 ms    0.3%	0 s	 	        nimNewSeqOfCap
1.00 ms    0.3%	0 s	 	        shape_to_strides_vk3EIHePgu5hL4dxsL38tg
1.00 ms    0.3%	0 s	 	         X5BX5D__JoZhL7eQinMhVkOHQyuBhQ
1.00 ms    0.3%	1.00 ms	 	          newSeq
2.00 ms    0.6%	1.00 ms	 	       shape_to_strides_vk3EIHePgu5hL4dxsL38tg
1.00 ms    0.3%	0 s	 	        genericSeqAssign
4.00 ms    1.2%	0 s	 	      genericSeqAssign
50.00 ms   15.7%	0 s	 	     sigmoid_cross_entropy_Yau9cGp7xu7MB2nxk5jZa9cQ
29.00 ms    9.1%	2.00 ms	 	      forward_eiS5bzXq9cybpN9bAe3jgt5Aex01_xor
2.00 ms    0.6%	0 s	 	       shape_to_strides_vk3EIHePgu5hL4dxsL38tg
1.00 ms    0.3%	0 s	 	        amp__YMHcPoBMZP9bnnIcR8Iy9cUQ
1.00 ms    0.3%	0 s	 	         newSeq
1.00 ms    0.3%	0 s	 	        newSeq
25.00 ms    7.8%	0 s	 	       toTensor_PDoWBw7dhWertuPrbd3nqQ
8.00 ms    2.5%	0 s	 	        amp__YMHcPoBMZP9bnnIcR8Iy9cUQ
8.00 ms    2.5%	1.00 ms	 	         newSeq
3.00 ms    0.9%	0 s	 	        genericSeqAssign
1.00 ms    0.3%	0 s	 	        incrSeqV2
1.00 ms    0.3%	1.00 ms	 	         growObj_FZeyQYjWPcE9c06y1gNqZxw
4.00 ms    1.2%	2.00 ms	 	        newSeq
9.00 ms    2.8%	0 s	 	        shape_to_strides_vk3EIHePgu5hL4dxsL38tg
1.00 ms    0.3%	0 s	 	         X5BX5D__JoZhL7eQinMhVkOHQyuBhQ
1.00 ms    0.3%	1.00 ms	 	          newSeq
8.00 ms    2.5%	0 s	 	         amp__YMHcPoBMZP9bnnIcR8Iy9cUQ
8.00 ms    2.5%	0 s	 	          newSeq
21.00 ms    6.6%	0 s	 	      genericSeqAssign
26.00 ms    8.1%	2.00 ms	 	     slicer_BD1F1oU9a9cLZM9aHXZ2JbVKw_2
24.00 ms    7.5%	0 s	 	      genericSeqAssign
2.00 ms    0.6%	1.00 ms	 	     unsafeSlicer_BD1F1oU9a9cLZM9aHXZ2JbVKw
1.00 ms    0.3%	0 s	 	      genericSeqAssign
21.00 ms    6.6%	1.00 ms	 	     update_4t6MKNnjrt9b9cUtnIk3Iizg
16.00 ms    5.0%	1.00 ms	 	      map_7sNmoBM0FdFmPtxsYkKirA
1.00 ms    0.3%	0 s	 	       at__7KXK9aqsdE0ndrhB7KxewvA
1.00 ms    0.3%	0 s	 	        newSeq
14.00 ms    4.4%	0 s	 	       shape_to_strides_vk3EIHePgu5hL4dxsL38tg
9.00 ms    2.8%	0 s	 	        X5BX5D__JoZhL7eQinMhVkOHQyuBhQ
9.00 ms    2.8%	0 s	 	         newSeq
1.00 ms    0.3%	0 s	 	        amp__YMHcPoBMZP9bnnIcR8Iy9cUQ
1.00 ms    0.3%	1.00 ms	 	         newSeq
2.00 ms    0.6%	0 s	 	        genericSeqAssign
2.00 ms    0.6%	0 s	 	        newSeq
4.00 ms    1.2%	0 s	 	      shape_to_strides_vk3EIHePgu5hL4dxsL38tg
3.00 ms    0.9%	2.00 ms	 	       X5BX5D__JoZhL7eQinMhVkOHQyuBhQ
1.00 ms    0.3%	0 s	 	        newSeq
1.00 ms    0.3%	0 s	 	       genericSeqAssign

unable to load

Hello,

This package is exactly what I have been looking for but unfortunately I have been unable to load it. I've tried installing using the nimble package manager, with apparent success:

image

And tried setting up a simple test:

image

However the compilation fails with:

image

I am able to open other nimble installed packages so the problem appears specific to arraymancer. Am I missing something?

Thanks,

Tom

Add a "stack" function

Basically concatenate a tensor along a new axis.

i.e. stack of 2D matrices create a 3D tensor

Auto-fuse operations (alpha X + Y, alpha A * B + beta C ...)

Arraymancer can leverage the nim compiler and term-rewriting macros to automatically detect operations that can be fused.

This is probably similar to what Tensorflow is doing with their XLA compiler.
See: https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
and the overview.

A term-rewriting example is already included with fusing toTensor + reshape operations:

template rewriteToTensorReshape*{reshape(toTensor(oa, dummy_bugfix), shape)}(
oa: openarray,
shape: varargs[int],
dummy_bugfix: static[int]): auto =
## Fuse ``sequence.toTensor(Backend).reshape(new_shape)`` into a single operation.
toTensorReshape(oa, shape, dummy_bugfix)

Tensor higher order functions syntax: aggregate, map, fmap, fold, reduce, apply ...

There are several aggregation/higher order functions that have various names depending on the language/library.

It would be nice to be (in no particular order):

  1. Descriptive and unambiguous

  2. Consistent within Nim ecosystem:

    • sequtils
      - map to apply a function to all elements creating a new object |-> Arraymancer fmap
      - apply to apply a function in-place to all elements. |-> Arraymancer apply pending #40
    • Nimdata
      - map to apply a function to all elements creating a new object |-> Arraymancer fmap
      - reduce to aggregate on all elements of an object |-> Arraymancer agg
      - fold to aggregate on all elements of an object |-> Arraymancer agg_inplace
    • neo
      - map to apply a function to all elements creating a new object |-> Arraymancer fmap
  3. Familiar, if possible, with users coming from other languages, non-Nim libraries, for (non-exhaustive) example:

    • Pandas
      • map to apply a function to all elements of a series (column/row) |-> No series concept in Arraymancer (use fmap)
      • applymap to apply a function to all elements of a dataframe |-> Arraymancer fmap
      • apply to apply a function along an axis |-> Arraymancer agg with axis argument
      • agg/aggregate to aggregate over an axis (semantics are different from apply on pandas GroupBy objects which Arraymancer doesn't have) |-> Arraymancer agg
    • Functional languages and Map/Reduce jargon. See Wikipedia fold comparison
      • Haskell defines map for Lists and fmap (and liftM) for Monads (generic containers)
      • Haskell's foldr/foldl takes an initial argument/accumulator value, foldl1/foldr1 doesn't
      • F# has fold/foldBack (init val) and reduce/reduceBack (no init val)
      • Scala has foldLeft/foldRight (init val) and reduceLeft/reduceRight (no init val).

Proposition:

  • Abandon fmap for map , same as nimData, neo and sequtils. Anyway fmap in Haskell had interesting controversies
  • fmap_inplace in PR #40 will be called apply to be consistent with sequtils cc @edubart
  • agg which is a name with poor discoverability/familiarity will be changed to reduce similar to NimData, F# and Scala
  • agg_inplace (with init value) will be changed to fold

cc @bluenote10 and @andreaferretti. If you have any input to make Nim scientific ecosystem better or at least consistent feel free.

Missing useful functions

I will list here some functions that I miss in the API and keep updating as I need them, for start I will list some useful functions that I will potentially need, also open for discussion on its naming or if should be in the core

I am going to mark really important ones for now in bold.

General:

  • at(tensor, i, j, ...) => access subdimensional tensors #55
  • squeeze(tensor, [dim]) => remove dimension from tensor
  • unsqueeze(tensor, dim) => add dimension to tensor
  • size(tensor) => return number of elements in the tensor
  • copy(tensor) => copy contents from tensor, this is different from assignment (e.g. we may want to copy to a view), number of elements in both tensors must match
  • flatten(tensor) => convert a tensor to a vector
  • fill(tensor, value) => fill tensor elements with the given value
  • save(tensor, filename) => save tensor to file
  • load(filename) => load tensor from file
  • stack(tensors) => array of tensors to plus 1 rank tensor

No copies operations:

  • unsafeToTensor(seq) => convert a seq to a tensor without copying, useful when loading without seq duplication
  • unsafeToTensorReshape(seq) => same as above, but does reshape
  • unsafeAt(tensor, i, j, ...) => like at, but no copies
  • unsafeSqueeze(tensor, [dim]) => remove dimension from tensor, but no copies
  • unsafeUnsqueeze(tensor, dim) => add dimension to tensor, but no copies
  • unsafeTranspose => transpose with no copies when possible
  • unsafePermute => permute with no copies when possible
  • unsafeFlatten => convert a tensor to a vector with no copies when possible
  • unsafeBroadcast => broadcasting with no copies
  • unsafeAsContiguous => like as contiguous, but no copies when it is already contiguous

Simple Math:

  • randomNormalTensor(tensor) => returns a random tensor in the standard normal distribution
  • abs(tensor) => returns abs on all elements
  • sum(tensor, [axes]) => sum over many axes
  • min(tensor, [axis]) => returns minimum in the given axis
  • max(tensor, [axis]) => returns maximum in the given axis
  • std(tensor, [axis)] => returns standard deviation in the given axis
  • var(tensor, [axis)] => returns variance in the given axis
  • prod(tensor, [axis]) => returns product of elements in the given axis
  • norm(tensor, axis) => returns the 2p-norm of a 2d tensor in the given axis
  • pnorm(tensor, axis, p) => returns the p-norm of a 2d tensor in the fiven axis
  • pow(tensor, v)
  • square => pow for power of 2, is it faster to just do x*x instead of pow(x,2.0f) ?
  • clamp(tensor, a, b) => clamp values of tensor between a and b
  • mclamp(tensor, a, b) => like above, but in-place
  • msqrt, mln, msin, mround, ... all the common element-wise math functions but in-place

Linear algebra:

  • eye(n, [m]) => retuns NxM identity matrix, a must for doing linear algebra in general
  • batchMatmul(tensor, tensor) => batch matrix multiplication for tensors with rank >= 3, useful for doing batches of convolution for example
  • inverse(tensor) => inverse of a matrix, useful for doing closed form of linear regression with few features for example
  • svd(tensor) => singular value decomposition, useful for doing PCA (principal component analysis) and dimensionality reduction of features for example
  • eig(tensor) => compute eigenvalues and eigenvectors of a square matrix, also useful for dimensionality reduction

Critical: Add tests to autograd, nn_primitives and nn

Autograd, nn_primitives and nn are in master. To bring them to the same standard as the core tensor library, they need tests:

  • unit tests for individual pieces (like derivative of linear: Weight * input + bias)

    • Derivatives and cost functions in particular are critical.
  • Full pipeline test. (learning XOR, or small dataset like Iris / dogs vs cats). Note: loading those datasets should not depend on modules that depends on Arraymancer.

    • Tests should catch convergence regression
    • A performance and memory benchmark would be nice to have to profile the library regularly. Continuous integration would be terrific but a file in "benchmarks" folder with a history of commit + CPU/GPU + compilation flags would be a great start.
  • End-to-end integration tests with IO libraries like arraymancer-vision (and later csv loading, etc).

    • Make sure we don't silently break Arraymancer for them
    • Make sure they don't break assumptions we rely on, like color channel being in CHW order.

Explore write-tracking/escape analysis to avoid copying seq to closure

Not implemented: https://nim-lang.org/docs/manual.html#effect-system-read-write-tracking

Reference ticket: https://github.com/nim-lang/Nim/issues/3377

Solution 3.
In the example above we could mark s as lifted variable. This would change the signature of giveMeProc from giveMeProc(s: seq[int]): proc() to giveMeProcs: mustBeOnTheHeap seq[int]): proc(). We also need to keep track of the variables: the giveMeProc(@[1,2,3]) is a call with parameter which is in static, therefore we need to copy it before the function call.
This solution is not that easy to implement and also has some bad implications. In the following case:

proc wrapper(s: seq[int]): proc() =
result = giveMeProc(s)
we need to realize that the parameter of wrapper needs to be on the heap. Moreover, if at the implementation at wrapper we might not now the actual implementation of giveMeProc, therefore we actually need to introduce an additional keyword to the language (to indicate that the parameter s needs to be on the heap at forward declarations).

Solution 3 is essentially:

proc wrapper(s: seq[int]): proc () {.escapes: [s].}

That's why I told you to look at my writetracking. ;-)
Escapes: [s] means you cannot pass a static seq to it.

Implement optimized value semantics for CudaTensor

Currently CudaTensor data is shallow-copied by default. From a consistency point of view it would be best if both Tensor and CudaTensor have the same behaviour.

Unfortunately, while waiting for nim-lang/Nim#6348 even constructing a CudaTensor will create an unecessary GPU copy.

Implement value semantics

proc `=`*[T](dest: var CudaTensor[T]; src: CudaTensor[T]) =
  ## Overloading the assignment operator
  ## It will have value semantics by default
  new(dest.data_ref, deallocCuda)
  dest.shape = src.shape
  dest.strides = src.strides
  dest.offset = src.offset
  dest.len = src.len
  dest.data_ref[] = cudaMalloc[T](dest.len)
  let size = dest.len * sizeof(T)
  check cudaMemCpy(dest.get_data_ptr,
                   src.get_data_ptr,
                   size,
                   cudaMemcpyDeviceToDevice)
  echo "Value copied"

Move optimization

proc `=`*[T](dest: var CudaTensor[T]; src: CudaTensor[T]{call}) {.inline.}=
  ## Overloading the assignment operator
  ## Optimized version that knows that
  ## the source CudaTensor is unique and thus don't need to be copied
  system.`=`(result, t)
  echo "Value moved"

Target Javascript

Nim can compile to Javascript. How nice would it be for Arraymancer to compile to javascript to.
There are 2 potential usages for a JS target with different needs:

  • Server (or local computing) with Node.js access
  • Client, directly in the web browser

Common:

  • Several Nim features are not implemented in JS, for example closure iterators.

Server:

  • JS-C BLAS bindings are available for speed: see https://github.com/mateogianolio/nblas
    Arraymancer already have generic fallback routines for integer that should work on JS backend with minimal change
  • JS-CUDA bindings are available for GPU acceleration: see https://github.com/kashif/node-cuda
    What would be awesome would be for Nim to automatically binds JS and C

Client:
We can't expect clients to have BLAS or Cuda installed so for acceleration:

  • JS-WebGL bindings for GPU acceleration: see GPU.js
  • WebAssembly, vaporware (?)

Also: check TensorFire https://tenso.rs/ (non opensource?)

Ellipsis slicing syntax

Just like numpy, allow to use the ellipsis ... when slicing in x[..., 0], expanding ... to spans _ matching the tensor rank in the slice operation. For example if x rank is 3:

x[..., 0] => x[_, _, 0]
x[0, ...] => x[0, _, _]
x[..., 0, 0, 0] => x[0, 0, 0]
x[0, 0, 0, ...] => x[0, 0, 0]

Optionally, although maybe less used:

x[0, ..., 0] => x[0, _, 0]
x[0, 0, ..., 0] => x[0, 0, 0]

Some slicing syntaxes fails inside generics

import arraymancer

proc test[T](t: Tensor[T]) =
  discard t[^1..0|-1] # fails
  discard t[0..1] # fails
  discard t[0..^1] # fails
  discard t[0..1|1] # works
  discard t[_] # works
  discard t[0] # works

zeros([2], int).test()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.