unda-ml / unda Goto Github PK

General purpose machine learning crate

License: Apache License 2.0

Rust 100.00%

adam-optimization ai convolutional-neural-networks deep-learning dense-neural-network machine-learning minibatch-sgd rust rust-traits

unda's Introduction

Unda

General purpose neural network crate

Unda aims to bring the future of deep learning to the world of rust. With dynamic input traits, concurrent minibatch processing, and full Dense network support, Unda is quickly emerging and making neural network development easy and blazingly fast.

Installation

Identify the latest compatible versions of CUDA and cuDNN. Adapt these instructions to install the two version of CUDA and cuDNN together.
Install clang and libclang1.
Download and extract xla_extension.
Make sure LD_LIBRARY_PATH includes /path/to/xla_extension/lib, and make sure the relevant CUDA paths are also visible to the system.

Use the package manager cargo to add unda to your rust project.

cargo add unda

or add the dependency directly in your cargo.toml file

[dependencies]
unda = "{version}"

Usage

use unda::core::network::Network;
use unda::core::layer::{methods::activations::Activations, layers::{LayerTypes, InputTypes}};
use unda::core::data::input::Input;
use unda::core::layer::{methods::errors::ErrorTypes};

fn main() {
    let inputs = vec![vec![0.0,0.0],vec![1.0,0.0],vec![0.0,1.0], vec![1.0,1.0]];
    let outputs = vec![vec![0.0],vec![1.0],vec![1.0], vec![0.0]];

    let mut new_net = Network::new(4);

    new_net.set_input(InputTypes::DENSE(2));
    new_net.add_layer(LayerTypes::DENSE(3, Activations::RELU, 0.001));
    new_net.add_layer(LayerTypes::DENSE(1, Activations::SIGMOID, 0.001));

    new_net.compile();

    new_net.fit(&inputs, &outputs, 2, ErrorTypes::MeanAbsolute);

    println!("1 and 0: {:?}", new_net.predict(&vec![1.0,0.0])[0]);
    println!("0 and 1: {:?}", new_net.predict(&vec![0.0,1.0])[0]);
    println!("1 and 1: {:?}", new_net.predict(&vec![1.0,1.0])[0]);
    println!("0 and 0: {:?}", new_net.predict(&vec![0.0,0.0])[0]);

    new_net.save("best_network.json");
}

Examples

The unda repository hosts a plethora of example ML models to compute a series of common problems. These examples can be found in the /examples folder and can be run by entering:

cargo run --release --example {example_name}

where example_name is the name of the file/folder you wish to run, omitting the .rs

Currently, Unda has example implementations for XoR, MNIST and a breast cancer model from Kaggle

Important! When using running the MNIST example, please make sure to put the appropriate ubyte files into the /src/util/mnist directory of this repository. We are currently working on using reqwest to automatically build the dataset, but for now it must be done manually

Implications for the future of ML

Using the built in Input trait, practically any data type can be mapped to an input for a neural network without the need for cutting corners, and the inner trait for layers allows for a plug and play style to neural network development. Currently, Unda has full support for Dense layers, Adam Optimization for Backprop, Activation functions (Sigmoid, TanH, ReLU and LeakyReLU), and even loss analysis per model and per layer.

Gradient descent currently can happen both syncronously as stochastic gradient descent or asynchronously through minibatch gradient descent.

If open source development is your thing, we at Unda would love additional work on anything that can be implemented, please contact [email protected] if you'd like to help out!

License

Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.

unda's People

Contributors

Stargazers

Watchers

Forkers

msoe-machine-learning-drone-club atlv24 ebanflo42 jorgeantonio21

unda's Issues

Integrate automatic differentiation with XLA conversion

My best case scenario for the next two weeks is that we get basic SGD up and running for simple dense networks (using XLA). The first major hurdle for this is not just differentiating the whole compute graph but identifying which parameter nodes the user wants gradients for. After that we add bindings to more XLA operations like matrix multiplication and activation functions + specify their differentiation rules, that is simple. Finally we should talk a bit about optimizers, but that is for another issue.

xla-rs defines "constants" and "parameters" in a pretty intuitive way: constants are baked into the executable whereas parameters are specified at every execution. Our current setup with the compute graphs reflects this design, which I'm happy with. The thing is, not everything which is specified at every execution should be differentiated. I saw Braden's email to ro, and this is highlighted by the question of how we distinguish "input", which is not differentiated but does change at every XLA call.

I would personally appreciate the following sort of API design (but haven't thought it through entirely):

let mut ctx = Context::new();
let learning_rate = ctx.scalar(-0.001);
let x = ctx.parameter("x", SmallVec::new()); // think of this as neural network weights
let y = ctx.parameter("y", SmallVec::new()); // think of this as neural network inputs
let loss = ctx.add(ctx.mul(x, x), y); // meaningless computation but that's not the point
let grad = ctx.gradient(x); // not yet implemented, this just gets the derivative with respect to x
let updates = ctx.mul(learning_rate, grad); // loss does not depend on this
// figure out how to apply updates when we talk about optimizers
let xla_executable = ctx.compile();

My idea is that ctx.gradient creates a node in the compute graph which is the placeholder for the gradient of a parameter. I think there's an issue with how the backend is currently set up relative to this: compile basically expects the loss node of the compute graph, and certainly this makes sense for autodiff, but overall we want the computation to potentially encapsulate computations on which the loss does not depend (note that updates depend on loss, not the other way around, so passing ctx.compile(loss) as we have it now would miss the multiplication of learning rate by the gradient). In summary, compile should take a loss node but not view it as the endpoint of the computation, instead it should use autodiff to connect placeholder Gradient nodes to actual computations which can then be converted to XLA.

This opens up the question of how the user specifies what tensors the XLA executable should return (usually you are interested in network predictions, loss, parameter updates, and maybe some auxiliary data), I will sleep on this. Interested in you guys' thoughts.

Get rid of build warnings

We have a lot of useless build warnings. At some point someone should go through and tidy this up.

Known Issue: Backprop working improperly when multiple activation functions are mixed

This issue seems to stem from the fact that we are using data from a previous layer to update the weights and biases of the current layer. When changing the gradient we should likely be using an activation function that we are not, but I cannot find out which one. I have tried using the current layer and the previous layer, so perhaps we have to use the next layers activation function to move forward, which would make the most sense

Output nodes

we need a way to have a computation with multiple outputs. add an operation type which is like Parameter but also holds a NodeIdentifier for the computation it outputs.

Support all XLA element types

This should be a pretty simple one, xla-rs has an enum

pub enum ElementType {
    Pred,
    S8,
    S16,
    S32,
    S64,
    U8,
    U16,
    U32,
    U64,
    F16,
    F32,
    Bf16,
    F64,
    C64,
    C128,
}

Pred is bool and C stands for complex. This should mesh easily if the numpy importing is already working. This can also be solved in the same branch as #28 and #29

Rethink error handling?

Currently, all of our tests are littered with .expect, which is really ugly. I would vote to panic from within all user-facing functions. I'm not really sure what the API would gain from doing anything else.

Automatic layer construction + initialization

We should have utility functions for constructing dense/convolutional layers (eventually more complex layers like LSTM or multihead attention), which take a context, input node identifiers, and initialization instructions, and return node identifiers for the layer output and the parameters.

A basic example of this is visible in the mnist_xla example.

Intializers should use XLA RNGs.

Improve dimensionality checks

comb through dimensionality propagation and ensure its correct

Proper iteration for Mini Batches

Need to alter the fit method for creating mini batches based on batch size, then taking each batch in parallel and generating the gradients for each input/output pair in parallel again. Once every thread has a full vector of gradients, generate the average gradient and finally update weights and biases according to this new vector of average gradients among batches.

asynchronously generating data from every layer of forward propagation and asynchronously gathering the gradient is currently implemented, I just need to look into the best way to execute the batches in parallel given the layers are Boxed dyn traits, which tokio and rayon don't like very much

bf16 problem

Something somewhere is making the bf16 conversion fail. I'll be checking it out

Categorical Cross Entropy not working now :(

Dynamic batch sizes

The MNIST example is too nice insofar as both the train and test set have a number of samples divisible by the batch size of 100. In general we should not assume this is the case. Our abstract model API should support some form of dynamic batch size.

I am not sure what the best approach for this is yet. However, if we assume that the model is executing on a fixed batch size and that only once per epoch will it receive a differing batch size, then when we average the loss we could multiply by a mask along the batch axis (1s where there are samples, 0s where there are not) and divide by the sum of the mask. If we do it like this, maybe it should be opt-in, since it does introduce a few extra floating point operations.

Operations for Dense Networks

We still need to bind and write differentiation rules for the following operations to train dense networks:
matmul
transpose
log
div
pow
reduce_sum
reduce_mean
In order to differentiate the last two we need a repeat or tile operation, which is apparently not yet bound in xla-rs. We also need a scatter operation in order to differentiate reduce_max and slice_in_dim, so I might look into adding bindings to xla-rs.

Minibatch backprop no longer works on layers with multiple outputs

Issue is either coming from back prop or compilation.

Softmax function not behaving properly

Tensor Constants

I only implemented scalar, vector, and matrix for Context. Arbitrary dimension constants are supported, they just arent exposed by any API right now.

Common term extraction

traverse the graph building a hashmap of nodes to node identifiers. upon insertion, if there's already an entry, replace references to the current node with the entry. make sure to update entry for the modified node, as the hash will change. do not include callsite when calculating the hash. start at the leaf nodes, work your way up.

Subterm node

add an operation type to ensure that a computation is already present in the compute graph. useful for referencing intermediate derivatives generated by autodiff with peace of mind that no additional work is created. remove these nodes if the constraint is satisified during common term extraction, and error if they are still present when compiling to XLA

Unit testing

write tests for every reduction. build an input graph and an expected output graph, run a reduction, and compare the output. test single steps, and total reductions. test that adding a vector to a scalar errors. write tests that dont pass yet but will pass, for example test that adding constants 2 + 1 folds to constant 3.

Printed AST parsing

reverse Context::to_string using nom and strum. This will enable save/load and easier unit testing.

short term we should probably do this via serde and save out the whole context for simplicity.

Implement XLA-Rs for all math/model computation

https://github.com/LaurentMazare/xla-rs

Parameter structure abstraction

This might be a trivial one, but it's important from a usability perspective.

PjRt executables expect a slice of buffers or literals as input to run on. However, if you have a model with dozens of parameter tensors organizing them all into a slice manually becomes tedious (already visible in the mnist_xla example), so I think our abstract model API should basically be callable on any user-defined structure implementing Into<Vec<Literal>> or Into<Vec<PjRtBuffer>>. The gradient engine should also return gradients with the same desired structure if the parameter struct implements From<Vec<T>>.

This is basically the equivalent of JAX tree flattening/unflattening

I don't think we have to do anything very thoughtful for this one, just write the type abstract signatures and call into and from appropriately.

Separate model/optimizer step, implement basic optimizers and batch normalization

Models and optimizers are generally thought of as separate objects, although currently they are executed in the same context.

This might be appropriate as a second step after forward/backward pass separation.

The critical reasons why we want the optimizer separate are 1) if updated weights are returned by the same context returning training loss which we want to print then weights are being bussed too and from the GPU at every step and 2) XLA supports Send and Recv operations which would allow us to compute gradient updates while simultaneously bussing the next model inputs and labels to the GPU.

We should also support SGD, RMSProp, and Adam optimizers. It would also make sense to make batch normalization a part of this issue (as it is its own sort of custom optimizer).

Softmax Activation

Write in an intuitive way that doesn't require a bunch of unnecessary gradient calculation

RELU Activation returning NaN

Adding Clone trait to Layer trait causes rust E0391

When trying to implement clone for a network, for some reason adding + Clone to the Layer trait ends up creating a cycle in minibatch? No idea how this is being caused

Constant folding

traverse the graph applying rules like 2 + 1 = 3, 1 * x = x, 0 * x = 0, x + 0 = x, etc.

highest priority is 0 * x = 0 rule because autodiff leaves behind many Const 0

node traversal should look a lot like the autodiff implementation

Model save/load

expose a way to save out a computation graph, accompanied by all constants parameters and model weights as relative-path hlo format files, xla-rs has utils for this and its the expected format for model/optimizer saving in XLA

Separate forward/backward pass

We need to carefully design the API such that the user has access to both an executable that returns the desired diff calls (for training) and an executable that returns everything except that (for testing).

This is part of a larger array of issues that will emerge from the need to embed contexts in other contexts (for example, separating the optimizer step, or designing recurrent architectures). In this case it might make sense to allow the user to design a forward pass context which doesn't take labels or output gradients, then allow them to clone that context and recover all desired node identifiers in order to create another context that takes both labels and inputs and outputs both predictions and loss and gradients. Then both executables can be used separately.

Need to implement categorical crossentropy

Data prefetching?

Question mark because it might be most reasonable to depend on another crate for this.

At the most abstract level, I imagine us having a Training struct which takes of course a model context and optimizer context but also very importantly an example generator. The generator should have a specified number of threads which it uses to load and preprocess and specified number of samples before they are required by the training loop. It should be known whether the generator is finite or infinite (counting epochs on an infinite training set would of course be silly).

Here is the simplest implementation in python: https://github.com/justheuristic/prefetch_generator. Note that all loading and preprocessing is user-defined, it basically just implements the multiprocessing aspect. Is there already a crate we can use for this?

Constant import

allow loading a numpy file as a constant node.

Softmax + MatMul diff issues

MatMul: Dimensionality is not good when one of the Nodes is a 3d matrix(stack of matrices), make it more generalized like the dimension checking function
Softmax: Problem during diff

Autodiff support for higher dimensional Jacobians

As it is currently written, autodiff expects a scalar output node and the tests which present a vector output fail. We should extend the logic to be able to cope with the case of higher-dimensional output.

Optimize Context::compile and get rid of boilerplate

The XLA conversion certainly has a lot of boilerplate and it would be nice if there is a way to reduce that. Someone should also check if there are easy optimizations that can be done.