Code Monkey home page Code Monkey logo

thunder's Introduction

Build Status

Thunder

Large-scale neural data analysis with Spark

About

Thunder is a library for analyzing large-scale neural data. It's fast to run, easy to develop for, and can be run interactively. It is built on Spark, a powerful new framework for distributed computing.

Thunder includes low-level utilties for data loading, saving, signal processing, and fitting algorithms (regression, factorization, etc.), and high-level functions that can be scripted to easily combine analyses. It is written in Spark's Python API (Pyspark), making use of scipy and numpy. We plan to port some or all functionality to Scala in the future, but for now all scala functions should be considered prototypes.

Quick start

Here's a quick guide to getting up and running. It assumes Scala 2.10.3, Spark 0.9.0, and Python 2.7.6 (with NumPy, SciPy, Scikit learn and Python Imaging Library) are already installed. First, download the latest build and add it to your path.

PYTHONPATH=your_path_to_thunder/python/:$PYTHONPATH

Now go into the top-level Thunder directory and run an analysis on test data.

$SPARK_HOME/bin/pyspark python/thunder/factorization/pca.py local data/iris.txt ~/results 4

This will run principal components on the β€œiris” data set with 4 components, and write results to a folder in your home directory. The same analysis can be run interactively. Start PySpark:

$SPARK_HOME/bin/pyspark

Then run the analysis

>> from thunder.util.load import load
>> from thunder.factorization.pca import pca
>> data = load(sc, 'data/iris.txt')
>> scores, latent, comps = pca(data, 4)

We include a script for automatically importing commonly used functions

>> execfile('helper/thunder-startup.py')

To run in iPython, just set this environmental variable before staring PySpark:

export IPYTHON=1

Analyses

Thunder currently includes four packages: clustering, factorization, regression, and signal processing, as well as utils for shared methods like loading and saving (see Input format and Output format). Individual packages include both high-level analyses and underlying methods and algorithms. There are several stand-alone analysis scripts for common analysis routines, but the same functions (or sub-functions) can be used from within the PySpark shell for easy interactive analysis. Here is a list of the primary analyses:

classification

classify - mass univariate classifiaction

clustering

kmeans - k-means clustering

factorization

pca - principal components analysis
ica - independent components analysis

regression

regress - mass univariate regression (linear and bilinear)
regresswithpca - regression combined with dimensionality reduction
tuning - mass univariate parameteric tuning curves (circular and gaussian)

signal processing

crosscorr - signal cross-correlation
fourier - fourier analysis
localcorr - local spatial time series correlations
stats - summary statistics (mean, std, etc.)
query - average over indices

Input and output

Thunder is built around a commmon input format for raw data: a set of neural signals as key-value pairs, where the key is an identifier, and the value is a response time series. In imaging data, for example, each record would be a voxel, the key an xyz coordinate, and the value a flouresence time series. This is a useful and efficient representation of raw data because the analyses parallelize across neural signals (i.e. across records).

These key-value records can, in principle, be stored in a variety of formats on a cluster-accessible file system; the core functionality (besides loading) does not depend on the file format, only that the data are key-value pairs. Currently, the loading function assumes a text file input, where the rows are neural signals, and the columns are the keys and values, each number separated by space. But we are investigating alternative file formats that are more space-efficient, as well as developing scripts that faciliate converting raw data (e.g. tif images) into the commmon data format.

All metadata (e.g. parameters of the stimulus or behavior for regression analyses) can be provided as numpy arrays or loaded from MAT files, see relavant functions for more details.

Results can be visualized directly from the python shell using matplotlib, or saved as MAT files (including automatic reshaping and sorting), text files, or images (including automatic rescaling).

thunder's People

Contributors

freeman-lab avatar joshrosen avatar

Watchers

David Strauss avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.