Code Monkey home page Code Monkey logo

larray's Introduction

LArray: N-dimensional labelled arrays

Build Status Documentation Status

LArray is an open source Python library that aims to provide tools for easy exploration and manipulation of N-dimensional labelled data structures.

Library Highlights

  • N-dimensional labelled array objects to store and manipulate multi-dimensional data
  • I/O functions for reading and writing arrays in different formats: CSV, Microsoft Excel, HDF5, pickle
  • Arrays can be grouped into Session objects and loaded/dumped at once
  • User interface with an IPython console for rapid exploration of data
  • Compatible with the pandas library: Array objects can be converted into pandas DataFrame and vice versa.

Installation

Pre-built binaries

The easiest route to installing larray is through Conda. For all platforms installing larray can be done with:

conda install -c larray-project larray

This will install a lightweight version of larray depending only on Numpy and Pandas libraries only. Additional libraries are required to use the included graphical user interface, make plots or use special I/O functions for easy dump/load from Excel or HDF files. Optional dependencies are described below.

Installing larray with all optional dependencies can be done with :

conda install -c larray-project larrayenv

You can also first add the channel larray-project to your channel list :

conda config --add channels larray-project

and then install larray (or larrayenv) as :

conda install larray

Building from source

The latest release of LArray is available from https://github.com/larray-project/larray.git

Once you have satisfied the requirements detailed below, simply run:

python setup.py install

Required Dependencies

  • Python 3.8, 3.9, 3.10 or 3.11
  • numpy (1.22 or later)
  • pandas (0.20 or later)

Optional Dependencies

For IO (HDF, Excel)

  • pytables: for working with files in HDF5 format.
  • xlwings: recommended package to get benefit of all Excel features of LArray. Only available on Windows and Mac platforms.
  • openpyxl: recommended package for reading and writing Excel 2010 files (ie: .xlsx)
  • xlsxwriter: alternative package for writing data, formatting information and, in particular, charts in the Excel 2010 format (ie: .xlsx)
  • xlrd: for reading data and formatting information from older Excel files (ie: .xls)
  • xlwt:

    for writing data and formatting information to older Excel files (ie: .xls)

  • larray_eurostat: provides functions to easily download EUROSTAT files as larray objects. Currently limited to TSV files.

For Graphical User Interface

LArray includes a graphical user interface to view, edit and compare arrays.

  • pyqt (version 5): required by larray-editor (see below).
  • pyside: alternative to PyQt.
  • qtpy: required by larray-editor.
  • larray-editor: required to use the graphical user interface associated with larray. It assumes that qtpy and either pyqt or pyside are installed. On windows, creates also a menu LArray in the Windows Start Menu.

For plotting

Miscellaneous

  • pydantic: required to use CheckedSession.

Documentation

The official documentation is hosted on ReadTheDocs at http://larray.readthedocs.io/en/stable/

Get in touch

  • To be informed of each new release, please subscribe to the announce mailing list.
  • For questions, ideas or general discussion, please use the Google Users Group.
  • To report bugs, suggest features or view the source code, please go to our GitHub website.

larray's People

Contributors

alixdamman avatar gbryon avatar gdementen avatar jehanneman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

larray's Issues

implement some way to escape special characters in labels

list of characters/patterns with special meaning:

  • current: , ; : .. [ ] >> name[] name.i[] {} numbers
  • whitespace could be considered a special character (because it is not kept as is) and we might want to make it "escapable"
  • planned (for automatic patterns): * ?
  • potential (for logic operators): | & !

we might want to reserve some or all other special characters just in case: # @ % / = + -
Or, we could define the precise list of characters a label can be made of which we can guarantee will not be interpreted.

split unit tests

at a minimum, move Axis, AxisCollection and LGroup tests out of test_la.py

ptp is broken

TypeError: ptp() got an unexpected keyword argument 'keepdims'

implement Axis.set[] to create LSet directly

axis.set[a, b]

should be equivalent to:

axis[a, b].set()

The goal is mostly that the __repr__ in #44 actually works, so one option might be to simply change LSet __repr__ to:

axis[a, b].set()

But the .set[] syntax would also be more efficient, so...

split core.py

  • at a minimum all Axis related stuff should go to an "axis" module
    Axis, AxisCollection
  • potential other modules:
    group
    expr

implement replace_axis

to complement with_axes, we need a way to replace only one axis (or a few axes). The goal is to have something nicer than:

a.with_axes(a.axes.replace(x.products, industries))

e.g:

a.replace_axis(x.products, industries)
# or
a.with_axes(products=industries)

fix read_excel for sparse files

It works when using engine='xlrd' but since the default engine changed to 'xlwings' it does not work.
To fix this nicely would need a lot of work: support for sparse arrays (#28) and reindex (on sparse index), however we could use a temporary shortcut: read the data as (or convert it to) a pd.dataframe, reindex, convert back to larray. Far from optimal but much easier to implement.

Window functions API

Rolling, expanding, ...

See
http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions
http://xarray.pydata.org/en/stable/generated/xarray.DataArray.rolling.html#xarray.DataArray.rolling

For numpy:
https://gist.github.com/seberg/3866040
http://www.rigtorp.se/2011/01/01/rolling-statistics-numpy.html

Bottleneck also supports move_*
https://pypi.python.org/pypi/Bottleneck
there are no built-in move functions in numpy, so it compares against its own implementation:
https://github.com/kwgoodman/bottleneck/blob/master/bottleneck/slow/move.py

But it seems like Pandas works well even with numpy arrays, so I guess I shouldn't bother and simply use Pandas which has a lot more features than all the other solutions anyway.

http://stackoverflow.com/a/30141358/288162

refactor viewer Model to include the concept of axes names

The goals are to cleanup the current code.
It would probably help to store/support LArrays directly in the model, instead of converting to np.ndarray, but I don't want the whole model to require the use of LArrays (because in that case we will not be able to send the code back to upstream Spyder). One option is to make a generic model and have a specific LArray model which would inherit from it. The goal would be to have as much functionality as possible/reasonable available in the generic model (ie plot, , copy & paste, filter -- but not by labels obviously). Unsure if that is reasonable though :)
One clear requirement is to keep the ability to view non-LArrays (np.ndarray, lists, tuples). The easiest way for this would be to convert them to LArray in the init of the Model, but I would rather avoid that for the above reason.
Another point to keep in mind is that it should be capable to handle Pandas Dataframes in the future without too much change.

position or index

We should pick one of the two terms and stick with it. Currently we use both (.i and .ipoints but PGroup and posarg*). We should either have:
.p[], .ppoints[], PGroup and posarg*
or
.i[], .ipoints[], IGroup and iarg* (or indarg*)

explore implementing set operations on Axis and Group

setdiff1d (numpy) -- works
delete (numpy) -- works for int, slice or list of indices
list.remove (python) -- works for value (inplace)
list.pop (python) -- index

since LArray is more like numpy arrays than Python lists => not remove and pop
=> delete and idelete?
does the label version (delete?) returns only unique? ie set-like op?

union
intersect
setdiff
setxor
setin

possibly on LArray too (though 1d/flattened only in that case like numpy -- because otherwise that returns non cubic arrays).

generalize/extend Session to be more LArray-like

It should be as close as possible to an LArray with an "array" axis.

# sum the age axis of all arrays *iff* they have such an axis
s.sum(x.age)
# sum all axes of each array present in the session
s.sum_by(x.array)

to_clipboard is broken

The bug is in Pandas and/or in pyperclip (used by Pandas).
I suspect it is fixed in upstream pyperclip but not in the copy included in Pandas. So it might only needs a PR to pandas which simply updates pyperclip to the latest version.

provide larrayenv package

which depends on larray and all optional larray dependencies so that our users only need to do:

conda update larrayenv

and be sure to have all the functionalities installed.

generalize stack to more than 1 dimension

Here are a few syntax experiments (but see also #30):

# 2D
stack([(('BE', 'M'), 1.0), (('BE', 'F'), 0.0),
       (('FO', 'M'), 1.0), (('FO', 'F'), 0.0)], ('nat', 'sex'))

# 3D
# a) flat list, label tuple
stack([(('BE', 1, 'M'), 1.0), (('BE', 1, 'F'), 0.0),
       (('BE', 2, 'M'), 1.0), (('BE', 2, 'F'), 0.0),
       (('BE', 3, 'M'), 1.0), (('BE', 3, 'F'), 0.0),
       (('FO', 1, 'M'), 1.0), (('FO', 1, 'F'), 0.0),
       (('FO', 2, 'M'), 1.0), (('FO', 2, 'F'), 0.0),
       (('FO', 3, 'M'), 1.0), (('FO', 3, 'F'), 0.0)],
      ('nat', 'type', 'sex'))

# b) recursive structure
stack([('BE', [(1, [('M', 1.0), ('F', 0.0)]),
               (2, [('M', 1.0), ('F', 0.0)]),
               (3, [('M', 1.0), ('F', 0.0)])]),
       ('FO', [(1, [('M', 1.0), ('F', 0.0)]),
               (2, [('M', 1.0), ('F', 0.0)]),
               (3, [('M', 1.0), ('F', 0.0)])])],
       ('nat', 'type', 'sex'))

add XXX_by aggregate methods

eg a.sum_by(x.age)

which should be equivalent to

a.sum(a.axes - x.age)

(which does not work, because aggregate functions do not support an AxisCollection argument)

but this works:

a.sum(*(a.axes - x.age))

implement .reindex_axis

>>> arr = ndtest((2, 3))
>>> arr.reindex_axis(x.a, ['a1', 'a0', 'a1', 'a3'])
a\\b |  b0 |  b1 |  b2
  a1 |   3 |   4 |   5
  a0 |   0 |   1 |   2
  a1 |   3 |   4 |   5
  a3 | nan | nan | nan

It might be easy to implement using something vaguely looking like:

>>> new_axis = Axis(old_axis.name, new_labels)
>>> missing_value = missing[dtype]
>>> old_indices = old_axis.translate(new_labels, missing=-1)
>>> result = self.i[old_indices]
>>> # fix up those which were missing
>>> result[old_indices == -1] = missing_value

implement .extend on Axis

... it should accept a list of labels or an Axis and return a new Axis object (not modify the Axis in-place)

Implement a better syntax for initializing an array with "constant" values

eg. provide an alternative to:

>>> nat = Axis('nat', ['BE', 'FO'])
>>> sex = Axis('sex', ['M', 'F'])
>>> LArray([[0, 1], [2, 3]], [nat, sex])
nat\sex | M | F
     BE | 0 | 1
     FO | 2 | 3

because it is so error prone.

For a 1d array, stack works nicely, but for 2+, it quickly gets awful.

>>> stack([('M', 0), ('F', 1)], 'sex')

implementing a from_lists function would probably solve this nicely (though a better name might help):

>>> from_lists([['nat\sex', 'M', 'F'],
...             ['BE',        0,   1],
...             ['FO',        2,   3]])
nat\sex | M | F
     BE | 0 | 1
     FO | 2 | 3

cleanup unit tests

we should rewrite most LArray unit tests using small-ish arrays created using ndtest() instead of the current demo-related examples.

sparse array support

by either

  • using a pd.MultiIndex
  • storing a pd.Dataframe in memory instead of np.ndarray (mostly done in pandasbased3 branch)
  • implementing our own MultiIndex-like object

add unit tests for Excel I/O

This is a very important part and it is not tested at all currently, and I manage to break it every other release.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.