larray-project / larray Goto Github PK

View Code? Open in Web Editor NEW

8.0 4.0 6.0 9.27 MB

N-dimensional labelled arrays in Python

Home Page: https://larray.readthedocs.io/

License: GNU General Public License v3.0

Python 99.99% Batchfile 0.01%

python array ndarray labeled-data

larray's Introduction

LArray: N-dimensional labelled arrays

LArray is an open source Python library that aims to provide tools for easy exploration and manipulation of N-dimensional labelled data structures.

Library Highlights

N-dimensional labelled array objects to store and manipulate multi-dimensional data
I/O functions for reading and writing arrays in different formats: CSV, Microsoft Excel, HDF5, pickle
Arrays can be grouped into Session objects and loaded/dumped at once
User interface with an IPython console for rapid exploration of data
Compatible with the pandas library: Array objects can be converted into pandas DataFrame and vice versa.

Installation

Pre-built binaries

The easiest route to installing larray is through Conda. For all platforms installing larray can be done with:

conda install -c larray-project larray

This will install a lightweight version of larray depending only on Numpy and Pandas libraries only. Additional libraries are required to use the included graphical user interface, make plots or use special I/O functions for easy dump/load from Excel or HDF files. Optional dependencies are described below.

Installing larray with all optional dependencies can be done with :

conda install -c larray-project larrayenv

You can also first add the channel larray-project to your channel list :

conda config --add channels larray-project

and then install larray (or larrayenv) as :

conda install larray

Building from source

The latest release of LArray is available from https://github.com/larray-project/larray.git

Once you have satisfied the requirements detailed below, simply run:

python setup.py install

Required Dependencies

Python 3.8, 3.9, 3.10 or 3.11
numpy (1.22 or later)
pandas (0.20 or later)

Optional Dependencies

For IO (HDF, Excel)

pytables: for working with files in HDF5 format.
xlwings: recommended package to get benefit of all Excel features of LArray. Only available on Windows and Mac platforms.
openpyxl: recommended package for reading and writing Excel 2010 files (ie: .xlsx)
xlsxwriter: alternative package for writing data, formatting information and, in particular, charts in the Excel 2010 format (ie: .xlsx)
xlrd: for reading data and formatting information from older Excel files (ie: .xls)
xlwt:

for writing data and formatting information to older Excel files (ie: .xls)
larray_eurostat: provides functions to easily download EUROSTAT files as larray objects. Currently limited to TSV files.

For Graphical User Interface

LArray includes a graphical user interface to view, edit and compare arrays.

pyqt (version 5): required by larray-editor (see below).
pyside: alternative to PyQt.
qtpy: required by larray-editor.
larray-editor: required to use the graphical user interface associated with larray. It assumes that qtpy and either pyqt or pyside are installed. On windows, creates also a menu LArray in the Windows Start Menu.

For plotting

matplotlib: required for plotting.

Miscellaneous

pydantic: required to use CheckedSession.

Documentation

The official documentation is hosted on ReadTheDocs at http://larray.readthedocs.io/en/stable/

Get in touch

To be informed of each new release, please subscribe to the announce mailing list.
For questions, ideas or general discussion, please use the Google Users Group.
To report bugs, suggest features or view the source code, please go to our GitHub website.

larray's People

Contributors

Stargazers

Watchers

Forkers

alixdamman gdementen smritigambhir avasse vishalbelsare sevketsayin

larray's Issues

new API for groups, ND groups, points selection, ...

should allow to create groups without axis (ie relying on guess axis).
eg.
G[2:7]

implement some way to escape special characters in labels

list of characters/patterns with special meaning:

current: , ; : .. [ ] >> name[] name.i[] {} numbers
whitespace could be considered a special character (because it is not kept as is) and we might want to make it "escapable"
planned (for automatic patterns): * ?
potential (for logic operators): | & !

we might want to reserve some or all other special characters just in case: # @ % / = + -
Or, we could define the precise list of characters a label can be made of which we can guarantee will not be interpreted.

split unit tests

at a minimum, move Axis, AxisCollection and LGroup tests out of test_la.py

ptp is broken

TypeError: ptp() got an unexpected keyword argument 'keepdims'

display array titles in viewer

add in LArray.init
session method to create codebook
display in session viewer

implement sep argument in combine_axes

implement axes (nb_index) autodetection for read_excel using pandas/xlrd backend

implement Axis.set[] to create LSet directly

axis.set[a, b]

should be equivalent to:

axis[a, b].set()

The goal is mostly that the __repr__ in #44 actually works, so one option might be to simply change LSet __repr__ to:

axis[a, b].set()

But the .set[] syntax would also be more efficient, so...

add unit tests for Excel I/O via open_excel

as a followup to #12, we also need tests for open_excel stuff.

split core.py

at a minimum all Axis related stuff should go to an "axis" module
Axis, AxisCollection
potential other modules:
group
expr

implement replace_axis

to complement with_axes, we need a way to replace only one axis (or a few axes). The goal is to have something nicer than:

a.with_axes(a.axes.replace(x.products, industries))

e.g:

a.replace_axis(x.products, industries)
# or
a.with_axes(products=industries)

= to name one-shot (ie string) groups

automatize update of larray package on conda forge

https://www.continuum.io/blog/developer-blog/community-conda-forge

Add more concrete & complete examples in the tutorial that would interest our users

Something like "Python for Econometrics" should be an inspiration. It's a bit messy (order of chapters seems weird to me), but it covers a lot of stuff.

https://www.kevinsheppard.com/images/0/09/Python_introduction.pdf

fix read_excel for sparse files

It works when using engine='xlrd' but since the default engine changed to 'xlwings' it does not work.
To fix this nicely would need a lot of work: support for sparse arrays (#28) and reindex (on sparse index), however we could use a temporary shortcut: read the data as (or convert it to) a pd.dataframe, reindex, convert back to larray. Far from optimal but much easier to implement.

implement view('filepath')

equivalent to:
view(Session('filepath'))

Window functions API

Rolling, expanding, ...

See
http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions
http://xarray.pydata.org/en/stable/generated/xarray.DataArray.rolling.html#xarray.DataArray.rolling

For numpy:
https://gist.github.com/seberg/3866040
http://www.rigtorp.se/2011/01/01/rolling-statistics-numpy.html

Bottleneck also supports move_*
https://pypi.python.org/pypi/Bottleneck
there are no built-in move functions in numpy, so it compares against its own implementation:
https://github.com/kwgoodman/bottleneck/blob/master/bottleneck/slow/move.py

But it seems like Pandas works well even with numpy arrays, so I guess I shouldn't bother and simply use Pandas which has a lot more features than all the other solutions anyway.

http://stackoverflow.com/a/30141358/288162

Add title in all array creation functions.

Filtering on anonymous axes in viewer does not work

add release notes to the repository

The objective is double: keep a trace of them and write them as we go during development so that making a release is not as painful.

refactor viewer Model to include the concept of axes names

The goals are to cleanup the current code.
It would probably help to store/support LArrays directly in the model, instead of converting to np.ndarray, but I don't want the whole model to require the use of LArrays (because in that case we will not be able to send the code back to upstream Spyder). One option is to make a generic model and have a specific LArray model which would inherit from it. The goal would be to have as much functionality as possible/reasonable available in the generic model (ie plot, , copy & paste, filter -- but not by labels obviously). Unsure if that is reasonable though :)
One clear requirement is to keep the ability to view non-LArrays (np.ndarray, lists, tuples). The easiest way for this would be to convert them to LArray in the init of the Model, but I would rather avoid that for the above reason.
Another point to keep in mind is that it should be capable to handle Pandas Dataframes in the future without too much change.

add docstrings (& examples) for aggregate methods

We might want to use a template for part of it.

Change signature of functions sum, mean, ...

Replace *args and **kwargs by equivalent arguments of Numpy functions.

read_excel(nb_index=...) is broken

should also change other methods index_col default value to None instead of [], but the code needs to be changed too

position or index

We should pick one of the two terms and stick with it. Currently we use both (.i and .ipoints but PGroup and posarg*). We should either have:
.p[], .ppoints[], PGroup and posarg*
or
.i[], .ipoints[], IGroup and iarg* (or indarg*)

LSet repr should not include OrderedSet

>>> letters = Axis('letters', 'a..z')
>>> letters[':c'].set() & letters['b:d'].set()
letters.set[OrderedSet(['b', 'c'])]

It should rather be:

letters.set['b', 'c']

Simple extrapolation API

I.e. fill missing data points after non-missing data points.

See:
http://stackoverflow.com/questions/22491628/extrapolate-values-in-pandas-dataframe/35959909#35959909

Pandas supports interpolation natively (ie fill missing data points between non-missing data points).

http://pandas.pydata.org/pandas-docs/stable/missing_data.html#interpolation
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html#pandas-dataframe-interpolate

explore implementing set operations on Axis and Group

setdiff1d (numpy) -- works
delete (numpy) -- works for int, slice or list of indices
list.remove (python) -- works for value (inplace)
list.pop (python) -- index

since LArray is more like numpy arrays than Python lists => not remove and pop
=> delete and idelete?
does the label version (delete?) returns only unique? ie set-like op?

union
intersect
setdiff
setxor
setin

possibly on LArray too (though 1d/flattened only in that case like numpy -- because otherwise that returns non cubic arrays).

generalize/extend Session to be more LArray-like

It should be as close as possible to an LArray with an "array" axis.

# sum the age axis of all arrays *iff* they have such an axis
s.sum(x.age)
# sum all axes of each array present in the session
s.sum_by(x.array)

viewer: select column/row does not load all data

file extension(s) for larray-compatible files

so that we can register the extension(s) with the viewer/editor/future IDE
e.g. .lacsv .lah5 ou .lcsv et .lh5?

to_clipboard is broken

The bug is in Pandas and/or in pyperclip (used by Pandas).
I suspect it is fixed in upstream pyperclip but not in the copy included in Pandas. So it might only needs a PR to pandas which simply updates pyperclip to the latest version.

provide larrayenv package

which depends on larray and all optional larray dependencies so that our users only need to do:

conda update larrayenv

and be sure to have all the functionalities installed.

add set operations to Session

via specialized methods (union, difference (or setdiff) and intersection)

generalize stack to more than 1 dimension

Here are a few syntax experiments (but see also #30):

# 2D
stack([(('BE', 'M'), 1.0), (('BE', 'F'), 0.0),
       (('FO', 'M'), 1.0), (('FO', 'F'), 0.0)], ('nat', 'sex'))

# 3D
# a) flat list, label tuple
stack([(('BE', 1, 'M'), 1.0), (('BE', 1, 'F'), 0.0),
       (('BE', 2, 'M'), 1.0), (('BE', 2, 'F'), 0.0),
       (('BE', 3, 'M'), 1.0), (('BE', 3, 'F'), 0.0),
       (('FO', 1, 'M'), 1.0), (('FO', 1, 'F'), 0.0),
       (('FO', 2, 'M'), 1.0), (('FO', 2, 'F'), 0.0),
       (('FO', 3, 'M'), 1.0), (('FO', 3, 'F'), 0.0)],
      ('nat', 'type', 'sex'))

# b) recursive structure
stack([('BE', [(1, [('M', 1.0), ('F', 0.0)]),
               (2, [('M', 1.0), ('F', 0.0)]),
               (3, [('M', 1.0), ('F', 0.0)])]),
       ('FO', [(1, [('M', 1.0), ('F', 0.0)]),
               (2, [('M', 1.0), ('F', 0.0)]),
               (3, [('M', 1.0), ('F', 0.0)])])],
       ('nat', 'type', 'sex'))

add XXX_by aggregate methods

eg a.sum_by(x.age)

which should be equivalent to

a.sum(a.axes - x.age)

(which does not work, because aggregate functions do not support an AxisCollection argument)

but this works:

a.sum(*(a.axes - x.age))

several versions of labels and axes names

mostly for output/reporting
short/long
language
...

Axis('a', '-1..9') is broken

implement .reindex_axis

>>> arr = ndtest((2, 3))
>>> arr.reindex_axis(x.a, ['a1', 'a0', 'a1', 'a3'])
a\\b |  b0 |  b1 |  b2
  a1 |   3 |   4 |   5
  a0 |   0 |   1 |   2
  a1 |   3 |   4 |   5
  a3 | nan | nan | nan

It might be easy to implement using something vaguely looking like:

>>> new_axis = Axis(old_axis.name, new_labels)
>>> missing_value = missing[dtype]
>>> old_indices = old_axis.translate(new_labels, missing=-1)
>>> result = self.i[old_indices]
>>> # fix up those which were missing
>>> result[old_indices == -1] = missing_value

implement d0

viewer colors are messed up in some cases

e.g. pycharm_minicourse.s_pop (when qx is not clipped)
s_pop = proj_pop.sum_by(x.time)[2040:]

xxx_by methods do not work with groups

Should add some tests also.

with_axes should copy title

the fact that it does not breaks ndtest title argument

check that ipfp totals have expected axes

Namely to raise a more meaningful error when the totals are swapped.

implement .extend on Axis

... it should accept a list of labels or an Axis and return a new Axis object (not modify the Axis in-place)

Implement a better syntax for initializing an array with "constant" values

eg. provide an alternative to:

>>> nat = Axis('nat', ['BE', 'FO'])
>>> sex = Axis('sex', ['M', 'F'])
>>> LArray([[0, 1], [2, 3]], [nat, sex])
nat\sex | M | F
     BE | 0 | 1
     FO | 2 | 3

because it is so error prone.

For a 1d array, stack works nicely, but for 2+, it quickly gets awful.

>>> stack([('M', 0), ('F', 1)], 'sex')

implementing a from_lists function would probably solve this nicely (though a better name might help):

>>> from_lists([['nat\sex', 'M', 'F'],
...             ['BE',        0,   1],
...             ['FO',        2,   3]])
nat\sex | M | F
     BE | 0 | 1
     FO | 2 | 3

using a pd.MultiIndex
storing a pd.Dataframe in memory instead of np.ndarray (mostly done in pandasbased3 branch)
implementing our own MultiIndex-like object

add unit tests for Excel I/O

This is a very important part and it is not tested at all currently, and I manage to break it every other release.