Code Monkey home page Code Monkey logo

physt's Introduction

physt Physt logo

P(i/y)thon h(i/y)stograms. Inspired (and based on) numpy.histogram, but designed for humans(TM) on steroids(TM).

The goal is to unify different concepts of histograms as occurring in numpy, pandas, matplotlib, ROOT, etc. and to create one representation that is easily manipulated with from the data point of view and at the same time provides nice integration into IPython notebook and various plotting options. In short, whatever you want to do with histograms, physt aims to be on your side.

ReadTheDocs Join the chat at https://gitter.im/physt/Lobby PyPI downloads PyPI version Anaconda-Server Badge Anaconda-Server Badge Code style: black

Simple example

from physt import h1

# Create the sample
heights = [160, 155, 156, 198, 177, 168, 191, 183, 184, 179, 178, 172, 173, 175,
           172, 177, 176, 175, 174, 173, 174, 175, 177, 169, 168, 164, 175, 188,
           178, 174, 173, 181, 185, 166, 162, 163, 171, 165, 180, 189, 166, 163,
           172, 173, 174, 183, 184, 161, 162, 168, 169, 174, 176, 170, 169, 165]

hist = h1(heights, 10)           # <--- get the histogram data
hist << 190                      # <--- add a forgotten value
hist.plot()                      # <--- and plot it

Heights plot

2D example

from physt import h2
import seaborn as sns

iris = sns.load_dataset('iris')
iris_hist = h2(iris["sepal_length"], iris["sepal_width"], "pretty", bin_count=[12, 7], name="Iris")
iris_hist.plot(show_zero=False, cmap="gray_r", show_values=True);

Iris 2D plot

3D directional example

import numpy as np
from physt import special_histograms

# Generate some sample data
data = np.empty((1000, 3))
data[:,0] = np.random.normal(0, 1, 1000)
data[:,1] = np.random.normal(0, 1.3, 1000)
data[:,2] = np.random.normal(1, .6, 1000)

# Get histogram data (in spherical coordinates)
h = special_histograms.spherical(data)

# And plot its projection on a globe
h.projection("theta", "phi").plot.globe_map(density=True, figsize=(7, 7), cmap="rainbow")

Directional 3D plot

See more in docstring's and notebooks:

Installation

Using pip:

pip install physt

or conda:

conda install -c janpipek physt

Features

Implemented

  • 1D histograms
  • 2D histograms
  • ND histograms
  • Some special histograms
    • 2D polar coordinates (with plotting)
    • 3D spherical / cylindrical coordinates (beta)
  • Adaptive rebinning for on-line filling of unknown data (beta)
  • Non-consecutive bins
  • Memory-effective histogramming of dask arrays (beta)
  • Understands any numpy-array-like object
  • Keep underflow / overflow / missed bins
  • Basic numeric operations (* / + -)
  • Items / slice selection (including mask arrays)
  • Add new values (fill, fill_n)
  • Cumulative values, densities
  • Simple statistics for original data (mean, std, sem) - only for 1D histograms
  • Plotting with several backends
    • matplotlib (static plots with many options)
    • vega (interactive plots, beta, help wanted!)
    • folium (experimental for geo-data)
    • plotly (very basic, help wanted!)
    • ascii (experimental)
  • Algorithms for optimized binning
    • pretty (nice rounded bin edges)
    • mathematical (statistical, quantile-based, geometrical, ...)
  • IO, conversions
    • I/O JSON
    • I/O xarray.DataSet (experimental)
    • O ROOT file (experimental)
    • O pandas.DataFrame (basic)

Planned

  • Rebinning
    • using reference to original data?
    • merging bins
  • Statistics (based on original data)?
  • Stacked histograms (with names)
  • Potentially holoviews plotting backend (instead of the discontinued bokeh one)

Not planned

  • Kernel density estimates - use your favourite statistics package (like seaborn)
  • Rebinning using interpolation - it should be trivial to use rebin (https://github.com/jhykes/rebin) with physt

Rationale (for both): physt is dumb, but precise.

Dependencies

  • Python 3.7+
  • Numpy 1.20+
  • (optional) matplotlib - simple output
  • (optional) xarray - I/O
  • (optional) uproot - I/O
  • (optional) astropy - additional binning algorithms
  • (optional) folium - map plotting
  • (optional) vega3 - for vega in-line in IPython notebook (note that to generate vega JSON, this is not necessary)
  • (optional) xtermcolot - for ASCII color maps
  • (testing) pytest, pandas
  • (docs) sphinx, sphinx_rtd_theme, ipython

Publicity

Talk at PyData Berlin 2018:

Contribution

I am looking for anyone interested in using / developing physt. You can contribute by reporting errors, implementing missing features and suggest new one.

Thanks to:

Patches:

Alternatives and inspirations

physt's People

Contributors

gitter-badger avatar janpipek avatar marinang avatar toddrme2178 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

physt's Issues

Histogram collections

Create a collection of histograms in a file. Utility function to merge multiple files into single file.
Provides a way to distribute processing and histogram creation, store histograms and merge.

I implemented rudimentary versions with json and google protobuf. Tensorflow stores summary data and histograms in protocol buffers, so I gave that a try. The protobuf have versioning capability, which is just one potential benefit. See my fork and the protobuf tag.

Happy to work on this some more with your input.

One thing that would help is to make 'name' required, such that the Summary protocol buffer map <string, Histogram> takes as key the histogram name. This is important for the merging. Possibly merging can be implemented without converting back into the histogram, so we only deserialize, sum and write out new summary.

Warning in current numpy

If you try to merge bins:

from physt import h2
from scipy.stats import multivariate_normal
hist = h2(*multivariate_normal.rvs((0,0), size=100_000).T, bins=100)
hist.merge_bins(2)

You get a warning from numpy:

/home/schreihf/.local/lib/python3.7/site-packages/physt/histogram_base.py:572: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  new_frequencies[new_index] += old_frequencies[old_index]
/home/schreihf/.local/lib/python3.7/site-packages/physt/histogram_base.py:573: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  new_errors2[new_index] += old_errors2[old_index]

Wrong interpretation of range in ND binning

This:

h2(data, range=((min1, min2), None))

is not understood correctly (i.e. no limits for second axis) but as a shared argument. The problem is in np.asarray, binnings.py:766 probably.

Option to center labels on bins

If you have a large dataset with a small number of values (such as consisting only of integers 1-10) then it would be nice to have the bin x-axis labels at the center under the respective bin instead of at the bin edges.

I recognise this case is more of a 'histogram as bar plot' kind of thing, but it is a use-case I have often.

Root compatibility

  • physt.compat.root.from_root(TH1F...)
  • physt.compat.root.to_root(HistogramBase)

Memory efficiency problem

When creating histogram from huge data, temporarily huge amount of memory is allocated, though no copy should be created.

Suspects:

  • dropna ???
  • weights

Dtypes for bin contents

Support dtypes.

To do:

  • Support for creation in Histogram1D
  • Support for creation in HistogramND & calculate frequencies
  • Loss-less arithmetics with constants
  • Check fill, fill_n
  • Addition of different types
  • Support retyping (dtype=...)
  • More thorough testing
  • Document it properly

Wrong std

Giving twice the correct value (at least in the notebook)

bqplot support?

It's features look quite interesting, perhaps more than bokeh's...

Geospatial histograms

Would be nice to be able to plot these in maps.

However, this, together with #31, establishes the necessity to keep axes semantics (for axes groups). Not trivial...

Datetime / calendar histograms

Requirements:

Possible axes:

  1. month (cyclic)
  2. year (continuous)
  3. year+month (continuous)
  4. day (continuous)
  5. hour (cyclic)
  6. weekday (cyclic)
  7. week+weekday (continuous)
  8. year + week (continuous)
    ...

Properties:

  1. It must be possible to have a float quantity on other axes

Use case:

I have the following data:

2010-03-10, 4.3 deg
2010-03-11, -0.3 deg
...
2017-03-13, 3.4 deg
2017-03-14, 6.2 deg

I need a simple way to construct the histogram temperature vs. month of the year, ideally just calling h2...

Compatibility:

  • must work with date, datetime objects
  • must work with pandas internal and nat values

ASCII plot

Only an idea yet, but it would be nice to be able to understand the distribution also in the console...

h1 (resp. calculate_frequencies) is too slow

Actually, ~10 times slower than np.histogram. It has its reasons but (being more general) but in the simple case, we should not add any overhead.

Suggestion: when parameters allow, fall back to np variant.

Special types of histograms

  • Polar (2D)
  • Cylindrical (3D)
  • Spherical (3D)
  • Projection polar -> radial
  • Projection polar -> azimuthal
  • Projection cylindrical -> polar
  • Projection cylindrical -> rhoz
  • projection cylindrical -> phiz
  • projection cylindrical -> radial
  • Projection spherical -> directional
  • projection spherical -> radial
    ...
  • add clear interpretation and reasonable call arguments for the facade functions

In general, some mapping of values?

Add 2D & ND histograms

  • Analogous data model to Histogram1D
  • refactor HistogramBase class -> common behaviour of 1D and 2D
  • revisit binning schemas
  • histogram2D facade function to be compatible with numpy one
  • plotting
  • arithmetic operations
  • documentation
  • stats

Statistics incorrect

Strange values of std for reasonable distribution in the presence of weights.

Wrong xlim for bar plot with just one bin

In this degenerate case, xlim for a bar matplotlib plot is set to (0, 1):

import matplotlib.pyplot as plt
from physt.histogram1d import Histogram1D
h = Histogram1D([0, 10], [1])
h.plot()
plt.show()

However, this works:

import matplotlib.pyplot as plt
plt.bar([0], [1], [10], align="edge")
plt.show()

python 2.7 plotting is not working

When runnin plot() function I get the error below even though matplotlib is installed.
Also the algorithm is pretty slow when running on something bigger than toy example.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/physt/plotting/__init__.py", line 137, in __call__
    return plot(self.histogram, kind=kind, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/physt/plotting/__init__.py", line 91, in plot
    backend_name, backend = _get_backend(backend)
  File "/usr/local/lib/python2.7/dist-packages/physt/plotting/__init__.py", line 70, in _get_backend
    raise RuntimeError("No plotting backend available. Please, install matplotlib (preferred) or bokeh (limited).")
RuntimeError: No plotting backend available. Please, install matplotlib (preferred) or bokeh (limited).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.