Code Monkey home page Code Monkey logo

pdroot's Introduction

Installation

pip install pdroot

Reading/writing ROOT files

import pandas as pd
import numpy as np
import pdroot # powers up pandas

df = pd.DataFrame(dict(foo=np.random.random(100), bar=np.random.random(100)))

# write out dataframe to a ROOT file
df.to_root("test.root")

# read ROOT files and optionally specify certain columns and/or a range of rows
df = pd.read_root("test.root", columns=["foo"], entry_start=0, entry_stop=50)

Histogram drawing from DataFrames

For those familiar with ROOT's TTree::Draw(), you can compute a histogram directly from a dataframe. All kwargs after first two required args are passed to a yahist Hist1D/Hist2D. "Jagged" branches are also supported (see below).

# expression string and a query/selection string
df.draw("mass+0.1", "0.1 < foo < 0.2", bins="200,0,10")

# 2D with "x:y"
df.draw("mass:foo+1", "0.1 < foo < 0.2")

Jagged arrays (e.g., in NanoAOD)

Manual reading

One can read jagged arrays into regular DataFrames without converting to (super slow) lists of lists by using zero-copy conversions from awkward1 to arrow and the fletcher pandas ExtensionArray.

df = pd.read_root("nano.root", columns=["/Electron_(pt|eta|phi|mass)$/", "MET_pt"])
df.head()
Electron_eta Electron_mass Electron_phi Electron_pt MET_pt
0 [] [] [] [] 208.131
1 [2.1259766] [0.12030029] [0.4970703] [121.077896] 96.3884
2 [-1.1259766] [-0.00405121] [0.1531372] [12.117786] 284.988
3 [0.17492676] [-0.04089355] [2.9018555] [178.91772] 26.7631
4 [ 0.12136841 -1.8227539 ] [-0.00730515 -0.00543594] [1.4355469 1.3552246] [19.721205 14.386331] 48.4577

It's easy to get the awkward array(s) from the fletcher columns (also a zero-copy operation):

>>> df["Electron_pt"].ak() 

<Array [[], [121], ... [179], [19.7, 14.4]] type='4273 * var * float64'>

>>> df[["Electron_pt","Electron_eta"]].ak()

<Array [{Electron_pt: [], ... -1.82]}] type='4273 * {"Electron_pt": option[var * fl...'>

A jagged array can be introduced back into a DataFrame:

from pdroot import to_pandas
dfc["Electron_good"] = to_pandas(dfc["Electron_pt"].ak() > 40)

Drawing/evaluating expressions and queries

Drawing from a DataFrame handles jagged columns via awkward-array and AST transformations.

# supports reduction operators:
#    min/max/sum/mean/length/len/argmin/margmax
#    (ROOT's Length$ -> length, etc)
df.draw("length(Jet_pt)")
df.draw("sum(Jet_pt>10)", "MET_pt>40", bins="5,-0.5,4.5")
df.draw("max(abs(Jet_eta))")
df.draw("Jet_eta[argmax(Jet_pt)]")

# combine event-level and object-level selection
df.draw("Jet_pt", "abs(Jet_eta) > 1.0 and MET_pt > 10")

# 2D
df.draw("Jet_pt:Jet_eta", "MET_pt > 40.")

# indexing; imagine you are operating row-by-row, so Jet_pt[0], not Jet_pt[:,0]
df.draw("Jet_pt[0]:Jet_eta[0]", "MET_pt > 10")

# combine reduction operators with fancy indexing
df.draw("sum(Jet_pt[abs(Jet_eta)<2.0])", bins="100,0,100")

# use the underlying array before a histogram is created
df["ht"] = df.draw("sum(Jet_pt[Jet_pt>40])", to_array=True)
df["ht"] = df.adraw("sum(Jet_pt[Jet_pt>40])") # think "*a*rray draw"

df.adraw (shortcut for df.draw(..., to_array=True)) is a columnar version of df.eval from pandas, which also supports jagged columns. For operations on a handful of arrays, df.adraw is a little faster than df.eval (which uses numexpr), at the cost of memory from intermediate array allocations.

The expression parsing can be explored via

>>> pdroot.parse.to_ak_expr("sum(Jet_pt[:2])") # sum of first/leading two jet pTs

'ak.sum(ak.pad_none(Jet_pt, 3, clip=True)[:, :2], axis=-1)'

Lazy chunked reading

ChunkDataFrame subclasses pd.DataFrame and lazily reads from a chunk of a file (or a whole one).

df = pdroot.ChunkDataFrame(filename="nano.root", entry_start=0, entry_stop=100e3)

# keeps track of original indexing, so if subsetted,
# any newly read arrays will be modified to match
df = df[df["MET_pt"]>40]

pt = df["Jet_pt"].ak()
df["ht"] = ak.sum(pt[pt>40])
# or
df["ht"] = df.adraw("sum(Jet_pt*(Jet_pt>40))")

df.head()
MET_pt Jet_pt ht
1 88.9181 [90.5625 60.28125 57.78125 34.40625 31.875 24.203125 19.21875 18.484375] 208.625
2 47.4594 [152.375 98. 72.75 70.75 48.28125 18.21875] 442.156
3 68.5163 [221.375 182.875 96.8125 83.5 43. 40.65625 25.34375 20.8125 18.015625 16.140625 15.53125 15.0390625] 668.219
5 130.703 [126.8125 69.875 64.75 41.84375 23.859375 21.15625 19.734375 15.796875] 303.281
7 131.178 [63.3125 48.96875 42.78125 33.8125 33.21875 18.5625 16.203125] 155.062

pdroot's People

Contributors

aminnj avatar

Stargazers

Jerry Ling avatar  avatar Jonathan Guiang avatar

Watchers

James Cloos avatar  avatar

pdroot's Issues

query chaining?

Add a df.qdraw maybe (but think more about the name first)? Basically it would be

def qdraw(df, ...):
    return df[df.adraw(...)]

So that one could do

import pdroot

(pdroot.ChunkDataFrame(filename="nano.root", entry_stop=100)
  .qdraw("MET_pt>40")
  .assign(ht=lambda df: df.adraw("sum(Jet_pt[Jet_pt>40])")
  .to_root("jets.root")
)

Implement weights for df.draw()

Need weights. e.g.,

df.draw("Jet_pt[0]", "MET_pt>40", weights="length(Jet_pt)")

Should be easy to implement since the same parsing for varexp should be applied to weights.

min/max for 2 arguments

import numpy as np
import pandas as pd
import pdroot
np.random.seed(42)
df = pd.DataFrame(np.random.normal(0,1,(100,4)), columns=list("abcd"))
df.draw_to_array("min(a,b)")

gives

~/sandbox/dev/pdroot/pdroot/draw.py in tree_draw_to_array(df, varexp, sel)
     49     dims = []
     50     for expr in varexp_exprs:
---> 51         vals = eval(expr)
     52 
     53         if sel:

~/sandbox/dev/pdroot/pdroot/draw.py in <module>

TypeError: min() got multiple values for argument 'axis'

though regular pandas df.eval("min(a,b)") says min isn't a supported function, so there's technically no loss here. However, df.draw_to_array("np.minimum(a,b)") actually works (while pandas eval doesn't recognize np), so maybe if there are two arguments to min, we use np.minimum instead of ak.min(..., axis=1) [1]?

[1]

from pdroot.parse import to_ak_expr
to_ak_expr("min(a,b)")
# ak.min(a, b, axis=-1)

ttree draw parsing

>>> from pdroot.parse import to_ak_expr
>>> to_ak_expr("sum(Jet_pt[:2])")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/namin/sandbox/dev/pdroot/pdroot/parse.py", line 148, in to_ak_expr
    Transformer().visit(parsed)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 262, in visit
    return visitor(node)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 317, in generic_visit
    value = self.visit(value)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 262, in visit
    return visitor(node)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 326, in generic_visit
    new_node = self.visit(old_value)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ast.py", line 262, in visit
    return visitor(node)
  File "/Users/namin/sandbox/dev/pdroot/pdroot/parse.py", line 124, in visit_Subscript
    if isinstance(node.slice.value, (ast.Constant, ast.Num)):
AttributeError: 'Slice' object has no attribute 'value'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.