Code Monkey home page Code Monkey logo

madminer's Introduction

MadMiner: ML based inference for particle physics

By Johann Brehmer, Felix Kling, Irina Espejo, Sinclert Pérez, and Kyle Cranmer

PyPI version conda-forge version CI/CD Status Docs Status Gitter chat Code style MIT license DOI reference ArXiv reference

Introduction

Schematics of the simulation and inference workflow

Particle physics processes are usually modeled with complex Monte-Carlo simulations of the hard process, parton shower, and detector interactions. These simulators typically do not admit a tractable likelihood function: given a (potentially high-dimensional) set of observables, it is usually not possible to calculate the probability of these observables for some model parameters. Particle physicists usually tackle this problem of "likelihood-free inference" by hand-picking a few "good" observables or summary statistics and filling histograms of them. But this conventional approach discards the information in all other observables and often does not scale well to high-dimensional problems.

In the three publications "Constraining Effective Field Theories with Machine Learning", "A Guide to Constraining Effective Field Theories with Machine Learning", and "Mining gold from implicit models to improve likelihood-free inference", a new approach has been developed. In a nutshell, additional information is extracted from the simulations that is closely related to the matrix elements that determine the hard process. This "augmented data" can be used to train neural networks to efficiently approximate arbitrary likelihood ratios. We playfully call this process "mining gold" from the simulator, since this information may be hard to get, but turns out to be very valuable for inference.

But the gold does not have to be hard to mine: MadMiner automates these modern multivariate inference strategies. It wraps around the simulators MadGraph and Pythia, with different options for the detector simulation. It streamlines all steps in the analysis chain from the simulation to the extraction of the augmented data, their processing, the training and evaluation of the neural networks, and the statistical analysis are implemented.

Resources

Paper

Our main publication MadMiner: Machine-learning-based inference for particle physics provides an overview over this package. We recommend reading it first before jumping into the code.

Installation instructions

Please have a look at our installation instructions.

Tutorials

In the examples folder in this repository, we provide two tutorials. The first is called Toy simulator, and it is based on a toy problem rather than a full particle-physics simulation. It demonstrates inference with MadMiner without spending much time on the more technical steps of running the simulation. The second, called Particle physics, shows all steps of a particle-physics analysis with MadMiner.

These examples are the basis of the online tutorial built on Jupyter Books. It also walks through how to run MadMiner using Docker so that you do not have to install Fortran, MadGraph, Pythia, Delphes, etc. You can even run it with no install using Binder.

Documentation

The madminer API is documented on Read the Docs.

Support

If you have any questions, please chat to us in our Gitter community.

Citations

If you use MadMiner, please cite our main publication,

@article{Brehmer:2019xox,
      author         = "Brehmer, Johann and Kling, Felix and Espejo, Irina and Cranmer, Kyle",
      title          = "{MadMiner: Machine learning-based inference for particle physics}",
      journal        = "Comput. Softw. Big Sci.",
      volume         = "4",
      year           = "2020",
      number         = "1",
      pages          = "3",
      doi            = "10.1007/s41781-020-0035-2",
      eprint         = "1907.10621",
      archivePrefix  = "arXiv",
      primaryClass   = "hep-ph",
      SLACcitation   = "%%CITATION = ARXIV:1907.10621;%%"
}

The code itself can be cited as

@misc{MadMiner_code,
      author         = "Brehmer, Johann and Kling, Felix and Espejo, Irina and Perez, Sinclert and Cranmer, Kyle",
      title          = "{MadMiner}",
      doi            = "10.5281/zenodo.1489147",
      url            = {https://github.com/madminer-tool/madminer}
}

The main references for the implemented inference techniques are the following:

Acknowledgements

We are immensely grateful to all contributors and bug reporters! In particular, we would like to thank Zubair Bhatti, Philipp Englert, Lukas Heinrich, Alexander Held, Samuel Homiller and Duccio Pappadopulo.

The SCANDAL inference method is based on Masked Autoregressive Flows, where our implementation is a PyTorch port of the original code by George Papamakarios, available at this repository.

IRIS-HEP logo

We are grateful for the support of IRIS-HEP and DIANA-HEP.

madminer's People

Contributors

alexander-held avatar cranmer avatar dependabot[bot] avatar johannbrehmer avatar klingfelix avatar matthewfeickert avatar maxgalli avatar rbarrue avatar sinclert avatar tehmeh avatar wangzhe04 avatar zbhatti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

madminer's Issues

User-defined histogram binning

It would be nice if in FisherInformation and / or AsymptoticLimits we could use a user-defined histogram binning, rather than the equal-sized or equal-xsec binnings we support right now.

Observables from functions, not strings

In madminer.delphes, observables are currently defined through parseable python expressions given as str. There should be an alternative option for the user to define observables through full-fledged functions.

Morphing bases with additional basis points

Implement morphing bases with n > k basis points, where k is the number of morphing components. There is then an (n-k)-dimensional weight function of the parameter space that determines the optimal basis vector combination for each parameter point. This weight function has to be stored in the HDF5 file in some way.

Improve Fisher info interface

Internal Functions should start with an underscore. In addition, we might want to separate the following use cases into separate (user-facing) functions:

  • calculate total Fisher info (summed over events, with optional cuts)
  • calculate Fisher info in histogram (summed over bins)
  • calculate Fisher info in xsec

I'm not sure we need a function that gives out the Fisher info in individual phase-space points. For the "distribution of the Fisher information" plots it is enough if we can calculate the total Fisher info in a number of bins. Thoughts on this?

Also, we should think about if we can unify the interface for truth-level Fisher info and ML-based approximate Fisher info somehow...

Uncertainties on Fisher information matrices

The ML-based estimated Fisher information has a notion of uncertainty when an ensemble of estimators is used.

We should also calculate the (statistical) uncertainty for all other Fisher information methods.

Speed up machine learning

The training of the ML models is surprisingly slow right now. This is particularly the case for models that need to calculate second derivatives during training (RASCAL, ALICES, SCANDAL, CASCAL). In addition, there have been reports that only a very small fraction of the GPU memory is utilized.

Initialize DelphesProcessor instance without re-running Delphes

It would be convenient if there was a clean way to initialize an instance of DelphesProcessor that can be pointed to an existing .root file containing Delphes output. Right now, I believe the only way to do this is to fill the DelphesProcessor.delphes_sample_filenames and DelphesProcessor.hepmc_sample_weight_labels variables manually. The latter variable is needed because DelphesProcessor.analyse_delphes_samples() loops over it, but looking at the constructor of DelphesProcessor that loop should maybe be over DelphesProcessor.benchmark_names instead?

Use case: Testing different observable definitions and cuts without re-running Delphes.

Smarter sampling

Currently, the madminer.sampling functions use all events when sampling from a given parameter point theta. When an event was originally sampled from theta_mc (i.e. theta_mc was set in MadGraph's param_card.dat) and in madminer.sampling we are (re-)sampling from a different theta, this can lead to weights spanning a large range, then many repeated events in the unweighted samples, and ultimately to poorer performance.

The user can combat this manually by building different MadMiner files for different regions of parameter space, only combining them into a training sample later. But this is a little cumbersome.

A nice automatic solution would be to keep track of which events were originally sampled from which theta_mc, and when resampling from theta, only using those events sampled from the closest theta_mc. (Instead of the closest theta_mc, we could also optimize fractions for each subsample to minimize the statistical uncertainty in the end.)

Dynamic binning for histogram Fisher info

The calculation of the Fisher information in (1D) histograms would benefit from variable-width bins. These can be determined automatically such that each bin carries the same cross section.

dp.analyse_delphes_samples()

Hi all,
I'm getting this error when I run dp.analyse_delphes_samples(reference_benchmark=benchmark), where benchmark = 'sm' I also get it if the benchmark is other than 'sm'. Below the error I give more info about previous lines. Do you know what's the problem? Thank you!

############################################
Traceback (most recent call last):
File "code/delphes.py", line 67, in
dp.analyse_delphes_samples(reference_benchmark=benchmark, parse_lhe_events_as_xml=False)
File "/usr/local/lib/python2.7/dist-packages/madminer/delphes.py", line 600, in analyse_delphes_samples
acceptance_pt_min_j=self.acceptance_pt_min_j,
File "/usr/local/lib/python2.7/dist-packages/madminer/utils/interfaces/delphes_root.py", line 41, in parse_delphes_root_file
root_file = uproot.open(delphes_sample_file)
File "/usr/local/lib/python2.7/dist-packages/uproot/rootio.py", line 74, in open
return ROOTDirectory.read(openfcn(path), **options)
File "/usr/local/lib/python2.7/dist-packages/uproot/rootio.py", line 71, in
openfcn = lambda path: MemmapSource(path, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/uproot/source/memmap.py", line 45, in init
self._source = numpy.memmap(self.path, dtype=numpy.uint8, mode="r")
File "/usr/local/lib/python2.7/dist-packages/numpy/core/memmap.py", line 225, in new
f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
File "/usr/local/lib/python2.7/dist-packages/numpy/compat/py3k.py", line 237, in os_fspath
"not " + path_type.name)
TypeError: expected str, bytes or os.PathLike object, not unicode
###########################################

before that I added these observables with dp.add_observable() and cuts dp.add_cut():

`observables:

  • name: 'pt_j1'
    definition: 'j[0].pt'
    required: False
    default: 0.

  • name: 'delta_phi_jj'
    definition: '(j[0].phi() - j[1].phi()) * (-1. + 2.*float(j[0].eta > j[1].eta))'
    required: True
    default:

cuts:

  • cut: 1
    expression: 'pt_j1 > 30.' `

Downstream estimators are biased (Fisher information, exclusion contours)

An unbiased score estimator leads to a biased Fisher information estimator and to biased Fisher contour estimates.

It's obvious in the extreme case. The Fisher information is by definition positive (semi-)definite, and so is any SALLY-based estimator for it. If the true Fisher information is (close to) zero, it stands to reason that the estimator will be biased: downward fluctuations can only be small (or in the extreme case are not possible at all), upward fluctuations can be arbitrarily large.

Even if the mode of an ensemble of estimators gets it right, the mean will likely overshoot the true information due to the skewness of the distribution. This is a relevant issue that we observe in practice.

Similarly, an unbiased estimator of the likelihood ratio might translate to a biased log likelihood ratio estimator and to biased contours.

Can / should we correct for this?

error when run reweight by madminer.

Hi, it's me again, I've been running some tutorial examples and trying to use HEL_UFO model to test some benchmark points and I noticed one odd thing that is happening on my machine.
When I run miner.run(...) for 'signal' events the madgraph reweight return an error that said I don't have allmatrix2py needed by the reweight api, but despite this it finishes and produce the .lhe files. I can loaded them normaly by the LHEProcessor and do all the selection cuts and use the analyse_samples().
however when I try to generate any plot or extract the fisher information I got this error from madminer:

--> 207 return matrix.dot(weights_benchmarks_T[:n_smaller])

ValueError: shapes (6,) and (1,677) not aligned: 6 (dim 0) != 1 (dim 0)

I figure out I way to work around in this issue by running again the MadGraph events in a different environment. I'm attaching the full error msgs and the logfiles from madgraph, I will keep doing more tests and will try to reproduce the error on a different machine, maybe is only related to my machine and/or environment variables.

Cheers.

run_01_tag_1_debug.log
error_madminer.txt

Add Warning in AsymptoticLimits class

AsymptoticLimits assumes that if it's an ensemble, we add the path to a directory, if it's a single estimator, we add the path to a file. I just added the path to the directory where the single estimator is, and the code failed at a assert comment. Let's add a warning.

More flexible morphing setup

Currently, the morphing components are specified only through

  • the maximal power that each parameter appears in couplings, and
  • the overall maximal combined power of parameters in couplings.
    This overestimates the number of components in many settings, for instance when there are two distinct classes of processes, affected by different operators.

One possible improvement would be to define classes of processes, each with associated operators, their maximal powers, and an overall maximal combined power.

Ensemble methods

We could / should implement a new class that simplified ensemble methods for likelihood ratio / Fisher information estimation. This would give out more precise predictions plus an error estimate.

Instead of a democratic vote, we could use the knowledge that ideally E[r] = 0 and E[t] = 0 (with the correct sampling), and use a weighted meta-prediction like

r_hat(x)  =  (sum_i r_hat_i(x) exp(-beta E[r_hat_i(x)])) / (sum_i exp(-beta E[r_hat_i(x)]))

where i are the different estimators.

I'm not sure whether the ensemble variance is still an estimator of the estimator variance if we use weights like this.

Outsource inference

Rewrite madminer.limits to wrap around pyhf. That will not only do all the histogram stuff (e.g. for SALLY) for us, but also takes care of the profiling / p-value calculation / CLs stuff. I discussed this with Lukas Heinrich and it should be possible to use pyhf's inference part using a RASCAL-style likelihood ratio estimator.

problem with environment.yml

Hi, I tried to create an environment with the environment.yml file, in the line 14 you have - pyteest intead of - pytest, this gives an error. after correct this issue the env can be create with no errors.

Understand LHE files despite wrong XML

The LHE files that MadGraph produces ostensibly conform to the XML standard. But sometimes MadGraph produces headers that crudely violate this standard, for instance in this part about the weight groups:

<weightgroup name="PDF4LHC15_nlo_30" combine="symmhessian"> # 90900: PDF4LHC15_nlo_30. mem=0 ; alphas(MZ)=0.118 central value; mem=1-30 ; PDF symmetric eigenvectors
<weight id="5" MUR="1.0" MUF="1.0" PDF="90900" >  </weight>
[...]
<weight id="35" MUR="1.0" MUF="1.0" PDF="90930" > PDF=90900 MemberID=30 </weight>
<weightgroup name="Central scale variation" combine="envelope">
</weightgroup> PDF
<weight id="36" MUR="1.0" MUF="1.0" DYN_SCALE="1" PDF="90900" > dyn_scale_choice=sum pt  </weight>
[...]
<weight id="39" MUR="1.0" MUF="1.0" DYN_SCALE="4" PDF="90900" > dyn_scale_choice=sqrts  </weight>
</weightgroup>

(two lines in the middle are swapped).

I'm not sure what exactly triggers this MadGraph behavior (I've seen it when calculating PDF uncertainties with PDF4LHC, but it was fine with CTEQ14nlo and identical settings otherwise).

But this can be a problem for MadMiner, which relies on an XML parser to read the weight setup and the events. In this particular example, the weights related to the PDF uncertainties are not identified out properly.

We should

  • talk to MadGraph people to understand this behavior, and
  • make sure it doesn't happen, or repair the files before reading them.

Clean up madminer.plotting

The madminer.plotting functions plot_fisherinfo_barplot() and (to a lesser extent) kinematic_distribution_of_information() are messy. Docstrings are missing, there is code that was written for specific processes, some objects are never used at all. Generally the coding style is different from the rest of MadMiner.

I think we should either clean up these functions, or remove them from the package (in that case we can move them to the parton-level tutorial notebook).

Thoughts?

Improve documentation

Add, update, and clean up docstrings. Optionally export as nice API documentation.

NLO reweighting

Add the option to reweight to NLO in addition to the Wilson coefficient reweightings.

Refactor package and class names

Right now these are the important objects (roughly sorted by workflow)

  • madminer.goldmine.GoldMine
  • delphesprocessor.delphesprocessor.DelphesProcessor
  • lheprocessor.lheprocessor.LHEProcessor
  • madminer.refinery.Refinery
  • forge.forge.Forge
  • madminer.madfisher.MadFisher

The distinction between core code (madminer package), detector and observable (delphesprocessor and lheprocessor packages) and ML implementation (forge packages) makes sense. But we could move it to the level of submodules of the madminer package.

In addition, we can probably improve the consistency by renaming some things. I would favor a consistent scheme, where every submodule or class name consists of a factual term (Mad for "MadGraph" / Delphes / Sample / ML / Fisher for "FisherInformation") and a mining pun (Miner / Processor / Refinery / Forge).

Combining these two ideas, we could change the classes above to

  • madminer.madminer.MadMiner
  • madminer.delphesprocessor.DelphesProcessor
  • madminer.lheprocessor.LHEProcessor
  • madminer.samplerefinery.SampleRefinery
  • madminer.mlforge.MLForge
  • madminer.madfisher.MadFisher (not sure about this one -- while it sounds cool and is consistent with our past work, it fits less into the scheme of the others)

This would clean up the structure a little bit, remove redundant code between the currently parallel packages, and improve the clarity of the branding.

Thoughts?

Phase-space-dependendant nuisance parameters

Let the user supply a function (or a string that can be saved in the HDF5 and parsed efficiently) that depends on x and the value of a nuisance parameter nu and returns a event-wise rescaling.

Validation with ground truth

Without shower and detector effects, we have x=z and can evaluate the ground-truth likelihood. We should use this to validate all methods.

This means:

  • generating events with MadMiner for some process (we can deactivate Pythia)
  • using LHEProcessor to extract a complete set of observables (no unobservable directions in phase space)
  • when extracting the test sample, also save the t(x,z) and r(x,z) (have to implement this in SampleAugmenter)
  • train methods as usually and compare to ground truth

Cleaner LHE parsing

The LHE event files seem to be an XML format. Currently we are parsing them manually, which might break with small changes in the way that MadGraph writes out these files. Maybe we should switch to using an XML parser such as ElementTree?

Smearing functions for LHEProcessor

For some applications it would be useful if we could read in LHE events, but smear the four-momenta for each event randomly according to some smearing function (a Gaussian would be the obvious starting point).

LHE weight normalization

When MadGraph generates event with Pythia activated, it insists of a certein weight normalization:

INFO: Pythia8 needs a specific normalisation of the events. We will change it accordingly. 
INFO: modify parameter event_norm of the run_card.dat to average

This means that the weights in the LHE file are off by a factor of the number of events. This has to be taken into account in LHEProcessor.

Morphing-based nuisance parameters

New feature: derive effect of nuisance parameters through a new version of morphing. We would have to fix the effect of nuisance parameters to some pre-defined parametric shape and then simulate basis points with different parameter values, e.g. scale choices.

Acceptance cuts

In DelphesProcessor, add acceptance cuts that an object has to pass to be stored at all.

Defaults could be pT > 10 GeV and |eta| < 2.5 for electromagnetic objects / |eta| < 5 for jets.

Profiled score

Let's say we want to infer on n parameters of interest in the presence of k nuisance parameters. Right now, SALLY lets us estimate the (k+n)-dimensional score.

It is possible to define a "profiled" or "nuisance-hardened" n-dimensional score, see https://arxiv.org/abs/1903.01473. We should implement this, probably in the ScoreEstimator class.

And then we can and should compare this infernal SALLY to INFERNO (https://arxiv.org/abs/1806.04743) :)

Regularization

We could add options for regularizers for network weights, or for the gradient of the network output with respect to x.

Better criterion for morphing basis optimization

Currently we optimize the morphing basis (benchmark points) by minimizing the expected squared morphing weights L[w] = E_theta[sum_i w_i(theta)^2]. The expectation runs over a flat prior over the parameter ranges specified in MadMiner.add_parameter(). Generally, our current algorithm will push the benchmark points far away from each other, which does indeed typically lead to small morphing weights.

However, the squared weights aren't really the same as the true morphing error. The true morphing error will also depend on how much the differential cross section changes over the parameter space. When all weights are numerically small, but the different cross sections for each benchmark are wildly different, the morphing error will be large even though L[w] is small. Of course, the cross sections will be more similar for closer parameter points, so this effect is opposite to the effect from the morphing weights.

The problem is of course that the differential cross section is process-dependent. At the stage where we have to pick the morphing basis, we don't have any numerical results yet.

So how can we improve this? One option would be to generate a small event sample with some morphing basis first and then to use this to calculate actual morphing errors as a function of a morphing basis. But that's a bit more complicated. Any other ideas?

Tests, tests, tests

MadMiner should have as many (unit) tests as possible. So far, in tests/ we have two simple tests that check that all imports work, and that some of the ML functionality works.

Here are some ideas for different tests:

☐ the basic functionality of the MadMiner class: setting up parameter spaces and benchmarks, loading and saving MadMiner files
☐ the morphing functionality
☐ running MadGraph and Pythia from a MadMiner instance

☐ reading in LHE files with LHEProcessor
☐ running Delphes from DelphesProceessor
☐ different ways to define and evaluate observables, both in LHEProceessor and DelphesProcessor
☐ cuts in LHEProcessor and DelphesProcessor

☐ creating test samples with SampleAugmenter
☐ creating SALLY / SALLINO training samples with SampleAugmenter
☐ creating RASCAL / ALICES / ... training samples with SampleAugmenter
☐ creating SCANDAL / MAF training samples with SampleAugmenter

☐ training and evaluating a SALLY estimator
☑ training and evaluating a RASCAL / ALICES / ... estimator
☐ training and evaluating a SCANDAL / MAF estimator

Process tag for events

It would be a nice feature if we could save for each event from which process (in the sense of MadGraph sample) it came from. The user could supply a label for each HepMC sample that is then saved in a separate field in the HDF5 file.

Parallelize ensemble training

With distributed computing and workflows in mind, it would be great to parallelize the training of ensembles of estimators.

LHEProcessor support for background samples

Since b6fe87a, MadMiner can generate "background samples", samples in which the event weights do not depend on theta, without reweighting. DelphesProcessor can already read in these files. But LHEProcessor assumes the same number of weights in all samples, so will probably crash when confronted with a mixture of signal and background samples.

Automatic definition of observables

Automatically define a default set of observables. For example we could use the energies, pT, eta, phi of all visible particles, plus MET.

Manually add new entries to reweight card

Can we have an option that just manually add news new entries to the reweight card?

The application I have in mind is that then we can do signal and (same final state) backgrounds in the same run, with the same events, which is needed for MadFisher.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.