gilesstrong / lumin Goto Github PK

LUMIN - a deep learning and data science ecosystem for high-energy physics.

License: Apache License 2.0

Python 100.00%

deep-learning machine-learning physics science statistics hep pytorch

lumin's Introduction

LUMIN: Lumin Unifies Many Improvements for Networks

LUMIN is a deep-learning and data-analysis ecosystem for High-Energy Physics. Similar to Keras and fastai it is a wrapper framework for a graph computation library (PyTorch), but includes many useful functions to handle domain-specific requirements and problems. It also intends to provide easy access to state-of-the-art methods, but still be flexible enough for users to inherit from base classes and override methods to meet their own demands.

Online documentation may be found at https://lumin.readthedocs.io/en/stable

For an introduction and motivation for LUMIN, checkout this talk from IML-2019 at CERN: video, slides. And for a live tutorial, checkout my talk at PyHEP 2021: https://www.youtube.com/watch?v=keDWQKHCa2o (tutorial repo here: https://github.com/GilesStrong/talk_pyhep21_lumin)

Distinguishing Characteristics

Data objects

Use with large datasets: HEP data can become quite large, making training difficult:
- The FoldYielder class provides on-demand access to data stored in HDF5 format, only loading into memory what is required.
- Conversion from ROOT and CSV to HDF5 is easy to achieve using (see examples)
- FoldYielder provides conversion methods to Pandas DataFrame for use with other internal methods and external packages
Non-network-specific methods expect Pandas DataFrame allowing their use without having to convert to FoldYielder.

Deep learning

PyTorch > 1.0
Inclusion of recent deep learning techniques and practices, including:
- Dynamic learning rate, momentum, beta_1:
  - Cyclical, Smith, 2015
  - Cosine annealed Loschilov & Hutter, 2016
  - 1-cycle, Smith, 2018
- HEP-specific data augmentation during training and inference
- Advanced ensembling methods:
  - Snapshot ensembles Huang et al., 2017
  - Fast geometric ensembles Garipov et al., 2018
  - Stochastic Weight Averaging Izmailov et al., 2018
- Learning Rate Finders, Smith, 2015
- Entity embedding of categorical features, Guo & Berkhahn, 2016
- Label smoothing Szegedy et al., 2015
- Running batchnorm fastai 2019
Flexible architecture construction:
- ModelBuilder takes parameters and modules to yield networks on-demand
- Networks constructed from modular blocks:
  - Head - Takes input features
  - Body - Contains most of the hidden layers
  - Tail - Scales down the body to the desired number of outputs
  - Endcap - Optional layer for use post-training to provide further computation on model outputs; useful when training on a proxy objective
- Easy loading and saving of pre-trained embedding weights
- Modern architectures like:
  - Residual and dense(-like) networks (He et al. 2015 & Huang et al. 2016)
  - Graph nets for physics objects, e.g. Battaglia, Pascanu, Lai, Rezende, Kavukcuoglu, 2016, Moreno et al., 2019, and Qasim, Kieseler, Iiyama, & Pierini, 2019, with optional self-attention Vaswani et al., 2017.
  - Recurrent layers for series of objects
  - 1D convolutional networks for series of objects
  - Squeeze-excitation blocks Hu, Shen, Albanie, Sun, & Wu, 2017
  - HEP-specific architectures, e.g. LorentzBoostNetworks Erdmann, Geiser, Rath, Rieger, 2018
Configurable initialisations, including LSUV Mishkin, Matas, 2016
HEP-specific losses, e.g. Asimov loss Elwood & Krücker, 2018
Exotic training schemes, e.g. Learning to Pivot with Adversarial Networks Louppe, Kagan, & Cranmer, 2016
Easy training and inference of ensembles of models:
- Default training method fold_train_ensemble, trains a specified number of models as well as just a single model
- Ensemble class handles the (metric-weighted) construction of an ensemble, its inference, saving and loading, and interpretation
Easy exporting of models to other libraries via Onnx
Use with CPU and NVidia GPU
Evaluation on domain-specific metrics such as Approximate Median Significance via EvalMetric class
fastai-style callbacks and stateful model-fitting, allowing training, models, losses, and data to be accessible and adjustable at any point

Feature selection methods

Dendrograms of feature-pair monotonicity
Feature importance via auto-optimised SK-Learn random forests
Mutual dependence (via RFPImp)
Automatic filtering and selection of features

Interpretation

Feature importance for models and ensembles
Embedding visualisation
1D & 2D partial dependency plots (via PDPbox)

Plotting

Variety of domain-specific plotting functions
Unified appearance via PlotSettings class - class accepted by every plot function providing control of plot appearance, titles, colour schemes, et cetera

Universal handling of sample weights

HEP events are normally accompanied by weight characterising the acceptance and production cross-section of that particular event, or to flatten some distribution.
Relevant methods and classes can take account of these weights.
This includes training, interpretation, and plotting
Expansion of PyTorch losses to better handle weights

Parameter optimisation

Optimal learning rate via cross-validated range tests Smith, 2015
Quick, rough optimisation of random forest hyper parameters
Generalisable Cut & Count thresholds
1D discriminant binning with respect to bin-fill uncertainty

Statistics and uncertainties

Integral to experimental science
Quantitative results are accompanied by uncertainties
Use of bootstrapping to improve precision of statistics estimated from small samples

Look and feel

LUMIN aims to feel fast to use - liberal use of progress bars mean you're able to always know when tasks will finish, and get live updates of training
Guaranteed to spark joy (in its current beta state, LUMIN may instead ignite rage, despair, and frustration - dev.)

Examples

Several examples are present in the form of Jupyter Notebooks in the examples folder. These can be run also on Google Colab to allow you to quickly try out the package.

examples/Simple_Binary_Classification_of_earnings.ipynb: Very basic binary-classification example
examples/Binary_Classification_Signal_versus_Background.ipynb: Binary-classification example in a high-energy physics context
examples/Multiclass_Classification_Signal_versus_Backgrounds.ipynb: Multiclass-classification example in a high-energy physics context
examples/Single_Target_Regression_Di-Higgs_mass_prediction.ipynb: Single-target regression example in a high-energy physics context
examples/Multi_Target_Regression_Di-tau_momenta.ipynb: Multi-target regression example in a high-energy physics context
examples/Feature_Selection.ipynb: In-depth walkthrough for automated feature-selection
examples/Advanced_Model_Building.ipynb: In-depth look at building more complicated models and a few advanced interpretation techniques
examples/Model_Exporting.ipynb: Walkthough for exporting a trained model to ONNX and TensorFlow
examples/RNNs_CNNs_and_GNNs_for_matrix_data.ipynb: Various examples of applying RNNs, CNNs, and GNNs to matrix data (top-tagging on jet constituents)
examples/Learning_To_Pivot.ipynb: Example of adversarial training for parameter invariance

Installation

Due to some strict version requirements on packages, it is recommended to install LUMIN in its own Python environment, e.g conda create -n lumin python=3.6

From PyPI

The main package can be installed via: pip install lumin

Full functionality requires two additional packages as described below.

From source

git clone [email protected]:GilesStrong/lumin.git
cd lumin
pip install .

Optionally, run pip install with -e flag for development installation.

Optional requirements

sparse: enables loading on COO sparse-format tensors, install via e.g. pip install sparse
PDPBox: model interpretation, requires numpy < 1.24.0

Notes

Why use LUMIN

TMVA contained in CERN's ROOT system, has been the default choice for BDT training for analysis and reconstruction algorithms due to never having to leave ROOT format. With the gradual move to DNN approaches, more scientists are looking to move their data out of ROOT to use the wider selection of tools which are available. Keras appears to be the first stop due to its ease of use, however implementing recent methods in Keras can be difficult, and sometimes requires dropping back to the tensor library that it aims to abstract. Indeed, the prequel to LUMIN was a similar wrapper for Keras (HEPML_Tools) which involved some pretty ugly hacks. The fastai framework provides access to these recent methods, however doesn't yet support sample weights to the extent that HEP requires. LUMIN aims to provide the best of both, Keras-style sample weighting and fastai training methods, while focussing on columnar data and providing domain-specific metrics, plotting, and statistical treatment of results and uncertainties.

Data types

LUMIN is primarily designed for use on columnar data, and from version 0.5 onwards this also includes matrix data; ordered series and un-ordered groups of objects. With some extra work it can be used on other data formats, but at the moment it has nothing special to offer. Whilst recent work in HEP has made use of jet images and GANs, these normally hijack existing ideas and models. Perhaps once we get established, domain specific approaches which necessitate the use of a specialised framework, then LUMIN could grow to meet those demands, but for now I'd recommend checking out the fastai library, especially for image data.

With just one main developer, I'm simply focussing on the data types and applications I need for my own research and common use cases in HEP. If, however you would like to use LUMIN's other methods for your own work on other data formats, then you are most welcome to contribute and help to grow LUMIN to better meet the needs of the scientific community.

Future

The current priority is to improve the documentation, add unit tests, and expand the examples.

The next step will be to try to increase the user base and number of contributors. I'm aiming to achieve this through presentations, tutorials, blog posts, and papers.

Further improvements will be in the direction of implementing new methods and (HEP-specific) architectures, as well as providing helper functions and data exporters to statistical analysis packages like Combine and PYHF.

Contributing & feedback

Contributions, suggestions, and feedback are most welcome! The issue tracker on this repo is probably the best place to report bugs et cetera.

Code style

Nope, the majority of the code-base does not conform to PEP8. PEP8 has its uses, but my understanding is that it primarily written for developers and maintainers of software whose users never need to read the source code. As a maths-heavy research framework which users are expected to interact with, PEP8 isn't the best style. Instead, I'm aiming to follow more the style of fastai, which emphasises, in particular, reducing vertical space (useful for reading source code in a notebook) naming and abbreviating variables according to their importance and lifetime (easier to recognise which variables have a larger scope and permits easier writing of mathematical operations). A full list of the abbreviations used may be found in abbr.md

Why is LUMIN called LUMIN?

Aside from being a recursive acronym (and therefore the best kind of acronym) lumin is short for 'luminosity'. In high-energy physics, the integrated luminosity of the data collected by an experiment is the main driver in the results that analyses obtain. With the paradigm shift towards multivariate analyses, however, improved methods can be seen as providing 'artificial luminosity'; e.g. the gain offered by some DNN could be measured in terms of the amount of extra data that would have to be collected to achieve the same result with a more traditional analysis. Luminosity can also be connected to the fact that LUMIN is built around the python version of Torch.

Who develops LUMIN

LUMIN is primarily developed by Giles Strong; a British-born doctor in particle physics, researcher at INFN-Padova (Italy), and a member of the CMS collaboration at CERN, and a founding member of the MODE Collaboration (differentiable optimisation for detector design).

As LUMIN has grown, it has welcomed contributions from members of the scientific and software development community. Check out the contributors page for a complete list.

Certainly more developers and contributors are welcome to join and help out!

Reference

If you have used LUMIN in your analysis work and wish to cite it, the preferred reference is: Giles C. Strong, LUMIN, Zenodo (Mar. 2019), https://doi.org/10.5281/zenodo.2601857, Note: Please check https://github.com/GilesStrong/lumin/graphs/contributors for the full list of contributors

@misc{giles_chatham_strong_2019_2601857,  
  author       = {Giles Chatham Strong},  
  title        = {LUMIN},  
  month        = mar,  
  year         = 2019,  
  note         = {{Please check https://github.com/GilesStrong/lumin/graphs/contributors for the full list of contributors}},  
  doi          = {10.5281/zenodo.2601857},  
  url          = {https://doi.org/10.5281/zenodo.2601857}  
}

lumin's People

Contributors

Stargazers

Watchers

Forkers

ofivite dsp6414 pgsrv kiryteo llayer liuxi09 phinate matthewfeickert nflanner choisant abdoelabassi lucamasserano riga

lumin's Issues

Eventual upgrade of RFPImp version

RFPImp is relied on quite a bit by LUMIN, however it imports stuff from sklearn.ensemble.forest, which is depreciated since 0.22, and will be removed in 0.24. Currently this raises FutureWarning. Hopefully in the next release this will be fixed and we can upgrade LUMIN to use the new version of RFPImp.

Colour scheme for LUMIN

Current state

The colour schemes for plots are dictated by PlotSettings objects. When these are not passed by the user, a default PlotSettings is used. This uses: tab10 for categorical colours, RdBu_r for divergent colours, and virdis for sequential colours.

These are colour schemes that exist in matplotlib and give LUMIN plots a kind of generic appearance. Additionally, tab10 is often not sufficient for some plots, leading to colour being repeated, or tab20 being used. I don't particularly like tab20 due to the mixture of hard and pastel colours.

Proposal

New default sequential, divergent, and categorical colour maps are set or created. Perhaps a series of styles/moods could be defined, which would change all three colour schemes to preset maps, e.g. 'spring' would favour greener colour maps, 'autumn' more reds and browns.

Requirements

Sequential colour maps must be perceptually uniform
Categorical colour maps must be compatible with most forms of colourblindness (can be tested via browser plugins and programs which simulate different forms of colour blindness)
Categorical colour maps should ideally work well in grey-scale (e.g. when for printing a paper in black and white), but this isn't essential
Divergent colour maps must be symmetric
Ideally the (categorical) colour map(s) should be consistent with the docs colour scheme; the docs and LUMIN logo colours may of course be changed
Could schemes must be suitable for scientific publications

Resources

https://picular.co/ allows for searching for colours and hex values
https://mycolor.space/ generates colour schemes from a starting hex
https://stripe.com/blog/accessible-color-systems discusses colour schemes, accessibility, and perceptual uniformity
https://www.colourlovers.com/palettes user submitted palettes

Check functionality in repl and .py

Currently training and application of NNs in LUMIN has only really been tested in Jupyter notebooks. It should also be tested to make sure it all works fine when used in executed python files and repl.

Feature importance from DataFrame

Current state

Ensemble and Model classes have get_feat_importance methods to compute the permutation importance of input features. Currently the the input data must be supplied as a FoldYielder object. There are occasions when one may wish to evaluate the feature importance on only a subset of the data (e.g. only on 2-jet events). This then requires saving the subset to a foldfile and instantiating a new FoldYielder to point to the subset of data.

Probable solution

The get_feat_importance methods are extend to take pandas.DataFrame objects as inputs. This will no doubt impact certain aspects of the returned information, such as averaging over folds and computing uncertainties. I think it is reasonable that if the user really wants this extra information, they can export a new foldfile. This extension would simply be for getting the rough information quickly and producing an informative plot which wouldn't necessarily be used for publication.

Improve importing of classes and methods / improve module layout

Problem

Currently every class, method, et cetera must be imported from the file in which they are defined. This potentially makes finding the thing one wishes to import quite difficult due to the number of files and submodules. This will only get worse with time. Even I, the main developer, forget where things are defined and have to spend time searching for them.

Attempted solution

I attempted to solve this by introducing imports at the top of each submodule in __init__.py. The idea being that users could then import from the submodule rather than the exact file. This however lead to circular imports between modules and submodules, due to the interconnected nature of the package. This attempt still exists in the code-base, but is commented out and not guaranteed to be up to date.

Thoughts

Most other packages seem to manage well with importing from submodules, but perhaps their submodules are not as interconnected as LUMIN's
Perhaps my initial attempt was to coarse (import *) and more care must be taken
Perhaps the layout of the modules and submodules should be revised to reduce interdependence - would be a severe breaking change

Setup CI and Actions

I don't know too much about Github actions, but certainly something like "Publish Python Package" could be useful, since its a time-consuming process that is easy to get wrong, or forget how to do.
Maybe an action to generate the docs templates, as well, could be good.

Change HEPAugFoldYielder to callback?

Current status

HEPAugFoldYielder applied train-time and test-time data augmentaitons to HEP data (phi rotations, transverse & longitudinal flips). This is performed when loading the data since originally, this was the last point at which the feature names for the data were known to the model. Later changes to LUMIN, now mean that the model has a list of named features and how they map to the input features. This means that instead the data augmentation could be performed by a callback during training (similar to the suggestion of issue #68).

Discussion

It seems a bit strange that the choice of whether or not to augment the data is made by changing how the data is loaded from file. Specifying the choice as a callback make a bit more sense (to me). This also avoids complications once addition forms of augmentation are added, which may otherwise require their ownFoldYielder classes, and we must then account for all possible combinations of different types of augmentation.

Depending on the choices made in issue #50, this may reduce the efficiency of augmentation, but it's possible that augmenting the data inplace on device may actually be more efficient by since it could be done multithreaded. This would perhaps avoid the need to augment as a pandas.DataFrame, and maybe pre-cached rotation matrices could be used, in some part, to speed things up. Since the data is already on device, this would actually be quicker than loaded from disc, augmenting, and then loading to device; this is known to cause particular slow-down when working on GPU

Possible change

The callback would need to mimic the behaviour of HEPAugFoldYielder, i.e. provide random augmentation during training, and a choice of either set transformations during testing or random ones. It would need to be passed as a callback during training and prediction.

Additionally, tests should be done to compare the speed and memory usage of the callback to HEPAugFoldYielder.

If successful, this would depreciate HEPAugFoldYielder.

Make HEPAugFoldYielder work with pT eta phi coordinates

Current state

HEPAugFoldYielder applies train-time and test-time augmentation to 3-momenta vectors, but only if they are in Cartesian coordinates (p_x, p_y, p_z). This should ideally be extended to (p_T, eta, phi).

Solution

Extend HEPAugFoldYielder to run data augmentation over vectors in (p_T, eta, phi) and infer coordinate system from data, rather than requiring user-supplied flag.

Way to resume ensemble training

Current state

fold_train_ensemble trains a set of models, using cross-validation over fixed folds of data. The order of use of folds is fixed val_id = model_num % fy.n_folds.

Problem

The training loop may be interrupted, for whatever reason. In these cases, the user must restart the training from the beginning.

Solution

A new argument is added fold_train_ensemble that will allow training to to begin from the specified model number. This would most likely add a lower bound to model_bar = master_bar(range(n_models)), however it must also alter os.system(f"rm {savepath}/*.h5 {savepath}/*.json {savepath}/*.pkl {savepath}/*.png {savepath}/*.log") to avoid deleting the previously trained models.

Minimum improvement early stopping callback

Current state

Main training function fold_train_ensemble uses early stopping by default: if a number of subepochs are passed (patience) without an improvement in the validation loss, then training stops.

Problem

Sometime the validation loss will move to a plateau with a shallow slope; the validation loss continues to decrease, but at rate which minimal impact on performance.

Suggestion

A new callback is introduced to to stop training if the validation loss doesn't decrease by a certain amount (or fraction) after a certain number of epochs. This should ideally automatically scale to the typical loss values for each particular training and should accurately detect plateaus without fine tuning by the user. It could, for instance, monitor the rate of change of loss. MetricLogger already does something similar by monitoring the loss velocity. It should also be resistant to fluctuations in the loss, which may occur, particularity in heavily weighted data.

Add Literal types (aka move to Python 3.8)

In a few locations, string values arguments are used in LUMIN and specific value are expected. Python 3.8 introduced literal types (https://realpython.com/python38-new-features/), where the set of expected values can be stated in the method definitions. This is potentially useful, but would then require setting the minimum version of python for LUMIN from 3.6 to 3.8, which could be disruptive to our user-base. (Although, since I doubt we actually have a consistent user-base, this change may not be disruptive and should be done sooner rather than later).

Save & load LUMIN version for ensembles and models

Issue

User must currently track which versions of LUMIN were used when training models due to changes in ModelBuilder. Whilst I try to coerce changes for a while when breaking changes or depreciation occur, it might still be useful if models and ensembles were to save as metadata the current version of LUMIN. This would make it easier to work with very old models without having to try loading them in every previous LUMIN release.

Possible solution

Model.save and Model.load are extended to include saving & loading the current version LUMIN in the dictionary. Similarly Ensemble.save and Ensemble.load could save/load details about current LUMIN version. ModelBuilder should probably also include a variable set to the current LUMIN version, since it get's pickled during Ensemble.save.

Permission errors during saving

Occasionally during training I get permission errors during training when trying to save model weights. This then kills the training and I normally have to restart the kernel. This has only ever happened on one computer (Elementary OS 0.4) and is very rare. I've not been able to reproduce the error on demand.

If anyone else ever encounters this error, please let me know!

Multi-threaded data loading and augmentation?

Current state

The current process for loading data during training is:

A complete fold of data is loaded from hard-drive (hdf5) by a FoldYielder
Any requested data augmentation is applied to the fold
The fold is then passed to a BatchYielder. Either the entire fold is then loaded to device at once, or mini-batches are loaded to device one at a time
Mini-batches are passed through the model and parameters are updated

The current process for loading data during predicting is:

A complete fold of data is loaded from hard-drive (hdf5) by a FoldYielder
Any requested data augmentation is applied to the fold
The entire fold if passed through the model, or mini-batches are passed separately

Problems

The use of data augmentation currently causes perceptible slow-downs during training and testing
Loading data to device can be slow: quicker to load entire fold at once, but requires large memory

Possible solutions

Data augmentation is applied using multi-threading. Should be trivial, but splitting and concatenating of DataFrames may actually slow down process. Maybe Dask could be useful?
Worker processes are used by BatchYielder to load minibatches to device in the background, reducing the memory overhead whilst not leading to delays.
- Could perhaps replace BatchYielder with, or inherit from, a PyTorch Dataloader, which includes multi-threaded workers (although I find that they're slower than single-core...)

Patience correction for CycleLR

Problem

When using the CycleLR callback, e.g for cosine LR annealing, the patience must be set to the number of patience cycles+1. E.g. to ensure that training finishes after one complete cycle without improvement, the patience must be set to 2. This is counter intuitive for the user.

Possible cause

I think this is due to the possibility that if the lowest loss is reach partway through the cycle rather than in the very last iteration, then some counters (e.g. epochs since last improvement) are being not being set/reset. This should be quite simple to fix. I'll look into it.

Polyak averaging for test-time data-augmetation

Current status

When data-augmentation is applied at test-time, the final prediction is based on the original data and the augmented data. This if okay, with the current data-augmentation in LUMIN, since it (should) result in physically valid events which are as likely as the original event.

Potential problem

The user, or future updates of LUMIN, may add data-augmentation which only produces data which is similar to the actual data, but is either not strictly physical, or as a differing probability of being.

Possible solution

In these cases is might be advantages to for the final prediction via Polyak averaging of the score on the original data, and on the augmented data, e.g.:

score = (beta*score on original data) + ((1-beta)*mean score on augmented data)

Beta would need to be an optional argument when calling .predict* methods of Ensemble and Model, and also Model.evaluate*. Beta could also be set as an property of e.g. HEPAugFoldYielder, and the the relevant methods could then see whether a beta had been set for the data, to avoid having to explicitly pass it every time.

Expand/change Ensemble to include AbsEndcap

Current state

AbsEndcap acts as a wrapper to apply fixed functions to the outputs of models that were trained on proxy objectives, e.g. to compute the invariant mass from a model that outputs the 3-momenta of two particles (see Multi_Target_Regression example).

AbsEndcap currently does this via a .forward method and a .predict method, so in theory, this should also work for Ensembles of models. Wrapping an Ensemble, though, then removes many of the benefits of that the Ensemble class offers.

Proposal

The Ensemble class is extended to include the ability to add an AbsEndcap, through which the outputs of all internal models will be passed.

Env detection for MetricLogger

The MetricLogger class computes metrics and displays information during training in the form of realtime plots. This isn't much use (and may even cause errors) if the training is being performed in an executed python file or repl. Ideally, MetricLogger the should detect if it is being used in Jupyter environment and if not, then should print the feedback information to the prompt (perhaps overwriting the values each time rather than entering new lines, to save space).

Add SOTA optimisers

There was a big kerfuffle in 2019 about some new optimisers: Regularised Adam (Liu et al., 2019), Look Ahead (Zhang, Lucas, Hinton, & Ba, 2019), and a combination of both of them, Ranger (which also now includes Gradient Centralization (Yong, Huang, Hua, & Zhang, 2020).

Having tried these, (except the latest version of Ranger), I'vbe not found much improvement compared to Adam, but this was only on one dataset. The performance of Ranger, though, looks to be quite good for other datasets, so perhaps it is useful.

User-defined optimisers can easily be used in LUMIN, by passing the partial optimiser to the opt_args argument of ModelBuilder, e.g. opt_args = {'eps':1e-08, 'opt':partial(RAdam)}. It could be useful, however, to include the optimisers in LUMIN, to allow them to be easily used, without the user having to include an copied code.

These git repos include Apache 2.0 - licensed implementations of Radam and Ranger, so inclusion should be straight forward.

Uncertainty bands for plot_roc

Current status

plot_roc provides a variety of options for computing and plotting ROC curves, including bootstrap resampling to compute the uncertainty on the ROC AUCs.

Deficiency

Whilst the mean ROC AUCs are displayed, along with their uncertainty, the curves that are plotted are just single lines with no uncertainty. Ideally the uncertainty band coming from bootstrap resamples of the data should be used to compute the uncertainty bands.

Probable solution

_bs_roc_auc is extended to compute and return multiple ROC curves. plt.plot is replaced with sns.lineplot and set to show the standard deviation as a band as computed using the ROC curves returned by _bs_roc_auc.

Check HEPAugFoldYielder and test-time augmentation

One thing I've been surprised about is that increasing the amount of test-time data augmentation can sometimes decrease model performance. I would have thought that it would saturate, rather than decrease; all transformations result in valid events, unlike say in image augmentation where augmented data is only similar and things like Polyak averaging are useful.

I've been over the code in HEPAugFoldYielder multiple times and can't spot any errors, but a second set of eyes might help. There might also be errors in my assumptions about the input data...

If the code is correct, then it would be interesting to investigate further what causes the degradation in performance, and whether it is reproducible on other datasets.

Extend code view in docs

Problem

When viewing source code in the docs (e.g. https://lumin.readthedocs.io/en/stable/_modules/lumin/nn/ensemble/ensemble.html#Ensemble), the width of the viewer is less than the line length used for coding (160 characters). This causes the code to wrap onto the next line, presenting a confusing appearance.

Probable solution

There is a large space on the right hand-side of the page, and the code view could easily be extended to fill it. Just need to find the right value to edit.

Note

This fix actually needs to be made in https://github.com/GilesStrong/pytorch_sphinx_theme I've simply referenced it here for visibility.

Include AdamW

AdamW (Loshchilov, Hutter, 2017) should really be included for cases when the user wishes to use weight decay (PyTorch Adam says 'weight_decay', but implements L_2). PyTorch >= 1.2 includes AdamW, and there is also an Apache2.0 implementation here. AdamW should be added as an easily accessible optimiser in LUMIN (if not the default). Warnings should also be addded to recommend that the user use AdamW if they specify a weight decay and Adam. The user must actively make this change; LUMIN should not silently switch the optimiser to AdamW.

Probably solution

LUMIN moves to the latest version of PyTorch, which includes AdamW, but checks will need to be made to make sure there aren't any problems in compatibility.

Train-time data-augmentation for parameterised learning

Overview

Parameterised learning is useful in HEP, for example in cases where a classifier should learn multiple signal hypotheses (e.g. a heavy Higgs of several possible masses) see Baldi et al., 2016.

In this example the signal would have a parameterised input equal to the true resonant mass, and the background would be randomly assigned resonant masses. Once trained, the entire dataset can be set to a particular resonant mass in order to perform inference for a given hypothesis. This last part is already possible with the ParametrisedPrediction class.

Data augmentation for parameterised learning

Currently the random assignment of parameterised-feature values for background (in the example above) is performed once when preparing the data for training. It could well be possible that it is useful to perform this random assignment during training, which may provide some of the benefits of train-time data augmentation.

Implementation

To avoid conflicts with HEPAugFoldYielder, and due to the fact that this only wants to be performed during training, this secondary form of augmentation should probably implemented as a callback. It also needs to account for the possibility that multiple parameterisation features may be used, and that only a subset of the data may need to be changed.

Adversarial training for feature invariance?

So far I've been contacted by two researchers about LUMIN, both of whom were wanting to use it for adversarial training in order to get invariance to some feature of the data. Unfortunately, LUMIN does not offer this out of the box, but seeing as 100% of the people that were interested in using LUMIN are wanting to do adversarial training, it's perhaps something that we should aim to offer in the future.

I made some attempt at this, but never tested it and don't have suitable data to hand to really develop it myself. I have a feeling, though, that a decent implementation will probably be quite task-specific, so input from fellow researchers more experience with this is really necessary to move forward.

My initial attempt is pasted below, along with a short response to an email:

"however it is potentially implementable by modifying the training loop and inheriting from the Model class.

In the attached files there's an example of this, in which the new training loop (adversarial_fold_train_ensemble) takes two ModelBuilders, one to provide the primary models and the other to provide the adversarial models. Two sets of callbacks get created, one for each model. A new Model class (AdversarialModel) inherits from Model and has a new method (adversarial_fit) which takes the adversarial model as an argument. This is can then be modified to provide interaction between the two models as necessary. Does this look like the kind of thing which could help? I'm afraid I haven't had a chance to test it, but I've marked everything new with "# New".

In its current form, adversarial_fold_train_ensemble will only save the primary model, and this is determined by the validation loss returned by primary_model.evaluate, so you may want to modify this to avoid saving models which don't have a flat response."

from typing import Dict, List, Tuple, Any, Optional
from pathlib import Path
from fastprogress import master_bar, progress_bar
import pickle
import timeit
import numpy as np
import os
import sys
from random import shuffle
from collections import OrderedDict
import math
from functools import partial
import warnings

import torch.tensor as Tensor

from lumin.nn.data.fold_yielder import FoldYielder
from lumin.nn.data.batch_yielder import BatchYielder
from lumin.nn.models.model_builder import ModelBuilder
from lumin.nn.models.model import Model
from lumin.nn.callbacks.cyclic_callbacks import AbsCyclicCallback
from lumin.nn.callbacks.model_callbacks import AbsModelCallback
from lumin.nn.callbacks.abs_callback import AbsCallback
from lumin.utils.misc import to_tensor, to_device
from lumin.utils.statistics import uncert_round
from lumin.nn.metrics.eval_metric import EvalMetric
from lumin.plotting.training import plot_train_history
from lumin.plotting.plot_settings import PlotSettings
from lumin.nn.training.metric_logger import MetricLogger
from lumin.nn.models.abs_model import AbsModel

import matplotlib.pyplot as plt

__all__ = ['adversarial_fold_train_ensemble', 'AdversarialModel']


# New___
class AdversarialModel(Model):
    def adversarial_fit(self, batch_yielder:BatchYielder, adversary_model:AbsModel, primary_callbacks:Optional[List[AbsCallback]]=None,
                        adversary_callbacks:Optional[List[AbsCallback]]=None, mask_inputs:bool=True) -> float:
        r'''
        Fit network for one complete iteration of a :class:`~lumin.nn.data.batch_yielder.BatchYielder`, i.e. one (sub-)epoch

        Arguments:
            batch_yielder: :class:`~lumin.nn.data.batch_yielder.BatchYielder` providing training data in form of tuple of inputs, targets,
                and weights as tensors on device
            adversary_model: :class:`~lumin.nn.models.model.Model` to act as the adversarial model during training
            primary_callbacks: list of :class:`~lumin.nn.callbacks.abs_callback.AbsCallback` to be used during training of the primary model
            adversary_callbacks: list of :class:`~lumin.nn.callbacks.abs_callback.AbsCallback` to be used during training of the adversary model
            mask_inputs: whether to apply input mask if one has been set

        Returns:
            Loss on training data averaged across all minibatches
        '''

        self.model.train()
        adversary_model.model.train()
        self.stop_train = False
        if primary_callbacks   is None: primary_callbacks   = []
        if adversary_callbacks is None: adversary_callbacks = []

        for c in primary_callbacks:   c.on_epoch_begin(by=batch_yielder)
        for c in adversary_callbacks: c.on_epoch_begin(by=batch_yielder)

        if self.input_mask is not None and mask_inputs: batch_yielder.inputs = batch_yielder.inputs[:,self.input_mask]

        # Replace this as necessary
        # _________________________
        # losses = []
        # for x, y, w in batch_yielder:
        #     for c in callbacks: c.on_batch_begin()
        #     y_pred = self.model(x)
        #     loss = self.loss(weight=w)(y_pred, y) if w is not None else self.loss()(y_pred, y)
        #     losses.append(loss.data.item())
        #     self.opt.zero_grad()
        #     for c in callbacks: c.on_backwards_begin(loss=loss)
        #     loss.backward()
        #     for c in callbacks: c.on_backwards_end(loss=loss)
        #     self.opt.step()
            
        #     for c in callbacks: c.on_batch_end(loss=losses[-1])
        #     if self.stop_train: break
        
        # for c in callbacks: c.on_epoch_end(losses=losses)
        # return np.mean(losses)
        # _________________________
# ______


def _get_folds(val_idx, n_folds, shuffle_folds:bool=True):
    r'''
    Return (shuffled) list of fold indeces which does not include the validation index
    '''

    folds = [x for x in range(n_folds) if x != val_idx]
    if shuffle_folds: shuffle(folds)
    return folds


def adversarial_fold_train_ensemble(fy:FoldYielder, n_models:int, bs:int, primary_model_builder:ModelBuilder, adversarial_model_builder:ModelBuilder,
                                    callback_partials:Optional[List[partial]]=None, eval_metrics:Optional[Dict[str,EvalMetric]]=None,
                                    train_on_weights:bool=True, eval_on_weights:bool=True, patience:int=10, max_epochs:int=200,
                                    shuffle_fold:bool=True, shuffle_folds:bool=True, bulk_move:bool=True,
                                    live_fdbk:bool=True, live_fdbk_first_only:bool=True, live_fdbk_extra:bool=True, live_fdbk_extra_first_only:bool=False,
                                    savepath:Path=Path('train_weights'), verbose:bool=False, log_output:bool=False,
                                    plot_settings:PlotSettings=PlotSettings(), plots:Optional[Any]=None) \
        -> Tuple[List[Dict[str,float]],List[Dict[str,List[float]]],List[Dict[str,float]]]:
    r'''
    Adversarial training method for :class:`~lumin.nn.models.model.Model`.
    Trains a specified numer of models created by a :class:`~lumin.nn.models.model_builder.ModelBuilder` on data provided by a
    :class:`~lumin.nn.data.fold_yielder.FoldYielder`, and save them to savepath.
    Note, this does not return trained models, instead they are saved and must be loaded later. Instead this method returns results of model training.
    Each :class:`~lumin.nn.models.model.Model` is trained on N-1 folds, for a :class:`~lumin.nn.data.fold_yielder.FoldYielder` with N folds, and the remaining
    fold is used as validation data.
    Training folds are loaded iteratively, and model evaluation takes place after each fold use (a sub-epoch), rather than after ever use of all folds (epoch).
    Training continues until:
        - All of the training folds are used max_epoch number of times;
        - Or validation loss does not decrease for patience number of training folds;
          (or cycles, if using an :class:`~lumin.nn.callbacks.cyclic_callbacks.AbsCyclicCallback`);
        - Or a callback triggers trainign to stop, e.g. :class:`~lumin.nn.callbacks.cyclic_callbacks.OneCycle`
        
    Depending on the live_fdbk arguments, live plots of losses and other metrics may be shown during training, if running in Jupyter. By default, a live plot
    with extra information will be shown for training the first model, and afterwards no live plots will be shown. Shoing the live plot slightly slows down the
    training, but can help highlight problems without having to wait to the end. Thererfore this compromises between showing useful information and training
    speed, since any problems should hopefully be visible in the first model.

    Once training is finished, the state with the lowest validation loss is loaded, evaluated, and saved.

    Arguments:
        fy: :class:`~lumin.nn.data.fold_yielder.FoldYielder` interfacing ot training data
        n_models: number of models to train
        bs: batch size. Number of data points per iteration
        primary_model_builder: :class:`~lumin.nn.models.model_builder.ModelBuilder` creating the primary networks to train
        adversarial_model_builder: :class:`~lumin.nn.models.model_builder.ModelBuilder` creating the adversary networks
        callback_partials: optional list of functools.partial, each of which will a instantiate :class:`~lumin.nn.callbacks.callback.Callback` when called
        eval_metrics: list of instantiated :class:`~lumin.nn.metric.eval_metric.EvalMetric`.
            At the end of training, validation data and model predictions will be passed to each, and the results printed and saved
        train_on_weights: If weights are present in training data, whether to pass them to the loss function during training
        eval_on_weights: If weights are present in validation data, whether to pass them to the loss function during validation
        patience: number of folds (sub-epochs) or cycles to train without decrease in validation loss before ending training (early stopping)
        max_epochs: maximum number of epochs for which to train
        live_fdbk: whether or not to show any live feedback at all during training (slightly slows down training, but helps spot problems)
        live_fdbk_first_only: whether to only show live feedback for the first model trained (trade off between time and problem spotting)
        live_fdbk_extra: whether to show extra information live feedback (further slows training)
        live_fdbk_extra_first_only: whether to only show extra live feedback information for the first model trained (trade off between time and information)
        shuffle_fold: whether to tell :class:`~lumin.nn.data.batch_yielder.BatchYielder` to shuffle data
        shuffle_folds: whether to shuffle the order of the trainign folds
        bulk_move: whether to pass all training data to device at once, or by minibatch. Bulk moving will be quicker, but may not fit in memory.
        savepath: path to to which to save model weights and results
        verbose: whether to print out extra information during training
        log_output: whether to save printed results to a log file rather than printing them
        plot_settings: :class:`~lumin.plotting.plot_settings.PlotSettings` class to control figure appearance
        plots: Depreciated: loss history will always be shown,
            lr history will no longer be shown separately,
            and live feedback is now controlled by `live_fdbk` argument

    Returns:
        - results list of validation losses and other eval_metrics results, ordered by model training.
            Can be used to create an :class:`~lumin.nn.ensemble.ensemble.Ensemble`.
        - histories list of loss histories, ordered by model training
        - cycle_losses if an :class:`~lumin.nn.callbacks.cyclic_callbacks.AbsCyclicCallback` was passed, list of validation losses at the end of each cycle,
            ordered by model training. Can be passed to :class:`~lumin.nn.ensemble.ensemble.Ensemble`.
    '''

    os.makedirs(savepath, exist_ok=True)
    os.system(f"rm {savepath}/*.h5 {savepath}/*.json {savepath}/*.pkl {savepath}/*.png {savepath}/*.log")
    if callback_partials is None: callback_partials = []
    
    if log_output:
        old_stdout = sys.stdout
        log_file = open(savepath/'training_log.log', 'w')
        sys.stdout = log_file

    if plots is not None:
        warnings.warn("The plots argument is now depreciated and ignored. Loss history will always be shown, lr history will no longer be shown separately, \
                       and live feedback is now controlled by the four live_fdbk arguments. This argument will be removed in V0.6.")

    train_tmr = timeit.default_timer()
    results,histories,cycle_losses = [],[],[]
    nb = len(fy.foldfile['fold_0/targets'])//bs

    if live_fdbk:
        metric_log = MetricLogger(loss_names=['Train', 'Validation'], n_folds=fy.n_folds, extra_detail=live_fdbk_extra or live_fdbk_extra_first_only,
                                  plot_settings=plot_settings)
    
    model_bar = master_bar(range(n_models))
    for model_num in (model_bar):
        model_bar.show()
        val_id = model_num % fy.n_folds
        print(f"Training model {model_num+1} / {n_models}, Val ID = {val_id}")
        if model_num == 1:
            if live_fdbk_first_only: live_fdbk = False  # Only show fdbk for first training
            elif live_fdbk_extra_first_only: metric_log.extra_detail = False
        if live_fdbk: metric_log.reset()
        model_tmr = timeit.default_timer()
        os.system(f"rm {savepath}/best.h5")
        best_loss,epoch_counter,subEpoch,stop = math.inf,0,0,False
        loss_history = OrderedDict({'trn_loss': [], 'val_loss': []})
        cycle_losses.append({})
        trn_ids = _get_folds(val_id, fy.n_folds, shuffle_folds)

        primary_model,adversary_model = AdversarialModel(primary_model_builder),AdversarialModel(adversarial_model_builder)  # New

        val_fold = fy.get_fold(val_id)
        if not eval_on_weights: val_fold['weights'] = None

        primary_cyclic_callback,primary_callbacks,primary_loss_callbacks = None,[],[]
        adversary_cyclic_callback,adversary_callbacks = None,[]  # New

        for c in callback_partials: primary_callbacks.append(c(model=primary_model))
        for c in primary_callbacks:
            if isinstance(c, AbsCyclicCallback):
                c.set_nb(nb)
                primary_cyclic_callback = c
        for c in primary_callbacks:
            if isinstance(c, AbsModelCallback):
                c.set_val_fold(val_fold)
                c.set_cyclic_callback(primary_cyclic_callback)
                if getattr(c, "get_loss", None):
                    primary_loss_callbacks.append(c)
                    if live_fdbk: metric_log.add_loss_name(type(c).__name__)
                    loss_history[f'{type(c).__name__}_val_loss'] = []
        for c in primary_callbacks: c.on_train_begin(model_num=model_num, savepath=savepath)

        # New___
        for c in callback_partials: adversary_callbacks.append(c(model=adversary_model))
        for c in adversary_callbacks:
            if isinstance(c, AbsCyclicCallback):
                c.set_nb(nb)
                adversary_cyclic_callback = c
        for c in adversary_callbacks:
            if isinstance(c, AbsModelCallback):
                c.set_val_fold(val_fold)
                c.set_cyclic_callback(adversary_cyclic_callback)
        for c in adversary_callbacks: c.on_train_begin(model_num=model_num, savepath=savepath)
        # ______

        # Validation data
        if bulk_move:
            if fy.has_matrix and fy.yield_matrix: val_x = (to_device(Tensor(val_fold['inputs'][0]).float()), to_device(Tensor(val_fold['inputs'][1]).float())) 
            else:                                 val_x =  to_device(Tensor(val_fold['inputs']).float())
            val_y = to_device(Tensor(val_fold['targets'])) if bulk_move else Tensor(val_fold['targets'])
            if train_on_weights: val_w = to_device(to_tensor(val_fold['weights'])) if bulk_move else to_tensor(val_fold['weights'])
            else:                val_w = None
            if 'multiclass' in primary_model_builder.objective: val_y = val_y.long().squeeze()
            else:                                       val_y = val_y.float()

        epoch_pb = progress_bar(range(max_epochs), leave=True)
        if live_fdbk: model_bar.show()
        for epoch in epoch_pb:
            for trn_id in trn_ids:
                subEpoch += 1
                batch_yielder = BatchYielder(**fy.get_fold(trn_id), objective=primary_model_builder.objective,
                                             bs=bs, use_weights=train_on_weights, shuffle=shuffle_fold, bulk_move=bulk_move)
                loss_history['trn_loss'].append(primary_model.adversarial_fit(batch_yielder, primary_callbacks=primary_callbacks,
                                                adversary_model=adversary_model, adversary_callbacks=adversary_callbacks))  # New
                del batch_yielder

                if bulk_move:
                    val_loss = primary_model.evaluate(val_x, val_y, weights=val_w, callbacks=primary_callbacks)
                else:
                    batch_yielder = BatchYielder(**val_fold, objective=primary_model_builder.objective,
                                                 bs=bs, use_weights=train_on_weights, shuffle=shuffle_fold, bulk_move=bulk_move)
                    val_loss = primary_model.evaluate_from_by(batch_yielder, callbacks=primary_callbacks)
                    del batch_yielder

                loss_history['val_loss'].append(val_loss)
                loss_callback_idx = None
                loss = val_loss
                for i, lc in enumerate(primary_loss_callbacks):
                    l = lc.get_loss()
                    if l < loss: loss, loss_callback_idx = l, i
                    if verbose: print(f'{subEpoch} {type(lc).__name__} loss {l}, default loss {val_loss}')
                    l = loss if l is None or not lc.active else l
                    loss_history[f'{type(lc).__name__}_val_loss'].append(l)

                if primary_cyclic_callback is not None and primary_cyclic_callback.cycle_end:
                    if verbose: print(f"Saving snapshot {primary_cyclic_callback.cycle_count}")
                    cycle_losses[-1][primary_cyclic_callback.cycle_count] = val_loss
                    primary_model.save(str(savepath/f"{model_num}_cycle_{primary_cyclic_callback.cycle_count}.h5"))

                if loss <= best_loss:
                    best_loss = loss
                    epoch_pb.comment = f'Epoch {subEpoch}, best loss: {best_loss:.4E}'
                    if verbose: print(epoch_pb.comment)
                    epoch_counter = 0
                    if loss_callback_idx is not None: primary_loss_callbacks[loss_callback_idx].test_model.save(savepath/"best.h5")
                    else: primary_model.save(savepath/"best.h5")
                elif primary_cyclic_callback is not None:
                    if primary_cyclic_callback.cycle_end: epoch_counter += 1
                else:
                    epoch_counter += 1

                if live_fdbk: metric_log.update_vals([loss_history[l][-1] for l in loss_history])
                if epoch_counter >= patience or primary_model.stop_train:  # Early stopping
                    print('Early stopping after {} epochs'.format(subEpoch))
                    stop = True; break
            if live_fdbk: metric_log.update_plot(best_loss)
            if stop: break

        primary_model.load(savepath/"best.h5")
        primary_model.save(savepath/f'train_{model_num}.h5')
        for c in primary_callbacks: c.on_train_end(fy=fy, val_id=val_id, bs=bs if not bulk_move else None)

        histories.append({})
        histories[-1] = loss_history
        results.append({})
        results[-1]['loss'] = best_loss
        if eval_metrics is not None and len(eval_metrics) > 0:
            y_pred = primary_model.predict(val_fold['inputs'], bs=bs if not bulk_move else None)
            for m in eval_metrics: results[-1][m] = eval_metrics[m].evaluate(fy, val_id, y_pred)
        print(f"Scores are: {results[-1]}")
        with open(savepath/'results_file.pkl', 'wb') as fout: pickle.dump(results, fout)
        with open(savepath/'cycle_file.pkl', 'wb') as fout: pickle.dump(cycle_losses, fout)
        
        plt.clf()
        print(f"Fold took {timeit.default_timer()-model_tmr:.3f}s\n")

    print("\n______________________________________")
    print("Training finished")
    print(f"Cross-validation took {timeit.default_timer()-train_tmr:.3f}s ")
    plot_train_history(histories, savepath/'loss_history', settings=plot_settings)
    for score in results[0]:
        mean = uncert_round(np.mean([x[score] for x in results]), np.std([x[score] for x in results])/np.sqrt(len(results)))
        print(f"Mean {score} = {mean[0]}±{mean[1]}")
    print("______________________________________\n")
    if log_output:
        sys.stdout = old_stdout
        log_file.close()
    return results, histories, cycle_losses

Ratio plots

In HEP it is useful to compare 'collider data' to 'MC data' as a ratio plot. These are normally placed under more informative plots, e.g.:

It would be useful to have something similar for LUMIN, like a subplot that sits under the main plot, but I'm not sure how easy it would be to implement in a generalising way for all plots. For now it will probably be sufficient to extend plot_binary_sample_feat to include a ratio plot of background to collider data, or signal to background (and to extend it to plot collider data as dots, and uncertainty bands for background and collider data). It's not a priority, though, since most analyses will probably still produce final plots in ROOT.

JIT compilation?

PyTorch has the ability to Just In Time compile stuff to make it run quicker and be more memory efficient. I'd tried to do this a while ago with @weak_script and @weak_module decorators, however they didn't seem to do much and I had trouble automatically generating the docs. I then found that PyTorch recommended that users not use these decorators. Since then PyTorch apparently introduced @torch.jit.script decorators, which are for user use and supposedly provide noticeable improvements in speed and memory usage.

Examples could be for compiling activation functions:

Whereas LUMIN's implementation of Swish is simply: x*torch.sigmoid(x). Other possibilities could be in LUMIN's loss function (e.g. WeightedMSE). I'm not sure how far one can take this; should all things related to PyTorch be JIT complied, or perhaps only operations on tensors?

A starting point would be test out the JIT compiled Swish against the current version, and then to try to find out more about what should be JITed, and what doesn't.

Prepare for scikit-learn 0.25

LUMIN was recently moved to use Scikit-learn >= 0.23.1 (latest version at time of writing). From Scikit-learn 0.25, methods will expect all arguments to be named, rather that positional. Version 0.23, should raise FutureWarning when positional arguments are used, and from 0.25 they will raise TypeError. LUMIN needs to update all calls to scikit-learn stuff to use keyword arguments ASAP. To cut down warnings and before they become errors.

Extend LRFinder to run over multiple epochs

Current state

The LRFinder callback runs once over every fold in the FoldYielder it is passed, and the LR step sizes are computed based on the batchsize, range of LRs specified, and the amount of training data.

Problem

For small datasets, the step sizes must be very large in order to cover a sufficient range of LRs, potentially leading to a jagged curve which may not be representative of the ideal LR.

Solution

A new argument is added to LRFinder and fold_lr_find to allow the LR range test to run over multiple uses of the same fold, in order to get finer step sizes for small datasets.

Introduction of training phases

Idea

Currently, training of a model uses the same settings, callbacks, and data for the entire training process. It could well be the case that the user wishes to change certain aspects at set points during the training. A simple example could be changing the LR cycle callbacks. A more complicated example could be changing the training data during training, e.g. from Delphes to Gent4 simulations. Another example could be starting with parts of the model frozen, and then unfreezing them at a set point (e.g. pre-training a part of the model to work better on low-level information before introducing high-level information)

This could potentially be allowed by defining training phases, each with their own sets of settings, callbacks, and data.

This idea will no doubt require large changes to fold_train_ensemble, and some sort of 'trigger' callback to move to the next training phase.

Improve plot_feat method (AKA move to latest Seaborn once released)

Current status

The plot_feat method provides 1D distributions for features in the form of histograms and KDEs, and also computes the mean and standard deviations of the distributions and their uncertainties.

Problems

Whilst the method accepts a weight argument, in order to plot weighted KDEs the data is sampled with replacement according to probabilities given by the (normalised weights).
- This is a bit of a hack. The next release of Seaborn should include the ability to plot weighted KDEs (PR), so the resampling will no longer be necessary
- This resampling method also means that all the data must have non-negative weights, which is not always the case in HEP.

Solution

Once latest version of Seaborn is released update plot_feat to use weighted KDEs. This will require:

Depreciate some arguments related to data resampling
Changes to the moments computation code to handle weighted data (they're currently computed on the resampled data, and so do not perform weighted computations)
Update of plot_kdes_from_bs to use lineplot, since tsplot was removed in Seaborn V10

Consistency between epochs and subepochs for user-facing arguments

Idea

LUMIN's basic 'tick' for most things, except mini-batch updates, is one complete use of a fold of data (referred to as a subepoch). Typically the training data consists of multiple folds, and an epoch refers to the full use of all training folds (multiple subepochs). Callbacks like OneCycle, when defining their cycle lengths use subepochs, however setting the upper limit for training in fold_train_ensemble is done in terms of epochs. This inconsistency may be confusing to new users or those already with experience in other frameworks.

This idea is that all user-facing arguments relating to subepochs or epochs, should consistently use one of the two and not a mixture of both.

Investigate usage of DeepLift & SHAP

DeepLift (Shrikumar, Greenside, Kundaje, 2017) is a method for interpreting trained networks, and could be useful to incorporate into LUMIN. The SHAP package appears to offer an implementation of it, along with other useful classes for interpreting models (including trees). Perhaps it could be good to include wrapper methods for these classes to allow their use on LUMIN models and PlotSettings, similar to plot_1d_partial_dependence?

Improve clarity for which arguments must be set by the user

Problem

Many classes are designed to be passed as partial to other methods and classes which will add additional arguments and then instantiate them, however it is not always obvious to the user which arguments will be supplied by the wrapper function and what they must set themselves.

An example of this is partial(CycleLR, lr_range=(0, 6e-3), cycle_mult=2), where a callback is partially defined and will later be passed to fold_train_ensemble. fold_train_ensemble will then set the model and nb arguments of the callback, however it is not clear to the user that these arguments are automatically set. Similarly, they may assume that e.g. cycle_mult might be set by fold_train_ensemble, when in fact it will not be.

Another example is when building a model: body = partial(FullyConnected, depth=4, width=100, act='swish'). The other required arguments for FullyConnected, n_in and feat_map are set by ModelBuilder, but again it is not clear that the user is not expected to set these arguments.

Thoughts

LUMIN is designed to be modular, and so all classes should work isolation as well as with the intended wrapper methods and classes.
- This makes it difficult to remove or set some naming convention for arguments that will be set by the wrappers.
Perhaps the best approach is clearly label in documentation which arguments can be ignored when using partial definitions, and to warn users when a preset argument value will be overwritten.
This isn't urgent, but if a convention is adopted, it should be done so earlier, before the code-base increases in size.

Example of PlotSettings

The use and flexibility of the PlotSettings class is only implicitly shown in the examples. It could be of use to have a dedicated example of the different settings available. This could also be used to demonstrate some of the less-used plotting methods in LUMIN.

Add repr to ModelBuilder

The ModelBuilder class should have a __repr__ method that presents a summary of it's settings. This could be as simple as instantiating a Model and returning its __repr__ value.

Tests

Current state

Development of new methods and classes is normally done whilst solving a specific problem and only once the code works is it added to the code-base. Still changes and depreciations may cause parts of the code to begin to fail, or perhaps the code does not work for all cases (e.g. edge cases exist and are not accounted for).

In order to help protect against this, the examples are designed to utilise as much of the code-base in realistic scenarios. They then function as tests and are run (at least) prior to the release of a new version and any errors may be fixed.

Concerns

The examples can be quite slow to run, and being Jupyter Notebooks, might be difficult to run in an automated fashion, and feedback might be limited
Full coverage with unit tests of all methods and classes might be difficult due to the requirement of extensive mocking, and may not accurately represent a realistic test, or capture interdependence of functions
- My experience with unit testing, though, is only a 3-month industrial secondment, i.e. not extensive. Perhaps approaches exist to better capture interdependence.
If examples are used as tests, then as the code-base grows, so must the range of examples
Examples by their nature will focus only on common application cases - they may miss edge cases
Whilst code may run correctly, some changes may lead to slow-down. This can be difficult to spot without continually monitoring of timing on a fixed task
Similarly code may run correctly, but changes may lead to loss of performance. This can be difficult to spot without continually monitoring of performance on a fixed task

Proposals

Create test versions (as .py) of the examples which step through each stage of the code, these are then used as continuous integration tests
- Allows for better coverage of functions and edge cases
- Timing and performance of each function can be recorded to check for slow-down & degradation in code-base
- Faster feedback on breaking changes, rather than just prior to deployment
Check how other frameworks, like FastAI, approach testing

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.5.1. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary lumin -w /tmp/ext lumin==0.5.1
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting lumin==0.5.1
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/561/5d2232da9ea91/lumin-0.5.1.tar.gz (116 kB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-2nqqp23z/lumin/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-2nqqp23z/lumin/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-2nqqp23z/lumin/pip-egg-info
         cwd: /tmp/pip-wheel-2nqqp23z/lumin/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-2nqqp23z/lumin/setup.py", line 9, in <module>
        with open('requirements.txt') as f: requirements = f.read().strip().split('\n')
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Investigate BootstrapResample

class BootstrapResample(Callback) runs bootstrap resampling on training data during training with the idea that each models during ensemble training are more decorrelated from one another, similar to bagging in Random Forest training. This resampling can optionally be performed differently for every epoch, which may have an impact on over-training (either good or bad)

It would be interesting to test this out further to see whether it bagging has any real impact on model correlation, ensemble performance, and single-model performance.

Deprocessing of matrix data?

Problem

FoldYielder.get_df is a method to return data from the foldfile as a pandas.DataFrame. Optionally, input data can be returned, further optionally it can be deprocessed if an input_pipe was provided. Ideally the same should be done for matrix data. Whilst the potential for a separate input pipe for matrix data has been added, it is not always obvious how the matrix data was originally preprocessed, making deprocessing ambiguous.

Current state

Matrix data is not deprocessed when returned by FoldYielder.get_df, and no warnings are raised to signal the user about this

Possible solutions

State that no attempt to deporcess matrix data will be made and that the ability to add matrix input pipes is purely for user convenience
Attempt to deprocess matrix data:
- Try either flattening it out, inverse-transforming and then reshaping, or passing each row, column through the input pipe. Can be guessed by length of means in pipe, but is ambiguous for square matrices.
- If attempts fail, or ambiguity exists, return processed data and warning

Wrapper methods for YellowBrick / generic plot-method wrapper?

Yellowbrick: Machine Learning Visualization seems to have some really nice methods and visualisations for various aspects of ML. These should be investigated and perhaps the most useful/unique ones wrapped to work with LUMIN models and PlotSettings.

Perhaps really, we need a method that can call generic plotting methods and then apply stylings to the returned plot, to save having to wrap plotters individually. I'm not sure how easy this would be, and would probably require that the plotting method returns the plot rather than plots it directly.

Typo in pivot example

Kranmer -> Cranmer 😜

In this example we'll reimplement the jets example from Learning to Pivot Louppe, Kagan, & Kranmer, 2016, as per the official code repo (https://github.com/glouppe/paper-learning-to-pivot).

Numpy version of `df2foldfile`

Current state

The df2foldfile method is currently the main helper method for building a foldfile from data, however it assumes that the data is supplied as a Pandas.DataFrame object. It is possible that the user's data might be in Numpy arrays (e.g. X inputs, y targets). Currently the user would have to convert their data to a DataFrame and then pass it to df2foldfile.

Suggestion

A new method arr2foldfile is written to take input and target arrays and an optional weight array, as well as other required arguments for df2foldfile. arr2foldfile then build a temporary DataFrame from the supplied arrays and passes it to df2foldfile.

Optimise exponentiation

https://chrissardegna.com/blog/posts/python-expontentiation-performance/ studies the performance of different methods of exponentiation and finds that chained multiplication should be used for integer powers less than, or equal to, 5, and math.pow() should be used otherwise. I.e. never use **.

It doesn't study exponentiation of Numpy arrays. Probably it will be useful to check if np.pow and ** are equivalent, and to compare math.pow to np.pow. There is also and argument for readability against chained multiplication.

This is only minor, but it could be useful to go through the code-base and optimise the exponentiation that is used. Probably just search for **.

Docstrings: imperative to descriptive

When writing doc-strings, I begun using imperative strings, e.g "Plot KDEs computed via :meth:~lumin.utils.statistics.bootstrap_stats". Following a recommendation I then switched to writing descriptive doc-strings, e.g. "Plots KDEs computed via :meth:~lumin.utils.statistics.bootstrap_stats".

Ideally the style should be consistent for all doc-strings, so the old ones need to be updated.

Add Mish activation

The Mish activation funciton (x tanh(ln(1=exp(x)))) or (x tanh(softplus(x))) (Misra, 2019) received a lot of attention in 2019, and seems to perform quite well. It should be added to LUMIN as a supported activation function.

I tried to do this already, but by implementation was really slow, and in the end I never committed it. There seems to be a good deal of information about implementations of it on its Github (which is MIT licensed). Considering issue #70, it is probable that JIT compilation should be used.

Addition of Mish (and other activation functions) would involve adding its definition to activations.py (or if a licensed version is copied, to a new file with a header carrying the licence terms and stating that the LUMIN Apache2.0 licence does not cover code contained in that file (see e.g. lsuv_init.py for an example)). Mish would then need to be added to the lookup_act method, so that it can be called via a string.