gilesstrong / lumin Goto Github PK
View Code? Open in Web Editor NEWLUMIN - a deep learning and data science ecosystem for high-energy physics.
License: Apache License 2.0
LUMIN - a deep learning and data science ecosystem for high-energy physics.
License: Apache License 2.0
plot_roc
provides a variety of options for computing and plotting ROC curves, including bootstrap resampling to compute the uncertainty on the ROC AUCs.
Whilst the mean ROC AUCs are displayed, along with their uncertainty, the curves that are plotted are just single lines with no uncertainty. Ideally the uncertainty band coming from bootstrap resamples of the data should be used to compute the uncertainty bands.
_bs_roc_auc
is extended to compute and return multiple ROC curves. plt.plot
is replaced with sns.lineplot
and set to show the standard deviation as a band as computed using the ROC curves returned by _bs_roc_auc
.
fold_train_ensemble
trains a set of models, using cross-validation over fixed folds of data. The order of use of folds is fixed val_id = model_num % fy.n_folds
.
The training loop may be interrupted, for whatever reason. In these cases, the user must restart the training from the beginning.
A new argument is added fold_train_ensemble
that will allow training to to begin from the specified model number. This would most likely add a lower bound to model_bar = master_bar(range(n_models))
, however it must also alter os.system(f"rm {savepath}/*.h5 {savepath}/*.json {savepath}/*.pkl {savepath}/*.png {savepath}/*.log")
to avoid deleting the previously trained models.
Many classes are designed to be passed as partial
to other methods and classes which will add additional arguments and then instantiate them, however it is not always obvious to the user which arguments will be supplied by the wrapper function and what they must set themselves.
An example of this is partial(CycleLR, lr_range=(0, 6e-3), cycle_mult=2)
, where a callback is partially defined and will later be passed to fold_train_ensemble
. fold_train_ensemble
will then set the model
and nb
arguments of the callback, however it is not clear to the user that these arguments are automatically set. Similarly, they may assume that e.g. cycle_mult
might be set by fold_train_ensemble
, when in fact it will not be.
Another example is when building a model: body = partial(FullyConnected, depth=4, width=100, act='swish')
. The other required arguments for FullyConnected
, n_in
and feat_map
are set by ModelBuilder
, but again it is not clear that the user is not expected to set these arguments.
I don't know too much about Github actions, but certainly something like "Publish Python Package" could be useful, since its a time-consuming process that is easy to get wrong, or forget how to do.
Maybe an action to generate the docs templates, as well, could be good.
User must currently track which versions of LUMIN were used when training models due to changes in ModelBuilder
. Whilst I try to coerce changes for a while when breaking changes or depreciation occur, it might still be useful if models and ensembles were to save as metadata the current version of LUMIN. This would make it easier to work with very old models without having to try loading them in every previous LUMIN release.
Model.save
and Model.load
are extended to include saving & loading the current version LUMIN in the dictionary. Similarly Ensemble.save
and Ensemble.load
could save/load details about current LUMIN version. ModelBuilder
should probably also include a variable set to the current LUMIN version, since it get's pickled during Ensemble.save
.
Currently training and application of NNs in LUMIN has only really been tested in Jupyter notebooks. It should also be tested to make sure it all works fine when used in executed python files and repl.
Currently every class, method, et cetera must be imported from the file in which they are defined. This potentially makes finding the thing one wishes to import quite difficult due to the number of files and submodules. This will only get worse with time. Even I, the main developer, forget where things are defined and have to spend time searching for them.
I attempted to solve this by introducing imports at the top of each submodule in __init__.py
. The idea being that users could then import from the submodule rather than the exact file. This however lead to circular imports between modules and submodules, due to the interconnected nature of the package. This attempt still exists in the code-base, but is commented out and not guaranteed to be up to date.
The MetricLogger
class computes metrics and displays information during training in the form of realtime plots. This isn't much use (and may even cause errors) if the training is being performed in an executed python file or repl. Ideally, MetricLogger
the should detect if it is being used in Jupyter environment and if not, then should print the feedback information to the prompt (perhaps overwriting the values each time rather than entering new lines, to save space).
HEPAugFoldYielder
applied train-time and test-time data augmentaitons to HEP data (phi rotations, transverse & longitudinal flips). This is performed when loading the data since originally, this was the last point at which the feature names for the data were known to the model. Later changes to LUMIN, now mean that the model has a list of named features and how they map to the input features. This means that instead the data augmentation could be performed by a callback during training (similar to the suggestion of issue #68).
It seems a bit strange that the choice of whether or not to augment the data is made by changing how the data is loaded from file. Specifying the choice as a callback make a bit more sense (to me). This also avoids complications once addition forms of augmentation are added, which may otherwise require their ownFoldYielder
classes, and we must then account for all possible combinations of different types of augmentation.
Depending on the choices made in issue #50, this may reduce the efficiency of augmentation, but it's possible that augmenting the data inplace on device may actually be more efficient by since it could be done multithreaded. This would perhaps avoid the need to augment as a pandas.DataFrame
, and maybe pre-cached rotation matrices could be used, in some part, to speed things up. Since the data is already on device, this would actually be quicker than loaded from disc, augmenting, and then loading to device; this is known to cause particular slow-down when working on GPU
The callback would need to mimic the behaviour of HEPAugFoldYielder
, i.e. provide random augmentation during training, and a choice of either set transformations during testing or random ones. It would need to be passed as a callback during training and prediction.
Additionally, tests should be done to compare the speed and memory usage of the callback to HEPAugFoldYielder
.
If successful, this would depreciate HEPAugFoldYielder
.
LUMIN was recently moved to use Scikit-learn >= 0.23.1 (latest version at time of writing). From Scikit-learn 0.25, methods will expect all arguments to be named, rather that positional. Version 0.23, should raise FutureWarning
when positional arguments are used, and from 0.25 they will raise TypeError
. LUMIN needs to update all calls to scikit-learn stuff to use keyword arguments ASAP. To cut down warnings and before they become errors.
DeepLift (Shrikumar, Greenside, Kundaje, 2017) is a method for interpreting trained networks, and could be useful to incorporate into LUMIN. The SHAP package appears to offer an implementation of it, along with other useful classes for interpreting models (including trees). Perhaps it could be good to include wrapper methods for these classes to allow their use on LUMIN models and PlotSettings
, similar to plot_1d_partial_dependence
?
The df2foldfile
method is currently the main helper method for building a foldfile from data, however it assumes that the data is supplied as a Pandas.DataFrame
object. It is possible that the user's data might be in Numpy arrays (e.g. X inputs, y targets). Currently the user would have to convert their data to a DataFrame and then pass it to df2foldfile
.
A new method arr2foldfile
is written to take input and target arrays and an optional weight array, as well as other required arguments for df2foldfile
. arr2foldfile
then build a temporary DataFrame from the supplied arrays and passes it to df2foldfile
.
One thing I've been surprised about is that increasing the amount of test-time data augmentation can sometimes decrease model performance. I would have thought that it would saturate, rather than decrease; all transformations result in valid events, unlike say in image augmentation where augmented data is only similar and things like Polyak averaging are useful.
I've been over the code in HEPAugFoldYielder
multiple times and can't spot any errors, but a second set of eyes might help. There might also be errors in my assumptions about the input data...
If the code is correct, then it would be interesting to investigate further what causes the degradation in performance, and whether it is reproducible on other datasets.
Main training function fold_train_ensemble
uses early stopping by default: if a number of subepochs are passed (patience) without an improvement in the validation loss, then training stops.
Sometime the validation loss will move to a plateau with a shallow slope; the validation loss continues to decrease, but at rate which minimal impact on performance.
A new callback is introduced to to stop training if the validation loss doesn't decrease by a certain amount (or fraction) after a certain number of epochs. This should ideally automatically scale to the typical loss values for each particular training and should accurately detect plateaus without fine tuning by the user. It could, for instance, monitor the rate of change of loss. MetricLogger
already does something similar by monitoring the loss velocity. It should also be resistant to fluctuations in the loss, which may occur, particularity in heavily weighted data.
Yellowbrick: Machine Learning Visualization seems to have some really nice methods and visualisations for various aspects of ML. These should be investigated and perhaps the most useful/unique ones wrapped to work with LUMIN models and PlotSettings
.
Perhaps really, we need a method that can call generic plotting methods and then apply stylings to the returned plot, to save having to wrap plotters individually. I'm not sure how easy this would be, and would probably require that the plotting method returns the plot rather than plots it directly.
The use and flexibility of the PlotSettings
class is only implicitly shown in the examples. It could be of use to have a dedicated example of the different settings available. This could also be used to demonstrate some of the less-used plotting methods in LUMIN.
Ensemble
and Model
classes have get_feat_importance
methods to compute the permutation importance of input features. Currently the the input data must be supplied as a FoldYielder
object. There are occasions when one may wish to evaluate the feature importance on only a subset of the data (e.g. only on 2-jet events). This then requires saving the subset to a foldfile and instantiating a new FoldYielder
to point to the subset of data.
The get_feat_importance
methods are extend to take pandas.DataFrame
objects as inputs. This will no doubt impact certain aspects of the returned information, such as averaging over folds and computing uncertainties. I think it is reasonable that if the user really wants this extra information, they can export a new foldfile. This extension would simply be for getting the rough information quickly and producing an informative plot which wouldn't necessarily be used for publication.
AbsEndcap
acts as a wrapper to apply fixed functions to the outputs of models that were trained on proxy objectives, e.g. to compute the invariant mass from a model that outputs the 3-momenta of two particles (see Multi_Target_Regression example).
AbsEndcap
currently does this via a .forward
method and a .predict
method, so in theory, this should also work for Ensembles of models. Wrapping an Ensemble
, though, then removes many of the benefits of that the Ensemble
class offers.
The Ensemble
class is extended to include the ability to add an AbsEndcap
, through which the outputs of all internal models will be passed.
Occasionally during training I get permission errors during training when trying to save model weights. This then kills the training and I normally have to restart the kernel. This has only ever happened on one computer (Elementary OS 0.4) and is very rare. I've not been able to reproduce the error on demand.
If anyone else ever encounters this error, please let me know!
The plot_feat
method provides 1D distributions for features in the form of histograms and KDEs, and also computes the mean and standard deviations of the distributions and their uncertainties.
Once latest version of Seaborn is released update plot_feat
to use weighted KDEs. This will require:
plot_kdes_from_bs
to use lineplot
, since tsplot
was removed in Seaborn V10There was a big kerfuffle in 2019 about some new optimisers: Regularised Adam (Liu et al., 2019), Look Ahead (Zhang, Lucas, Hinton, & Ba, 2019), and a combination of both of them, Ranger (which also now includes Gradient Centralization (Yong, Huang, Hua, & Zhang, 2020).
Having tried these, (except the latest version of Ranger), I'vbe not found much improvement compared to Adam, but this was only on one dataset. The performance of Ranger, though, looks to be quite good for other datasets, so perhaps it is useful.
User-defined optimisers can easily be used in LUMIN, by passing the partial optimiser to the opt_args
argument of ModelBuilder
, e.g. opt_args = {'eps':1e-08, 'opt':partial(RAdam)}
. It could be useful, however, to include the optimisers in LUMIN, to allow them to be easily used, without the user having to include an copied code.
These git repos include Apache 2.0 - licensed implementations of Radam and Ranger, so inclusion should be straight forward.
The colour schemes for plots are dictated by PlotSettings
objects. When these are not passed by the user, a default PlotSettings
is used. This uses: tab10
for categorical colours, RdBu_r
for divergent colours, and virdis
for sequential colours.
These are colour schemes that exist in matplotlib and give LUMIN plots a kind of generic appearance. Additionally, tab10
is often not sufficient for some plots, leading to colour being repeated, or tab20
being used. I don't particularly like tab20
due to the mixture of hard and pastel colours.
New default sequential, divergent, and categorical colour maps are set or created. Perhaps a series of styles/moods could be defined, which would change all three colour schemes to preset maps, e.g. 'spring' would favour greener colour maps, 'autumn' more reds and browns.
In a few locations, string values arguments are used in LUMIN and specific value are expected. Python 3.8 introduced literal types (https://realpython.com/python38-new-features/), where the set of expected values can be stated in the method definitions. This is potentially useful, but would then require setting the minimum version of python for LUMIN from 3.6 to 3.8, which could be disruptive to our user-base. (Although, since I doubt we actually have a consistent user-base, this change may not be disruptive and should be done sooner rather than later).
Currently, training of a model uses the same settings, callbacks, and data for the entire training process. It could well be the case that the user wishes to change certain aspects at set points during the training. A simple example could be changing the LR cycle callbacks. A more complicated example could be changing the training data during training, e.g. from Delphes to Gent4 simulations. Another example could be starting with parts of the model frozen, and then unfreezing them at a set point (e.g. pre-training a part of the model to work better on low-level information before introducing high-level information)
This could potentially be allowed by defining training phases, each with their own sets of settings, callbacks, and data.
This idea will no doubt require large changes to fold_train_ensemble
, and some sort of 'trigger' callback to move to the next training phase.
It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.5.1. You're in good company, about 5% of other
projects updated in the last year are also missing files.
+ /tmp/venv/bin/pip3 wheel --no-binary lumin -w /tmp/ext lumin==0.5.1
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting lumin==0.5.1
Downloading http://10.10.0.139:9191/root/pypi/%2Bf/561/5d2232da9ea91/lumin-0.5.1.tar.gz (116 kB)
ERROR: Command errored out with exit status 1:
command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-2nqqp23z/lumin/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-2nqqp23z/lumin/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-2nqqp23z/lumin/pip-egg-info
cwd: /tmp/pip-wheel-2nqqp23z/lumin/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-wheel-2nqqp23z/lumin/setup.py", line 9, in <module>
with open('requirements.txt') as f: requirements = f.read().strip().split('\n')
FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
The LRFinder
callback runs once over every fold in the FoldYielder
it is passed, and the LR step sizes are computed based on the batchsize, range of LRs specified, and the amount of training data.
For small datasets, the step sizes must be very large in order to cover a sufficient range of LRs, potentially leading to a jagged curve which may not be representative of the ideal LR.
A new argument is added to LRFinder
and fold_lr_find
to allow the LR range test to run over multiple uses of the same fold, in order to get finer step sizes for small datasets.
Development of new methods and classes is normally done whilst solving a specific problem and only once the code works is it added to the code-base. Still changes and depreciations may cause parts of the code to begin to fail, or perhaps the code does not work for all cases (e.g. edge cases exist and are not accounted for).
In order to help protect against this, the examples are designed to utilise as much of the code-base in realistic scenarios. They then function as tests and are run (at least) prior to the release of a new version and any errors may be fixed.
Parameterised learning is useful in HEP, for example in cases where a classifier should learn multiple signal hypotheses (e.g. a heavy Higgs of several possible masses) see Baldi et al., 2016.
In this example the signal would have a parameterised input equal to the true resonant mass, and the background would be randomly assigned resonant masses. Once trained, the entire dataset can be set to a particular resonant mass in order to perform inference for a given hypothesis. This last part is already possible with the ParametrisedPrediction
class.
Currently the random assignment of parameterised-feature values for background (in the example above) is performed once when preparing the data for training. It could well be possible that it is useful to perform this random assignment during training, which may provide some of the benefits of train-time data augmentation.
To avoid conflicts with HEPAugFoldYielder
, and due to the fact that this only wants to be performed during training, this secondary form of augmentation should probably implemented as a callback. It also needs to account for the possibility that multiple parameterisation features may be used, and that only a subset of the data may need to be changed.
class BootstrapResample(Callback)
runs bootstrap resampling on training data during training with the idea that each models during ensemble training are more decorrelated from one another, similar to bagging in Random Forest training. This resampling can optionally be performed differently for every epoch, which may have an impact on over-training (either good or bad)
It would be interesting to test this out further to see whether it bagging has any real impact on model correlation, ensemble performance, and single-model performance.
AdamW (Loshchilov, Hutter, 2017) should really be included for cases when the user wishes to use weight decay (PyTorch Adam says 'weight_decay', but implements L_2). PyTorch >= 1.2 includes AdamW, and there is also an Apache2.0 implementation here. AdamW should be added as an easily accessible optimiser in LUMIN (if not the default). Warnings should also be addded to recommend that the user use AdamW if they specify a weight decay and Adam. The user must actively make this change; LUMIN should not silently switch the optimiser to AdamW.
LUMIN moves to the latest version of PyTorch, which includes AdamW, but checks will need to be made to make sure there aren't any problems in compatibility.
In HEP it is useful to compare 'collider data' to 'MC data' as a ratio plot. These are normally placed under more informative plots, e.g.:
It would be useful to have something similar for LUMIN, like a subplot that sits under the main plot, but I'm not sure how easy it would be to implement in a generalising way for all plots. For now it will probably be sufficient to extend plot_binary_sample_feat
to include a ratio plot of background to collider data, or signal to background (and to extend it to plot collider data as dots, and uncertainty bands for background and collider data). It's not a priority, though, since most analyses will probably still produce final plots in ROOT.
When viewing source code in the docs (e.g. https://lumin.readthedocs.io/en/stable/_modules/lumin/nn/ensemble/ensemble.html#Ensemble), the width of the viewer is less than the line length used for coding (160 characters). This causes the code to wrap onto the next line, presenting a confusing appearance.
There is a large space on the right hand-side of the page, and the code view could easily be extended to fill it. Just need to find the right value to edit.
This fix actually needs to be made in https://github.com/GilesStrong/pytorch_sphinx_theme I've simply referenced it here for visibility.
https://chrissardegna.com/blog/posts/python-expontentiation-performance/ studies the performance of different methods of exponentiation and finds that chained multiplication should be used for integer powers less than, or equal to, 5, and math.pow()
should be used otherwise. I.e. never use **
.
It doesn't study exponentiation of Numpy arrays. Probably it will be useful to check if np.pow
and **
are equivalent, and to compare math.pow
to np.pow
. There is also and argument for readability against chained multiplication.
This is only minor, but it could be useful to go through the code-base and optimise the exponentiation that is used. Probably just search for **
.
The ModelBuilder
class should have a __repr__
method that presents a summary of it's settings. This could be as simple as instantiating a Model
and returning its __repr__
value.
FoldYielder.get_df
is a method to return data from the foldfile as a pandas.DataFrame
. Optionally, input data can be returned, further optionally it can be deprocessed if an input_pipe was provided. Ideally the same should be done for matrix data. Whilst the potential for a separate input pipe for matrix data has been added, it is not always obvious how the matrix data was originally preprocessed, making deprocessing ambiguous.
Matrix data is not deprocessed when returned by FoldYielder.get_df
, and no warnings are raised to signal the user about this
When using the CycleLR
callback, e.g for cosine LR annealing, the patience must be set to the number of patience cycles+1. E.g. to ensure that training finishes after one complete cycle without improvement, the patience must be set to 2. This is counter intuitive for the user.
I think this is due to the possibility that if the lowest loss is reach partway through the cycle rather than in the very last iteration, then some counters (e.g. epochs since last improvement) are being not being set/reset. This should be quite simple to fix. I'll look into it.
Kranmer -> Cranmer ๐
In this example we'll reimplement the jets example from Learning to Pivot Louppe, Kagan, & Kranmer, 2016, as per the official code repo (https://github.com/glouppe/paper-learning-to-pivot).
The Mish activation funciton (x tanh(ln(1=exp(x)))) or (x tanh(softplus(x))) (Misra, 2019) received a lot of attention in 2019, and seems to perform quite well. It should be added to LUMIN as a supported activation function.
I tried to do this already, but by implementation was really slow, and in the end I never committed it. There seems to be a good deal of information about implementations of it on its Github (which is MIT licensed). Considering issue #70, it is probable that JIT compilation should be used.
Addition of Mish (and other activation functions) would involve adding its definition to activations.py
(or if a licensed version is copied, to a new file with a header carrying the licence terms and stating that the LUMIN Apache2.0 licence does not cover code contained in that file (see e.g. lsuv_init.py for an example)). Mish would then need to be added to the lookup_act
method, so that it can be called via a string.
When data-augmentation is applied at test-time, the final prediction is based on the original data and the augmented data. This if okay, with the current data-augmentation in LUMIN, since it (should) result in physically valid events which are as likely as the original event.
The user, or future updates of LUMIN, may add data-augmentation which only produces data which is similar to the actual data, but is either not strictly physical, or as a differing probability of being.
In these cases is might be advantages to for the final prediction via Polyak averaging of the score on the original data, and on the augmented data, e.g.:
score = (beta*score on original data) + ((1-beta)*mean score on augmented data)
Beta would need to be an optional argument when calling .predict*
methods of Ensemble
and Model
, and also Model.evaluate*
. Beta could also be set as an property of e.g. HEPAugFoldYielder
, and the the relevant methods could then see whether a beta had been set for the data, to avoid having to explicitly pass it every time.
PyTorch has the ability to Just In Time compile stuff to make it run quicker and be more memory efficient. I'd tried to do this a while ago with @weak_script
and @weak_module
decorators, however they didn't seem to do much and I had trouble automatically generating the docs. I then found that PyTorch recommended that users not use these decorators. Since then PyTorch apparently introduced @torch.jit.script
decorators, which are for user use and supposedly provide noticeable improvements in speed and memory usage.
Examples could be for compiling activation functions:
Whereas LUMIN's implementation of Swish is simply: x*torch.sigmoid(x)
. Other possibilities could be in LUMIN's loss function (e.g. WeightedMSE
). I'm not sure how far one can take this; should all things related to PyTorch be JIT complied, or perhaps only operations on tensors?
A starting point would be test out the JIT compiled Swish against the current version, and then to try to find out more about what should be JITed, and what doesn't.
HEPAugFoldYielder
applies train-time and test-time augmentation to 3-momenta vectors, but only if they are in Cartesian coordinates (p_x, p_y, p_z). This should ideally be extended to (p_T, eta, phi).
Extend HEPAugFoldYielder
to run data augmentation over vectors in (p_T, eta, phi) and infer coordinate system from data, rather than requiring user-supplied flag.
RFPImp is relied on quite a bit by LUMIN, however it imports stuff from sklearn.ensemble.forest, which is depreciated since 0.22, and will be removed in 0.24. Currently this raises FutureWarning
. Hopefully in the next release this will be fixed and we can upgrade LUMIN to use the new version of RFPImp.
LUMIN's basic 'tick' for most things, except mini-batch updates, is one complete use of a fold of data (referred to as a subepoch). Typically the training data consists of multiple folds, and an epoch refers to the full use of all training folds (multiple subepochs). Callbacks like OneCycle
, when defining their cycle lengths use subepochs, however setting the upper limit for training in fold_train_ensemble
is done in terms of epochs. This inconsistency may be confusing to new users or those already with experience in other frameworks.
This idea is that all user-facing arguments relating to subepochs or epochs, should consistently use one of the two and not a mixture of both.
The current process for loading data during training is:
FoldYielder
BatchYielder
. Either the entire fold is then loaded to device at once, or mini-batches are loaded to device one at a timeThe current process for loading data during predicting is:
FoldYielder
BatchYielder
to load minibatches to device in the background, reducing the memory overhead whilst not leading to delays.
BatchYielder
with, or inherit from, a PyTorch Dataloader, which includes multi-threaded workers (although I find that they're slower than single-core...)When writing doc-strings, I begun using imperative strings, e.g "Plot KDEs computed via :meth:~lumin.utils.statistics.bootstrap_stats
". Following a recommendation I then switched to writing descriptive doc-strings, e.g. "Plots KDEs computed via :meth:~lumin.utils.statistics.bootstrap_stats
".
Ideally the style should be consistent for all doc-strings, so the old ones need to be updated.
So far I've been contacted by two researchers about LUMIN, both of whom were wanting to use it for adversarial training in order to get invariance to some feature of the data. Unfortunately, LUMIN does not offer this out of the box, but seeing as 100% of the people that were interested in using LUMIN are wanting to do adversarial training, it's perhaps something that we should aim to offer in the future.
I made some attempt at this, but never tested it and don't have suitable data to hand to really develop it myself. I have a feeling, though, that a decent implementation will probably be quite task-specific, so input from fellow researchers more experience with this is really necessary to move forward.
My initial attempt is pasted below, along with a short response to an email:
"however it is potentially implementable by modifying the training loop and inheriting from the Model class.
In the attached files there's an example of this, in which the new training loop (adversarial_fold_train_ensemble) takes two ModelBuilders, one to provide the primary models and the other to provide the adversarial models. Two sets of callbacks get created, one for each model. A new Model class (AdversarialModel) inherits from Model and has a new method (adversarial_fit) which takes the adversarial model as an argument. This is can then be modified to provide interaction between the two models as necessary. Does this look like the kind of thing which could help? I'm afraid I haven't had a chance to test it, but I've marked everything new with "# New".
In its current form, adversarial_fold_train_ensemble will only save the primary model, and this is determined by the validation loss returned by primary_model.evaluate, so you may want to modify this to avoid saving models which don't have a flat response."
from typing import Dict, List, Tuple, Any, Optional
from pathlib import Path
from fastprogress import master_bar, progress_bar
import pickle
import timeit
import numpy as np
import os
import sys
from random import shuffle
from collections import OrderedDict
import math
from functools import partial
import warnings
import torch.tensor as Tensor
from lumin.nn.data.fold_yielder import FoldYielder
from lumin.nn.data.batch_yielder import BatchYielder
from lumin.nn.models.model_builder import ModelBuilder
from lumin.nn.models.model import Model
from lumin.nn.callbacks.cyclic_callbacks import AbsCyclicCallback
from lumin.nn.callbacks.model_callbacks import AbsModelCallback
from lumin.nn.callbacks.abs_callback import AbsCallback
from lumin.utils.misc import to_tensor, to_device
from lumin.utils.statistics import uncert_round
from lumin.nn.metrics.eval_metric import EvalMetric
from lumin.plotting.training import plot_train_history
from lumin.plotting.plot_settings import PlotSettings
from lumin.nn.training.metric_logger import MetricLogger
from lumin.nn.models.abs_model import AbsModel
import matplotlib.pyplot as plt
__all__ = ['adversarial_fold_train_ensemble', 'AdversarialModel']
# New___
class AdversarialModel(Model):
def adversarial_fit(self, batch_yielder:BatchYielder, adversary_model:AbsModel, primary_callbacks:Optional[List[AbsCallback]]=None,
adversary_callbacks:Optional[List[AbsCallback]]=None, mask_inputs:bool=True) -> float:
r'''
Fit network for one complete iteration of a :class:`~lumin.nn.data.batch_yielder.BatchYielder`, i.e. one (sub-)epoch
Arguments:
batch_yielder: :class:`~lumin.nn.data.batch_yielder.BatchYielder` providing training data in form of tuple of inputs, targets,
and weights as tensors on device
adversary_model: :class:`~lumin.nn.models.model.Model` to act as the adversarial model during training
primary_callbacks: list of :class:`~lumin.nn.callbacks.abs_callback.AbsCallback` to be used during training of the primary model
adversary_callbacks: list of :class:`~lumin.nn.callbacks.abs_callback.AbsCallback` to be used during training of the adversary model
mask_inputs: whether to apply input mask if one has been set
Returns:
Loss on training data averaged across all minibatches
'''
self.model.train()
adversary_model.model.train()
self.stop_train = False
if primary_callbacks is None: primary_callbacks = []
if adversary_callbacks is None: adversary_callbacks = []
for c in primary_callbacks: c.on_epoch_begin(by=batch_yielder)
for c in adversary_callbacks: c.on_epoch_begin(by=batch_yielder)
if self.input_mask is not None and mask_inputs: batch_yielder.inputs = batch_yielder.inputs[:,self.input_mask]
# Replace this as necessary
# _________________________
# losses = []
# for x, y, w in batch_yielder:
# for c in callbacks: c.on_batch_begin()
# y_pred = self.model(x)
# loss = self.loss(weight=w)(y_pred, y) if w is not None else self.loss()(y_pred, y)
# losses.append(loss.data.item())
# self.opt.zero_grad()
# for c in callbacks: c.on_backwards_begin(loss=loss)
# loss.backward()
# for c in callbacks: c.on_backwards_end(loss=loss)
# self.opt.step()
# for c in callbacks: c.on_batch_end(loss=losses[-1])
# if self.stop_train: break
# for c in callbacks: c.on_epoch_end(losses=losses)
# return np.mean(losses)
# _________________________
# ______
def _get_folds(val_idx, n_folds, shuffle_folds:bool=True):
r'''
Return (shuffled) list of fold indeces which does not include the validation index
'''
folds = [x for x in range(n_folds) if x != val_idx]
if shuffle_folds: shuffle(folds)
return folds
def adversarial_fold_train_ensemble(fy:FoldYielder, n_models:int, bs:int, primary_model_builder:ModelBuilder, adversarial_model_builder:ModelBuilder,
callback_partials:Optional[List[partial]]=None, eval_metrics:Optional[Dict[str,EvalMetric]]=None,
train_on_weights:bool=True, eval_on_weights:bool=True, patience:int=10, max_epochs:int=200,
shuffle_fold:bool=True, shuffle_folds:bool=True, bulk_move:bool=True,
live_fdbk:bool=True, live_fdbk_first_only:bool=True, live_fdbk_extra:bool=True, live_fdbk_extra_first_only:bool=False,
savepath:Path=Path('train_weights'), verbose:bool=False, log_output:bool=False,
plot_settings:PlotSettings=PlotSettings(), plots:Optional[Any]=None) \
-> Tuple[List[Dict[str,float]],List[Dict[str,List[float]]],List[Dict[str,float]]]:
r'''
Adversarial training method for :class:`~lumin.nn.models.model.Model`.
Trains a specified numer of models created by a :class:`~lumin.nn.models.model_builder.ModelBuilder` on data provided by a
:class:`~lumin.nn.data.fold_yielder.FoldYielder`, and save them to savepath.
Note, this does not return trained models, instead they are saved and must be loaded later. Instead this method returns results of model training.
Each :class:`~lumin.nn.models.model.Model` is trained on N-1 folds, for a :class:`~lumin.nn.data.fold_yielder.FoldYielder` with N folds, and the remaining
fold is used as validation data.
Training folds are loaded iteratively, and model evaluation takes place after each fold use (a sub-epoch), rather than after ever use of all folds (epoch).
Training continues until:
- All of the training folds are used max_epoch number of times;
- Or validation loss does not decrease for patience number of training folds;
(or cycles, if using an :class:`~lumin.nn.callbacks.cyclic_callbacks.AbsCyclicCallback`);
- Or a callback triggers trainign to stop, e.g. :class:`~lumin.nn.callbacks.cyclic_callbacks.OneCycle`
Depending on the live_fdbk arguments, live plots of losses and other metrics may be shown during training, if running in Jupyter. By default, a live plot
with extra information will be shown for training the first model, and afterwards no live plots will be shown. Shoing the live plot slightly slows down the
training, but can help highlight problems without having to wait to the end. Thererfore this compromises between showing useful information and training
speed, since any problems should hopefully be visible in the first model.
Once training is finished, the state with the lowest validation loss is loaded, evaluated, and saved.
Arguments:
fy: :class:`~lumin.nn.data.fold_yielder.FoldYielder` interfacing ot training data
n_models: number of models to train
bs: batch size. Number of data points per iteration
primary_model_builder: :class:`~lumin.nn.models.model_builder.ModelBuilder` creating the primary networks to train
adversarial_model_builder: :class:`~lumin.nn.models.model_builder.ModelBuilder` creating the adversary networks
callback_partials: optional list of functools.partial, each of which will a instantiate :class:`~lumin.nn.callbacks.callback.Callback` when called
eval_metrics: list of instantiated :class:`~lumin.nn.metric.eval_metric.EvalMetric`.
At the end of training, validation data and model predictions will be passed to each, and the results printed and saved
train_on_weights: If weights are present in training data, whether to pass them to the loss function during training
eval_on_weights: If weights are present in validation data, whether to pass them to the loss function during validation
patience: number of folds (sub-epochs) or cycles to train without decrease in validation loss before ending training (early stopping)
max_epochs: maximum number of epochs for which to train
live_fdbk: whether or not to show any live feedback at all during training (slightly slows down training, but helps spot problems)
live_fdbk_first_only: whether to only show live feedback for the first model trained (trade off between time and problem spotting)
live_fdbk_extra: whether to show extra information live feedback (further slows training)
live_fdbk_extra_first_only: whether to only show extra live feedback information for the first model trained (trade off between time and information)
shuffle_fold: whether to tell :class:`~lumin.nn.data.batch_yielder.BatchYielder` to shuffle data
shuffle_folds: whether to shuffle the order of the trainign folds
bulk_move: whether to pass all training data to device at once, or by minibatch. Bulk moving will be quicker, but may not fit in memory.
savepath: path to to which to save model weights and results
verbose: whether to print out extra information during training
log_output: whether to save printed results to a log file rather than printing them
plot_settings: :class:`~lumin.plotting.plot_settings.PlotSettings` class to control figure appearance
plots: Depreciated: loss history will always be shown,
lr history will no longer be shown separately,
and live feedback is now controlled by `live_fdbk` argument
Returns:
- results list of validation losses and other eval_metrics results, ordered by model training.
Can be used to create an :class:`~lumin.nn.ensemble.ensemble.Ensemble`.
- histories list of loss histories, ordered by model training
- cycle_losses if an :class:`~lumin.nn.callbacks.cyclic_callbacks.AbsCyclicCallback` was passed, list of validation losses at the end of each cycle,
ordered by model training. Can be passed to :class:`~lumin.nn.ensemble.ensemble.Ensemble`.
'''
os.makedirs(savepath, exist_ok=True)
os.system(f"rm {savepath}/*.h5 {savepath}/*.json {savepath}/*.pkl {savepath}/*.png {savepath}/*.log")
if callback_partials is None: callback_partials = []
if log_output:
old_stdout = sys.stdout
log_file = open(savepath/'training_log.log', 'w')
sys.stdout = log_file
if plots is not None:
warnings.warn("The plots argument is now depreciated and ignored. Loss history will always be shown, lr history will no longer be shown separately, \
and live feedback is now controlled by the four live_fdbk arguments. This argument will be removed in V0.6.")
train_tmr = timeit.default_timer()
results,histories,cycle_losses = [],[],[]
nb = len(fy.foldfile['fold_0/targets'])//bs
if live_fdbk:
metric_log = MetricLogger(loss_names=['Train', 'Validation'], n_folds=fy.n_folds, extra_detail=live_fdbk_extra or live_fdbk_extra_first_only,
plot_settings=plot_settings)
model_bar = master_bar(range(n_models))
for model_num in (model_bar):
model_bar.show()
val_id = model_num % fy.n_folds
print(f"Training model {model_num+1} / {n_models}, Val ID = {val_id}")
if model_num == 1:
if live_fdbk_first_only: live_fdbk = False # Only show fdbk for first training
elif live_fdbk_extra_first_only: metric_log.extra_detail = False
if live_fdbk: metric_log.reset()
model_tmr = timeit.default_timer()
os.system(f"rm {savepath}/best.h5")
best_loss,epoch_counter,subEpoch,stop = math.inf,0,0,False
loss_history = OrderedDict({'trn_loss': [], 'val_loss': []})
cycle_losses.append({})
trn_ids = _get_folds(val_id, fy.n_folds, shuffle_folds)
primary_model,adversary_model = AdversarialModel(primary_model_builder),AdversarialModel(adversarial_model_builder) # New
val_fold = fy.get_fold(val_id)
if not eval_on_weights: val_fold['weights'] = None
primary_cyclic_callback,primary_callbacks,primary_loss_callbacks = None,[],[]
adversary_cyclic_callback,adversary_callbacks = None,[] # New
for c in callback_partials: primary_callbacks.append(c(model=primary_model))
for c in primary_callbacks:
if isinstance(c, AbsCyclicCallback):
c.set_nb(nb)
primary_cyclic_callback = c
for c in primary_callbacks:
if isinstance(c, AbsModelCallback):
c.set_val_fold(val_fold)
c.set_cyclic_callback(primary_cyclic_callback)
if getattr(c, "get_loss", None):
primary_loss_callbacks.append(c)
if live_fdbk: metric_log.add_loss_name(type(c).__name__)
loss_history[f'{type(c).__name__}_val_loss'] = []
for c in primary_callbacks: c.on_train_begin(model_num=model_num, savepath=savepath)
# New___
for c in callback_partials: adversary_callbacks.append(c(model=adversary_model))
for c in adversary_callbacks:
if isinstance(c, AbsCyclicCallback):
c.set_nb(nb)
adversary_cyclic_callback = c
for c in adversary_callbacks:
if isinstance(c, AbsModelCallback):
c.set_val_fold(val_fold)
c.set_cyclic_callback(adversary_cyclic_callback)
for c in adversary_callbacks: c.on_train_begin(model_num=model_num, savepath=savepath)
# ______
# Validation data
if bulk_move:
if fy.has_matrix and fy.yield_matrix: val_x = (to_device(Tensor(val_fold['inputs'][0]).float()), to_device(Tensor(val_fold['inputs'][1]).float()))
else: val_x = to_device(Tensor(val_fold['inputs']).float())
val_y = to_device(Tensor(val_fold['targets'])) if bulk_move else Tensor(val_fold['targets'])
if train_on_weights: val_w = to_device(to_tensor(val_fold['weights'])) if bulk_move else to_tensor(val_fold['weights'])
else: val_w = None
if 'multiclass' in primary_model_builder.objective: val_y = val_y.long().squeeze()
else: val_y = val_y.float()
epoch_pb = progress_bar(range(max_epochs), leave=True)
if live_fdbk: model_bar.show()
for epoch in epoch_pb:
for trn_id in trn_ids:
subEpoch += 1
batch_yielder = BatchYielder(**fy.get_fold(trn_id), objective=primary_model_builder.objective,
bs=bs, use_weights=train_on_weights, shuffle=shuffle_fold, bulk_move=bulk_move)
loss_history['trn_loss'].append(primary_model.adversarial_fit(batch_yielder, primary_callbacks=primary_callbacks,
adversary_model=adversary_model, adversary_callbacks=adversary_callbacks)) # New
del batch_yielder
if bulk_move:
val_loss = primary_model.evaluate(val_x, val_y, weights=val_w, callbacks=primary_callbacks)
else:
batch_yielder = BatchYielder(**val_fold, objective=primary_model_builder.objective,
bs=bs, use_weights=train_on_weights, shuffle=shuffle_fold, bulk_move=bulk_move)
val_loss = primary_model.evaluate_from_by(batch_yielder, callbacks=primary_callbacks)
del batch_yielder
loss_history['val_loss'].append(val_loss)
loss_callback_idx = None
loss = val_loss
for i, lc in enumerate(primary_loss_callbacks):
l = lc.get_loss()
if l < loss: loss, loss_callback_idx = l, i
if verbose: print(f'{subEpoch} {type(lc).__name__} loss {l}, default loss {val_loss}')
l = loss if l is None or not lc.active else l
loss_history[f'{type(lc).__name__}_val_loss'].append(l)
if primary_cyclic_callback is not None and primary_cyclic_callback.cycle_end:
if verbose: print(f"Saving snapshot {primary_cyclic_callback.cycle_count}")
cycle_losses[-1][primary_cyclic_callback.cycle_count] = val_loss
primary_model.save(str(savepath/f"{model_num}_cycle_{primary_cyclic_callback.cycle_count}.h5"))
if loss <= best_loss:
best_loss = loss
epoch_pb.comment = f'Epoch {subEpoch}, best loss: {best_loss:.4E}'
if verbose: print(epoch_pb.comment)
epoch_counter = 0
if loss_callback_idx is not None: primary_loss_callbacks[loss_callback_idx].test_model.save(savepath/"best.h5")
else: primary_model.save(savepath/"best.h5")
elif primary_cyclic_callback is not None:
if primary_cyclic_callback.cycle_end: epoch_counter += 1
else:
epoch_counter += 1
if live_fdbk: metric_log.update_vals([loss_history[l][-1] for l in loss_history])
if epoch_counter >= patience or primary_model.stop_train: # Early stopping
print('Early stopping after {} epochs'.format(subEpoch))
stop = True; break
if live_fdbk: metric_log.update_plot(best_loss)
if stop: break
primary_model.load(savepath/"best.h5")
primary_model.save(savepath/f'train_{model_num}.h5')
for c in primary_callbacks: c.on_train_end(fy=fy, val_id=val_id, bs=bs if not bulk_move else None)
histories.append({})
histories[-1] = loss_history
results.append({})
results[-1]['loss'] = best_loss
if eval_metrics is not None and len(eval_metrics) > 0:
y_pred = primary_model.predict(val_fold['inputs'], bs=bs if not bulk_move else None)
for m in eval_metrics: results[-1][m] = eval_metrics[m].evaluate(fy, val_id, y_pred)
print(f"Scores are: {results[-1]}")
with open(savepath/'results_file.pkl', 'wb') as fout: pickle.dump(results, fout)
with open(savepath/'cycle_file.pkl', 'wb') as fout: pickle.dump(cycle_losses, fout)
plt.clf()
print(f"Fold took {timeit.default_timer()-model_tmr:.3f}s\n")
print("\n______________________________________")
print("Training finished")
print(f"Cross-validation took {timeit.default_timer()-train_tmr:.3f}s ")
plot_train_history(histories, savepath/'loss_history', settings=plot_settings)
for score in results[0]:
mean = uncert_round(np.mean([x[score] for x in results]), np.std([x[score] for x in results])/np.sqrt(len(results)))
print(f"Mean {score} = {mean[0]}ยฑ{mean[1]}")
print("______________________________________\n")
if log_output:
sys.stdout = old_stdout
log_file.close()
return results, histories, cycle_losses
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.