mitre / menelaus Goto Github PK

View Code? Open in Web Editor NEW

60.0 11.0 7.0 61.66 MB

Online and batch-based concept and data drift detection algorithms to monitor and maintain ML performance.

Home Page: https://menelaus.readthedocs.io/en/latest/

License: Apache License 2.0

Python 100.00%

drift-detection machine-learning statistics concept-drift data-science data-drift

menelaus's Issues

Make import statements more efficient

If a dot-import (np.sqrt) is called, the lookup resolves np first, then sqrt. This doesn't make a big difference except within a for loop, apparently -- and most of the detectors are going to be run within for loops. In that case, it's better to do e.g. from numpy import sqrt instead of import numpy as np.

We might get slightly better performance, then, if we go through update methods and identify eventual calls to dot-imports and replace them with the pattern above.

https://stackoverflow.com/questions/32151193/is-there-a-performance-cost-putting-python-imports-inside-functions

EDDM: store the n-ago of when drift/warning began (L120)

Author Owner

As in self.buffer_size of STEPD

MD3 API modifications

MD3's API differs from the other detectors, because it is intended for a semi-supervised context. It may be desirable to address some or all of these:

Currently, MD3 has a set_reference to pass in a batch of data as the reference batch, which is inconsistent with the other streaming detectors. We could (1) leave this as is, (2) leave this as is and add set_reference to the other streaming detectors to make them consistent, (3) make MD3's API compatible with both an initial batch or stream-based data for setting the reference, or (4) change MD3 to be compatible only with stream-based data for setting the reference.
The waiting_for_oracle state and give_oracle_label method are unusual. You could imagine gathering the labeled samples as we go, rather than using an oracle function, and maintaining that as a buffer, with some scheme to "forget" "old-enough" labeled samples. This is not as-written in the paper, though.
The number of requested oracle samples and number of retraining samples are decoupled in our implementation, potentially. It may be worth making note of this in the docstring.

Examples for generically examining "which features have drifted?"

For example, consider fitting some classifier on the featurespace, where the labels are the drift detector's output, and examining its feature importance (e.g. coefficients in logistic regression). This would be agnostic to the actual detection algorithm.

Add validation to init params where applicable.

For example, I don't believe most of the window-based detectors check that the window size is greater than 0.
For detectors where the parameters have theoretical restrictions (i.e. probabilities are on [0,1]) we ought to drop ValidationError too.

Add citation and description to rainfall data

Can you add a brief description and citation to the docstring for the rainfall data in L129 of make_example_data.py? Should be okay to add a similar citation to refs.bib as minku2010.

Improve handling of categorical columns in KDQTreePartitioner

Overview: Currently, KDQTreePartittioner behavior on datasets with columns containing categorical/n-hot encoded/ordinal data will be volatile. Fixing this will generalize KDQTreePartitioner to mixed-type datasets.

Details: For example, if one column in a dataset is a 0/1 variable, the first time it is split by build/fill, all 0-rows will be sent one way. The leaf nodes could hence have many more data points in them, than the upper bound count_threshold suggests.

The uniqueness criterion (if # unique values in a column are too few, stop splitting) is needed to prevent endless recursion. With the min_cutpoint_size proportion added, maybe we can remove this safely, as the uniqueness criterion is what prematurely sends too many points to a leaf node.
We can preprocess data that is called to build/fill, e.g., either with information passed by user (or determined by ourselves), we can specially treat columns that are problematic (skip if the unique values are too few, etc.).
We may introduce a split for each value in the category, and force the tree to split as such on the problematic columns.

Note that, once kdq-tree is set up to use dataframes, we can "expect" the categorical dtype to treat these columns appropriately. Update the example(s) accordingly!

Debug issue where kdqpartitioner makes leaves with large counts

If you run the detector(partitioner) with this data, we somehow get a leaf of size ~8000 despite having a maximum leaf count of 100. This may have something to do with the uniqueness criterion, but removing it didn't change the behavior.

Add doctests and use the doctest module

Add doctests in the strings: https://docs.python.org/3/library/doctest.html
Add "run doctests" to a testing workflow for GitHub

Address indiscriminate drift alarm in NN-DVI

Task

When currently applied to 'realistic' and sizable data, the NNDVI implementation alarms drift constantly and irrespective of the user-selected k (for k-NN) and number of sampling times for drift-threshold estimation. Until this is fixed, NNDVI is bugged and unusable.

Note: the referenced dataset is a private dataset of ~180K data points, split into ~9 batches. A MRE of such data is needed for testing and development in this issue.

Impact

This will debug and un-block an otherwise completed drift detector and partitioner, and hence expand the zoo of detectors provided by menelaus.

Details

At minimum:

reproduce a dataset which causes the above behavior
debug NNSpacePartitioner.build(), NNSpacePartitioner.compute_nnps_distance() for any partitioner-side problems
debug NNDVI.update() and NNDVI.compute_drift_threshold() for any detector-side problems

Is it sensible to add a "both" option to the Page-Hinkley direction argument?

Right now, you'd have to run P-H twice on the same data to check both directions. Would thresholds change?

Add ensemble drift detector

Review literature on ensembled detectors
- https://www.researchgate.net/publication/281901318_A_Lightweight_Concept_Drift_Detection_Ensemble
- https://www.researchgate.net/publication/267034767_A_Selective_Detector_Ensemble_for_Concept_Drift_Detection
Enumerate potential voting schemes
How can we allow user input for defining e.g. weighting? Can we update the weights over time? There is a dearth of labelled data in this context, though.
Do we need to sync time-scales between streaming detectors that wait a certain number of samples? Not necessarily - so validation isn't strictly needed, and not all detectors have such an attribute.

A really simple motivating example would be n instances of ADWIN, each monitoring one of {accuracy, precision, recall, TPR, FPR, ...} or whatever combination, similar to LFR without the Monte Carlo runs. This implementation of ADWIN now only points at the accuracy, though it could be modified to monitor other quantities, given some finagling of the update signature.

PCA-CD: modify sample_period based on density estimate

Original TODO

modify sample period dependent upon density estimate
    line 97 (initialization of self.sample_period) in pca_cd.py

Other Description

sample_period (float, optional): how often to check for drift. This is 100 samples or sample_period * window_size, whichever is smaller. Default .05, or 5% of the window size.

Change initialization of value for sample_period in PCA_CD based on density estimate. Currently defaults to 5% of the window size.

Split up the linting and formatting workflows

Split off the linting action (flake8)
Split off the formatting actions
- include black

Add badges for each workflow

Just "create status badge" from the Actions section of the repo.

Add "contributing" heading over testing and formatting in the README

Make streaming and batch abstract classes

kdq-tree has a lot of weirdness due to handling both streaming and batch, e.g. in the validation: right now, the dimensions of new numpy arrays aren't checked. We could probably add checks for the length of self.input_cols vs the shape of a numpy array. This still doesn't seem ideal, but it's definitely break-able as it is.

Rather than continuing to fool with the complications introduced by the detector taking both streaming and batch input, we should probably have parallel classes that have dependency on the same partitioner/similar. This seems like a better structure going forward, as we add more detectors that can do both. The further we get into implementing validation and similar, the more troublesome these problems are.

Add workflow that runs the examples

If these are in the form of notebooks -- example using nbconvert: https://github.com/AureumChaos/LEAP/blob/master/tests/jupyter/test_jupyter.py

Annotate the jupyter/example tests with @pytest.mark.no_cover so system tests aren't included in the coverage report.

Follow multi-inheritance pattern for all applicable detectors

Task

After #46, all remaining detectors outside of KdqTree will need to be updated, to now use the StreamingDetector and BatchDetector ABCs if they are meant to service both options. Such detectors will (likely) also need to implement their own parent algorithm class, e.g. KdqTreeDetector as part of the multi-inheritance pattern.

Impact

This will significantly progress the broad refactor of the drift detectors' object design. See also #15

Revisit generic setup for detectors

*tms: Updating this to reflect the current code. There are a couple of bits that could be made more generic. This ought to be split up into several issues, when we choose to tackle it.

If update could always take the same input, then it'd be easier to swap out detector objects, instead of having to check "is this an ADWIN object?" or similar. Immediate use cases: comparing performance of two detectors; ensembling detectors.
- e.g. ADWIN.update in the typical use case takes "whether the most recent classification was correct." kdq_tree.update takes, instead, an array of the new sample(s) containing each feature.
- For further work with semi-supervised detectors, where there may or may not be a label with a new sample, we'll eventually have an algorithm that requires more generic input to be acceptable.
- The batch algorithms present an additional wrinkle: their default behaviors differ, in that e.g. HDDDM currently puts all of the new test data into the reference window (by default), when drift is not detected. Given that the drift decision is correct, this would mean an increasingly accurate empirical distribution for the reference data. kdq_tree, as currently implemented, maintains an unchanging reference window, because repartitioning an ever-larger sample at each step would be expensive. Doing so is possible, now, using set_reference and maintaining the data outside the object, though.
Validation of input. We probably want decisions on the prior point before troubling with this.
- For streaming detectors, update should check whether more than one sample (/row) has been passed. KdqTree.set_reference currently does something like this to stop the user from inappropriately calling it when KdqTree is in streaming mode instead of batch.
- We also should add checks that confirm the dimension (and/or column names) of the passed input matches prior updates.
- We could likely set up DriftDetector.validate to be called by DriftDetector.update, so that it only needs implementation once. This assumes that each detector is calling super().update at the very beginning of their implementation, which should be correct in most cases.
Having a set_reference method for streaming detectors means we could potentially have faster code for processing many samples at once, especially for those which have an explicit wait period before doing the real calculations. This seems like a much lower value use case.

remove/make optional tracking attributes of LFR, CUSUM, PH

PH, LFR, and CUSUM also store some version of their test statistics indefinitely assuming no drift is alarmed to. It might behoove us to add an option that would truncate these at some length even if there is no drift. our example notebooks show how to store the test statistics in a separate df

Split off and configure the testing pipeline

The current workflow runs coverage tests. Split it into a separate .yml.
Coverage % badges seem to be nonstandard for GitHub. At a minimum, figure out a way to pass/fail coverage based on whether 99% or higher is achieved.

Docstring fixes

Double-check the formatting on the numbered list in make_example_data.make_example_batch_data
docs/source/examples/convert_notebooks.py docstring should include a note that it must be run from within a subdirectory of the cloned repo.
Update the release number on conf.py to the appropriate version, or remove the release number from conf.py entirely, if RTD/sphinx allows it.
Remove this line from the README: "A flowchart breaking down these contexts can be found on the ReadTheDocs page under “Choosing a Detector.”"
- The page was removed, but the docstring update was not.
Add this to the CHANGELOG.md

## v0.1.2 - July 11, 2022

- Updated the documentation
- Added example jupyter notebooks to ReadTheDocs
- Switched to sphinx-bibtext for citations
- Formatting and language tweaks.
- Added StreamingDetector and BatchDetector abstract base classes.
- Re-factored kdq-tree to use new abstract base classes: the separate classes KdqTreeStreaming and KdqTreeBatch now exist.
- kdq-tree can now consume dataframes.
- Added new git workflows and improved old ones.

Set up backtesting for python versions to the earliest compatible version.

At a minimum, run unit tests on older python versions/dependency versions, to establish a floor for compatible python versions.
- We might include running the examples/ scripts and doctests (once they exist) in this pipeline as well.
Update the README's version note accordingly.

Probably most easily accomplished by setting up tox or a similar utility.

Suggest default settings for PH threshold

Issue: the PH threshold (aka lambda) starts at 0.1 * window_size, but our three divergence metrics lead to change scores that vary widely in terms of magnitude. I.e. for intersection area, the threshold was always 4 but the change scores were around 0.25-0.50 so it never alarmed, even though the trajectory of the change scores over time looked accurate. LLH and Kl divergence were on the order of 500 and 50 respectively.

From the PH paper Qahtan cites here: https://repository.kaust.edu.sa/bitstream/handle/10754/556655/TKDE.pdf;jsessionid=289DDA44451AE77BA76539AB0B619647?sequence=1
The parameter λ is usually set after observing <the test stat> for some time and depends on the desired alarm rate

Look into alternative partitioners for kdq-Tree

Quad-tree is a similar partitioning algorithm. Can we set up that detector/partitioner pair s.t. arbitrary divergence metrics over tree-based partitions can be (bootstrapped) and used?

Replace test_example_notebooks with a call to convert_notebooks?

#65 (review)

As noted in the PR that added the latter script, examples.test_example_notebooks and docs.source.examples.convert_notebooks do very similar things. The script for jupyter is pretty complicated, especially with some venv weirdness on the git runner (if I understand correctly).

It might be better to run the example scripts as the converted .py files once, for testing, instead.

Fix the changelog-update workflow

See failure here: https://github.com/mitre/menelaus/runs/7284225469?check_suite_focus=true

It's failing because the branch is protected. Not sure if a workaround is worthwhile.

Switch to sphinx-bibtext for citations

This might enable #13 .

Add security check workflow

Security can be checked using ~~safety and~~ bandit.

Safety needs a subscription, so just use bandit.

Increase coverage for kdq_tree

After allowing dataframes and adding validation, we have some uncovered lines:
src/menelaus/data_drift/kdq_tree.py 106 12 89% 175-176, 180, 208-217, 223, 232, 354

I think these should just be a couple of cases:

1) call kdq_tree with a string (i.e., garbage input data)
2) initialize with a dataframe, then call with a differently-named dataframe
3) initialize with a numpy array, then pass a dataframe
4) call to_plotly_dataframe with a different input_cols argument

Set up changelogs

Some portion of this is automatable:

https://stefanzweifel.io/posts/2021/11/13/introducing-the-changelog-updater-action

Add nbconvert script to convert the docs notebooks to .py scripts

Maintaining these in both places is a pain.

Speed up the GitHub actions

Can we get a performance improvement by using something other than ubuntu-latest?
Can we get a performance improvement by using docker images, instead of repeatedly configuring environments?
The pipeline currently complains about not having wheel. Speed up by installing it, or by using docker as above?
Is it faster to use e.g. venv over conda?

Update examples to use "source" jupyter notebooks

One jupyter notebook per module (data_drift, concept_drift)
- The jupyter notebook has pretty formatting, in-line display of tables, figures, etc. We should be able to include the html files via sphinx.
  - Confirm that including an nbconverted html file is easy to do via sphinx.
    - Seems to be possible to include notebooks directly via nbsphinx, which will execute the notebooks upon creating the docs.
    - nbconvert can be used to convert notebooks into notebooks, which would allow us to filter out certain tags from a given notebook and automatically generate a new one. Might end up needlessly elaborate..
  - Figure out what all nb_extensions we need and how to include them in the setup.cfg
- each jupyter notebook uses the "tag" feature to specify cells corresponding to example (e.g. "all_examples" and "ADWIN" tags)
- use nbconvert with the tag option to convert the single jupyter notebook
  - Mock up a script that can be added to the pipeline to do this conversion. May be able to base this on an example. Remember to mark this with @pytest.mark.no_cover so that it doesn't inflate the coverage statistics.
- Switch the .py scripts to jupyter, using the tags.
Update the README and setup.cfg, with instructions on how to use:
- default install includes the visualization dependencies necessary to run each example.py script
- barebones install is bare minimum dependencies, no visuals.
- dev install includes the above, but also sphinx and everything else.
- For the non-dev configs, look into not downloading e.g. docs, tests, etc., as they're a waste of bandwidth.

This lets us maintain the examples in jupyter notebooks and have them be pretty, so that someone can just read the documentation and see what stuff does, while also allowing the user to run stuff without forcing them to install jupyter.

LFR Tests: implement better way to force drift and non-drift

For these test functions, we use a numpy random seed to get the tests to pass. Ideally, we would have a better way to ensure drift or non-drift as a result of the update statements that we use in the tests to introduce new data to the detector.

Add flow diagram to README on selecting drift detectors

We can have a "decision tree" on which setting each detector is appropriate for.

Move `tests` into `src` and update any references

Task

Move tests/ into src and update any workflows / setup files / tests / other references to migrate seamlessly.

Impact

This will clean up the repository organization a bit more.

Workflow to push to pypi on release?

Is it possible to run wheel and push to pypi upon release?
If not, probably want to add documentation somewhere on how to do this.
Investigate what options there are for automatically/reminding to increment version numbers in conf.py and setup.cfg.

Add parameter value recommendations where applicable

Goal: even though some parameters do not have easy interpretations and/or are very data dependent, Kodie suggested we could add suggested directions to move the parameters in, e.g. "if you are getting too many false positives decrease the lamdba value"

Changing default values and recommendations in doc strings

Automate version incrementing?

Evaluate whether this is a good idea in the first place
Set up workflow for it.

Clean up potential memory issues

Detectors which use an internal drift_tracker object or similar have the potential to grow unboundedly. These should be pushed out to the external for loop for status tracking, rather than being stored internally.

Switch to a more modern theme

Dark mode and variable monitor width support are desirable, as is low configurability.

Shift data files into data-generating scripts

Task

Currently src/menelaus contains tools/artifacts which only exist to house large CSV files for test datasets, and accompanying descriptions. As many of these as possible should be transformed such that the data does not need to live in-repository, and can instead be generated by Python scripts instead. For now, this applies only to example_data.csv via make_example_data.R.

Impact

Transforming CSVs into generated data via Python scripts allows for greater flexibility for users, and mimics patterns in tensorflow.keras.datasets, sklearn.datasets, etc. This will also allow us to refactor the suboptimal "tools" folder, which isn't really a sub-package at the moment and contains some large files it may be preferable to avoid downloading.

Details

At minimum:

replicate make_example_data.R into a Python script, making sure to fix seeds where applicable
place this in a refactored tools/artifacts (e.g. a datasets sub-package)

Nice to have:

it may not be ideal to load all data into memory, so we may want to offer a generator class or some such feature for iterating over datasets, in general
determine if dataCircleGSev3Sp3Train.csv can also be cleaned up in some way
put all included descriptions into one README for the datasets directory

Create `benchmarking` suite

Task

Even if populated afterwards, a benchmarks directory can serve as an add-on to examples/, and specifically address basic comparisons of detector performance. Here is an interesting MD in detectron2.

Impact

This can help visualize/document differences between detectors, and enable us to later create a seamless way of updating our knowledge base about detector strengths/weaknesses with minimal effort.

Details:

At the minimum:

benchmarks directory
README with rough outline, examples of tables/figures
one example script of some detector

Enable linting pipeline

To transition to flake8 as our linter, run it locally and fix the issues.
Add example commands for how to run these formatters locally in the README.
Turn flake8 workflow back on.

Replace readme_short.rst with the full README on pypi

The long_description and long_description_content_type fields in setup.cfg are pointing to readme_short.rst due to pypi being unable to render the citations from the references.rst.
This seems avoidable, but low-value.
If the base README can be updated s.t. this does not occur, it might be worthwhile to have the full README reflected on pypi.

Update documentation for self.total_updates self.updates_since_reset for HDDDM, CDBD, HDM

These detectors use artificial batches s.t. the attributes are not accurate to their documentation in other detectors. At a minimum, these should be noted to behave differently based on the passed subsets argument.

Find benchmark dataset

Need to identify a benchmark dataset for identifying concept drift for streaming data. This can be used in our final report to showcase the accuracy of our algorithms and how effective they are at mitigating drift.

If we did happen to run into a benchmark dataset for concept drift for batch data, that would be helpful too.

Inject drift according to types of drift in Lu (sudden, gradual, abrupt and source 1, 2, and 3)

ADWIN: tests for BucketRow and BucketRowList methods

Write tests for the other BucketRow and BucketRowList methods, separate from the full ADWIN tests. Currently, the relevant test functions that exist are: test_bucket_row_init, test_bucket_row_list_empty, and test_bucket_row_list_append_head.

Methods in BucketRow still to write tests for

add_bucket
remove_buckets
shift

Methods in BucketRowList still to write tests for

init
append_tail
remove_tail

mitre / menelaus Goto Github PK

menelaus's Issues

Task

Impact

Details

Task

Impact

Task

Impact

Task

Impact

Details

Task

Impact

Details:

Recommend Projects

Recommend Topics

Recommend Org