mitre / menelaus Goto Github PK

View Code? Open in Web Editor NEW

60.0 11.0 7.0 61.66 MB

Online and batch-based concept and data drift detection algorithms to monitor and maintain ML performance.

Home Page: https://menelaus.readthedocs.io/en/latest/

License: Apache License 2.0

Python 100.00%

drift-detection machine-learning statistics concept-drift data-science data-drift

menelaus's Introduction

Background

Menelaus implements algorithms for drift detection in machine learning. Drift detection is a branch of machine learning focused on the detection of unforeseen shifts in data. The relationships between variables in a dataset are rarely static and can be affected by changes in both internal and external factors, e.g. changes in data collection techniques, external protocols, and/or population demographics. Both undetected changes in data and undetected model underperformance pose risks to the users thereof. The aim of this package is to enable monitoring of data and of model performance.

The algorithms contained within this package were identified through a comprehensive literature survey. Menelaus' aim was to implement drift detection algorithms that cover a range of statistical methodology. Of the algorithms identified, all are able to identify when drift is occurring; some can highlight suspicious regions of the data in which drift is more significant; and others can also provide model retraining recommendations.

Menelaus implements drift detectors for both streaming and batch data. In a streaming setting, data is arriving continuously and is processed one observation at a time. Streaming detectors process the data with each new observation that arrives and are intended for use cases in which instant analytical results are desired. In a batch setting, information is collected over a period of time. Once the predetermined set is "filled", data is fed into and processed by the drift detection algorithm as a single batch. Within a batch, there is no meaningful ordering of the data with respect to time. Batch algorithms are typically used when it is more important to process large volumes of information simultaneously, where the speed of results after receiving data is of less concern.

Menelaus is named for the Odyssean hero that defeated the shapeshifting Proteus.

Detector List

Menelaus implements the following drift detectors.

Type	Detector	Abbreviation	Streaming	Batch
Change detection	Cumulative Sum Test	CUSUM	x
Change detection	Page-Hinkley	PH	x
Change detection	ADaptive WINdowing	ADWIN	x
Concept drift	Drift Detection Method	DDM	x
Concept drift	Early Drift Detection Method	EDDM	x
Concept drift	Linear Four Rates	LFR	x
Concept drift	Statistical Test of Equal Proportions to Detect concept drift	STEPD	x
Concept drift	Margin Density Drift Detection Method	MD3	x
Data drift	Confidence Distribution Batch Detection	CDBD		x
Data drift	Hellinger Distance Drift Detection Method	HDDDM		x
Data drift	kdq-Tree Detection Method	kdq-Tree	x	x
Data drift	PCA-Based Change Detection	PCA-CD	x
Data drift	Nearest Neighbor Density Variation Identification	NN-DVI		x
Ensemble	Streaming Ensemble	-	x
Ensemble	Batch Ensemble	-		x

The three main types of detector are described below. More details, including references to the original papers, can be found in the respective module documentation on ReadTheDocs.

Change detectors monitor single variables in the streaming context, and alarm when that variable starts taking on values outside of a pre-defined range.
Concept drift detectors monitor the performance characteristics of a given model, trying to identify shifts in the joint distribution of the data's feature values and their labels. Note that change detectors can also be applied in this context.
Data drift detectors monitor the distribution of the features; in that sense, they are model-agnostic. Such changes in distribution might be to single variables or to the joint distribution of all the features.
Ensembles are groups of detectors, where each watches the same data, and drift is determined by combining their output. Menelaus implements a framework for wrapping detectors this way.

The detectors may be applied in two settings, as described in the Background section:

Streaming, in which each new observation that arrives is processed separately, as it arrives.
Batch, in which the data has no meaningful ordering with respect to time, and the goal is comparing two datasets as a whole.

Additionally, the library implements a kdq-Tree partitioner, for support of the kdq-Tree Detection Method. This data structure partitions a given feature space, then maintains a count of the number of samples from the given dataset that fall into each section of that partition. More details are given in the respective module.

Installation

Create a virtual environment as desired, then:

# for read-only, install from pypi:
pip install menelaus

# to allow editing, running tests, generating docs, etc.
# first, clone the git repo, then:
cd ./menelaus_clone_folder/
pip install -e .[dev] 

# to run examples which use datasets from the wilds library,
# another install option is:
pip install menelaus[wilds]

Menelaus should work with Python 3.8 or higher.

Getting Started

Each detector implements the API defined by menelaus.detector: notably, they have an update method which allows new data to be passed, and a drift_state attribute which tells the user whether drift has been detected, along with (usually) other attributes specific to the detector class.

Generally, the workflow for using a detector, given some data, is as follows:

from menelaus.concept_drift import ADWINAccuracy
from menelaus.data_drift import KdqTreeStreaming
from menelaus.datasets import fetch_rainfall_data
from menelaus.ensemble import StreamingEnsemble, SimpleMajorityElection


# has feature columns, and a binary response 'rain'
df = fetch_rainfall_data()


# use a concept drift detector (response-only)
detector = ADWINAccuracy()
for i, row in df.iterrows():
    detector.update(X=None, y_true=row['rain'], y_pred=0)
    assert detector.drift_state != "drift", f"Drift detected in row {i}"


# use data drift detector (features-only)
detector = KdqTreeStreaming(window_size=5)
for i, row in df.iterrows():
    detector.update(X=df.loc[[i], df.columns != 'rain'], y_true=None, y_pred=None)
    assert detector.drift_state != "drift", f"Drift detected in row {i}"


# use ensemble detector (detectors + voting function)
ensemble = StreamingEnsemble(
  {
    'a': ADWINAccuracy(),
    'k': KdqTreeStreaming(window_size=5)
  },
  SimpleMajorityElection()
)

for i, row in df.iterrows():
    ensemble.update(X=df.loc[[i], df.columns != 'rain'], y_true=row['rain'], y_pred=0)
    assert ensemble.drift_state != "drift", f"Drift detected in row {i}"

As a concept drift detector, ADWIN requires both a true value (y_true) and a predicted value (y_predicted) at each update step. The data drift detector KdqTreeStreaming only requires the feature values at each step (X). More detailed examples, including code for visualizating drift locations, may be found in the examples directory, as stand-alone python scripts. The examples along with output can also be viewed on the RTD website.

Contributing

Install the library using the [dev] option, as above.

Testing

Unit tests can be run with the command pytest. By default, a coverage report with highlighting will be generated in htmlcov/index.html. These default settings are specified in setup.cfg under [tool:pytest].
Documentation

HTML documentation can be generated at menelaus/docs/build/html/index.html with:
```
cd docs/source
sphinx-build . ../build
```
If the example notebooks for the docs need to be updated, the corresponding python scripts in the examples directory should also be regenerated via:
```
cd docs/source/examples
python convert_notebooks.py
```
Note that this will require the installation of jupyter and nbconvert, which can be added to installation via pip install -e ".[dev, test]".
Formatting:

This project uses black, bandit, and flake8 for code formatting and linting, respectively. To satisfy these requirements when contributing, you may use them as the linter/formatter in your IDE, or manually run the following from the root directory:
```
flake8 ./menelaus           # linting
bandit -r ./menelaus        # security checks
black ./menelaus            # formatting
```

Copyright

Authors: Leigh Nicholl, Thomas Schill, India Lindsay, Anmol Srivastava, Kodie P McNamara, Shashank Jarmale.
©2022 The MITRE Corporation. ALL RIGHTS RESERVED
Approved for Public Release; Distribution Unlimited. Public Release
Case Number 22-0244.

menelaus's People

Contributors

Stargazers

Watchers

Forkers

hartl3y94 abdulsattarpalli sandy4321 futtery puwadon bloomyouth richard-chris

menelaus's Issues

Add doctests and use the doctest module

Add doctests in the strings: https://docs.python.org/3/library/doctest.html
Add "run doctests" to a testing workflow for GitHub

Add security check workflow

Security can be checked using ~~safety and~~ bandit.

Safety needs a subscription, so just use bandit.

EDDM: store the n-ago of when drift/warning began (L120)

Author Owner

As in self.buffer_size of STEPD

Add "contributing" heading over testing and formatting in the README

Create `benchmarking` suite

Task

Even if populated afterwards, a benchmarks directory can serve as an add-on to examples/, and specifically address basic comparisons of detector performance. Here is an interesting MD in detectron2.

Impact

This can help visualize/document differences between detectors, and enable us to later create a seamless way of updating our knowledge base about detector strengths/weaknesses with minimal effort.

Details:

At the minimum:

benchmarks directory
README with rough outline, examples of tables/figures
one example script of some detector

Add badges for each workflow

Just "create status badge" from the Actions section of the repo.

Workflow to push to pypi on release?

Is it possible to run wheel and push to pypi upon release?
If not, probably want to add documentation somewhere on how to do this.
Investigate what options there are for automatically/reminding to increment version numbers in conf.py and setup.cfg.

Clean up potential memory issues

Detectors which use an internal drift_tracker object or similar have the potential to grow unboundedly. These should be pushed out to the external for loop for status tracking, rather than being stored internally.

Address indiscriminate drift alarm in NN-DVI

Task

When currently applied to 'realistic' and sizable data, the NNDVI implementation alarms drift constantly and irrespective of the user-selected k (for k-NN) and number of sampling times for drift-threshold estimation. Until this is fixed, NNDVI is bugged and unusable.

Note: the referenced dataset is a private dataset of ~180K data points, split into ~9 batches. A MRE of such data is needed for testing and development in this issue.

Impact

This will debug and un-block an otherwise completed drift detector and partitioner, and hence expand the zoo of detectors provided by menelaus.

Details

At minimum:

reproduce a dataset which causes the above behavior
debug NNSpacePartitioner.build(), NNSpacePartitioner.compute_nnps_distance() for any partitioner-side problems
debug NNDVI.update() and NNDVI.compute_drift_threshold() for any detector-side problems

MD3 API modifications

MD3's API differs from the other detectors, because it is intended for a semi-supervised context. It may be desirable to address some or all of these:

Currently, MD3 has a set_reference to pass in a batch of data as the reference batch, which is inconsistent with the other streaming detectors. We could (1) leave this as is, (2) leave this as is and add set_reference to the other streaming detectors to make them consistent, (3) make MD3's API compatible with both an initial batch or stream-based data for setting the reference, or (4) change MD3 to be compatible only with stream-based data for setting the reference.
The waiting_for_oracle state and give_oracle_label method are unusual. You could imagine gathering the labeled samples as we go, rather than using an oracle function, and maintaining that as a buffer, with some scheme to "forget" "old-enough" labeled samples. This is not as-written in the paper, though.
The number of requested oracle samples and number of retraining samples are decoupled in our implementation, potentially. It may be worth making note of this in the docstring.

Examples for generically examining "which features have drifted?"

For example, consider fitting some classifier on the featurespace, where the labels are the drift detector's output, and examining its feature importance (e.g. coefficients in logistic regression). This would be agnostic to the actual detection algorithm.

PCA-CD: modify sample_period based on density estimate

Original TODO

modify sample period dependent upon density estimate
    line 97 (initialization of self.sample_period) in pca_cd.py

Other Description

sample_period (float, optional): how often to check for drift. This is 100 samples or sample_period * window_size, whichever is smaller. Default .05, or 5% of the window size.

Change initialization of value for sample_period in PCA_CD based on density estimate. Currently defaults to 5% of the window size.

Split off and configure the testing pipeline

The current workflow runs coverage tests. Split it into a separate .yml.
Coverage % badges seem to be nonstandard for GitHub. At a minimum, figure out a way to pass/fail coverage based on whether 99% or higher is achieved.

remove/make optional tracking attributes of LFR, CUSUM, PH

PH, LFR, and CUSUM also store some version of their test statistics indefinitely assuming no drift is alarmed to. It might behoove us to add an option that would truncate these at some length even if there is no drift. our example notebooks show how to store the test statistics in a separate df

Set up backtesting for python versions to the earliest compatible version.

At a minimum, run unit tests on older python versions/dependency versions, to establish a floor for compatible python versions.
- We might include running the examples/ scripts and doctests (once they exist) in this pipeline as well.
Update the README's version note accordingly.

Probably most easily accomplished by setting up tox or a similar utility.

Debug issue where kdqpartitioner makes leaves with large counts

If you run the detector(partitioner) with this data, we somehow get a leaf of size ~8000 despite having a maximum leaf count of 100. This may have something to do with the uniqueness criterion, but removing it didn't change the behavior.

Docstring fixes

Double-check the formatting on the numbered list in make_example_data.make_example_batch_data
docs/source/examples/convert_notebooks.py docstring should include a note that it must be run from within a subdirectory of the cloned repo.
Update the release number on conf.py to the appropriate version, or remove the release number from conf.py entirely, if RTD/sphinx allows it.
Remove this line from the README: "A flowchart breaking down these contexts can be found on the ReadTheDocs page under “Choosing a Detector.”"
- The page was removed, but the docstring update was not.
Add this to the CHANGELOG.md

## v0.1.2 - July 11, 2022

- Updated the documentation
- Added example jupyter notebooks to ReadTheDocs
- Switched to sphinx-bibtext for citations
- Formatting and language tweaks.
- Added StreamingDetector and BatchDetector abstract base classes.
- Re-factored kdq-tree to use new abstract base classes: the separate classes KdqTreeStreaming and KdqTreeBatch now exist.
- kdq-tree can now consume dataframes.
- Added new git workflows and improved old ones.

Update documentation for self.total_updates self.updates_since_reset for HDDDM, CDBD, HDM

These detectors use artificial batches s.t. the attributes are not accurate to their documentation in other detectors. At a minimum, these should be noted to behave differently based on the passed subsets argument.

Add ensemble drift detector

Review literature on ensembled detectors
- https://www.researchgate.net/publication/281901318_A_Lightweight_Concept_Drift_Detection_Ensemble
- https://www.researchgate.net/publication/267034767_A_Selective_Detector_Ensemble_for_Concept_Drift_Detection
Enumerate potential voting schemes
How can we allow user input for defining e.g. weighting? Can we update the weights over time? There is a dearth of labelled data in this context, though.
Do we need to sync time-scales between streaming detectors that wait a certain number of samples? Not necessarily - so validation isn't strictly needed, and not all detectors have such an attribute.

A really simple motivating example would be n instances of ADWIN, each monitoring one of {accuracy, precision, recall, TPR, FPR, ...} or whatever combination, similar to LFR without the Monte Carlo runs. This implementation of ADWIN now only points at the accuracy, though it could be modified to monitor other quantities, given some finagling of the update signature.

Speed up the GitHub actions

Can we get a performance improvement by using something other than ubuntu-latest?
Can we get a performance improvement by using docker images, instead of repeatedly configuring environments?
The pipeline currently complains about not having wheel. Speed up by installing it, or by using docker as above?
Is it faster to use e.g. venv over conda?

Replace test_example_notebooks with a call to convert_notebooks?

#65 (review)

As noted in the PR that added the latter script, examples.test_example_notebooks and docs.source.examples.convert_notebooks do very similar things. The script for jupyter is pretty complicated, especially with some venv weirdness on the git runner (if I understand correctly).

It might be better to run the example scripts as the converted .py files once, for testing, instead.

Enable linting pipeline

To transition to flake8 as our linter, run it locally and fix the issues.
Add example commands for how to run these formatters locally in the README.
Turn flake8 workflow back on.

Switch to sphinx-bibtext for citations

This might enable #13 .

Add parameter value recommendations where applicable

Goal: even though some parameters do not have easy interpretations and/or are very data dependent, Kodie suggested we could add suggested directions to move the parameters in, e.g. "if you are getting too many false positives decrease the lamdba value"

Changing default values and recommendations in doc strings

Revisit generic setup for detectors

*tms: Updating this to reflect the current code. There are a couple of bits that could be made more generic. This ought to be split up into several issues, when we choose to tackle it.

If update could always take the same input, then it'd be easier to swap out detector objects, instead of having to check "is this an ADWIN object?" or similar. Immediate use cases: comparing performance of two detectors; ensembling detectors.
- e.g. ADWIN.update in the typical use case takes "whether the most recent classification was correct." kdq_tree.update takes, instead, an array of the new sample(s) containing each feature.
- For further work with semi-supervised detectors, where there may or may not be a label with a new sample, we'll eventually have an algorithm that requires more generic input to be acceptable.
- The batch algorithms present an additional wrinkle: their default behaviors differ, in that e.g. HDDDM currently puts all of the new test data into the reference window (by default), when drift is not detected. Given that the drift decision is correct, this would mean an increasingly accurate empirical distribution for the reference data. kdq_tree, as currently implemented, maintains an unchanging reference window, because repartitioning an ever-larger sample at each step would be expensive. Doing so is possible, now, using set_reference and maintaining the data outside the object, though.
Validation of input. We probably want decisions on the prior point before troubling with this.
- For streaming detectors, update should check whether more than one sample (/row) has been passed. KdqTree.set_reference currently does something like this to stop the user from inappropriately calling it when KdqTree is in streaming mode instead of batch.
- We also should add checks that confirm the dimension (and/or column names) of the passed input matches prior updates.
- We could likely set up DriftDetector.validate to be called by DriftDetector.update, so that it only needs implementation once. This assumes that each detector is calling super().update at the very beginning of their implementation, which should be correct in most cases.
Having a set_reference method for streaming detectors means we could potentially have faster code for processing many samples at once, especially for those which have an explicit wait period before doing the real calculations. This seems like a much lower value use case.

Is it sensible to add a "both" option to the Page-Hinkley direction argument?

Right now, you'd have to run P-H twice on the same data to check both directions. Would thresholds change?

Add citation and description to rainfall data

Can you add a brief description and citation to the docstring for the rainfall data in L129 of make_example_data.py? Should be okay to add a similar citation to refs.bib as minku2010.

Replace readme_short.rst with the full README on pypi

The long_description and long_description_content_type fields in setup.cfg are pointing to readme_short.rst due to pypi being unable to render the citations from the references.rst.
This seems avoidable, but low-value.
If the base README can be updated s.t. this does not occur, it might be worthwhile to have the full README reflected on pypi.

Automate version incrementing?

Evaluate whether this is a good idea in the first place
Set up workflow for it.

Add nbconvert script to convert the docs notebooks to .py scripts

Maintaining these in both places is a pain.

Add flow diagram to README on selecting drift detectors

We can have a "decision tree" on which setting each detector is appropriate for.

Update examples to use "source" jupyter notebooks

One jupyter notebook per module (data_drift, concept_drift)
- The jupyter notebook has pretty formatting, in-line display of tables, figures, etc. We should be able to include the html files via sphinx.
  - Confirm that including an nbconverted html file is easy to do via sphinx.
    - Seems to be possible to include notebooks directly via nbsphinx, which will execute the notebooks upon creating the docs.
    - nbconvert can be used to convert notebooks into notebooks, which would allow us to filter out certain tags from a given notebook and automatically generate a new one. Might end up needlessly elaborate..
  - Figure out what all nb_extensions we need and how to include them in the setup.cfg
- each jupyter notebook uses the "tag" feature to specify cells corresponding to example (e.g. "all_examples" and "ADWIN" tags)
- use nbconvert with the tag option to convert the single jupyter notebook
  - Mock up a script that can be added to the pipeline to do this conversion. May be able to base this on an example. Remember to mark this with @pytest.mark.no_cover so that it doesn't inflate the coverage statistics.
- Switch the .py scripts to jupyter, using the tags.
Update the README and setup.cfg, with instructions on how to use:
- default install includes the visualization dependencies necessary to run each example.py script
- barebones install is bare minimum dependencies, no visuals.
- dev install includes the above, but also sphinx and everything else.
- For the non-dev configs, look into not downloading e.g. docs, tests, etc., as they're a waste of bandwidth.

This lets us maintain the examples in jupyter notebooks and have them be pretty, so that someone can just read the documentation and see what stuff does, while also allowing the user to run stuff without forcing them to install jupyter.

Add validation to init params where applicable.

For example, I don't believe most of the window-based detectors check that the window size is greater than 0.
For detectors where the parameters have theoretical restrictions (i.e. probabilities are on [0,1]) we ought to drop ValidationError too.

Switch to a more modern theme

Dark mode and variable monitor width support are desirable, as is low configurability.

Find benchmark dataset

Need to identify a benchmark dataset for identifying concept drift for streaming data. This can be used in our final report to showcase the accuracy of our algorithms and how effective they are at mitigating drift.

If we did happen to run into a benchmark dataset for concept drift for batch data, that would be helpful too.

Inject drift according to types of drift in Lu (sudden, gradual, abrupt and source 1, 2, and 3)

LFR Tests: implement better way to force drift and non-drift

For these test functions, we use a numpy random seed to get the tests to pass. Ideally, we would have a better way to ensure drift or non-drift as a result of the update statements that we use in the tests to introduce new data to the detector.

Fix the changelog-update workflow

See failure here: https://github.com/mitre/menelaus/runs/7284225469?check_suite_focus=true

It's failing because the branch is protected. Not sure if a workaround is worthwhile.

Make streaming and batch abstract classes

kdq-tree has a lot of weirdness due to handling both streaming and batch, e.g. in the validation: right now, the dimensions of new numpy arrays aren't checked. We could probably add checks for the length of self.input_cols vs the shape of a numpy array. This still doesn't seem ideal, but it's definitely break-able as it is.

Rather than continuing to fool with the complications introduced by the detector taking both streaming and batch input, we should probably have parallel classes that have dependency on the same partitioner/similar. This seems like a better structure going forward, as we add more detectors that can do both. The further we get into implementing validation and similar, the more troublesome these problems are.

Suggest default settings for PH threshold

Issue: the PH threshold (aka lambda) starts at 0.1 * window_size, but our three divergence metrics lead to change scores that vary widely in terms of magnitude. I.e. for intersection area, the threshold was always 4 but the change scores were around 0.25-0.50 so it never alarmed, even though the trajectory of the change scores over time looked accurate. LLH and Kl divergence were on the order of 500 and 50 respectively.

From the PH paper Qahtan cites here: https://repository.kaust.edu.sa/bitstream/handle/10754/556655/TKDE.pdf;jsessionid=289DDA44451AE77BA76539AB0B619647?sequence=1
The parameter λ is usually set after observing <the test stat> for some time and depends on the desired alarm rate

Set up changelogs

Some portion of this is automatable:

https://stefanzweifel.io/posts/2021/11/13/introducing-the-changelog-updater-action

Follow multi-inheritance pattern for all applicable detectors

Task

After #46, all remaining detectors outside of KdqTree will need to be updated, to now use the StreamingDetector and BatchDetector ABCs if they are meant to service both options. Such detectors will (likely) also need to implement their own parent algorithm class, e.g. KdqTreeDetector as part of the multi-inheritance pattern.

Impact

This will significantly progress the broad refactor of the drift detectors' object design. See also #15

Add workflow that runs the examples

If these are in the form of notebooks -- example using nbconvert: https://github.com/AureumChaos/LEAP/blob/master/tests/jupyter/test_jupyter.py

Annotate the jupyter/example tests with @pytest.mark.no_cover so system tests aren't included in the coverage report.

Improve handling of categorical columns in KDQTreePartitioner

Overview: Currently, KDQTreePartittioner behavior on datasets with columns containing categorical/n-hot encoded/ordinal data will be volatile. Fixing this will generalize KDQTreePartitioner to mixed-type datasets.

Details: For example, if one column in a dataset is a 0/1 variable, the first time it is split by build/fill, all 0-rows will be sent one way. The leaf nodes could hence have many more data points in them, than the upper bound count_threshold suggests.

The uniqueness criterion (if # unique values in a column are too few, stop splitting) is needed to prevent endless recursion. With the min_cutpoint_size proportion added, maybe we can remove this safely, as the uniqueness criterion is what prematurely sends too many points to a leaf node.
We can preprocess data that is called to build/fill, e.g., either with information passed by user (or determined by ourselves), we can specially treat columns that are problematic (skip if the unique values are too few, etc.).
We may introduce a split for each value in the category, and force the tree to split as such on the problematic columns.

Note that, once kdq-tree is set up to use dataframes, we can "expect" the categorical dtype to treat these columns appropriately. Update the example(s) accordingly!

Look into alternative partitioners for kdq-Tree

Quad-tree is a similar partitioning algorithm. Can we set up that detector/partitioner pair s.t. arbitrary divergence metrics over tree-based partitions can be (bootstrapped) and used?

Move `tests` into `src` and update any references

Task

Move tests/ into src and update any workflows / setup files / tests / other references to migrate seamlessly.

Impact

This will clean up the repository organization a bit more.

Increase coverage for kdq_tree

After allowing dataframes and adding validation, we have some uncovered lines:
src/menelaus/data_drift/kdq_tree.py 106 12 89% 175-176, 180, 208-217, 223, 232, 354

I think these should just be a couple of cases:

1) call kdq_tree with a string (i.e., garbage input data)
2) initialize with a dataframe, then call with a differently-named dataframe
3) initialize with a numpy array, then pass a dataframe
4) call to_plotly_dataframe with a different input_cols argument

Shift data files into data-generating scripts

Task

Currently src/menelaus contains tools/artifacts which only exist to house large CSV files for test datasets, and accompanying descriptions. As many of these as possible should be transformed such that the data does not need to live in-repository, and can instead be generated by Python scripts instead. For now, this applies only to example_data.csv via make_example_data.R.

Impact

Transforming CSVs into generated data via Python scripts allows for greater flexibility for users, and mimics patterns in tensorflow.keras.datasets, sklearn.datasets, etc. This will also allow us to refactor the suboptimal "tools" folder, which isn't really a sub-package at the moment and contains some large files it may be preferable to avoid downloading.

Details

At minimum:

replicate make_example_data.R into a Python script, making sure to fix seeds where applicable
place this in a refactored tools/artifacts (e.g. a datasets sub-package)

Nice to have:

it may not be ideal to load all data into memory, so we may want to offer a generator class or some such feature for iterating over datasets, in general
determine if dataCircleGSev3Sp3Train.csv can also be cleaned up in some way
put all included descriptions into one README for the datasets directory

Make import statements more efficient

If a dot-import (np.sqrt) is called, the lookup resolves np first, then sqrt. This doesn't make a big difference except within a for loop, apparently -- and most of the detectors are going to be run within for loops. In that case, it's better to do e.g. from numpy import sqrt instead of import numpy as np.

We might get slightly better performance, then, if we go through update methods and identify eventual calls to dot-imports and replace them with the pattern above.

https://stackoverflow.com/questions/32151193/is-there-a-performance-cost-putting-python-imports-inside-functions

ADWIN: tests for BucketRow and BucketRowList methods

Write tests for the other BucketRow and BucketRowList methods, separate from the full ADWIN tests. Currently, the relevant test functions that exist are: test_bucket_row_init, test_bucket_row_list_empty, and test_bucket_row_list_append_head.

Methods in BucketRow still to write tests for

add_bucket
remove_buckets
shift

Methods in BucketRowList still to write tests for

init
append_tail
remove_tail

Split up the linting and formatting workflows

Split off the linting action (flake8)
Split off the formatting actions
- include black

mitre / menelaus Goto Github PK

menelaus's Introduction

Background

Detector List

Installation

Getting Started

Contributing

Copyright

menelaus's People

Contributors

Stargazers

Watchers

Forkers

menelaus's Issues

Task

Impact

Details:

Task

Impact

Details

Task

Impact

Task

Impact

Task

Impact

Details

Recommend Projects

Recommend Topics

Recommend Org