Code Monkey home page Code Monkey logo

menelaus's Issues

Make import statements more efficient

If a dot-import (np.sqrt) is called, the lookup resolves np first, then sqrt. This doesn't make a big difference except within a for loop, apparently -- and most of the detectors are going to be run within for loops. In that case, it's better to do e.g. from numpy import sqrt instead of import numpy as np.

We might get slightly better performance, then, if we go through update methods and identify eventual calls to dot-imports and replace them with the pattern above.

https://stackoverflow.com/questions/32151193/is-there-a-performance-cost-putting-python-imports-inside-functions

MD3 API modifications

MD3's API differs from the other detectors, because it is intended for a semi-supervised context. It may be desirable to address some or all of these:

  • Currently, MD3 has a set_reference to pass in a batch of data as the reference batch, which is inconsistent with the other streaming detectors. We could (1) leave this as is, (2) leave this as is and add set_reference to the other streaming detectors to make them consistent, (3) make MD3's API compatible with both an initial batch or stream-based data for setting the reference, or (4) change MD3 to be compatible only with stream-based data for setting the reference.
  • The waiting_for_oracle state and give_oracle_label method are unusual. You could imagine gathering the labeled samples as we go, rather than using an oracle function, and maintaining that as a buffer, with some scheme to "forget" "old-enough" labeled samples. This is not as-written in the paper, though.
  • The number of requested oracle samples and number of retraining samples are decoupled in our implementation, potentially. It may be worth making note of this in the docstring.

Add validation to init params where applicable.

For example, I don't believe most of the window-based detectors check that the window size is greater than 0.
For detectors where the parameters have theoretical restrictions (i.e. probabilities are on [0,1]) we ought to drop ValidationError too.

Add citation and description to rainfall data

Can you add a brief description and citation to the docstring for the rainfall data in L129 of make_example_data.py? Should be okay to add a similar citation to refs.bib as minku2010.

Improve handling of categorical columns in KDQTreePartitioner

Overview: Currently, KDQTreePartittioner behavior on datasets with columns containing categorical/n-hot encoded/ordinal data will be volatile. Fixing this will generalize KDQTreePartitioner to mixed-type datasets.

Details: For example, if one column in a dataset is a 0/1 variable, the first time it is split by build/fill, all 0-rows will be sent one way. The leaf nodes could hence have many more data points in them, than the upper bound count_threshold suggests.

  1. The uniqueness criterion (if # unique values in a column are too few, stop splitting) is needed to prevent endless recursion. With the min_cutpoint_size proportion added, maybe we can remove this safely, as the uniqueness criterion is what prematurely sends too many points to a leaf node.
  2. We can preprocess data that is called to build/fill, e.g., either with information passed by user (or determined by ourselves), we can specially treat columns that are problematic (skip if the unique values are too few, etc.).
  3. We may introduce a split for each value in the category, and force the tree to split as such on the problematic columns.

Note that, once kdq-tree is set up to use dataframes, we can "expect" the categorical dtype to treat these columns appropriately. Update the example(s) accordingly!

Address indiscriminate drift alarm in NN-DVI

Task

When currently applied to 'realistic' and sizable data, the NNDVI implementation alarms drift constantly and irrespective of the user-selected k (for k-NN) and number of sampling times for drift-threshold estimation. Until this is fixed, NNDVI is bugged and unusable.

Note: the referenced dataset is a private dataset of ~180K data points, split into ~9 batches. A MRE of such data is needed for testing and development in this issue.

Impact

This will debug and un-block an otherwise completed drift detector and partitioner, and hence expand the zoo of detectors provided by menelaus.

Details

At minimum:

  • reproduce a dataset which causes the above behavior
  • debug NNSpacePartitioner.build(), NNSpacePartitioner.compute_nnps_distance() for any partitioner-side problems
  • debug NNDVI.update() and NNDVI.compute_drift_threshold() for any detector-side problems

Add ensemble drift detector

A really simple motivating example would be n instances of ADWIN, each monitoring one of {accuracy, precision, recall, TPR, FPR, ...} or whatever combination, similar to LFR without the Monte Carlo runs. This implementation of ADWIN now only points at the accuracy, though it could be modified to monitor other quantities, given some finagling of the update signature.

PCA-CD: modify sample_period based on density estimate

Original TODO

modify sample period dependent upon density estimate
    line 97 (initialization of self.sample_period) in pca_cd.py

Other Description

sample_period (float, optional): how often to check for drift. This is 100 samples or sample_period * window_size, whichever is smaller. Default .05, or 5% of the window size.

Change initialization of value for sample_period in PCA_CD based on density estimate. Currently defaults to 5% of the window size.

Make streaming and batch abstract classes

kdq-tree has a lot of weirdness due to handling both streaming and batch, e.g. in the validation: right now, the dimensions of new numpy arrays aren't checked. We could probably add checks for the length of self.input_cols vs the shape of a numpy array. This still doesn't seem ideal, but it's definitely break-able as it is.

Rather than continuing to fool with the complications introduced by the detector taking both streaming and batch input, we should probably have parallel classes that have dependency on the same partitioner/similar. This seems like a better structure going forward, as we add more detectors that can do both. The further we get into implementing validation and similar, the more troublesome these problems are.

Follow multi-inheritance pattern for all applicable detectors

Task

After #46, all remaining detectors outside of KdqTree will need to be updated, to now use the StreamingDetector and BatchDetector ABCs if they are meant to service both options. Such detectors will (likely) also need to implement their own parent algorithm class, e.g. KdqTreeDetector as part of the multi-inheritance pattern.

Impact

This will significantly progress the broad refactor of the drift detectors' object design. See also #15

Revisit generic setup for detectors

*tms: Updating this to reflect the current code. There are a couple of bits that could be made more generic. This ought to be split up into several issues, when we choose to tackle it.

  1. If update could always take the same input, then it'd be easier to swap out detector objects, instead of having to check "is this an ADWIN object?" or similar. Immediate use cases: comparing performance of two detectors; ensembling detectors.
    • e.g. ADWIN.update in the typical use case takes "whether the most recent classification was correct." kdq_tree.update takes, instead, an array of the new sample(s) containing each feature.
    • For further work with semi-supervised detectors, where there may or may not be a label with a new sample, we'll eventually have an algorithm that requires more generic input to be acceptable.
    • The batch algorithms present an additional wrinkle: their default behaviors differ, in that e.g. HDDDM currently puts all of the new test data into the reference window (by default), when drift is not detected. Given that the drift decision is correct, this would mean an increasingly accurate empirical distribution for the reference data. kdq_tree, as currently implemented, maintains an unchanging reference window, because repartitioning an ever-larger sample at each step would be expensive. Doing so is possible, now, using set_reference and maintaining the data outside the object, though.
  2. Validation of input. We probably want decisions on the prior point before troubling with this.
    • For streaming detectors, update should check whether more than one sample (/row) has been passed. KdqTree.set_reference currently does something like this to stop the user from inappropriately calling it when KdqTree is in streaming mode instead of batch.
    • We also should add checks that confirm the dimension (and/or column names) of the passed input matches prior updates.
    • We could likely set up DriftDetector.validate to be called by DriftDetector.update, so that it only needs implementation once. This assumes that each detector is calling super().update at the very beginning of their implementation, which should be correct in most cases.
  3. Having a set_reference method for streaming detectors means we could potentially have faster code for processing many samples at once, especially for those which have an explicit wait period before doing the real calculations. This seems like a much lower value use case.

remove/make optional tracking attributes of LFR, CUSUM, PH

PH, LFR, and CUSUM also store some version of their test statistics indefinitely assuming no drift is alarmed to. It might behoove us to add an option that would truncate these at some length even if there is no drift. our example notebooks show how to store the test statistics in a separate df

Split off and configure the testing pipeline

  • The current workflow runs coverage tests. Split it into a separate .yml.
  • Coverage % badges seem to be nonstandard for GitHub. At a minimum, figure out a way to pass/fail coverage based on whether 99% or higher is achieved.

Docstring fixes

  • Double-check the formatting on the numbered list in make_example_data.make_example_batch_data
  • docs/source/examples/convert_notebooks.py docstring should include a note that it must be run from within a subdirectory of the cloned repo.
  • Update the release number on conf.py to the appropriate version, or remove the release number from conf.py entirely, if RTD/sphinx allows it.
  • Remove this line from the README: "A flowchart breaking down these contexts can be found on the ReadTheDocs page under “Choosing a Detector.”"
    • The page was removed, but the docstring update was not.
  • Add this to the CHANGELOG.md
## v0.1.2 - July 11, 2022

- Updated the documentation
- Added example jupyter notebooks to ReadTheDocs
- Switched to sphinx-bibtext for citations
- Formatting and language tweaks.
- Added StreamingDetector and BatchDetector abstract base classes.
- Re-factored kdq-tree to use new abstract base classes: the separate classes KdqTreeStreaming and KdqTreeBatch now exist.
- kdq-tree can now consume dataframes.
- Added new git workflows and improved old ones.

Set up backtesting for python versions to the earliest compatible version.

  • At a minimum, run unit tests on older python versions/dependency versions, to establish a floor for compatible python versions.
    • We might include running the examples/ scripts and doctests (once they exist) in this pipeline as well.
  • Update the README's version note accordingly.

Probably most easily accomplished by setting up tox or a similar utility.

Suggest default settings for PH threshold

Issue: the PH threshold (aka lambda) starts at 0.1 * window_size, but our three divergence metrics lead to change scores that vary widely in terms of magnitude. I.e. for intersection area, the threshold was always 4 but the change scores were around 0.25-0.50 so it never alarmed, even though the trajectory of the change scores over time looked accurate. LLH and Kl divergence were on the order of 500 and 50 respectively.

From the PH paper Qahtan cites here: https://repository.kaust.edu.sa/bitstream/handle/10754/556655/TKDE.pdf;jsessionid=289DDA44451AE77BA76539AB0B619647?sequence=1
The parameter λ is usually set after observing <the test stat> for some time and depends on the desired alarm rate

Replace test_example_notebooks with a call to convert_notebooks?

#65 (review)

As noted in the PR that added the latter script, examples.test_example_notebooks and docs.source.examples.convert_notebooks do very similar things. The script for jupyter is pretty complicated, especially with some venv weirdness on the git runner (if I understand correctly).

It might be better to run the example scripts as the converted .py files once, for testing, instead.

Increase coverage for kdq_tree

After allowing dataframes and adding validation, we have some uncovered lines:
src/menelaus/data_drift/kdq_tree.py 106 12 89% 175-176, 180, 208-217, 223, 232, 354

I think these should just be a couple of cases:

  • 1) call kdq_tree with a string (i.e., garbage input data)
  • 2) initialize with a dataframe, then call with a differently-named dataframe
  • 3) initialize with a numpy array, then pass a dataframe
  • 4) call to_plotly_dataframe with a different input_cols argument

Speed up the GitHub actions

  • Can we get a performance improvement by using something other than ubuntu-latest?
  • Can we get a performance improvement by using docker images, instead of repeatedly configuring environments?
  • The pipeline currently complains about not having wheel. Speed up by installing it, or by using docker as above?
  • Is it faster to use e.g. venv over conda?

Update examples to use "source" jupyter notebooks

  • One jupyter notebook per module (data_drift, concept_drift)
    • The jupyter notebook has pretty formatting, in-line display of tables, figures, etc. We should be able to include the html files via sphinx.
      • Confirm that including an nbconverted html file is easy to do via sphinx.
        • Seems to be possible to include notebooks directly via nbsphinx, which will execute the notebooks upon creating the docs.
        • nbconvert can be used to convert notebooks into notebooks, which would allow us to filter out certain tags from a given notebook and automatically generate a new one. Might end up needlessly elaborate..
      • Figure out what all nb_extensions we need and how to include them in the setup.cfg
    • each jupyter notebook uses the "tag" feature to specify cells corresponding to example (e.g. "all_examples" and "ADWIN" tags)
    • use nbconvert with the tag option to convert the single jupyter notebook
      • Mock up a script that can be added to the pipeline to do this conversion. May be able to base this on an example. Remember to mark this with @pytest.mark.no_cover so that it doesn't inflate the coverage statistics.
    • Switch the .py scripts to jupyter, using the tags.
  • Update the README and setup.cfg, with instructions on how to use:
    • default install includes the visualization dependencies necessary to run each example.py script
    • barebones install is bare minimum dependencies, no visuals.
    • dev install includes the above, but also sphinx and everything else.
    • For the non-dev configs, look into not downloading e.g. docs, tests, etc., as they're a waste of bandwidth.

This lets us maintain the examples in jupyter notebooks and have them be pretty, so that someone can just read the documentation and see what stuff does, while also allowing the user to run stuff without forcing them to install jupyter.

LFR Tests: implement better way to force drift and non-drift

For these test functions, we use a numpy random seed to get the tests to pass. Ideally, we would have a better way to ensure drift or non-drift as a result of the update statements that we use in the tests to introduce new data to the detector.

Workflow to push to pypi on release?

  • Is it possible to run wheel and push to pypi upon release?
  • If not, probably want to add documentation somewhere on how to do this.
  • Investigate what options there are for automatically/reminding to increment version numbers in conf.py and setup.cfg.

Add parameter value recommendations where applicable

Goal: even though some parameters do not have easy interpretations and/or are very data dependent, Kodie suggested we could add suggested directions to move the parameters in, e.g. "if you are getting too many false positives decrease the lamdba value"

Changing default values and recommendations in doc strings

Clean up potential memory issues

  • Detectors which use an internal drift_tracker object or similar have the potential to grow unboundedly. These should be pushed out to the external for loop for status tracking, rather than being stored internally.

Shift data files into data-generating scripts

Task

Currently src/menelaus contains tools/artifacts which only exist to house large CSV files for test datasets, and accompanying descriptions. As many of these as possible should be transformed such that the data does not need to live in-repository, and can instead be generated by Python scripts instead. For now, this applies only to example_data.csv via make_example_data.R.

Impact

Transforming CSVs into generated data via Python scripts allows for greater flexibility for users, and mimics patterns in tensorflow.keras.datasets, sklearn.datasets, etc. This will also allow us to refactor the suboptimal "tools" folder, which isn't really a sub-package at the moment and contains some large files it may be preferable to avoid downloading.

Details

At minimum:

  • replicate make_example_data.R into a Python script, making sure to fix seeds where applicable
  • place this in a refactored tools/artifacts (e.g. a datasets sub-package)

Nice to have:

  • it may not be ideal to load all data into memory, so we may want to offer a generator class or some such feature for iterating over datasets, in general
  • determine if dataCircleGSev3Sp3Train.csv can also be cleaned up in some way
  • put all included descriptions into one README for the datasets directory

Create `benchmarking` suite

Task

Even if populated afterwards, a benchmarks directory can serve as an add-on to examples/, and specifically address basic comparisons of detector performance. Here is an interesting MD in detectron2.

Impact

This can help visualize/document differences between detectors, and enable us to later create a seamless way of updating our knowledge base about detector strengths/weaknesses with minimal effort.

Details:

At the minimum:

  • benchmarks directory
  • README with rough outline, examples of tables/figures
  • one example script of some detector

See also #59

Enable linting pipeline

  • To transition to flake8 as our linter, run it locally and fix the issues.
  • Add example commands for how to run these formatters locally in the README.
  • Turn flake8 workflow back on.

Replace readme_short.rst with the full README on pypi

  • The long_description and long_description_content_type fields in setup.cfg are pointing to readme_short.rst due to pypi being unable to render the citations from the references.rst.
  • This seems avoidable, but low-value.
  • If the base README can be updated s.t. this does not occur, it might be worthwhile to have the full README reflected on pypi.

Find benchmark dataset

Need to identify a benchmark dataset for identifying concept drift for streaming data. This can be used in our final report to showcase the accuracy of our algorithms and how effective they are at mitigating drift.

If we did happen to run into a benchmark dataset for concept drift for batch data, that would be helpful too.

Inject drift according to types of drift in Lu (sudden, gradual, abrupt and source 1, 2, and 3)

ADWIN: tests for BucketRow and BucketRowList methods

Write tests for the other BucketRow and BucketRowList methods, separate from the full ADWIN tests. Currently, the relevant test functions that exist are: test_bucket_row_init, test_bucket_row_list_empty, and test_bucket_row_list_append_head.

Methods in BucketRow still to write tests for

add_bucket
remove_buckets
shift

Methods in BucketRowList still to write tests for

init
append_tail
remove_tail

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.