Code Monkey home page Code Monkey logo

hpi-adp-ind's Introduction

  • ๐Ÿ‘‹ Hi, Iโ€™m David Kuska, a M.Sc Data Engineering student at Hasso Plattner Institute Potsdam (Germany)

  • ๐Ÿ‘€ Iโ€™m interested in Cellular Automata, Dynamical Systems, Generative Art, Data Science & Machine Learning

  • ๐ŸŒฑ Iโ€™m currently learning JAX with flax and how to build a Chess Engine using Neural Networks

hpi-adp-ind's People

Contributors

clfesc avatar dkuska avatar yjojo17 avatar

Watchers

 avatar  avatar

hpi-adp-ind's Issues

Simplify only running part of the experiments

Currently, all datasets available will be sampled every time the sampling script is called. This necessitates removing folders from src/ when you want to run only part of the experiments. There should be a simpler way to run only select experiments, keeping the default as it is.

Create new results directory for each experiment run

The number of result-JSONs, CSVs and Plots is starting to become too much to handle.

It would make more sense to have a folder for each experiment, so that all results for a file are located in the same location

Add single metric for comparison

We should introduce a (ideally) single metric value that describes how "good" a certain experiment, i.e. sampling methods/rates, etc, is, to be able to easily compare multiple experiments.

Clean-up imports

We have some unused and/or outdated imports that should be cleaned up. This even leads to an error as a dacite import still exists but dacite is no longer installed.

Directly use dataclasses-json

In the evaluation script we first use json.load to make a dict from the json file, and then use MetanomeRunBatch.from_dict(data) to convert it to a MetanomeRunBatch. We should simply use MetanomeRunBatch.from_json. This might also omit the error with the incorrect types we currently get.

Improve the visualization

The visualization should be improved. Possibilities for this are nearly endless, some examples include:

  • Grouping by certain aspects
  • Using an Onion Diagram for nary INDs

Support headers

Currently, there is a way to enable headers in the csv source files. However, this setting might not be propagated to every place that needs it.

Cleanup folder structure

Move descriptive_statistics -> /scripts/
Move plots.py -> /utils/
Move sampling_methods.py -> /utils/

Use statistics for exclusion of INDs

We should use statistics, e.g. minimum/maximum values, to exclude theoretical possible INDs before testing them. We should still be able to use data without statistics.

Update Readme

The Readme is currently pretty outdated (e.g. reference to BINDER). This should be updated.

Rework Experiment Design - Combination of Files/Samples

In the beginning it was fine to simply build the cross product of each file-version (original, sampled, etc.).

However as we move to column sampling this does not work anymore and will blow up in our face, as the number of experiments to conduct grows exponentially with the number of columns.

We will need to rethink the way we design our experiments and combine columns.

(Temporarily) disable tuples-to-remove metric

I think we should temporarily disable the tuples to remove metric as it's too slow in the case of the many thousand INDs found with PartialSPIDER. In the long run we should figure out how to improve it or whether we actually need it.

@dkuska @yjojo17 What do you think about this? Do you agree that we should disable it?

Sample by columns

We should be able to sample data only by column (combinations) as the connection between the values of one row are not important (unless we look at nary INDs of these columns).

Calculate and save column statistics at runtime

For result probabilization we need to calculate a number of column statistics and keep them in memory:

  1. Range - Minimum & Maximum
  2. Mean/Median (TODO: Decide if this is useful somewhere)
  3. Value Count - Number of values in column
  4. Distinct Count - Number of distinct/unique values

This information will need to be passed on to the evaluation/result consolidation.
As such it may be written to a file or kept in memory during execution.

Better piping

Our current piping approach has two flaws:

  • The sampling script only outputs file names when all datasets are completed. This is a total waste as the evaluation of runs can start while others are still being computed.
  • Even if we fix the first problem it won't help because we use sys.stdin.read() which reads from stdin until EOF. We should rather use an approach similar to https://stackoverflow.com/a/47927374.

Comments/Questions/Ideas from Presentation:

Might be good to keep some of the questions today in mind for the report?

  • Is Outlier Detection a possible approach
  • What happens in sorted data case if the A >>>> B
  • "irgendwas Dataset relatetes aber ich komm nicht mehr drauf"

Partial IND discovery

We should add the ability to find partial INDs by relaxing the correctness constraint. This could - and probably will - require writing or borrowing a Metanome Algorithm.

Sample based on smallest values

We should add a sampling strategy that only tests the x smallest values of the given dataset.

The reasoning behind this is that most INDs either hold or are very wrong, i.e. the overlap between the column (combinations) is very small.

Rework Sampling Strategies

Instead of applying static sampling strategies and rates, we want to move towards a more dynamic approach, where we sample according to the structure of the data.

  1. Sorting of values inside column
  2. Keeping only unique values
  3. New Sampling Strategies:
    3.1 Smallest/Largest (for int type columns)
    3.2 Shortest/Longest (for string type columns)

Result aggregation & consolidation

In order to reduce FNs and FPs, we need to consolidate the results of multiple runs.

This means, that there are multiple things need to be done:

  1. Results of multiple runs need to be combined
    1.1. 'Majority voting': If an IND is detected in a majority of the aggregated runs, we can assume that it holds. Or rather if an IND is only detected in some cases, we have to do probabilization.
  2. Detected INDs need to be probabilized with output of error-metric 'MissingValues' and statistics about the column.
    2.1 We need to find a threshold/ratio between number of distinct values in the column and MissingValues in combination with the sample size, where we discard a detected IND

Note: This issue will be extended/refinded iteratively with the results of our experiments.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.