dkuska / hpi-adp-ind Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 1.57 MB

Approximative Data Profiling - WS22/23 - Inclusion Dependencies

Python 100.00%

hpi-adp-ind's Introduction

👋 Hi, I’m David Kuska, a M.Sc Data Engineering student at Hasso Plattner Institute Potsdam (Germany)
👀 I’m interested in Cellular Automata, Dynamical Systems, Generative Art, Data Science & Machine Learning
🌱 I’m currently learning JAX with flax and how to build a Chess Engine using Neural Networks

hpi-adp-ind's People

Contributors

Watchers

hpi-adp-ind's Issues

Simplify only running part of the experiments

Currently, all datasets available will be sampled every time the sampling script is called. This necessitates removing folders from src/ when you want to run only part of the experiments. There should be a simpler way to run only select experiments, keeping the default as it is.

Create new results directory for each experiment run

The number of result-JSONs, CSVs and Plots is starting to become too much to handle.

It would make more sense to have a folder for each experiment, so that all results for a file are located in the same location

Add single metric for comparison

We should introduce a (ideally) single metric value that describes how "good" a certain experiment, i.e. sampling methods/rates, etc, is, to be able to easily compare multiple experiments.

Clean-up imports

We have some unused and/or outdated imports that should be cleaned up. This even leads to an error as a dacite import still exists but dacite is no longer installed.

Allow multiple data sets to be evaluated at once

It would be good to be able to run experiments on multiple datasets without having to copy files and start scripts every few minutes.

Directly use dataclasses-json

In the evaluation script we first use json.load to make a dict from the json file, and then use MetanomeRunBatch.from_dict(data) to convert it to a MetanomeRunBatch. We should simply use MetanomeRunBatch.from_json. This might also omit the error with the incorrect types we currently get.

Improve the visualization

The visualization should be improved. Possibilities for this are nearly endless, some examples include:

Grouping by certain aspects
Using an Onion Diagram for nary INDs

Support headers

Currently, there is a way to enable headers in the csv source files. However, this setting might not be propagated to every place that needs it.

Cleanup folder structure

Move descriptive_statistics -> /scripts/
Move plots.py -> /utils/
Move sampling_methods.py -> /utils/

Add error metric Tuples to be Removed

We should add an error metric that indicates how many tuples from the original data have to be removed such that a found IND becomes a real IND.

Use statistics for exclusion of INDs

We should use statistics, e.g. minimum/maximum values, to exclude theoretical possible INDs before testing them. We should still be able to use data without statistics.

Update Readme

The Readme is currently pretty outdated (e.g. reference to BINDER). This should be updated.

Rework Experiment Design - Combination of Files/Samples

In the beginning it was fine to simply build the cross product of each file-version (original, sampled, etc.).

However as we move to column sampling this does not work anymore and will blow up in our face, as the number of experiments to conduct grows exponentially with the number of columns.

We will need to rethink the way we design our experiments and combine columns.

(Temporarily) disable tuples-to-remove metric

I think we should temporarily disable the tuples to remove metric as it's too slow in the case of the many thousand INDs found with PartialSPIDER. In the long run we should figure out how to improve it or whether we actually need it.

@dkuska @yjojo17 What do you think about this? Do you agree that we should disable it?

Sample by columns

We should be able to sample data only by column (combinations) as the connection between the values of one row are not important (unless we look at nary INDs of these columns).

Calculate and save column statistics at runtime

For result probabilization we need to calculate a number of column statistics and keep them in memory:

Range - Minimum & Maximum
Mean/Median (TODO: Decide if this is useful somewhere)
Value Count - Number of values in column
Distinct Count - Number of distinct/unique values

This information will need to be passed on to the evaluation/result consolidation.
As such it may be written to a file or kept in memory during execution.

Better piping

Our current piping approach has two flaws:

The sampling script only outputs file names when all datasets are completed. This is a total waste as the evaluation of runs can start while others are still being computed.
Even if we fix the first problem it won't help because we use sys.stdin.read() which reads from stdin until EOF. We should rather use an approach similar to https://stackoverflow.com/a/47927374.

Comments/Questions/Ideas from Presentation:

Might be good to keep some of the questions today in mind for the report?

Is Outlier Detection a possible approach
What happens in sorted data case if the A >>>> B
"irgendwas Dataset relatetes aber ich komm nicht mehr drauf"

Sorting of values inside column
Keeping only unique values
New Sampling Strategies:
3.1 Smallest/Largest (for int type columns)
3.2 Shortest/Longest (for string type columns)

Result aggregation & consolidation

In order to reduce FNs and FPs, we need to consolidate the results of multiple runs.

This means, that there are multiple things need to be done:

Results of multiple runs need to be combined
1.1. 'Majority voting': If an IND is detected in a majority of the aggregated runs, we can assume that it holds. Or rather if an IND is only detected in some cases, we have to do probabilization.
Detected INDs need to be probabilized with output of error-metric 'MissingValues' and statistics about the column.
2.1 We need to find a threshold/ratio between number of distinct values in the column and MissingValues in combination with the sample size, where we discard a detected IND

Note: This issue will be extended/refinded iteratively with the results of our experiments.

dkuska / hpi-adp-ind Goto Github PK

hpi-adp-ind's Introduction

hpi-adp-ind's People

Contributors

Watchers

hpi-adp-ind's Issues

Recommend Projects

Recommend Topics

Recommend Org