Currently, all datasets available will be sampled every time the sampling script is called. This necessitates removing folders from src/ when you want to run only part of the experiments. There should be a simpler way to run only select experiments, keeping the default as it is.
We should introduce a (ideally) single metric value that describes how "good" a certain experiment, i.e. sampling methods/rates, etc, is, to be able to easily compare multiple experiments.
We have some unused and/or outdated imports that should be cleaned up. This even leads to an error as a dacite import still exists but dacite is no longer installed.
In the evaluation script we first use json.load to make a dict from the json file, and then use MetanomeRunBatch.from_dict(data) to convert it to a MetanomeRunBatch. We should simply use MetanomeRunBatch.from_json. This might also omit the error with the incorrect types we currently get.
We should use statistics, e.g. minimum/maximum values, to exclude theoretical possible INDs before testing them. We should still be able to use data without statistics.
In the beginning it was fine to simply build the cross product of each file-version (original, sampled, etc.).
However as we move to column sampling this does not work anymore and will blow up in our face, as the number of experiments to conduct grows exponentially with the number of columns.
We will need to rethink the way we design our experiments and combine columns.
I think we should temporarily disable the tuples to remove metric as it's too slow in the case of the many thousand INDs found with PartialSPIDER. In the long run we should figure out how to improve it or whether we actually need it.
@dkuska@yjojo17 What do you think about this? Do you agree that we should disable it?
We should be able to sample data only by column (combinations) as the connection between the values of one row are not important (unless we look at nary INDs of these columns).
For result probabilization we need to calculate a number of column statistics and keep them in memory:
Range - Minimum & Maximum
Mean/Median (TODO: Decide if this is useful somewhere)
Value Count - Number of values in column
Distinct Count - Number of distinct/unique values
This information will need to be passed on to the evaluation/result consolidation.
As such it may be written to a file or kept in memory during execution.
The sampling script only outputs file names when all datasets are completed. This is a total waste as the evaluation of runs can start while others are still being computed.
Even if we fix the first problem it won't help because we use sys.stdin.read() which reads from stdin until EOF. We should rather use an approach similar to https://stackoverflow.com/a/47927374.
We should add the ability to find partial INDs by relaxing the correctness constraint. This could - and probably will - require writing or borrowing a Metanome Algorithm.
Instead of applying static sampling strategies and rates, we want to move towards a more dynamic approach, where we sample according to the structure of the data.
Sorting of values inside column
Keeping only unique values
New Sampling Strategies:
3.1 Smallest/Largest (for int type columns)
3.2 Shortest/Longest (for string type columns)
In order to reduce FNs and FPs, we need to consolidate the results of multiple runs.
This means, that there are multiple things need to be done:
Results of multiple runs need to be combined
1.1. 'Majority voting': If an IND is detected in a majority of the aggregated runs, we can assume that it holds. Or rather if an IND is only detected in some cases, we have to do probabilization.
Detected INDs need to be probabilized with output of error-metric 'MissingValues' and statistics about the column.
2.1 We need to find a threshold/ratio between number of distinct values in the column and MissingValues in combination with the sample size, where we discard a detected IND
Note: This issue will be extended/refinded iteratively with the results of our experiments.