hpi-information-systems / snowman Goto Github PK

View Code? Open in Web Editor NEW

37.0 6.0 2.0 87.81 MB

Welcome to Snowman App – a Data Matching Benchmark Platform.

Home Page: https://hpi-information-systems.github.io/snowman/

License: MIT License

HTML 0.47% JavaScript 0.97% TypeScript 96.44% CSS 2.12%

matching entity-resolution kpis data-matching duplicate-detection benchmark snowman data-stewards

snowman's Introduction

Snowman

Comparing data matching algorithms is still an unsolved topic in both industry and research. With snowman, developers and researchers are able to compare the performance of different data matching solutions or improve new algorithms. Besides traditional metrics, the tool also considers economic aspects like Soft KPIs.

This tool is developed as part of a bachelor's project in collaboration with SAP SE.

Research Project

This tool has been published as part of the the paper "Frost: Benchmarking and Exploring Data Matching Results" (2022) at VLDB. More details on reproducing results shown within the paper can found here.

Current state

In Q1 and Q2 of 2021, we reached the following milestones:

[x] Milestone 1: Ability to add/delete datasets, experiments and matching solutions; binary comparison and basic behavior analysis; executable on all major platform
[x] Milestone 2: Compare more than two experiments and create new experiments based on results; survey Soft KPIs, allow comparison based on KPIs
[x] Milestone 3: Allow individual thresholds for experiments, extend Soft KPIs further and allow advanced evaluation of matching solutions

The precise progress is tracked through Github issues and project boards. Please get in touch in case you want a special feature included :)

After reaching milestone 3, we plan to continue to work on further features which will broaden the tools abilities and features.

Showcase

To show off some key features that Snowman offers, we created a small introductory video:

Contributing

Contribution guidelines will follow soon. Until then, feel free to open an issue to report a bug or request a feature.
In case you want to contribute code, please first open an associated issue and afterwards a pull request containing the proposed solution.

Development

See our development guide for more information on how to get started.

Documentation

Please see our documentation for further information: Snowman Docs

Licenses

A complete list of all dependencies and their individual licenses can be found within our documentation.

snowman's People

Contributors

Stargazers

Watchers

Forkers

lasklu phpfs

snowman's Issues

Refactor Metrics description

split "formula" and "information"
make formula shorter (e.g. by using fractions and abbreviations)

Handle silver standards in a detailed way

There are different types of silver standard (or incomplete ground truth matches). For example:

specify which pairs are not duplicate
specify which pairs are duplicates

Provide support for those.

Enable filtering / sorting / limitting for meta routes

filtering / sorting / limitting for

matching solutions (including soft KPIs)
experiments
datasets

are managed on the frontend at the moment. When the number of entries gets larger this may lead to problems. We should therefore implement this functionality on the backend side.

UI should differentiate between experiments used as gold standard

For binary comparison, one of the two supplied experiments is regarded as a gold standard. We should make it clear in the UI which one the user wants to use as gold standard.

E2E integration of matching solutions

provide support for running matching solutions directly from the tool
add "debugging" capabilities for ML matching solutions by implementing XAI methods to show why pairs have (not) been declared as duplicate

Add AUTHORS file

Please include a list of my authors :)

Highlight currently openend page

The side menu should show the current page's menu item in color primary

Support for JedAI

We should consider how to use JedAI

support for their result set format
other integrations between snowman & jedAI

Allow to sort metrics how user wants

Make datasets, experiments and matching solutions editable

At the moment datasets, experiments and matching solutions can only be created and deleted. However, the API also supports changing them and this should be reflected in the frontend.

Make automatic releases of snowman with CI

Cache Metrics

Speed up the metrics route by caching the result (or something else).

we should come up with a coherent caching strategy we can apply to all routes.

Investigate Dataset Records

When hovering or clicking on a dataset record in the benchmark section, show all records this one is related to (e.g. all true positives, false positives, false negatives and true negatives)

Bugfix: "magellan" experiment format throws error

To reproduce (on branch main): Select magellan as experiment format

Visiting metrics route throws error

When visiting the metrics route directly in debug mode (npm start), the following error is thrown:

Create page for benchmark configurator

@phpfs Do we have any UX concept for that issue?

Change color of metric according to performance

Improve command line parameters

We use command line parameters in both electron and backend. Therefore, it makes sense to have a complete strategy on how these are implemented.

Additionally:

accept port and host as command line argument to start backend server
accept command line parameter to use an in memory database

Experiment Cluster Preview

When inspecting / modifying an experiment or when benchmarking add the possibility to preview the detected duplicates of a dataset as clusters.

Parallelize long running routes

Especially necessary for

uploading files
some benchmark routes

Implement by e.g. starting a worker thread and asynchronously waiting for it to finish so other routes can be handled in the same time.

cheap solution: call (and await) setImmediate now and then
thread library: https://www.npmjs.com/package/workerpool
keep in mind only one writable connection per SQLite database (maybe spilt up databases more?)

Additional speed up by compiling to WASM?

Live Synchronisation

Synchronize changes of a shared instance to all users. Otherwise if two users access the same system at the same time inconsistencies might occur.

Possible Implementation:

connect clients and server via websocket
the server informs all connected clients about all changes (e.g. experiment 5 has been updated). The client can then reload the relevant information

Bugfix datasets not refreshed after file upload

To reproduce (on branch main):

upload the magellan songs dataset
the total and uploaded counts are not updated after the dataset file finishes uploading

Searchable experiment file format selector

As we are adding more and more experiment formats the list of experiment formats gets too large to be comfortably usable. We should make it searchable in some way to improve usability.

Soft KPIs: Provide new information

It should be possible to add information about a matching solution like soft KPIs. The currently preferred way is using a questionnaire

Implement questionnaire parser

The questionnaire for the soft KPIs should by dynamically rendered from a json definition file

Intersection mode selector

Implement a dropdown (or similar) to select one of the following intersection modes:

Pairs
Clusters
Investigative

Backend implementation, API spec and types will be merged with this PR #55

SideMenu does not re-render

The SideMenu rerender correctly if the state and the current path changes. But it does not rerender automatically if only the current path changes.

Tooltips on metrics do not show up

Currently, the tooltip that should explain a metric does not work.

Show complete error message on click

Some times the error message is too large to be completely visible. We should allow the user to expand the error message.

Rename Algorithm -> Matching Solution everywhere

API spec
frontend
backend

Provide a CLI for Snowman

See this openAPI generator

Add search for metrics

As a data scientist I want to have a search field on the metrics viewer page to search for a specific metric.

Refresh Views even if error is thrown

Sometimes the view needs to be refreshed even if an error is thrown (as in this case):

Increase test coverage

We should add the following types of tests

API tests (e.g. check correctness of routes)
e2e / integration tests (e.g. via Cypress)
performance tests (check that magellan dataset can be uploaded in less than x min)

Additionally the test coverage of our unit tests should be improved by testing edge cases like

upload dataset (no file)
upload experiment
upload dataset file with too few records and see if error is thrown

Compare matching solutions across multiple datasets

The user runs his matching solution on different datasets. He can now upload the different result sets (based on the different datasets) in our tool and see averaged metrics (--> calculate for each result set a metric and then calculate the average).
The user also has the possibility to investigate how the developed algorithm performs on each dataset seperately.

Include markdown file on how to add an experiment format

Adding new experiments is pretty easy. Therefore, it makes sense to include a special contribution guide just for this use case.

Should we put this in our docs or a README into the experiment formats folder?

Soft KPI GitHub sync

Add button in frontend to "publish" matching solution + its soft KPIs. This should open a new GitHub issue with the matching solution configuration in its description (format of the example matching solutions)

Remove axios from wrapper

Axios is not really needed. We could probably use the default request library as well.

New datasets from a new snowman version are not shown in the tool

Describe the bug
The new release includes a new preloaded dataset. When the user already has a snowman version installed and now wants to update this version, he does not see the newly added dataset because the database will not be updated. Up to now he has to delete the whole database in AppData.

Expected behaviour
Snowman should show every newly added dataset. For example, we should restart the database initialization process after installing a new version.

Compare n experiments against a groundtruth

Currently it is only possible to compare 2 experiments. It should now be allowed to compare n experiments against one groundtruth.

This includes:

Frontend:

visualising via a venn diagram
save results sets after clicking on specific area in venn diagram

Backend:

Update /benchmark/experiment-intersection/records and /benchmark/experiment-intersection/count routes to support n experiments
- Bonus: Speed up those routes by adding in some fancy caching (see #21)
- Idea for speeding up the calculation of the subclusterings (in getFalsePositives): Use a datastructure which maps from id to cluster id -> only have to find different cluster ids in cluster
if possible show duplicate groups together

Provide shared resource / tool function pool for tests

Refactor tests and provide

mock datasets
mock dataset files
mock experiments
mock experiment files
utility functions

Most of them are already implemented and only need to be used. See wrapper/src/api/database/setup/examples

Support proprietary experiment formats

add support for proprietary experiment formats
add documentation on how to get to CSV formats from XLSX files

View questionnaire results

Is your feature request related to a problem? Please describe.
Analysing a duplicate detection mode should also be possible on a higher level: soft KPIs. It should be visualised in the frontend with some statistics or something like that

Better tracking of experiment parameters

add support for uploading a configuration file for each experiment
make configuration files comparable (e.g. via git diff)

Bugfix: Double error message after deleting used matching solution

When deleting a used matching solution, two errors are thrown:

Expected: Only the second error should be thrown.

To reproduce: Reset the database and try to delete the Mock Solution

limit
startAt

Idea to optimize performance for range query:

calculate the number of rows (including empty rows) / the number of groups
- e.g. for the format clusters the count of groups in a subclustering with dimensions (d1,d2,d3) would be d1d2d3
for each cluster, save the accumulated count of rows / groups until this one
- maybe cache this array
for a range query, only calculate the required subclusterings

Idea to optimize performance for sort query:

define that "sort" means
- sort the values IN the clusters
- sort the clusters by comparing their first element
join database table with experiment table
- there are 2 ids to be joined on (?)
- join 2 times (?)
sort the joined table by the required attribute
- make sure id1 < id2 (regarding sorted column)
add the pairs in this order to the graph
when id1=4 is added -> all occurences of ids 0..3 are guaranteed to be in the graph

Auto detect experiment format

try to auto detect the experiment format when uploading experiment files by looking at the columns of the experiment.

we should also provide a preview of the experiment file (and dataset file if applicable).