Code Monkey home page Code Monkey logo

snowman's Introduction

logo

Snowman

General Documentation Release

Comparing data matching algorithms is still an unsolved topic in both industry and research. With snowman, developers and researchers are able to compare the performance of different data matching solutions or improve new algorithms. Besides traditional metrics, the tool also considers economic aspects like Soft KPIs.

Benchmark Dashboard

This tool is developed as part of a bachelor's project in collaboration with SAP SE.

Research Project

This tool has been published as part of the the paper "Frost: Benchmarking and Exploring Data Matching Results" (2022) at VLDB. More details on reproducing results shown within the paper can found here.

Current state

In Q1 and Q2 of 2021, we reached the following milestones:

[x] Milestone 1: Ability to add/delete datasets, experiments and matching solutions; binary comparison and basic behavior analysis; executable on all major platform
[x] Milestone 2: Compare more than two experiments and create new experiments based on results; survey Soft KPIs, allow comparison based on KPIs
[x] Milestone 3: Allow individual thresholds for experiments, extend Soft KPIs further and allow advanced evaluation of matching solutions

The precise progress is tracked through Github issues and project boards. Please get in touch in case you want a special feature included :)

After reaching milestone 3, we plan to continue to work on further features which will broaden the tools abilities and features.

Showcase

To show off some key features that Snowman offers, we created a small introductory video:

Snowman Showcase

Contributing

Contribution guidelines will follow soon. Until then, feel free to open an issue to report a bug or request a feature.
In case you want to contribute code, please first open an associated issue and afterwards a pull request containing the proposed solution.

Development

See our development guide for more information on how to get started.

Documentation

Please see our documentation for further information: Snowman Docs

Licenses

Copyright 2021 Hasso Plattner Institute. Licensed under the MIT license.

A complete list of all dependencies and their individual licenses can be found within our documentation.

snowman's People

Contributors

dependabot[bot] avatar florian-papsdorf avatar lasklu avatar lucky-snowman avatar martingraf4 avatar phpfs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

lasklu phpfs

snowman's Issues

Handle silver standards in a detailed way

There are different types of silver standard (or incomplete ground truth matches). For example:

  • specify which pairs are not duplicate
  • specify which pairs are duplicates

Provide support for those.

Enable filtering / sorting / limitting for meta routes

filtering / sorting / limitting for

  • matching solutions (including soft KPIs)
  • experiments
  • datasets

are managed on the frontend at the moment. When the number of entries gets larger this may lead to problems. We should therefore implement this functionality on the backend side.

Support for JedAI

We should consider how to use JedAI

  • support for their result set format
  • other integrations between snowman & jedAI

Cache Metrics

Speed up the metrics route by caching the result (or something else).

  • we should come up with a coherent caching strategy we can apply to all routes.

Investigate Dataset Records

When hovering or clicking on a dataset record in the benchmark section, show all records this one is related to (e.g. all true positives, false positives, false negatives and true negatives)

Improve command line parameters

We use command line parameters in both electron and backend. Therefore, it makes sense to have a complete strategy on how these are implemented.

Additionally:

  • accept port and host as command line argument to start backend server
  • accept command line parameter to use an in memory database

Experiment Cluster Preview

When inspecting / modifying an experiment or when benchmarking add the possibility to preview the detected duplicates of a dataset as clusters.

Parallelize long running routes

Especially necessary for

  • uploading files
  • some benchmark routes

Implement by e.g. starting a worker thread and asynchronously waiting for it to finish so other routes can be handled in the same time.

  • cheap solution: call (and await) setImmediate now and then
  • thread library: https://www.npmjs.com/package/workerpool
    keep in mind only one writable connection per SQLite database (maybe spilt up databases more?)

Additional speed up by compiling to WASM?

Live Synchronisation

Synchronize changes of a shared instance to all users. Otherwise if two users access the same system at the same time inconsistencies might occur.

Possible Implementation:

  • connect clients and server via websocket
  • the server informs all connected clients about all changes (e.g. experiment 5 has been updated). The client can then reload the relevant information

Searchable experiment file format selector

As we are adding more and more experiment formats the list of experiment formats gets too large to be comfortably usable. We should make it searchable in some way to improve usability.

Soft KPIs: Provide new information

It should be possible to add information about a matching solution like soft KPIs. The currently preferred way is using a questionnaire

Intersection mode selector

Implement a dropdown (or similar) to select one of the following intersection modes:

  • Pairs
  • Clusters
  • Investigative

Backend implementation, API spec and types will be merged with this PR #55

SideMenu does not re-render

The SideMenu rerender correctly if the state and the current path changes. But it does not rerender automatically if only the current path changes.

Add search for metrics

As a data scientist I want to have a search field on the metrics viewer page to search for a specific metric.

Increase test coverage

We should add the following types of tests

  • API tests (e.g. check correctness of routes)
  • e2e / integration tests (e.g. via Cypress)
  • performance tests (check that magellan dataset can be uploaded in less than x min)

Additionally the test coverage of our unit tests should be improved by testing edge cases like

  1. upload dataset (no file)
  2. upload experiment
  3. upload dataset file with too few records and see if error is thrown

Compare matching solutions across multiple datasets

The user runs his matching solution on different datasets. He can now upload the different result sets (based on the different datasets) in our tool and see averaged metrics (--> calculate for each result set a metric and then calculate the average).
The user also has the possibility to investigate how the developed algorithm performs on each dataset seperately.

Soft KPI GitHub sync

Add button in frontend to "publish" matching solution + its soft KPIs. This should open a new GitHub issue with the matching solution configuration in its description (format of the example matching solutions)

New datasets from a new snowman version are not shown in the tool

Describe the bug
The new release includes a new preloaded dataset. When the user already has a snowman version installed and now wants to update this version, he does not see the newly added dataset because the database will not be updated. Up to now he has to delete the whole database in AppData.

Expected behaviour
Snowman should show every newly added dataset. For example, we should restart the database initialization process after installing a new version.

Compare n experiments against a groundtruth

Currently it is only possible to compare 2 experiments. It should now be allowed to compare n experiments against one groundtruth.

This includes:

Frontend:

  • visualising via a venn diagram
  • save results sets after clicking on specific area in venn diagram

Backend:

  • Update /benchmark/experiment-intersection/records and /benchmark/experiment-intersection/count routes to support n experiments
    • Bonus: Speed up those routes by adding in some fancy caching (see #21)
    • Idea for speeding up the calculation of the subclusterings (in getFalsePositives): Use a datastructure which maps from id to cluster id -> only have to find different cluster ids in cluster
  • if possible show duplicate groups together

Provide shared resource / tool function pool for tests

Refactor tests and provide

  • mock datasets
  • mock dataset files
  • mock experiments
  • mock experiment files
  • utility functions

Most of them are already implemented and only need to be used. See wrapper/src/api/database/setup/examples

View questionnaire results

Is your feature request related to a problem? Please describe.
Analysing a duplicate detection mode should also be possible on a higher level: soft KPIs. It should be visualised in the frontend with some statistics or something like that

Resumable File Upload

Problem: The current uploader can only handle files up to ~1GB.
Solution: Add support for partial (resumable) file upload.

SoftKPIs: Backend architecture

Is your feature request related to a problem? Please describe.
It has to be possible to save the results of the soft KPIs in the database

We should think of a hosted version for better comparison?

Automatic Database Migration

We need to be able to "upgrade" databases from older versions of our tool (e.g. when adding a new column, a default value must be supplied and an old version of the table should automatically get the new column on system load).

Support `limit` and `startAt` in `calculateExperimentIntersectionRecords`

  • limit
  • startAt

Idea to optimize performance for range query:

  1. calculate the number of rows (including empty rows) / the number of groups
    • e.g. for the format clusters the count of groups in a subclustering with dimensions (d1,d2,d3) would be d1d2d3
  2. for each cluster, save the accumulated count of rows / groups until this one
    • maybe cache this array
  3. for a range query, only calculate the required subclusterings

Idea to optimize performance for sort query:

  1. define that "sort" means
    • sort the values IN the clusters
    • sort the clusters by comparing their first element
  2. join database table with experiment table
    • there are 2 ids to be joined on (?)
    • join 2 times (?)
  3. sort the joined table by the required attribute
    • make sure id1 < id2 (regarding sorted column)
  4. add the pairs in this order to the graph
  5. when id1=4 is added -> all occurences of ids 0..3 are guaranteed to be in the graph

Auto detect experiment format

try to auto detect the experiment format when uploading experiment files by looking at the columns of the experiment.

we should also provide a preview of the experiment file (and dataset file if applicable).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.