The snowman's discuss from hpi-information-systems

Bugfix: "magellan" experiment format throws error

To reproduce (on branch main): Select magellan as experiment format

Live Synchronisation

Synchronize changes of a shared instance to all users. Otherwise if two users access the same system at the same time inconsistencies might occur.

Possible Implementation:

connect clients and server via websocket
the server informs all connected clients about all changes (e.g. experiment 5 has been updated). The client can then reload the relevant information

Remove axios from wrapper

Axios is not really needed. We could probably use the default request library as well.

Make datasets, experiments and matching solutions editable

At the moment datasets, experiments and matching solutions can only be created and deleted. However, the API also supports changing them and this should be reflected in the frontend.

Parallelize long running routes

Especially necessary for

uploading files
some benchmark routes

Implement by e.g. starting a worker thread and asynchronously waiting for it to finish so other routes can be handled in the same time.

cheap solution: call (and await) setImmediate now and then
thread library: https://www.npmjs.com/package/workerpool
keep in mind only one writable connection per SQLite database (maybe spilt up databases more?)

Additional speed up by compiling to WASM?

Bugfix: Double error message after deleting used matching solution

When deleting a used matching solution, two errors are thrown:

Expected: Only the second error should be thrown.

To reproduce: Reset the database and try to delete the Mock Solution

Intersection mode selector

Implement a dropdown (or similar) to select one of the following intersection modes:

Pairs
Clusters
Investigative

Backend implementation, API spec and types will be merged with this PR #55

Improve command line parameters

We use command line parameters in both electron and backend. Therefore, it makes sense to have a complete strategy on how these are implemented.

Additionally:

accept port and host as command line argument to start backend server
accept command line parameter to use an in memory database

Rename Algorithm -> Matching Solution everywhere

API spec
frontend
backend

Is your feature request related to a problem? Please describe.
Analysing a duplicate detection mode should also be possible on a higher level: soft KPIs. It should be visualised in the frontend with some statistics or something like that

Soft KPIs: Provide new information

It should be possible to add information about a matching solution like soft KPIs. The currently preferred way is using a questionnaire

Refresh Views even if error is thrown

Sometimes the view needs to be refreshed even if an error is thrown (as in this case):

Highlight currently openend page

The side menu should show the current page's menu item in color primary

Auto detect experiment format

try to auto detect the experiment format when uploading experiment files by looking at the columns of the experiment.

we should also provide a preview of the experiment file (and dataset file if applicable).

Better tracking of experiment parameters

add support for uploading a configuration file for each experiment
make configuration files comparable (e.g. via git diff)

E2E integration of matching solutions

provide support for running matching solutions directly from the tool
add "debugging" capabilities for ML matching solutions by implementing XAI methods to show why pairs have (not) been declared as duplicate

Support for JedAI

We should consider how to use JedAI

support for their result set format
other integrations between snowman & jedAI

Handle silver standards in a detailed way

There are different types of silver standard (or incomplete ground truth matches). For example:

specify which pairs are not duplicate
specify which pairs are duplicates

Provide support for those.

Investigate Dataset Records

When hovering or clicking on a dataset record in the benchmark section, show all records this one is related to (e.g. all true positives, false positives, false negatives and true negatives)

Tooltips on metrics do not show up

Currently, the tooltip that should explain a metric does not work.

Enable filtering / sorting / limitting for meta routes

filtering / sorting / limitting for

matching solutions (including soft KPIs)
experiments
datasets

are managed on the frontend at the moment. When the number of entries gets larger this may lead to problems. We should therefore implement this functionality on the backend side.

Compare matching solutions across multiple datasets

The user runs his matching solution on different datasets. He can now upload the different result sets (based on the different datasets) in our tool and see averaged metrics (--> calculate for each result set a metric and then calculate the average).
The user also has the possibility to investigate how the developed algorithm performs on each dataset seperately.

Automatic Database Migration

We need to be able to "upgrade" databases from older versions of our tool (e.g. when adding a new column, a default value must be supplied and an old version of the table should automatically get the new column on system load).

Soft KPI GitHub sync

Add button in frontend to "publish" matching solution + its soft KPIs. This should open a new GitHub issue with the matching solution configuration in its description (format of the example matching solutions)

Provide shared resource / tool function pool for tests

Refactor tests and provide

mock datasets
mock dataset files
mock experiments
mock experiment files
utility functions

Most of them are already implemented and only need to be used. See wrapper/src/api/database/setup/examples

Bugfix datasets not refreshed after file upload

To reproduce (on branch main):

upload the magellan songs dataset
the total and uploaded counts are not updated after the dataset file finishes uploading

Change color of metric according to performance

Increase test coverage

We should add the following types of tests

API tests (e.g. check correctness of routes)
e2e / integration tests (e.g. via Cypress)
performance tests (check that magellan dataset can be uploaded in less than x min)

Additionally the test coverage of our unit tests should be improved by testing edge cases like

upload dataset (no file)
upload experiment
upload dataset file with too few records and see if error is thrown

Add AUTHORS file

Please include a list of my authors :)

Searchable experiment file format selector

As we are adding more and more experiment formats the list of experiment formats gets too large to be comfortably usable. We should make it searchable in some way to improve usability.

Compare n experiments against a groundtruth

Currently it is only possible to compare 2 experiments. It should now be allowed to compare n experiments against one groundtruth.

This includes:

Frontend:

visualising via a venn diagram
save results sets after clicking on specific area in venn diagram

Backend:

Update /benchmark/experiment-intersection/records and /benchmark/experiment-intersection/count routes to support n experiments
- Bonus: Speed up those routes by adding in some fancy caching (see #21)
- Idea for speeding up the calculation of the subclusterings (in getFalsePositives): Use a datastructure which maps from id to cluster id -> only have to find different cluster ids in cluster
if possible show duplicate groups together

Support `limit` and `startAt` in `calculateExperimentIntersectionRecords`

limit
startAt

Idea to optimize performance for range query:

calculate the number of rows (including empty rows) / the number of groups
- e.g. for the format clusters the count of groups in a subclustering with dimensions (d1,d2,d3) would be d1d2d3
for each cluster, save the accumulated count of rows / groups until this one
- maybe cache this array
for a range query, only calculate the required subclusterings

Idea to optimize performance for sort query:

define that "sort" means
- sort the values IN the clusters
- sort the clusters by comparing their first element
join database table with experiment table
- there are 2 ids to be joined on (?)
- join 2 times (?)
sort the joined table by the required attribute
- make sure id1 < id2 (regarding sorted column)
add the pairs in this order to the graph
when id1=4 is added -> all occurences of ids 0..3 are guaranteed to be in the graph

SideMenu does not re-render

The SideMenu rerender correctly if the state and the current path changes. But it does not rerender automatically if only the current path changes.

Cache Metrics

Speed up the metrics route by caching the result (or something else).

we should come up with a coherent caching strategy we can apply to all routes.

Refactor Metrics description

split "formula" and "information"
make formula shorter (e.g. by using fractions and abbreviations)

Implement questionnaire parser

The questionnaire for the soft KPIs should by dynamically rendered from a json definition file

Make automatic releases of snowman with CI

UI should differentiate between experiments used as gold standard

For binary comparison, one of the two supplied experiments is regarded as a gold standard. We should make it clear in the UI which one the user wants to use as gold standard.

Experiment Cluster Preview

When inspecting / modifying an experiment or when benchmarking add the possibility to preview the detected duplicates of a dataset as clusters.

Provide a CLI for Snowman

See this openAPI generator

New datasets from a new snowman version are not shown in the tool

Describe the bug
The new release includes a new preloaded dataset. When the user already has a snowman version installed and now wants to update this version, he does not see the newly added dataset because the database will not be updated. Up to now he has to delete the whole database in AppData.

Expected behaviour
Snowman should show every newly added dataset. For example, we should restart the database initialization process after installing a new version.

Allow to sort metrics how user wants

Visiting metrics route throws error

When visiting the metrics route directly in debug mode (npm start), the following error is thrown:

Support proprietary experiment formats

add support for proprietary experiment formats
add documentation on how to get to CSV formats from XLSX files

hpi-information-systems / snowman Goto Github PK

snowman's Issues

Recommend Projects

Recommend Topics

Recommend Org