hpi-information-systems / snowman Goto Github PK
View Code? Open in Web Editor NEWWelcome to Snowman App – a Data Matching Benchmark Platform.
Home Page: https://hpi-information-systems.github.io/snowman/
License: MIT License
Welcome to Snowman App – a Data Matching Benchmark Platform.
Home Page: https://hpi-information-systems.github.io/snowman/
License: MIT License
To reproduce (on branch main): Select magellan as experiment format
Synchronize changes of a shared instance to all users. Otherwise if two users access the same system at the same time inconsistencies might occur.
Possible Implementation:
Axios is not really needed. We could probably use the default request library as well.
At the moment datasets, experiments and matching solutions can only be created and deleted. However, the API also supports changing them and this should be reflected in the frontend.
Especially necessary for
Implement by e.g. starting a worker thread and asynchronously waiting for it to finish so other routes can be handled in the same time.
Additional speed up by compiling to WASM?
Implement a dropdown (or similar) to select one of the following intersection modes:
Backend implementation, API spec and types will be merged with this PR #55
We use command line parameters in both electron and backend. Therefore, it makes sense to have a complete strategy on how these are implemented.
Additionally:
Is your feature request related to a problem? Please describe.
Analysing a duplicate detection mode should also be possible on a higher level: soft KPIs. It should be visualised in the frontend with some statistics or something like that
It should be possible to add information about a matching solution like soft KPIs. The currently preferred way is using a questionnaire
The side menu should show the current page's menu item in color primary
try to auto detect the experiment format when uploading experiment files by looking at the columns of the experiment.
we should also provide a preview of the experiment file (and dataset file if applicable).
git diff
)We should consider how to use JedAI
There are different types of silver standard (or incomplete ground truth matches). For example:
Provide support for those.
When hovering or clicking on a dataset record in the benchmark section, show all records this one is related to (e.g. all true positives, false positives, false negatives and true negatives)
Currently, the tooltip that should explain a metric does not work.
filtering / sorting / limitting for
are managed on the frontend at the moment. When the number of entries gets larger this may lead to problems. We should therefore implement this functionality on the backend side.
The user runs his matching solution on different datasets. He can now upload the different result sets (based on the different datasets) in our tool and see averaged metrics (--> calculate for each result set a metric and then calculate the average).
The user also has the possibility to investigate how the developed algorithm performs on each dataset seperately.
We need to be able to "upgrade" databases from older versions of our tool (e.g. when adding a new column, a default value must be supplied and an old version of the table should automatically get the new column on system load).
Add button in frontend to "publish" matching solution + its soft KPIs. This should open a new GitHub issue with the matching solution configuration in its description (format of the example matching solutions)
Refactor tests and provide
Most of them are already implemented and only need to be used. See wrapper/src/api/database/setup/examples
To reproduce (on branch main):
We should add the following types of tests
Additionally the test coverage of our unit tests should be improved by testing edge cases like
Please include a list of my authors :)
As we are adding more and more experiment formats the list of experiment formats gets too large to be comfortably usable. We should make it searchable in some way to improve usability.
Currently it is only possible to compare 2 experiments. It should now be allowed to compare n experiments against one groundtruth.
This includes:
Frontend:
Backend:
/benchmark/experiment-intersection/records
and /benchmark/experiment-intersection/count
routes to support n experiments
limit
startAt
Idea to optimize performance for range query:
clusters
the count of groups in a subclustering with dimensions (d1,d2,d3) would be d1d2d3Idea to optimize performance for sort query:
The SideMenu rerender correctly if the state and the current path changes. But it does not rerender automatically if only the current path changes.
Speed up the metrics route by caching the result (or something else).
The questionnaire for the soft KPIs should by dynamically rendered from a json definition file
For binary comparison, one of the two supplied experiments is regarded as a gold standard. We should make it clear in the UI which one the user wants to use as gold standard.
When inspecting / modifying an experiment or when benchmarking add the possibility to preview the detected duplicates of a dataset as clusters.
See this openAPI generator
Describe the bug
The new release includes a new preloaded dataset. When the user already has a snowman version installed and now wants to update this version, he does not see the newly added dataset because the database will not be updated. Up to now he has to delete the whole database in AppData
.
Expected behaviour
Snowman should show every newly added dataset. For example, we should restart the database initialization process after installing a new version.
As a data scientist I want to have a search field on the metrics viewer page to search for a specific metric.
Problem: The current uploader can only handle files up to ~1GB.
Solution: Add support for partial (resumable) file upload.
Adding new experiments is pretty easy. Therefore, it makes sense to include a special contribution guide just for this use case.
Should we put this in our docs or a README into the experiment formats folder?
@phpfs Do we have any UX concept for that issue?
Is your feature request related to a problem? Please describe.
It has to be possible to save the results of the soft KPIs in the database
We should think of a hosted version for better comparison?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.