aleksbobic / csx Goto Github PK

Collaboration Spotting X - A network-based information retrieval and visual-analytics application

License: MIT License

HTML 0.13% JavaScript 88.93% SCSS 0.26% Python 10.68%

information-retrieval data-visualization data-visualisation graph-visualization graph-visualisation network-visualization network-visualisation graph-analysis graph-analytics network-analysis

csx's Introduction

I'm a software engineer with multiple years of experience developing full-stack data-driven solutions and a PhD in computer science focusing on visual analytics, network science and information retrieval.
🌱 I’m currently working on multiple projects exploring a variety of different techniques, technologies and aiming to solve real world problems.
🌍 My webpage

csx's People

Contributors

Stargazers

Watchers

Forkers

rooronoazoro cyberpeace-institute hedyraymond gdn0101 mr2cool

csx's Issues

Search Hints

Is your feature request related to a problem? Please describe.
users should be provided with hints when searching through text fields and provided with dropdowns when searching through categorical values.

Describe the solution you'd like
This will also require that we extract information about categorical values before populating the dataset.

Introduce internal representation of graphs on backend

Is your feature request related to a problem? Please describe.
Current user graphs should be stored on backend in a redis instance and accessible for further processing.

Describe the solution you'd like
The graphs should be stored by providing a unique user id (should be generated when the user first opens the frontend) and additional metadata about the dataset, query and the full graph structure together with node and edge properties. This should be accessible from the entire backend.

Chart counts not calculated properly

Describe the bug
Wrong value frequencies in charts

Customize charts

Is your feature request related to a problem? Please describe.
Add custom title, info and labels for charts as part of the add new chart module.

Provide smart restrictions for users

Is your feature request related to a problem? Please describe.
Users should be discouraged from performing questionable actions in CSX.

Describe the solution you'd like

~~1. If a user tries to connect an array feature with a single value feature they should not be able to form a 1:1 connection. (similar logic for other connections)~~
~~2. Users should only be able to add new properties to single value features in the overview graph.~~
3. Users should only be able to use appropriate search/filter nodes with appropriate values.

This should be generalised to all actions

List properties for list anchors

Is your feature request related to a problem? Please describe.
If a user selects a list feature as an anchor and another list feature as its property then the mapping should be 1:1 if there are more values in the property than in the feature values (e.g. more values of author names than there are h-indexes) then a missing prop value should be added.

Consistent styles for light mode

Is your feature request related to a problem? Please describe.
Users should be able to explore their datasets using CSX also in light mode

Describe the solution you'd like
Every CSX style should be adapted for light mode.

Export graphs to Neo4j

Is your feature request related to a problem? Please describe.
Users should be able to export their graphs in a format which can be imported directly into Neo4j while preserving all of the graph properties.

Deselect when in direct connection view

Describe the bug
The application crashes when deselecting a node in any of the special views.

To Reproduce
Select multiple nodes and go into one of the special views using the top left menu.
Deselect one of the nodes while in the special view.

Simplify default settings by removing default visible nodes and automatically identifying list features

Is your feature request related to a problem? Please describe.
Simplify default settings by removing default visible nodes and automatically identifying list features

Describe the solution you'd like
Users should only define the anchor, links and default searchable nodes. Features that are list should be automatically assigned the list type.

Additional context
Should be implemented after #27 and #24

Handle missing config files

Describe the bug
When a dataset config file is missing CSX crashes.

To Reproduce

Delete an automatically generated config file.
Run CSX (but leave the dataset in elasticsearch)

Expected behavior
It should notify the user that a config file is missing or autogenerate a new config file.

Remove links and possibly anchor from detail schema

Describe the solution you'd like
The links do not have any influence on the detail schema so they should be removed.
Since the detail schema is a DAG consider removing the requirement of an anchor and directly identify shortest paths from nodes to other nodes. The paths that overlap should be merged together.
After this is implemented the users should not have to define any anchors or links in the detail schema and should be able to just define connections in the schema.

This issue might also fix #26

Advanced statistic for selected nodes

Is your feature request related to a problem? Please describe.
Users should have access to advanced statistics in the selection menu

Describe the solution you'd like
Users should be able to add their own custom graphs to the statistic menu. When a user clicks a plus button they can chose multiple types of visualisations (pie, line, bar chart) and features/properties to visualise. They should also be able to select the width of the chart (e.g. half width or full width).

Additionally there should be an option to show the comparison to the full graph for some of the features (e.g. number of nodes in selection vs number of nodes in entire graph)

Introduce basic frontend unit tests for critical larger components

Is your feature request related to a problem? Please describe.
Developers (as well as a CICD pipelines) should be able to run frontend unit tests to make sure the entire application works as expected.

Describe the solution you'd like
This issue should serve as a first step towards a full test coverage of the frontend.
As part of this issue jest should be introduced together with the tests for critical parts of CSX frontend (TBD which components exactly need testing)

Automatically handle null values

Is your feature request related to a problem? Please describe.
When uploading a dataset the users should be provided with an option to fill the null values automatically.

Describe the solution you'd like
When a dataset is provided to CSX it should automatically detect columns with null values. Based on the type of the column it should provide a few suggestions for null value alternatives as well as a text field asking the user to enter a specific value they would like to use instead of a suggestion.

Louvain and alternatives

Is your feature request related to a problem? Please describe.
User should be able to run community detection algorithms on graphs and view/select those the same way they can do with components.

Describe the solution you'd like
Users should have the option to run Louvain and potentially other reasonable fast algorithms for community detection on their graphs. The community detection should be executed in the background without interfering with the users interactions with the app and only enrich the existing data once the calculation is done. Once the data is enriched the users should also be presented with a list of communities and with a community color schema when exploring the graph.

The community detection algorithm should run on the internal representation of a graph stored in redis (implemented in #40).

Additional context
This feature should be implemented only once #40 is done.

Update table view based on direct connection exploration actions

Is your feature request related to a problem? Please describe.
Users should be able to filter the tabular data through direct connection exploration (e.g. looking at direct connections of a keyword node connected to multiple papers should only leave those paper entries in the table on the right side)

Enable sorting in charts

Is your feature request related to a problem? Please describe.
Charts with properties such as age should be sortable also by values (i.e. values of age).

Additional context
In case when we have properties which can not be sorted we need to disable this option or sort by something else (e.g. sort by alphabet vs by size)

Search through lists not working

Describe the bug
Searching through list values uses fuzzy retrieval and retrieves whatever is vaguely relevant. So you might

Additional context
Look into using https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html

Introduce basic backend unit tests

Is your feature request related to a problem? Please describe.
Developers (as well as a CICD pipelines) should be able to run backend unit tests to make sure the entire application works as expected.

Describe the solution you'd like
This issue should serve as a first step towards a full test coverage of the backend.
As part of this issue pytest (since it is already part of FastAPI) should be introduced together with the tests for critical parts of CSX backend (TBD which components exactly need testing)

Drag & drop datasets

Is your feature request related to a problem? Please describe.
CSX users shouldn't have to run python scripts or write config files to get their custom dataset into the tool. There should be a way for users to just drag and drop their dataset and be guided through a set of step which will result in an automatic generation of a configuration for the dataset.

Describe the solution you'd like
Users should be able to perform the following steps:

Drag and drop a .csv / .xlsx file
CSX should ask the user for the dataset name as well as propose the same name as the filename
CSX should automatically detect column data types (e.g. is it a string, list, number, date, etc.) and provide the suggestions to the user
The user should have an option to define a different type for each column or continue
Users should be asked a set of questions for creating the settings file (e.g. default search features etc.)
~~Users should be asked to define a default schema for the overview graph~~
Users should be warned if the suggested overview schema crosses a threshold of nodes/edges by calculating the ratio between the unique values in the anchor and unique values in the link and compering it to a threshold
~~Users should be asked to define a default schema for the overview graph, they should be warned if they try connecting features with invalid cardinalities~~
The dataset should show up in the list of existing datasets

Node selection retrieval

Expand graph by selecting a few nodes and searching for new results based on these node values (either narrow expansion where all of the properties have to be satisfied or broad expansion where any of the properties can be satisfied) Essentially running a search with AND or OR and expanding existing graph with new nodes and connections. This means the tabular data has to be merged and the graph has to be recalculated.

Also in this case you have to add search nodes to the advanced search that will represent this search.

Enable deletion of nodes from search workflow and saving of workflows

Is your feature request related to a problem? Please describe.
Users should be able to ~~delete particular nodes from the workflow~~ and save workflows by giving them a name

Detail schema without anchor and links

Is your feature request related to a problem? Please describe.
Users should not have to define anchor and links properties in the detail graph. Explore if there is a possibility of integrating the detail graph generation without the need for a defined anchor node.

Default schema generation in dataset upload form

Is your feature request related to a problem? Please describe.
When uploading a dataset users should be able to define a default schema for the dataset they are uploading.

Describe the solution you'd like

Users should be asked to define a default schema for the overview graph
Users should be warned if the suggested overview schema crosses a threshold of nodes/edges by calculating the ratio between the unique values in the anchor and unique values in the link and compering it to a threshold
Users should be asked to define a default schema for the overview graph, they should be warned if they try connecting features with invalid cardinalities

Additional context
#1 should be done before

Faster table view performance and more features

Is your feature request related to a problem? Please describe.
The table view should be a first class citizen just like the network view. Therefore, users should be able to load large amounts of data in the table quickly and perform various filtering operations on them.

Describe the solution you'd like
Explore using glide-data-grid for the table view. If it performs better than the existing one use the glide-data-grid and enable users sorting and searching through data entries in the table using it. If the table is filtered down through search the network should be filtered as well.

Pre-compute nodes

Overview graph links AND connection

Is your feature request related to a problem? Please describe.
Users should be able to demand graphs with connections where all link features defined in the overview schema are present in each edge.

Describe the solution you'd like
If a users defines authors and countries as links of the overview schema the overview graph view should present the user with a graph where only edges with both link features are present.

Selecting single node with no connection results in a crash

Describe the bug
Selecting a single node with no connections causes CSX to crash.

To Reproduce
Steps to reproduce the behavior:

Search for something
Select a node with no conenctions

Expected behavior
Selecting a single node with no connections should work the same way as selecting nodes with connections.

Cut graphs

Is your feature request related to a problem? Please describe.
Users should be able to "cut out" parts of graphs and explore only those in either overview or detail view.

Describe the solution you'd like
When a user selects a part of the graph there should be an option to "cut" that part of the graph. This should enable users to narrow down their information need.

Add core decomposition as part of this issue.

Return only nodes if there are no links defined in overview schema

Is your feature request related to a problem? Please describe.
Return only nodes if there are no links defined in overview schema

Smart guide for feature selection in overview graph

Is your feature request related to a problem? Please describe.
Users should be warned if the overview graph has potential to be extremely large if they select a feature that has many unique values as an anchor and features with low number of unique values as links.

Describe the solution you'd like
Users should see a popup / button to visualise the same properties as a detail graph in case they select a combination of link and anchor features which might lead to graphs with (e.g. more than 3k links).

Additional context
The unique values should be pre-calculated and integrated as part of the drag and drop feature described in #1 and this issue should not be done before it.

Enable clicking on charts to filter out graph data

Is your feature request related to a problem? Please describe.
Users should be able to filter out graphs by clicking on a chart element.

Additional context
Make sure this works also with the exploration menu in the top left corner

Same type "citation like" reference in schema

Is your feature request related to a problem? Please describe.
Users should be able to explore networks of self-referencing nodes. E.g. twitter networks where users follow other users or citation networks where papers reference other papers.

Describe the solution you'd like
Users should chose the dataset feature representing the "relationship" between the current entry and the related entries. This introduces multiple complications:

Which feature is the source and which feature is the destination of this connection (this property should be defined in the overview and detail schema (e.g. if a users selects the feature column "followers" they should also select its type "user_id")
To represent this in the overview schema we need to introduce a new property on the link nodes which expresses that the link is a "same type reference" link.
To represent this in the detail schema we either have to convert our DAG into just a DG (directed graph) or introduce a new property which can be toggled on the nodes
Processing this will most likely be time demanding since we're dealing with a column of links therefore we should ask users when adding their data to specify if any of the columns express "explicit connections to other nodes"

Additional context
#1 Should be done first since we need to ask users in advance for the data types

Sliding window pivot

Support numeric values

Is your feature request related to a problem? Please describe.
Users should be able to leave the numeric values in their dataset as they are and CSX should be able to process them.

Describe the solution you'd like
Currently CSX does not digest numeric values which limits the visualisations and operations which can be performed (e.g. we can't do binning or scale-based color schemas based on values of nodes)

Provide better except feedback

Is your feature request related to a problem? Please describe.
Users should be able to understand what went wrong on the backend (or frontend) by seeing the error message popup on the frontend (currently that is not the case)

Describe the solution you'd like
For example: if a connection to elastic cannot be established the user should get that as an error message or at least a particular code indicating there is an issue with elastic.

Simple view

Is your feature request related to a problem? Please describe.
Users should be able to toggle a simple view in which they can only explore predefined schemas, view stats for the entire graph and selection, view the list view and have no advanced toggling and search options. This could serve as a sort of a presentation mode. If possible as part of this issue explore how to introduce a flag in the docker build process to generate a CSX which only enables users to run a simple text search and view networks in a presentation mode.

Overwrite datasets

Is your feature request related to a problem? Please describe.
Users should be able to overwrite datasets by drag and dropping new datasets with the same name while still keeping the configuration of the old dataset with the possibility of changing it if the datset changed.

Additional context
#1 should be done before this

Store findings

Is your feature request related to a problem? Please describe.
Users should be able to store a particular state (e.g. particular search, combined with a particular view and settings) of CSX as a finding which must be given a name and can be given a description. Users should be able to click on these findings from the homepage and immediately navigate to the finding.

Additional context
This feature requires user accounts and storage of existing graphs.

Automatic schema creation from basic settings

Is your feature request related to a problem? Please describe.
When users provide default dataset settings they should also see a detailed graph connected based on these settings.

Describe the solution you'd like
If a user defines feature X as the default anchor and feature Y and Z as the default links then they should see this reflected not only on the overview network but also on the detail network. E.g. the default schema in that case should be feature X connected directly to feature Y and feature Z with appropriate relationships.

Additional context
#24 should be done before this since it is needed to infer appropriate relationships.

View entire graph

Is your feature request related to a problem? Please describe.
In addition to searching through the provided datasets users should be able to view the entire dataset.

Describe the solution you'd like
In addition to the three buttons next to each of the datasets there should be a fourth one offering the option to view the entire dataset.

Parallel coordinates as node stats

Is your feature request related to a problem? Please describe.
Users should be able to view selected node property distributions as node stats.

Describe the solution you'd like
When users select multiple nodes with properties on them these should show up in a parallel coordinates plot in the "selected" stats on the right sidebar.

Additional context
#11 should be done before this

Searching for words with an apostrophe results in empty search results

Describe the bug
When searching for words which include an Apostrophe (e.g. haven't) we get an empty set of results.

Expected behavior
Searching for words with Apostrophes should work the same way as searching for any other kind of words

Dataset overview, holistic exploration and deletion

Is your feature request related to a problem? Please describe.
Users should see all available datasets as a scrollable list or a scrollable grid.
They should have the option to explore the entire dataset or to delete the dataset from the index.

Additional context
#1 Should be done before this

Tabular filtering not working in detail view

Advanced color-schema picker and heatmap color schema

Is your feature request related to a problem? Please describe.
Users should be able to select different color schemas and not just the default for each feature.

Describe the solution you'd like
Users should see the possible properties which can be used as color schema properties in a dropdown.
Additionally they should be presented with a dropdown / toggle for choosing between different color schemas for a certain color schema property.

Subtasks:

Create a branch called issue-6 from develop and use it for the development of this task.
Add another dropdown with a label called "color schema" and add 3-5 color schemas (from libraries) Note that color schemas for numeric variables and categorical variables are most likely different.
Add option to define color for each categorical features (users should be provided a color picker)
Open PR to develop

Additional context
#5 should be finished first.

Go to detail graph
Combine 4 nodes in the following way: A -1:1 -> B, A -1:1-> C, D -1:1-> A
regenerate the graph (it crashes)

Expected behavior
The expected behavior would be for it not to crash.