bluebrain / bluegraph Goto Github PK

Python framework for graph analytics and co-occurrence analysis

Home Page: https://bluegraph.readthedocs.io

License: Apache License 2.0

Python 78.75% Jupyter Notebook 20.98% Dockerfile 0.27%

graph-analytics graphs cord-19 graph-embedding networkx graph-tool neo4j-database stellargraph-library graph-framework property-graph

bluegraph's Introduction

Blue Graph

Unifying Python framework for graph analytics and co-occurrence analysis.

About

BlueGraph is a Python framework that consolidates graph analytics capabilities from different graph processing backends. It provides the following set of interfaces:

preprocessing and co-occurrence analysis API providing semantic property encoders and co-occurrence graph generators;
graph analytics API providing interfaces for computing graph metrics, performing path search and community detection;
representation learning API for applying various graph embedding techniques;
representation learning downstream tasks API allowing the user to perform node classification, similarity queries, link prediction.

Using the built-in PGFrame data structure (currently, pandas-based implementation is available) for representing property graphs, it provides a backend-agnostic API supporting the following in-memory and persistent graph backends:

NetworkX (for the analytics API)
graph-tool (for the analytics API)
Neo4j (for the analytics and representation learning API);
StellarGraph (for the representation learning API).
gensim (for the representation learning API).

This repository originated from the Blue Brain effort on building a COVID-19-related knowledge graph from the CORD-19 dataset and analysing the generated graph to perform literature review of the role of glucose metabolism deregulations in the progression of COVID-19. For more details on how the knowledge graph is built, explored and analysed, see COVID-19 co-occurrence graph generation and analysis.

`bluegraph` package

BlueGraph's API is built upon 4 main packages:

bluegraph.core providing the exchange data structure for graph representation that serves as the input to graph processors based on different backends (PGFrame), as well as basic interfaces for different graph analytics and embedding classes (MetricProcessor, PathFinder, CommunityDetector, GraphElementEmbedder, etc).
bluegraph.backends is a package that collects implementation of various graph processing and analytics interfaces for different graph backends (for example, NXPathFinder for path search capabilities provided by NetworkX, Neo4jCommunityDetector for community detection methods provided by Neo4j, etc).
bluegraph.preprocess is a package that contains utils for preprocessing property graphs (e.g. SemanticPGEncoder for encoding node/edge properties as numerical vectors, CooccurrenceGenerator for generation and analysis of co-occurrence relations in PGFrames.)
bluegraph.downstream is a package that provides a set of utils for various downstream tasks based on vector representations of graphs and graph elements (for example, NodeSimilarityProcessor for building and querying node similarity indices based on vector representation of nodes, EdgePredictor for predicting true and false edges of the graph based on vector representation of its nodes, EmbeddingPipeline for stacking pipelines of graph preprocessing, embedding, similarity index building, etc).

Main components of BlueGraph's API are illustrated in the following diagram:

`cord19kg` package

The cord19kg package contains a set of tools for interactive exploration and analysis of the CORD-19 dataset using the co-occurrence analysis of the extracted named entities. It includes data preparation and curation helpers, tools for generation and analysis of co-occurrence graphs. Moreover, it provides several interactive mini-applications (based on JupyterDash and ipywidgets) for Jupyter notebooks allowing the user to interactively perform:

entity curation;
graph visualization and analysis;
dataset saving/loading from Nexus.

`services` package

Collects services included as a part of BlueGraph. Currently, only a mini-service for retrieving embedding vectors and similarity computation is included as a part of this repository (see embedder service specific README).

Installation

It is recommended to use a virtual environment such as venv or conda environment.

Installing backend dependencies

If you want to use graph-tool as a backend, you need to manually install the library (it cannot be simply installed by running pip install), as it is not an ordinary Python library, but a wrapper around a C++ library (please, see graph-tool installation instructions). Currently, BlueGraph supports graph-tool<=2.37.

Similarly, if you want to use the bluegraph.downstream.similarity module for building similarity indices (on embedded nodes, for example), you should install the Facebook Faiss library separately. Please, see Faiss installation instructions (conda and conda-forge installation available).

You can install both graph-tool and the Facebook Faiss library by creating a new environment with the right dependencies using conda, as follows:

conda create --name <your_environment> -c conda-forge graph-tool==2.37 faiss python=<your_python>
conda activate <your_environment>

The same holds for the Neo4j backend: in order to use it, the database should be installed and started (please, see Neo4j installation instructions). Typically, the Neo4j-based interfaces provided by BlueGraph require the database uri (the bolt port), username and password to be provided. In addition, BlueGraph uses the Neo4j Graph Data Science (GDS) library, which should be installed separately for the database on which you would like to run the analytics (see installation instructions). Current supported Neo4j GDS version is >=1.6.1.

Installing BlueGraph

BlueGraph supports Python versions >= 3.7 and pip >= 21.0.1. To update pip from the older versions run:

pip install --upgrade pip wheel setuptools

The stable version of BlueGraph can be installed from PyPI using:

pip install bluegraph

The development version of BlueGraph can be installed from the source by cloning the current repository as follows:

git clone https://github.com/BlueBrain/BlueGraph.git
cd BlueGraph

Basic version including only the NetworkX backend can be installed using:

pip install bluegraph

The prerequisites for using the graph-tool backend can be found in 'Installing backend dependencies'. You can also install additional backends for Neo4j and StellarGraph by running the following:

pip install bluegraph[<backend>]

Where <backend> has one of the following values neo4j or stellargraph.

Alternatively, a version supporting all the backends can be installed by running the following commands:

pip install bluegraph[all]

In order to use the cord19kg package and its interactive Jupyter applications, run:

pip install bluegraph[cord19kg]

Getting started

The examples directory contains a set of Jupyter notebooks providing tutorials and usecases for BlueGraph.

To get started with property graph data structure PGFrame provided by BlueGraph, get an example of semantic property encoding, see the PGFrames and semantic encoding tutorial notebook.

To get familiar with the ideas behind the co-occurrence analysis and the graph analytics interface provided by BlueGraph we recommend to run the following example notebooks:

Literature exploration (PGFrames + in-memory analytics tutorial) illustrates how to use BlueGraphs's analytics API for in-memory graph backends based on the NetworkX and the graph-tool libraries.
NASA keywords (PGFrames + Neo4j analytics tutorial) illustrates how to use the Neo4j-based analytics API for persistent property graphs.

Embedding and downstream tasks tutorial starts from the co-occurrence graph generation example and guides the user through the graph representation learning and all it's downstream tasks including node similarity queries, node classification and edge prediction.

Create and run embedding pipelines illustrates how embedding pipelines can be built and executed using BlueGraph.

Finally, Create and push embedding pipeline into Nexus.ipynb illustrates how embedding pipelines can be created and pushed to Nexus and Embedding service API shows how embedding service that retrieves the embedding pipelines from Nexus can be used.

Getting started with cord19kg

The cord19kg packages provides examples of CORD-19-specific co-occurrence analysis. Please, see more details on the CORD-19 analysis and exploration pipeline of the Blue Brain Project here.

We recommend starting from the Co-occurrence analysis tutorial notebook providing a simple starting example.

The Topic-centered co-occurrence network analysis of CORD-19 notebook provides a full analysis pipeline on the selection of 3000 articles obtained by searching the CORD-19 dataset using the query "Glucose is a risk factor for COVID-19" (the search is performed using BlueBrainSearch).

The Nexus-hosted co-occurrence network analysis of CORD-19 notebook provides an example for the previously mentioned 3000-article dataset, where datasets corresponding to different analysis steps can be saved and loaded to and from a Blue Brain Nexus project.

Finally, the generate_10000_network.py script allows the user to generate the co-occurrence networks for 10'000 most frequent entities extracted from the entire CORD-19v47 database (based on paper- and paragraph- level entity co-occurrence). To run the script, simply execute python generate_10000_network.py from the examples folder.

Note that the generated networks are highly dense (contain a large number of edges, for example, ~44M edges for the paper-based network), and the process of their generation, even if parallelized, is highly costly.

Licensing

Blue Graph is distributed under the Apache 2 license.
Included example scripts and notebooks (BlueGraph/examples and BlueGraph/cord19kg/examples) are distributed under the 3-Clause BSD License.
Data files stored in the repository are distributed under the Commons Attribution 4.0 International license (CC BY 4.0) License.

Funding & Acknowledgements

The development of this project was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.

bluegraph's People

Contributors

Stargazers

Watchers

Forkers

mfsy raimon-fa rtsengsv ssgantayat

bluegraph's Issues

Use BlueGraph on Connectome Analysis

Hello BlueGraph Team,

I was working on BBP circuits via representing them as graph, analyzing the motifs but couldnt find a way to visualize these dense graphs. Then I came across this platform thanks to Cyrille Favreau. Is there a jira project assigned to this ?

So far i have used networkx and igraph (more scalable as i was told) to analyses. but mostly I store my graph in adjacency matrix format and do operations on scipy.sparse module. I started to use Neo4j but the syntax is a bit annoying. So making a python backend would be very nice and scalable when we integrate it to bluegraph

Here are my projects if i can have a nice UI for the graph:

Utilize 5th floor screen, open deck to visualize the graph to do lots of cool stuff
Connect it to python scripts we already have.
Make novel and visualization-based simulation analyses. See VIZTM-860, and also this cool company's work and demo video https://cambridge-intelligence.com/products/ on static and dynamic analyses of graph data.

Since we have this software and brain is a graph, i say lets use it on it :)

All the best,
Kerem

Pandas append on data frames is removed starting at version 2.0

https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#removal-of-prior-version-deprecations-changes

Bluegraph uses the data frame append method, but it was removed in pandas starting version 2.0
Requirements needs to make sure it limits to lower versions

Upper bound on scikitlearn version

sklearn.neighbors._dist_metrics not available above version 0.24.2 (= starting at version 1.0.2)
Upper bound on scikit learn dependency should be 0.24.2

Graph vis app: Get node positions when exporting GML

At the moment, there is no way to retrieve node positions dynamically generated by the layout (e.g. force layout, cose, etc) in Dash or by user dragging.

They are not stored in the Python graph object. They possibly can be retrieved from the JS using Cytoscape.js, however, this solution is not trivial.

Created a Plotly Community Forum post: https://community.plotly.com/t/dash-cytoscape-returning-node-positions-from-layout/23818/2

Add means for evaluation/calibration of embedding models

We need to have means for reserving a validation set (or even make a k-fold cross validation) for the embedding models (usually based on link prediction, i.e. reserve sets of links for validation).

Useful references:

Question on BlueGraph compared to other Libraries

How is BlueGraph different compared to CDLib (with iGraph and NetworkX)?

There are not enough variety of centrality measures compared to NetworkX
There are not enough community detection algorithms compared to CDLib
- https://github.com/GiulioRossetti/cdlib
- GiulioRossetti/cdlib#178

How is BLueGraph different compared to KarateClub (regarding the number of node embedding algorithms)?

https://github.com/benedekrozemberczki/karateclub

Create a wrapper for Gensim node embedder

Would be nice to create a new interface GensimNodeEmbedder that could allow us to use different Gensim models, adaptable to the node representation learning problem.

Integration with nexus-forge

BlueGraph's PGFrame needs to provide means for interacting with nexus-forge.

Currently, resources can be converted to pandas dataframes that can be easily transformed into nodes/edges of property graphs.

Possible questions:

Is there a way to design a forge Store for PGFrame?
In that case what is a resource (node/edge/property or an entire PGFrame)

Add rdflib to requirements

Error downloading data from Blue Brain Nexus in the Colab Notebook

Downloading dataset from Nexus fails, i.e. the cell containing:

download_from_nexus(
    uri=f"{nexus_endpoint}/resources/{nexus_bucket}/_/1e01e1a2-133f-4833-9fe0-93230384b95f",
    output_path=DATA_PATH, config_file_path=nexus_config_file,
    nexus_endpoint=nexus_endpoint, nexus_bucket=nexus_bucket, unzip=True)

outputs

Downloading the file to 'BlueGraph/cord19kg/examples/data/Glucose_risk_3000_papers.csv.zip'
<action> _retrieve_filename
<error> DownloadingError: file 'data:_/f72f15bf-0a64-49a3-ab60-48c8575c4bb2' not found in project 'covid19-kg/data'

Decompressing ...

---------------------------------------------------------------------------

FileNotFoundError                         Traceback (most recent call last)

<ipython-input-28-ab35afac1260> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'download_from_nexus(\n    uri=f"{nexus_endpoint}/resources/{nexus_bucket}/_/1e01e1a2-133f-4833-9fe0-93230384b95f",\n    output_path=DATA_PATH, config_file_path=nexus_config_file,\n    nexus_endpoint=nexus_endpoint, nexus_bucket=nexus_bucket, unzip=True)\n\ndata = pd.read_csv(f"{DATA_PATH}/Glucose_risk_3000_papers.csv")\nprint("Done.")')

4 frames

<decorator-gen-53> in time(self, line, cell, local_ns)

<timed exec> in <module>()

/usr/lib/python3.7/zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel)
   1238             while True:
   1239                 try:
-> 1240                     self.fp = io.open(file, filemode)
   1241                 except OSError:
   1242                     if filemode in modeDict:

FileNotFoundError: [Errno 2] No such file or directory: 'BlueGraph/cord19kg/examples/data/Glucose_risk_3000_papers.csv.zip'

Add support for multigraphs in PGFrames

Currently, only one edge between a pair of nodes in allowed. For some use cases, it would be useful to support multiedges. For example:

Alice and Bob have two relationships, friends and colleagues, with different properties. It could be represented as:

@source_id, @target_id, @edge_id, @type, @since
"Alice", "Bob", 1, "friends", 1999
"Alice", "Bob", 2, "colleagues", 1998

Where the index of the dataframe with edges is given by @source_id, @target_id, @edge_id

Curation app: add removing specific entities from the table

Proposed features:

Remove entities of a given type
Remove entities of a given length (now we can remove entities with the length == 2)
Remove entities made of punctuation or of digits

Add graph embeddings

(Unsupervised/supervised) Graph embedding constructs embedding vectors for entire graphs.

This allows to perform the following downstream tasks:

graph classification
graph clustering

BlueGraph should implement graph embedding and the respective downstream tasks

Add optional static typing to BlueGraph

Would it make sense to add typing to BlueGraph, so that additional checks can be performed using mypy. Moreover, it would contribute to the project documentation.

Graph vis app: fix glitching labels

When the visualization app is used (after some point: switching between graphs, merging removing nodes) the node labels start to glitch on zoom in/out. They disappear and/or are not displayed correctly.

This bug needs to be further investigated in order to isolated the scenarios when the glitching appears. It is possible that the source is some internal dash-cytoscape bug.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.