graphlet-ai / graphlet Goto Github PK

View Code? Open in Web Editor NEW

27.0 2.0 1.0 20.92 MB

PyPi module for Graphlet AI Knowledge Graph Factory

Home Page: https://graphlet.ai

License: Apache License 2.0

Python 100.00%

ai big-data entity-resolution gnns graph-algorithms graph-analytics graphs network pyspark

graphlet's Introduction

Graphlet AI Property Graph Factory

This is the PyPi module for the Graphlet AI Property Graph Factory for building enterprise knowledge graphs as property graphs. Our mission is to create a PySpark-based wizard for building large knowledge graphs in the form of property graphs that makes them easier to build for fewer dollars and with less risk.

Motivaton

A 100-slide presentation on Graphlet AI explains where we are headed! The motivation for the project is described in Property Graph Factory: Extract, Transform, Resolve, Model, Predict, Explain.

A video of this presentation is available.

The knowledge graph and graph database markets have long asked themselves: why aren't we larger? The vision of the semantic web was that many datasets could be cross-referenced between independent graph databases to map all knowledge on the web from myriad disparate datasets into one or more authoritative ontologies which could be accessed by writing SPARQL queries to work across knowledge graphs. The reality of dirty data made this vision impossible. Most time is spent cleaning data which isn't in the format you need to solve your business problems. Multiple datasets in different formats each have quirks. Deduplicate data using entity resolution is an unsolved problem for large graphs. Once you merge duplicate nodes and edges, you rarely have the edge types you need to make a problem easy to solve. It turns out the most likely type of edge in a knowledge graph that solves your problem easily is defined by the output of a Python program using the machine learning. For large graphs, this program needs to run on a horizontally scalable platform PySpark and extend rather than be isolated inside a graph databases. The quality of developer's experience is critical. In this talk I will review an approach to an Open Source Large Knowledge Graph Factory built on top of Spark that follows the ingest / build / refine / public / query model that open source big data is based upon.

--Russell Jurney in Knowledge Graph Factory: Extract, Transform, Resolve, Model, Predict, Explain

Core Features

This project is new, some features we are building are:

Scale Goals

Graphlet AI is a knowledge graph factory designed to scale to 10B node property graphs with 30B edges.

If your network is 10K nodes, let me introduce you to networkx :)

Developer Setup

This project is in a state of development, things are still forming and changing. If you are here, it must be to contribute :)

Dependencies

We manage dependencies with poetry which are managed (along with most settings) in pyproject.toml.

To install poetry, run:

curl -sSL https://install.python-poetry.org | python3 -

Then upgrade to poetry 1.2b3 (required for PyDantic non-binary install):

poetry self update --preview

To build the project, run:

poetry install

To add a PyPi package, run:

poetry add <package>

To add a development package, run:

poetry add --dev <package>

If you do edit pyproject.toml you must update to regenerate poetry.lock:

poetry update

Pre-Commit Hooks

We use pre-commit to run black, flake8, isort and mypy. This is configured in .pre-commit-config.yaml.

VSCode Settings

The following VSCode settings are defined for the project in .vscode/settings.json to ensure code is formatted consistent with our pre-commit hooks:

{
    "editor.rulers": [90, 120],
    "[python]": {
        "editor.defaultFormatter": "ms-python.python",
        "editor.formatOnSave": true,
        "editor.codeActionsOnSave": {"source.organizeImports": true},
    },
    "python.jediEnabled": false,
    "python.languageServer": "Pylance",
    "python.linting.enabled": true,
    "python.formatting.provider": "black",
    "python.sortImports.args": ["--profile", "black"],
    "python.linting.pylintEnabled": false,
    "python.linting.flake8Enabled": true,
    "autoDocstring.docstringFormat": "numpy",
    "mypy.dmypyExecutable": "~/opt/anaconda3/envs/graphlet/bin/dmypy"
}

System Architecture

The system architecture for Graphlet AI is based on a standard "Delta Architecture" that ingests, transforms, refines and publishes data for a graph database on top of a search engine to serve along with an MLOps platform for ML APIs.

This architecture is intended to optimize the construction of large property graphs from multiple data sources and eventually using NLP - information extraction and entity linking.

How do you build a knowledge graph as a property graph? What is a property graph factory?

The process of building a knowledge graph - a property graph - out of multiple large (and many small) datasets is described below. This is the process we are optimizing.

Assess the input datasets, come up with the Pandera ontology classes. What your graph will look like. I am using films as an example for the test dataset... horror.csv, comedy.csv, directors.csv... and it becomes Movies, Actors, Directors, Awards. So you create those classes and Directed, ActedIn, Won, etc. edges... as Pandera classes.
Use the Pandera classes that define your ontology to build custom transformation and validation of data so you instantiate a simple class to transform data from one format to another rather than writing independent implementations. Implement your ETL as part of these classes using Pandera functions in the class to efficiently transform and also validate data. Pandera validates the ENTIRE record, even if one field fails to parse... so you get ALL the fields' errors at once. The system will report every erroneous error rather than dying on the first error. This would make ETL MUCH faster. You will know all the issues up front, and can put checks in place to prevent creeper issues that kill productivity from making it through the early stages of te lengthy, complex ETL pipelines that large knowledge graph projects often create.
Take these classes that we have ETL'd the original datasets into, feed them into a Ditto style encoding and turn them into text documents and feed them into a Graph Attention Network (GAN) ER model.
The ER model produces aggregate nodes with lots of sub-nodes... what we have called identities made up of entities.
The same Pandera classes for the Ontology then contain summarization methods. Some kind of summarization interface that makes things simple. You got 25 addresses? You have an interface for reducing them. Turn things into fields with lists, or duplicate them.

NOTE: At this point you have a property graph (property graph) you can load anywhere - TigerGraph, Neo4j, Elasticsearch or OpenSearch.
Once this is accomplished, we build a graph DB on top of OpenSearch. The security-analytics project is going to do this, so we can wait for them and contribute to that project. Using an OpenSearch plugin reduces round-trip latency substantially, which makes scaling much easier for long walks that expand into many neighboring nodes.
Finally we create or use a middleware layer for an external API for the platform in front of MLFlow for MLOps / serving any live models and graph search and retrieval from OpenSearch.
Now that we have a clean property graph, we can pursue our network motif searching and motif-based representation learning.

Tonight we will take over the world! Muhahahahahaha!

GraphFrames uses PySpark DataFrames to perform network motif search for known motifs until we Implement efficient random motif searching via neural subgraph matching.

Below is an example of a network motif for financial compliance risk (KYC / AML) called Multiple-Path Beneficial Ownership for finding the ultimate beneficial owners a company that uses a layer of companies it owns between it and the asset it wishes to obscure. This motif indicates secrecy, not wrongdoing, but this is a risk factor.

Below is the PySpark / GraphFrames motif search code that detects this motif. While brute force searching for network motifs using MapReduce joins is not efficient, it does work well for finding known network motifs for most large networks. It is also flexible enough to search for variations, broadening results and providing domain experts with examples of variants from which to learn new motifs or expand existing motifs.

Motif Source: Creating clarity around ownership structures, Bureau Van Dijk

Optimizing the above process is the purpose of Graphlet AI. We believe that if we make all of that easier, we can help more organizations successfully build large, enterprise knowledge graphs (property graphs) in less time and for less money.

License

This project is created and published under the Apache License, version 2.0.

Conventions

This project uses pre-commit hooks to enforce its conventions: git will reject commits that don't comply with our various flake8 plugins.

We use numpy docstring format on all Python classes and functions, which is enforced by pydocstring and flake8-docstrings.

We run black, flake8, isort and mypy in .pre-commit-config.yaml. All of these are configured in pyproject.toml except for flake8 which uses .flake8. Flake8 uses the following plugins. We will consider adding any exceptions to the flake config that are warranted, but please document them in your pull requests.

flake8-docstrings = "^1.6.0"
pydocstyle = "^6.1.1"
flake8-simplify = "^0.19.2"
flake8-unused-arguments = "^0.0.10"
flake8-class-attributes-order = "^0.1.3"
flake8-comprehensions = "^3.10.0"
flake8-return = "^1.1.3"
flake8-use-fstring = "^1.3"
flake8-builtins = "^1.5.3"
flake8-functions-names = "^0.3.0"
flake8-comments = "^0.1.2"

Entity Resolution (ER)

This project includes a Graph Attention Network implementation of an entity resolution model where node features are based on the Ditto encoding defined in Deep Entity Matching with Pre-Trained Language Models, Li et al, 2020.

For specifics, see Issue 3: Create a generic, configurable system for entity resolution of heterogeneous networks

Why do Entity Resolution in Graphlet?

The motivation for Graphlet AI is to provide tools that facilitate the construction of networks for research into network motifs, motif search and motif-based representation learning. Without entity resolution... motif analysis does not work well.

Entity Resolution Process

Transform Datasets into a set of Common Schemas in a Property Graph Ontology

The first step in our ER process is to ETL multiple datasets into a common form - in silver tables - in our property graph ontology. Then a single model can be used for each type - rather than having to work across multiple schemas. This simplifies the implementation of entity resolution.
Ditto Encode Nodes using Pre-Trained Language Models

As mentioned above, we use the Ditto encoding to encode documents as text documents with column name/type hints which we then embed using a pre-trained language model. Graph Neural Networks accept arbitrary input as features - we believe Ditto provides a general purpose encoding for multiple operations including entity resolution and link prediction.
Blocking Records with Sentence Transformers and Locality Sensitive Hashing (LSH)

Large knowledge graphs (property graphs) have too many records to perform an algebraic comparison of all records to all records - it is N^2 complexity!

We use Sentence Transformers (PyPi) (Github) for blocking, as in Ditto. We incorporate network topological features in addition to node features in the blocker.

Note: LSH is powerful for many operations using pairs of network nodes! Google Grale is described in Grale: Designing Networks for Graph Learning, Halcrow et al, 2020 (arXiv) from Google Research. Grale is a powerful paper from Google Research LSH is an incredibly powerful algorithm - for large graph ML the algorithm isn't MapReduce - it is MapLSH - for approximate grouping.
Entity Matching with Graph Attention Networks

TBD :)

DBLP Training Data

DBLP is a database of scholarly research in computer science.

The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.

DBLP Official Dataset is available at https://dblp.org/xml/dblp.xml.gz.
Feilx Naumann's DBLP Dataset 2 by Prof. Dr. Felix Naumann available in DBLP10k.csv is a set of 10K labels (5K true, 5K false) for pairs of authors. We use it to train our entity resoultion model.

Note that there are additional labels available as XML that we haven't parsed yet at:

Felix Nauman's DBLP Dataset 1 is available in dblp50000.xml

Collecting and Preparing the Training Data

The DBLP XML and the 50K ER labels are downloaded, parsed and transformed into a graph via graphlet.dblp.__main__ via:

python -m graphlet.dblp

Why property graphs? Why not RDF Triples and SPARQL?

We believe RDF/SPARQL are based on the false assumptions of the Semantic Web, which did not work out.

The reality is more like this, which our system is optimized for. At present we are not focusing on NLP, information extraction and entity linking in favor of tools optimized for building property graphs by using ETL to transform many datasets into a uniform ontology for solving problems using ML and information retrieval.

graphlet's People

Contributors

Stargazers

Watchers

Forkers

skrawcz

graphlet's Issues

Implement efficient random motif searching via neural subgraph matching

Motif search for heterogeneous networks - especially temporal heterogeneous networks - has fundamental scalability challenges. Neural Subgraph Matching proposes a technique using graph representation learning and vector search called NeuroMatch. NeuroMatch is an efficient neural approach for subgraph matching.

The source code for NeuroMatch is at github.com/snap-stanford/neural-subgraph-learning-GNN.

FAISS and Distributed FAISS

If the code doesn't scale, is this something we could implement using FAISS and Distributed FAISS?

Create a DBLP labeled training network with SAME_AS edges for training our entity resolution model

DBLP Training Data

I need to create a network with a set of edges that include a SAME_AS edge type and a NOT_SAME_AS edge type for entity resolution to serve as training data to enable @tanmoyio to proceed with training an entity resolution model in #3.

DBLP Datasets

DBLP is a database of scholarly research in computer science.

The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.

DBLP Dataset is available at https://dblp.org/xml/dblp.xml.gz.
DBLP Dataset 2 by Prof. Dr. Felix Naumann available in DBLP10k.csv is a set of 10K labels (5K true, 5K false) for pairs of authors. We use it to train our entity resoultion model.

Note that there are additional labels available as XML that we haven't parsed yet at:

Felix Nauman's DBLP Dataset 1 is available in dblp50000.xml

Collecting and Preparing the Training Data

The DBLP XML and the 50K ER labels are downloaded, parsed and transformed into a graph via graphlet.dblp.__main__ via:

python -m graphlet.dblp

See the example data at: https://gist.github.com/rjurney/5acad373d485272b5c1f4352b1dd0fc6

Create Pandera / PySpark utilities `graphlet.etl` to transform / validate multiple datasets into a uniform ontology

Summary

This ticket is to create utilities - including node and edge base classes - that provide an object-oriented interface with runtime validation for defining data types in a property graph ontology to transform multiple datasets into via ETL in a way that works with Spark schemas to allow for code re-use rather than writing UDFs for each dataset individually.

How do datasets that map to a single class in an ontology vary?

Datasets that map to the same class or set of classes in a knowledge graph (property graph) ontology can vary in different ways:

Naming - Field names vary for the same properties of a given entity.
Structure - Records have different structures. Schemas can be flat or nested. One may use lists of objects while another uses one object with column names mapping to lists of values. 
Formats - Datasets implement different standard data formats at the file or field level. One file is CSV, another Parquet. One timestamp field uses *nix timestamps and another ISO datetimes.
Values - The same field name and format may have values with different errors that require cleaning. Fixes for these errors must not break correct records in other datasets.
Logical - Datasets may exist in different logical forms when ingested. An application ontology may have the concept of (company)—officership—>(person), while a dataset at ingestion may be two files: companies and their officers (both an officership and a person).

Even in a distributed system like Spark, large datasets can be relatively slow to process. This creates a slow feedback loop to clean data and resolve format disparities. Sampling helps but problems may occur in any one record, requiring a full run over all the data to validate. Fixing issues among each dataset we ingest is a time-consuming problem.

What is `pandera` and how can it help?

Pandera explains itself as:

data validation library for scientists, engineers, and analysts seeking correctness.

pandera provides a flexible and expressive API for performing data validation on dataframe-like objects to make data processing pipelines more readable and robust.

Dataframes contain information that pandera explicitly validates at runtime. This is useful in production-critical data pipelines or reproducible research settings.

`@pandera.decorators.check_types`

The check_types decorator can be used to define the input and output schmas of a function that transforms a DataFrame.

Sources of improvements in productivity from our ETL

Code re-use for validation and transformation: Without any help from Pandera, if we JUST create one class for each type in an ontology with validation and ETL code we write manually for each class rather than doing custom ETL scripts for each file we ingest like awards.csv and comedy.csv that has Movies in it, that alone makes things more efficient. Just because of the benefit of the pattern of organization in a standard interface and more importantly the code reuse. Fortunately we can do better than this with the features Pandera provides :)
Pandera validation tools have powerful features: Pandera can define a Schema Model for each class in an ontology that has Checks that can help validate fields. Things get more efficient because now you have Pandera checks to validate your data once per class, rather than once per file.
Pandera transformation functions
Pandera column validation tools are based on pd.DataFrame and pd.Series vector methods, which are much more efficient via pandera.pyspark via PySpark pandas_udfs. This is much faster than trying to write your own validation code yourself at a record or row level.
Pandera makes debugging data much more efficient with lazy evaluation - so records can wait to emit SchemaErrors if they are given the schema.validate(df, lazy=True) parameter.
When you factor in that with Pandera you can get not just ONE error and the entire PySpark job dies... you can get ALL errors in ALL fields in ALL records... now you are talking about a SERIOUS upgrade to building a knowledge graph because right now the ENTIRE JOB fails and you get information about a single row.

You go through that over and over. You miss problems because your data for the new drama.csv file isn't really verified - and down your data pipeline when incorporating new custom movie reviews into the dataset you find a problem you missed during ETL of drama.csv. By building up a capability in Movie class, you get faster and faster as you add datasets to the class! Each one makes the ones that come after faster.

Let's say you start a Movie class and load horror.csv. Then someone sends you comedies.csv to add to the graph. You don't even have to start out by doing any ETL... you can just write code that fills out the fields for the Movie class with values from comedies.csv and... see if it validates. You do no work by default.

Let's say you find a problem... you solve it by improving validation, first off. That now works for action.csv - as does improving the robustness of your transformations.

PyDantic --> Pandera

Note: While an initial attempt at implementing this used Pydantic there were serialization issues with PySpark and it was not featureful for this problem. On the other hand Pandera is a much better fit as it already integrates with both Pandas, PySpark and Dask. Not all references have yet been updated in the issues and code, but they will be.

ETL is Hard: Problems and Solutions

Several challenges arise when transforming multiple large datasets into a single schema using PySpark making code re-use difficult to accomplish.

PROBLEM: Ingested Datasets Vary for the same Class in an Ontology

Datasets destined for the same entity in an ontology may vary greatly and require different transformation logic.

For example a Movie class in your ontology might look like this:

# TODO: update from Pydantic to Pandera!
    class Movie(NodeBase):
        """A film node in hollywood."""

        entity_type: str = "movie"
        genre: str
        title: str
        year: str
        length: int = 0
        gross: int = 0
        rating: str

        @validator("length", pre=True)
        def convert_hours_minutes_to_int_minutes(cls, x):
            if x and isinstance(x, str):
                x = text_runtime_to_minutes(x)
            return x

But two datasets you load in PySpark to add to your knowledge graph might look like this... they have movies in them but use a different concept - awards and a sub-class of movies called comedies.

# Movie awards
awards = spark.read.option("header", "true").csv("tests/data/awards.csv")
awards.show()

# A genre of movies
comedies = spark.read.option("header", "true").csv("tests/data/comedy.csv")
comedies.show()

You need to get Movies out of these things, but you need to write custom ETL code to do so! What a bummer :( Or do you...

SOLUTION: Create an Object-Oriented Interface for Code Reuse

While datasets at ingestion can vary dramatically, the goal of ETL for a property graph is to transform them into a single object-oriented form. We can optimize the process by centralizing all ETL by in instances of a single class that handles transformation, validation and summarization in a fully reusable way such that each dataset added makes the class more robust - meaning less work per dataset for each dataset that you add to a given type!

PROBLEM: What good are classes for ETL if they aren’t usable in PySpark?

Extending a base class for ETL for every entity in an ontology isn’t very useful if they can’t work with PySpark’s APIs.

SOLUTION: Pandera has PySpark/Pandas Support!

Fortunately Pandera has Pandas and PySpark support. For traditional PySpark support check out pandera.typing.pyspark. This seems like the way to go... but I am not sure how it will work with the Pandera Schema Models.

For an interface to essentially map a Python function into Spark to run it, check out Fugue.

PROBLEM: We had 10 nodes, now we did entity resolution and we have 2 nodes made up of 5 nodes each. How do we work with and represent this data efficiently?

You DEFINITELY need to re-validate the aggregate nodes when you summarize them.

SOLUTION: We extend Pandas classes with an interface for summarizing records using PySpark/Pandas/Pandera GROUP BY / aggregate functions!

Once you perform entity resolution and you GROUP BY, you start out with a list of nodes inside a big master node... the ones inside are all the same Person, say. Names of Russell Jurney, Russ Jurney, Russell H Jurney, Russ H Jurney, Russell Journey - these are actually real variations of my name in business registries across the US 🙂

So you need to take each field - sometimes multiple fields at once if you need access to the others for logical reasons - and summarize them in a way that creates a more accessible format for the master record. Sometimes you do want a list of records... but more commonly you DO NOT want a list of 5 records that have name fields with various versions of a name, you want a single record with a list field for name.

{
  "name": ["Russell Jurney", "Russell H. Jurney", "Russ H Jurney", ...],
  "address": [{"street": "365 ...", ...}, {"street": "102 ...", ...}, {"street": "365 ...", ...}],
  ...
}

How do we use or extend Pandera classes for our ontology with an interface and process to accomplish this?

PROBLEM: You don't actually just want to create lists of the values, you want to de-duplicate the values!

But you don't want the duplicates in field names like addresses, you want to de-duplicate the values!

{
  "name": ["Russell Jurney", "Russell H. Jurney", "Russ H Jurney", ...],
  "address": [{"street": "365 ...", ...}, {"street": "102 ...", ...}, {"street": "365 ...", ...}],
  ...
}

How do we use or extend Pandera classes for our ontology with an interface and process to accomplish this?

PROBLEM: Not all summarizations are just list-building or deduplication... some involve creating a single, summarized value of the same or a different type!

Sometimes you have a more complicated summarization method, such as when you really need ONE value for a field in an aggregate node. One such case is when you want to create a single value of a field to easily compare one aggregate node to another one, such as via a distance measure such as cosine similarity and a vector - an embedding.

What if you want to combine 5 names into a single vector representation using the average of 5 Word2Vec embedding vectors of the names? Before you balk, remember that they are the same name... this might work even across languages with the right embedding :)

What about a more sophisticated embedding technique such as using sentence transformers and mean or max pooling to create a single representation for a name?

PROBLEM: Do the individual and aggregate, summarized nodes have the same schema? The same class? How do you handle the fact that some are singular in their values and some are lists? Shit?

Are the aggregate nodes different classes? Shouldn't be. Crap. Should be?

SOLUTION: How do we handle this?

One strategy is to make all fields of nodes and edges contain individual items or lists, such as for an int field use a type hint instead of typing.Union[int, typing.List[int]]. There can be chains of consequences for complex type hints however, and we would need to consider this carefully.

What else do we need from these ontology classes?

See #3 for more information, but we need to be able to use the Schema Model to generate a "Ditto format" text version of a node or edge as in ditto/ditto_light/summarize.py:

    def transform(self, row, max_len=128):
        """Summarize one single example.
        Only retain tokens of the highest tf-idf
        Args:
            row (str): a matching example of two data entries and a binary label, separated by tab
            max_len (int, optional): the maximum sequence length to be summarized to
        Returns:
            str: the summarized example
        """
        sentA, sentB, label = row.strip().split('\t')
        res = ''
        cnt = Counter()
        for sent in [sentA, sentB]:
            tokens = sent.split(' ')
            for token in tokens:
                if token not in ['COL', 'VAL'] and \
                   token not in stopwords:
                    if token in self.vocab:
                        cnt[token] += self.idf[self.vocab[token]]

For the full code, see https://github.com/megagonlabs/ditto/blob/master/ditto_light/summarize.py#L63-L84

Building Knowledge Graphs as Property Graphs

Building a knowledge graph as a property graph from multiple datasets requires doing a lot of ETL as a pre-processing step to transform several schemas representing the same thing into a single schema within a uniform ontology so that records can be merged into a pair of tables for nodes and edges and processed with code or statistical models specific to that type of entity rather than each dataset that make it up.

Bro, why not RDF? SPARQL!

While there are systems and algorithms specific to RDF triples that avoid ETL up front, they put ETL off until query time where it must be repeated once per query per dataset involved and require property oriented query languages like SPARQL which many people find difficult to work with as humans think most easily in terms of objects with properties while SPARQL requires reification or restoration of objects from properties. RDF is metadata, not data.

While some machine learning is specific to triples, doing machine learning at scale as part of knowledge graph construction is made easier once a graph is stored as an object as most tools, models and algorithms do not support triple form.

A Note on Medallion Tables

Calling Medallion Tables a Medallion Architecture is a bit of an exaggeration but the concepts are useful. Data is ingested into bronze tables where a unique identifier might be added but otherwise data is raw so it can be accessed in its original form at any time for operations like debugging downstream tables by examining the unaltered data. These are cleaned, enriched, combined and transformed into one or more intermediate silver tables that may be clean versions of the original data type or entirely new concepts. Silver tables chain into other silver tables. When data is prepared for display for a user or storage in an external system it is stored in a gold table. Finally, there are the palladium external tables in other systems that gold tables are transferred to such as in Graphlet AI in an OpemSearch table.

Dude… that’s a LOT of ETL

Part of building a property graph database out of myriad sources of data is transforming disparate formats representing the same type of entity into the schema of an entity in a uniform ontology.

That means several bronze tables for each entity in our ontology…

And at least one silver table for each entity type in our ontology.

Ok but… hey, that’s a lot of ETL! I want to play with graphs! Don’t you know that ETL of large datasets is hard! Now you get the motivation for this issue :)

Create a `graphlet.null_model` module for heterogeneous networks

In order to determine whether a graphlet is a network motif, we need to compare its frequency versus a null model to determine if it is statistically significant. This means we need heterogeneous null models... and I can't find any libraries such as networkx that contain null models for heterogeneous networks.

Create a `graphlet.null_model` Module

We should create a module graphlet.null_model with networkx style generators that accept properties such as:

Total number of nodes of each node type
Summary of degrees of each edge type

Or whatever we find in the literature for various heterogeneous null models.

NetworkX Generators

While networkx Generators cover a range of randomly generated networks, they do not handle multiple types of nodes found in a property graph, also known as a heterogeneous network.

Literature Review

Structure of Heterogeneous Networks - creates a type-specific null model for modularity
Null Model and Community Structure in Heterogeneous Networks - lots about heterogeneous null models in Section 3.
Analytical Models for Motifs in Temporal Networks: Discovering Trends and Anomalies

The null model of the heterogeneous networks refers to those network models that has the same set of types of nodes T, number of homogeneous nodes N, number of heterogeneous U, distribution of homogeneous node degree P(k) and distribution of heterogeneous node degree P(u) with the original network, while otherwise is taken to be an instance of the random network.

For each two types of the nodes, there is a distribution of heterogeneous node degree. Therefore, there are |T|2 − |T| distribution of heterogeneous node degree in a heterogeneous network, where T refers to the set of types of nodes.

Uses random walks - a Markov process - that considers the odds of walking across different edge types (source_type, dest_type) - reaches a steady state that incorporates homogeneous and heterogeneous degrees. Also includes modularity function using this null model.

Create an efficient pipeline for computing network motifs and aggregating higher order networks

PySpark / GraphFrames

Property Graph Minors

Create `graphlet.nlp.ie` module for information extraction as part of property graph construction

Use of `graphlet.etl` Schema Models

We can use graphlet.etl's Pandera Schema Models schema models to define the entities and relations we are extracting.

About `graphlet.etl`

The module graphlet.etl helps to construct enterprise knowledge graphs as property graphs via Extract, Transform, Load (ETL) / Extract, Load, Transform (ELT) with the assistance of Pandera Schema Models on top of PySpark and Dask. These models are useful in that they define the types of nodes and edges of a heterogeneous information network (HIN) with semi-structured data as properties of nodes and edges in a central place to which other features can refer such as entity resolution.

The classes EntitySchema, NodeSchema and EdgeSchema can be sub-classes to define the types of relations to be extracted.

Use of FlairNLP

FlairNLP is the most commonly used project for Named Entity Recognition and relationships extraction. Flair makes it easy to stack embeddings of different types - for example character and word embeddings as in a flair model.

See the following tutorials:

Features

We need to define the minimum features required to support the integration of these two libraries.
Using flair and transfer learning to perform NER and relation extraction makes the tasks primarily a labeling problem. Platforms like snorkel and skweak are helpful for generating labels programmatically.

Create a generic, configurable system for entity resolution of heterogeneous networks using pre-trained LMs and GATs

Graph Neural Networks (GNNs) accept arbitrary features as input... making an embedding produced by a model based on ditto a way to encode the properties of nodes in a heterogeneous network. Ditto designed this method for entity matching, which would seem to be analogous to link prediction in training a GNN. Ditto could therefore work as the first layer in a two-step entity resolution process: encode nodes using a pre-trained language model and then distribute node representations around a network to help an entity matching model consider the network neighborhood of each node in its matching.

To create a generic entity resolution system, we will:

Automatically encode nodes as text documents including column and type hints from the ontology's Pandera classes.
A pre-trained language model will create a fixed-length vector representation of the node text documents
A Graph Attention Network (GAT) model will perform entity resolution as a binary classification problem, given two nodes' representations encoded via Ditto --> LM --> GNN, to predict SAME_AS edges for the entire network at once rather than blocking and matching pairs of nodes. This avoids scaling blocking for large networks, which can be painful.

Encoding Node Features using Pre-Trained Language Models (LMs)

The landmark paper Deep Entity Matching with Pre-Trained Language Models outlines a way to encode records using pre-trained language models such as baby BERT models by providing clues about the column names as shown in the image below to help the model use what it learned from a huge amount of semi-structured data to encode any semi-structured record.

The code for this paper is available here: github.com/megagonlabs/ditto

We will use the Schema Model to generate a "Ditto format" text version of a node or edge as in ditto/ditto_light/summarize.py:

    def transform(self, row, max_len=128):
        """Summarize one single example.
        Only retain tokens of the highest tf-idf
        Args:
            row (str): a matching example of two data entries and a binary label, separated by tab
            max_len (int, optional): the maximum sequence length to be summarized to
        Returns:
            str: the summarized example
        """
        sentA, sentB, label = row.strip().split('\t')
        res = ''
        cnt = Counter()
        for sent in [sentA, sentB]:
            tokens = sent.split(' ')
            for token in tokens:
                if token not in ['COL', 'VAL'] and \
                   token not in stopwords:
                    if token in self.vocab:
                        cnt[token] += self.idf[self.vocab[token]]

For the full code, see https://github.com/megagonlabs/ditto/blob/master/ditto_light/summarize.py#L63-L84

Blocking for Pair-Wise Comparisons using LSH and Embeddings

Create `graphlet.nlp.entity_linking` module that uses BLINK on your KG / dataset

Use of `graphlet.nlp.ie`

The entities and their relations that form the input to this module will be extracted using graphlet.nlp.ie - see #11 and #1.

Integration with BLINK

BLINK by Facebook Research which you implements the ELQ architecture uses a joint embedding

In a nutshell, BLINK uses a two stages approach for entity linking, based on fine-tuned BERT architectures. In the first stage, BLINK performs retrieval in a dense space defined by a bi-encoder that independently embeds the mention context and the entity descriptions. Each candidate is then examined more carefully with a cross-encoder, that concatenates the mention and entity text. BLINK achieves state-of-the-art results on multiple datasets.

BLINK can be used interactively, which is neat. Adapting BLINK to any given ontology is not a simple task.

BLINK can use FAISS vector search engine or the more scalable distributed-faiss, which we have used in the past for blocking for entity resolution.

Using BLINK with an arbitrary KG and dataset

The work in building this ticket would be using BLINK with the knowledge graph defined using the graphlet.etl NodeSchema / EdgeSchema sub-classes and your own corpus of documents.

Incorporate spark-rapids into architecture

spark-rapids incorporates RAPIDS into Apache Spark and could accelerate our architecture significantly.

Getting Started makes this sound easy.
spark-rapids on Databricks

graphlet-ai / graphlet Goto Github PK

graphlet's Introduction

Graphlet AI Property Graph Factory

Motivaton

Core Features

Scale Goals

Developer Setup

Dependencies

Pre-Commit Hooks

VSCode Settings

System Architecture

How do you build a knowledge graph as a property graph? What is a property graph factory?

License

Conventions

Entity Resolution (ER)

Why do Entity Resolution in Graphlet?

Entity Resolution Process

DBLP Training Data

Collecting and Preparing the Training Data

Why property graphs? Why not RDF Triples and SPARQL?

graphlet's People

Contributors

Stargazers

Watchers

Forkers

graphlet's Issues

FAISS and Distributed FAISS

DBLP Training Data

DBLP Datasets

Collecting and Preparing the Training Data

Summary

How do datasets that map to a single class in an ontology vary?

What is pandera and how can it help?

@pandera.decorators.check_types

Sources of improvements in productivity from our ETL

PyDantic --> Pandera

ETL is Hard: Problems and Solutions

PROBLEM: Ingested Datasets Vary for the same Class in an Ontology

SOLUTION: Create an Object-Oriented Interface for Code Reuse

PROBLEM: What good are classes for ETL if they aren’t usable in PySpark?

SOLUTION: Pandera has PySpark/Pandas Support!

PROBLEM: We had 10 nodes, now we did entity resolution and we have 2 nodes made up of 5 nodes each. How do we work with and represent this data efficiently?

SOLUTION: We extend Pandas classes with an interface for summarizing records using PySpark/Pandas/Pandera GROUP BY / aggregate functions!

PROBLEM: You don't actually just want to create lists of the values, you want to de-duplicate the values!

PROBLEM: Not all summarizations are just list-building or deduplication... some involve creating a single, summarized value of the same or a different type!

PROBLEM: Do the individual and aggregate, summarized nodes have the same schema? The same class? How do you handle the fact that some are singular in their values and some are lists? Shit?

SOLUTION: How do we handle this?

What else do we need from these ontology classes?

Building Knowledge Graphs as Property Graphs

Bro, why not RDF? SPARQL!

A Note on Medallion Tables

Dude… that’s a LOT of ETL

Create a graphlet.null_model Module

NetworkX Generators

Literature Review

PySpark / GraphFrames

Property Graph Minors

Use of graphlet.etl Schema Models

About graphlet.etl

Use of FlairNLP

Features

Create a generic, configurable system for entity resolution of heterogeneous networks using pre-trained LMs and GATs

Encoding Node Features using Pre-Trained Language Models (LMs)

Blocking for Pair-Wise Comparisons using LSH and Embeddings

Use of graphlet.nlp.ie

Integration with BLINK

Using BLINK with an arbitrary KG and dataset

Recommend Projects

Recommend Topics

Recommend Org

What is `pandera` and how can it help?

`@pandera.decorators.check_types`

Create a `graphlet.null_model` Module

Use of `graphlet.etl` Schema Models

About `graphlet.etl`

Use of `graphlet.nlp.ie`