moj-analytical-services / splink_graph Goto Github PK

pyspark-parallelised functions producing graph-theoretical metrics in connected component clusters for use in record-linkage (or other domains)

Home Page: https://moj-analytical-services.github.io/splink_graph

License: MIT License

Python 36.78% Jupyter Notebook 5.56% HTML 46.64% JavaScript 9.48% CSS 1.54%

graph graph-algorithms spark pandas-udf networkx record-linkage

splink_graph's Issues

On Pyspark > 3.0 the ARROW_PRE_0_15_IPC_FORMAT hack to make pandas_udf work is not needed.

In case Pyspark version is > 3.0 the ARROW_PRE_0_15_IPC_FORMAT="1" hack to make pandas_udf work will throw an error

RuntimeError: Arrow legacy IPC format is not supported in PySpark, please unset ARROW_PRE_0_15_IPC_FORMAT

the solution to this is to check for spark.version before setting it to 1 and if version is >3 do nothing

edge betweeness function could be faster if it can use either SparkUDF or Vectorized UDFs

some possible improvements

Spark UDFs
Vectorized UDFs

Vectorized UDFs are user defined functions that are executed by Spark using Arrow to transfer data and Pandas to work with the data, which allows vectorized operations.Before Spark 3.0, Vectorized UDFs used to be defined with PandasUDFType. From Spark 3.0 with Python 3.6+, you can also use Python type hints. Using Python type hints are preferred and using PandasUDFType will be deprecated in the future release.

Why does density get rounded?

It's not obvious to me why the user would want the density rounded to 3 decimal places?

splink_graph/splink_graph/cluster_metrics.py

Line 77 in 302fba2

"density", f.round(f.col("edgecount") / f.col("maxNumberOfEdgesundir"), 3)

will networkx work on AWS Glue?

This small collection of functions uses sparkframes and networkx.

Will networkx run on AWS Glue... from what I see there are not `nyC++ or Fortran binaries that networkx uses and wraps.
graphframes runs on Glue so one less thing to worry about

One way to find out ! try it

Implement `has_bridges` as a cluster metric

In addition to the edge function that gives us bridges, it'd be useful to classify clusters as to whether they have bridges

We should be able to use networkX's has_bridges for this.

I'll have a crack at implementing this

Log2 Bayes factor better than match prob

Use of match probability suffers from 'bunching up' near 0 and 1. We currently have to do the 1.01 - match prob transformation to get a sensible distance metric. I suspect the graph outputs would be more informative if we used a log Bayes factor based distance metric

Weird error when testing eigencentrality on Analytical Platform

test_node_metrics.py::test_eigencentrality_star
gives an error when running on AP

pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64

however test passes on github-actions dockerised environment 🤔 and hence it is merged.

Need to test in other envs as well (DAP /AWS GLUE/ OSX) to understad what is going on

location of jars

I think the jars may go in python3.6/site-packages/jars rather than python3.6/site-packages/splink_graph/jars.

Need to experiment with pyproject.toml and the location of /jars in this repo.

Add Algebraic Connectivity to cluster metrics together with its tests

Background for Algebraic Connectivity:
http://www.eecs.qmul.ac.uk/~steve/papers/NGI07.pdf

error: command 'cmake' failed with exit status 1 when running tox


 -- Running cmake for pyarrow
  cmake -DPYTHON_EXECUTABLE=/Users/robinlinacre/Documents/data_linking/splink_graph/.nox/tests-pandas-1-0-1-pyspark-3-0-1/bin/python  -DPYARROW_BOOST_USE_SHARED=on -DCMAKE_BUILD_TYPE=release /private/var/folders/y2/25dftcwj5kv9lqs7fzsyxf6h0000gp/T/pip-install-_0xduqi2/pyarrow
  error: command 'cmake' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for pyarrow
Failed to build pyarrow
ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly
WARNING: You are using pip version 20.2.4; however, version 21.1.3 is available.
You should consider upgrading via the '/Users/robinlinacre/Documents/data_linking/splink_graph/.nox/tests-pandas-1-0-1-pyspark-3-0-1/bin/python -m pip install --upgrade pip' command.
nox > Session tests(pandas='1.0.1', pyspark='3.0.1') failed.
nox > Ran multiple sessions:
nox > * tests(pandas='0.25.3', pyspark='2.4.5'): failed
nox > * tests(pandas='1.0.1', pyspark='2.4.5'): failed
nox > * tests(pandas='0.25.3', pyspark='3.0.1'): failed
nox > * tests(pandas='1.0.1', pyspark='3.0.1'): failed

after looking at this issue
streamlit/streamlit#2774

ill try adding the following line to toxfile to perhaps solve it?🤞

session.run("python", "-m", "pip", "install", "--upgrade", "pip")

Useful to create a graph converter from connected component clusters to something PyG can load

In order to leverage all the goodness PyTorch Geometric can bring to the graph data available we need the functionality
of converting connected component clusters to a format that a PyTorch Geometric DataLoader can understand. (probably edgelists.. nothing very complicated).

This is not a very difficult task.. and info on how to be able to easily load these subgraphs/clusters in PyTorch Geometric
is avaliable here: https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html

It will be very useful if this conversion can be done as a pandas_udf so that its performed in a parallelised manner

Think about adding the connected components step as a submodule in splink_graph

Think about adding the connected components step in splink_graph.

There are plenty of ways to do it:

import jars/use the graphframes functions (this could be brittle)
use networkx functions (this might not scale well)
have options for both?

num_bridges for case where there are zero bridges

I think it might return null for the zero bridges case, rather than 0 (which is presumably what we want).

At least, at the moment, there are no clusters with zero bridges, which can't be right

splink_graph fails to install due to an error installing gensim

Cause: This happens because for some reason gensim (which is a requirement of node2vec) on the 29thAug2021 went to 4.1.0 and that installation fails

Solution:

pip install gensim==3.8.3 or pip install gensim==4.0.1 first before trying to install splink_graph

Division by zero error within NetworkX `cluster_eb_modularity`

File "/home/spark/.local/lib/python3.7/site-packages/splink_graph/cluster_metrics.py", line 359, in cluster_eb_m
    co_eb_mod = nx_comm.modularity(nxGraph, gn)
  File "/home/spark/.local/lib/python3.7/site-packages/networkx/algorithms/community/quality.py", line 342, in modularity
    norm = 1 / deg_sum ** 2
ZeroDivisionError: float division by zero

I will try and delve a little deeper into the cause of this...

Do we need to round edge betweenness?

See here

Suggest removing round unless there's a good reason to do it?

WL hashes will change between NetworkX v2.6 and 2.7... a warning

During the conversation around the addition of WL-subgraph-hashing (a great feature btw) in Networx here there is a mention of the fact WL hashes will change between NetworkX v2.6 and 2.7
a backwards incompatible change. For example, if users are relying on graph hashes to compare graphs that have been hashed with different NX versions, they will get the wrong answer.

For the moment we are not using WL-hashes heavily but this is something to have in mind for avoiding any issues.

Thoughts about splink_graph API/UX

Consistency in naming

There should be common names for different 'things' across API calls. We can decide what the names should be but for example:

Subgraphs (clusters)
Nodes (components)
Edges (pairwise record comparison)

And these should filter through to everything

For example:

from splink_graph.cluster_metrics import x
from splink_graph.node_metrics import x
from splink_graph.edge_metrics import x

The in functions we want consistency in what things are called:

cluster_metrics should expect a cluster_id
node_metrics should expect node_id
edge metrics should expect src dst

We should always use these when performing joins. So for example:

df_nodes = cc.withColumnRenamed("id","index").withColumnRenamed("component","group")

df_nodes = df_nodes.withColumn("name",f.col("index"))
node_df_for_viz = df_nodes.join(node_eigen_centrality,(df_nodes.index==node_eigen_centrality.node)).drop("node")
node_df_for_viz.show(2)

relies on the index of a dataframe. But we should join on the node_id

Method calls could be like:

from splink_graph.node_metrics import eigencentrality
node_eigen_centrality = eigencentrality(edge_df )

where the default arguments for eigencentrality are the default names:

cluster_id_colname="cluster_id", distance_colname="distance",
               src_colname="src", dst="dst"

distance vs. weight

I find the terminology for distance and weight used in splink_graph confusing because I think it differs from the use of the same terminology in networkx.

In networkx, weight and distance are synonyms, whereas in splink, we define distance = (1-weight).

This probably originates from our use of probabilities in splink. The higher the probability, the lower the distance. I think perhaps the use of 'weight' was originally intended to make this simpler - but I think it conflicts with the networkx definitions.

But within networkx I believe weight, cost and distance are used interchangeably. weight seems to be a more generic term because it encompasses the possibility that in some graph analytics scenarios, you may be interested in maximising the weight. An example would be if weights represented the profit from shipping between two nodes.

We can see that weight and distance are comparable here:

G = nx.Graph()

            
data_list = [
    {"src": 1, "dst": 2, "weight":2.0},
    {"src": 2, "dst": 3, "weight":2.0},
    {"src": 1, "dst": 3, "weight":10},
        ]
from_dl = pd.DataFrame(data_list)
G = nx.from_pandas_edgelist(from_dl, "src", "dst", "weight")

shortest_path(G, weight="weight", source=1, target=3)
shortest_path_length(G, weight="weight", source=1, target=3)

> [1,2,3]
> 4.0

With this in mind, I suggest we include within splink graph functions that:

Convert Splink probabilities to an appropriate measure of distance - suggestion below
Convert match weights (log2 bayes factors) to an appropriate measure of distance - this probably just means fixing infinite and zero distances.

And then all functions should use distance, noting in the docstring that this is a synonym of weight

def probability_to_distance(
    df, prob_colname, out_colname="distance"
):

    df = df.withColumn(
        "__match_weight__", f.expr(f"log2({prob_colname}/(1-{prob_colname}))")
    )

    log2_bf = f"log2({prob_colname}/(1-{prob_colname}))"
    expr = f"""
    case
        when {prob_colname} = 0.0 then -40
        when {prob_colname} = 1.0 then 40
        when {log2_bf} > 40 then 40
        when {log2_bf} < -40 then -40
        else {log2_bf}
        end
    """

    df = df.withColumn("__match_weight__", f.expr(expr))

    score_min = df.select(f.min("__match_score__")).collect()[0][0]
    score_max = df.select(f.max("__match_score__")).collect()[0][0]

    expr = f"""
    1.01 - ((__match_score__ - {score_min})/{score_max - score_min} )
    """
    df = df.withColumn(out_colname, f.expr(expr))
    # Higher match weights mean lower distances so invert 
    df = df.withColumn(out_colname, f.expr(f"-{out_colname}"))

    df = df.drop("__match_score__")

    return df

Add possible GPU support?

Many of the functions in splink_graph and particularly the ones based on
numpy or scipy can be sped up by enabling the use of GPUs in the calculations.

The code to do this is not very hard (thanks Nvidia and numpy /scipy developers ! )
however the testing of if this works might be more difficult.

an idea is to use an GPU-enabled Amazon AWS image for testing.

Seperate graph hash calculation from rest of cluster_main_stats

Firstly its not something that most people would need. But very useful for people that do need it
Somehow import is not sucessful in DAP environment and stops the whole cluster_main_stats function from working.

moj-analytical-services / splink_graph Goto Github PK

splink_graph's Issues

Recommend Projects

Recommend Topics

Recommend Org