Comments (3)
Num bridges = 1 seems to only occur with cluster size = 2 in the synthetic data. Is there something wrong with the calculation?
from splink_graph.
This problem relates to the use of weight and distance in the graph calculation.
In particular, if you set the distance colname = 'weight' you get incorrect results:
from splink_graph.cluster_metrics import number_of_bridges
from pyspark.sql import Row
from pyspark.sql import functions as f
data_list = [
{"src": 1, "dst": 2, "weight":0.9, "cluster_id": 1},
{"src": 2, "dst": 3, "weight":0.9, "cluster_id": 1},
{"src": 3, "dst": 1, "weight":0.9, "cluster_id": 1},
]
data = spark.createDataFrame(Row(**x) for x in data_list)
cluster_num_bridges = number_of_bridges(data, distance_colname="weight")
cluster_num_bridges.toPandas()
Whilst my use of distance colname = 'weight'
with the default weight_colname="weight"
is illogical, we can avoid this problem altogether because the bridges calculation uses neither weight nor distance.
So we can simplify the function by removing both from the function signature
from splink_graph.
In the example code the problem is introduced here:
https://github.com/moj-analytical-services/splink_examples_synthetic_data/blob/4e649ac359a290a4749bf35b9ec58e7e20e2fe16/graph_analytics/person/uk_citizens_max_groupsize_20/basic/01_graph_analytics/job.py#L109
from splink_graph.
Related Issues (20)
- will networkx work on AWS Glue? HOT 1
- On Pyspark > 3.0 the ARROW_PRE_0_15_IPC_FORMAT hack to make pandas_udf work is not needed. HOT 1
- Thoughts about splink_graph API/UX HOT 1
- Log2 Bayes factor better than match prob HOT 2
- error: command 'cmake' failed with exit status 1 when running tox HOT 3
- Add possible GPU support?
- Think about adding the connected components step as a submodule in splink_graph HOT 3
- Division by zero error within NetworkX `cluster_eb_modularity` HOT 6
- Implement `has_bridges` as a cluster metric HOT 2
- Weird error when testing eigencentrality on Analytical Platform HOT 1
- Seperate graph hash calculation from rest of cluster_main_stats HOT 1
- splink_graph fails to install due to an error installing gensim
- location of jars HOT 3
- WL hashes will change between NetworkX v2.6 and 2.7... a warning
- Useful to create a graph converter from connected component clusters to something PyG can load
- Why does density get rounded?
- distance vs. weight
- Do we need to round edge betweenness?
- Add Algebraic Connectivity to cluster metrics together with its tests
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from splink_graph.