I find the terminology for distance
and weight
used in splink_graph
confusing because I think it differs from the use of the same terminology in networkx
.
In networkx
, weight
and distance
are synonyms, whereas in splink
, we define distance = (1-weight)
.
This probably originates from our use of probabilities in splink
. The higher the probability, the lower the distance. I think perhaps the use of 'weight' was originally intended to make this simpler - but I think it conflicts with the networkx
definitions.
But within networkx
I believe weight
, cost
and distance
are used interchangeably. weight
seems to be a more generic term because it encompasses the possibility that in some graph analytics scenarios, you may be interested in maximising the weight. An example would be if weights represented the profit from shipping between two nodes.
We can see that weight
and distance
are comparable here:
G = nx.Graph()
data_list = [
{"src": 1, "dst": 2, "weight":2.0},
{"src": 2, "dst": 3, "weight":2.0},
{"src": 1, "dst": 3, "weight":10},
]
from_dl = pd.DataFrame(data_list)
G = nx.from_pandas_edgelist(from_dl, "src", "dst", "weight")
shortest_path(G, weight="weight", source=1, target=3)
shortest_path_length(G, weight="weight", source=1, target=3)
> [1,2,3]
> 4.0
With this in mind, I suggest we include within splink graph
functions that:
- Convert Splink probabilities to an appropriate measure of distance - suggestion below
- Convert match weights (log2 bayes factors) to an appropriate measure of distance - this probably just means fixing infinite and zero distances.
And then all functions should use distance
, noting in the docstring that this is a synonym of weight
def probability_to_distance(
df, prob_colname, out_colname="distance"
):
df = df.withColumn(
"__match_weight__", f.expr(f"log2({prob_colname}/(1-{prob_colname}))")
)
log2_bf = f"log2({prob_colname}/(1-{prob_colname}))"
expr = f"""
case
when {prob_colname} = 0.0 then -40
when {prob_colname} = 1.0 then 40
when {log2_bf} > 40 then 40
when {log2_bf} < -40 then -40
else {log2_bf}
end
"""
df = df.withColumn("__match_weight__", f.expr(expr))
score_min = df.select(f.min("__match_score__")).collect()[0][0]
score_max = df.select(f.max("__match_score__")).collect()[0][0]
expr = f"""
1.01 - ((__match_score__ - {score_min})/{score_max - score_min} )
"""
df = df.withColumn(out_colname, f.expr(expr))
# Higher match weights mean lower distances so invert
df = df.withColumn(out_colname, f.expr(f"-{out_colname}"))
df = df.drop("__match_score__")
return df