Comments (3)
Hm, you're right. This was certainly not intentional. And, it's true for f4
as well:
import msprime as ms
ts = ms.sim_ancestry(sequence_length=1000, samples=4, ploidy=2, random_seed=1)
m = ms.sim_mutations(ts, rate=5e-3, random_seed=2)
a, b, c, d = (0, 1), (2, 3), (4, 5), (6, 7)
f12 = m.f4((a, b, c, d), span_normalise=False)
f21 = m.f4((c, d, a, b), span_normalise=False)
assert f12 == f21, f"f4: {f12} !={f21}"
Here's the issue: the definition we gave for f4 (and by extension, f3 and f2) is
The average density of sites at which a and c agree but differs from b and d,
minus the average density of sites at which a and d agree but differs from b and c
This sounds symmetric under switching (a,b) with (c,d), and indeed it is for biallelic loci, since
a and c agree, but differ from b and d
is the same for biallelic loci as
a=c != b=d
and hence the same as
b and d agree, but differ from a and c
However, with more than two alleles, we can have b and d differing yet still neither agreeing with a and c.
So, I see two options:
- symmetrize the definition; this would then have the interpretation
The average density of sites at which a and c agree but differs from b and d, plus the average density of sites at which b and d agree but differs from a and c, minus the average density of sites at which a and d agree but differs from b and c and the average density of sites at which b and c agree but differs from a and d, all divided by two.
- leave it as-is, documenting the possible asymmetry.
I'm in favor of (2) because:
a. people can symmetrize it themselves if they want;
b. the current version seems a bit more natural and perhaps informative than the symmetrized version
c. the only downside of the current version I'm aware of is that it's surprising, so documenting should fix that.
However, if you know of a reason that the symmetrized version (or something else?) is a more natural estimator or something, then that's worth considering?
FYI, we've defined all the statistics in a way that makes sense with more-than-biallelic sites: the strategy taken is to treat each allele as a binary split between "the allele" and "all other alleles", and compute statistics in a way that sums a summary over these splits (see the paper for more discussion). I'm not aware of any other issues like this one that arises from multiallelic sites: for instance, ts.divergence( )
is still symmetric.
from tskit.
Hi Peter,
Thanks a lot for your thorough investigation and explanation of the issue.
Perhaps a lot of the difference in intuition (and why this behavior of tskit was surprising to me) is that I tend to think of
However, I think from the
- For biallelic sites, the
$F$ -stats have some useful relations, e.g.$2F_4(a,b,c,d) = F_2(a,d) + F_2(b,c) - F_2(a,c) - F_2(b,d)$ , it would be nice if they were retained in the multiallelic case. - For biallelic sites,
$F_2$ can be written in terms of pairwise divergences$\pi_{ab}$ and within-sample diversity$\pi_a$ (eq 17 in paper):
$F_2 = \pi_{12} - \pi_1/2 - \pi_2/2$ . This definition starts from asking whether the two alleles in two haploid individuals differ from each other. Thus, if you had pressed me to come up with an extension of$F_2$ -stats to multiallelic sites, this is probably what I would have used. - Combining these two points, one can write
$F_4$ in terms of pairwise divergences:$2F_4 = \pi_{14} + \pi_{23} - \pi_{13} -\pi_{24}$ . This, I think, coincides with the symmetric estimator you laid out. - As far as I can see, the symmetric versions of the statistic retain all the interpretations and properties from the biallelic case; symmetric
$F_2$ is a tree metric, a (squared) distance, a covariance.,$F_3$ and$F_4$ can be interpreted as the lengths of internal and external branches, respectively. Now, these branch lengths will be biased relative to the ones calculated without mutations, presumably in similar ways that naive estimators of branch lengths from number of mutations are biased under recurrent mutation models. However, I think they should still be tree-additive.
In addition, most applications of
In summary, I think the symmetric version is just the nicer, more convenient and, to me, more intuitive extension of
Ben
from tskit.
Thanks very much - this is compelling (but I'm not yet sure). I think we considered the
but used the definitions we ended up with because (a) they were equivalent for biallelic sites, and (b) they were easier to explain in words. But, do you have a good and concise way to explain
However, I think that our current version, symmetrized (i.e., (ts.f4(a,b,c,d) + ts.f4(c,d,a,b))/2
) is not equal to the pairwise divergences version. So, I think what you've got is a third proposal for the definitions, in terms of divergence and diversity, and distinct from the "let's just symmetrize" proposal. Or, have I got this wrong?
Thanks for helping get this right.
p.s. I ran across the thread in which we worked out the multiallelic thing; it would have been nice to have your input at the time!
from tskit.
Related Issues (20)
- edges lost after merging two trees HOT 20
- Codecov upload issues HOT 1
- keep_intervals() giving _tskit.LibraryError: Can't squash, flush, simplify or link ancestors... HOT 7
- Update GitHub upload/download artefacts
- Add XTable.drop_metadata HOT 1
- trees.c Compilation Error HOT 2
- Split large numbers in html/cli print out.
- Post release for 0.5.7 HOT 1
- Fixup tests for lshmm 0.0.6 HOT 7
- Fix tests for numpy 2.0 HOT 1
- Drop "benchmark" CI job?
- Pass a numpy array of booleans to python simplify? HOT 7
- Cannot pickle '_tskit.Tree' object HOT 3
- Trivial wording change HOT 1
- clarify that genetic_relatedness includes self comparisons HOT 1
- set default for genetic_relatedness to polarised=True and remove a factor of 1/2
- "mutation" mode for statistics HOT 1
- inconsistent handling of non-ancestral material in non-strict, unpolarised branch stats HOT 2
- wrong output for genetic_relatedness(... indexes=None, proportion=True)
- Clarification on pairwise coalescent rates HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tskit.