Code Monkey home page Code Monkey logo

Comments (5)

mmatschiner avatar mmatschiner commented on August 22, 2024 1

Hi Érico,

I'm glad the one issue is sorted out. Regarding the second (the ValueError): I also noticed that when I tested your dataset, and I figured out that it was due to whitespace at the end of each line. I made a small change to the script so that this should now parse correctly. If you re-download F4, you should not see this issue anymore.

Regarding your suggestion to allow the user to account for MAF filtering in the simulations, I agree, this would be possible and it seems to be a good idea. I'll see if I find time to implement it, but have too much on my plate at the moment.

Finally, I agree that the set of results that you describe does not sound very consistent. I assume you did that already, but if not, what I would first look at would be the D statistic. @millanek's Dsuite program allows the quick calculation of different versions of the D-statistic as well as the F-branch statistic which might be helpful in your case.

Cheers,
Michael

from f4.

mmatschiner avatar mmatschiner commented on August 22, 2024

Hi Érico,
this sounds to me like the pairs of species that you specify are not actually pairs. Could this be the case? Does Treemix group either Pop1 and Pop2 as sisters or Pop3 and Pop4 as sisters?
Cheers,
Michael

from f4.

ericopolo avatar ericopolo commented on August 22, 2024

Hi, Michael, thanks for the quick response! First of all, I'm sorry, I actually didn't realize the order of populations in the input file would matter, so their position was completely random; I assumed simulations would alternate between the 3 possible topologies, but now I've noticed the information of the assumed topology in the output.

Indeed, most likely they are not actually pairs. Without allowing migration events, Treemix (as well as other phylogenetic methods) returns (1,3),(2,4). When allowing for migration (any amount of events) the unrooted topology becomes (1,4),(2,3) - what, by the way, agrees with mtDNA data. So the input file I gave you assumes the least likely topology, what would explain an overestimated theta.

However, changing the topology didn't have any effect. I've just tested the other two possible ones and the behavior is the same. Burn-in goes on until 28,000 or so and finally stops with that same error.

from f4.

mmatschiner avatar mmatschiner commented on August 22, 2024

Hi Érico,

This seems to be a tricky one. During the burnin phase in the simulations, F4 is increasing the effective population size in situations when the number of simulated SNPs variable in more than one population and the number of simulated SNPs variable on both sides of the root is smaller than the empirical number for these two measures. But even with an increased effective population size going to infinity, the empirical numbers are not reached in your case. Could it be that you filtered the dataset based on a minor allele frequency threshold or similar? It appears that the proportions of the SNPs that are variable in more than two populations and on both sides of the root of the assumed tree is unusually large.

from f4.

ericopolo avatar ericopolo commented on August 22, 2024

Hello again, Michael!

You're right. I had filtered out SNPs with a MAF < 0.1. Without this filtering (i.e., removing only constant sites and sites with missing data) things worked just fine, at least with the two more likely topologies. With the third one (the one I sent you), however, I got another error, that occurred just after the simulations:

Traceback (most recent call last):
  File "/home/ericolegal/bin/f4.py", line 1235, in <module>
    outlier_lines[x] = outlier_lines[x].replace("\n","") + " | " + "{0:.2f}".format(line_weight) + "\n"
ValueError: Unknown format code 'f' for object of type 'str'

The output file ends at the "interpretation" part, without saying anything about the SNPs driving the f4 statistic to be different than zero, like it does with the other two topologies. I'm sending the input again, this time without the MAF filtering. input.txt

Back to the original issue, although my MAF filtering seems to improve Treemix results (they make much more sense from the biogeographic point of view), I can see the problem it causes when simulating data. To fix that I think it would be necessary to ask users for that information (MAF filtering) and include an equivalent MAF filter at the "masking" stage, along with the inclusion of missing data, and before the burn-in. I know, of course, the filtering in itself is not necessary for the f4 test, and maybe not even useful in any sense, especially with you simulation method, and that such improvement would only make things more convenient to some users, so maybe a warning in README would suffice?

Now, maybe this is not the right place to talk about this, but I'm very puzzled about the result I got. What made me search for alternatives to the "fourpop" algorithm implemented in Treemix is that it was the only test that told me there's no introgression with a particular topology - (1,3),(2,4). As I mentioned before, Treemix finds this topology when not accounting for migration, but one migration event is enough not only to improve significantly the likelihood and lower the values in the residual covariance matrix, but also with a very strong statistical support for a high weight value. The "threepop" test tells me that pop2 is admixed whenever pop1 is included among the other two (what makes sense geographically). D3 (a statistic based on ABBA-BABA) also tells me there is introgression. Fourpop, in the other hand, is telling me that the topology I get from phylogenetic methods that don't take into account any kind of horizontal transfer (e.g. SNAPP, ASTRAL) needs no other process than genetic drift to explain the data. I wondered whether that was being caused for issues with the jackknife blocks (my data being from ddRADseq). So I came across your method, and was confident that the fastsimcoal simulations would tell me that life is good. But it is not. Your test also told me ILS is enough to explain the topology. And now I'm troubled trying to conciliate those results. Have you seen something like this? Any thoughts would be much appreciated.

Anyway, thank you so much for your help with this "overflow" issue, it was a fun one to solve, thanks to your attention and support.

Cheers,

Érico.

from f4.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.