Comments (6)
update: I changed the biopython portion from read() to parse() and was able to run the tree, however I am getting p-values of 0 for every SNP in the test files.
from bad_mutations.
Hello,
The first issue (multiple trees in file) looks like a real bug - I will start working on the tree sanitization script to address that. The second issue regarding P-values of 0 seems like more of a complicated case. I will look into that once the tree sanitizing is in place and I can reliably reproduce the problematic files. Thank you, and apologies for the inconvenience!
from bad_mutations.
Thank you for your help, please let me know if you would like more information or need any help! As an update I tried running the tree again with 42 species instead of all the 90 (what currently downloads) and now get proper p-values, however all of them are significant, even for those in the test data which should not be.
from bad_mutations.
Hello,
Thank you for your patience! I have pushed some edits to the dev
branch that should address the tree parsing issue. With regard to the P-value issue, can you share the output of the predict
and compile
commands? I can make sure that the columns of the HYPHY output are being parsed and handled properly. Thank you!
from bad_mutations.
Thanks again for the help! I have attached all the files and logs associated. Please note if I run the predict
command on the provided tree and MSA fasta file I get similar p-values to those in the sample output files. I also tried running both commands on some of my own data and it gives highly significant p-values (e-10) for all possible snps, which seems somewhat suspicious to me.
CBF3.tree.txt
CBF3_MSA.fasta.txt
CBF3_Alignment.log.txt
CBF3_Predictions.log.txt
messages.log.txt
from bad_mutations.
Hello,
Thank you for sharing the files! It looks like the alignments and tree files are generating properly and the HYPHY log does not show any problematic output. The issue might be that the logistic regression model for generating P-values does not really translate to analyses that use a different set of query genomes than the ones for which it was developed. My hunch with that is that the total substitution rate observed depends on the number of sequences in the alignment.
I think the easiest way to get a usable prediction from the output of BAD_Mutations when using a different set of sequences, you may be able to use the heuristics that were originally used in the Chun and Fay 2009 (https://genome.cshlp.org/content/19/9/1553.abstract) paper:
Deleterious mutations were predicted by nonsynonymous SNPs that disrupt significantly constrained codons defined by the LRT (P < 0.001) and a number of subsequent filters (Supplemental Table S1). First, positions with low power, <10 eutherian mammals, were eliminated. Second, a small number of sites with dN significantly greater than dS were discarded. Finally, positions where the derived deleterious allele occurred in another eutherian species were eliminated.
I'm sorry that the best solution for now is to go back to heuristic approaches for prediction. We applied similar heuristics in our 2016 paper, which you can see implemented in this old script: https://github.com/MorrellLAB/Deleterious_Mutations/blob/master/Analysis_Scripts/Count_Deleterious_By_Sample.py (function defined on lines 25-63). I'll look into getting a new logistic regression model with a more modern set of genomes for the next release of BAD_Mutations. My apologies!
from bad_mutations.
Related Issues (20)
- Skip sequences with ambiguous nucleotides HOT 1
- Off-by-one error in HyPhy script? HOT 1
- Add command for dowloading in Mannual HOT 1
- Fail gracefully when no BLAST hits HOT 1
- makeblastdb HOT 1
- No output written to output dir with silent error HOT 3
- Fail nicely when "required" arguments are not specified
- Add Filtering of Species Databases
- Sanitize sequence names before alignment HOT 1
- Add Logistic Regression to Prediction
- Doesn't find HyPhy properly
- BAD_Mutations Incorrectly Back-translates from PASTA HOT 1
- The logistic P value is only valid for the 41 genomes
- Fetch XML ParseError HOT 5
- Fetch only the latest CDS for each species
- The XML document has moved HOT 2
- Compile Bug - Won't work on test data HOT 5
- Compile - include SNP names? HOT 1
- BUG: HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bad_mutations.