Comments (2)
Hi Sofia,
Thanks for the question.
The --min-coding-length
is indeed a cutoff for length, but the default is 100 nucleotides in the CDS,
so much shorter than 200aa and it's unlikely this threshold is responsible for your missing genes.
A few other parameters may be worth playing with, namely
- reducing
--edge-threshold
(default 0.1) may reduce fragmentation of genes (but will increase run time, and in rare cases lead to concatenated gene models) - reducing
--peak_threshold
(default 0.8) may increase recall (but will reduce precision)
However it's likely that the neural net simply didn't learn a good representation for this class of genes, and
you're right that retraining may help. Certainly 3,000 gene copies from a single family should be enough to drastically improve performance on that family. While I haven't tried to boost performance by gene family, I could potentially speculate on how I'd try.
Before I do that, a question: are you interested in only that gene family, or whole genome annotations that specifically perform better on that gene family?
from helixer.
Hi Alisandra,
Thank you for the reply. Looking at the parameters that you listed, my guess is that I would still likely miss my genes. It would probably be best to create a new model using the 3k genes in this family.
On this project (alfalfa plus other plants), I would only need to improve gene models for this gene family. There are published genes that are good for the whole genome but they are missing this family so we tried helixer specifically to see if we could find those genes, and we didn't.
I have other organisms in totally different projects that I would like to improve the whole genome annotations (first in line a couple sea anemones and corals). I will try to follow the documentation on how to build models for new organisms. I get a wide variety of species, especially invertebrates that come to me for structural and functional annotation. Helixer seems like a great option that has shown to provide good models in a short time for a some other species (vertebrates) that I have helped with.
Sofia
from helixer.
Related Issues (20)
- Name the genes in order HOT 1
- Issues with Gymnosperms HOT 2
- Online Version HOT 8
- Fail early and clearly on fasta files with duplicate IDs
- Some tensorflow warning/error messages when running Helixer via Singularity HOT 3
- results from web tool not sent HOT 2
- boolean index did not match indexed array along dimension 0 HOT 2
- Low busco scores for transcripts called on large Cupressus genome (10Gb) with helixer HOT 2
- gff3 specification HOT 1
- nohup not helping in Helixer.py, run is stopped when connection is lost HOT 5
- Helixer and the protein related domain HOT 1
- ModuleNotFoundError: No module named 'tensorflow.keras' HOT 2
- How does helixer evaluate performance when predicting genomes? HOT 1
- sqlite3.DataError: string or blob too big HOT 2
- OSError: [Errno 9] Bad file descriptor HOT 1
- When I run python Helixer.py, met: "ImportError: cannot import name 'where' from 'certifi' (unknown location" HOT 1
- hypothetical all-eukaryote model HOT 3
- Micro exons with single bases or two
- error message = 'No space left on device"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from helixer.