Comments (4)
Yes absolutely.
Normally on your previous analysis, a pangenome.h5 file was generated, you can reuse it to rerun parts of the workflow.
The simplest option is to change the minimum required length for an RGP to be predicted. The default is 3000bp. If you want the predicted RGPs to be of at least 5000 bp for example, you can use the '--min_length' option along with this pangenome.h5 file, as such:
ppanggolin rgp -p pangenome.h5 --min_length 5000
Another possibility is modifying the minimum score threshold. The default threshold for that score is 4 which roughly means that you need at least 4 shell or cloud genes close together to get a RGP, when other parameters are set to default.
If you feel like this is not strict enough, and only want the regions with a lot more genes, you can change this threshold, as such:
ppanggolin rgp -p pangenome.h5 --min_score 8
This will set the threshold to 8 instead of the default 4.
There are other parameters, but they are less straight forward to explain. You can see them all by running ppanggolin rgp -h
.
Afterward, you can regenerate the 'plastic_regions.tsv' file by running
ppanggolin write -p pangenome.h5 --regions --output MyNewRegionsOutputDir
If you do start tweaking the parameters, you might find the following command useful:
ppanggolin info -p pangenome.h5 --parameters
which will list the parameters used to compute the results currently stored in the .h5 file for all the steps of the analysis.
from ppanggolin.
Hello
Taken alone, those two parameters kind of oppose each other.
Persistent penalty default is 3. Decreasing it might fuse two RGPs that are close together along the genome but separated by some persistent genes. Increasing it might divide RGPs into multiple components if there are persistent genes included in them.
Variable gain default is 1. Increasing it might fuse two RGPs that are close together along the genome, while decreasing it might divide RGPs into multiple components if there are persistent genes included in them.
And both of those parameters will impact the score of the RGPs that are predicted.
In any case however, having persistent genes in the middle of RGPs is relatively rare, so modifying those parameters slightly should not have a lot of impact, while changing them greatly might not give you biologically meaningful results anymore, as you may group RGPs together over long stretches of persistent genes.
If you want to understand more in detail how all of those parameters interact, the full method is detailed in this preprint : https://www.biorxiv.org/content/10.1101/2020.03.26.007484v1.full
In part "2.1 - panRGP method"
In part 2.1.1, parameter p in the formula corresponds to persistent penalty, parameter v to variable gain
In part 2.1.2, parameter s min is "min_score" and l min is "min_length" I was talking about previously.
Only 2.1.1 and 2.1.2 will be of interest for understanding how the RGPs are predicted.
If something is unclear, do not hesitate to ask more questions :)
from ppanggolin.
Hello!
Could you briefly explain the options
--persistent penalty
- variable_gain
If I increase or decrease these values, what should I expect?
Thanks for taking the time to answer these basic things.
Really appreciated
from ppanggolin.
Since this is from may and there has been no other questions since, I will close this issue. If you have any other question please do not hesitate to reopen it.
from ppanggolin.
Related Issues (20)
- less genes clustered than genes in the pangenome HOT 7
- PPanGGOLiN taking too much time to run on large collection of genomes HOT 6
- Error when re-partitionning a .h5 file obtained from the panmodule subcommand HOT 1
- Extract gene IDs located within RGPs HOT 4
- Only extract gene families within RGPs HOT 5
- Attribute error when draw spots HOT 2
- Deprecated NumPy HOT 5
- Identity/coverage threshold for species/genus? HOT 1
- Spot Plot Error HOT 2
- PPanGGOLiN hangs when partitioning does not work HOT 11
- doubt about Circular contig identifiers HOT 4
- Non-deterministic clustering (possibly due to defragmentation) HOT 11
- Regions of Genome Plasticity HOT 2
- Position column in pangenomeGraph.json file HOT 4
- Spots - way to get the flanking genes? HOT 3
- Error in WriteBinaries.py HOT 17
- Searching for spots using the gene as a query HOT 8
- drawing some spot is not working ValueError HOT 9
- Plastic regions HOT 1
- How to extract the genome location of the spots HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ppanggolin.