Comments (10)
Hi Leran,
The crash I think relates to #34 and the logging should be addressed in the next build. Let's revisit it once the pipeline is finishing correctly.
from hecatomb.
Some additional data:
134 samples run on Pathogen:
assembly.fasta:24280
seqtable.fasta: 35,306,586
Same 134 samples running on HTCF (MMSEQS_AA_PRIMARY step):
assembly.fasta:24568
seqtable.fasta: 9,344,202
I don't think the second seqtable will be added to after this step
from hecatomb.
It seems like the difference is mainly arising during clustering:
Pathogen:
more p09_cluster_similar_sequences.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log
Size of the sequence database: 341461
Size of the alignment database: 341461
Number of clusters: 325347
more p08_remove_exact_dups.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log *****
Input: 971106 reads 223037112 bases.
Duplicates: 65529 reads (6.75%) 15441315 bases (6.92%) 0 collisions.
Result: 905577 reads (93.25%) 207595797 bases (93.08%)
HTCF (slurm):
more p09_cluster_similar_sequences.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log *****
Size of the sequence database: 147516
Size of the alignment database: 147516
Number of clusters: 77134
more p08_remove_exact_dups.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log
Input: 971504 reads 223125309 bases.
Duplicates: 65541 reads (6.75%) 15444031 bases (6.92%) 0 collisions.
Result: 905963 reads (93.25%) 207681278 bases (93.08%)
from hecatomb.
This very well may be due to some changes we made to the clustering parameters for linclust. There was some toying with these settings over the past month, so if you ran the Pathogen run a while back and the HTCF run more recently you would likely get different results.
Can you check the cluster settings for each run (should be in your config file, but may be in the rule file depending on which version and when Mike made updates). That large of a difference is more likely explained by a major setting change than a compute error.
from hecatomb.
Here are the differences:
Pathogen:
mmseqs easy-linclust hecatomb_out/PROCESSING/TMP/p08/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_R1.deduped.out.fastq hecatomb_out/PROCESSING/TMP/p09/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_R1 hecatomb_out/PROCESSING/TMP/p09/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_TMP --kmer-per-seq-scale 0.3 -c 0.97 --cov-mode 1 --threads 16 &> hecatomb_out/STDERR/p09_cluster_similar_sequences.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log;
HTCF:
mmseqs easy-linclust hecatomb_out/PROCESSING/TMP/p08/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_R1.deduped.out.fastq
hecatomb_out/PROCESSING/TMP/p09/M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_R1 hecatomb_out/PROCESSING/TMP/p09/M721_I9060_34
044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10_TMP --kmer-per-seq-scale 0.3 -c 0.7 --cov-mode 1 --min-seq-id 0.95 --alignment-mode 3 --threads 24 &> hecatomb_out/STDERR/p09_cluster_similar_sequences.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log;
from hecatomb.
NOTE:
-c = minimum coverage originally 0.97 to 0.7
--min-seq-id + --alignment-mode 3 (number of identical aligned residues divided by the number of aligned columns including internal gap columns) -> now 0.95
- not clear to me what happens without these params
from hecatomb.
Well dropping from 0.97 to 0.7 is going to have the greatest effect and is likely way lower than what we intend. This should be set back to 0.97 or 0.95.
You can read about alignment mode here: https://github.com/soedinglab/MMseqs2/wiki#how-to-set-the-right-alignment-coverage-to-cluster
@mroach-awri can we go back to my original settings as the default for now?
from hecatomb.
Yes I think I pushed my settings to the last build. The original default (-c .97 --cov-mode 0
) I think specifies 97% residue matches in the longer sequence, which will only cluster sequences that are end-to-end almost identical and most clusters are n=1. The current settings (--cov-mode 1 -c 0.7 --min-seq-id 0.95
) specify 70% alignment coverage of the member sequence by the rep sequence at 95% identity. These are the setting I'll probably be using as I want to maximize runtime performance.
I'll be pushing a new build soon; did you want the original defaults or did you have a specific coverage and seq identity in mind?
from hecatomb.
from hecatomb.
Running:
grep \> seqtable.fasta | sed 's/.*:\(.*\):.*/\1/' | awk '{n+=$1}END{print n}'
gives consistent number of reads on seqtable.fasta file.
Leran
from hecatomb.
Related Issues (20)
- Database installation problems HOT 4
- MANY missing output files; Input files updated by another job HOT 1
- Enhancement: Host HOT 5
- taxonomy improvement HOT 4
- missing fields in outfile HOT 5
- Error in rule SECONDARY_AA_refactor_finalize HOT 1
- crash unknown HOT 1
- Missing new line in contigAnnotations.tsv HOT 1
- fastp not building HOT 8
- mmseqs: command not found HOT 7
- flye crash in population_assembly step HOT 2
- Illumina_NextSeq_Run dies immediately HOT 2
- Check for the presence of an environment variable for location of databases HOT 1
- readthedocs viral ecology R tutorial error fix HOT 1
- assembly Contigs in results folder HOT 1
- Enhancement: add whitelist to pre-processing HOT 2
- I want to create a web app for hecatomb. HOT 1
- HPC database installation problems HOT 31
- HPC Execution problem when changing to V.1.1.0 HOT 4
- Skip host removal HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hecatomb.