Comments (17)
Not yet, lets get the samplesheets and databases seets upload to test-datasets and then we can close it :)
from taxprofiler.
https://lomanlab.github.io/mockcommunity/
And @sofstam will help look for 'real' illumina/nanopore stuff :)
from taxprofiler.
I have asked at our site if we are allowed to use some of our data as test-data. Meanwhile, I found those for Nanopore:
https://www.ebi.ac.uk/ena/browser/view/PRJNA312719
from taxprofiler.
@jfy133 I got a response from our site that we cannot use any of our data as test data right now, we need an ethical approval that we will be working on during fall.
from taxprofiler.
OK lets just look for already published stuff 👍
Maybe: https://www.nature.com/articles/s41597-019-0287-z ?
from taxprofiler.
I will have a look at this!
Sounds good with the dataset from this article. Since the dataset is focused on bacteria, it might be good to have test data for viruses as well? https://www.ebi.ac.uk/ena/browser/view/PRJNA670157?show=reads
from taxprofiler.
Beyond the mock communities, I'm personally interested in using the CAMI data.
from taxprofiler.
And also need to decide databases, and where to store them (presumably aws...?)
from taxprofiler.
I think Zenodo could be a good place if we want to publish the benchmark. Then every database has a DOI.
from taxprofiler.
I fear the file sizes for some will be too large for Zenodo (50GB limit) but we can see
from taxprofiler.
Minimum criteria for full-test data:
- Fastq files, 5 Illumina, 2 Nanopore
- Sequencing depth > 10M
- Shotgun experiment
- Multiple run accessions for one sample
- Host removal (contaminant preferably)
from taxprofiler.
To 'borrow' from the MAG full-test data we can pick 2-3 illumina and 2-3 ONT samples/runs from here: https://www.ebi.ac.uk/ena/browser/view/PRJEB29152
Have both Illumina and Nanopoe, and sequencing depth is >10m, and is shotgun
from taxprofiler.
https://www.nature.com/articles/s41597-019-0287-z
I post the article here so we do not forget.
from taxprofiler.
Meslier2022 is what we are going for: https://www.nature.com/articles/s41597-022-01762-z
I did a test run:
nextflow run nf-core/taxprofiler
-profile mpcdf,raven
-r combine-kreports-fix
--input fulltest_samplesheet.csv
--databases fulltest_dbsheet.csv
--outdir ./results
--save_preprocessed_reads
--perform_shortread_qc
--shortread_qc_mergepairs
--perform_shortread_complexityfilter
--save_complexityfiltered_reads
--perform_longread_qc
--perform_shortread_hostremoval
--perform_longread_hostremoval
--hostremoval_reference 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/819/615/GCA_000819615.1_ViralProj14015/GCA_000819615.1_ViralProj14015_genomic.fna.gz'
--save_hostremoval_index
--save_hostremoval_mapped
--save_hostremoval_unmapped
--perform_runmerging
--save_runmerged_reads
--run_centrifuge
--centrifuge_save_reads
--run_diamond
--run_kaiju
--run_kraken2
--kraken2_save_reads
--kraken2_save_readclassification
--kraken2_save_minimizers
--run_krakenuniq
--krakenuniq_save_reads
--krakenuniq_save_readclassifications
--run_bracken
--run_malt
--malt_save_reads
--malt_generate_megansummary
--run_metaphlan3
--run_motus
--run_profile_standardisation
--run_krona
-ansi-log false
-with-tower
-resume
--run_metaphlan3
--run_motus
With a the following samplesheet:
sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta
MOCK_001_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/000/ERR9765780/ERR9765780.fastq.gz,,
MOCK_002_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/001/ERR9765781/ERR9765781.fastq.gz,,
MOCK_003_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/002/ERR9765782/ERR9765782.fastq.gz,,
MOCK_001_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/006/ERR9765746/ERR9765746_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/006/ERR9765746/ERR9765746_2.fastq.gz,
MOCK_002_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/007/ERR9765747/ERR9765747_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/007/ERR9765747/ERR9765747_2.fastq.gz,
MOCK_003_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/008/ERR9765748/ERR9765748_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/008/ERR9765748/ERR9765748_2.fastq.gz,
MOCK_003_Illumina_Hiseq_3000,2,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/009/ERR9765749/ERR9765749_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/009/ERR9765749/ERR9765749_2.fastq.gz,
And database sheet
tool,db_name,db_params,db_path
bracken,bracken-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/bracken.tar.gz
centrifuge,centrifuge-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/centrifuge.tar.gz
diamond,diamond-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022/diamond/diamond.dmnd
kaiju,kaiju-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/kaiju.tar.gz
kraken2,kraken2-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/kraken2.tar.gz
krakenuniq,krakenuniq-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/krakenuniq.tar.gz
malt,malt-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/malt.tar.gz
metaphlan3,metaphlan3-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/metaphlan3.tar.gz
motus,motus-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/motus.tar.gz
The databases TARs basd on these instructions
And it mostly ran! Observations in next commetn
from taxprofiler.
- The multiQC report looks weird in a few places though
- The nanopore reads has the the ERR* name not the sample nam1e in the general stats #223
- We still have the bracken entries in the general stats that we need to turn off #223
- Porechop plots all say 'not able to plot data' #227
[2023-01-20 18:53:22,590] multiqc.plots.bargraph [WARNING] Tried to make bar plot, but had no data: porechop-starttrim-barplot [2023-01-20 18:53:22,590] multiqc.plots.bargraph [WARNING] Tried to make bar plot, but had no data: porechop-endtrim-barplot [2023-01-20 18:53:22,590] multiqc.plots.bargraph [WARNING] Tried to make bar plot, but had no data: porechop-middlesplit-barplot
- I agree with @Midnighter that we need a separate MQC report for Nanopore data, but I don't want to mess with pipeline code atm and focus on bug fixing (separate issue: #218 )
- We only picked up the Bracken results, and no Kraken2 etc. for some reason (I guess the file names we used wasn't correct ) https://github.com/nf-core/taxprofiler/pull/223/files
Output files:
- Centrifuge ran despite no multiQC report, however
.results.txt
files in that directory are empty - Kaiju is a complete fail, no output!
- MetaPhlan3 runs fine for short-reads, but all long reads fail with 100% unknown, maybe we should filter this out (I suspect long-reads are unsuitable because long-reads trying to be mapped against short genes witha s hort read aligner ) #221
- mOTUs did not execute because when supplying in TAR not accounted for here #219
- Separate short-read and long-read parameters for the databases. [v1.1!] #226
Overall we get 35% reads classified with Brakcken so I think this a good sign this is reasonable dataset
from taxprofiler.
- Add to docs we recommend running SR/LR separtely (although running together is supported) #220
from taxprofiler.
Shall we close this?
from taxprofiler.
Related Issues (20)
- megan version 6.21.7 throw `java.lang.NullPointerException` error HOT 5
- Allow multiple bracken profiles with different `-l` levels from the same kraken2 report HOT 1
- Adding GetOrganelle HOT 10
- Add Nanoq for Nanopore reads
- Documentation: Broken Links in "Full database sheet" Section HOT 4
- Krakenuniq save_reads does not give the fastq files when using PE data HOT 18
- Centrifuge error : (ERR): mkfifo(/tmp/72.inpipe1) failed. HOT 13
- Empty files were also published from the module samtools/fastq
- kaiju2table not reporting taxon names
- Generate samplesheet for nf-core/mag HOT 4
- Logo does not match the logo in tube HOT 1
- Move/copy DB troubleshooting content from Usage/Tutorial to Usage/ or new Usage/Database-troubleshooting HOT 1
- Failed to produce kaiju_combined_reports.txt HOT 4
- Current UNTAR scheme inefficent and can cause overwriting for database sheet input HOT 1
- KrakenUniq read extraction HOT 4
- Database parameter validation for bracken doesn't work currently
- taxprofiler krona output files missing HOT 1
- Recommended procedure for profiling nanopore data HOT 1
- The run_accession and sample_name should be unique.
- Add example in documentation for multiple runs of the same sample
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from taxprofiler.