Code Monkey home page Code Monkey logo

Comments (17)

jfy133 avatar jfy133 commented on July 23, 2024 1

Not yet, lets get the samplesheets and databases seets upload to test-datasets and then we can close it :)

from taxprofiler.

jfy133 avatar jfy133 commented on July 23, 2024

https://lomanlab.github.io/mockcommunity/

And @sofstam will help look for 'real' illumina/nanopore stuff :)

from taxprofiler.

sofstam avatar sofstam commented on July 23, 2024

I have asked at our site if we are allowed to use some of our data as test-data. Meanwhile, I found those for Nanopore:

https://www.ebi.ac.uk/ena/browser/view/PRJNA312719

from taxprofiler.

sofstam avatar sofstam commented on July 23, 2024

@jfy133 I got a response from our site that we cannot use any of our data as test data right now, we need an ethical approval that we will be working on during fall.

from taxprofiler.

jfy133 avatar jfy133 commented on July 23, 2024

OK lets just look for already published stuff 👍

Maybe: https://www.nature.com/articles/s41597-019-0287-z ?

from taxprofiler.

sofstam avatar sofstam commented on July 23, 2024

I will have a look at this!

Sounds good with the dataset from this article. Since the dataset is focused on bacteria, it might be good to have test data for viruses as well? https://www.ebi.ac.uk/ena/browser/view/PRJNA670157?show=reads

from taxprofiler.

Midnighter avatar Midnighter commented on July 23, 2024

Beyond the mock communities, I'm personally interested in using the CAMI data.

from taxprofiler.

jfy133 avatar jfy133 commented on July 23, 2024

And also need to decide databases, and where to store them (presumably aws...?)

from taxprofiler.

Midnighter avatar Midnighter commented on July 23, 2024

I think Zenodo could be a good place if we want to publish the benchmark. Then every database has a DOI.

from taxprofiler.

jfy133 avatar jfy133 commented on July 23, 2024

I fear the file sizes for some will be too large for Zenodo (50GB limit) but we can see

from taxprofiler.

sofstam avatar sofstam commented on July 23, 2024

Minimum criteria for full-test data:

  • Fastq files, 5 Illumina, 2 Nanopore
  • Sequencing depth > 10M
  • Shotgun experiment
  • Multiple run accessions for one sample
  • Host removal (contaminant preferably)

from taxprofiler.

jfy133 avatar jfy133 commented on July 23, 2024

To 'borrow' from the MAG full-test data we can pick 2-3 illumina and 2-3 ONT samples/runs from here: https://www.ebi.ac.uk/ena/browser/view/PRJEB29152

Have both Illumina and Nanopoe, and sequencing depth is >10m, and is shotgun

from taxprofiler.

sofstam avatar sofstam commented on July 23, 2024

https://www.nature.com/articles/s41597-019-0287-z

I post the article here so we do not forget.

from taxprofiler.

jfy133 avatar jfy133 commented on July 23, 2024

Meslier2022 is what we are going for: https://www.nature.com/articles/s41597-022-01762-z

I did a test run:

nextflow run nf-core/taxprofiler
		 -profile mpcdf,raven
		 -r combine-kreports-fix
		 --input fulltest_samplesheet.csv
		 --databases fulltest_dbsheet.csv
		 --outdir ./results
		 --save_preprocessed_reads
		 --perform_shortread_qc
		 --shortread_qc_mergepairs
		 --perform_shortread_complexityfilter
		 --save_complexityfiltered_reads
		 --perform_longread_qc
		 --perform_shortread_hostremoval
		 --perform_longread_hostremoval
		 --hostremoval_reference 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/819/615/GCA_000819615.1_ViralProj14015/GCA_000819615.1_ViralProj14015_genomic.fna.gz'
		 --save_hostremoval_index
		 --save_hostremoval_mapped
		 --save_hostremoval_unmapped
		 --perform_runmerging
		 --save_runmerged_reads
		 --run_centrifuge
		 --centrifuge_save_reads
		 --run_diamond
		 --run_kaiju
		 --run_kraken2
		 --kraken2_save_reads
		 --kraken2_save_readclassification
		 --kraken2_save_minimizers
		 --run_krakenuniq
		 --krakenuniq_save_reads
		 --krakenuniq_save_readclassifications
		 --run_bracken
		 --run_malt
		 --malt_save_reads
		 --malt_generate_megansummary
		 --run_metaphlan3
		 --run_motus
		 --run_profile_standardisation
		 --run_krona
		 -ansi-log false
		 -with-tower
		 -resume
		 --run_metaphlan3
		 --run_motus

With a the following samplesheet:

sample,run_accession,instrument_platform,fastq_1,fastq_2,fasta
MOCK_001_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/000/ERR9765780/ERR9765780.fastq.gz,,
MOCK_002_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/001/ERR9765781/ERR9765781.fastq.gz,,
MOCK_003_Minion_R9,1,OXFORD_NANOPORE,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/002/ERR9765782/ERR9765782.fastq.gz,,
MOCK_001_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/006/ERR9765746/ERR9765746_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/006/ERR9765746/ERR9765746_2.fastq.gz,
MOCK_002_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/007/ERR9765747/ERR9765747_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/007/ERR9765747/ERR9765747_2.fastq.gz,
MOCK_003_Illumina_Hiseq_3000,1,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/008/ERR9765748/ERR9765748_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/008/ERR9765748/ERR9765748_2.fastq.gz,
MOCK_003_Illumina_Hiseq_3000,2,ILLUMINA,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/009/ERR9765749/ERR9765749_1.fastq.gz,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR976/009/ERR9765749/ERR9765749_2.fastq.gz,

And database sheet

tool,db_name,db_params,db_path
bracken,bracken-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/bracken.tar.gz
centrifuge,centrifuge-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/centrifuge.tar.gz
diamond,diamond-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022/diamond/diamond.dmnd
kaiju,kaiju-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/kaiju.tar.gz
kraken2,kraken2-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/kraken2.tar.gz
krakenuniq,krakenuniq-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/krakenuniq.tar.gz
malt,malt-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/malt.tar.gz
metaphlan3,metaphlan3-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/metaphlan3.tar.gz
motus,motus-db,,/raven/ptmp/jfellowsy/databases/taxprofiler_full_test/meslier2022_final_dbs/tars/motus.tar.gz

The databases TARs basd on these instructions

And it mostly ran! Observations in next commetn

from taxprofiler.

jfy133 avatar jfy133 commented on July 23, 2024
  • The multiQC report looks weird in a few places though
    • The nanopore reads has the the ERR* name not the sample nam1e in the general stats #223
    • We still have the bracken entries in the general stats that we need to turn off #223
    • Porechop plots all say 'not able to plot data' #227
      [2023-01-20 18:53:22,590] multiqc.plots.bargraph [WARNING] Tried to make bar plot, but had no data: porechop-starttrim-barplot [2023-01-20 18:53:22,590] multiqc.plots.bargraph [WARNING] Tried to make bar plot, but had no data: porechop-endtrim-barplot [2023-01-20 18:53:22,590] multiqc.plots.bargraph [WARNING] Tried to make bar plot, but had no data: porechop-middlesplit-barplot
  • I agree with @Midnighter that we need a separate MQC report for Nanopore data, but I don't want to mess with pipeline code atm and focus on bug fixing (separate issue: #218 )
  • We only picked up the Bracken results, and no Kraken2 etc. for some reason (I guess the file names we used wasn't correct ) https://github.com/nf-core/taxprofiler/pull/223/files

Output files:

  • Centrifuge ran despite no multiQC report, however .results.txt files in that directory are empty
  • Kaiju is a complete fail, no output!
  • MetaPhlan3 runs fine for short-reads, but all long reads fail with 100% unknown, maybe we should filter this out (I suspect long-reads are unsuitable because long-reads trying to be mapped against short genes witha s hort read aligner ) #221
  • mOTUs did not execute because when supplying in TAR not accounted for here #219
  • Separate short-read and long-read parameters for the databases. [v1.1!] #226

Overall we get 35% reads classified with Brakcken so I think this a good sign this is reasonable dataset

from taxprofiler.

jfy133 avatar jfy133 commented on July 23, 2024
  • Add to docs we recommend running SR/LR separtely (although running together is supported) #220

from taxprofiler.

sofstam avatar sofstam commented on July 23, 2024

Shall we close this?

from taxprofiler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.