microbiomedata / metamags Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 4.0 4.74 MB

Workflow for metagenome assembled genomes generation.

WDL 10.71% Shell 0.36% Dockerfile 0.24% Python 88.69%

metamags's People

Contributors

Stargazers

Watchers

Forkers

jfroula wtroy2 pythseq ajtritt

metamags's Issues

remove JGI mbin files from Docker folder

We do not have permission to distribute these files, they were added with commit id 14ec4da in June 2023. I have removed them in the master branch. We either need to update the mbin_nmdc task in the wdl such that it only requires Neha's container, njvarghese/mbin:v0.4, or we need to work with Neha to include mbin_stats.py, which I believe is NMDC code, in a new version of the container.

Workflows - Test small data sets for workflows (metaMAGs)

create a small test data set with expected outcome if workflow run is successful

link test results to this issue

Error in package tasks

We are seeing this error in some runs...

Dumping /pscratch/sd/n/nmdcda/cromwell-executions/nmdc_mags/ab68c879-fcb8-412f-af79-8db74291bb31/call-package/execution/stderr
Traceback (most recent call last):
  File "/opt/conda/envs/mags_vis/bin/create_tarfiles.py", line 174, in <module>
    krona_plot(ko_result,prefix)
  File "/opt/conda/envs/mags_vis/bin/create_tarfiles.py", line 127, in krona_plot
    df = pd.read_csv(ko_result,sep="\t")
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in __init__
    self._open_handles(src, kwds)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py", line 229, in _open_handles
    errors=kwds.get("encoding_errors", "strict"),
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/common.py", line 614, in get_handle
    storage_options=storage_options,
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/common.py", line 396, in _get_filepath_or_buffer
    raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'NoneType'>

I've created a snapshot for testing in /global/cfs/cdirs/m3408/squads/mags/package_failure on perlmutter.

add eukcc component to binning workflow

JGI's eukcc database has been copied to /refdata/eukcc_db/eukcc2_db_ver_1.2

The mbin_nmdc task should be updated to use --eukccdb EUKCCDB to run eukcc.
mbin_nmdc task & workflow new output file eukcc_output/eukcc.csv.final
create new file type enum for eukcc.csv.final in nmdc-schema
The package task should be updated to make sure fungal bins get added to the final tarball

IMG taxon oids that can be used for testing, large assemblies but known to produce euk bins, or an existing metagenome can be spiked in with a small eukaryote for testing purposes.
3300067032
3300059473
3300059591 - this project produces just 1 euk bin Eukaryota; Ascomycota; Dothideomycetes; Pleosporales; Phaeosphaeriaceae; Parastagonospora

silent error during call-mbin_nmdc

I noticed that none of our mags workflow results generate any bins, even projects for which we know the JGI processing generated mags (ie
nmdc:wfmgan-11-585bp531.1, the JGI run of this data generated 27 medium and high quality mags, see img taxon oid 3300061644). I find errors in the call-mbin_nmdc stage 'ERROR: Models must be parsed before identifying HMM hits.'
A quick search suggests an issue with the checkM installation
Ecogenomics/CheckM#280

This issue is not caught b/c the last process of the task completes correctly so we should also set pipefail to catch these types of problems going forward.

I can find 400 records on pscratch, this likely doesn't include deleted projects and the issue dates back to at least early August 2023
/pscratch/sd/n/nmdcda/cromwell-executions/nmdc_mags> grep 'ERROR: Models must be parsed before identifying HMM hits' */call-mbin_nmdc/execution/stdout | wc -l
600

cc @mbthornton-lbl

MAGs_stats files content are empty even there are binning result.

Workflows - Test large data sets for workflows (metaMAGs)

create a large test data set with expected outcome if workflow run is successful

link test results to this issue

Finish end-to-end workflows for MG

Finish end-to-end workflows for MG going from raw reads to all data products.

eukCC failed on quoted node names.

/pscratch/sd/n/nmdcda/cromwell-executions/nmdc_mags/f859549b-b820-48f7-ae8d-c1e8abacd6e2/call-mbin_nmdc/execution

raise NewickError("Unexpected newick format '%s' " %subnw[0:50]) ete3.parser.newick.NewickError: Unexpected newick format ''nmdc:wfmgas-11-wpm64z83.1_scf_321_c1_2':0.423943'

Files that can be used to test this:
https://data.jgi.doe.gov/search?q=1414320&expanded=IMG_AP-1414320

assembly file: assembly.contigs.fasta
mapping file: Ga0597049_contig_names_mapping.tsv

Package MAGs with annotations

This would bring the MAG output more inline with what IMG currently generates.

As part of this we need to:

Generate the various annotation files filtered for each MAG (scripts exist for this already)
Package each of these as a zip bundle per MAG
Update the schema to match the new output model
Test the changes out and validate that the output looks as expect (Emiley and IMG to verify)
Make an incremental update workflow to regenerate data from previous runs
Do the incremental processing for everything in system.
Re-register the new results.

    output{
        #retaining the sdb is very important for loading into IMG
        File? sdb = "mbin.sdb"
        #flag file to indicate that no bins were generated
        File? nobins = "mbin.nobins"
        #flag file to indicate that pipeline ran through without error
        File? success = "mbin.success"
        #checkm results
        File? checkm = "checkm_qa.out"
        #gtdbtk results
        File? bacsum = "gtdbtk_output/gtdbtk.bac120.summary.tsv"
        File? arcsum = "gtdbtk_output/gtdbtk.ar122.summary.tsv"
        #retaining the metabat-bins folder is important for downstream Euk pipeline
        #NOTE: if the lineage SDB is provided, change below to 'filtered-metabat-bins.tar.gz'
        File? lqbins = "metabat-bins.tar.gz"
        #hq+mq bins folder
        File? hqmqbins = "hqmq-metabat-bins.tar.gz"
        #optional to retain depth file, only for reprocessing
        File? depth = "metabat.depth"
    }

Extract phylum data from MAG to support taxonomic search

Generate TSV report of the bins

The TSV would mimic what is in the stats.json but just be in tabular form.