microbiomedata / metamags Goto Github PK
View Code? Open in Web Editor NEWWorkflow for metagenome assembled genomes generation.
Workflow for metagenome assembled genomes generation.
We do not have permission to distribute these files, they were added with commit id 14ec4da in June 2023. I have removed them in the master branch. We either need to update the mbin_nmdc task in the wdl such that it only requires Neha's container, njvarghese/mbin:v0.4, or we need to work with Neha to include mbin_stats.py, which I believe is NMDC code, in a new version of the container.
create a small test data set with expected outcome if workflow run is successful
link test results to this issue
We are seeing this error in some runs...
Dumping /pscratch/sd/n/nmdcda/cromwell-executions/nmdc_mags/ab68c879-fcb8-412f-af79-8db74291bb31/call-package/execution/stderr
Traceback (most recent call last):
File "/opt/conda/envs/mags_vis/bin/create_tarfiles.py", line 174, in <module>
krona_plot(ko_result,prefix)
File "/opt/conda/envs/mags_vis/bin/create_tarfiles.py", line 127, in krona_plot
df = pd.read_csv(ko_result,sep="\t")
File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
self._engine = self._make_engine(self.engine)
File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 51, in __init__
self._open_handles(src, kwds)
File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py", line 229, in _open_handles
errors=kwds.get("encoding_errors", "strict"),
File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/common.py", line 614, in get_handle
storage_options=storage_options,
File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/pandas/io/common.py", line 396, in _get_filepath_or_buffer
raise ValueError(msg)
ValueError: Invalid file path or buffer object type: <class 'NoneType'>
I've created a snapshot for testing in /global/cfs/cdirs/m3408/squads/mags/package_failure
on perlmutter.
JGI's eukcc database has been copied to /refdata/eukcc_db/eukcc2_db_ver_1.2
IMG taxon oids that can be used for testing, large assemblies but known to produce euk bins, or an existing metagenome can be spiked in with a small eukaryote for testing purposes.
3300067032
3300059473
3300059591 - this project produces just 1 euk bin Eukaryota; Ascomycota; Dothideomycetes; Pleosporales; Phaeosphaeriaceae; Parastagonospora
I noticed that none of our mags workflow results generate any bins, even projects for which we know the JGI processing generated mags (ie
nmdc:wfmgan-11-585bp531.1, the JGI run of this data generated 27 medium and high quality mags, see img taxon oid 3300061644). I find errors in the call-mbin_nmdc stage 'ERROR: Models must be parsed before identifying HMM hits.'
A quick search suggests an issue with the checkM installation
Ecogenomics/CheckM#280
This issue is not caught b/c the last process of the task completes correctly so we should also set pipefail to catch these types of problems going forward.
I can find 400 records on pscratch, this likely doesn't include deleted projects and the issue dates back to at least early August 2023
/pscratch/sd/n/nmdcda/cromwell-executions/nmdc_mags> grep 'ERROR: Models must be parsed before identifying HMM hits' */call-mbin_nmdc/execution/stdout | wc -l
600
create a large test data set with expected outcome if workflow run is successful
link test results to this issue
Finish end-to-end workflows for MG going from raw reads to all data products.
/pscratch/sd/n/nmdcda/cromwell-executions/nmdc_mags/f859549b-b820-48f7-ae8d-c1e8abacd6e2/call-mbin_nmdc/execution
raise NewickError("Unexpected newick format '%s' " %subnw[0:50]) ete3.parser.newick.NewickError: Unexpected newick format ''nmdc:wfmgas-11-wpm64z83.1_scf_321_c1_2':0.423943'
Task:
Prepare mags workflow that models ~latest mags version from JGI so that when the new version is ready, we can plug it in and test.
Add release workflow to manages release version and assets
When NMDC runs workflows where the assembly was generated at JGI the binning workflow needs a mapping file to convert between the assembly fasta file and annotation files. We need to update the workflow such that there is an option to specify --map
related to #26
Files that can be used to test this:
https://data.jgi.doe.gov/search?q=1414320&expanded=IMG_AP-1414320
assembly file: assembly.contigs.fasta
mapping file: Ga0597049_contig_names_mapping.tsv
This would bring the MAG output more inline with what IMG currently generates.
As part of this we need to:
Mark F is working on this issue as mentioned on stand up. This is highest priority to fix.
From Neha from summer 2023, specifically I'm not sure if we are saving metabat-bins.tar.gz
which would be needed to implement the eukaryotic binning.
output{
#retaining the sdb is very important for loading into IMG
File? sdb = "mbin.sdb"
#flag file to indicate that no bins were generated
File? nobins = "mbin.nobins"
#flag file to indicate that pipeline ran through without error
File? success = "mbin.success"
#checkm results
File? checkm = "checkm_qa.out"
#gtdbtk results
File? bacsum = "gtdbtk_output/gtdbtk.bac120.summary.tsv"
File? arcsum = "gtdbtk_output/gtdbtk.ar122.summary.tsv"
#retaining the metabat-bins folder is important for downstream Euk pipeline
#NOTE: if the lineage SDB is provided, change below to 'filtered-metabat-bins.tar.gz'
File? lqbins = "metabat-bins.tar.gz"
#hq+mq bins folder
File? hqmqbins = "hqmq-metabat-bins.tar.gz"
#optional to retain depth file, only for reprocessing
File? depth = "metabat.depth"
}
The TSV would mimic what is in the stats.json but just be in tabular form.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.