franciscozorrilla / metagem Goto Github PK

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data

Home Page: https://franciscozorrilla.github.io/metaGEM/

License: MIT License

Python 71.85% Shell 14.45% R 13.70%

metagenomics computational-biology metabolic-models gut-microbiome snakemake metagenome-assembled-genomes mags metabolism bioinformatics flux-balance-analysis

metagem's Introduction

👋 Hi there

Biological engineer-turned computational biologist researching microbial community metabolism in Kiran Patil's group.

☄️ Developing metagenomics + genome-scale metabolic modeling approaches for microbiome research
⚙️ In descending order of preference, I enjoy coding in bash, R, python, and MATLAB
📦 Writing workflows in Snakemake and deploying them on high performance computing clusters
👽 PhD student in BioSci @ the MRC Toxicology Unit, University of Cambridge

metagem's People

Contributors

Stargazers

Watchers

Forkers

zzsunday hongzhonglu qazwsx1995 gitter-badger liphlab mxrcon cfrioux shyamalschandra vinisalazar haroon123 zpf0117b joelfnogueira janedy cmkobel zoey-rw srisvs33 cvn001 bartoszbartmanski xentrics ntekasi yhbae6022 zhidongzhang-code zhang8128 tom-a-lynch jameyzhu eisenra liupfskygre rvcoudert parthosen aglucaci codechenx fbartusch qing-microbiol hasihays mastika2306174974 lizhi-git jacqabyasa ruanzhepu

metagem's Issues

feat: create task/sample specific folder within each job that uses scratch dir

Not all users may have a job-specific variable like $SCRATCH or $TMPDIR that can be set in the config.yaml file (e.g. #26). To account for this, create working/scratch folders within the scratch/ directory to avoid errors.

~/path/to/scratch/
  |--assemblies/
     |--sample1/
     |--sample2/
     ...

Needs to be implemented in rules that expand wildcards and use scratch/:

megahit
crossMap
kallistoIndex
crossMap3
concoct
metabat
metabatCross
maxbin
maxbinCross
binRefine
binReassemble
abundance
GTDBtk
carveme
smetana
memote
grid
prokka
roary

feat: Add cluster workload manager specific flags for metaGEM.sh parser to handle non-slurm cases

User in issue #16 wants to run metaGEM.sh parser on SGE/OGE-based cluster. In this case need to replace sbatch with qsub and adapt submission parameters/flags.

bug: maxbinCross rule does not use multiple threads

Add -thread {config[cores][maxbin]} flag to maxbinCross rule in Snakefile (line 867):

run_MaxBin.pl -contig contigs.fasta -out $(basename $(dirname {output})) -abund_list abund.list

error in rule megahit when running two or more jobs in parallel

Hello,
thanks for this pipeline, it's been very useful.
I found an error when running the rule megahit with two jobs in parallel . In the line 232 of the Snakefile :

....
-2 $(basename {input.R2}) \
 -o tmp;
echo "done. "

that -o tmp makes megahit to complain and stop because that file/folder already exist. I solved the problem defining an output name depending on the sample name:

#This is inside the shell command
out_dir=$(basename {input.R1} _R1.fastq.gz)
# then I use that as output name
....
-2 $(basename {input.R2}) \
 -o $out_dir;
echo "done. "

Any way to make the compositionVis.R or abundanceVis.R scripts to work?

Hello Francisco!
I was trying to produce a plot similar to the one under 6. Relative abundances with bwa and samtools in the unseenbio tutorial but the compositionVis.R script does not exist. I did find an abundanceVis.R script but it needs two input files that I am unaware of: classification.stats and abundance.stats

It would be great if I could do something to make either of these work!

Thanks,
Shreyansh

feat: Add support for BGC/MGC prediction

May be useful to identify putative metabolism using gutsmash or gecco.

feat: Add optional feature to automate download of user-defined dataset

metaGEM presently expects the users to have already downloaded their dateset, this can be a hurdle.

It could be useful for users to have the option to download a dataset based on an SRR ID list using fasterq-dump as described here.

Note also that users could also easily modify the downloadToy rule to download a dataset of interest. However, this requires a list of links to the individual files, which may prove cumbersome to generate.

[Question] Option to filter host's DNA

Hello,
I was wondering if you haven't tried to check for host DNA in the input files. I've seen some reports showing that it can be problematic for downstream analysis. I'm new analyzing metagenomes, so I'm not sure about it and maybe for your particular analysis this isn't a concern. Searching for software to do this, I found kneaddata, which separates bacterial DNA from host DNA, although it seems to include trimming too, which maybe isn't the most convenient.

https://huttenhower.sph.harvard.edu/kneaddata/

Jose Luis

feat: Add bashplotlib to stats task for quick visualization on the command line

The stats function can be augmented by using bashplotlib to plot *.stats summary files generated by the various *Vis Snakefile rules directly onto the command line for easy inspection (e.g. no need to keep re-downloading the pdf plots generated by R scripts).

Develop and expand repo wiki

Although the metaGEM tutorial fully explains how to run the workflow on two real metagenomic samples, it would be nice to have more of a conceptual explanation of how the different rules/tasks are implemented + tips and tricks. Create and document wiki pages for each of the workflow tasks.

Rule assemblyVis has commented out two relevant lines

metaGEM/Snakefile

Line 327 in ae3e5e3

#mv assembly.stats {config[path][root]}/{config[folder][stats]}

metaGEM/Snakefile

Line 330 in ae3e5e3

    
                   #Rscript {config[path][root]}/{config[folder][scripts]}/{config[scripts][assemblyVis]}

Hey, this two lines and the remove of the redundant plot are commented out, so the task fails

bug: binReassemble implementation suboptimal without --parallel flag

The binReassemble rule is implemented sub-optimally without the --parallel flag. Although multiple threads are being used, only one genome is assembled at a time. For large genome batches, e.g. 349 MAGs (from single sample 😮) this results in a bottleneck and it is much faster to use the --parallel flag. Note that with this implementation there could be wasted/unused resources if the number of genomes < number of threads, however such a job is likely to finish in a short amount of time.

metaGEM/Snakefile

Lines 958 to 968 in 64ffedb

    
                   echo "Running metaWRAP bin reassembly ... " 
        
                   metaWRAP reassemble_bins -o $(basename {output}) \ 
        
                       -b metawrap_*_bins \ 
        
                       -1 $(basename {input.R1}) \ 
        
                       -2 $(basename {input.R2}) \ 
        
                       -t {config[cores][reassemble]} \ 
        
                       -m {config[params][reassembleMem]} \ 
        
                       -c {config[params][reassembleComp]} \ 
        
                       -x {config[params][reassembleCont]}

How can I check jobs are being submitted to our cluster

Hi,

I'm also trying to run metagem on our HPC, and I was wondering how I can check that jobs are actually being submitted or running? I ran bash metaGEM.sh -t fastp -j 43 -c 4 -m 20 -h 3 in a tmux window, and now it seems to be stuck in:

nohup: appending output to 'nohup.out'

Is this normal?

Our HPC recently moved from Torque PBS to Slurm for the resource management and job scheduling software, and they rewrote the job wrappers and everything. But normal procedure on the HPC is still to submit jobs using qsub and then a job script you wrote, such as:

qsub metagem.pbs

Any idea whether metagem would work in cluster mode on our HPC?

I also tried running it using --local like I do on our local machine, but that doesn't work on the hpc because metagem is interactive (the y/n questions)

crossMapseries vs crossMapParallel

Hi,

I tried bash metaGEM.sh -t crossMap -c 48 --local, but it didn't work. I see that there are now 2 different versions of crossMap in the metaGEM core workflow instead. I'm not entirely sure what the difference is between these 2 methods, and was wondering if there is any documentation detailing the differences?

Kind regards,
Sam

feat: Implement optional modification to loop metaWRAP refinement + reassembly

Should not be very hard to modify metaWRAP refine + reassembly module to do:

read recruitment
multiple re-assemblies (strict, permissive, original)
choose best version
repeat until circular, certain completeness reached, or max_iter reached

Inspired by the Jorg approach
Code: https://github.com/lmlui/Jorg
Paper: https://www.biorxiv.org/content/10.1101/2020.03.05.979740v2.full

Also would not be a bad idea to create my own version of the refine + reassembly metaWRAP modules to trim down on dependencies and also since metaWRAP has not yet migrated to python3

feat: add rules to create and query custom protein db using diamond

createDBrule for creating database should be implemented with no wildcards:

diamond makedb --in proteins.fa -d custom_db

queryDBrule for querying should be implemented with sample-specific wildcards:

diamond blastp -q query.fa -d custom_db -o output.tsv --very-sensitive

See diamond tutorial for more details.

troubleshooting: poor assembly/binning in complex samples

Hi Francisco! I think a poor assembly is preventing soil samples from producing decent bins. I do plan to try re-assembling with a lower minimum contig length, but I was wondering if you have any other advice for troubleshooting these samples.

In addition to the assembly stats from Megahit, I've included the final and intermediate binning results from the refined_bins directory. When running metaWRAP manually, the results are similarly dismal (had to lower the completeness to 15% to get output). This is one of three samples that ran through metaGEM, and I'm hoping that running ~40 additional samples will improve the MAGs.

Thank you,
Zoey

Continue or restart failed or incomplete tasks

Let's say I have 43 samples, and on one local machine metaGEM finished the assemblies of 10 of them, but now I would like to continue on a faster cluster for the remaining assembly tasks. My question is if I copy these to the assemblies folder of the other local machine, if metaGEM will recognize these and not assemble these samples again?

Regarding usage, installing CPLEX & CarveMe

Hi, The package seems to be interesting, and would like to give a shot on my data. I have a sweet of prokaryotic MAGs and would like to run the GEM pipeline on it without starting from assembly and MAG generation. Is it possible to do so?

[OGE scheduler] Problem with the model reconstruction (carveme) step

Hello, again Francisco!

So after the extractProteinbins step, I had 93 .faa files. I ran the 'bash metaGEM.sh -t carveme -j 93 -c 4 -m 20 -h 4' command after this and ended up with only 47 models in the GEMs folder (This is not due to the time limit as the job ended within an hour). I tried running the carveme step again but his time with 33 .faa files in the extractProteinbins folder. This time I ended up with only 19 models in the GEMs folder.

Please let me know if there is something that I am missing!

P.S. I tried running the carveme step separately on my PC and pasted the 93 files in the GEMs folder with the same naming scheme but then ran into errors when I tried running the memote or modelVis steps.

Someone can provide a virtual disk or something to run data directly?

Someone can provide a virtual disk or something to run data directly? It is too hard for me to set up the metaGEM. I encounter a lot of problems in my ubuntu virtualbox and WSL2. I've given up seting up metaGEM enivronment. :(

feat: Option for users to upload their data to google colab

It would be more exciting for users if they could upload their own raw reads, assemblies, bins, etc. to the google colab notebook.

Need to look around to see what is the best and most reproducible way of going about this, may be best to make a skeleton/template for downloading a sample or MAG from SRA/MGnify so users can simply switch out the link/sampleID to analyze their data of interest.

Query about SMETANA 'detailed' mode runtimes and parallel processing

Hello Francisco!

I obtained around 93 MAGs from the metaGEM workflow and ran smetana for these 93 organisms using smetana. Even though I assigned 30 cores for this job, it seems that the CPU utilization indicates otherwise (shows only one core is used). Is there a way to ensure that the parallel processing smoothly functions for smetana as the runtime has already been upwards of 4 days for a single media (M11)?
I independently tried running smetana in 'global' mode with the communities.tsv file having multiple different sized communities, and it turns out the CPU utilization shows active parallel processing and runtimes are much lower.

I know that detailed mode is expected to take higher runtimes but is there:

Some data about community size vs. runtimes, so that if runtimes are higher, we can maybe break down a large community into multiple smaller communities, parallelly process them, and in the end, combine the results from all using various centrality measures (similar to the 'global' mode approach).
Is there a way to ensure that parallel processing is taking place when the 'detailed' mode is run?

Please let me know!

Thanks,
Shreyansh

feat: Implement complementary pre-assembly binning for additional MAG generation

I have been thinking about trying this out for a long time, and was reminded of this approach in a recent tweet. Digging into the most recent literature, this paper found assemble-first and bin-first approaches to be complementary.

I originally thought that this could be implemented as a complementary draft bin generating approach to be refined and reassembled along with the other 3 draft bin sets. However this is likely not possible since the refinement step requires that all bins be generated from the same assembly.

An alternative approach would be to use this pre-binning assembly only to recover genomes that were not reconstructed using the assemble-first approach. Although perhaps a bit naive, this could be implemented based on taxonomy:

Generate refined and reassembled MAGs as usual (henceforth MAGs_A)
Generate MAGs using bin-first approach with LSA (henceforth MAGs_B)
Assign taxonomy to MAGs from steps i. and ii. using GTDB-Tk
Remove from MAGs_B any bins that have a taxonomic label found in MAGs_A

An alternative and likely more robust approach could be to simply run dRep on each sample's MAGs_A + MAGs_B to get dereplicated bins. See issue in dRep repo.

Add scratchdir/tmpdir folder in config file

Right now many rules just cd or cp into $SCRATCHDIR, however other clusters may have other names or users may want to use custom directories.

Note: This will require modifying the Snakefile as well, and need to be careful about the createFolders rule.

bug: r-gridextra missing from metagem environment

Error in library(gridExtra) : there is no package called ‘gridExtra’

Add r-gridextra to metaGEM env recipe

sbatch: not found Error submitting jobscript (exit code 127):

I primarily install the metaGEM and run with toy dataset. And I got no error (red words) in bash, but a error in nohup.out.
Is that due to the configuration and software about cluster?
I run it on Ubuntu20.04 in virtualbox in my PC (40G RAM)

####################
Here is the error recorded in nohup.out.

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 2
Job counts:
count jobs
1 all
2 qfilter
3
Select jobs to execute...

[Sun May 30 10:25:13 2021]
rule qfilter:
input: /media/sf_stone_meta/software/metaGEM/dataset/sample1/sample1_R1.fastq.gz, /media/sf_stone_meta/software/metaGEM/dataset/sample1/sample1_R2.fastq.gz
output: /media/sf_stone_meta/software/metaGEM/qfiltered/sample1/sample1_R1.fastq.gz, /media/sf_stone_meta/software/metaGEM/qfiltered/sample1/sample1_R2.fastq.gz
jobid: 1
wildcards: IDs=sample1

/bin/sh: 1: sbatch: not found
Error submitting jobscript (exit code 127):

Job failed, going on with independent jobs.

Exiting because a job execution failed. Look above for error message
Complete log: /media/sf_stone_meta/software/metaGEM/.snakemake/log/2021-05-30T102512.578344.snakemake.log
#####################

why use gap filling with carveme

Hi!

Thanks for sharing this pipeline!
I am trying to use it to understand microbial interactions, and I saw that carveme is run with the gap-filling flag. Could you explain why did you chose to use it by default? Considering that we are dealing with metagenomes, we can't be sure that the microbes can grow in isolation - perhaps they rely on essential nutrients provided by other members. On the other hand, MAGs are often incomplete and perhaps omitting gap-filling might overrepresent the interdependence between species. Have you tried both options?

Thanks!!

error while running env_setup.sh script:

Hi,

I'm getting the following error while running the env_setup.sh script:

CommandNotFoundError: Your shell has not been properly configured to use 'conda deactivate'.

I tried conda init bash and source /home/slambrecht/miniconda2/etc/profile.d/conda.sh to solve it, but to no avail.

Also, when I try to run conda deactivate manually, I experience no problems

full output:

Checking if conda is available ... detected version 4.7.12!
Checking if mamba environment is available ... env_setup.sh: line 33: mamba: command not found
Do you wish to create an environment for mamba installation? This is recommended for faster setup (y/n)y

CommandNotFoundError: Your shell has not been properly configured to use 'conda deactivate'.
To initialize your shell, run

    $ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.


Do you wish to download and set up metaGEM conda environment? (y/n)^C

feat: add viral identification module

It would be interesting to add a viral contig detection "module" to metaGEM, in particular for exploring associations between metabolic exchanges in the presence/absence of bacteriophages.

Check out methods of big virus paper for ideas:

Bacteriophages have important roles in the ecology of the human gut microbiome but are under-represented in reference databases. To address this problem, we assembled the Metagenomic Gut Virus catalogue that comprises 189,680 viral genomes from 11,810 publicly available human stool metagenomes. Over 75% of genomes represent double-stranded DNA phages that infect members of the Bacteroidia and Clostridia classes. Based on sequence clustering we identified 54,118 candidate viral species, 92% of which were not found in existing databases. The Metagenomic Gut Virus catalogue improves detection of viruses in stool metagenomes and accounts for nearly 40% of CRISPR spacers found in human gut Bacteria and Archaea. We also produced a catalogue of 459,375 viral protein clusters to explore the functional potential of the gut virome. This revealed tens of thousands of diversity-generating retroelements, which use error-prone reverse transcription to mutate target genes and may be involved in the molecular arms race between phages and their bacterial hosts.

Also see VirFinder, VirSorter2, checkV.

feat: add drep for dereplicating MAGs across samples

Add rule to Snakefile for dereplicating MAGs:

dRep dereplicate output_directory -g path/to/genomes/*.fasta

code, docs

Question about generating "depth.txt" for metabat2

Hi Francisco,
Could I ask a question about usage of metabat2 in bin analysis? I found it need this file-"depth.txt"? However I can not get file from the last step - "megahit". Do you know how to prepare "depth.txt" as the input for metabat2?

Thanks a lot!

Best regards,
Hongzhong

feat: use snakemake integrated conda feature to annotate jobs

Johannes Koster considers activating conda envs within jobs to be bad practice, use integrated conda feature instead. See documentation: https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management

How to run on a cluster without workload manager

Dear Francisco,

I realize metaGEM was optimized to run on a HPC cluster system, but the cluster of our lab (40 cores, 500 GB RAM) operates without a workload manager. So as a user you have to take care you don't use all 40 cores, and that you do not submit more jobs then there are cores etc.

I have 43 (soil) samples organized into sample specific subdirectories within the dataset folder

If I run

bash metaGEM.sh --task megahit --local

and I set megahit cores to 24 in the config.yaml file,

Would that mean metaGEM would assemble the samples only one at a time? What I want to avoid is that the metagem parser submits all 43 jobs at once, because on this cluster submitting means starting

Best wishes,
Sam

Have you metadata for the GEMs from gut

Dear Francisco,
Nice toolbox. Thanks for your effort. As I am interested in the gut GEMs, I downloaded GEMs from your article generated by metaGEM. Do you have metadata information for these gut GEMs, like the personal information for each gut GEMs?

Best,
Hongzhong

Is it common that only about 30 MAGs of high quality were obtained from one metagenome sample?

Dear Francisco,
Recently, I have tried to use your pipeline to analyse some metagenome data. For the bin, i found I can only get about 30 MAGs of high quality (completeness > 90% by checkM). So I am wondering do you get similar results or could you get more MAGs of high quality which can be directly used for the model reconstruction? Thanks a lot!

Best,
Hongzhong

feat: Add KneadData as optional pre-processing of raw data

Add KneadData to metaGEM as some users may wish to filter out host-associated sequences prior to analysis (e.g. #27).

Create two new metaGEM.sh tasks: check and stats

These two functionalities are designed to check that input dataset is properly read by metaGEM and for intermediate progress report, e.g. how many of the samples have been assembled, etc.

Check

Check if result folders exists, if not then create them with Snakefile rule createFolders
Check if sequencing files are present in dataset folder
- If sequencing files are dumped and unorganized then run organizeData
- If sequencing files are organized print out count and list of IDs for user verification
Check if environments metagem, metawrap, and prokkaroary exists

Stats

Check if result folders exists, if not then create them with Snakefile rule createFolders
Print out progress bar for each tasks/folder
- dataset: count subfolders to determine total number of samples
- qfilter: count .json report files
- assembly: count .gz fasta files
- concoct: count *concoct-bins subfolders
- maxbin2: count *maxbin-bins subfolders
- metabat2: count *metabat-bins subfolders
- metawrap_refine: count subfolders
- metawrap_reassemble: count subfolders, also determine total number of final MAGs across samples
- taxonomy: count subfolders
- abundances: count subfolders
- models: count subfolders for sample progress and count .xml GEM files for total models generated
- model reports: count subfolders
- simulations: count .tsv files

The stats function can also be augmented by using bashplotlib to plot *.stats summary files generated by the various *Vis Snakefile rules directly onto the command line for easy inspection (e.g. no need to keep re-downloading the pdf plots generated by R scripts).

Add flag to metaGEM.sh parser to allow users to force certain jobs/tasks to be submitted locally

Some users such as #16 may want to run certain tasks on their local machine/login node rather than submitting to a cluster.
Would be good to add flag --local, -l to force any jobs to be submitted locally.
Should include warning that running non-trivial jobs on the login node is a faux pas and will make your fellow cluster users mad at you.

'compositionVis' , 'abundanceVis' and 'taxonomyVis' rule errors

Hello Francisco!

I was following the unseenbio examples, but when I run: bash metaGEM.sh -t compositionVis the terminal throws an error that there is no task under that name. When I check the Snakefile, I do find a compositionVis rule.
And when I run bash metaGEM.sh -t taxonomyVis. I get this error: No rule to produce taxonomyVis
Same error as 2 again when I run bash command for abundanceVis

Can you please tell me why this happens?

Thanks,
Shreyansh

Issue running concoct with crossMapParallel pathway

I am trying to run concoct on metaGEM after running crossMapParallel. I am trying to run metaGEM using a large dataset that has already been quality filtered and assembled into contigs. I was successfully able to run my files through megahit and ran crossMapParallel. I used crossMapParallel since it's recommended for large datasets and it outputted the expected files into the kallisto folder. I ran concoct as the next job in the workflow which calls kallisto2concoct but fails after encountering the following error. Do you know how I can avoid the issue to be able to continue the workflow? Thank you!

P.S. Line 598 has the output file commented out so I removed the "#"

Traceback (most recent call last):
  File "/projectnb2/talbot-lab-data/jlopezna/metaGEM/scripts/kallisto2concoct.py", line 41, in <module>
    main(args)
  File "/projectnb2/talbot-lab-data/jlopezna/metaGEM/scripts/kallisto2concoct.py", line 22, in main
    samplename = samplenames[i]
IndexError: list index out of range

Update needed to modelVis rule for models reconstructed using CarveMe version>=1.5.0

Hello, again Francisco!

I think I figured out the problem with the modelVis rule. Just small corrections are needed for users working with a carveme >1.5.0.

Changes (In the Snakefile):
In lines 1511 to 1518:

while read model;do 
            id=$(echo $(basename $model)|sed 's/.xml//g'); 
            mets=$(less $model| grep "species id="|cut -d ' ' -f 8|sed 's/..$//g'|sort|uniq|wc -l);
            rxns=$(less $model|grep -c 'reaction id=');
            genes=$(less $model|grep 'fbc:geneProduct fbc:id='|grep -vic spontaneous);
            echo "Model: $id has $mets mets, $rxns reactions, and $genes genes ... "
            echo "$id $mets $rxns $genes" >> GEMs.stats;
        done< <(find . -name "*.xml")

species id ----->species metaid
reaction id ------>reaction metaid
fbc:geneProduct fbc:id ------>fbc:geneProduct metaid

These changes are reflected in a smooth run of this rule!

I hope this helps!

P.S.: I am still trying to figure out the multiple job submission on an OGE cluster and the memote errors. I will keep you posted if I find something!

Thanks and regards,
Shreyansh

feat: Check if files/folders exist in tmpdir before copying them within Snakefile rules

This would help make it easier to continue/restart failed or incomplete jobs.

To check if file exists:

FILE=/path/to/file.txt
if test -f "$FILE"; then
    echo "$FILE exists."
fi

To check if folder exists:

FILE=/path/to/folder
if [ -d "$FILE" ]; then
    echo "$FILE is a directory."
fi

For example, a large binReassemble job timed out after the 24 hr limit on my cluster. To continue the job without re-recruiting reads & re-reassembling some genomes I had to manually silence the cp command in the Snakefile rule. This could be handled automatically by a conditional statement as shown above, e.g. if tmp/$job/$sample exists then do not copy any new files into it.

MAG abundance across samples

Hi! It seems the 'abundance' rule maps each sample's reads against the bins generated from that same sample. Reads are not mapped against bins generated from other samples. Is this understanding correct?

If so - why not create one output directory of dereplicated bins from the whole analysis (i.e. flatten the output from reassembled_bins/{sampleID}/reassembled_bins), and map each sample's reads against that directory? Otherwise, how else could you evaluate the abundance of a single species among samples?

I am testing this with the outputs from a relatively unsuccessful analysis (still running the pipeline with better assembly parameters). I input 48 bins (which were mostly very low quality, minimum completeness only 15% to ensure every sample would pass through the metaGEM pipeline). These 48 were combined into only 2 genome clusters. I do expect the results to change when I input better genomes.

Here is how I just tried implementing this using anvi'o. This R code creates the anvi'o input file (with genome names and file paths):

b_directories <- list.files("./reassembled_bins/", pattern = "HARV", full.names = T)
out <- data.frame(name = NULL, path = NULL)
for (sample in b_directories){
	samp.bins <- list.files(paste0(sample, "/reassembled_bins"), pattern=".fa", full.names = T)
	bin_names <- paste0(basename(sample), ".", basename(samp.out))
	samp.bin.info <- cbind(name = bin_names, path = samp.out)
	out <- rbind(out, samp.bin.info)
}
write_tsv(out, "./external-genomes.tsv")

Then, from within the same metaGEM directory:

module load fastani
module load miniconda 
module load anvio 
conda activate anvio-7.0
anvi-dereplicate-genomes -f external-genomes.tsv -o derep_genomes/ —-skip-fasta-report --program fastANI --similarity-threshold 0.85 —-representative-method Qscore

The two bins selected as "representative" for each cluster could then be used for mapping. Do you think this is a sound approach?

feat: Add optional feature to separate contigs after assembly

If users are only interested in either reconstructing prokaryotic or eukaryotic MAGs it could be useful to run EukRep after assembly and proceed to cross-map/bin only the contigs of interest (e.g. only prokaryotic contigs or only eukaryotic contigs). Separating contigs and cross-mapping/binning prokaryotes and eukaryotes separately may also improve contamination and completeness scores of genomes. Could check to see if this would by helpful by scanning reconstructed prokaryotic MAGs for eukaryotic contamination.

Inspired by this EukRep pipeline repo.

bug: problems with GTDBTk, small correction in the snakefile needed

Hello Francisco,

Thanks for making this pipeline!

While running the gtdbtk task, I encountered a 'Missing input error' during the dry run. After a lot of searching, I found that the config.yaml file, the classification folder is called GTDBTk, whereas, in the Snakefile, it is called GTDBtk.

So I changed the config.yaml entry. The dry run was successful this time, although the actual execution was a failure as it couldn't find the ID folder. For this, I just changed the '$(basename $(dirname {input}))' to '{input}' in line 1330 of the Snakefile.

The run has not thrown any errors so far. I just wanted to notify you of this small mistake so that others don't have to face it in the future!

Please let me know if I made an error in any of the steps!

Thanks and regards,
Shreyansh

Running metaGEM.sh parser locally

Hi,
I install metagem automatically on my local computer. I have no problem with bash metaGEM.sh -t downloadToy and bash metaGEM.sh -t organizeData in dataset folder. But I have an error in the nohup.out :
+1: Building DAG of jobs...
Using shell: /bin/bash
Provided cluster nodes: 2
Job counts:
count jobs
1 all
1
Select jobs to execute...

[Wed Mar 3 12:56:47 2021]
Job 0:
WARNING: Be very careful when adding/removing any lines above this message.
The metaGEM.sh parser is presently hardcoded to edit line 22 of this Snakefile to expand target rules accordingly,
therefore adding/removing any lines before this message will likely result in parser malfunction.

/bin/sh: 1: sbatch: not found
Error submitting jobscript (exit code 127):

Job failed, going on with independent jobs.

Do you know what can i do ?
Best

documentation on running metaGEM using user-generated contig assemblies

I am trying to run metaGEM using a dataset that has already been quality filtered and assembled into contigs. I'm trying to format the data the way metaGEM wants it, but I can't get it right (maybe because I'm not "touch"ing the files in the right order for Snakemake?).

Is there any documentation on how users should input files when starting at the crossMap/binning step of the pipeline?

Thank you!

feat: add rule that identifies potentially novel genomes from MAGs

This could be done based on:

output of GTDBTk (% ANI),
dReping all MAGs, and/or
reference genome tree placement.

See methods of big MAG papers here and here for inspiration.

Open to suggestions 🤔

feat: Add crossMap3-based method rules to metaGEM.sh parser

Currently the parser only supports submitting jobs through the crossMap-based method, i.e. N (= number of samples) jobs submitted cross-mapping each set of paired end reads to the focal sample assembly in series for small-medium datasets.

Still need to add the wildcard expansion strings for crossMap3-based method, i.e. where N x N individual jobs are submitted for each individual cross-mapping operation using kallisto for large datasets. Need to add parser support for rules kallistoIndex, crossMap3, gatherCrossMap3, kallisto2concoctTable, etc. in the metaGEM.sh parser and ensure that there are not rule dependency conflicts (e.g. multiple rules generate same output so Snakemake gets confused).


	echo "Running metaWRAP bin reassembly ... "
	metaWRAP reassemble_bins -o $(basename {output}) \
	-b metawrap_*_bins \
	-1 $(basename {input.R1}) \
	-2 $(basename {input.R2}) \
	-t {config[cores][reassemble]} \
	-m {config[params][reassembleMem]} \
	-c {config[params][reassembleComp]} \
	-x {config[params][reassembleCont]}

franciscozorrilla / metagem Goto Github PK

metagem's Introduction

👋 Hi there

metagem's People

Contributors

Stargazers

Watchers

Forkers

metagem's Issues

Check

Stats

Recommend Projects

Recommend Topics

Recommend Org