cami-challenge / camisim Goto Github PK

View Code? Open in Web Editor NEW

153.0 153.0 36.0 440.46 MB

CAMISIM: Simulating metagenomes and microbial communities

Home Page: https://data.cami-challenge.org/participate

License: Apache License 2.0

Python 93.23% Perl 4.28% Shell 0.01% Roff 0.28% C 2.14% Dockerfile 0.03% Raku 0.03%

metagenomics simulation

camisim's People

Contributors

Stargazers

Watchers

Forkers

xudangliatiger jianshu93 add1ct3d plesan gwarmstrong ropolomx shihuang047 andrese52 ohmeta rhysnewell blue-moon22 katsteinke davidjsherman apurbz roosavarjus skrakau nermin-ghith nick-youngblut fplazaonate zohrazahir linhduongtuan wanliu2019 alienzj pengshengbin 25280841 linweiliarchaea yijessepi trellixvulnteam vinisalazar qwang-big galsang-git liaochenlanruo 5l1v3r1 spitfiredd xueyao0830 darbvincent shox-zywoo

camisim's Issues

full list of dependencies with versions

I'd like to create a conda recipe for CAMISIM and add it to the bioconda channel. It would be helpful to know the full list of dependencies and the (minimum) required versions for each (eg., the versions for Mothur, MUMmer, ART, etc). I've been looking at the wiki, dockerfile, etc, and I can't find all of the software versions required.

ReadSimulation random seed

The reads for every genome are simulated with different seeds (based on the original "overall" seed). This makes kind of sense for multi-sample experiments, so that not for every sample the same reads are produced but makes it really hard to exctly recreate the SAME data set a second time.

Make new datasets available

Related to the release: The datasets we describe in the manuscript (HMP and mouse gut) need to be available on https://data.cami-challenge.org/participate and properly described. This is mainly work for @pbelmann, I help to transfer the missing data, and @AlphaSquad will answer any questions we might have. Correct?

Docker container

@pbelmann You created a new Docker container for CAMISIM, correct?
We might want to include a link to the Docker Hub URL in the manuscript.
Can you take care of registering it there and provide us with a working link?

I guess this is stalled by PR #25 and #29 (and not much can be done now).

Changelog

Please maintain a CHANGELOG.md tracking important code changes. We (and others) can keep track of new features, changes in functionality, or resolved bugs. Pragmatically, let's start after #30 with a clean start..?

Check dependencies

Not sure if implemented already, but do we have a way to check if all dependencies are installed before actually executing the pipeline?

Error on using wgsim with no errors

I'm trying to run CASIM using wgsim with the noerror flag but the pipeline crashed.

Here is the command line that I used
python2.7 metagenomesimulation.py --debug defaults/mini_config_wgsim.ini
With the config file :
mini_config_wgsim.txt

But the pipeline crashed on a python KeyError.
I think that is caused by a precedent warning/error raised :
2019-04-26 14:50:45 INFO: [GenomePreparation 75339803532] Simulating reads finished
[W::sam_read1] parse error at line 81
[main_samview] truncated file.
[W::sam_read1] parse error at line 3863
[main_samview] truncated file.
[W::sam_read1] parse error at line 9821
[main_samview] truncated file.
[W::sam_read1] parse error at line 38837
[main_samview] truncated file.
[W::sam_read1] parse error at line 71881
[main_samview] truncated file.
[W::sam_read1] parse error at line 216393
[main_samview] truncated file.

Here are the complete logs :
exec_wgsim_stderr.txt

Do you have any idea on how to fix it ?

Resuming after Community Design

Resuming the pipeline with the read simulation step instead of performing community design completely new messes up the gold standard creation (unrecognized reference names in mpileup).

Questions related to metadata

Hi, I'm trying to simulate a dataset using CAMISIM but since I am relatively new to metagenomics, I do not have clear understandings on some names. My questions are :

I want to use genome with ncbi accession GCF_000006825.1, so I visited
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/825/GCF_000006825.1_ASM682v1
From this site, do I only need to download
GCF_000006825.1_ASM682v1_genomic.fna.gz (genome sequence)
GCF_000006825.1_ASM682v1_genomic.gff.gz (genome annotation)
and add the paths to id_to_genome and id_to_gff respectively?
I'm not sure what these 4 fields mean for metadata. i.e. Row 1 (header): genome_ID\tOTU\NCBI_ID\tnovelty_category
What is genome ID for GCF_000006825.1? Where can I find this information?
What is tOTU for GCF_000006825.1? Where can I find this information?
I found from https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=272843 that taxonomy identifier for GCF_000006825.1 is 272843.
What should I put for tnovelty_category?

It would also be great if you can provide some other concrete example genome with the corresponding genome_ID\tOTU\NCBI_ID\tnovelty_category.
Thank you very much for your help.

ART error profile 'hi'

The Pipeline "recommends" ART 2.3.6, but actually breaks for ART 2.3.7, because the profile 'hi' (namely the files EmpHiSeq2kR1.txt and EmpHiSeq2kR2.txt) are missing in the folder vor ART 2.3.7
See also suggestion in #5 (using files directly instead of folder)

Log File

For very large metagenomes the log files get incredibly large, need to set an option for no or minimal logging.

Versioning

CAMISIM should be versioned following the semantic versioning conventions: http://semver.org/
Pragmatically, I suggest to simply start with 1.0.0 once we merged #29 into the master branch..?

Biom Library

Wiki entry for installing the biom python library is missing.

contig mapping file

A separate column for each "start position", "end position", "total length" in relation to the original sequence is needed.

Config file

The config file requires a lot of files/options which are not immediately clear (at least to me) or probably should not be required, especially concerning gene annotation.

The option ncbi_taxdmp: Is a required option, but it is not necessarily clear why. It is inconvenient having to download a 250MB folder when it is not absolutely required.
The option metadata: What does "novelty category" and "OTU" mean in this context and do we need this mapping if no gene annotation is supposed to be performed?
The option id_to_gff_file: Similar to 1): This option is required, even if no genome annotations are used/available
Or am I missing something major here?

In general the config file is a little bit confusing (again, subjective to me) because there are so many options and suboptions for the different parts of the pipeline. Probably the very best would be a GUI, maybe with automatic detection which options are required every time and which are only required if certain steps are to be performed.

Fixed distributions

When using the ReadSimulationWrapper as standalone, I could provide the abundances file on my own. That was very helpful, as it was possible to use fixed distributions for the (simulated) strains.
From what I can tell, this is not possible when using the complete pipeline.
What do you think about giving the option to provide such an abundances file which gets used when provided and the current approach if not?

Can I use CAMISIM to simulate OTUs?

Hi
I am just curious, If the input is a taxonomic profile in BIOM format, the OTUs appearing in it
Can I use CAMISIM to simulate OTUs?

Thanks in advance
Ashutosh

NCBI taxonomy versions

The provided NCBI taxonomy dump version(s) might be incompatible with the list for downloading genomes resulting in pipeline crashes

Update wiki

Since we have a new and shiny text about the Simpipeline for the CAMI paper it would be great to incorporate this into the wiki, i.e. removing the nested structure of the documentation and instead have one central text in which all the relevant information is found, linking to external sources only (?)

Simulation using a set of fasta files

Hi,

I'm sorry but I don't understand if it's possible or not, to create metagenomic samples from my own fasta/fastq (for example, private genomes not yet published).

So, I would like to provide a bunch of fasta to camisim and then, let the software create the abundances, cut the reads and anonymise everything in a fq file.
Is it possible ? If yes, how can I do that ?

Perl library in wiki

rnammer needs the XML::Simple library. This must be documented

Error while running "python metagenomesimulation.py defaults/mini_config.ini" command

Hi
I attempted to check CAMISIM is working properly as per usage section guideline using the following command

python metagenomesimulation.py defaults/mini_config.ini

However, while at the "merging the bam files" steps it got aborted by giving the following error

2019-04-05 09:12:54 INFO: [Validator 17733368889] Creating pooled gold standard
2019-04-05 09:12:54 INFO: [Validator 17733368889] Merging bam files.
2019-04-05 09:14:18 ERROR: [MetagenomeSimulationPipeline] [Errno 12] Cannot allocate memory in line 111
2019-04-05 09:14:18 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

Can you please suggest what is going wrong with CAMISIM

Thanks in Advance

broken README Links

All links in the readme still point to the old repo. Since there is no wiki anymore they point to nowhere.

Issues running CAMISAM

HI, I have followed the update documentation but I am having trouble running CAMISAM.

Running the docker command says the mini.biom file is not available. I try entering the docker container using bash but that also gives me an error. I also try running "python metagenomesimulation.py configuration/metagenome_simulation" but it's unclear where that configuration is.. I thought it might be the config file which I have edited to suit but that doesn't work. would I be able to get some help getting this to run please?

`bshaban@6300d-111439-l:~/camisim$ sudo docker run -it -v "/path/to/input/directory:/input:rw" -v "/path/to/output/directory:/output:rw" cami/camisim:latest metagenome_from_profile.py -p /input/mini.biom -o /output
NCBI database not present yet (first time used?)
Downloading taxdump.tar.gz from NCBI FTP site...
Done. Parsing...
Loading node names...
2044492 names loaded.
249789 synonyms loaded.
Loading nodes...
2044492 nodes loaded.
Linking nodes...
Tree is loaded.
Updating database: /root/.etetoolkit/taxa.sqlite ...
2044000 generating entries...
Uploading to /root/.etetoolkit/taxa.sqlite

Inserting synonyms: 245000
Inserting taxid merges: 50000
Inserting taxids: 2040000
2019-01-16 23:37:51 WARNING: [root] Max strains per OTU not set, using default (3)
2019-01-16 23:37:51 WARNING: [root] Mu and sigma have not been set, using defaults (1,2)
Traceback (most recent call last):
File "metagenome_from_profile.py", line 87, in
config = GG.generate_input(args) # total number of genomes and path to updated config
File "/usr/local/bin/scripts/get_genomes.py", line 283, in generate_input
tax_profile = read_taxonomic_profile(args.profile, config, args.samples)
File "/usr/local/bin/scripts/get_genomes.py", line 26, in read_taxonomic_profile
table = biom.load_table(biom_profile)
File "/usr/local/lib/python2.7/dist-packages/biom/parse.py", line 652, in load_table
with biom_open(f) as fp:
File "/usr/lib/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/biom/util.py", line 443, in biom_open
if os.path.getsize(fp) == 0:
File "/usr/lib/python2.7/genericpath.py", line 57, in getsize
return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/input/mini.biom'`

File of Files for CAMISIM

To improve documentation, simplify usage and increase understanding of code, we should create a file of files for all scripts, defaults and tools provided within CAMISIM

Example Pipeline Call

The Usage section
https://github.com/CAMI-challenge/MetagenomeSimulationPipeline/wiki/User-manual#usage should be updated:

e.g.

python metagenome_from_profile.py path/to/16Sprofile

should be

python metagenome_from_profile.py -p path/to/16Sprofile

[documentation] Give correct arguments to read simulators

I'm trying to use CAMISIM to generate a big fastq file containing community reads with no error.

If I understand well, this is possible using wgsim read simulator with the option "errorfree" (I saw that in the code).
So, I changed the config file with theses values :
readsim=tools/wgsim/wgsim
error_profiles=errorfree

But this is not that simple.
CAMISIM wants a file as input for the error_profiles value.
How can I change the values for a working pipeline ?

I think that it should be nice to have one or more wiki pages to explain the possible values in the config file.

Error in readsimulationwrapper.py

There is a type in line 1058 in scripts/ReadSimulationWrapper/readsimulationwrapper.py:

fragment_size_mean=options.fragments_size_mean,
should be
fragment_size_mean=options.fragment_size_mean,

No module named StringIO on standard command line

Hi,

I'm trying to generate de novo metagenome samples using the default command line :
python metagenomesimulation.py defaults/mini_config.ini

At the very end of the simulation I have a python error that abort the whole process :

2019-04-24 16:24:11 INFO: [FastaAnonymizer] Interweave shuffle and anonymize
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/yoann/Softwares/CAMISIM/fastastreamer.py", line 9, in
File "/home/yoann/Softwares/CAMISIM/anonymizer.py", line 7, in
import StringIO
import StringIO
ModuleNotFoundError: No module named 'StringIO'
ModuleNotFoundError: No module named 'StringIO'
2019-04-24 16:24:11 ERROR: [FastaAnonymizer] Error occurred anonymizing '/tmp/tmpC8uquE/2019.04.24_16.19.24_sample_0/reads'
2019-04-24 16:24:11 ERROR: [MetagenomeSimulationPipeline] Error occurred anonymizing '/tmp/tmpC8uquE/2019.04.24_16.19.24_sample_0/reads' in line 117
2019-04-24 16:24:11 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

The module StringIO is not found. This is very curious because if I understand well, this module is a part of the standard library.
This bug is reproducible under multiple linux computers.
Do you have any idea of what caused this issue ?

ERROR: [NcbiTaxonomy] Directory does not exist:

Hello,
I am trying to simulate a dataset from reference genomes and on two different systems the pipeline fails quite quickly:

$ python2.7  metagenomesimulation.py default_config.ini.txt
2019-03-01 12:52:14 INFO: [MetagenomeSimulationPipeline] Metagenome simulation starting
2019-03-01 12:52:14 INFO: [MetagenomeSimulationPipeline] Validating Genomes
2019-03-01 12:52:14 INFO: [MetadataReader] Reading file: '01_id_to_genome.csv'
2019-03-01 12:52:48 INFO: [MetagenomeSimulationPipeline] Design Communities
2019-03-01 12:52:48 INFO: [CommunityDesign] Drawing strains.
2019-03-01 12:52:48 INFO: [MetadataReader 98870155726] Reading file: '01_metadata.csv'
2019-03-01 12:52:48 INFO: [MetadataReader 70071771555] Reading file: '01_id_to_genome.csv'
2019-03-01 12:52:48 INFO: [CommunityDesign] Validating raw sequence files!
2019-03-01 12:53:06 ERROR: [NcbiTaxonomy] Directory does not exist: '/tmp/tmp5QWyLW/nodes.dmp'
2019-03-01 12:53:06 ERROR: [MetagenomeSimulationPipeline]  in line 83
2019-03-01 12:53:06 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

I don't know if its the docker container, or my configuration or a bug, but I cant get past this message. I would appreciate a pointer.

This directory seems to be used by some sort of NCBI script but I dont know why this would fail. I have read and write access to /tmp.

these are the config files I used:

01_id_to_genome.csv.txt
01_metadata.csv.txt
default_config.txt

Low alignment rates of synthetic reads to gold standard coassembly scaffolds

Hi, I don't know if this is the right place to talk about this, but I am using the 2nd CAMI Challenge Human Microbiome Project Toy Dataset. When I mapped the reads (bowtie2) to the gold standard co-assembly scaffolds, for oral, skin, and airways, the alignment rate is pretty high (>99%), but for gastrointestinal tract and urogenital tract, the alignment rate is very low (~40%). I am wondering why this is happening for these two body sites. Thanks!

high temporary space requirement

Reads of samples are made consecutively and require a lot of space uncompressed.
Currently compression is done after read simulation of all samples is finished.

Space could be saved by compressing reads right after they were generated.

Uniform distribution

For testing purposes it would be useful if we could draw abundances from a uniform distribution.
It might be sufficient to allow the user to set the standard deviation to 0 in the configuration file

log to file

Since the logs are gigantic, it might be useful if the logs are automatically written to a file, i.e. in encapsulated environments where stdout/stderr is not really a possibility.

tmp folder

When giving a path to the temp_directory option which does not exist, the pipeline crashes.
A better behaviour in my opinion was to create the folder in the given direction (as it is only supposed to be a temporary folder anyway)

Metadata

CAMISIM should also provide a description of data sets created by it (similar to #35 but for the produced data not the code itself)

Read simulators

Currently only Art Illumina is implemented in this version.

pIRS requires a configuration file generated dynamically for each strain. A nice method needs to be written for that.
PBSIM does not create sam files but maf files. The script maf_converter.py has been written to convert those maf files into sam files, but still needs to be made look nice. Making a class for all the loose functions, etc.
The code of the class for both PBSIM and pIRS still needs to be cleaned up similar to ReadSimulationArt.

All scripts can be found in the module scripts.ReadSimulationWrapper

This mostly a todo for myself, if I feel the need to do something more fun than documenting stuff.

Errors in open file

I have followed the user manual but there appeared an error when running CAMISAM.
The input file was a biom file produced by Metaphlan2 with the command:
./metagenome_from_profile.py -p ~/CAMISIM_test/read_1.biom -o ~/CAMISIM_test/output/
When the Simulating reads finished, there appeared an error like the following:
2019-02-13 04:25:38 INFO: [GenomePreparation 27474285647] Simulating reads finished
[E::hts_open_format] fail to open file '/home/swy/CAMISIM_tmp/tmpnfPzsh/2019.02.13_04.06.47_sample_0/reads/k__Bacteria'
samtools view: failed to open "/home/swy/CAMISIM_tmp/tmpnfPzsh/2019.02.13_04.06.47_sample_0/reads/k__Bacteria" for reading: No such file or directory
/bin/sh: f__Enterobacteriaceae: command not found
/bin/sh: p__Proteobacteria: command not found
/bin/sh: p__Proteobacteria: command not found
/bin/sh: o__Enterobacteriales: command not found
/bin/sh: o__Enterobacteriales: command not found
/bin/sh: c__Gammaproteobacteria: command not found
/bin/sh: f__Enterobacteriaceae: command not found
/bin/sh: s__Klebsiella_pneumoniae.0.sam: command not found
/bin/sh: c__Gammaproteobacteria: command not found
/bin/sh: g__Klebsiella: command not found
/bin/sh: s__Klebsiella_pneumoniae.0.bam: command not found
........

I don't know where the error occurred that caused the error report. I hope you can give me some advice. Thank you very much!

Read Simulation in config file

As the ReadSimulator currently is only supporting ART, the options in the config for the read simulation are tailored to the ART options, i.e. the options art_illumina and art_illumina_error_profile/profile

I would suggest renaming the option art_illumina to something along the lines of simulator_executable (supposed to still point to the executable) and art_illumina_error_profile + profile to just error_profile. This is a file with either the illumina error profile file (instead of folder) or an error profile for another read simulator.

ete2 ftp usage

ete2/3 uses ftp for accessing the NCBI data base which might not be possible if using it behind a proxy.
Probably needs a PR for ete2/3

Config files

For all ways to invoke the pipeline and all data sets we create, we should provide exemplary config files which should work out of the box

gene annotations for strain evolver

When sequence names of genomes are made unique, the sequence names in the gff files will no longer map correct!

Error with default run defaults/mini_config.ini

I installed all dependencies and I still obtain "Error" with the default command line and I'm running out of ideas.

Thanks in advance for the help.

2019-04-14 15:02:58 INFO: [GenomePreparation 11874323490] Simulating reads finished
2019-04-14 15:03:07 INFO: [MetagenomeSimulationPipeline] Generate gold standard assembly
2019-04-14 15:03:07 INFO: [MetadataReader 76565746185] Reading file: '/home/ubuntu/dataBIG/METAG/CAMISIM/out/internal/genome_locations.tsv'
2019-04-14 15:09:19 INFO: [MetagenomeSimulationPipeline] Generate pooled strains gold standard assembly
2019-04-14 15:09:19 INFO: [MetadataReader 40340013719] Reading file: '/home/ubuntu/dataBIG/METAG/CAMISIM/out/internal/genome_locations.tsv'
2019-04-14 15:09:19 INFO: [MetagenomeSimulationPipeline] Samples used for pooled assembly: '2019.04.14_14.53.22_sample_0', '2019.04.14_14.53.22_sample_1', '2019.04.14_14.53.22_sample_2', '2019.04.14_14.53.22_sample_3', '2019.04.14_14.53.22_sample_4', '2019.04.14_14.53.22_sample_5', '2019.04.14_14.53.22_sample_6', '2019.04.14_14.53.22_sample_7', '2019.04.14_14.53.22_sample_8', '2019.04.14_14.53.22_sample_9'
2019-04-14 15:09:19 INFO: [Validator 17733368889] Creating pooled gold standard
2019-04-14 15:09:19 INFO: [Validator 17733368889] Merging bam files.
2019-04-14 15:11:43 INFO: [MetagenomeSimulationPipeline] Creating binning gold standard
2019-04-14 15:11:43 INFO: [MetadataReader 95469384767] Reading file: '/home/ubuntu/dataBIG/METAG/CAMISIM/out/internal/genome_locations.tsv'
2019-04-14 15:11:44 INFO: [MetadataReader 8601722415] Reading file: '/home/ubuntu/dataBIG/METAG/CAMISIM/out/internal/meta_data.tsv'
2019-04-14 15:11:50 INFO: [MetadataReader 44116791807] Reading file: '/home/ubuntu/dataBIG/METAG/CAMISIM/tmp/tmpMa8PBz/read_start_positionsAn8Hvz'
2019-04-14 15:11:52 ERROR: [MetagenomeSimulationPipeline] local variable 'file_path_anonymous_gs_mapping' referenced before assignment in line 122
2019-04-14 15:11:52 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted

Python Requirements

It would make it for the user easier to use metasim if it would contain a requirements.txt. This file just states the needed libraries with their version numbers.

A user would just have to type:

pip install -r requirements.txt

in order to install metasim with all libraries.

USE OF CAMISIM

Now I have Metagenome raw data with fastq and many genomes using metabat. I don't know how to use CAMI. should i make input folder？which files should folder contain？Thank you very much！

python metagenome_from_profile.py -p path/to/16Sprofile

python metagenomesimulation.py configuration/metagenome_simulation

python genomeannotation.py configuration/genome_annotation

Tutorial or Example for denovo simulation

I was attempting to use CAMISIM and I have read the user manual and readme but I am still not sure exactly what the input file for de novo metagenome simulation would look like. Is there any step by step guide, or example input files available for the metagenomesimulation.py?

Improve documentation and working example

Hello,

I'm interested in simulating metagenome reads de novo using CAMISIM, however, I find the documentation hard to follow. The issues I have include:

Links to obsolete pages: Most links to the configuration file either redirect to the top of the user manual or to a page last updated in 2015.
While id_to_gff_file is specified as required, there is no example of this file beyond a short explanation
What is the difference between genomes_total and genomes_real?
Is there a straightforward way of building a CAMI file without dealing with the hierarchical structure in the tax path and the percentage (e.g. only with a list of species taxIDs and their correspondent abundance)?

A working reproducible example and a guide to execute it is lacking, and would really facilitate the usage of CAMISIM

Thanks.

ART Broken Link

The link to ART in the wiki is broken.

What is

I want it generate metagenome data sets with tags，like a fastq file with sequences and a csv file which shows different clustering name( mapping the sequences id in fq file。。。。) But i was confused because the description of input file and output file，it was tooo difficult for a CS student to figure out the differences ..... How can I get the data I want

Unique filenames per run

To prevent possible mix-ups (which might have happened in #38 ), each CAMISIM run should have unique filenames for all produced files depending on the sample ID provided in the config, i.e. as prefix for all files