Model options

It would be simpler if the --model_file options was simply --model. The --model option is somewhat less useful now except for people who want a more basic error model. Could be changed to --mode

Additionally, there should be a switch to print informations about the error models provided with the software. That info should also be on the readthedocs

Multiprocessing

Issue to track the progress on the Roadmap item "Add multiprocessing support"

mutliprocessing vs joblib
parallel read generation
~~parallel~~ fast model generation
0.6.0 release 🚀 🚀 🚀

Simple Error Model

Issue to track the progress on the Roadmap item "Add a simple error model"

I guess that the simplest would be to:

Add 2 parameters: mean_quality and std_dev*
Generate random quality scores following a normal distribution
Eventually modify the nucleotides whose quality aren't perfect (it never is)
write docstrings and tests

A few unknowns:

Should the length of the sequences vary?
To which base switch if we get an erroneous call. A random other nucleotide? I would guess a random nucl. is good enough for a simple error model.

* I haven't added a standard deviation parameter. it is hardcoded to 0.01 but can be discussed

Grammar and typesetting

Link to paper: https://www.overleaf.com/read/ksvfpdhfskmq

Reviewer 1

I would have typeset in a different way (or enclosed in quotes) the two commands 'iss generate' and 'iss model' because right now they are undistinguishable and get confusing.
Change 'lognormal distribution per default' to 'lognormal distribution by default'

Reviewer 2

In the abstract there are missing words (“when new projects”).

*N.B: this applies to the current version of the paper, which will be mostly rewritten to incorporate the comparison done in #49 *
Nevertheless addressing theses before will not be time consuming and will prevent any errors on text that would be kept in v2 of the paper to be corrected

Temporary files

If the software is interrupted or crashes before completion, it leaves temporary files behind. The cleanup function should be called upon completion AND crash/interruption

Ambiguous bases

When genomes are downloaded from the ncbi they sometimes contain ambiguous bases (e.g. 'S' which stands for 'G' or 'C') Those cause a KeyError in the dispatch dict for indels

Traceback:

File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/__main__.py", line 5, in <module>
    main()
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 318, in main
    args.func(args)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 123, in generate_reads
    generator.to_fastq(read_gen, args.output)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 124, in to_fastq
    for read_tuple in generator:
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 86, in reads
    reverse, 'reverse', sequence, bounds)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/error_models/__init__.py", line 179, in introduce_indels
    if random.random() < deletions[position][mutable_seq[nucl]]:  # del
KeyError: 'S'

IUPAC codes: http://www.bioinformatics.org/sms2/iupac.html

introduce_indels out of range

The introduce_indels loop sometimes goes out of the sequence bounds

Traceback:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/__main__.py", line 5, in <module>
    main()
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 318, in main
    args.func(args)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 123, in generate_reads
    generator.to_fastq(read_gen, args.output)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 125, in to_fastq
    for read_tuple in generator:
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 69, in reads
    forward, 'forward', sequence, bounds)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/error_models/__init__.py", line 172, in introduce_indels
    if mutable_seq[nucl] == 'N':
  File "/Users/hadrien/Documents/venvs/iss/lib/python3.6/site-packages/Bio/Seq.py", line 1839, in __getitem__
    return self.data[index]
IndexError: array index out of range

Errors when running code

Hi,
I've been trying to use InSiclicoSeq to create an NGS dataset from a single 7.4kb genome (starting with a shortened fragment), running the program from the linux command line in a qiime VirtualBox. I've been given a couple of error messages however; the first (error message 1 attached) stating invalid syntax in app.py at line 78. I could avoid this by removing the "elif" segment that deals with ncbi genomes (lines 68-78), but then run into another issue later where InSilicoSeq calls seq.py (error message 2 attached), this time a type error.
I've also attached the DNA sequence I was using in my command- iss generate --genomes PV3_gb_MF678293Short.fasta -model miseq --output miseq_reads -n 10000

Error1.txt
Error2.txt
PV3_gb_MF678293Short.txt

I can provide further information if it helps to recreate the errors. And thanks for creating this software by the way- I've spent days trying to find an up to date package for NGS data simulation.
Thanks,
Alex

can't use HiSeq or NovaSeq profiles

Trying to use HiSeq or NovaSeq profiles throw an error

ERROR:iss.error_models.two_dim.MultiKDErrorModel:Failed to read ErrorModel file: [Errno 2] No such file or directory: 'HiSeq'

Refactoring

Issue to track the progress on the Roadmap item "Refactor to avoid duplicated code when possible"

Some functions are duplicated at the moment, and the bam parsing and model creation is quite messy.

There are duplicated functions in the error model classes. The indels and substitutions functions can probably be moved to the base ErrorModel class
Ideally a bam file should only be read once to create the model .npz file. It is not the case now.
Some functions are too long (notably in bam.py) and should be broken done into smaller functions. Possibly the bam.py file could even be split in 2 files (perhaps 1 general bam reading / writing and 1 where the heavy lifting / model creation happens)

Mixed datasets

This is just an idea but it would be nice to be able to simulate datasets that contain Bacterial and Viral sequences for example (and also if possible, sequences that are not only retrieved from RefSeq)

Distribution of genomes

Issue to track the progress on the Roadmap item "The user provides a multi-fasta file, and choose the type of distribution"

Errors models in 0.2.1 don't work

The SRA reads taken for building the new error models were not all the same length (adapter trimming, possible quality too) and it caused issues during the model generation 😱

Will rebuild asap with either modified SRA reads or my coral metagenome reads (need to assemble and map back to assembly then)

In the meantime, I've rolled back with the old error models (named HiSeq2500 and MiSeq) and push to 0.2.2 (also on Pypi)

test coverage of download module

test coverage is almost non-existent for the download module (used for the --ncbi option) and it has already cause problem.

add unit tests

Logging and Error Handling

Issue to track the progress on the Roadmap item "Add proper logging and error handling"

add try / except blocks where we can catch exceptions
add sys.exit / write to stderr whenever a fatal exception occurs
switch from sys to logging
add a 'info' logging level to make the software more verbose
add a 'quiet' and 'debug' argument to the parser

Documentation Improvments

reviewer 2

Some of the examples do not have all the required files to run. The following files do not exist or they are wrongly named: model.npz, my_model.npz, reads_R1.fastq.gz, reads_R2.fastq.gz.

make clearer that thos are example commands that the user should run with their own data.

In addition, there are some parameters for the “iss generate” when downloading sequences from NCBI that are not explicitly described. The -u, -k are not explained, -n_genomes does not seem to be used

document those better

The documentation is not extensive but apparently sufficient. However, I missed tutorials for specific scenarios of interest. The examples are vaguely organised.

the documentation does feel a bit minimal and (i) has not been consistently kept up-to-date with new options and (ii) has lost its cohesion with the updates. This issue will now track to re-writing of the doc

Roadmap

Overview

the goal of the project is to create a (meta)genomics reads simulator, i.e. a program capable of simulating realistic high-throughput sequencing data

This issue is aimed at being a todo list for the project.

~~Make a roadmap~~

V1.0.0

Simulating reads

From a genome, output perfect reads in fastq format: ~~Issue #2~~
From a set of genomes, output perfect reads in fastq format given abundance parameters: ~~Issue #3~~
Generate paired-end reads: ~~Issue #4~~
Add a simple error model: ~~Issue #5~~
Add a more complicated error model, (based on real read data, or literature on Illumina error model?): ~~Issue #6~~
Refine the error model from #6: 2D KDE, GC bias and Insert size distribution
Provide at least 1 good HiSeq and 1 good MiSeq model: ~~Issue #13~~ and ~~Issue #26~~

Input

The user provides a single-fasta file and coverage information: ~~Issue #2~~
The user provides a multi-fasta file and abundance parameters: ~~Issue #3~~
The user provides a multi-fasta file, and choose the type of distribution: ~~Issue #12~~
The user chooses nothing, and random genomes from RefSeq genomes are selected: ~~Issue #15~~

Testing and packaging

Add Travis and Codecov integration
Add proper logging and error handling: ~~Issue #7~~
Refactor to avoid duplicated code when possible: ~~Issue #8~~
Write tests: ~~Issue #9~~
Write documentation: ~~Issue #10~~
Submit package to PyPi: ~~Issue #11~~
Add multiprocessing support: ~~Issue #17~~
User testing

Write documentation

Issue to track the progress on the Roadmap item "Write documentation"

write docstrings for all functions
write inline comments for (i) making the code easier to follow (ii) document eventual obscure one-liners
update the README with (i) a brief intro, (ii) install instructions, (iii) basic usage
write complete documentation in a doc subdirectory
publish the doc to readthedocs

WGS genome projects with `--ncbi` option

Some empty genomes (short string of Ns) sometimes get downloaded with the --ncbi option and make the software crash.

HTTP Error 502: Bad Gateway

InsilicoSeq should skip the offending record with a warning when something goes wrong while downloading genomes.

I don't know where that HTTP error is coming from and can't reproduce at the moment. I'll close this issue and reopen if it reoccurs.

Random genome from multifasta

if the --ncbi option works (sorta) well for random genomes, sometimes users (read, me) want random genomes drawn from a set of known genomes (a multifasta file given as input)

Crashes when --model is not provided

when there is not model, iss generate fails with an AttributeError, which is caught and dealt with with display the help message.

TODO:

a more informative error message

Genome -> Reads

Issue to track the progress on the Roadmap item "From a genome, output perfect reads in fastq format"

Test file: genome.fasta
Pattern: ATCG
Length: 2000bp

add Biopython to requirements.txt (aka let's not re-write a fasta parser)
add Numpy to requirements.txt (required by Biopython)
generate 100 (pseudo)random subsequences in fasta format
generate x subsequences covering the input genome 40x (or an arbitrary coverage level)
add quality information (fastq output)
get out of app.py, import module and write ArgumentParser
setup travis and write tests

Note: Biopython discards everything after a blank space (' ') in the sequence headers

Model GC bias

I feel like the GC-bias needs its own issue, it's harder to implement than the i_size distribution

This article is a good read about GC bias in Illumina data.

in this article they scanned a bam file with a sliding window equal to the read-length (and s step of half the read length).

try that and plot result
compare coverage at x GC% with mean coverage
~~convert coverage into probabilities of a reads coming for that region~~
~~save that distribution to the model files~~
add a chance to not generate a read from region with lower probabilities (i.e. "abnormal" gc content

Can't install on python3.7

Upstream issue with pysam here: pysam-developers/pysam#697

add iss download command

The idea would be to have a specific command for downloading genomes from NCBI

Bug when output file already exists

If the output file exists, the program should not write in the existing one but maybe overwrite. I got this error:

INFO:iss.generator:Cleaning up
Traceback (most recent call last):
File "/usr/local/bin/iss", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 365, in main
args.func(args)
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 152, in generate_reads
generator.cleanup(temp_file_list)
File "/usr/local/lib/python3.6/site-packages/iss/generator.py", line 197, in cleanup
os.remove(temp_file + '_R1.fastq')
FileNotFoundError: [Errno 2] No such file or directory: '.iss.tmp.NC_010175.1.0_R1.fastq’

Values for option --model/-m could be case insensitive

For example HiSeq and hiseq could be allowed

gc_bias not working

The --gc_bias options seems to do nothing at all.

Error handling when template > genome

Click on the arrow for traceback

INFO:iss.app:Starting iss generate INFO:iss.app:Using kde ErrorModel INFO:iss.app:Using lognormal abundance distribution INFO:iss.app:Using 2 cpus for read generation INFO:iss.app:Generating 5000 reads INFO:iss.app:Generating reads for record: NC_001740.1:5890-6729 INFO:iss.app:Generating reads for record: NC_018081.1:c2783521-2782661 INFO:iss.app:Generating reads for record: NC_005364.2:c967483-966185 INFO:iss.app:Generating reads for record: NC_003997.3:1351647-1352231 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py", line 130, in simulate_read assert reverse_end < len(record.seq) AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in call
return self.func(*args, **kwargs)
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 131, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 131, in
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py", line 53, in reads
forward, reverse = simulate_read(record, ErrorModel, i)
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py", line 135, in simulate_read
read_length, len(record.seq) - read_length
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py", line 199, in randrange
raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (301,284, -17)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 359, in call
raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException

ValueError Tue May 22 21:04:40 2018
PID: 95094Python 3.6.5: /Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/python3.6
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False), {})]
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False)
kwargs = {}
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in reads(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, n_pairs=204, cpu_number=0, output='test_reads', gc_bias=False)
48 # try:
49 # forward, reverse = simulate_read(record, ErrorModel, i)
50 # except ValueError as e:
51 # logger.error('Skipping this record: %s' % record.id)
52 # return
---> 53 forward, reverse = simulate_read(record, ErrorModel, i)
forward = undefined
reverse = undefined
record = SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[])
ErrorModel = <iss.error_models.kde.KDErrorModel object>
i = 0
54 if gc_bias:
55 stiched_seq = forward.seq + reverse.seq
56 gc_content = GC(stiched_seq)
57 if 40 < gc_content < 60:

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in simulate_read(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, i=0)
130 assert reverse_end < len(record.seq)
131 except AssertionError as e:
132 # we use random insert when the modelled template length distribution
133 # is too large
134 reverse_end = len(record.seq) - random.randrange(
--> 135 read_length, len(record.seq) - read_length
read_length = array(301)
record.seq = Seq('ATGCGGGAGTCTAGTCAAATACGTGAGACGACAGAAACAAAAATAAAATTAAGT...TAA', SingleLetterAlphabet())
136 )
137 reverse_start = reverse_end - read_length
138 bounds = (reverse_start, reverse_end)
139 # create a perfect read

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py in randrange(self=<random.Random object>, start=array(301), stop=284, step=1, _int=<class 'int'>)
194 raise ValueError("non-integer stop for randrange()")
195 width = istop - istart
196 if step == 1 and width > 0:
197 return istart + self._randbelow(width)
198 if step == 1:
--> 199 raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
istart = 301
istop = 284
width = -17
200
201 # Non-unit step argument supplied.
202 istep = _int(step)
203 if istep != step:

ValueError: empty range for randrange() (301,284, -17)

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 699, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
joblib.my_exceptions.TransportableException: TransportableException

ValueError Tue May 22 21:04:40 2018
PID: 95094Python 3.6.5: /Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/python3.6
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False), {})]
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False)
kwargs = {}
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in reads(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, n_pairs=204, cpu_number=0, output='test_reads', gc_bias=False)
48 # try:
49 # forward, reverse = simulate_read(record, ErrorModel, i)
50 # except ValueError as e:
51 # logger.error('Skipping this record: %s' % record.id)
52 # return
---> 53 forward, reverse = simulate_read(record, ErrorModel, i)
forward = undefined
reverse = undefined
record = SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[])
ErrorModel = <iss.error_models.kde.KDErrorModel object>
i = 0
54 if gc_bias:
55 stiched_seq = forward.seq + reverse.seq
56 gc_content = GC(stiched_seq)
57 if 40 < gc_content < 60:

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in simulate_read(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, i=0)
130 assert reverse_end < len(record.seq)
131 except AssertionError as e:
132 # we use random insert when the modelled template length distribution
133 # is too large
134 reverse_end = len(record.seq) - random.randrange(
--> 135 read_length, len(record.seq) - read_length
read_length = array(301)
record.seq = Seq('ATGCGGGAGTCTAGTCAAATACGTGAGACGACAGAAACAAAAATAAAATTAAGT...TAA', SingleLetterAlphabet())
136 )
137 reverse_start = reverse_end - read_length
138 bounds = (reverse_start, reverse_end)
139 # create a perfect read

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py in randrange(self=<random.Random object>, start=array(301), stop=284, step=1, _int=<class 'int'>)
194 raise ValueError("non-integer stop for randrange()")
195 width = istop - istart
196 if step == 1 and width > 0:
197 return istart + self._randbelow(width)
198 if step == 1:
--> 199 raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
istart = 301
istop = 284
width = -17
200
201 # Non-unit step argument supplied.
202 istep = _int(step)
203 if istep != step:

ValueError: empty range for randrange() (301,284, -17)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/iss", line 11, in
sys.exit(main())
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py", line 403, in main
args.func(args)
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py", line 169, in generate_reads
args.gc_bias) for i in range(cpus))
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 789, in call
self.retrieve()
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 740, in retrieve
raise exception
joblib.my_exceptions.JoblibValueError: JoblibValueError

Multiprocessing exception:
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/iss in ()
6
7 from iss.app import main
8
9 if name == 'main':
10 sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0])
---> 11 sys.exit(main())

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py in main()
398 elif args.debug:
399 logging.basicConfig(level=logging.DEBUG)
400 else:
401 logging.basicConfig(level=logging.INFO)
402
--> 403 args.func(args)
args.func =
args = Namespace(abundance='lognormal', abundance_file=... output='test_reads', quiet=False, version=False)
404 logging.shutdown()
405 except AttributeError as e:
406 logger = logging.getLogger(name)
407 logger.debug(e)

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py in generate_reads(args=Namespace(abundance='lognormal', abundance_file=... output='test_reads', quiet=False, version=False))
164
165 record_file_name_list = Parallel(n_jobs=cpus)(
166 delayed(generator.reads)(
167 record, err_mod,
168 n_pairs_per_cpu, i, args.output,
--> 169 args.gc_bias) for i in range(cpus))
args.gc_bias = False
cpus = 2
170 temp_file_list.extend(record_file_name_list)
171 except KeyboardInterrupt as e:
172 logger.error('iss generate interrupted: %s' % e)
173 generator.cleanup(temp_file_list)

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=Parallel(n_jobs=2), iterable=<generator object generate_reads..>)
784 if pre_dispatch == "all" or n_jobs == 1:
785 # The iterable was consumed all at once by the above for loop.
786 # No need to wait for async callbacks to trigger to
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=2)>
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time
792 self._print('Done %3i out of %3i | elapsed: %s finished',
793 (len(self._output), len(self._output),

Sub-process traceback:

ValueError Tue May 22 21:04:40 2018
PID: 95094Python 3.6.5: /Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/python3.6
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False), {})]
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False)
kwargs = {}
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in reads(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, n_pairs=204, cpu_number=0, output='test_reads', gc_bias=False)
48 # try:
49 # forward, reverse = simulate_read(record, ErrorModel, i)
50 # except ValueError as e:
51 # logger.error('Skipping this record: %s' % record.id)
52 # return
---> 53 forward, reverse = simulate_read(record, ErrorModel, i)
forward = undefined
reverse = undefined
record = SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[])
ErrorModel = <iss.error_models.kde.KDErrorModel object>
i = 0
54 if gc_bias:
55 stiched_seq = forward.seq + reverse.seq
56 gc_content = GC(stiched_seq)
57 if 40 < gc_content < 60:

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in simulate_read(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, i=0)
130 assert reverse_end < len(record.seq)
131 except AssertionError as e:
132 # we use random insert when the modelled template length distribution
133 # is too large
134 reverse_end = len(record.seq) - random.randrange(
--> 135 read_length, len(record.seq) - read_length
read_length = array(301)
record.seq = Seq('ATGCGGGAGTCTAGTCAAATACGTGAGACGACAGAAACAAAAATAAAATTAAGT...TAA', SingleLetterAlphabet())
136 )
137 reverse_start = reverse_end - read_length
138 bounds = (reverse_start, reverse_end)
139 # create a perfect read

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py in randrange(self=<random.Random object>, start=array(301), stop=284, step=1, _int=<class 'int'>)
194 raise ValueError("non-integer stop for randrange()")
195 width = istop - istart
196 if step == 1 and width > 0:
197 return istart + self._randbelow(width)
198 if step == 1:
--> 199 raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
istart = 301
istop = 284
width = -17
200
201 # Non-unit step argument supplied.
202 istep = _int(step)
203 if istep != step:

ValueError: empty range for randrange() (301,284, -17)

information on Error models

We need more information of the provided error models, i.e. read length, insert size distribution, ...

This will added to the doc and eventually some basic info at the command-line

Issue with MiSeq and NovaSeq models?

From Reviewer 2:

The software does not run correctly all the time. For example, running

iss generate --genomes genomes.fasta --model miseq --output miseq_reads

models miseq and novaseq did not work.

coverage instead of abundance

A comment by the reviewers is that I may be convenient to let the user control the coverage directly instead of it being a function of n_reads * abundance.

It is fairly simple to implement, and having mutually exclusive flags --abundance or --coverage would only warrant a minor version bump.

Comparison with existing software

This the main criticism received by the paper and the reason for the major revision

Software to compare to:

Software	License	Installed	Ran
InSilicoSeq	MIT	✔️	✔️
MetaSim	Academic	❌	❌
FunctionSIM	None	❌	❌
NeSSM	Academic	❌	❌
BEAR	Academic	✔️	❌
FASTQSim	GPL-3.0	✔️	❌
GemSim	GPL-3.0	✔️	❌
Grinder	GPL-3.0	✔️	✔️
pIRS	GPL-2.0	✔️	✔️

TODO:

Installation process

Software	Open Source	Dependency management	CLI	One command
InSilicoSeq	✔️	✔️	✔️	✔️
MetaSim	❌	❌	✔️	✔️
FunctionSIM	❌	❌	❌	✔️
NeSSM	❌	❌	✔️	❌
BEAR	✔️	❌	✔️	❌
FASTQSim	✔️	❌	✔️	❌
GemSim	✔️	❌	✔️	❌
Grinder	✔️	✔️	✔️	✔️
pIRS	✔️	❌	✔️	❌

Model and mode options

If the mode (error model) is basic, it might be good to put a warning saying that any specified model will be ignored

Add --version

iss -v, iss --version should display the version number.

could also display the version number in --debug mode before the programme starts.

MetaGenome -> Reads

Issue to track the progress on the Roadmap item "From a set of genomes, output perfect reads in fastq format given abundance parameters"

Test files:

Number of genomes: 5

Parse the file and get one SeqRecord per genome (we still assume complete genomes: one sequence per genome)
Parse abundance.txt and calculate the necessary coverage per genome to satisfy the abundance requirements
Generate the reads with the generator.reads() function
Use multiprocessing to parallelise the process (1 threads per genome?)
Write the reads to file
write tests

`--n_reads` should allow suffixes

the --n_reads of iss generate should allow suffixes like "M, G"

Example data

Reviewers requested example data.

While I won't include large-ish example data with the pip package, I'm willing to put example files on <osf.io> and modify the examples to include the download and running with the downloaded data.

presets

In light of reviewer 3's comments but only vaguely related, it would be a great addition to allow presets of "known" microbiomes to be simulated, i.e.

iss generate --model miseq --preset human_gut --output wathever

I'll get on that after the issues tagged with review though

Submit to PyPi

Issue to track the progress on the Roadmap item "Submit package to PyPi"

This should be the last step before beta release 🚀

generate MiSeq and HiSeq model .npz files to go with the release
write setup.py
tag release
upload to PyPi

Input Genomes from RefSeq

Issue to track the progress on the Roadmap item "The user chooses nothing, and random genomes from RefSeq genomes are selected"

add new 'refseq' input option (with 'bacteria', 'viruses', 'archaea')
Get random genomes with Entrez
write to multifasta file
Simulate Reads all the same

Bug if no number of reads is specified

If we use the default number of reads and do not specify any, we get this:
Traceback (most recent call last):
File "/usr/local/bin/iss", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 365, in main
args.func(args)
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 108, in generate_reads
n_reads = util.convert_n_reads(args.n_reads)
File "/usr/local/lib/python3.6/site-packages/iss/util.py", line 121, in convert_n_reads
if unit[-1].isdigit():
TypeError: 'int' object is not subscriptable

Testing

Issue to track the progress on the Roadmap item "Add proper logging and error handling"

Most of the old tests are broken since the introduction of indels + the indel code is not tested

create dummy bam, fasta and abundance files with a mix of perfect reads, known substitutions and indels.
fix the old tests
test the indel code

Update error models

The two error models currently built-in are based on fairly old genomics data. It'd be nice to base them on recent-ish metagenomes, or at least recent genomic datasets sequencing with most popular read-lengths + lib prep

Identify 1 MiSeq, 1 HiSeq and 1 NovaSeq dataset to use

MiSeq

Assembly with megahit
mapping the reads back to the assembly with bowtie2
build models from the bam file

HiSeq

~~Assembly with megahit~~ used public illumina data from human
mapping the reads back to the assembly with bowtie2
build models from the bam file

NovaSeq

~~Assembly with megahit~~ used public illumina data from human
mapping the reads back to the assembly with bowtie2
build models from the bam file

Roadmap reference: Provide at least 1 good HiSeq and 1 good MiSeq model

Advanced error model(s)

Issue to track the progress on the Roadmap item "Add a more complicated error model"

This is where the fun begins.

The current plan is to:

Download raw Illumina datasets from known bacterial species*
Map the downloaded reads against their reference genome
Extract the Base quality information from the BAM file
Parse the BAM files and for each position, get the substitution rate (using CIGAR strings?) **
Figure out how to get deletions information from the ~~MD tag and~~ CIGAR string. ~~(samtools calmd?)~~ It worked well with CIGAR
write the error rates to a file
Write the subsitution matrix for every position in a file***
Generate realistic-ish illumina data!

* It will be very important to track every step used to generate the profiles: genomes download, mapping options, ... I will keep everything in evernote for now but will dump it in the software doc later on.

** I'll need pysam as a dependency. ~~Hopefully it will not add difficulties to the install process~~ Pysam installed seamlessly

*** something of the like (not sure yet, subject to change!):

pos	AtoA	AtoT	...
1	0.96	0.01	...
...

n_genomes fails when using ncbi option

Since 1.1.0 and the introduction of -u/--n_genomes for multifasta the ncbi support appears broken

Traceback (most recent call last):
  File "/opt/sw/insilicoseq/1.1.0/bin/iss", line 11, in <module>
    load_entry_point('InSilicoSeq==1.1.0', 'console_scripts', 'iss')()
  File "/opt/sw/insilicoseq/1.1.0/lib/python3.6/site-packages/iss/app.py", line 403, in main
    args.func(args)
  File "/opt/sw/insilicoseq/1.1.0/lib/python3.6/site-packages/iss/app.py", line 136, in generate_reads
    n = args.n_genomes[0][0]
TypeError: 'int' object is not subscriptable

No info about abundance

When using a distribution, no information about the genomes abundance is printed to the log or on file.
Such information is important for the downstream analysis of the simulated reads.

better install instructions

The Installation instructions should contain pip --user and advanced instructions for custom install (i.e. with a module system)

Paired-end reads

Issue to track the progress on the Roadmap item "Generate paired-end reads"

Add insert_size parameters to the ArgumentParser
~~Update the coverage calculations~~ not necessary. The number of reads doesn't change!*
Update read generator to output paired-end reads**
Iterate over generator and write to 2 output files
update test_generator.py

* The number of reads is divided by 2 in the generator function and taken as number of pairs
**We'll probably want to yield a tuple containing two SeqRecord objects: one with the forward read, the other with the reverse

hadrieng / insilicoseq Goto Github PK

insilicoseq's People

Contributors

Stargazers

Watchers

Forkers

insilicoseq's Issues

Reviewer 1

Reviewer 2

reviewer 2

Overview

V1.0.0

Simulating reads

Input

Testing and packaging

Sub-process traceback:

Installation process

MiSeq

HiSeq

NovaSeq

Recommend Projects

Recommend Topics

Recommend Org