Code Monkey home page Code Monkey logo

insilicoseq's People

Contributors

alienzj avatar dependabot-preview[bot] avatar dependabot-support avatar emollier avatar gdrosos avatar hadrieng avatar richelbilderbeek avatar standage avatar stefanlelieveld avatar thijsmaas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

insilicoseq's Issues

Model options

It would be simpler if the --model_file options was simply --model. The --model option is somewhat less useful now except for people who want a more basic error model. Could be changed to --mode

Additionally, there should be a switch to print informations about the error models provided with the software. That info should also be on the readthedocs

Multiprocessing

Issue to track the progress on the Roadmap item "Add multiprocessing support"

  • mutliprocessing vs joblib
  • parallel read generation
  • parallel fast model generation
  • 0.6.0 release πŸš€ πŸš€ πŸš€

Simple Error Model

Issue to track the progress on the Roadmap item "Add a simple error model"

I guess that the simplest would be to:

  • Add 2 parameters: mean_quality and std_dev*
  • Generate random quality scores following a normal distribution
  • Eventually modify the nucleotides whose quality aren't perfect (it never is)
  • write docstrings and tests

A few unknowns:

  • Should the length of the sequences vary?
  • To which base switch if we get an erroneous call. A random other nucleotide? I would guess a random nucl. is good enough for a simple error model.

* I haven't added a standard deviation parameter. it is hardcoded to 0.01 but can be discussed

Grammar and typesetting

Link to paper: https://www.overleaf.com/read/ksvfpdhfskmq

Reviewer 1

  • I would have typeset in a different way (or enclosed in quotes) the two commands 'iss generate' and 'iss model' because right now they are undistinguishable and get confusing.
  • Change 'lognormal distribution per default' to 'lognormal distribution by default'

Reviewer 2

  • In the abstract there are missing words (β€œwhen new projects”).

*N.B: this applies to the current version of the paper, which will be mostly rewritten to incorporate the comparison done in #49 *
Nevertheless addressing theses before will not be time consuming and will prevent any errors on text that would be kept in v2 of the paper to be corrected

Temporary files

If the software is interrupted or crashes before completion, it leaves temporary files behind. The cleanup function should be called upon completion AND crash/interruption

Ambiguous bases

When genomes are downloaded from the ncbi they sometimes contain ambiguous bases (e.g. 'S' which stands for 'G' or 'C') Those cause a KeyError in the dispatch dict for indels

Traceback:

File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/__main__.py", line 5, in <module>
    main()
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 318, in main
    args.func(args)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 123, in generate_reads
    generator.to_fastq(read_gen, args.output)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 124, in to_fastq
    for read_tuple in generator:
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 86, in reads
    reverse, 'reverse', sequence, bounds)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/error_models/__init__.py", line 179, in introduce_indels
    if random.random() < deletions[position][mutable_seq[nucl]]:  # del
KeyError: 'S'

IUPAC codes: http://www.bioinformatics.org/sms2/iupac.html

introduce_indels out of range

The introduce_indels loop sometimes goes out of the sequence bounds

Traceback:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/__main__.py", line 5, in <module>
    main()
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 318, in main
    args.func(args)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 123, in generate_reads
    generator.to_fastq(read_gen, args.output)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 125, in to_fastq
    for read_tuple in generator:
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 69, in reads
    forward, 'forward', sequence, bounds)
  File "/Users/hadrien/Documents/github/InSilicoSeq/iss/error_models/__init__.py", line 172, in introduce_indels
    if mutable_seq[nucl] == 'N':
  File "/Users/hadrien/Documents/venvs/iss/lib/python3.6/site-packages/Bio/Seq.py", line 1839, in __getitem__
    return self.data[index]
IndexError: array index out of range

Errors when running code

Hi,
I've been trying to use InSiclicoSeq to create an NGS dataset from a single 7.4kb genome (starting with a shortened fragment), running the program from the linux command line in a qiime VirtualBox. I've been given a couple of error messages however; the first (error message 1 attached) stating invalid syntax in app.py at line 78. I could avoid this by removing the "elif" segment that deals with ncbi genomes (lines 68-78), but then run into another issue later where InSilicoSeq calls seq.py (error message 2 attached), this time a type error.
I've also attached the DNA sequence I was using in my command- iss generate --genomes PV3_gb_MF678293Short.fasta -model miseq --output miseq_reads -n 10000

Error1.txt
Error2.txt
PV3_gb_MF678293Short.txt

I can provide further information if it helps to recreate the errors. And thanks for creating this software by the way- I've spent days trying to find an up to date package for NGS data simulation.
Thanks,
Alex

can't use HiSeq or NovaSeq profiles

Trying to use HiSeq or NovaSeq profiles throw an error

ERROR:iss.error_models.two_dim.MultiKDErrorModel:Failed to read ErrorModel file: [Errno 2] No such file or directory: 'HiSeq'

Refactoring

Issue to track the progress on the Roadmap item "Refactor to avoid duplicated code when possible"

Some functions are duplicated at the moment, and the bam parsing and model creation is quite messy.

  • There are duplicated functions in the error model classes. The indels and substitutions functions can probably be moved to the base ErrorModel class
  • Ideally a bam file should only be read once to create the model .npz file. It is not the case now.
  • Some functions are too long (notably in bam.py) and should be broken done into smaller functions. Possibly the bam.py file could even be split in 2 files (perhaps 1 general bam reading / writing and 1 where the heavy lifting / model creation happens)

Mixed datasets

This is just an idea but it would be nice to be able to simulate datasets that contain Bacterial and Viral sequences for example (and also if possible, sequences that are not only retrieved from RefSeq)

Distribution of genomes

Issue to track the progress on the Roadmap item "The user provides a multi-fasta file, and choose the type of distribution"

  • count number of record in input fasta file
  • draw abundance from a few distributions
    • uniform (heh)
    • normal (well half-normal we don't like negative abundances)
    • exponential
    • log-normal (the contender for best distribution and default parameter)
    • 0-inflated log-normal (meaning that some genomes will be "missed" by the sequencing run)
  • store in abundance_dict
  • generate reads all the same
  • write tests for the 5 distributions

Errors models in 0.2.1 don't work

The SRA reads taken for building the new error models were not all the same length (adapter trimming, possible quality too) and it caused issues during the model generation 😱

Will rebuild asap with either modified SRA reads or my coral metagenome reads (need to assemble and map back to assembly then)

In the meantime, I've rolled back with the old error models (named HiSeq2500 and MiSeq) and push to 0.2.2 (also on Pypi)

test coverage of download module

test coverage is almost non-existent for the download module (used for the --ncbi option) and it has already cause problem.

  • add unit tests

Logging and Error Handling

Issue to track the progress on the Roadmap item "Add proper logging and error handling"

  • add try / except blocks where we can catch exceptions
  • add sys.exit / write to stderr whenever a fatal exception occurs
  • switch from sys to logging
  • add a 'info' logging level to make the software more verbose
  • add a 'quiet' and 'debug' argument to the parser

Documentation Improvments

reviewer 2

  1. Some of the examples do not have all the required files to run. The following files do not exist or they are wrongly named: model.npz, my_model.npz, reads_R1.fastq.gz, reads_R2.fastq.gz.
  • make clearer that thos are example commands that the user should run with their own data.
  1. In addition, there are some parameters for the β€œiss generate” when downloading sequences from NCBI that are not explicitly described. The -u, -k are not explained, -n_genomes does not seem to be used
  • document those better
  1. The documentation is not extensive but apparently sufficient. However, I missed tutorials for specific scenarios of interest. The examples are vaguely organised.

the documentation does feel a bit minimal and (i) has not been consistently kept up-to-date with new options and (ii) has lost its cohesion with the updates. This issue will now track to re-writing of the doc

  • Introduction. How does iss work and what does it do.
  • Quick link/infos: github, license, install
  • Install procedure
    • pip
    • docker
  • generating reads
    • minimal examples
    • cpus
    • abundance
    • ncbi
    • n_genomes
    • abundance file
    • mode
    • gc_bias
  • error model
    • creating error models
    • available error models and how they were built

Roadmap

Overview

the goal of the project is to create a (meta)genomics reads simulator, i.e. a program capable of simulating realistic high-throughput sequencing data

This issue is aimed at being a todo list for the project.

  • Make a roadmap

V1.0.0

Simulating reads

  • From a genome, output perfect reads in fastq format: Issue #2
  • From a set of genomes, output perfect reads in fastq format given abundance parameters: Issue #3
  • Generate paired-end reads: Issue #4
  • Add a simple error model: Issue #5
  • Add a more complicated error model, (based on real read data, or literature on Illumina error model?): Issue #6
  • Refine the error model from #6: 2D KDE, GC bias and Insert size distribution
  • Provide at least 1 good HiSeq and 1 good MiSeq model: Issue #13 and Issue #26

Input

  • The user provides a single-fasta file and coverage information: Issue #2
  • The user provides a multi-fasta file and abundance parameters: Issue #3
  • The user provides a multi-fasta file, and choose the type of distribution: Issue #12
  • The user chooses nothing, and random genomes from RefSeq genomes are selected: Issue #15

Testing and packaging

  • Add Travis and Codecov integration
  • Add proper logging and error handling: Issue #7
  • Refactor to avoid duplicated code when possible: Issue #8
  • Write tests: Issue #9
  • Write documentation: Issue #10
  • Submit package to PyPi: Issue #11
  • Add multiprocessing support: Issue #17
  • User testing

Write documentation

Issue to track the progress on the Roadmap item "Write documentation"

  • write docstrings for all functions
  • write inline comments for (i) making the code easier to follow (ii) document eventual obscure one-liners
  • update the README with (i) a brief intro, (ii) install instructions, (iii) basic usage
  • write complete documentation in a doc subdirectory
  • publish the doc to readthedocs

HTTP Error 502: Bad Gateway

InsilicoSeq should skip the offending record with a warning when something goes wrong while downloading genomes.

I don't know where that HTTP error is coming from and can't reproduce at the moment. I'll close this issue and reopen if it reoccurs.

Random genome from multifasta

if the --ncbi option works (sorta) well for random genomes, sometimes users (read, me) want random genomes drawn from a set of known genomes (a multifasta file given as input)

Crashes when --model is not provided

when there is not model, iss generate fails with an AttributeError, which is caught and dealt with with display the help message.

TODO:

  • a more informative error message

Genome -> Reads

Issue to track the progress on the Roadmap item "From a genome, output perfect reads in fastq format"

Test file: genome.fasta
Pattern: ATCG
Length: 2000bp

  • add Biopython to requirements.txt (aka let's not re-write a fasta parser)
  • add Numpy to requirements.txt (required by Biopython)
  • generate 100 (pseudo)random subsequences in fasta format
  • generate x subsequences covering the input genome 40x (or an arbitrary coverage level)
  • add quality information (fastq output)
  • get out of app.py, import module and write ArgumentParser
  • setup travis and write tests

Note: Biopython discards everything after a blank space (' ') in the sequence headers

Model GC bias

I feel like the GC-bias needs its own issue, it's harder to implement than the i_size distribution

This article is a good read about GC bias in Illumina data.

in this article they scanned a bam file with a sliding window equal to the read-length (and s step of half the read length).

  • try that and plot result
  • compare coverage at x GC% with mean coverage
  • convert coverage into probabilities of a reads coming for that region
  • save that distribution to the model files
  • add a chance to not generate a read from region with lower probabilities (i.e. "abnormal" gc content

Bug when output file already exists

If the output file exists, the program should not write in the existing one but maybe overwrite. I got this error:

INFO:iss.generator:Cleaning up
Traceback (most recent call last):
File "/usr/local/bin/iss", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 365, in main
args.func(args)
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 152, in generate_reads
generator.cleanup(temp_file_list)
File "/usr/local/lib/python3.6/site-packages/iss/generator.py", line 197, in cleanup
os.remove(temp_file + '_R1.fastq')
FileNotFoundError: [Errno 2] No such file or directory: '.iss.tmp.NC_010175.1.0_R1.fastq’

Error handling when template > genome

Click on the arrow for traceback

INFO:iss.app:Starting iss generate INFO:iss.app:Using kde ErrorModel INFO:iss.app:Using lognormal abundance distribution INFO:iss.app:Using 2 cpus for read generation INFO:iss.app:Generating 5000 reads INFO:iss.app:Generating reads for record: NC_001740.1:5890-6729 INFO:iss.app:Generating reads for record: NC_018081.1:c2783521-2782661 INFO:iss.app:Generating reads for record: NC_005364.2:c967483-966185 INFO:iss.app:Generating reads for record: NC_003997.3:1351647-1352231 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py", line 130, in simulate_read assert reverse_end < len(record.seq) AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in call
return self.func(*args, **kwargs)
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 131, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 131, in
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py", line 53, in reads
forward, reverse = simulate_read(record, ErrorModel, i)
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py", line 135, in simulate_read
read_length, len(record.seq) - read_length
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py", line 199, in randrange
raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (301,284, -17)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 359, in call
raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException


ValueError Tue May 22 21:04:40 2018
PID: 95094Python 3.6.5: /Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/python3.6
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False), {})]
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False)
kwargs = {}
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in reads(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, n_pairs=204, cpu_number=0, output='test_reads', gc_bias=False)
48 # try:
49 # forward, reverse = simulate_read(record, ErrorModel, i)
50 # except ValueError as e:
51 # logger.error('Skipping this record: %s' % record.id)
52 # return
---> 53 forward, reverse = simulate_read(record, ErrorModel, i)
forward = undefined
reverse = undefined
record = SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[])
ErrorModel = <iss.error_models.kde.KDErrorModel object>
i = 0
54 if gc_bias:
55 stiched_seq = forward.seq + reverse.seq
56 gc_content = GC(stiched_seq)
57 if 40 < gc_content < 60:

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in simulate_read(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, i=0)
130 assert reverse_end < len(record.seq)
131 except AssertionError as e:
132 # we use random insert when the modelled template length distribution
133 # is too large
134 reverse_end = len(record.seq) - random.randrange(
--> 135 read_length, len(record.seq) - read_length
read_length = array(301)
record.seq = Seq('ATGCGGGAGTCTAGTCAAATACGTGAGACGACAGAAACAAAAATAAAATTAAGT...TAA', SingleLetterAlphabet())
136 )
137 reverse_start = reverse_end - read_length
138 bounds = (reverse_start, reverse_end)
139 # create a perfect read

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py in randrange(self=<random.Random object>, start=array(301), stop=284, step=1, _int=<class 'int'>)
194 raise ValueError("non-integer stop for randrange()")
195 width = istop - istart
196 if step == 1 and width > 0:
197 return istart + self._randbelow(width)
198 if step == 1:
--> 199 raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
istart = 301
istop = 284
width = -17
200
201 # Non-unit step argument supplied.
202 istep = _int(step)
203 if istep != step:

ValueError: empty range for randrange() (301,284, -17)


"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 699, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
joblib.my_exceptions.TransportableException: TransportableException


ValueError Tue May 22 21:04:40 2018
PID: 95094Python 3.6.5: /Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/python3.6
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False), {})]
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False)
kwargs = {}
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in reads(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, n_pairs=204, cpu_number=0, output='test_reads', gc_bias=False)
48 # try:
49 # forward, reverse = simulate_read(record, ErrorModel, i)
50 # except ValueError as e:
51 # logger.error('Skipping this record: %s' % record.id)
52 # return
---> 53 forward, reverse = simulate_read(record, ErrorModel, i)
forward = undefined
reverse = undefined
record = SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[])
ErrorModel = <iss.error_models.kde.KDErrorModel object>
i = 0
54 if gc_bias:
55 stiched_seq = forward.seq + reverse.seq
56 gc_content = GC(stiched_seq)
57 if 40 < gc_content < 60:

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in simulate_read(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, i=0)
130 assert reverse_end < len(record.seq)
131 except AssertionError as e:
132 # we use random insert when the modelled template length distribution
133 # is too large
134 reverse_end = len(record.seq) - random.randrange(
--> 135 read_length, len(record.seq) - read_length
read_length = array(301)
record.seq = Seq('ATGCGGGAGTCTAGTCAAATACGTGAGACGACAGAAACAAAAATAAAATTAAGT...TAA', SingleLetterAlphabet())
136 )
137 reverse_start = reverse_end - read_length
138 bounds = (reverse_start, reverse_end)
139 # create a perfect read

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py in randrange(self=<random.Random object>, start=array(301), stop=284, step=1, _int=<class 'int'>)
194 raise ValueError("non-integer stop for randrange()")
195 width = istop - istart
196 if step == 1 and width > 0:
197 return istart + self._randbelow(width)
198 if step == 1:
--> 199 raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
istart = 301
istop = 284
width = -17
200
201 # Non-unit step argument supplied.
202 istep = _int(step)
203 if istep != step:

ValueError: empty range for randrange() (301,284, -17)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/iss", line 11, in
sys.exit(main())
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py", line 403, in main
args.func(args)
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py", line 169, in generate_reads
args.gc_bias) for i in range(cpus))
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 789, in call
self.retrieve()
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 740, in retrieve
raise exception
joblib.my_exceptions.JoblibValueError: JoblibValueError


Multiprocessing exception:
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/iss in ()
6
7 from iss.app import main
8
9 if name == 'main':
10 sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0])
---> 11 sys.exit(main())

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py in main()
398 elif args.debug:
399 logging.basicConfig(level=logging.DEBUG)
400 else:
401 logging.basicConfig(level=logging.INFO)
402
--> 403 args.func(args)
args.func =
args = Namespace(abundance='lognormal', abundance_file=... output='test_reads', quiet=False, version=False)
404 logging.shutdown()
405 except AttributeError as e:
406 logger = logging.getLogger(name)
407 logger.debug(e)

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py in generate_reads(args=Namespace(abundance='lognormal', abundance_file=... output='test_reads', quiet=False, version=False))
164
165 record_file_name_list = Parallel(n_jobs=cpus)(
166 delayed(generator.reads)(
167 record, err_mod,
168 n_pairs_per_cpu, i, args.output,
--> 169 args.gc_bias) for i in range(cpus))
args.gc_bias = False
cpus = 2
170 temp_file_list.extend(record_file_name_list)
171 except KeyboardInterrupt as e:
172 logger.error('iss generate interrupted: %s' % e)
173 generator.cleanup(temp_file_list)

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=Parallel(n_jobs=2), iterable=<generator object generate_reads..>)
784 if pre_dispatch == "all" or n_jobs == 1:
785 # The iterable was consumed all at once by the above for loop.
786 # No need to wait for async callbacks to trigger to
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=2)>
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time
792 self._print('Done %3i out of %3i | elapsed: %s finished',
793 (len(self._output), len(self._output),


Sub-process traceback:

ValueError Tue May 22 21:04:40 2018
PID: 95094Python 3.6.5: /Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/python3.6
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False), {})]
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False)
kwargs = {}
132
133 def len(self):
134 return self._size
135

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in reads(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, n_pairs=204, cpu_number=0, output='test_reads', gc_bias=False)
48 # try:
49 # forward, reverse = simulate_read(record, ErrorModel, i)
50 # except ValueError as e:
51 # logger.error('Skipping this record: %s' % record.id)
52 # return
---> 53 forward, reverse = simulate_read(record, ErrorModel, i)
forward = undefined
reverse = undefined
record = SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[])
ErrorModel = <iss.error_models.kde.KDErrorModel object>
i = 0
54 if gc_bias:
55 stiched_seq = forward.seq + reverse.seq
56 gc_content = GC(stiched_seq)
57 if 40 < gc_content < 60:

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in simulate_read(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, i=0)
130 assert reverse_end < len(record.seq)
131 except AssertionError as e:
132 # we use random insert when the modelled template length distribution
133 # is too large
134 reverse_end = len(record.seq) - random.randrange(
--> 135 read_length, len(record.seq) - read_length
read_length = array(301)
record.seq = Seq('ATGCGGGAGTCTAGTCAAATACGTGAGACGACAGAAACAAAAATAAAATTAAGT...TAA', SingleLetterAlphabet())
136 )
137 reverse_start = reverse_end - read_length
138 bounds = (reverse_start, reverse_end)
139 # create a perfect read

...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py in randrange(self=<random.Random object>, start=array(301), stop=284, step=1, _int=<class 'int'>)
194 raise ValueError("non-integer stop for randrange()")
195 width = istop - istart
196 if step == 1 and width > 0:
197 return istart + self._randbelow(width)
198 if step == 1:
--> 199 raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
istart = 301
istop = 284
width = -17
200
201 # Non-unit step argument supplied.
202 istep = _int(step)
203 if istep != step:

ValueError: empty range for randrange() (301,284, -17)


information on Error models

We need more information of the provided error models, i.e. read length, insert size distribution, ...

This will added to the doc and eventually some basic info at the command-line

Issue with MiSeq and NovaSeq models?

From Reviewer 2:

The software does not run correctly all the time. For example, running

iss generate --genomes genomes.fasta --model miseq --output miseq_reads

models miseq and novaseq did not work.

coverage instead of abundance

A comment by the reviewers is that I may be convenient to let the user control the coverage directly instead of it being a function of n_reads * abundance.

It is fairly simple to implement, and having mutually exclusive flags --abundance or --coverage would only warrant a minor version bump.

Comparison with existing software

This the main criticism received by the paper and the reason for the major revision

Software to compare to:

Software License Installed Ran
InSilicoSeq MIT βœ”οΈ βœ”οΈ
MetaSim Academic ❌ ❌
FunctionSIM None ❌ ❌
NeSSM Academic ❌ ❌
BEAR Academic βœ”οΈ ❌
FASTQSim GPL-3.0 βœ”οΈ ❌
GemSim GPL-3.0 βœ”οΈ ❌
Grinder GPL-3.0 βœ”οΈ βœ”οΈ
pIRS GPL-2.0 βœ”οΈ βœ”οΈ

TODO:

  • Install all
  • Rate ease of installation
  • Run all (18 genomes)
    • InSilicoSeq
    • BEAR (failed)
    • GemSim (failed)
    • Grinder
    • pIRS
  • Compare run time (4 cpus?)
    • will use GNU time 1.9 for that
  • Compare per base quality and per sequence quality

Installation process

Software Open Source Dependency management CLI One command
InSilicoSeq βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ
MetaSim ❌ ❌ βœ”οΈ βœ”οΈ
FunctionSIM ❌ ❌ ❌ βœ”οΈ
NeSSM ❌ ❌ βœ”οΈ ❌
BEAR βœ”οΈ ❌ βœ”οΈ ❌
FASTQSim βœ”οΈ ❌ βœ”οΈ ❌
GemSim βœ”οΈ ❌ βœ”οΈ ❌
Grinder βœ”οΈ βœ”οΈ βœ”οΈ βœ”οΈ
pIRS βœ”οΈ ❌ βœ”οΈ ❌

Model and mode options

If the mode (error model) is basic, it might be good to put a warning saying that any specified model will be ignored

Add --version

iss -v, iss --version should display the version number.

could also display the version number in --debug mode before the programme starts.

MetaGenome -> Reads

Issue to track the progress on the Roadmap item "From a set of genomes, output perfect reads in fastq format given abundance parameters"

Test files:

Number of genomes: 5

  • Parse the file and get one SeqRecord per genome (we still assume complete genomes: one sequence per genome)
  • Parse abundance.txt and calculate the necessary coverage per genome to satisfy the abundance requirements
  • Generate the reads with the generator.reads() function
  • Use multiprocessing to parallelise the process (1 threads per genome?)
  • Write the reads to file
  • write tests

Example data

Reviewers requested example data.

While I won't include large-ish example data with the pip package, I'm willing to put example files on <osf.io> and modify the examples to include the download and running with the downloaded data.

presets

In light of reviewer 3's comments but only vaguely related, it would be a great addition to allow presets of "known" microbiomes to be simulated, i.e.

iss generate --model miseq --preset human_gut --output wathever

I'll get on that after the issues tagged with review though

Submit to PyPi

Issue to track the progress on the Roadmap item "Submit package to PyPi"

This should be the last step before beta release πŸš€

  • generate MiSeq and HiSeq model .npz files to go with the release
  • write setup.py
  • tag release
  • upload to PyPi

Input Genomes from RefSeq

Issue to track the progress on the Roadmap item "The user chooses nothing, and random genomes from RefSeq genomes are selected"

  • add new 'refseq' input option (with 'bacteria', 'viruses', 'archaea')
  • Get random genomes with Entrez
  • write to multifasta file
  • Simulate Reads all the same

Bug if no number of reads is specified

If we use the default number of reads and do not specify any, we get this:
Traceback (most recent call last):
File "/usr/local/bin/iss", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 365, in main
args.func(args)
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 108, in generate_reads
n_reads = util.convert_n_reads(args.n_reads)
File "/usr/local/lib/python3.6/site-packages/iss/util.py", line 121, in convert_n_reads
if unit[-1].isdigit():
TypeError: 'int' object is not subscriptable

Testing

Issue to track the progress on the Roadmap item "Add proper logging and error handling"

Most of the old tests are broken since the introduction of indels + the indel code is not tested

  • create dummy bam, fasta and abundance files with a mix of perfect reads, known substitutions and indels.
  • fix the old tests
  • test the indel code

Update error models

The two error models currently built-in are based on fairly old genomics data. It'd be nice to base them on recent-ish metagenomes, or at least recent genomic datasets sequencing with most popular read-lengths + lib prep

  • Identify 1 MiSeq, 1 HiSeq and 1 NovaSeq dataset to use

MiSeq

  • Assembly with megahit
  • mapping the reads back to the assembly with bowtie2
  • build models from the bam file

HiSeq

  • Assembly with megahit used public illumina data from human
  • mapping the reads back to the assembly with bowtie2
  • build models from the bam file

NovaSeq

  • Assembly with megahit used public illumina data from human
  • mapping the reads back to the assembly with bowtie2
  • build models from the bam file

Roadmap reference: Provide at least 1 good HiSeq and 1 good MiSeq model

Advanced error model(s)

Issue to track the progress on the Roadmap item "Add a more complicated error model"

This is where the fun begins.

The current plan is to:

  • Download raw Illumina datasets from known bacterial species*
  • Map the downloaded reads against their reference genome
  • Extract the Base quality information from the BAM file
  • Parse the BAM files and for each position, get the substitution rate (using CIGAR strings?) **
  • Figure out how to get deletions information from the MD tag and CIGAR string. (samtools calmd?) It worked well with CIGAR
  • write the error rates to a file
  • Write the subsitution matrix for every position in a file***
  • Generate realistic-ish illumina data!

* It will be very important to track every step used to generate the profiles: genomes download, mapping options, ... I will keep everything in evernote for now but will dump it in the software doc later on.

** I'll need pysam as a dependency. Hopefully it will not add difficulties to the install process Pysam installed seamlessly

*** something of the like (not sure yet, subject to change!):

pos AtoA AtoT ...
1 0.96 0.01 ...
...

n_genomes fails when using ncbi option

Since 1.1.0 and the introduction of -u/--n_genomes for multifasta the ncbi support appears broken

Traceback (most recent call last):
  File "/opt/sw/insilicoseq/1.1.0/bin/iss", line 11, in <module>
    load_entry_point('InSilicoSeq==1.1.0', 'console_scripts', 'iss')()
  File "/opt/sw/insilicoseq/1.1.0/lib/python3.6/site-packages/iss/app.py", line 403, in main
    args.func(args)
  File "/opt/sw/insilicoseq/1.1.0/lib/python3.6/site-packages/iss/app.py", line 136, in generate_reads
    n = args.n_genomes[0][0]
TypeError: 'int' object is not subscriptable

No info about abundance

When using a distribution, no information about the genomes abundance is printed to the log or on file.
Such information is important for the downstream analysis of the simulated reads.

better install instructions

The Installation instructions should contain pip --user and advanced instructions for custom install (i.e. with a module system)

Paired-end reads

Issue to track the progress on the Roadmap item "Generate paired-end reads"

  • Add insert_size parameters to the ArgumentParser
  • Update the coverage calculations not necessary. The number of reads doesn't change!*
  • Update read generator to output paired-end reads**
  • Iterate over generator and write to 2 output files
  • update test_generator.py

* The number of reads is divided by 2 in the generator function and taken as number of pairs
**We'll probably want to yield a tuple containing two SeqRecord objects: one with the forward read, the other with the reverse

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.