hadrieng / insilicoseq Goto Github PK
View Code? Open in Web Editor NEW:rocket: A sequencing simulator
Home Page: https://insilicoseq.readthedocs.io
License: MIT License
:rocket: A sequencing simulator
Home Page: https://insilicoseq.readthedocs.io
License: MIT License
It would be simpler if the --model_file
options was simply --model
. The --model
option is somewhat less useful now except for people who want a more basic error model. Could be changed to --mode
Additionally, there should be a switch to print informations about the error models provided with the software. That info should also be on the readthedocs
Issue to track the progress on the Roadmap item "Add multiprocessing support"
Issue to track the progress on the Roadmap item "Add a simple error model"
I guess that the simplest would be to:
A few unknowns:
* I haven't added a standard deviation parameter. it is hardcoded to 0.01 but can be discussed
Link to paper: https://www.overleaf.com/read/ksvfpdhfskmq
*N.B: this applies to the current version of the paper, which will be mostly rewritten to incorporate the comparison done in #49 *
Nevertheless addressing theses before will not be time consuming and will prevent any errors on text that would be kept in v2 of the paper to be corrected
If the software is interrupted or crashes before completion, it leaves temporary files behind. The cleanup function should be called upon completion AND crash/interruption
When genomes are downloaded from the ncbi they sometimes contain ambiguous bases (e.g. 'S' which stands for 'G' or 'C') Those cause a KeyError
in the dispatch dict for indels
Traceback:
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/__main__.py", line 5, in <module>
main()
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 318, in main
args.func(args)
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 123, in generate_reads
generator.to_fastq(read_gen, args.output)
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 124, in to_fastq
for read_tuple in generator:
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 86, in reads
reverse, 'reverse', sequence, bounds)
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/error_models/__init__.py", line 179, in introduce_indels
if random.random() < deletions[position][mutable_seq[nucl]]: # del
KeyError: 'S'
IUPAC codes: http://www.bioinformatics.org/sms2/iupac.html
The introduce_indels loop sometimes goes out of the sequence bounds
Traceback:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/__main__.py", line 5, in <module>
main()
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 318, in main
args.func(args)
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/app.py", line 123, in generate_reads
generator.to_fastq(read_gen, args.output)
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 125, in to_fastq
for read_tuple in generator:
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/generator.py", line 69, in reads
forward, 'forward', sequence, bounds)
File "/Users/hadrien/Documents/github/InSilicoSeq/iss/error_models/__init__.py", line 172, in introduce_indels
if mutable_seq[nucl] == 'N':
File "/Users/hadrien/Documents/venvs/iss/lib/python3.6/site-packages/Bio/Seq.py", line 1839, in __getitem__
return self.data[index]
IndexError: array index out of range
Hi,
I've been trying to use InSiclicoSeq to create an NGS dataset from a single 7.4kb genome (starting with a shortened fragment), running the program from the linux command line in a qiime VirtualBox. I've been given a couple of error messages however; the first (error message 1 attached) stating invalid syntax in app.py at line 78. I could avoid this by removing the "elif" segment that deals with ncbi genomes (lines 68-78), but then run into another issue later where InSilicoSeq calls seq.py (error message 2 attached), this time a type error.
I've also attached the DNA sequence I was using in my command- iss generate --genomes PV3_gb_MF678293Short.fasta -model miseq --output miseq_reads -n 10000
Error1.txt
Error2.txt
PV3_gb_MF678293Short.txt
I can provide further information if it helps to recreate the errors. And thanks for creating this software by the way- I've spent days trying to find an up to date package for NGS data simulation.
Thanks,
Alex
Trying to use HiSeq or NovaSeq profiles throw an error
ERROR:iss.error_models.two_dim.MultiKDErrorModel:Failed to read ErrorModel file: [Errno 2] No such file or directory: 'HiSeq'
Issue to track the progress on the Roadmap item "Refactor to avoid duplicated code when possible"
Some functions are duplicated at the moment, and the bam parsing and model creation is quite messy.
This is just an idea but it would be nice to be able to simulate datasets that contain Bacterial and Viral sequences for example (and also if possible, sequences that are not only retrieved from RefSeq)
Issue to track the progress on the Roadmap item "The user provides a multi-fasta file, and choose the type of distribution"
The SRA reads taken for building the new error models were not all the same length (adapter trimming, possible quality too) and it caused issues during the model generation π±
Will rebuild asap with either modified SRA reads or my coral metagenome reads (need to assemble and map back to assembly then)
In the meantime, I've rolled back with the old error models (named HiSeq2500 and MiSeq) and push to 0.2.2 (also on Pypi)
test coverage is almost non-existent for the download module (used for the --ncbi
option) and it has already cause problem.
Issue to track the progress on the Roadmap item "Add proper logging and error handling"
try / except
blocks where we can catch exceptionsthe documentation does feel a bit minimal and (i) has not been consistently kept up-to-date with new options and (ii) has lost its cohesion with the updates. This issue will now track to re-writing of the doc
iss
work and what does it do.the goal of the project is to create a (meta)genomics reads simulator, i.e. a program capable of simulating realistic high-throughput sequencing data
This issue is aimed at being a todo list for the project.
Issue to track the progress on the Roadmap item "Write documentation"
Some empty genomes (short string of Ns) sometimes get downloaded with the --ncbi
option and make the software crash.
InsilicoSeq should skip the offending record with a warning when something goes wrong while downloading genomes.
I don't know where that HTTP error is coming from and can't reproduce at the moment. I'll close this issue and reopen if it reoccurs.
if the --ncbi
option works (sorta) well for random genomes, sometimes users (read, me) want random genomes drawn from a set of known genomes (a multifasta file given as input)
when there is not model, iss generate
fails with an AttributeError
, which is caught and dealt with with display the help message.
TODO:
Issue to track the progress on the Roadmap item "From a genome, output perfect reads in fastq format"
Test file: genome.fasta
Pattern: ATCG
Length: 2000bp
Note: Biopython discards everything after a blank space (' ') in the sequence headers
I feel like the GC-bias needs its own issue, it's harder to implement than the i_size distribution
This article is a good read about GC bias in Illumina data.
in this article they scanned a bam file with a sliding window equal to the read-length (and s step of half the read length).
Upstream issue with pysam here: pysam-developers/pysam#697
The idea would be to have a specific command for downloading genomes from NCBI
If the output file exists, the program should not write in the existing one but maybe overwrite. I got this error:
INFO:iss.generator:Cleaning up
Traceback (most recent call last):
File "/usr/local/bin/iss", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 365, in main
args.func(args)
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 152, in generate_reads
generator.cleanup(temp_file_list)
File "/usr/local/lib/python3.6/site-packages/iss/generator.py", line 197, in cleanup
os.remove(temp_file + '_R1.fastq')
FileNotFoundError: [Errno 2] No such file or directory: '.iss.tmp.NC_010175.1.0_R1.fastqβ
For example HiSeq and hiseq could be allowed
The --gc_bias
options seems to do nothing at all.
Click on the arrow for traceback
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in call
return self.func(*args, **kwargs)
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 131, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 131, in
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py", line 53, in reads
forward, reverse = simulate_read(record, ErrorModel, i)
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py", line 135, in simulate_read
read_length, len(record.seq) - read_length
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py", line 199, in randrange
raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
ValueError: empty range for randrange() (301,284, -17)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 359, in call
raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException
ValueError Tue May 22 21:04:40 2018
PID: 95094Python 3.6.5: /Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/python3.6
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False), {})]
132
133 def len(self):
134 return self._size
135
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False)
kwargs = {}
132
133 def len(self):
134 return self._size
135
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in reads(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, n_pairs=204, cpu_number=0, output='test_reads', gc_bias=False)
48 # try:
49 # forward, reverse = simulate_read(record, ErrorModel, i)
50 # except ValueError as e:
51 # logger.error('Skipping this record: %s' % record.id)
52 # return
---> 53 forward, reverse = simulate_read(record, ErrorModel, i)
forward = undefined
reverse = undefined
record = SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[])
ErrorModel = <iss.error_models.kde.KDErrorModel object>
i = 0
54 if gc_bias:
55 stiched_seq = forward.seq + reverse.seq
56 gc_content = GC(stiched_seq)
57 if 40 < gc_content < 60:
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in simulate_read(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, i=0)
130 assert reverse_end < len(record.seq)
131 except AssertionError as e:
132 # we use random insert when the modelled template length distribution
133 # is too large
134 reverse_end = len(record.seq) - random.randrange(
--> 135 read_length, len(record.seq) - read_length
read_length = array(301)
record.seq = Seq('ATGCGGGAGTCTAGTCAAATACGTGAGACGACAGAAACAAAAATAAAATTAAGT...TAA', SingleLetterAlphabet())
136 )
137 reverse_start = reverse_end - read_length
138 bounds = (reverse_start, reverse_end)
139 # create a perfect read
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py in randrange(self=<random.Random object>, start=array(301), stop=284, step=1, _int=<class 'int'>)
194 raise ValueError("non-integer stop for randrange()")
195 width = istop - istart
196 if step == 1 and width > 0:
197 return istart + self._randbelow(width)
198 if step == 1:
--> 199 raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
istart = 301
istop = 284
width = -17
200
201 # Non-unit step argument supplied.
202 istep = _int(step)
203 if istep != step:
ValueError: empty range for randrange() (301,284, -17)
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 699, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/usr/local/Cellar/python/3.6.5/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
joblib.my_exceptions.TransportableException: TransportableException
ValueError Tue May 22 21:04:40 2018
PID: 95094Python 3.6.5: /Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/python3.6
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False), {})]
132
133 def len(self):
134 return self._size
135
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False)
kwargs = {}
132
133 def len(self):
134 return self._size
135
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in reads(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, n_pairs=204, cpu_number=0, output='test_reads', gc_bias=False)
48 # try:
49 # forward, reverse = simulate_read(record, ErrorModel, i)
50 # except ValueError as e:
51 # logger.error('Skipping this record: %s' % record.id)
52 # return
---> 53 forward, reverse = simulate_read(record, ErrorModel, i)
forward = undefined
reverse = undefined
record = SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[])
ErrorModel = <iss.error_models.kde.KDErrorModel object>
i = 0
54 if gc_bias:
55 stiched_seq = forward.seq + reverse.seq
56 gc_content = GC(stiched_seq)
57 if 40 < gc_content < 60:
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in simulate_read(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, i=0)
130 assert reverse_end < len(record.seq)
131 except AssertionError as e:
132 # we use random insert when the modelled template length distribution
133 # is too large
134 reverse_end = len(record.seq) - random.randrange(
--> 135 read_length, len(record.seq) - read_length
read_length = array(301)
record.seq = Seq('ATGCGGGAGTCTAGTCAAATACGTGAGACGACAGAAACAAAAATAAAATTAAGT...TAA', SingleLetterAlphabet())
136 )
137 reverse_start = reverse_end - read_length
138 bounds = (reverse_start, reverse_end)
139 # create a perfect read
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py in randrange(self=<random.Random object>, start=array(301), stop=284, step=1, _int=<class 'int'>)
194 raise ValueError("non-integer stop for randrange()")
195 width = istop - istart
196 if step == 1 and width > 0:
197 return istart + self._randbelow(width)
198 if step == 1:
--> 199 raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
istart = 301
istop = 284
width = -17
200
201 # Non-unit step argument supplied.
202 istep = _int(step)
203 if istep != step:
ValueError: empty range for randrange() (301,284, -17)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/iss", line 11, in
sys.exit(main())
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py", line 403, in main
args.func(args)
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py", line 169, in generate_reads
args.gc_bias) for i in range(cpus))
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 789, in call
self.retrieve()
File "/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py", line 740, in retrieve
raise exception
joblib.my_exceptions.JoblibValueError: JoblibValueError
Multiprocessing exception:
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/iss in ()
6
7 from iss.app import main
8
9 if name == 'main':
10 sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0])
---> 11 sys.exit(main())
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py in main()
398 elif args.debug:
399 logging.basicConfig(level=logging.DEBUG)
400 else:
401 logging.basicConfig(level=logging.INFO)
402
--> 403 args.func(args)
args.func =
args = Namespace(abundance='lognormal', abundance_file=... output='test_reads', quiet=False, version=False)
404 logging.shutdown()
405 except AttributeError as e:
406 logger = logging.getLogger(name)
407 logger.debug(e)
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/app.py in generate_reads(args=Namespace(abundance='lognormal', abundance_file=... output='test_reads', quiet=False, version=False))
164
165 record_file_name_list = Parallel(n_jobs=cpus)(
166 delayed(generator.reads)(
167 record, err_mod,
168 n_pairs_per_cpu, i, args.output,
--> 169 args.gc_bias) for i in range(cpus))
args.gc_bias = False
cpus = 2
170 temp_file_list.extend(record_file_name_list)
171 except KeyboardInterrupt as e:
172 logger.error('iss generate interrupted: %s' % e)
173 generator.cleanup(temp_file_list)
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=Parallel(n_jobs=2), iterable=<generator object generate_reads..>)
784 if pre_dispatch == "all" or n_jobs == 1:
785 # The iterable was consumed all at once by the above for loop.
786 # No need to wait for async callbacks to trigger to
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=2)>
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time
792 self._print('Done %3i out of %3i | elapsed: %s finished',
793 (len(self._output), len(self._output),
ValueError Tue May 22 21:04:40 2018
PID: 95094Python 3.6.5: /Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/bin/python3.6
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in call(self=<joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
self.items = [(, (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False), {})]
132
133 def len(self):
134 return self._size
135
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/joblib/parallel.py in (.0=<list_iterator object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), <iss.error_models.kde.KDErrorModel object>, 204, 0, 'test_reads', False)
kwargs = {}
132
133 def len(self):
134 return self._size
135
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in reads(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, n_pairs=204, cpu_number=0, output='test_reads', gc_bias=False)
48 # try:
49 # forward, reverse = simulate_read(record, ErrorModel, i)
50 # except ValueError as e:
51 # logger.error('Skipping this record: %s' % record.id)
52 # return
---> 53 forward, reverse = simulate_read(record, ErrorModel, i)
forward = undefined
reverse = undefined
record = SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[])
ErrorModel = <iss.error_models.kde.KDErrorModel object>
i = 0
54 if gc_bias:
55 stiched_seq = forward.seq + reverse.seq
56 gc_content = GC(stiched_seq)
57 if 40 < gc_content < 60:
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/site-packages/iss/generator.py in simulate_read(record=SeqRecord(seq=Seq('ATGCGGGAGTCTAGTCAAATACGTGAGAC...r. Ames chromosome, complete genome', dbxrefs=[]), ErrorModel=<iss.error_models.kde.KDErrorModel object>, i=0)
130 assert reverse_end < len(record.seq)
131 except AssertionError as e:
132 # we use random insert when the modelled template length distribution
133 # is too large
134 reverse_end = len(record.seq) - random.randrange(
--> 135 read_length, len(record.seq) - read_length
read_length = array(301)
record.seq = Seq('ATGCGGGAGTCTAGTCAAATACGTGAGACGACAGAAACAAAAATAAAATTAAGT...TAA', SingleLetterAlphabet())
136 )
137 reverse_start = reverse_end - read_length
138 bounds = (reverse_start, reverse_end)
139 # create a perfect read
...........................................................................
/Users/hadrien/.local/share/virtualenvs/workspace-w0RVuX47/lib/python3.6/random.py in randrange(self=<random.Random object>, start=array(301), stop=284, step=1, _int=<class 'int'>)
194 raise ValueError("non-integer stop for randrange()")
195 width = istop - istart
196 if step == 1 and width > 0:
197 return istart + self._randbelow(width)
198 if step == 1:
--> 199 raise ValueError("empty range for randrange() (%d,%d, %d)" % (istart, istop, width))
istart = 301
istop = 284
width = -17
200
201 # Non-unit step argument supplied.
202 istep = _int(step)
203 if istep != step:
ValueError: empty range for randrange() (301,284, -17)
We need more information of the provided error models, i.e. read length, insert size distribution, ...
This will added to the doc and eventually some basic info at the command-line
From Reviewer 2:
The software does not run correctly all the time. For example, running
iss generate --genomes genomes.fasta --model miseq --output miseq_reads
models miseq and novaseq did not work.
A comment by the reviewers is that I may be convenient to let the user control the coverage directly instead of it being a function of n_reads * abundance.
It is fairly simple to implement, and having mutually exclusive flags --abundance
or --coverage
would only warrant a minor version bump.
This the main criticism received by the paper and the reason for the major revision
Software to compare to:
Software | License | Installed | Ran |
---|---|---|---|
InSilicoSeq | MIT | βοΈ | βοΈ |
MetaSim | Academic | β | β |
FunctionSIM | None | β | β |
NeSSM | Academic | β | β |
BEAR | Academic | βοΈ | β |
FASTQSim | GPL-3.0 | βοΈ | β |
GemSim | GPL-3.0 | βοΈ | β |
Grinder | GPL-3.0 | βοΈ | βοΈ |
pIRS | GPL-2.0 | βοΈ | βοΈ |
TODO:
Software | Open Source | Dependency management | CLI | One command |
---|---|---|---|---|
InSilicoSeq | βοΈ | βοΈ | βοΈ | βοΈ |
MetaSim | β | β | βοΈ | βοΈ |
FunctionSIM | β | β | β | βοΈ |
NeSSM | β | β | βοΈ | β |
BEAR | βοΈ | β | βοΈ | β |
FASTQSim | βοΈ | β | βοΈ | β |
GemSim | βοΈ | β | βοΈ | β |
Grinder | βοΈ | βοΈ | βοΈ | βοΈ |
pIRS | βοΈ | β | βοΈ | β |
If the mode (error model) is basic, it might be good to put a warning saying that any specified model will be ignored
iss -v
, iss --version
should display the version number.
could also display the version number in --debug
mode before the programme starts.
Issue to track the progress on the Roadmap item "From a set of genomes, output perfect reads in fastq format given abundance parameters"
Test files:
Number of genomes: 5
the --n_reads
of iss generate
should allow suffixes like "M, G"
Reviewers requested example data.
While I won't include large-ish example data with the pip package, I'm willing to put example files on <osf.io> and modify the examples to include the download and running with the downloaded data.
In light of reviewer 3's comments but only vaguely related, it would be a great addition to allow presets of "known" microbiomes to be simulated, i.e.
iss generate --model miseq --preset human_gut --output wathever
I'll get on that after the issues tagged with review though
Issue to track the progress on the Roadmap item "Submit package to PyPi"
This should be the last step before beta release π
setup.py
Issue to track the progress on the Roadmap item "The user chooses nothing, and random genomes from RefSeq genomes are selected"
If we use the default number of reads and do not specify any, we get this:
Traceback (most recent call last):
File "/usr/local/bin/iss", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 365, in main
args.func(args)
File "/usr/local/lib/python3.6/site-packages/iss/app.py", line 108, in generate_reads
n_reads = util.convert_n_reads(args.n_reads)
File "/usr/local/lib/python3.6/site-packages/iss/util.py", line 121, in convert_n_reads
if unit[-1].isdigit():
TypeError: 'int' object is not subscriptable
Issue to track the progress on the Roadmap item "Add proper logging and error handling"
Most of the old tests are broken since the introduction of indels + the indel code is not tested
The two error models currently built-in are based on fairly old genomics data. It'd be nice to base them on recent-ish metagenomes, or at least recent genomic datasets sequencing with most popular read-lengths + lib prep
Roadmap reference: Provide at least 1 good HiSeq and 1 good MiSeq model
Issue to track the progress on the Roadmap item "Add a more complicated error model"
This is where the fun begins.
The current plan is to:
* It will be very important to track every step used to generate the profiles: genomes download, mapping options, ... I will keep everything in evernote for now but will dump it in the software doc later on.
** I'll need pysam as a dependency. Hopefully it will not add difficulties to the install process Pysam installed seamlessly
*** something of the like (not sure yet, subject to change!):
pos | AtoA | AtoT | ... |
---|---|---|---|
1 | 0.96 | 0.01 | ... |
... |
Since 1.1.0 and the introduction of -u
/--n_genomes
for multifasta the ncbi support appears broken
Traceback (most recent call last):
File "/opt/sw/insilicoseq/1.1.0/bin/iss", line 11, in <module>
load_entry_point('InSilicoSeq==1.1.0', 'console_scripts', 'iss')()
File "/opt/sw/insilicoseq/1.1.0/lib/python3.6/site-packages/iss/app.py", line 403, in main
args.func(args)
File "/opt/sw/insilicoseq/1.1.0/lib/python3.6/site-packages/iss/app.py", line 136, in generate_reads
n = args.n_genomes[0][0]
TypeError: 'int' object is not subscriptable
When using a distribution, no information about the genomes abundance is printed to the log or on file.
Such information is important for the downstream analysis of the simulated reads.
The Installation instructions should contain pip --user
and advanced instructions for custom install (i.e. with a module system)
Issue to track the progress on the Roadmap item "Generate paired-end reads"
* The number of reads is divided by 2 in the generator function and taken as number of pairs
**We'll probably want to yield a tuple containing two SeqRecord objects: one with the forward read, the other with the reverse
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.