merenlab / illumina-utils Goto Github PK

View Code? Open in Web Editor NEW

90.0 90.0 31.0 2.68 MB

A library and collection of scripts to work with Illumina paired-end data (for CASAVA 1.7+ pipeline).

License: GNU General Public License v2.0

Python 98.93% R 1.07%

illumina-utils's People

Contributors

Stargazers

Watchers

illumina-utils's Issues

Installation with pip-20.1 failed

Hi,
I tried to do the installation of illumina-utils for anvio with pip-20.1.
The following error came up.

Using cached illumina-utils-2.7.tar.gz (3.3 MB) ERROR: Command errored out with exit status 1: command: /Users/virtual-envs/anvio-master/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/_b/40l3prb52ln490d96qf4ccmh0000gn/T/pip-install-n0dzn60q/illumina-utils/setup.py'"'"'; __file__='"'"'/private/var/folders/_b/40l3prb52ln490d96qf4ccmh0000gn/T/pip-install-n0dzn60q/illumina-utils/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/_b/40l3prb52ln490d96qf4ccmh0000gn/T/pip-pip-egg-info-y_xa8czy cwd: /private/var/folders/_b/40l3prb52ln490d96qf4ccmh0000gn/T/pip-install-n0dzn60q/illumina-utils/ Complete output (7 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "/private/var/folders/_b/40l3prb52ln490d96qf4ccmh0000gn/T/pip-install-n0dzn60q/illumina-utils/setup.py", line 17, in <module> reqs = [str(ir.req) for ir in install_reqs] File "/private/var/folders/_b/40l3prb52ln490d96qf4ccmh0000gn/T/pip-install-n0dzn60q/illumina-utils/setup.py", line 17, in <listcomp> reqs = [str(ir.req) for ir in install_reqs] AttributeError: 'ParsedRequirement' object has no attribute 'req' ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I downgraded pip to pip-19.3 and the installation worked without any problems.

SequenceSource compressed for fasta

Hi Meren,
There is a "compressed" parameter for fastq files, but I can't find one for fasta. Is it possible to add it or am I missing something?

default number of cores

Hello

rapidmerge.py uses multiprocessing.cpu_count() to ge the number of available cpuswhich retruns the number of cpu in the machine. But this is not the same as the number of cpu available to the process. For example, you can run in a taskset context or a batch scheduler like slurm.

see:

$ nproc
96
$ taskset -c 1 nproc
1
$ taskset -c 1 python3 -c "import multiprocessing; print(multiprocessing.cpu_count())"
96

I would suggest to use len(os.sched_getaffinity(0)) instead of multiprocessing.cpu_count()

$ python3 -c "import os; print(len(os.sched_getaffinity(0)))"
96
$ taskset -c 1 python3 -c "import os; print(len(os.sched_getaffinity(0)))"
1

regards

Eric

Python 3 compatible

Some methods are not compatible with Python3. For exemple :

iu-merge-pairs
      ^
SyntaxError: Missing parentheses in call to 'print'

Failed quality curve generation

Hi,

I'm running into an error trying to produce the quality plots with iu-filter-quality-minoche. This is the command I ran and the error I received:

#!/bin/bash
source activate anvio-7
iu-gen-configs samples_ada.txt -o 01_QC_minoche/
for ini in 01_QC_minoche/*.ini;
do iu-filter-quality-minoche $ini --visualize-quality-curves;
done

Quality scores visualization in progress: FAILED_REASON_N                    Traceback (most recent call last):
  File "/home/saatkinson/anaconda3/envs/anvio-7/bin/iu-filter-quality-minoche", line 314, in <module>
    sys.exit(main(config, args))
  File "/home/saatkinson/anaconda3/envs/anvio-7/bin/iu-filter-quality-minoche", line 265, in main
    title = 'Mean PHRED scores for pairs tagged as "%s"' % entry_type)
  File "/home/saatkinson/anaconda3/envs/anvio-7/lib/python3.6/site-packages/IlluminaUtils/utils/helperfunctions.py", l
ine 558, in visualize_qual_stats_dict
    subplots[tile] = plt.subplot(next(gs))
TypeError: 'Gs' object is not an iterator

All the other files associated with Minoche seem to have been generated, just not the plots.

Any help getting the quality plots to generate would be most appreciated!
Thanks,
Samantha

Suggested changes to iu-remove-ids-from-fastq

I suggest (and plan to implement) the following changes to iu-remove-ids-from-fastq:

Add a flag -G, --generate-output-for-survived-only
-K, --keep-ids - if provided, then instead of removing the reads in the list, only the reads in the list will be kept (and the rest would be removed).

iu-trim-fastq not working

Using v2.10.
I've tried to use iu-trim-fastq but it failed with this error:

iu-trim-fastq -f 0 -t 100 R1.fastq.gz R1-TRIMMED-TO-100bp.fastq.gz
 00% -- (num pairs processed: 1) Traceback (most recent call last):
  File "/project2/meren/VIRTUAL-ENVS/anvio-dev/bin/iu-trim-fastq", line 51, in <module>
    sys.exit(main(input_file_path, output_file_path, args.trim_from, args.trim_to, compressed))
  File "/project2/meren/VIRTUAL-ENVS/anvio-dev/bin/iu-trim-fastq", line 25, in main
    output.store_entry(input.entry)
  File "/project2/meren/VIRTUAL-ENVS/anvio-dev/lib/python3.6/site-packages/IlluminaUtils/lib/fastqlib.py", line 191, in store_entry
    self.file_pointer.write('@' + e.header_line + '\n')
  File "/project2/meren/VIRTUAL-ENVS/anvio-dev/lib/python3.6/gzip.py", line 260, in write
    data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'

Maybe related to a python 2 to 3 migration?
Thanks for the help!

Merging ITS amplicons

ITS amplicons can have a great variation in length. Therefore the insert size may become too small to have a partial overlap. Better yet, the insert size may be long enough for a partial overlap, but then after trimming the prefixes from both reads you may run into a situation that requires complete overlap analysis instead of partial overlap.

Here is an example. Say this is read 1 in one of your ITS paired-end sequences:

@M01028:24:000000000-A49NB:1:1101:16134:1723 1:N:0:21
TACGTCAGCGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTACTGTTATTTACTACTACACTGCGTGAGCGGAACGAAAACAACAACACCTAAAATGTGGAATATAGCATATAGTCGACAAGAGAAATCTACGAAAAAACAAACAAAACTTTCAACAACGGATCTCTTGGTTCTCGCATCGATGAAGAGCGCAGCGAAATGCGATACCTAGTGTGAATTGCAGCCGTCGTGAAT
+
BBBBBFBBFBBBGGGGGGGFGGEHHHHHHHHHGHHGGHHHHHHHHHGGEGGGHBGHFHGHHHH5DFGHHHHHHHHHHHHHG>?B?EFEGGGGGGGDGHHHHHFHHGGH3F?3BFFGHFHHHHHHHBDGHHHHH/F/CFHHHHHHHHHHHGGHGGHGGHHHHHH.GHHHHHHHHHHHGGGGGHHHHGGGGGGGGGGGGGGGGFGGGGFGGG-@DF-@EFEDFFFFFFFFFFFFFFFFFFFFFFAFCFDA/A/

and this is read 2:

@M01028:24:000000000-A49NB:1:1101:16134:1723 2:N:0:21
GTTCAAAGATTCGATGATTCACGACGGCTGCAATTCACACTAGGTATCGCATTTCGCTGCGCTCTTCATCGATGCGAGAACCAAGAGATCCGTTGTTGAAAGTTTTGTTTGTTTTTTCGTAGATTTCTCTTGTCGACTATATGCTATATTCCACATTTTAGGTGTTGTTGTTTTCGTTCCGCTCACGCAGTGTAGTAGTAAATCACAGTAATGATCCTTCCGCAGGTTCACCTACGGAAACCTTGTTACGA
+
AABABFFFFFFFGGGGGGG6FGHGAAEEEGGFHGHFFF5BBB3BDFGGGGGGGHHHGGGGGGCGHHHHHHHHHHFEGGGGHHHHHHHGHHGHG3EGHH4BFFHHHHHGHHHHHHHGGHGHHGEHHFHHHHHHH1DGCGCHFGHFGDGFHHFGGHFGHHGF0GGGHGHHHHFHHHH?GHGGF?EGGHHG@BEFFBFFFBFFF0;FFFFFGGBF9BFF0FFGBFB@B;FFFFFFFFFFFF.BBFFBFBFBB?9

and this is your config.ini file to merge them:

[general]
project_name = CGCCTT_NNNNTCAGC_1
researcher_email = [email protected]
input_directory = /somewhere/on/your/disk
output_directory = /somewhere/on/your/disk/output

[files]
pair_1 = r1
pair_2 = r2

# following section is optional
[prefixes]
pair_1_prefix = ^....TCAGCGTAAAAGTCGTAACAAGGTTTC
pair_2_prefix = ^GTTCAAAGA[C,T]TCGATGATTCAC

And if you run these like this:

merge-illumina-pairs short.ini

read 1 and 2 will not merge. Because, after trimming prefixes, they will want to be aligned like this after the reverse-complementing read 2:

read 1:                 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
read 2: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

unlike our expected partial overlap situation:

read 1: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
read 2:                 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

To solve this issue a new parameter is required. When m/o (mismatches at the overlapped region) fails miserably, the algorithm should check "the other way around" alternative, and pick the one with the best m/o.

Issue with config file for quality filtering

Hi Meren,

I am getting this error when I perform iu-merge-pairs - Config File Error: Unexpected value for "pair_1" section "files": RC_AB4_062216-R1.fastq. Not sure what the issue is. I prepared the samples.txt file (first two rows of the file shown below).

sample	r1	r2
RC_AB4_062216	RC_AB4_062216-R1.fastq	RC_AB4_062216-R2.fastq

I ran iu-gen-configs and then ran iu-merge-pairs as shown below (I also tried iu-filter-minoche to see if the issue was the iu-merge-pairs command itself but that also gave the same error).

iu-gen-configs samples.txt -o 01_QC
iu-merge-pairs --debug 01_QC/RC_AB4_062216.ini 

Config File Error: Unexpected value for "pair_1" section "files": RC_AB4_062216-R1.fastq

Here is the appearance of the config file for the sample

[general]
project_name = RC_AB4_062216
researcher_email = [email protected]
input_directory = /media/shared/Onedrive/Postdoc_Gu/Projects/Oligotyping_RockCreek/metagenomics/raw_fq
output_directory = /media/shared/Onedrive/Postdoc_Gu/Projects/Oligotyping_RockCreek/metagenomics/01_QC

[files]
pair_1 = RC_AB4_062216-R1.fastq
pair_2 = RC_AB4_062216-R2.fastq

Any help with this would be greatly appreciated! (I have this foreboding feeling that I am just missing something extremely obvious!).

Thanks
Varun

how to demultiplex unknown casava version fastq file ?

Dear all,

I am trying to use iu-demultiplex to work with those fastq files : https://github.com/caporaso-lab/mockrobiota/blob/master/data/mock-9/dataset-metadata.tsv

but it's seems to no be FASTQ file generated by CASAVA 1.8, as iu-demultiplex return me this error :

          Header lines in your FASTQ file does not seem to be the ones illumina-utils     
          expects to see in a FASTQ file generated by CASAVA 1.8. If you call this        
          funciton with 'raw = True' parameter, all should be fine. If you are accessing  
          this function through a client, or in other words if you have no idea what this 
          message is telling you, try to re-run the program with --ignore-deflines        
          parameter. If that parameter is not available to you, then please send an e-mail
          to [email protected]

and I don't undertstand " 'raw = True' parameter ", as it's not an option of iu-demultiplex.
Could you tell me if it's possible to use your program with those data ?

They looks like this

head mock-forward-read.fastq

@ILLUMINA_0331:1:1101:1214:2235#NNNNNNNNNNNN/1
TACGTAGGGCGCAAGCGTTGTCCGGAATTANTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
a_aeceeegggggiiiiiiighiiihehifBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

head mock-index-read.fastq

@ILLUMINA_0331:1:1101:1214:2235#NNNNNNNNNNNN/1
NNNNNNNNNNNN
+
YYYYYYYYYYYY

head mock-reverse-read.fastq

@ILLUMINA_0331:1:1101:1214:2235#NNNNNNNNNNNN/2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

(its not always NNNNNNN in the index file)

Kind regards

Maria

iu-subsample-fastq not recognizing paired ends

I am trying to subsample paired end fastq reads with iu-subsample-fastq but I am getting an error that states that the paired end reads are different lengths. However, when I grep the "@" from each paired end fastq file I get the same number of reads. Any ideas what's happening here?

Help interpreting iu-filter-quality-minoche crash

I am hoping someone can help me interpret this error message. I am using iu-filter-quality-minoche for some NextSeq metagenomes (4xR1, 4XR2 per sample) that i ran through trimmomatic. For most of the metagenomes I can run iu-filter-quality-minoche just fine but a few always fail at the same point in the processing. For this sample it is around 24,000 reads in. If I run iu-filter-quality-minoche on the raw data before trimmomatic I have no issues. So it seems trimmomatic is doing something to a read that is causing iu-filter-quality-minoche to crash. But I don't know how to interpret the error and thus troubleshoot the problem.

(num pairs processed: 23,000)
(num pairs processed: 24,000)
Traceback (most recent call last):
File "/miniconda3/bin/iu-filter-quality-minoche", line 313, in <module>
sys.exit(main(config, args))
File "/miniconda3/bin/iu-filter-quality-minoche", line 178, in main
p1_passed_qual, p1_trim_to, p1_fate = IsHighQuality(s1, q1, p)
File "/miniconda3/bin/iu-filter-quality-minoche", line 68, in IsHighQuality
trim_to = None if len(sequence) == trim_to else trim_to
UnboundLocalError: local variable 'trim_to' referenced before assignment

--ignore-deflines flag is incompatible with --compute-qual-dicts flag.

If --ignore-deflines flag is used, the use of --compute-qual-dicts should throw an exception.

Related to #3.

using iu-remove-ids-from-fastq to remove read ids obtained from a bam file

First of all, I wanted to say that this is a great tool, so I just wanted to thank the developers!

I am trying to use iu-remove-ids-from-fastq to remove some reads that were mapped using bowtie2, but I have the following problem:
in the bam output from the bowtie2 mapping the reads look like this:
fasta_02:23:B02CBACXX:8:2315:2667:7273

Whereas, if I look at the corresponding read in the fastq file, it looks like this:
@fasta_02:23:B02CBACXX:8:2315:2667:7273 1:N:0:GATCAG

And iu-remove-ids-from-fastq expects:
fasta_02:23:B02CBACXX:8:2315:2667:7273 1:N:0:GATCAG

Even though to my understanding the read name fasta_02:23:B02CBACXX:8:2315:2667:7273 is unique.

Could this behavior be modified?

Thank you!

iu-merge-pairs error

Here is my case. I got my V1-V3 data sequenced by an external provider. They say they used the Illumina Casava pipeline version 1.8.3.

As a start I tried to generate a config file. I followed the steps listed at https://github.com/meren/illumina-utils:

I first generated a tab file listing all the sample names and the corresponding paired end fastq files. Than I ran iu-gen-configs and I was surprised that instead of generating a single config file it generated a config file for each sample.

Than I decided to merge the paired end fastq files for each sample by using iu-merge-pairs using the --compute-qual-dicts option. When I ran it for the first sample it produced the following error:

Error: Your input FASTQ files do not seem to be generated by CASAVA 1.8. Please use --ignore-deflines parameter.

I added the parameter as requested. Than I got another error message:

$ iu-merge-pairs --compute-qual-dicts --ignore-deflines 16001_posD09_CCTAAGACACTGCATA.ini
Traceback (most recent call last):
  File "/usr/local/bin/iu-merge-pairs", line 770, in <module>
    sys.exit(merger.run())
  File "/usr/local/bin/iu-merge-pairs", line 398, in run
    tile_number = self.input_1.entry.tile_number
  File "/Library/Python/2.7/site-packages/IlluminaUtils/lib/fastqlib.py", line 82, in __getattr__
    return getattr(self, '_'.join(['process', key]))()
(...)
  File "/Library/Python/2.7/site-packages/IlluminaUtils/lib/fastqlib.py", line 82, in __getattr__
    return getattr(self, '_'.join(['process', key]))()
  File "/Library/Python/2.7/site-packages/IlluminaUtils/lib/fastqlib.py", line 73, in __getattr__
    if key in ['__str__']: 
RuntimeError: maximum recursion depth exceeded in cmp

I don't know what to do now. Can you, perhaps, advise?

iu-demultiplex creating errors

Dear developer,

I am trying to demultiplex an Illumina run using iu-demultiplex, here is my command iu-demultiplex -s SampleSheet-RC.txt --r1 lane1_NoIndex_L001_R1_001-13C.fastq.gz --r2 lane1_NoIndex_L001_R3_001-13C.fastq.gz --index lane1_NoIndex_L001_R2_001-13C.fastq.gz -o output/

But I got following errors:
Output directory .............................: /Users/Jincheng/Desktop/tmp/output
Barcodes .....................................: 13 samples found
Traceback (most recent call last):
File "/Users/Jincheng/miniconda3/envs/py34/bin/iu-demultiplex", line 238, in
d._run()
File "/Users/Jincheng/miniconda3/envs/py34/bin/iu-demultiplex", line 45, in _run
self.build_index()
File "/Users/Jincheng/miniconda3/envs/py34/bin/iu-demultiplex", line 116, in build_index
progress.update('~%.2f%% (num index reads with no barcode: %d (%.2f%% of all reads))' % (self.index.percent_read, missing_barcode, missing_barcode * 100.0 / num_index))
TypeError: a float is required

Could you help?
Thank you!

Jincheng

fastaunique

Hi Meren,
We finally switched to python3 and now I have a problem:
/groups/vampsweb/seqinfobin/anaconda3/bin/python3 ./fastaunique TTAGGC_NNNNTCAGC_1_MERGED_V6_PRIMERS_REMOVED
Traceback (most recent call last):
File "./fastaunique", line 74, in
main(args)
File "./fastaunique", line 13, in main
input = u.SequenceSource(args.input_fasta, unique = True)
File "/groups/vampsweb/seqinfobin/anaconda3/lib/python3.6/site-packages/IlluminaUtils/lib/fastalib.py", line 94, in init
self.init_unique_hash()
File "/groups/vampsweb/seqinfobin/anaconda3/lib/python3.6/site-packages/IlluminaUtils/lib/fastalib.py", line 98, in init_unique_hash
hash = hashlib.sha1(self.seq.upper()).hexdigest()
TypeError: Unicode-objects must be encoded before hashing

issue with demultiplexing example

Hello -
I was trying to run the test example using this command:

(illumina-utils-dev) delaney@ada:~/software/illumina-utils/examples/demultiplexing$ iu-demultiplex -s barcode_to_sample.txt --r1 r1.fastq --r2 r2.fastq --index index.fastq -o output/

and get the following error:

Traceback (most recent call last):
File "/home/delaney/software/illumina-utils/scripts/iu-demultiplex", line 18, in
import IlluminaUtils.lib.fastqlib as u
ModuleNotFoundError: No module named 'IlluminaUtils'

Any help is greatly appreciated!!

not compatible with python 3.6

Dear illumina-utils Developer
i have python 3.6 activate and I am receiving a strange error when trying to to run illmina-utils.
here is my command line and the error:
(py36) [dieunel@genomics ~]$ iu-filter-quality-minoche -h
Your active Python major version ('2') is not compatible with what illumina-utils expects :/ We recently switched to Python 3.

Please any help ?

Regards

Python3 compatibility - iu-merge-pairs

Hi,

I noticed a small Python3 compatibility bug while running iu-merge-pairs.

The error message that I got was this:

Traceback (most recent call last):
  File "/home/danielle/virtual-envs/illumina-utils-v2.0.0/bin/iu-merge-pairs", line 770, in <module>
    sys.exit(merger.run())
  File "/home/danielle/virtual-envs/illumina-utils-v2.0.0/bin/iu-merge-pairs", line 302, in run
    r1_passed_Q30, r1_Q30 = self.passes_minoche_Q30(self.input_1.entry.Q_list[0:-len_overlap])
  File "/home/danielle/virtual-envs/illumina-utils-v2.0.0/bin/iu-merge-pairs", line 666, in passes_minoche_Q30
    Q30 = len([True for _q in base_qualities[0:half_length] if _q > 30])
TypeError: slice indices must be integers or None or have an __index__ method

This is an easy fix (I fixed it on my local machine), by changing line 666 from
Q30 = len([True for _q in base_qualities[0:half_length] if _q > 30])
to
Q30 = len([True for _q in base_qualities[0:int(half_length)] if _q > 30])

Just wanted to let you know!
Thanks for this excellent package!!

Best,
Danielle

merenlab / illumina-utils Goto Github PK

illumina-utils's People

Contributors

Stargazers

Watchers

Forkers

illumina-utils's Issues

Recommend Projects

Recommend Topics

Recommend Org