When downloading fastq files from NCBI that are paired end reads, the files get interleaved into one .fastq.gz
file. Salmon expects two separate files. @rob-p has written a bash script which will split an interleaved fastq file into two streams and then feed the streams into salmon, which can be found here. There is an alternative method though which would use python's gzip
library to split the gzip file without gunzipping it. Which will be faster is not clear, so some benchmarking is in order. The benchmarking should be done on an AWS instance to best match the environment in which the production code will be run, because apparently the speed of the HDD can impact Salmon's performance on .fastq
vs .fastq.gz
files.
The code for the alternative python method is currently only at a POC level and has not been committed to any repo, so it is included here:
import gzip
import re
with gzip.open("sra_data.fastq.gz", "r") as interleaved:
with gzip.open("read_1.fastq.gz", "w") as out_1:
with gzip.open("read_2.fastq.gz", "w") as out_2:
for line in interleaved:
line = str(line, "utf-8")
if re.match(".*RR\d+\.\d+\.1.*", line) is not None:
out_1.write(bytes(line, "utf-8"))
out_1.write(interleaved.readline())
elif re.match(".*RR\d+\.\d+\.2.*", line) is not None:
out_2.write(bytes(line, "utf-8"))
out_2.write(interleaved.readline())
else:
print(line)
print("AAAAAAAHHHHHH NO MATCHES")
exit()