Comments (21)
Yes, I can implement this feature.
from fastp.
Haha, I've put this feature in fastp's roadmap.
from fastp.
I think it will be used a lot, because
- FLASH is on sourceforge and isn't able to be downloaded lately due to problems at sourceforge
- PEAR is no longer fully open source, you need a click through licence now
So there is a gap in the open source market for a overlapper tool
It would be amazing to have a tool that does adapters, quality AND stitching!
from fastp.
I'm also interested in this feature!
from fastp.
I promise to implement this in 3 days
from fastp.
As far as I understand out there are open source tools which already stitch overlapping reads from paired-reads, like for example BBMerge from BBMAP.
Here one would be interested how fastp would compare against BBMerge.
from fastp.
fastp
is "a tool designed to provide fast all-in-one preprocessing for FastQ files".
You'll need stitching support for that to be really true? :-)
from fastp.
Okay, I will implement it soon, probably in 1 week.
from fastp.
me too!
from fastp.
There's a lot of literature and existing tools for stitching together reads. It'd be nice to implement whichever is considered "the best", as in, the most accurate. Is there a review paper? Does the peanut gallery have any comments on which is perceived to be the best tool by the community?
from fastp.
My old blog post is a start, but probably newer tools now:
http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html
Please note that PEAR is no longer open source and should not be considered.
Heng Li also has one buried in fermi-kit somewhere too I think!
from fastp.
ABySS has abyss-mergepairs
too. I have no idea how it compares to other tools.
from fastp.
--chastity discard unchaste reads [default]
--no-chastity do not discard unchaste reads
Our old nesoni
toolkit had chastity
and fidelity
options too :)
Regards, the 🥜 gallery.
from fastp.
Random bit trivia. ABySS discards unchaste reads when building the de Bruin graph, but uses unchaste reads when mapping back to the assembly. (if they map, may as well use them)
from fastp.
Was this implemented? Aligners can penalize unpaired reads, so is it possible that the overlap can be 'clipped' from the read with lower base quality (or randomly if tied)?
from fastp.
@tseemann
BBMerge (which is part of BBMAP) is stiching paired reads really well!
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmerge-guide/
Now even the STAR aligner is stitching together the overlapping reads before mapping them in order to get better alignments.
from fastp.
Hi guys, this function is implemented, please have a try and help to update this thread with the results.
merge paired-end reads
For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge
option. In this merging
mode, the output will be a single file.
In the output file, a tag like merged_xxx_yyy
will be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp
prefers the bases in read1 since they usually have higher quality than read2.
For the pairs of reads that cannot be merged successfully, they will be both included in the output by default. But you can specify the --discard_unmerged
option to discard the unmerged reads.
Same as the base correction feature, this function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30)
and overlap_diff_limit (default 5)
.
from fastp.
Thank you, Shifu! A couple of questions.
Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0
, when both the first and second read are 120 bp of template and then 30 bp of adapter.
But you can specify the
--chastity
option to discard the unmerged reads.
Chastity refers to the Illumina chastity filter, which is a different thing, the :N:
or :Y:
in the FASTA header comment. I'd suggest naming this option something like --only-merged
.
from fastp.
@sjackman thanks for your reply.
Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0, when both the first and second read are 120 bp of template and then 30 bp of adapter.
Yes, it handles that case.
As you suggested, I renamed --chastity
to --discard_unmerged
.
Please try with the latest code.
from fastp.
Thanks, Shifu! I'll give it a spin.
from fastp.
Hi guys, this feature is revised and improved a lot in fastp v0.19.9 (will be released soon), see the update here:
merge paired-end reads
For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge
option. In this merging
mode:
--merged_out
shouuld be given to specify the file to store merged reads, otherwise you should enable--stdout
to stream the merged reads to STDOUT. The merged reads are also filtered.--out1
and--out2
will be the reads that cannot be merged successfully, but both pass all the filters.--unpaired1
will be the reads that cannot be merged,read1
passes filters butread2
doesn't.--unpaired2
will be the reads that cannot be merged,read2
passes filters butread1
doesn't.--include_unmerged
can be enabled to make reads of--out1
,--out2
,--unpaired1
and--unpaired2
redirected to--merged_out
. So you will get a single output file. This option is disabled by default.
--failed_out
can still be given to store the reads (either merged or unmerged) failed to passing filters.
In the output file, a tag like merged_xxx_yyy
will be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp
prefers the bases in read1 since they usually have higher quality than read2.
This function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30)
and overlap_diff_limit (default 5)
.
from fastp.
Related Issues (20)
- I obtained different 'insert size peaks' when I used different options for the same paired-end sequencing sample.
- Insert size estimation and report interpretation HOT 2
- fastp in my pip package
- aborted core dumped
- Can fastp draw PNG format?
- Very short fragments
- No mouse ouver in kmer matrix
- Can fastq remove read sequences with duplicate IDs?
- Feature request: add option to set lower limit of unqualified quality
- Missing most reads after given r2 adapter HOT 1
- Interpretation help file?
- Store duplicate reads
- Split interleaved output
- interleaved output is not reproducible with multiple threads HOT 1
- Not able to install on Mac book M1 HOT 4
- Keep occurred error message from the beginning < igzip: invalid gzip header found >
- Nanopore data filtering using fastp
- No adapter detected for read and Q20 bases: 4747174600(99.9999%)
- fastp not removing all Illumina universal adapter sequences as indicated by FastQC HOT 2
- few options throw 'undefined error' -reg
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fastp.