When the DNA library is overly short, Is it possible that most reads overlap. <p d

I think it will be used a lot, because FLASH is on sourceforge

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

sfchen commented on May 20, 2024 4

Yes, I can implement this feature.

from fastp.

sfchen commented on May 20, 2024 3

Haha, I've put this feature in fastp's roadmap.

from fastp.

tseemann commented on May 20, 2024 2

I think it will be used a lot, because

FLASH is on sourceforge and isn't able to be downloaded lately due to problems at sourceforge
PEAR is no longer fully open source, you need a click through licence now

So there is a gap in the open source market for a overlapper tool

It would be amazing to have a tool that does adapters, quality AND stitching!

from fastp.

sjackman commented on May 20, 2024 2

I'm also interested in this feature!

from fastp.

sfchen commented on May 20, 2024 2

I promise to implement this in 3 days

from fastp.

ndaniel commented on May 20, 2024 1

@tseemann

As far as I understand out there are open source tools which already stitch overlapping reads from paired-reads, like for example BBMerge from BBMAP.

Here one would be interested how fastp would compare against BBMerge.

from fastp.

tseemann commented on May 20, 2024 1

fastp is "a tool designed to provide fast all-in-one preprocessing for FastQ files".

You'll need stitching support for that to be really true? :-)

from fastp.

sfchen commented on May 20, 2024 1

Okay, I will implement it soon, probably in 1 week.

from fastp.

oschwengers commented on May 20, 2024

me too!

from fastp.

sjackman commented on May 20, 2024

There's a lot of literature and existing tools for stitching together reads. It'd be nice to implement whichever is considered "the best", as in, the most accurate. Is there a review paper? Does the peanut gallery have any comments on which is perceived to be the best tool by the community?

from fastp.

tseemann commented on May 20, 2024

My old blog post is a start, but probably newer tools now:
http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html

Please note that PEAR is no longer open source and should not be considered.
Heng Li also has one buried in fermi-kit somewhere too I think!

from fastp.

sjackman commented on May 20, 2024

ABySS has abyss-mergepairs too. I have no idea how it compares to other tools.

from fastp.

tseemann commented on May 20, 2024

      --chastity          discard unchaste reads [default]
      --no-chastity       do not discard unchaste reads

Our old nesoni toolkit had chastity and fidelity options too :)

Regards, the 🥜 gallery.

from fastp.

sjackman commented on May 20, 2024

Random bit trivia. ABySS discards unchaste reads when building the de Bruin graph, but uses unchaste reads when mapping back to the assembly. (if they map, may as well use them)

from fastp.

brucemoran commented on May 20, 2024

Was this implemented? Aligners can penalize unpaired reads, so is it possible that the overlap can be 'clipped' from the read with lower base quality (or randomly if tied)?

from fastp.

ndaniel commented on May 20, 2024

@tseemann
BBMerge (which is part of BBMAP) is stiching paired reads really well!
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmerge-guide/

Now even the STAR aligner is stitching together the overlapping reads before mapping them in order to get better alignments.

from fastp.

sfchen commented on May 20, 2024

Hi guys, this function is implemented, please have a try and help to update this thread with the results.

merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode, the output will be a single file.

In the output file, a tag like merged_xxx_yyywill be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

For the pairs of reads that cannot be merged successfully, they will be both included in the output by default. But you can specify the --discard_unmerged option to discard the unmerged reads.

Same as the base correction feature, this function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30) and overlap_diff_limit (default 5).

from fastp.

sjackman commented on May 20, 2024

Thank you, Shifu! A couple of questions.

Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0, when both the first and second read are 120 bp of template and then 30 bp of adapter.

But you can specify the --chastity option to discard the unmerged reads.

Chastity refers to the Illumina chastity filter, which is a different thing, the :N: or :Y: in the FASTA header comment. I'd suggest naming this option something like --only-merged.

from fastp.

sfchen commented on May 20, 2024

@sjackman thanks for your reply.

Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0, when both the first and second read are 120 bp of template and then 30 bp of adapter.

Yes, it handles that case.

As you suggested, I renamed --chastity to --discard_unmerged.

Please try with the latest code.

from fastp.

sjackman commented on May 20, 2024

Thanks, Shifu! I'll give it a spin.

from fastp.

sfchen commented on May 20, 2024

Hi guys, this feature is revised and improved a lot in fastp v0.19.9 (will be released soon), see the update here:

merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode:

--merged_out shouuld be given to specify the file to store merged reads, otherwise you should enable --stdout to stream the merged reads to STDOUT. The merged reads are also filtered.
--out1 and --out2 will be the reads that cannot be merged successfully, but both pass all the filters.
--unpaired1 will be the reads that cannot be merged, read1 passes filters but read2 doesn't.
--unpaired2 will be the reads that cannot be merged, read2 passes filters but read1 doesn't.
--include_unmerged can be enabled to make reads of --out1, --out2, --unpaired1 and --unpaired2 redirected to --merged_out. So you will get a single output file. This option is disabled by default.

--failed_out can still be given to store the reads (either merged or unmerged) failed to passing filters.

In the output file, a tag like merged_xxx_yyywill be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

This function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30) and overlap_diff_limit (default 5).

from fastp.

Stitch together overlapping reads? about fastp HOT 21 OPEN

Comments (21)

merge paired-end reads

merge paired-end reads

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent