Code Monkey home page Code Monkey logo

Comments (21)

sfchen avatar sfchen commented on May 20, 2024 4

Yes, I can implement this feature.

from fastp.

sfchen avatar sfchen commented on May 20, 2024 3

Haha, I've put this feature in fastp's roadmap.

from fastp.

tseemann avatar tseemann commented on May 20, 2024 2

I think it will be used a lot, because

  1. FLASH is on sourceforge and isn't able to be downloaded lately due to problems at sourceforge
  2. PEAR is no longer fully open source, you need a click through licence now

So there is a gap in the open source market for a overlapper tool

It would be amazing to have a tool that does adapters, quality AND stitching!

from fastp.

sjackman avatar sjackman commented on May 20, 2024 2

I'm also interested in this feature!

from fastp.

sfchen avatar sfchen commented on May 20, 2024 2

I promise to implement this in 3 days

from fastp.

ndaniel avatar ndaniel commented on May 20, 2024 1

@tseemann

As far as I understand out there are open source tools which already stitch overlapping reads from paired-reads, like for example BBMerge from BBMAP.

Here one would be interested how fastp would compare against BBMerge.

from fastp.

tseemann avatar tseemann commented on May 20, 2024 1

fastp is "a tool designed to provide fast all-in-one preprocessing for FastQ files".

You'll need stitching support for that to be really true? :-)

from fastp.

sfchen avatar sfchen commented on May 20, 2024 1

Okay, I will implement it soon, probably in 1 week.

from fastp.

oschwengers avatar oschwengers commented on May 20, 2024

me too!

from fastp.

sjackman avatar sjackman commented on May 20, 2024

There's a lot of literature and existing tools for stitching together reads. It'd be nice to implement whichever is considered "the best", as in, the most accurate. Is there a review paper? Does the peanut gallery have any comments on which is perceived to be the best tool by the community?

from fastp.

tseemann avatar tseemann commented on May 20, 2024

My old blog post is a start, but probably newer tools now:
http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html

Please note that PEAR is no longer open source and should not be considered.
Heng Li also has one buried in fermi-kit somewhere too I think!

from fastp.

sjackman avatar sjackman commented on May 20, 2024

ABySS has abyss-mergepairs too. I have no idea how it compares to other tools.

from fastp.

tseemann avatar tseemann commented on May 20, 2024
      --chastity          discard unchaste reads [default]
      --no-chastity       do not discard unchaste reads

Our old nesoni toolkit had chastity and fidelity options too :)

Regards, the 🥜 gallery.

from fastp.

sjackman avatar sjackman commented on May 20, 2024

Random bit trivia. ABySS discards unchaste reads when building the de Bruin graph, but uses unchaste reads when mapping back to the assembly. (if they map, may as well use them)

from fastp.

brucemoran avatar brucemoran commented on May 20, 2024

Was this implemented? Aligners can penalize unpaired reads, so is it possible that the overlap can be 'clipped' from the read with lower base quality (or randomly if tied)?

from fastp.

ndaniel avatar ndaniel commented on May 20, 2024

@tseemann
BBMerge (which is part of BBMAP) is stiching paired reads really well!
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmerge-guide/

Now even the STAR aligner is stitching together the overlapping reads before mapping them in order to get better alignments.

from fastp.

sfchen avatar sfchen commented on May 20, 2024

Hi guys, this function is implemented, please have a try and help to update this thread with the results.

merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode, the output will be a single file.

In the output file, a tag like merged_xxx_yyywill be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

For the pairs of reads that cannot be merged successfully, they will be both included in the output by default. But you can specify the --discard_unmerged option to discard the unmerged reads.

Same as the base correction feature, this function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30) and overlap_diff_limit (default 5).

from fastp.

sjackman avatar sjackman commented on May 20, 2024

Thank you, Shifu! A couple of questions.

Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0, when both the first and second read are 120 bp of template and then 30 bp of adapter.

But you can specify the --chastity option to discard the unmerged reads.

Chastity refers to the Illumina chastity filter, which is a different thing, the :N: or :Y: in the FASTA header comment. I'd suggest naming this option something like --only-merged.

from fastp.

sfchen avatar sfchen commented on May 20, 2024

@sjackman thanks for your reply.

Does it handle the case when the sequenced molecule is less than a read length? For example with 2x150 bp sequencing, a result of merged_120_0, when both the first and second read are 120 bp of template and then 30 bp of adapter.

Yes, it handles that case.

As you suggested, I renamed --chastity to --discard_unmerged.

Please try with the latest code.

from fastp.

sjackman avatar sjackman commented on May 20, 2024

Thanks, Shifu! I'll give it a spin.

from fastp.

sfchen avatar sfchen commented on May 20, 2024

Hi guys, this feature is revised and improved a lot in fastp v0.19.9 (will be released soon), see the update here:

merge paired-end reads

For paired-end (PE) input, fastp supports stiching them by specifying the -m/--merge option. In this merging mode:

  • --merged_out shouuld be given to specify the file to store merged reads, otherwise you should enable --stdout to stream the merged reads to STDOUT. The merged reads are also filtered.
  • --out1 and --out2 will be the reads that cannot be merged successfully, but both pass all the filters.
  • --unpaired1 will be the reads that cannot be merged, read1 passes filters but read2 doesn't.
  • --unpaired2 will be the reads that cannot be merged, read2 passes filters but read1 doesn't.
  • --include_unmerged can be enabled to make reads of --out1, --out2, --unpaired1 and --unpaired2 redirected to --merged_out. So you will get a single output file. This option is disabled by default.

--failed_out can still be given to store the reads (either merged or unmerged) failed to passing filters.

In the output file, a tag like merged_xxx_yyywill be added to each read name to indicate that how many base pairs are from read1 and from read2, respectively. For example, @NB551106:9:H5Y5GBGX2:1:22306:18653:13119 1:N:0:GATCAG merged_150_15
means that 150bp are from read1, and 15bp are from read2. fastp prefers the bases in read1 since they usually have higher quality than read2.

This function is also based on overlapping detection, which has adjustable parameters overlap_len_require (default 30) and overlap_diff_limit (default 5).

from fastp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.