Code Monkey home page Code Monkey logo

Comments (13)

jdidion avatar jdidion commented on August 27, 2024

from atropos.

jdidion avatar jdidion commented on August 27, 2024

FYI - the 'sra' branch has a new -sra option that streams reads directly from an SRA accession. This is based on my srastream library, which in turn requires the NCBI ngs-python library to be installed. This will be in Atropos 1.2 if not earlier.

from atropos.

antonkulaga avatar antonkulaga commented on August 27, 2024

Here are my commands ($-rs are substituted by samples).
For detection:

    /usr/local/bin/atropos detect -se ${file} -d heuristic -O fasta -o ${sampleName}_adapters.fasta

for triming:

/usr/local/bin/atropos trim --trim-n -se ${file} --known-adapters-file ${adapters} -o ${sampleName}"_trimmed.fastq.gz"

As known adapters I give the adapters detected in previous stage (I suspect that they should be combined with known adapters).

Note: I use latest master branch.

from atropos.

jdidion avatar jdidion commented on August 27, 2024

Ah, ok. I think there's a misunderstanding here. A known adapters file simply provides a database from which you can access adapters by name; it does not search all your sequences for all of the adapters in the file. That would make trimming take much longer, since there is a linear time increase with each additional adapter searched. I have some ideas about how to make this faster, but that will be in a later version. Exhaustive search would also waste a lot of time since the known adapter file is a mix of forward and reverse adapter sequences.

For right now, you need to provide the adapters to trim in one of three ways:

  • By name: use the known adapters file or the default list of known adapters, and then add option "-a {adapter name}"
  • By sequence: add option "-a {adapter sequence}"
  • By file: add option "-a file://{adapter file}" to search all adapters in the fasta file

The option would be -b instead of -a for 5' adapters.

Using the sra branch, I ran the following command:

atropos detect -sra SRR2040662 -d heuristic

I got back 6 sequences, all of which are at pretty low abundance (~0.1%). When adapters are this rare, sometimes the detect command can need more data, so I ran:

atropos detect -sra SRR2040662 --max-reads 50000 -d heuristic

Now Atropos picks up the most abundant adapter and matches it to a known sequence:

=======
Input 1
=======

File: SRR2040662
Detected 1 adapters/contaminants:
1. Longest kmer: CTGGAGTTCAGACGTGTGCTCT
   Longest matching sequence: CTGGAGTTCAGACGTGTGCTCTTCCGATCTAATTTT
   Name(s): IlluminaMultiplexingAdapter1
   Known sequence(s): GATCGGAAGAGCACACGTCT
   Known sequence K-mers that match detected contaminant: 100.00%
   Abundance (full-length) in 50000 reads: 52 (0.1%)
   Detected contaminant kmers that match known sequence: 36.00%
   Number of k-mer matches: 5592

Through some sleuthing, I figured out that the true adapter sequence for this dataset is GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTC.

If we align the reported known sequence, the longest matching sequence, the sequence detected by FASTQC, and the true adapter sequence, we get:

       GATCGGAAGAGCACACGTCT
AAAATTAGATCGGAAGAGCACACGTCTGAACTCCAG
          CGGAAGAGCACACGTCTGAACTCCAGTCACATCACG
       GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTC

So these are all matching the same thing. The challenge is that the read lengths here (36bp) are shorter than the adapter size, so the full adapter is never observed. But FASTQC does do a better job of picking up more of the true adapter sequence, so I'm going to have to look at their code and see how I can improve the heuristic algorithm.

The other issue is that this is a 5' adapter, and the detect command was optimized for 3' end adapter detection, so I need to do some work to generalize it for either case. It's pretty easy to detect 5' vs 3' by just looking at what end you find long runs of As.

Anyway, I then trimmed using the known sequence:

atropos trim -T 4 -sra SRR2040662 -b CTGGAGTTCAGACGTGTGCTCT -o test.fq

The resulting FastQC report: test_fastqc.zip.

Notice that the second-most over-represented sequence in the original has reduced by ~60%. You could improve this further with some more parameter tuning, for example by enforcing a minimum read length (-m {min read length}). But if I understand what you're trying to do -- automated adapter trimming -- realize that it's a hard problem and that results will vary dramatically from dataset to dataset no matter what trimming tool you use.

from atropos.

jdidion avatar jdidion commented on August 27, 2024

My plan is to close this issue and open others:

  • Improve documentation on known adapters file
  • Improve heuristic detection when read lengths are very short (< 75bp), perhaps by borrowing ideas from FastQC
  • Better generalize adapter detection for 5' and 3' contaminants, perhaps by identifying the side(s) with long runs of As

from atropos.

jdidion avatar jdidion commented on August 27, 2024

The SRA streaming feature is included in v1.1.7.

from atropos.

antonkulaga avatar antonkulaga commented on August 27, 2024

atropos trim -T 4 -sra SRR2040662 -b CTGGAGTTCAGACGTGTGCTCT -o test.fq
The resulting FastQC report: test_fastqc.zip.
Notice that the second-most over-represented sequence in the original has reduced by ~60%.

Nothing changed, all overrepresented sequences have same ammount of counts as in original, maybe you confused the file when uploading the report?

UPD: sorry, did not notice that it cut part of the sequences. The problem is that the number of overrepresented sequences remained the same (it could not cut TCCGATCT part)

from atropos.

antonkulaga avatar antonkulaga commented on August 27, 2024

But if I understand what you're trying to do -- automated adapter trimming -- realize that it's a hard problem and that results will vary dramatically from dataset to dataset no matter what trimming tool you use.

We have to deal with a huge amount of different SRA-s in the lab and writing down adapters and primers manually for each set of samples is definitely an overkill for us. That means that I will be happy if it will detect (and then clean) at least those adapters and primers that I see in FASTQC. It is a bit surprising that FASTQC, a more general tool, does the job of detecting adapters/primers better than atropos, the specialized tool to delete adapters.

from atropos.

jdidion avatar jdidion commented on August 27, 2024

I understand the use case -- that's why I implemented the 'detect' command. Just understand that it was very recently developed and optimized for long, paired-end sequences. FASTQC has been around forever, and was created back when 36 bp reads were the norm, so I'm not surprised that it's better at handling those.

Atropos will get better with help from users like you. One thing people can do to help is submit complete bug reports. Instead of just saying "it doesn't work," also upload all the relevant information - inputs, expected outputs, actual outputs, stack traces, etc.

In the interim, for datasets where Atropos doesn't perform as well as FASTQC (single-end, shorter reads) you can write a script to extract the over-represented sequences from the FASTQC report, save them to FASTA, and feed those to atropos.

from atropos.

antonkulaga avatar antonkulaga commented on August 27, 2024

that's why I implemented the 'detect' command

In Detect command it will be good to differentiate adapters detected in 3prime and 5prime sites inside fasta output file, so I will not have to bother what should go to atropos A and atropos B

you can write a script to extract the over-represented sequences from the FASTQC report,

For them I do not know if they are A and B, should I put them both to A and B?

from atropos.

jdidion avatar jdidion commented on August 27, 2024

from atropos.

jdidion avatar jdidion commented on August 27, 2024

you can write a script to extract the over-represented sequences from the FASTQC report,

For them I do not know if they are A and B, should I put them both to A and B?

First, I made a mistake before when I said '-b' (which means to search for the adapter anywhere in the read) I meant '-g' (which means to search for it at the front of the read. For the temporary FASTQC solution, you can use the '-g' option.

from atropos.

jdidion avatar jdidion commented on August 27, 2024

I just added some new issues to cover the enhancements proposed here, so I'm closing this one.

from atropos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.