Comments (13)
from atropos.
FYI - the 'sra' branch has a new -sra option that streams reads directly from an SRA accession. This is based on my srastream library, which in turn requires the NCBI ngs-python library to be installed. This will be in Atropos 1.2 if not earlier.
from atropos.
Here are my commands ($-rs are substituted by samples).
For detection:
/usr/local/bin/atropos detect -se ${file} -d heuristic -O fasta -o ${sampleName}_adapters.fasta
for triming:
/usr/local/bin/atropos trim --trim-n -se ${file} --known-adapters-file ${adapters} -o ${sampleName}"_trimmed.fastq.gz"
As known adapters I give the adapters detected in previous stage (I suspect that they should be combined with known adapters).
Note: I use latest master branch.
from atropos.
Ah, ok. I think there's a misunderstanding here. A known adapters file simply provides a database from which you can access adapters by name; it does not search all your sequences for all of the adapters in the file. That would make trimming take much longer, since there is a linear time increase with each additional adapter searched. I have some ideas about how to make this faster, but that will be in a later version. Exhaustive search would also waste a lot of time since the known adapter file is a mix of forward and reverse adapter sequences.
For right now, you need to provide the adapters to trim in one of three ways:
- By name: use the known adapters file or the default list of known adapters, and then add option "-a {adapter name}"
- By sequence: add option "-a {adapter sequence}"
- By file: add option "-a file://{adapter file}" to search all adapters in the fasta file
The option would be -b instead of -a for 5' adapters.
Using the sra branch, I ran the following command:
atropos detect -sra SRR2040662 -d heuristic
I got back 6 sequences, all of which are at pretty low abundance (~0.1%). When adapters are this rare, sometimes the detect command can need more data, so I ran:
atropos detect -sra SRR2040662 --max-reads 50000 -d heuristic
Now Atropos picks up the most abundant adapter and matches it to a known sequence:
======= Input 1 ======= File: SRR2040662 Detected 1 adapters/contaminants: 1. Longest kmer: CTGGAGTTCAGACGTGTGCTCT Longest matching sequence: CTGGAGTTCAGACGTGTGCTCTTCCGATCTAATTTT Name(s): IlluminaMultiplexingAdapter1 Known sequence(s): GATCGGAAGAGCACACGTCT Known sequence K-mers that match detected contaminant: 100.00% Abundance (full-length) in 50000 reads: 52 (0.1%) Detected contaminant kmers that match known sequence: 36.00% Number of k-mer matches: 5592
Through some sleuthing, I figured out that the true adapter sequence for this dataset is GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTC.
If we align the reported known sequence, the longest matching sequence, the sequence detected by FASTQC, and the true adapter sequence, we get:
GATCGGAAGAGCACACGTCT AAAATTAGATCGGAAGAGCACACGTCTGAACTCCAG CGGAAGAGCACACGTCTGAACTCCAGTCACATCACG GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTC
So these are all matching the same thing. The challenge is that the read lengths here (36bp) are shorter than the adapter size, so the full adapter is never observed. But FASTQC does do a better job of picking up more of the true adapter sequence, so I'm going to have to look at their code and see how I can improve the heuristic algorithm.
The other issue is that this is a 5' adapter, and the detect command was optimized for 3' end adapter detection, so I need to do some work to generalize it for either case. It's pretty easy to detect 5' vs 3' by just looking at what end you find long runs of As.
Anyway, I then trimmed using the known sequence:
atropos trim -T 4 -sra SRR2040662 -b CTGGAGTTCAGACGTGTGCTCT -o test.fq
The resulting FastQC report: test_fastqc.zip.
Notice that the second-most over-represented sequence in the original has reduced by ~60%. You could improve this further with some more parameter tuning, for example by enforcing a minimum read length (-m {min read length}). But if I understand what you're trying to do -- automated adapter trimming -- realize that it's a hard problem and that results will vary dramatically from dataset to dataset no matter what trimming tool you use.
from atropos.
My plan is to close this issue and open others:
- Improve documentation on known adapters file
- Improve heuristic detection when read lengths are very short (< 75bp), perhaps by borrowing ideas from FastQC
- Better generalize adapter detection for 5' and 3' contaminants, perhaps by identifying the side(s) with long runs of As
from atropos.
The SRA streaming feature is included in v1.1.7.
from atropos.
atropos trim -T 4 -sra SRR2040662 -b CTGGAGTTCAGACGTGTGCTCT -o test.fq
The resulting FastQC report: test_fastqc.zip.
Notice that the second-most over-represented sequence in the original has reduced by ~60%.
Nothing changed, all overrepresented sequences have same ammount of counts as in original, maybe you confused the file when uploading the report?
UPD: sorry, did not notice that it cut part of the sequences. The problem is that the number of overrepresented sequences remained the same (it could not cut TCCGATCT part)
from atropos.
But if I understand what you're trying to do -- automated adapter trimming -- realize that it's a hard problem and that results will vary dramatically from dataset to dataset no matter what trimming tool you use.
We have to deal with a huge amount of different SRA-s in the lab and writing down adapters and primers manually for each set of samples is definitely an overkill for us. That means that I will be happy if it will detect (and then clean) at least those adapters and primers that I see in FASTQC. It is a bit surprising that FASTQC, a more general tool, does the job of detecting adapters/primers better than atropos, the specialized tool to delete adapters.
from atropos.
I understand the use case -- that's why I implemented the 'detect' command. Just understand that it was very recently developed and optimized for long, paired-end sequences. FASTQC has been around forever, and was created back when 36 bp reads were the norm, so I'm not surprised that it's better at handling those.
Atropos will get better with help from users like you. One thing people can do to help is submit complete bug reports. Instead of just saying "it doesn't work," also upload all the relevant information - inputs, expected outputs, actual outputs, stack traces, etc.
In the interim, for datasets where Atropos doesn't perform as well as FASTQC (single-end, shorter reads) you can write a script to extract the over-represented sequences from the FASTQC report, save them to FASTA, and feed those to atropos.
from atropos.
that's why I implemented the 'detect' command
In Detect command it will be good to differentiate adapters detected in 3prime and 5prime sites inside fasta output file, so I will not have to bother what should go to atropos A and atropos B
you can write a script to extract the over-represented sequences from the FASTQC report,
For them I do not know if they are A and B, should I put them both to A and B?
from atropos.
from atropos.
you can write a script to extract the over-represented sequences from the FASTQC report,
For them I do not know if they are A and B, should I put them both to A and B?
First, I made a mistake before when I said '-b' (which means to search for the adapter anywhere in the read) I meant '-g' (which means to search for it at the front of the read. For the temporary FASTQC solution, you can use the '-g' option.
from atropos.
I just added some new issues to cover the enhancements proposed here, so I'm closing this one.
from atropos.
Related Issues (20)
- TypeError: can only concatenate tuple (not "list") to tuple HOT 1
- atropos --version prints errors and exits with error status HOT 1
- atropos --version exits with stack trace and error status HOT 1
- Bundle adapters file with atropos wheel HOT 1
- TypeError: 'CountingDict' object is not callable when using --metrics HOT 2
- multiqc support? HOT 2
- Implement non-internal adapters
- Cutadapt version HOT 1
- Atropos trim multi thread stuck in infinite loop on empty files. HOT 2
- Bioconda atropos package tied to python 3.6 HOT 1
- Atropos generated empty gzip files are not proper gzip file HOT 3
- error while running the detect command HOT 2
- Can atropos remove 5' adapter variants that are incomplete from the tail? HOT 5
- detect: --max-reads default HOT 1
- defaults not listed in cli docs
- Install Error HOT 1
- Test python 3.10 compatibility
- Move CI to github actions
- Python 3.12 build errors due to SafeConfigParser removal HOT 2
- cython is required at build time but not install time HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from atropos.