magicdgs / readtools Goto Github PK
View Code? Open in Web Editor NEWA Universal Toolkit for Handling Sequence Data from Different Sequencing Platforms
Home Page: https://magicdgs.github.io/ReadTools/
License: MIT License
A Universal Toolkit for Handling Sequence Data from Different Sequencing Platforms
Home Page: https://magicdgs.github.io/ReadTools/
License: MIT License
The algorithm to perform trimming is a single class with everything contained there. A better approach is use the lambda support from Java8 to extract different components of each trimming there (by quality, by Ns, etc). In addition, we can use the GATK4 framework to implement a plugin in the command line which takes parameters for each trimming algorithm if requested.
First draft:
ReadTransformer
-like class, which transform the read to the trim one.GATKRead
implementation that is called GATKTrimmedRead
which contains the original read and the coordinates for trim.TrimmingFunction
as a ReadTransformer
to return a read with updated tags.ApplyTrimResultReadTransfomer
for hard-clip the read based on the tags.TrimmerPluginDescriptor
as a plugin system to provide trimmers to the command line.ReadFilter
): first trimmers (TrimmingFunction
), then apply the trim points (ApplyTrimResultReadTransformer
), and finaly filters (ReadFilter
). This include the accumulation of new metrics classes while processing.TrimReads
tool, which includes a plugin to provide the trimmer(s) and a plugin to provide the read filter(s).There should be a TrimmingMetric
extending MetricBase
, containing the following fields:
Similarly, a FilteringMetric
extending MetricBase
should be implemented containing FILTER (name), TOTAL_SINGLE/TOTAL_FIRST/TOTAL_SECOND (reads reaching the filter) and PASSED_SINGLE/PASSED_FIRST/PASSED_SECOND (reads that reach the filter and pass).
I used ReadTools v.0.2.2.r_4831abfbac to add read groups to a BAM-file and got the following error message:
SAM validation error: ERROR: Read name E00300:175:H3LYTALXX:6:1103:27813:5546#ACAGTGAT, No real operator (M|I|D|N) in CIGAR
This is the respective read pair:
E00300:175:H3LYTALXX:6:1103:27813:5546#ACAGTGAT 99 2L 1 70 150S = 1 129 TTTTTTCGGTTCTTGACTGGGGCGGAGCGTGAACGCTCAGCTCTGTTTTCTCGGGAAAAAATACAAGCGCGAACTCCGCGTCGTGTTAAAGACATATATATTTTTTTAATCTCATTGATCTTTTTTTGCTGTTCTTTGCTTGTGGTTCTT AAFFFKKKKAKKKKKKFKKKKKKKKAKKKFKKKKKKAF7KKKKKFFKKKKFKKKKKKKFKKKKAKKKKKKK<AKKKKKFFKKKKKKKKAKKKKFAKKKKKKKKKKKKFAA<KK7FKKKKKKKKKAKKKKKKAFKKKKKKKFKKFF7<FKA AM:i:8 AS:i:256 MD:Z:114 NM:i:0 PG:Z:novoalign.I PQ:i:438 SM:i:8 UQ:i:256
E00300:175:H3LYTALXX:6:1103:27813:5546#ACAGTGAT 147 2L 1 70 21S129M = 1 -129 ACTGGGGCGGAGCGTGAACGCTCAGCTCTGTTTTCTCGGGAAAAAATACAAGCGCGAACTCCGCGTCGTGTTAAAGACATATATATTTTTTTAATCTCATTGATCTTTTTTTGCTGTTCTTTGCTTGTGGTTCTTTTAGGAATTAATTGG <KFAKKKKKKKKKKKKKKKKKKKKKKKKKF<KKKKKKKKFKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKFKFKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA MD:Z:129 PG:Z:novoalign.I AM:i:8 NM:i:0 SM:i:70 PQ:i:438 UQ:i:166 AS:i:166
Thanks a lot for looking into this, Daniel.
Cheers,
Thomas
Using a similar command line handling as Picard will be useful for change parameters, documentation of the tools and adding new parameters (now it is a nightmare).
Thus, either develop a new command line interface API (based on annotations) or port our tools the Picard/GATK4 framework.
The --non-standardize-output
option allows the user to output a file in the same format as the previous file. Because ReadTools is a toolkit for standardize, this does not make any more sense. I implemented this option due to the heuristic approach to guess the format in the input file, to allow the user to maintain an incorrectly guessed Illumina format.
Nevertheless, I think that it will be much more clear to create an option for the input file format, check if the guessed format agrees and if not log a warning instead and proceed with the input format provided by the user. That will really convert ReadTools in a set of tools which consistent output format (Sanger/Standard).
Probably it will be better for the final user to know that running the tool with -h or --help provides the full information for the tool.
When working with BAM files, it is clear that the barcode is contained in a tag that it is named usually BC
(and B2
for a second index), but this tags could be changed in the command line or using a java variable.
On the other hand, when working with FASTQ files the things are much more complicated (see #48). We should use patterns to recognize the format of the FATSQ barcode encoding automatically. It should:
@readName#ACTG
)@readName#ACTG_ACGT
), where the defult separation is a "_"([ACGT]+)(_??)([ACGT]+)
, or with a different separatorA simpler way is to create a command line parameter as suggested in #48, but once it is change to the GATK4 framework it could be a plugin to specify an method (example, DefaultReadNameBarcodeEncoding
, SeparatorReadNameBarcodeEncoding
with an option for the separator between read name and barcode, BestGuessReadNameBarcodeEncoding
with the pattern strategy). This will give to the user a finer control of what is happening in the data, but increase the command line complexity. On the other hand, this allows to include other plugins not related only with FASTQ to detect the barcode in the BAM tags and/or the sequence itself.
Due to the fix made for #24, now there is an option for allow higher qualities. QualityChecker
should help to the user to decide if the new option is needed after detecting the format, checking all the qualities and throwing a warning if the qualities are higher than expected.
Although it will break the compatibility with previous versions (although they are pre-releases), I would like to output the barcodes in the read name when several barcodes with a separator between them (like that, they could be retrieved separately to test other approaches.
Some helpers are already implemented on f6803eb
In release 0.1.4 no barcode is detected (all reads are in the discarded file) running in single-end mode (-s).
Implementation in commit 73b1f62 does not work properly in real data.
Implement unit tests for the trimming algorithm and found the bug.
When Sanger-like enconding (Phred+33) comes from Illumina 1.8+, it allows the value 41(J). Nevertheless, my checker is using 40 as the highest value for this Phred score.
Change in the next release.
HTSJDK have its own TrimmingUtils
with one simple method to perform trimming.
Extracting the Mott algorithm from the TrimmingMethods
will be important for support other kind of trimming once #25 is done, because it will use the same quality enconding.
In addition, trimming Ns could be ported to HTSJDK TrimmingUtils
to keep our code cleaner, and at some point even the Mott algorithm could be ported.
Check the loggers for all the tools. The problem was detected on BasrcodedBamToFastq.
Now I'm using Codacy in my projects for automatic review and check code quality. I want to increase the code quality of ReadTools by using the rules implemented there, so this issue requires:
Goal: 50% (it is enough for legacy code with integration tests)
Note: set thresholds in the build (in the CI and maybe with gradle)
HTSJDK will include a RawRead
interface for FastqRecord
and SAMRecord
with defined contract for quality encoding and so on: samtools/htsjdk#572
This will allow to simplify some code and have an unique quality management. Things which can be done when this PR is accepted:
SAMRecordUtils
and FastqRecordUtils
regarding quality and barcodes in name.SolexaQualityConverter
FastqWriter
directly with SAMRecord
s when converting from tagged BAM to FASTQ.StandardizerAndChecker
Found bug in trimming algorithm for master. Changed in develop commit e764c91
Although the method is working, the output files are empty.
I would like to migrate the org.vetmeduni
package to org.magicdgs
, and apply the coding style that I'm defining in the styleguide repo.
TrimFastq for single end is trying to write a null object.
Minor bug in lines 173-174:
SAMRecordUtils.addBarcodeToName(record1, String.join("", barcodes));
SAMRecordUtils.addBarcodeToName(record2, String.join("", barcodes));
Should be:
SAMRecordUtils.addBarcodeToName(record1, BarcodeMethods.joinBarcodes(barcodes));
SAMRecordUtils.addBarcodeToName(record2, BarcodeMethods.joinBarcodes(barcodes));
Bug reported on line 86 and 121 in BarcodedBamToFastq.java and line 220 in BarcodeMethods.
When the output is a folder (or several folders) that does not exist, try to create it instead of throwing an error. Another enhancement will be to check each time that an output is created if it exists (to avoid overwriting files).
For paired-end, the returned SplitFastqWriter is for single end. Check the factory for writers
Only 1 million reads are checked for the quality, but the SangerReaders or the normal ones assume that the encoding is the same for the rest of the file. To make sure that there is not a mixture several options could be implemented:
Anyway, if a mixture is found, an error should be thrown.
It could be worth it to implement a new tool to combine FASTQ/BAM files converting qualities if the files are different.
Coverage should increase in beta to at least 80%. We should have a unit test for each component if possible, not only integration tests for all the tools.
Note: thresholds in codecov.yml should be changed.
Issues:
Improvements:
It would be nice to compute a new metric for the trim algorithm and/or analysis of the raw datasets: the raw coverage. The raw coverage is the total number of bases contained in the reads divided by the genome length. We can allow the computation of this by an input of the FASTA genome or the genome length as parameter, or provide the number of bases present in the raw reads vs. trim reads to see how much we lose.
quality 42 ('K') occurs.
https://www.biostars.org/p/179149/
Illumina info about their quality scores:
http://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf
Although it is not still out, next version of htsjdk will have support for normalizing read names (see this PR), other useful stuff for FastqRecords (serializable and so on, see this PR), and asynchronious stuff which should be take into account (see this PR).
Thus, updating the dependency to this version need modification in the source code.
For any tool that outputs FASTQ/BAM files the quality should of the final files should be Sanger. For that purpose, add a quality checker in any program and a FastqReader that returns the record already converted.
Implement a maximum length filtering
Running in the same dataset (although not the same system) TrimFastq version v.0.1.4.r_3293bb31 took around 30s per 1M reads; on the other hand, version v.0.1.4.r_3293bb31 is taking 300s.
I should test it properly and check that is not a system issue.
As part of the migration to GATK4 (#23) and gradle, some other stuff should be done:
Here is tracked what it is already done in the port_to_gatk branch. Specific issues for each part are opened to track what is already done in the develop branch.
Some FATSQ files have the barcode in the name not in the standard format (@readName#ATCG
) but including spaces. An example record for this is the following:
@ST-E00276:193:H5M5TALXX:3:1101:8420:1625 1:N:0:NAACAAAA
NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+
#AAFFFKKKKFFKKKKKKKKKKKKFFKKKKKKFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKAKKKKKKFFKKKKKKKKKKKKKFKKKKKKKKKKFK
The best temporary solution for the last pre-release is to provide a command line optional parameter with the default separator ("#"). Thus, to handle the previous record the user should provide the following: --read-name-barcode-separator " 1:N:0:"
.
Because this is changing to often, we should provide a better way of handling this for later realeases.
Currently single-end mode in TaggedBamToFastq
running on the same data as FastqBarcodeDetector
outputs different markers for the files ("/1" , "/2", "/0").
This should be normalize somehow. Probably this will be much easier once #36 is done.
Creates a tool to upload FASTQ in the Distmap format to HDFS.
We could benefit from downloading and formatting SRA accessions provided as input, to normalize also public available data. Exploring the SRA handling from htsjdk will help to improve it.
Implement the tools contained in MapStats as quality control tools for mapped reads. A nice improvement with respect the current implementation should be:
A tab-delimited reader could be easily implemented instead of using the dependency.
In TaggedBamToFastq
users have problems with the meaning of the --tag
option, so it will be better to provide defaults ("BC" and "B2"), because it will be easier. Nevertheless, it should change the logic of loading the barcodes if the user does not provide any tag.
In addition, the error is not meaningful:
Each sample should have the same number of barcodes. Check the formatting of barcodes.txt
This was reported in version v.0.1.4.r_3293bb31
Default should be SILENT
, although this could change in future releases.
The barcode dictionary maps currently sample names (String
) to sequences, but it will be more interesting to map SAMReadGroupRecord
for a more general implementation. Like that, it will make easier the implementation of the barcode detector for a BAM file.
After refactoring (#32) and checking that everything is working as expected, I would like to have a consistent way of managing input/output formats before releasing version 1. This requires:
SAMRecord
and/orGATKRead
. Thus, this requires implement a FASTQ reader wrapper which transform the reads.SAMRecord
orGATKRead
).apply(single)
or apply(first, second)
). That will be much more easier than the current way of managing the tools.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.