magicdgs / readtools Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 6.42 MB

A Universal Toolkit for Handling Sequence Data from Different Sequencing Platforms

Home Page: https://magicdgs.github.io/ReadTools/

License: MIT License

Java 99.61% Shell 0.39%

bioinformatics dna genomics ngs sequencing

readtools's People

Contributors

Stargazers

Watchers

Forkers

altingia maxmaronna popgenvienna

readtools's Issues

More flexible trimming command line

The algorithm to perform trimming is a single class with everything contained there. A better approach is use the lambda support from Java8 to extract different components of each trimming there (by quality, by Ns, etc). In addition, we can use the GATK4 framework to implement a plugin in the command line which takes parameters for each trimming algorithm if requested.

First draft:

ReadTransformer-like class, which transform the read to the trim one.
We should decide how the read will be output if it is completely trimmed. One strategy is return a GATKRead implementation that is called GATKTrimmedRead which contains the original read and the coordinates for trim.
They should have a setter for trim the 5' end or not.
We should separate the logic of filtering the read. By length or if it continue having Ns should be a post-filtering after trimming.

EDITED: Final design

Tags for trim points and completely trim read.
TrimmingFunction as a ReadTransformer to return a read with updated tags.
Read filters for the current pipeline.
ApplyTrimResultReadTransfomer for hard-clip the read based on the tags.
TrimmerPluginDescriptor as a plugin system to provide trimmers to the command line.
New metrics classes to output the information for trimming (see below)
New metrics classes to output the information for the filtering (see below)
Class for apply a pipeline to reads (maybe a ReadFilter): first trimmers (TrimmingFunction), then apply the trim points (ApplyTrimResultReadTransformer), and finaly filters (ReadFilter). This include the accumulation of new metrics classes while processing.
TrimReads tool, which includes a plugin to provide the trimmer(s) and a plugin to provide the read filter(s).

There should be a TrimmingMetric extending MetricBase, containing the following fields:

TRIMMER: trimmer name. The same as the one provided to switch on the trimmer (usually the class name).
TOTAL_SINGLE/TOTAL_FIRST/TOTAL_SECOND: total number of reads passed to this trimmer (single-end or pair-end). This should be the ones that the trimmer are able to trim (for instance, the ones that are completely trimmed does not pass update this field).
TRIMMED_SINGLE/TRIMMED_FIRST/TRIMMED_SECOND: the number of reads from the previous fields trimmed. We should be careful regarding this, because if 5' or 3' are not trimmed, this should reflect it.

Similarly, a FilteringMetric extending MetricBase should be implemented containing FILTER (name), TOTAL_SINGLE/TOTAL_FIRST/TOTAL_SECOND (reads reaching the filter) and PASSED_SINGLE/PASSED_FIRST/PASSED_SECOND (reads that reach the filter and pass).

SAM validation error - No real operator in CIGAR

I used ReadTools v.0.2.2.r_4831abfbac to add read groups to a BAM-file and got the following error message:

SAM validation error: ERROR: Read name E00300:175:H3LYTALXX:6:1103:27813:5546#ACAGTGAT, No real operator (M|I|D|N) in CIGAR

This is the respective read pair:

E00300:175:H3LYTALXX:6:1103:27813:5546#ACAGTGAT 99      2L      1       70      150S    =       1       129     TTTTTTCGGTTCTTGACTGGGGCGGAGCGTGAACGCTCAGCTCTGTTTTCTCGGGAAAAAATACAAGCGCGAACTCCGCGTCGTGTTAAAGACATATATATTTTTTTAATCTCATTGATCTTTTTTTGCTGTTCTTTGCTTGTGGTTCTT  AAFFFKKKKAKKKKKKFKKKKKKKKAKKKFKKKKKKAF7KKKKKFFKKKKFKKKKKKKFKKKKAKKKKKKK<AKKKKKFFKKKKKKKKAKKKKFAKKKKKKKKKKKKFAA<KK7FKKKKKKKKKAKKKKKKAFKKKKKKKFKKFF7<FKA  AM:i:8  AS:i:256        MD:Z:114        NM:i:0  PG:Z:novoalign.I        PQ:i:438        SM:i:8  UQ:i:256
E00300:175:H3LYTALXX:6:1103:27813:5546#ACAGTGAT 147     2L      1       70      21S129M =       1       -129    ACTGGGGCGGAGCGTGAACGCTCAGCTCTGTTTTCTCGGGAAAAAATACAAGCGCGAACTCCGCGTCGTGTTAAAGACATATATATTTTTTTAATCTCATTGATCTTTTTTTGCTGTTCTTTGCTTGTGGTTCTTTTAGGAATTAATTGG  <KFAKKKKKKKKKKKKKKKKKKKKKKKKKF<KKKKKKKKFKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKFKFKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA  MD:Z:129        PG:Z:novoalign.I        AM:i:8  NM:i:0  SM:i:70 PQ:i:438        UQ:i:166        AS:i:166

Thanks a lot for looking into this, Daniel.

Cheers,

Thomas

Simpler command line handling

Using a similar command line handling as Picard will be useful for change parameters, documentation of the tools and adding new parameters (now it is a nightmare).

Thus, either develop a new command line interface API (based on annotations) or port our tools the Picard/GATK4 framework.

Remove -nstd option in beta tools in favor of force input format

The --non-standardize-output option allows the user to output a file in the same format as the previous file. Because ReadTools is a toolkit for standardize, this does not make any more sense. I implemented this option due to the heuristic approach to guess the format in the input file, to allow the user to maintain an incorrectly guessed Illumina format.

Nevertheless, I think that it will be much more clear to create an option for the input file format, check if the guessed format agrees and if not log a warning instead and proceed with the input format provided by the user. That will really convert ReadTools in a set of tools which consistent output format (Sanger/Standard).

Main help with more details

Probably it will be better for the final user to know that running the tool with -h or --help provides the full information for the tool.

Automatic detection of barcode type of encoding in FASTQ files

When working with BAM files, it is clear that the barcode is contained in a tag that it is named usually BC (and B2 for a second index), but this tags could be changed in the command line or using a java variable.

On the other hand, when working with FASTQ files the things are much more complicated (see #48). We should use patterns to recognize the format of the FATSQ barcode encoding automatically. It should:

Be simple for the case of default read name encoding (@readName#ACTG)
Recognize several indexes separated (@readName#ACTG_ACGT), where the defult separation is a "_"
Allow other kind of separator for several indexes (e.g. "-")
Check in the readName for a DNA-like string through a pattern: ([ACGT]+)(_??)([ACGT]+), or with a different separator
Probably it would be worthy to check for a barcode pattern in the quality header too if the previous ones failed

A simpler way is to create a command line parameter as suggested in #48, but once it is change to the GATK4 framework it could be a plugin to specify an method (example, DefaultReadNameBarcodeEncoding, SeparatorReadNameBarcodeEncoding with an option for the separator between read name and barcode, BestGuessReadNameBarcodeEncoding with the pattern strategy). This will give to the user a finer control of what is happening in the data, but increase the command line complexity. On the other hand, this allows to include other plugins not related only with FASTQ to detect the barcode in the BAM tags and/or the sequence itself.

QualityChecker should warn when high qualities are found in Standard format

Due to the fix made for #24, now there is an option for allow higher qualities. QualityChecker should help to the user to decide if the new option is needed after detecting the format, checking all the qualities and throwing a warning if the qualities are higher than expected.

Change the input/output for barcodes

Although it will break the compatibility with previous versions (although they are pre-releases), I would like to output the barcodes in the read name when several barcodes with a separator between them (like that, they could be retrieved separately to test other approaches.

Some helpers are already implemented on f6803eb

Bug in TaggedBamToFastq single-end

In release 0.1.4 no barcode is detected (all reads are in the discarded file) running in single-end mode (-s).

New implementation of TrimFastq does not provides the same output as the previous

Implementation in commit 73b1f62 does not work properly in real data.

Implement unit tests for the trimming algorithm and found the bug.

Checking Sanger encoding fails when found J

When Sanger-like enconding (Phred+33) comes from Illumina 1.8+, it allows the value 41(J). Nevertheless, my checker is using 40 as the highest value for this Phred score.

Change in the next release.

Create trimming utils to implement different kind of trimming

HTSJDK have its own TrimmingUtils with one simple method to perform trimming.

Extracting the Mott algorithm from the TrimmingMethods will be important for support other kind of trimming once #25 is done, because it will use the same quality enconding.

In addition, trimming Ns could be ported to HTSJDK TrimmingUtils to keep our code cleaner, and at some point even the Mott algorithm could be ported.

Logger wrong

Check the loggers for all the tools. The problem was detected on BasrcodedBamToFastq.

Code quality

Now I'm using Codacy in my projects for automatic review and check code quality. I want to increase the code quality of ReadTools by using the rules implemented there, so this issue requires:

Setup codacy in ReadTools
Add README badge for code quality
Clean the code to reach the highest level in develop (A)

Increase test coverage for alpha

Goal: 50% (it is enough for legacy code with integration tests)

Note: set thresholds in the build (in the CI and maybe with gradle)

Use RawRead interface when added to htsjdk

HTSJDK will include a RawRead interface for FastqRecordand SAMRecord with defined contract for quality encoding and so on: samtools/htsjdk#572

This will allow to simplify some code and have an unique quality management. Things which can be done when this PR is accepted:

Unify methods in SAMRecordUtils and FastqRecordUtils regarding quality and barcodes in name.
~~Support Solexa qualities through SolexaQualityConverter~~
~~Use FastqWriter directly with SAMRecords when converting from tagged BAM to FASTQ.~~
Unify methods in StandardizerAndChecker

Bug in trimming

Found bug in trimming algorithm for master. Changed in develop commit e764c91

Split in BamBarcodeDetector is not working

Although the method is working, the output files are empty.

Coding style and main package name

I would like to migrate the org.vetmeduni package to org.magicdgs, and apply the coding style that I'm defining in the styleguide repo.

Bug in TrimFastq for single end

TrimFastq for single end is trying to write a null object.

Pair-end mode in TaggedBamToFastq for unknown barcodes is broken

Minor bug in lines 173-174:

SAMRecordUtils.addBarcodeToName(record1, String.join("", barcodes));
SAMRecordUtils.addBarcodeToName(record2, String.join("", barcodes));

Should be:

SAMRecordUtils.addBarcodeToName(record1, BarcodeMethods.joinBarcodes(barcodes));
SAMRecordUtils.addBarcodeToName(record2, BarcodeMethods.joinBarcodes(barcodes));

BarcodedBamToFastq does not allow more than one mismatch option

Bug reported on line 86 and 121 in BarcodedBamToFastq.java and line 220 in BarcodeMethods.

Create directories and check for already existing files for output

When the output is a folder (or several folders) that does not exist, try to create it instead of throwing an error. Another enhancement will be to check each time that an output is created if it exists (to avoid overwriting files).

Test coverage with Coveralls

BarcodedBamToFastq problem with writer

For paired-end, the returned SplitFastqWriter is for single end. Check the factory for writers

Quality encoding detection could generate bugs in mixed files

Only 1 million reads are checked for the quality, but the SangerReaders or the normal ones assume that the encoding is the same for the rest of the file. To make sure that there is not a mixture several options could be implemented:

Check the quality after reading/conversion of each read to make sure that are correctly encoded and within the range of each quality. This approach could be very time consuming for reading, although the checking could be done at the same time as the conversion. This does not resolve if a conversion is not performed.
Check the quality after reading/conversion each N read, being N the default number of reads that were use to check the files (1 million). The withdraw of this approach is that if two files are merged, one with 1 million reads and the second with less than one million, it won't be detected.
Check only in tools where the quality is required (TrimFastq and conversion). The withdraw is that every tools is converting by default, and it will be the same as in the first point.
Implement a different quality detector (not the htsjdk) to randomly sample reads from the file.

Anyway, if a mixture is found, an error should be thrown.

It could be worth it to implement a new tool to combine FASTQ/BAM files converting qualities if the files are different.

Code coverage for beta

Coverage should increase in beta to at least 80%. We should have a unit test for each component if possible, not only integration tests for all the tools.

Note: thresholds in codecov.yml should be changed.

Barcode detection algorithm issues

Issues:

Currently uppercase bases mismatch with lowercase and vice-versa; this could introduce a bug if the bases are capitalized differently
When converting from tagged BAM to FASTQ, we should check the PF flag to remove that reads

Improvements:

Option for do not count as mismatches the N bases
Allow only certain number of Ns in the barcode (when allowing lots of mismatches or not counting them as mismatches)
Output a MetricFile (from htsjdk) with an histogram with number of mismatches in the barcode and the final statistics for the de-multiplexing

Metrics for raw/trim coverage

It would be nice to compute a new metric for the trim algorithm and/or analysis of the raw datasets: the raw coverage. The raw coverage is the total number of bases contained in the reads divided by the genome length. We can allow the computation of this by an input of the FASTA genome or the genome length as parameter, or provide the number of bases present in the raw reads vs. trim reads to see how much we lose.

Use codecov instead of coveralls for coverage report

HiSeq XTEN Quality encoding

quality 42 ('K') occurs.

https://www.biostars.org/p/179149/

Illumina info about their quality scores:
http://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf

Update to htsjdk v.2.1.2

Although it is not still out, next version of htsjdk will have support for normalizing read names (see this PR), other useful stuff for FastqRecords (serializable and so on, see this PR), and asynchronious stuff which should be take into account (see this PR).

Thus, updating the dependency to this version need modification in the source code.

Force overwrite file for all tools

Quality conversion by default in all tools

For any tool that outputs FASTQ/BAM files the quality should of the final files should be Sanger. For that purpose, add a quality checker in any program and a FastqReader that returns the record already converted.

new option for TrimFastq

Implement a maximum length filtering

Language and license badge

Migration to gradle

Testing framework to TestNG

Possible performance regression in TrimFastq

Running in the same dataset (although not the same system) TrimFastq version v.0.1.4.r_3293bb31 took around 30s per 1M reads; on the other hand, version v.0.1.4.r_3293bb31 is taking 300s.

I should test it properly and check that is not a system issue.

Refactoring (port to GATK4 framework)

As part of the migration to GATK4 (#23) and gradle, some other stuff should be done:

Port to GATK4 framework (in a separate branch to keep it clean)
Integration tests for the ported tools using the current version
~~Refactoring code to keep it simpler (see #25, for example).~~ Edited: with the refactoring is much simpler, but simplification will be performed in a different way.
Make variables final when possible.
Move tests to TestNG
Increase test coverage (ideally at least 50%)
~~Check code for putative errors and add javadoc to public and protected methods.~~ Edited: new issue for this (#63)

Here is tracked what it is already done in the port_to_gatk branch. Specific issues for each part are opened to track what is already done in the develop branch.

Allow to specify the separator of the barcodes

Some FATSQ files have the barcode in the name not in the standard format (@readName#ATCG) but including spaces. An example record for this is the following:

@ST-E00276:193:H5M5TALXX:3:1101:8420:1625 1:N:0:NAACAAAA
NCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+
#AAFFFKKKKFFKKKKKKKKKKKKFFKKKKKKFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKAKKKKKKFFKKKKKKKKKKKKKFKKKKKKKKKKFK

The best temporary solution for the last pre-release is to provide a command line optional parameter with the default separator ("#"). Thus, to handle the previous record the user should provide the following: --read-name-barcode-separator " 1:N:0:".

Because this is changing to often, we should provide a better way of handling this for later realeases.

Normalize single-end output FASTQ files

Currently single-end mode in TaggedBamToFastq running on the same data as FastqBarcodeDetector outputs different markers for the files ("/1" , "/2", "/0").

This should be normalize somehow. Probably this will be much easier once #36 is done.

Tool for upload read sources to HDFS in Distmap format

Creates a tool to upload FASTQ in the Distmap format to HDFS.

Using SRA support from htsjdk

We could benefit from downloading and formatting SRA accessions provided as input, to normalize also public available data. Exploring the SRA handling from htsjdk will help to improve it.

Include MapStat tools

Implement the tools contained in MapStats as quality control tools for mapped reads. A nice improvement with respect the current implementation should be:

Use a common walker for window-base analysis (maybe if broadinstitute/gatk#1708 is accepted will be easier).
Use the metrics system implemented in HTSJDK/GATK4 for output.

Remove dependency for opencsv

A tab-delimited reader could be easily implemented instead of using the dependency.

Default values for BC tag

In TaggedBamToFastq users have problems with the meaning of the --tag option, so it will be better to provide defaults ("BC" and "B2"), because it will be easier. Nevertheless, it should change the logic of loading the barcodes if the user does not provide any tag.

In addition, the error is not meaningful:

Each sample should have the same number of barcodes. Check the formatting of barcodes.txt

This was reported in version v.0.1.4.r_3293bb31

Create validation stringency option

Default should be SILENT, although this could change in future releases.

Barcode dictionary improvements

The barcode dictionary maps currently sample names (String) to sequences, but it will be more interesting to map SAMReadGroupRecord for a more general implementation. Like that, it will make easier the implementation of the barcode detector for a BAM file.

Change input/output handling and default format

After refactoring (#32) and checking that everything is working as expected, I would like to have a consistent way of managing input/output formats before releasing version 1. This requires:

User side:

Output consistent format: BAM with information in the tags will be the best approach as a default output to store raw reads. ~~I have to decide if every tool will allow output a FASTQ file or just one tool will convert BAM to FASTQ using the tags.~~ Edited: Only BAM files will be output, and one tool will standardize the FASTQ
Input raw reads: now only single FASTQ/BAM files, paired FASTQ files (1 file per pair) and paired BAM files (one after the other in name order) are allowed. We should allow other kind of inputs, as interleaved FASTQ files.

Developer side:

Readers returning only one type of read: even if the input is a FASTQ file, the readers should return a ~~SAMRecord and/or~~ GATKRead. Thus, this requires implement a FASTQ reader wrapper which transform the reads.
Methods should be apply to only one read source (~~SAMRecord or~~ GATKRead).
Common interface for output reads: only one generic writer type should be implemented. Implementations of tools will decide how the read will be transform.
Common walker for raw reads (single and paired end): the GATK framework is awesome to create walker over the data. The best way is implement a walker for reads which takes one/two arguments, determine the type of source (paired vs. single) and iterates through one read or two applying a function (apply(single) or apply(first, second)). That will be much more easier than the current way of managing the tools.