broadinstitute / long-read-pipelines Goto Github PK

Long read production pipelines

Home Page: https://broadinstitute.github.io/long-read-pipelines/

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.32% Dockerfile 1.23% Shell 2.46% Python 15.03% WDL 27.78% R 1.10% Jupyter Notebook 52.02% JavaScript 0.01% Java 0.07% Awk 0.01%

pipelines wdl long-reads de-novo-assembly variant-calling

long-read-pipelines's Introduction

Long read pipelines

This repository contains pipelines for processing of long read data from PacBio and/or Oxford Nanopore platforms. The pipelines are written in WDL 1.0 intended for use with Google Cloud Platform via the scientific workflow engine, Cromwell. Processing is designed to be reasonably consistent between both long read platforms, and use platform-specific options or tasks where necessary.

High level workflows can be found in the wdl/ directory.

Documentation: Documentation for each workflow can be found at the repository site.

External Contributors: Please see the Contributing Guidelines for information on how to contribute to the repository.

long-read-pipelines's People

Contributors

Stargazers

Watchers

long-read-pipelines's Issues

Exome-based trio binning

Investigate exome-based trio binning

`resources/gcs_workflow_options.json` might be out of date

Got the following error running on methods Cromwell 45.

Unable to find block device for filesystem /dev/disk/by-id/google-local-disk.

Googling landed on this still open ticket.

Allow triobinning to stop before binning

i.e. when only kmer stats are necessary from parental short reads, and when one has no access to child long reads.

Metadata uploading app in the monitoring pipeline repeatedly inserts rows into BQ table

We need to check if the metadata has been uploaded already.
Otherwise there will be duplicate rows.

Dealing with strange (and sometimes wrong) CIGARs in alignments

The strange CIGAR could be something like this in MM2.

Or it could something complained by SortSam from Picard

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: WARNING: Read name m64020_190419_185501/103416136/ccs, No M or N operator between pair of D operators in CIGAR

when sorting the results of bamout from GATK-HC on a CCS corrected BAM (the source could be MM2).

Adopt some coding standard for writing WDLs

This helps reading the WDLs by human eyes.

For example

should we use tabs vs spaces
if using spaces, should we use 2 or 4

Generate graph visualization of the pipelines

As we clean up and document the WDLs, it'll be good to have a graph/DAG visualization of workflows.

This helps

new users understand the status of the pipeline, and
developers maintain the repo as clean as possible (e.g. there are pipelines that we may not need anymore, or repeated tasks)

Avoid unnecessary localization when possible

See this

How to handle sex chromosomes with GATK

The PR that will be bringing in GATK has a limitation that unless one carefully modifies the scattering scheme, the pipeline will fail because when sex chromosomes and autosomes are mixed in one scatter, the logic trying to detect ploidy will bark.

Currently, we can get away with separating runs that limit the input intervals to the workflow (i.e. separate runs for autosomes and sex chromosomes).

Reproducible test case for the Intel codec bug

Right now we're using the JDK codec for bgzip because of an apparent intermittent bug in the Intel codec. Because this comes at a huge speed penalty, we write temporary files out in uncompressed form. Producing a reproducible test for the Intel codec bug would hopefully allow Intel to fix the issue, which would in turn allow us to make this pipeline leaner and faster.

Remove non-code directories

Let's keep data and reports somewhere else, and make this repo code only.

Auto-evaluation of resource usage of our WDLs

The resource monitoring script gets run for all of our tasks, but I have yet to analyze the data. Let's write an R script that can examine resource usage per task over time, along with that task's requested resources, and can aggregate data from multiple invocations of that task across workflows. That way we can know where to scale back our resource requests and optimize the pipeline.

Change pipeline intermediate bucket target to a bucket with a short expiry time.

Let's start writing our intermediate files in a separate bucket set to delete data for us within a few days. That way we won't have to periodically stop what we're doing to delete the files ourselves.

Remove ProcessReads

With the recent changes, we now have some redundant functionality across the codebase. For example, AlignReads, PBUtils, ONTUtils, and Utils now substantially overlap with their progenitor WDL, ProcessReads. We should remove ProcessReads and update anything that depended on it.

Replace explicit reference to GATK jar with a GATK4 tag

We need that custom jar for some analysis, but we need to do follow best practices.

Investigate if we should `tar` some of the result folders when using cromwell

Relying on cromwell for delocalizing a folder gives you names like
glob-4474304f3e3392228593ba19c5cd74e8,
glob-de84a38a722f40d0a73061d4179b1788
which takes an unnecessary effort to decode, especially when the two folders hold contents in highly similar structures.

tar-only avoids the compression time, hence maybe a good compromise between easy-of-comprehension and speed.

Set up script(s) for processing resource monitoring log file

E.g. bare minimum:

grep -F 'Memory usage:' resources.log | \
    grep -Eo "[0-9\.]+ GiB" | \
    sed 's/GiB//' | \
    Rscript -e 'summary (as.numeric (readLines ("stdin")))'

grep -F 'CPU usage:' resources.log | \
    awk -F ':' '{print $2}' | \
    sed 's/ //g' | sed 's/%//' | \
    Rscript -e 'summary (as.numeric (readLines ("stdin")))'

Experiment with major assembly polishers

Racon, Medaka, MarginPolish+HELEN.
And maybe more.

Result with Racon seems interesting.

Check upfront in the PacBio CCS and CLR pipelines whether a PacBio flowcell actually represents CCS or CLR data

The metadata.xml file that accompanies PacBio data is unreliable with regards to whether a flowcell represents CCS or CLR data. However, it's pretty easy to look at the first few hundred reads of a subread file and make this determination by counting the frequency of ZMW numbers. Here's a quick example:

Likely CCS data:
$ gsutil cat gs://broad-gp-pacbio/r64020_20190507_173946/4_D01/**.subreads.bam | samtools view | awk -F"/" '{ print $2 }' | uniq -c | head
14 0
2 1
15 2
22 5
2 7
20 8
15 9
2 12
5 13
2 15

Likely CLR data:
$ gsutil cat gs://broad-gp-pacbio/r64020_20190507_173946/1_A01/**.subreads.bam | samtools view | awk -F"/" '{ print $2 }' | uniq -c | head
1 4
1 6
1 7
1 12
1 15
1 16
1 20
1 25
1 30
1 31

TODO: Write a tool that runs very early in the PBCCS and PBCLR workflows that checks, say, the first 10000 reads and makes a determination whether a run is actually appropriate for that workflow to run. Throw an error if not.

Investigate minion_qc

https://github.com/roblanf/minion_qc

@kvg unsure if you are aware.

Add gCNV to whole genome pipelines

Use the multi-technology trios as the cohort.

Official Sniffles docker seems to have issues with Cromwell

See here.
The sever is Cromwell version 36, though, so updating to a newer 39 should fix this issue as claimed in the above ticket.
And the following message, though the (sub) workflow is marked as "Done".

find: unrecognized: -empty
xargs: invalid option -- 'I'
BusyBox v1.22.1 (2014-05-23 01:24:27 UTC) multi-call binary.

Usage: find [-HL] [PATH]... [OPTIONS] [ACTIONS]

Search for files and perform actions on them.
First failed action stops processing of current file.
Defaults: PATH is current directory, action is '-print'

	-L,-follow	Follow symlinks
	-H		...on command line only
	-xdev		Don't descend directories on other filesystems
	-maxdepth N	Descend at most N levels. -maxdepth 0 applies
			actions to command line arguments only
	-mindepth N	Don't act on first N levels
	-depth		Act on directory *after* traversing it

Actions:
	( ACTIONS )	Group actions for -o / -a
	! ACT		Invert ACT's success/failure
	ACT1 [-a] ACT2	If ACT1 fails, stop, else do ACT2
	ACT1 -o ACT2	If ACT1 succeeds, stop, else do ACT2
			Note: -a has higher priority than -o
	-name PATTERN	Match file name (w/o directory name) to PATTERN
	-iname PATTERN	Case insensitive -name
	-path PATTERN	Match path to PATTERN
	-ipath PATTERN	Case insensitive -path
	-regex PATTERN	Match path to regex PATTERN
	-type X		File type is X (one of: f,d,l,b,c,...)
	-perm MASK	At least one mask bit (+MASK), all bits (-MASK),
			or exactly MASK bits are set in file's mode
	-mtime DAYS	mtime is greater than (+N), less than (-N),
			or exactly N days in the past
	-mmin MINS	mtime is greater than (+N), less than (-N),
			or exactly N minutes in the past
	-newer FILE	mtime is more recent than FILE's
	-user NAME/ID	File is owned by given user
	-group NAME/ID	File is owned by given group
	-size N[bck]	File size is N (c:bytes,k:kbytes,b:512 bytes(def.))
			+/-N: file size is bigger/smaller than N
	-prune		If current file is directory, don't descend into it
If none of the following actions is specified, -print is assumed
	-print		Print file name
	-print0		Print file name, NUL terminated
	-exec CMD ARG ;	Run CMD with all instances of {} replaced by
			file name. Fails if CMD exits with nonzero

BusyBox v1.22.1 (2014-05-23 01:24:27 UTC) multi-call binary.

Usage: xargs [OPTIONS] [PROG ARGS]

Run PROG on every item given by stdin

	-r	Don't run command if input is empty
	-0	Input is separated by NUL characters
	-t	Print the command on stderr before execution
	-e[STR]	STR stops input processing
	-n N	Pass no more than N args to PROG
	-s N	Pass command line of no more than N bytes
	-x	Exit if size is exceeded

Two workflows in invalidate state

womtool validate tells me these two are invalid.

PB10xSingleFlowcell.wdl
Failed to import 'tasks/HiFi.wdl' (reason 1 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'relative to directory [...]/wdl (escaping allowed)' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 2 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'entire local filesystem (relative to '/')' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 3 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'relative to directory [...]/wdl (escaping allowed)' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 4 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'http importer (no 'relative-to' origin)' (reason 1 of 1): Relative path

PB10xSingleProcessedSample.wdl
Failed to import 'tasks/HiFi.wdl' (reason 1 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'relative to directory [...]/wdl (escaping allowed)' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 2 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'entire local filesystem (relative to '/')' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 3 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'relative to directory [...]/wdl (escaping allowed)' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 4 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'http importer (no 'relative-to' origin)' (reason 1 of 1): Relative path

Task for routine trio-assembly quality assessment

Now that we showed trio-assembly costs can be lowered to sub $10, it makes sense to have a (sub-) pipeline for quality assessment on routine trio-assembly.

Currently, I'm experimenting with BUSCO and U50.
Other suggestions/ideas welcome.

Rev CI/CD Cromwell to version 51

Have a depth check for input/output bam

See discussion in #111

DeepVariant optimization

We need two optimizations for DV:

separate out the three steps make_example, call_variants and post process; make_example is CPU intensive and takes ~50 CPU hours for our EAP data now, call_variants can be hugely improved with GPU and takes approximately ~1-2 hours (and is relatively more expensive on the per hour basis), post processing step is a mere one hour.
need an AVX512F-optimized DV docker to shorten make_example, but this is more easily done by the DV team.

All together, this could bring the cost down per WGS to ~$3, and ultimately we can bring it down to below $1.

optimize TrioBinChildLongReads workflow

One obvious optimization is to localize only needed parental reads in the meryl-count step.

Longshot taking a long time

Currently the scattering scheme is per chromosome, which leads to jobs running 10 hours or longer.

We should be using Picards' IntervalListTools for more but shorter intervals.

Inventory codebase to identify and remove redundant functionality

We now have a few WDLs that have some tasks that are slightly or entirely redundant with other WDLs. We should inventory the codebase to see where we have multiple tasks that do nearly the same thing, and then make a plan to consolidate those tasks into one canonical task.

Have a plan for docker image versioning

The utils docker is going to be the hardest to figure out.

Change to dockerhub

Right now the pipeline Docker images all exist in our personal Docker sites. Let's move these to Dockerhub.

Investigate when BAM outs can be enabled reliably in calling GATK

This is related to #32

Pipeline regression testing

How are we going to verify that this pipeline continues to function the way we expect as we add things to it over time?

Develop a Dockstore-compatible release process

Similar to what the clinical SV pipeline does (see https://github.com/broadinstitute/gatk-sv-clinical/blob/master/scripts/create_dockstore_release_branch.sh), we need to have a release process that is compatible with Dockstore. The major component of this is simply rewriting all the WDL import statements to refer to static https links.

experiment with dipcall

whenever diploid assembly is available

https://github.com/lh3/dipcall

Keeping track of which dockers we use

So that we know which to cleanup, and what the dependency chain is.
I'm not sure where to keep this documentation, so I just created this ticket.

3rd party

DOCKER	TAG	WDL
quay.io/biocontainers/mosdepth	0.2.4--he527e40_0	AlignedMetrics.wdl
quay.io/biocontainers/nanoplot	1.28.0--py_0	NanoPlot.wdl
gcr.io/deepvariant-docker/deepvariant	0.8.0-gpu	DeepVariantLR.wdl
us.gcr.io/broad-gatk/gatk	latest	GATKBestPractice.wdl
us.gcr.io/broad-gotc-prod/genomes-in-the-cloud	2.4.1-1540490856	Utils.wdl

we have control over, need time to migrate

DOCKER	TAG	WDL
quay.io/broad-long-read-pipelines/canu	v1.9_wdl_patch_varibale_k	AssignChildLongReads.wdl
quay.io/broad-long-read-pipelines/canu	v1.9_wdl_patch_varibale_k	CollectParentsKmerStats.wdl
us.gcr.io/broad-dsde-methods/samtools-cloud	v1.clean	GATKBestPractice.wdl

we have control over, need to get versions right

DOCKER	TAG	WDL
us.gcr.io/broad-dsp-lrma/lr-10x	0.1.9	AnnotateAdapters.wdl, ONT10xSingleFlowcell.wdl
us.gcr.io/broad-dsp-lrma/lr-align	0.1.26	AlignReads.wdl, PhaseReads.wdl, Utils.wdl, PB10xSingleProcessedSample.wdl, TestCromwell.wdl
us.gcr.io/broad-dsp-lrma/lr-asm	0.1.12	AssembleTarget.wdl
us.gcr.io/broad-dsp-lrma/lr-c3poa	0.1.4	C3POa.wdl
us.gcr.io/broad-dsp-lrma/lr-canu	0.1.0	Canu.wdl
us.gcr.io/broad-dsp-lrma/lr-cloud-downloader	0.2.1	DownloadFromSRA.wdl
us.gcr.io/broad-dsp-lrma/lr-finalize	0.1.2	Finalize.wdl
us.gcr.io/broad-dsp-lrma/lr-gatk	0.1.1	GATKBestPractice.wdl
us.gcr.io/broad-dsp-lrma/lr-guppy	4.0.14	Guppy.wdl
us.gcr.io/broad-dsp-lrma/lr-longshot	0.1.1	CallSmallVariants.wdl
us.gcr.io/broad-dsp-lrma/lr-medaka	0.1.0	Medaka.wdl
us.gcr.io/broad-dsp-lrma/lr-metrics	0.1.8	AlignedMetrics.wdl, UnalignedMetrics.wdl, Utils.wdl
us.gcr.io/broad-dsp-lrma/lr-nanopolish	0.3.0	Nanopolish.wdl
us.gcr.io/broad-dsp-lrma/lr-pb	0.1.5	PBUtils.wdl
us.gcr.io/broad-dsp-lrma/lr-peregrine	0.1.6	Peregrine.wdl
us.gcr.io/broad-dsp-lrma/lr-quast	0.1.0	Quast.wdl
us.gcr.io/broad-dsp-lrma/lr-racon	0.1.0	Racon.wdl
us.gcr.io/broad-dsp-lrma/lr-sv	0.1.2	CallSVs.wdl
us.gcr.io/broad-dsp-lrma/lr-utils	0.1.6	Guppy.wdl, ONTUtils.wdl, PBUtils.wdl, Utils.wdl

Investigate if memory tuning is necessary for GATK-HC

Falls under the general task of optimization.

Periodically remove stale branches

I'll periodically check which branches Github deem as stale, and assign authors to either rebase or remove the branches.

Two workflows fail to validate

PB10xSingleFlowcell.wdl
Failed to import 'tasks/HiFi.wdl' (reason 1 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'relative to directory [...]/wdl (escaping allowed)' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 2 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'entire local filesystem (relative to '/')' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 3 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'relative to directory [...]/wdl (escaping allowed)' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 4 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'http importer (no 'relative-to' origin)' (reason 1 of 1): Relative path
PB10xSingleProcessedSample.wdl
Failed to import 'tasks/HiFi.wdl' (reason 1 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'relative to directory [...]/wdl (escaping allowed)' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 2 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'entire local filesystem (relative to '/')' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 3 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'relative to directory [...]/wdl (escaping allowed)' (reason 1 of 1): File not found: tasks/HiFi.wdl
Failed to import 'tasks/HiFi.wdl' (reason 4 of 4): Failed to resolve 'tasks/HiFi.wdl' using resolver: 'http importer (no 'relative-to' origin)' (reason 1 of 1): Relative path

Decide on code formatting standards

We now have code written in

WDL
shell
python

And sooner or later there could be more.
We need to think about picking a style guide.

For WDL, there's the WDL Guidelines for GATK Repo;
for shell scripts I generally use Sublime + SublimeLinter + shellcheck;
for python, there's the generally accepted PEP 8.

Hook in DV and GATK to small variants calls

Concern is how long each workflow will take.

@kvg should we do it?
Currently only Longshot is hooked into CallSmallVariants.wdl.

Find empirical formula on batch size for racon

It is showing some strange behavior in splitting the data: setting the value to 80, it complains not enough data, setting it to 90, huzzah!

And batch sizes for 3000-contig assemblies may be too large for 4000-contig assemblies, leading to OoM errors.

The task is to find an empirical formula.

Who should take this on? 🧐

@kvg

Provided high level descriptions for tasks and workflows without such docs

Implement a task for the SNV caller for raw Nanopore data, Clair

Clairvoyante is a new deep-learning approach to SNP and indel calling designed for long read sequencing (https://www.nature.com/articles/s41467-019-09025-z). We should implement this in our pipeline ASAP. This is particularly important for Nanopore and PacBio CLR sequencing as we don't currently have a solution for SNP calling on such data (currently the pipeline can only call SNPs on PacBio CCS data using DeepVariant and (very soon) GATK HaplotypeCaller).

Clair is the successor to Clairvoyante, with apparently a 5% bump in sensitivity for ONT data (https://github.com/HKU-BAL/Clair). The usage is apparently identical to Clairvoyante (https://github.com/aquaskyline/Clairvoyante). This is probably where we should start, rather than with Clairvoyante itself.

Abstract
The accurate identification of DNA sequence variants is an important, but challenging task in genomics. It is particularly difficult for single molecule sequencing, which has a per-nucleotide error rate of ~5–15%. Meeting this demand, we developed Clairvoyante, a multi-task five-layer convolutional neural network model for predicting variant type (SNP or indel), zygosity, alternative allele and indel length from aligned reads. For the well-characterized NA12878 human sample, Clairvoyante achieves 99.67, 95.78, 90.53% F1-score on 1KP common variants, and 98.65, 92.57, 87.26% F1-score for whole-genome analysis, using Illumina, PacBio, and Oxford Nanopore data, respectively. Training on a second human sample shows Clairvoyante is sample agnostic and finds variants in less than 2 h on a standard server. Furthermore, we present 3,135 variants that are missed using Illumina but supported independently by both PacBio and Oxford Nanopore reads. Clairvoyante is available open-source (https://github.com/aquaskyline/Clairvoyante), with modules to train, utilize and visualize the model.

Always use GATK releases (or at least master branch)

Not consistent with best practices and will be a maintenance headache.

Improve the pipeline monitoring unit

A wish list

ask for more metrics to be monitored (e.g. I/O ops, GPU) in the docker
more robust querying command (I can imagine many ways to fool the command, e.g. the python script currently does not check for the status of a job, and may fail for failed jobs, etc)

Anyone interested in implementing any feature can cross out the item.

More wishes welcome.

Watchout for alignments with > 65535 CIGAR operations

As the reads and assemblies gets longer, we will get there.
In fact we might already be there for some assemblies with NG50 > 10Mb.

See minimap2's comment about how to do that.

A tricky thing is how up-to-date downstream tools are, when it comes to adhering to the hts-specs.

Too few reads in bam for Sniffles

In the PBCCSWholeGenomeSingleFlowcell workflow, in CallSVs, Sniffles task I got an error where it sounds like there's too few reads in the bam to estimate some parameter it needs? Log file excerpt follows:

++ samtools view -H /cromwell_root/fc-408ac13e-b06c-4835-b747-4258321b9a9b/8aaaf651-b226-426c-9580-40be7525723e/PBCCSWholeGenomeSingleFlowcell/8a3ed254-64b3-4266-8f57-08e5e1e0766b/call-MergeRuns/SM-JOTZQ_RW.m64020_200118_025318.bam
++ grep -m1 '^@RG'
++ sed 's/\t/\n/g'
++ grep '^SM:'
++ sed s/SM://g
+ SM=SM-JOTZQ_RW
+ sniffles -t 8 -m /cromwell_root/fc-408ac13e-b06c-4835-b747-4258321b9a9b/8aaaf651-b226-426c-9580-40be7525723e/PBCCSWholeGenomeSingleFlowcell/8a3ed254-64b3-4266-8f57-08e5e1e0766b/call-MergeRuns/SM-JOTZQ_RW.m64020_200118_025318.bam -v SM-JOTZQ_RW.m64020_200118_025318.sniffles.pre.vcf -s 3 -r 1000 -q 20 --genotype --report_seq --report_read_strands
Estimating parameter...
Too few reads detected in /cromwell_root/fc-408ac13e-b06c-4835-b747-4258321b9a9b/8aaaf651-b226-426c-9580-40be7525723e/PBCCSWholeGenomeSingleFlowcell/8a3ed254-64b3-4266-8f57-08e5e1e0766b/call-MergeRuns/SM-JOTZQ_RW.m64020_200118_025318.bam

This is using Terra with the method imported from dockstore:
github.com/broadinstitute/long-read-pipelines/PBCCSWholeGenomeSingleFlowcellVersion: 2.0-dockstore-test-2

Let me know if any other details are needed.

Implement and evaluate a new long-read-specialized tandem repeat finder

Implement a WDL task and evaluate the use of the noise-cancelling repeat finder (NCRF). From the manuscript (https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz484/5530597):

Abstract
Summary
Tandem DNA repeats can be sequenced with long-read technologies, but cannot be accurately deciphered due to the lack of computational tools taking high error rates of these technologies into account. Here we introduce Noise-Cancelling Repeat Finder (NCRF) to uncover putative tandem repeats of specified motifs in noisy long reads produced by Pacific Biosciences and Oxford Nanopore sequencers. Using simulations, we validated the use of NCRF to locate tandem repeats with motifs of various lengths and demonstrated its superior performance as compared to two alternative tools. Using real human whole-genome sequencing data, NCRF identified long arrays of the (AATGG)n repeat involved in heat shock stress response.

Availability and implementation
NCRF is implemented in C, supported by several python scripts, and is available in bioconda and at https://github.com/makovalab-psu/NoiseCancellingRepeatFinder.