ucl-pathgenomics / harold Goto Github PK

Haplotype Reconstruction of Longitudinal Deep sequencing data

License: MIT License

Java 100.00%

harold's Introduction

HaROLD - HAplotype Reconstruction Of Longitudinal Deep sequencing data

Overview

HaROLD reconstructs haplotypes based on identifying co-varying variant frequencies using a probabilistic framework. For more details, please refer to our paper in Virus Evolution.

Usage

HaROLD requires Java 8 (or newer). Download pre-built binaries from releases.

Prepare input files

This step might not be necessary depending on software used for alignment. Tested with bam files produced by Bbmap and bwa. Convert the BAM files using Samtools.

samtools view -h -G69 to.convert.bam | samtools view -h -G133 > file.bam

For every sample, generate a strandcount.csv from the BAM file (some examples of this step can be found in the "example" folder)

java -cp /your-path-to-HaROLD/lib/htsjdk-unspecified-SNAPSHOT.jar:\
/Your-path-to-HaROLD/lib/picocli-4.1.2.jar:\
/Your-path-to-HaROLD/lib/pal-1.5.1.jar:\
/Your-path-to-HaROLD/lib/cache2k-all-1.0.2.Final.jar:\
/Your-path-to-HaROLD/lib/commons-math3-3.6.1.jar:\
/Your-path-to-HaROLD/jar/MakeReadCount.jar makereadcount.MakeReadCount file.bam

Step 1 - Running HaROLD

HaROLD requires as input file only a list of strandcount.csv files. Longitudinal data are listed together, but separate samples from different runs need to be submitted separately. Data and examples used in the simulation in the paper can be found in the "simulation" folder. For example, if you have 4 longitudinal samples from the same patient, your sample.txt will look like:

sampleA_timepoint1.strandcount.csv
sampleA_timepoint2.strandcount.csv
sampleA_timepoint3.strandcount.csv
sampleA_timepoint4.strandcount.csv

View program options:

java -jar /your-path-to-HaROLD/jar/Cluster_RG/dist/HaROLD-2.0.jar --help

Usage: 

richards-haplotype-model [-AhHLNvV] [--alpha-frac=<alpha_frac>]
                                [--threads=<threads>] [--tol=<tol>]
                                [-o=<optimiser>] [-p=<prefix>]
                                [-s=<randomSeed>] [-a=<initialAlphaParams>
                                <initialAlphaParams>]... -c=<countFile>...
                                [-c=<countFile>...]...
                                [-f=<initialFreqFile>...]... -n=<haplotypes>...
                                [-n=<haplotypes>...]...

      --alpha-frac=<alpha_frac>
                            Fraction of sites to use to optimise error parameters
      --threads=<threads>
      --tol=<tol>
  -a, --initial-alpha=<initialAlphaParams> <initialAlphaParams>

  -A, --fix-alpha           Fix alpha parameters
  -c, --count-file=<countFile>...
                            file containing list of count files
  -f, --initial-freq-file=<initialFreqFile>...
                            optional file containing hap frequency values
  -h, -?, --help            give this help list
  -H, --printHaplotypes     Print haplotypes
  -L, --printLikelihoods    Print likelihoods
  -n, --haplotypes=<haplotypes>...
                            number of haplotypes
  -N, --noOpt               Process without optimising
  -o, --optimiser=<optimiser>
                            Optimiser for haplotype frequencies
  -p, --prefix=<prefix>     Results file prefix
  -s, --seed=<randomSeed>
  -v, --verbose
  -V, --version

For example, the following command was used in our simulation with norovirus data (details in the paper):

java -jar /your-path-to-HaROLD/jar/Cluster_RG/dist/HaROLD-2.0.jar \
--count-file sample.txt --haplotypes 4 --alpha-frac 0.5 --gamma-cache 10000 \
-H -L --threads 4 -p /your-path-to-results/Step1_results

This step produces three output files:

your-path-to-results/Step1_results.log: log file
your-path-to-results/Step1_results.fasta: haplotypes fasta sequences
your-path-to-results/Step1_results.lld: base frequency file

A note about choosing the number of haplotypes

The log file (your-path-to-results/Step1_results.log) provides the total likelihood for the model (at the end of log file, "Main: Final total likelihood"). You should try different numbers and choose the number of haplotypes that gives you the highest likelihood. However, the following step will help to find the best number.

Step 2 - Refining Output from HaROLD

The algorithm takes a set of reads for different samples and calculates the optimal probability of the haplotypes for each of the samples as well as the optimal sequences of the haplotypes. We are going to adjust the haplotype frequencies and sequences, use these values to assign reads,and then calculate log likelihood based on these assigned reads. The required input files are:

output files from Harold step1
reference sequence in Fasta format
BAM files

View program options:

java -cp /your-path-to-HaROLD/lib/htsjdk-unspecified-SNAPSHOT.jar: \
/your-path-to-HaROLD/lib/picocli-4.1.2.jar: \
/your-path-to-HaROLD/lib/pal-1.5.1.jar: \
/your-path-to-HaROLD/lib/commons-math3-3.6.1.jar: \
/your-path-to-HaROLD/lib/cache2k-all-1.0.2.Final.jar: \
/your-path-to-HaROLD/lib/flanagan.jar: \
/your-path-to-HaROLD/jar/RefineHaplotypes.jar refineHaplotypes.RefineHaplotypes -h

Usage: richards-haplotype-model [-hIV] [--expand] [--printIntermediate]
                                [--printReference]
                                --alignment=<hapAlignmentFile>
                                --bam=<readsFile> --baseFreq=<baseFreqFile>
                                [-D=<minReadDepth>] [--errorRate=<errorRate>]
                                [--hapFreq=<hapFreqFile>] [-m=<minReads>]
                                [--maxHaplo=<maxHaplo>]
                                [--maxIterate=<maxIter>]
                                [--maxRecombine=<maxRecombine>]
                                --reference=<refSeqFile> [--seed=<randomSeed>]
                                -t=<tag>

      --alignment, --hapAlignment=<hapAlignmentFile>
                            Name of haplotype alignment file, fasta (req)
      --bam, --BAM, --sam, --SAM=<readsFile>
                            Name of bam/sam file (req)
      --baseFreq=<baseFreqFile>
                            Name of base frequency file (req)
  -D, --minReadDepth=<minReadDepth>
                            Minimum read depth (<0 indicates inactive) (-1.)
      --errorRate=<errorRate>
                            Error rate (0.002)
      --expand              Consider splits that increase number of haplotypes
                              (false)
  -h, -?, --help            give this help list
      --hapFreq=<hapFreqFile>
                            Name of haplotype frequency file (null)
  -I, --iterate             Turn on iteration (false)
  -m, --minReads=<minReads> Minimum number of reads (20.0)
      --maxHaplo, --maxHaplotypes=<maxHaplo>
                            Maximum number of haplotypes (10)
      --maxIterate=<maxIter>
                            Maximum number of iterations (10)
      --maxRecombine=<maxRecombine>
                            Maximum number of recombination attempts (20)
      --printIntermediate   Print intermediate sequences (false)
      --printReference      Print reference sequence (false)
      --reference, --refSequence, --referenceSequence=<refSeqFile>
                            Name of file containing reference sequence, fasta
                              (req)
      --seed=<randomSeed>   Random number seed (-1, not specified)
  -t, --tag=<tag>           Tag for output files (req)
  -V, --version

Here, we show an example for sample2 for the first Norovirus longitudinal set in the simulation (referred in the paper as "2 haplotypes, low similarity, 5 time points set"). Run this for every sample.

java -cp /your-path-to/lib/htsjdk-unspecified-SNAPSHOT.jar: \
/your-path-to-HaROLD/lib/picocli-4.1.2.jar: \
/your-path-to-HaROLD/lib/pal-1.5.1.jar: \
/your-path-to-HaROLD/lib/commons-math3-3.6.1.jar: \
/your-path-to-HaROLD/lib/cache2k-all-1.0.2.Final.jar: \
/your-path-to-HaROLD/lib/flanagan.jar: \
/your-path-to-HaROLD/jar/RefineHaplotypes.jar refineHaplotypes.RefineHaplotypes \
-t sample2 --bam sample2-1longitudinal.sorted.dedup.bam.fixed.bam \
--baseFreq nhaplo_4_results.lld --refSequence refseq-JX459907.fasta \
--hapAlignment nhaplo_4_resultsHaplo.fasta --iterate

This step produces two output files for each sample:

your-path-to-results/Sample2.log: log file
your-path-to-results/Sample2.fasta: haplotypes fasta sequences for a sample

Getting help

If you have any questions, please feel free to contact Juanita Pang, Cristina Venturini or Richard Goldstein.

harold's People

Contributors

Stargazers

Watchers

Forkers

maremita

harold's Issues

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 300000 out of bounds for length 300000

This is the command line I ran with your simulated dataset: sample1_touse.strandcount.csv

[nina]$ h=1
[nina]$ cat $temp_dir/$convertedPrefx".txt"
sample1_touse.strandcount.csv
[nina]$ 
[nina]$ time java -jar $app_dir/jar/Cluster_RG/dist/HaROLD-2.0.jar         --count-file $temp_dir/$convertedPrefx".txt" --haplotypes ${h}         --alpha-frac 0.5 --gamma-cache 10000 --threads 32 -H -L -p $tempRes_prefix
Picked up JAVA_TOOL_OPTIONS: -Xmx2g
Picked up _JAVA_OPTIONS: -Xms256m -Xmx256g
Main: arguments = --count-file /scratch/nina7/SSenr9_harold/results/temp/BC32.converted.txt --haplotypes 1 --alpha-frac 0.5 --gamma-cache 10000 --threads 32 -H -L -p /scratch/nina7/SSenr9_harold/results/temp/BC3
2_H1_harold
Main: seed = 1714761951164
Optimisation method :BOBYQ
May 03, 2024 2:45:51 P.M. org.cache2k.core.Cache2kCoreProviderImpl
INFO: cache2k starting. version=1.0.2.Final, build=undefined, defaultImplementation=HeapCache
BC32.converted.txt: /scratch/nina7/SSenr9_harold/results/temp/BC32.converted.txt
BC32.converted.txt: haplotypes = 1
/scratch/nina7/SSenr9_harold/results/temp/BC32.converted.txt
initialFreqSet = false
/scratch/nina7/SSenr9_harold/results/temp/sample1_touse.strandcount.csv
Average read depth
0       81.90919980095208

[0]
BC32.converted.txt: timepoints = 1
BC32.converted.txt: sites = 227081
Starting pre-optimisation of hap frequencies, a = 0.99900000 0.03000000
[0.999, 0.03]
[1.0]
java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at cluster_rg.Cluster_RG.getFutureResults(Cluster_RG.java:342)
        at cluster_rg.Cluster_RG.optimise(Cluster_RG.java:164)
        at cluster_rg.Cluster_RG.run(Cluster_RG.java:97)
        at cluster_rg.Cluster_RG.main(Cluster_RG.java:50)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at cluster_rg.Cluster.run(Cluster.java:122)
        at cluster_rg.Cluster_RG.lambda$optimise$0(Cluster_RG.java:161)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Exception in thread "main" picocli.CommandLine$ExecutionException: Error
        at cluster_rg.Cluster_RG.run(Cluster_RG.java:123)
        at cluster_rg.Cluster_RG.main(Cluster_RG.java:50)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at cluster_rg.Cluster_RG.getFutureResults(Cluster_RG.java:345)
        at cluster_rg.Cluster_RG.optimise(Cluster_RG.java:164)
        at cluster_rg.Cluster_RG.run(Cluster_RG.java:97)
        ... 1 more
Caused by: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at cluster_rg.Cluster_RG.getFutureResults(Cluster_RG.java:342)
        ... 3 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at cluster_rg.Cluster.run(Cluster.java:122)
        at cluster_rg.Cluster_RG.lambda$optimise$0(Cluster_RG.java:161)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

Array out of bound exeption on your simulated dataset

Describe the bug
Array out of bond exeption on your simulated dataset sample1_touse.strandcount.csv

To Reproduce

[nina]$ h=1
[nina]$ cat $temp_dir/$convertedPrefx".txt"
sample1_touse.strandcount.csv # This is one of your simulated files
[nina]$ 
[nina]$ time java -jar $app_dir/jar/Cluster_RG/dist/HaROLD-2.0.jar         --count-file $temp_dir/$convertedPrefx".txt" --haplotypes ${h}         --alpha-frac 0.5 --gamma-cache 10000 --threads 32 -H -L -p $tempRes_prefix

Picked up JAVA_TOOL_OPTIONS: -Xmx2g
Picked up _JAVA_OPTIONS: -Xms256m -Xmx256g
Main: arguments = --count-file /scratch/nina7/SSenr9_harold/results/temp/BC32.converted.txt --haplotypes 1 --alpha-frac 0.5 --gamma-cache 10000 --threads 32 -H -L -p /scratch/nina7/SSenr9_harold/results/temp/BC3
2_H1_harold
Main: seed = 1714761951164
Optimisation method :BOBYQ
May 03, 2024 2:45:51 P.M. org.cache2k.core.Cache2kCoreProviderImpl
INFO: cache2k starting. version=1.0.2.Final, build=undefined, defaultImplementation=HeapCache
BC32.converted.txt: /scratch/nina7/SSenr9_harold/results/temp/BC32.converted.txt
BC32.converted.txt: haplotypes = 1
/scratch/nina7/SSenr9_harold/results/temp/BC32.converted.txt
initialFreqSet = false
/scratch/nina7/SSenr9_harold/results/temp/sample1_touse.strandcount.csv
Average read depth
0       81.90919980095208

[0]
BC32.converted.txt: timepoints = 1
BC32.converted.txt: sites = 227081
Starting pre-optimisation of hap frequencies, a = 0.99900000 0.03000000
[0.999, 0.03]
[1.0]
java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at cluster_rg.Cluster_RG.getFutureResults(Cluster_RG.java:342)
        at cluster_rg.Cluster_RG.optimise(Cluster_RG.java:164)
        at cluster_rg.Cluster_RG.run(Cluster_RG.java:97)
        at cluster_rg.Cluster_RG.main(Cluster_RG.java:50)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at cluster_rg.Cluster.run(Cluster.java:122)
        at cluster_rg.Cluster_RG.lambda$optimise$0(Cluster_RG.java:161)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
Exception in thread "main" picocli.CommandLine$ExecutionException: Error
        at cluster_rg.Cluster_RG.run(Cluster_RG.java:123)
        at cluster_rg.Cluster_RG.main(Cluster_RG.java:50)
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at cluster_rg.Cluster_RG.getFutureResults(Cluster_RG.java:345)
        at cluster_rg.Cluster_RG.optimise(Cluster_RG.java:164)
        at cluster_rg.Cluster_RG.run(Cluster_RG.java:97)
        ... 1 more
Caused by: java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at cluster_rg.Cluster_RG.getFutureResults(Cluster_RG.java:342)
        ... 3 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at cluster_rg.Cluster.run(Cluster.java:122)
        at cluster_rg.Cluster_RG.lambda$optimise$0(Cluster_RG.java:161)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

Desktop (please complete the following information):

OS: [LINUX]
x86_64 AMD EPYC 7532 32-Core Processor AuthenticAMD GNU/Linux
java/17.0.6

Exception in thread "main" java.lang.NegativeArraySizeException

Describe the bug

When trying to create the "strandcount" tables from bam files I get the following error:

./my-data/step0-input/iii-13-13MT2EXPIIIVP10FullRepseq25052018_S1_L001-sorted.bam.
Exception in thread "main" java.lang.NegativeArraySizeException: -449
	at makereadcount.Read.<init>(Read.java:39)
	at makereadcount.MakeReadCount.readFromFile(MakeReadCount.java:447)
	at makereadcount.MakeReadCount.readData(MakeReadCount.java:427)
	at makereadcount.MakeReadCount.run(MakeReadCount.java:62)
	at makereadcount.MakeReadCount.main(MakeReadCount.java:44)

Here's the code that I run for this step:

for i in $(ls ./my-data/step0-input); do echo ${I} && java -cp ./lib/htsjdk-unspecified-SNAPSHOT.jar:./lib/picocli-4.1.2.jar:./lib/pal-1.5.1.jar:./lib/cache2k-all-1.0.2.Final.jar:./lib/commons-math3-3.6.1.jar:./jar/MakeReadCount.jar makereadcount.MakeReadCount ./my-data/step0-input/${i}; done

Additional context

The same command works for example bam files provided within the package.
My bam files were created using BWA aligner with default parameters.

Error when run HaROLD

Hi, I have met a problem when I run HaROLD according to the manual.

When I run the command below,

java -cp HAROLD/lib/htsjdk-unspecified-SNAPSHOT.jar:\
HaROLD/lib/picocli-4.1.2.jar: \
HaROLD/lib/pal-1.5.1.jar: \
HaROLD/lib/cache2k-all-1.0.2.Final.jar: \
HaROLD/lib/commons-math3-3.6.1.jar: \
HaROLD/jar/MakeReadCount.jar makereadcount.MakeReadCount file.bam

I got the error ,

Error: Could not find or load main class HaROLD.lib.pal-1.5.1.jar:
Caused by: java.lang.ClassNotFoundException: HaROLD.lib.pal-1.5.1.jar:

Do you have any idea about this? Thanks a lot!