omaroakheart / nphase Goto Github PK

View Code? Open in Web Editor NEW

39.0 4.0 4.0 51.76 MB

Ploidy agnostic phasing pipeline and algorithm

License: GNU General Public License v3.0

Python 100.00%

polyploid phasing bioinformatics bioinformatics-pipeline

nphase's Introduction

nPhase

Ploidy agnostic phasing pipeline and algorithm

nPhase is a ploidy agnostic tool developed in python which predicts the haplotypes of a sample that was sequenced by both long and short reads by aligning them to a reference. It should work with any ploidy.

Quick-start
Installation
- With bioconda
- Without bioconda
Usage
- nphase pipeline or algorithm
- nphase partial
Parameters
- nPhase pipeline or algorithm
- nPhase cleaning
Outputs
Paper
Media
Recommendations
Contact me

Quick-start

nPhase on conda

Conda takes forever to resolve environments, use mamba or micromamba instead. Once installed, you can install nPhase with the following commands:

micromamba create -n polyploidPhasing -c oakheart nphase -c bioconda
micromamba activate polyploidPhasing

Then you can phase your data with the following command:

nphase pipeline --sampleName Individual_1 --reference /path/to/Individual_referenceGenome.fasta 
                --R1 /path/to/Individual_1_shortReads_R1.fastq.gz --R2 /path/to/Individual_1_shortReads_R2.fastq.gz
                --longReads /path/to/Individual_1_longReads.fastq.gz --longReadPlatform [ont|pacbio]
                --output /path/to/outputFolder --threads 8

Once you've phased your data, you can use our automated cleaning method on the raw results folder (/path/to/outputFolder/Individual_1) with the following command:

nphase cleaning --sampleName Cleaning_1 --resultFolder /path/to/outputFolder/Individual_1
                --longReads /path/to/Individual_1_longReads.fastq.gz --threads 8

Installation

With bioconda

Install mamba or micromamba by following the steps here.

Then you can create a new environment and install nPhase with the following commands:

micromamba create -n polyploidPhasing -c oakheart nphase -c bioconda
micromamba activate polyploidPhasing

Without bioconda

Pre-requisites

You will have to install the following software before nPhase:

*Currently, installing samtools v1.10 or higher will cause an error to occur due to a bug in the way ngmlr handles MAPQ scores.

Installation via PyPI

You can now install nPhase via

pip install -U nPhase

Usage

nphase pipeline or algorithm

There are two main ways to run nPhase:

nphase pipeline will run the entire pipeline from start to finish and requires the following inputs:

nphase pipeline --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE
                --longReadPlatform {ont,pacbio} --R1 SHORT_READ_FILE_R1 --R2 SHORT_READ_FILE_R2 --threads 8

Optional parameters are described in Parameters.

nphase algorithm will only run the phasing algorithm, it requires inputs generated by nphase pipeline. This is useful if you want to test different paramaters on your dataset after having generated the pre-processed files once. Here are the inputs required by nphase algorithm:

nphase algorithm --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE
                 --contextDepth CONTEXT_DEPTHS_FILE --processedLongReads VALIDATED_SNP_ASSIGNMENTS_FILE

Optional parameters are described in Parameters.

nphase partial

Alternatively, if you already mapped your short reads to a reference, or your long reads, or already variant called your mapped short reads, you can try to use

nphase partial which will run only the parts of the pipeline that you need to run. So for example if you provide it with a vcf file, then it will only try to map the long reads. If you provide it with mapped short reads and mapped long reads, it will only variant call the short reads, etc. This is not recommended since I can't control what you input but it can save people time or allow for more creative uses of nPhase.

Here are the use cases of nphase partial:

You have mapped long reads and a vcf file of your short reads:

nphase partial --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE
               --vcf VCF_FILE --mappedLongReads MAPPED_LONG_READ_FILE --threads 8

You have mapped long reads and mapped short reads, but not vcf:

nphase partial --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE
               --mappedLongReads MAPPED_LONG_READ_FILE --mappedShortReads MAPPED_SHORT_READ_FILE --threads 8

You have mapped long reads, but no mapped short reads and no vcf:

nphase partial --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE
               --mappedLongReads MAPPED_LONG_READ_FILE --R1 SHORT_READ_FILE_R1 --R2 SHORT_READ_FILE_R2 --threads 8

You have a short read vcf, but no mapped long reads:

nphase partial --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE
               --vcf VCF_FILE --longReadPlatform {ont,pacbio} --threads 8

You have mapped short reads, but no mapped long reads and no vcf:

nphase partial --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE
               --mappedShortReads MAPPED_SHORT_READ_FILE --longReadPlatform {ont,pacbio} --threads 8

Parameters

nPhase pipeline or algorithm

nphase pipeline [-h] [--version] [--threads [THREADS]] [--maxID [MAXID]] [--minOvl [MINOVL]] [--minSim [MINSIM]] [--minLen [MINLEN]] [--nPairs [NPAIRS]] --sampleName SAMPLE_NAME
                       --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE --longReadPlatform {ont,pacbio} --R1 SHORT_READ_FILE_R1 --R2
                       SHORT_READ_FILE_R2
or
nphase algorithm [-h] [--threads [THREADS]] [--maxID [MAXID]] [--minOvl [MINOVL]] [--minSim [MINSIM]] [--minLen [MINLEN]] [--nPairs [NPAIRS]] --sampleName SAMPLE_NAME
                        --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE --contextDepth CONTEXT_DEPTHS_FILE --processedLongReads
                        VALIDATED_SNP_ASSIGNMENTS_FILE

positional arguments:
    pipeline            Run the entire nPhase pipeline on your sample
    algorithm           Only run the nPhase algorithm. NOTE: This will require files generated by running the pipeline mode

arguments always required:
  --sampleName STRAINNAME
                        Name of your sample, ex: "Individual_1"
  --reference REFERENCE
                        Path to fasta file of reference genome to align to, ex: /home/reference/Individual_reference.fasta
  --output OUTPUTFOLDER
                        Path to output folder, ex: /home/phased/
  --longReads LONGREADFILE
                        Path to long read FastQ file, ex: /home/longReads/Individual_1.fastq.gz

additional arguments required by nphase pipeline:
  --longReadPlatform {ont,pacbio}
                        Long read platform, must be 'ont' or 'pacbio'
  --R1 SHORTREADFILE_R1
                        Path to paired end short read FastQ file #1, ex: /home/shortReads/Individual_1_R1.fastq.gz
  --R2 SHORTREADFILE_R2
                        Path to paired end short read FastQ file #2, ex: /home/shortReads/Individual_1_R2.fastq.gz
  --nPairs [NPAIRS]
                        Number of overlapping clusters to merge per iterative step in the nPhase algorithm. Higher values can speed up runtime at the cost of accuracy. Default 1

additional arguments required by nphase algorithm:
  --contextDepth CONTEXTDEPTHSFILE
                        Path to context depths file, ex: /home/phased/Individual_1/Overlaps/Individual_1.contextDepths.tsv
  --processedLongReads VALIDATEDSNPASSIGNMENTSFILE
                        Path to validated long read SNPs, ex:
                        /home/phased/Individual_1/VariantCalls/longReads/Individual_1.hetPositions.SNPxLongReads.validated.tsv
  --nPairs [NPAIRS]
                        Number of overlapping clusters to merge per iterative step in the nPhase algorithm. Higher values can speed up runtime at the cost of accuracy. Default 1

additional arguments potentially required by nphase partial:
  --mappedShortReads MAPPEDSHORTREADS
                        Path to mapped, sorted short read file in BAM format, ex: /home/phased/Individual_1/Mapped/shortReads/Individual_1.final.bam
  --vcf VCFFILE         Path to VCF file (must be generated with GATK --ploidy=2), ex:/home/phased/Individual_1/VariantCalls/shortReads/Individual.vcf
  --mappedLongReads MAPPEDLONGREADS
                        Path to mapped long read file in SAM format and should be filtered to keep split reads (flag 260), ex: /home/phased/Individual_1/Mapped/longReads/Individual_1.sorted.sam
  --longReadPlatform {ont,pacbio}
                        Long read platform, must be 'ont' or 'pacbio'
  --R1 SHORTREADFILE_R1
                        Path to paired end short read FastQ file #1, ex: /home/shortReads/Individual_1_R1.fastq.gz
  --R2 SHORTREADFILE_R2
                        Path to paired end short read FastQ file #2, ex: /home/shortReads/Individual_1_R2.fastq.gz
  --nPairs [NPAIRS]
                        Number of overlapping clusters to merge per iterative step in the nPhase algorithm. Higher values can speed up runtime at the cost of accuracy. Default 1

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --threads [THREADS]   Number of threads to use on some steps, default 8
  --maxID [MAXID]       MaxID parameter, determines how different two clusters must be to prevent them from merging. Default 0.05
  --minOvl [MINOVL]     minOvl parameter, determines the minimal percentage of overlap required to allow a merge between two clusters that have fewer than
                        100 heterozygous SNPs in common. Default 0.1
  --minSim [MINSIM]     minSim parameter, determines the minimal percentage of similarity required to allow a merge between two clusters. Default 0.01
  --minLen [MINLEN]     minLen parameter, any cluster based on fewer than N reads will not be output. Default 0

nPhase cleaning

nPhase cleaning [-h] [--version] --sampleName STRAINNAME --longReads LONGREADFILE --resultFolder PHASINGRESULT [--filterPct [PERCENTKEPT]]
                       [--fillGaps [DEDUPLICATE]] [--filterFirst [FFBOOL]] [--setMaxDiscordance [MAXDISCORDANCE]

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

required arguments:
  --sampleName STRAINNAME
                        Name of your sample, ex: "Individual_1"
  --longReads LONGREADFILE
                        Path to long read FastQ file, ex: /home/longReads/Individual_1.fastq.gz
  --resultFolder PHASINGRESULT
                        Path to output folder, ex: /home/nphaseResults/Individual_1/
  --filterPct [PERCENTKEPT]
                        Percentage of results to filter out. Default 2.0
  --fillGaps [DEDUPLICATE]
                        Attempt to fill gaps by redistributing reads from most covered cluster. Set to 0 to disable this step. Default 1
  --filterFirst [FFBOOL]
                        Perform the filtering step before the merging step. Set to 1 to enable. Default 0
  --setMaxDiscordance [MAXDISCORDANCE]
                        If this is set higher than 0, use this percentage as the stopping rule for merging instead of using the mean discordance of raw
                        results. Inputting 6.8 means 6.8% discordance max allowed. Default 0

Outputs

All files will be named with the prefix "SAMPLE_NAME_minOvl_minSim_maxID", which we will shorten to $prefix in this section.

One fastQ file per predicted haplotig in OUTPUT_FOLDER/Phased/FastQ/
Three plots in OUTPUT_FOLDER/Phased/Plots/:
- $prefix_coverageVis.{png|svg|pdf}: Coverage of each haplotig plotted against your reference genome
- $prefix_phasedVis.{png|svg|pdf}: Displays the haplotigs along the genome
- $prefix_discordanceVis.{png|svg|pdf}: Violin plot for each haplotig showing the distribution of allele frequency within the haplotig. Values around 0.5 here are a sign that two distinct haplotypes were erroneously merged into one. Note: These are currently generated by plotnine and suffer from memory issues. For now nPhase will only try to generage the svg plot, and may error out, but the raw data is still available. More info: has2k1/plotnine#430
Six tab-separated value (.tsv) files in OUTPUT_FOLDER/Phased/:
- $prefix_variants.tsv: For each haplotig, the haplotig name, chromosome, position, base
- $prefix_clusterReadNames.tsv: For each haplotig, the names of reads that comprise it
- $prefix_phasedDataFull.tsv: For each haplotig, the position, chromosome, y position of each phased SNP. Raw data, not used to generate the phasedVis plot.
- $prefix_discordanceVis.tsv: For each haplotig, the haplotig name, chromosome, position, base, frequency, coverage. Used to generate the discordanceVis plot.
- $prefixD_phasedDataSimple.tsv: For each haplotig, the start position, stop position, chromosome, y position. Used to generate the phasedVis plot.
- $prefix_covVis.tsv: For each haplotig, the haplotig name, chromosome, 5kb window, mean coverage for the window

Paper

If you use nPhase in your work or if you're interested in better understanding how nPhase works, please cite/read the following publication: nPhase: An accurate and contiguous phasing method for polyploids

Media

Online lightning talk I gave about nPhase for an Oxford Nanopore event [5:44]: link

Recommendations

Current recommendations for default parameters are at least 20X coverage per haplotype (so 3*20=60X for a triploid) and a heterozygosity level of at least 0.4% (average of 1 heterozygous SNP every 250 bp).

It is currently untested on pacbio data so if you have a pacbio dataset (with a known ground truth) please contact me (raise an issue on github or email me), especially if you have errors.

There is an example dataset on this github, in the example folder, which you can test nPhase on. It contains a reference sequence, short reads and long reads. The result should look like a triploid and it should run quickly.

For significantly more exhaustive testing, please check out the Phasing Toolkit repo. You can use it to automatically perform benchmarks, compare nPhase to other polyploid phasing tools, calculate accuracy metrics on tests with ground truth datasets generated in the same way as described in the nPhase paper.

Contact me

email: [email protected]

nphase's People

Contributors

Stargazers

Watchers

Forkers

pythseq agedplanarian shengxinzhuan

nphase's Issues

Running nphase partial with a .bam file or a .sam file that isn't sorted or has a header

Currently nphase partial fails if you don't provide the perfect formal for mapped longReads. I'll close this issue when I've updated it to automatically detect if the user submits a .bam or .sam file and to process it accordingly.

If you run into any issue and you use nphase partial with a mapped long read file, follow these steps:

samtools view -h -t reference.fa -@ threadNumber -F 260 longReads.bam -o longReads.sam
samtools sort longReads.sam -@ threadNumber -o longReads.sorted.sam
samtools view longReads.sorted.sam -@ threadNumber -o longReads.sorted.noHeader.sam

Then re-run your command and replace longReads.bam with longReads.sorted.noHeader.sam

Nphase partial error

Hi,

I was hoping to use your tool to phase a draft assembly I have using a VCF file generated from medaka. I have retried the tool three times and I keep getting the below error. Please advise on how I can run the tool?

Traceback (most recent call last):
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/bin/nphase", line 11, in <module>
    sys.exit(main())
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 560, in main
    partialPipeline(args)
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 449, in partialPipeline
    phaseTool.nPhase(validatedSNPAssignmentsFile,args.strainName,contextDepthsFile,phasedPath,basePath,args.reference,args.minSim,args.minOvl,args.$  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhase.py", line 499, in nPhase
    visDataTextFull=nPhaseFunctions.giveMeFullData(cleanAllClusters)
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 766, in giveMeFullData
    previouslySeen={sortedClusterLines[0][1]:i}
  File "/home/FCAM/mxaba/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/sortedcontainers/sortedlist.py", line 891, in __getitem__
    raise IndexError('list index out of range')
IndexError: list index out of range```

The output file reads as follows:

```Tue Oct 26 03:23:32 EDT 2021
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
1 %
2 %
3 %
4 %
5 %
6 %
7 %
8 %
9 %
10 %
11 %
12 %
13 %
14 %
15 %
16 %
17 %
18 %
19 %
20 %
21 %
22 %
23 %
24 %
25 %
26 %
27 %
28 %
29 %
30 %
31 %
32 %
33 %
34 %
35 %
36 %
37 %
38 %
39 %
40 %
41 %
42 %
43 %
44 %
45 %
46 %
47 %
48 %
49 %
50 %
51 %
52 %
53 %
54 %
55 %
56 %
57 %
58 %
59 %
60 %
61 %
62 %
63 %
64 %
65 %
66 %
67 %
68 %
69 %
70 %
71 %
72 %
73 %
74 %
75 %
76 %
77 %
78 %
79 %
80 %
81 %
82 %
83 %
84 %
85 %
86 %
87 %
88 %
89 %
90 %
91 %
92 %
93 %
94 %
95 %
96 %
97 %
98 %
99 %
100 %
Loading reads.
Reads loaded.
Loading context depth.
Context depth loaded
Split reads identified.
Split reads processed.
All reads processed.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (0 sequences)
No clusters.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (0 sequences)
No clusters. ```

The code I used is as follows: 
```samtools view -h -t Racon_purge.fa -@ 16 -F 260 Racon_aligned.bam -o /core/Racon_aligned.sam
samtools sort /core/Racon_aligned.sam -@ 16 -o /core/Racon_aligned_sorted.sam
samtools view /core/Racon_aligned_sorted.sam -@ 16 -o /core/Racon_aligned_sorted.noHeader.sam

conda activate polyploidPhasing

nphase partial --sampleName Racon_phase--reference Racon_purge.fasta --output /core/Phased --longReads /core/PromethionFlonge_Porechopped.fasta_no_contaminants.fa --vcf medaka_variant/round_1_phased.vcf --mappedLongReads /core/Racon_aligned_sorted.noHeader.sam --threads 16

Thank you.

Can run nPhase due to lack of matplotlib._contour

I was trying to call nPhase for the first time, but encountered this error:

ModuleNotFoundError: No module named 'matplotlib.contour'

How to fix this?

Output FQ file is empty

Hi,
I wanna to seperate my fq file by nphase. But the output folder (OUTPUT_FOLDER/Phased/FastQ/) is empty.
Do you know the reason? Many thanks!

Applying to contigs or chromosome-level scaffolds

Hi,

Thank you for sharing this tool with the community. After reading the manual and issues, I would like to use nPhase for my genome projects.

FYI,
Input data: PacBio (Sequel) with ~200X coverage (raw read N50 = ~14Kb)
Assembled contigs: ~1,000 cotigs (N50 = ~3Mb)
Chromosome-level scaffolds: 20 Chromosomes with Pore-C data (PromethION)
Trio data: Not available

Given the circumstances, I would like to ask your professional opinions on which stage would be better to use nPhase 1) "Assembled contigs" or 2) "Chromosome-level scaffolds"?

Thank you in advance!

averageScore+=QScores[Q] KeyError: 'Q'

nPhase errors out during the long read pre-processing phase when reads with a basequality>42 are present

Running nPhase partial with mapped short reads

Hi!
I have an issue when trying to run nPhase partial (version v1.0.12) using both mapped long reads and mapped short reads.
Running:
nphase partial --sampleName $ID --reference $ref --output ~/nPhase/ --longReads $longReads --mappedLongReads $mappedLongReads.sam --mappedShortReads $mappedShortReads.bam

have me the following error:
Traceback (most recent call last): File "/home/denkena/miniconda3/envs/polyploidPhasing/bin/nphase", line 8, in <module> sys.exit(main()) File "/home/denkena/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 557, in main partialPipeline(args) File "/home/denkena/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 375, in partialPipeline SNPVCFFile=open(SNPVCFFilePath,"r") FileNotFoundError: [Errno 2] No such file or directory: '/home/denkena/nPhase/matchedNormal_DNA/VariantCalls/shortReads/matchedNormal_DNA.SNPs.vcf'

It seems the pipeline is skipping the Variant Calling from the mapped short reads? Is there something I can do to get this to run?
Thanks!

errors in partial and full pipeline

Hi, very interesting tool for us. I tried the following but ancountered an error:
nphase partial --sampleName $samplename --reference $reference --output ./ --mappedShortReads $mappedShortReads
--longReads $longreads --mappedLongReads $mappedONTread

Traceback (most recent call last):
File "/media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/bin/nphase", line 11, in
sys.exit(main())
File "/media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 513, in main
partialPipeline(args)
File "/media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 333, in partialPipeline
shortReadBam==args.mappedShortReads
UnboundLocalError: local variable 'shortReadBam' referenced before assignment

Any idea ?

I also ried the whole pipeline but after the markduplicate step the bam file seems to be removed and sorting the bam files gives the error
Here is the error part in the log file:
INFO 2020-09-14 15:25:16 MarkDuplicates Read 448,000,000 records. Elapsed time: 08:23:29s. Time for last 1,000,000: 1,100s. Last read position: StSOLv1.1ch01:34,366,000
INFO 2020-09-14 15:25:16 MarkDuplicates Tracking 55877209 as yet unmatched pairs. 55877209 records in RAM.
INFO 2020-09-14 15:47:00 MarkDuplicates Read 449,000,000 records. Elapsed time: 08:45:13s. Time for last 1,000,000: 1,303s. Last read position: StSOLv1.1ch01:34,414,428
INFO 2020-09-14 15:47:00 MarkDuplicates Tracking 55885188 as yet unmatched pairs. 55885188 records in RAM.
INFO 2020-09-14 16:18:50 MarkDuplicates Read 450,000,000 records. Elapsed time: 09:17:03s. Time for last 1,000,000: 1,909s. Last read position: StSOLv1.1ch01:34,477,480
INFO 2020-09-14 16:18:50 MarkDuplicates Tracking 55925108 as yet unmatched pairs. 55925108 records in RAM.
[Mon Sep 14 16:51:18 CEST 2020] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 589.56 minutes.
Runtime.totalMemory()=29487529984
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.reflect.Array.newArray(Native Method)
at java.lang.reflect.Array.newInstance(Array.java:75)
at java.util.Arrays.parallelSort(Arrays.java:1178)
at htsjdk.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:247)
at htsjdk.samtools.util.SortingCollection.add(SortingCollection.java:182)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:553)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /media/bulk_01/users/essel002/anaconda3/envs/polyploidPhasing/share/gatk4-4.1.8.1-0/gatk-package-4.1.8.1-local.jar MarkDuplicates --REMOVE_DUPLICATES true -I ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.bam -O ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.MD.bam -M /dev/null

STDOUT:

COMMAND: samtools sort ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.MD.bam -o ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.final.bam

STDERR:

[E::hts_open_format] Failed to open file ./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.MD.bam
samtools sort: can't open "./StSOLv1.1ch01.Colomba/Mapped/shortReads/StSOLv1.1ch01.Colomba.sorted.MD.bam": No such file or directory

No clusters found

I have an error when I sued phase with pacbio data:

Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
Short reads successfully mapped to reference
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
1 %
2 %
3 %
4 %
5 %
6 %
7 %
8 %
9 %
10 %
11 %
12 %
13 %
14 %
15 %
16 %
17 %
18 %
19 %
20 %
21 %
22 %
23 %
24 %
25 %
26 %
27 %
28 %
29 %
30 %
31 %
32 %
33 %
34 %
35 %
36 %
37 %
38 %
39 %
40 %
41 %
42 %
43 %
44 %
45 %
46 %
47 %
48 %
49 %
50 %
51 %
52 %
53 %
54 %
55 %
56 %
57 %
58 %
59 %
60 %
61 %
62 %
63 %
64 %
65 %
66 %
67 %
68 %
69 %
70 %
71 %
72 %
73 %
74 %
75 %
76 %
77 %
78 %
79 %
80 %
81 %
82 %
83 %
84 %
85 %
86 %
87 %
88 %
89 %
90 %
91 %
92 %
93 %
94 %
95 %
96 %
97 %
98 %
99 %
100 %
Loading reads.
Reads loaded.
Loading context depth.
Context depth loaded
Split reads identified.
Split reads processed.
All reads processed.
Initializing cachedCluster
99 %
100 %
Loading reads.
Reads loaded.
Loading context depth.
Context depth loaded
Split reads identified.
Split reads processed.
All reads processed.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (0 sequences)
No clusters.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (0 sequences)
No clusters.
Traceback (most recent call last):
File "/usr/local/bin/nphase", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 562, in main
nPhasePipeline(args)
File "/usr/local/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 173, in nPhasePipeline
phaseTool.nPhase(validatedSNPAssignmentsFile,args.strainName,contextDepthsFile,phasedPath,basePath,args.reference,arg
s.minSim,args.minOvl,args.minLen,args.maxID)
File "/usr/local/lib/python3.8/site-packages/bin/nPhase.py", line 499, in nPhase
visDataTextFull=nPhaseFunctions.giveMeFullData(cleanAllClusters)
File "/usr/local/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 766, in giveMeFullData
previouslySeen={sortedClusterLines[0][1]:i}
File "/usr/local/lib/python3.8/site-packages/sortedcontainers/sortedlist.py", line 891, in getitem
raise IndexError('list index out of range')
IndexError: list index out of range

phase was launched like this:
nphase pipeline --sampleName 52594 --reference /mnt/CANU-pilon.fasta --R1 /mnt/52594_R1-fastp.fastq --R2 /mnt/52594_R2-fastp.fastq --longReads /mnt/52594.fastq --longReadPlatform pacbio --output /mnt/outNPHASE --threads 20

Can you help me to solve this?

nPhase on one chromosome

Hi,
I am working on tetraploid potato and trying to run nPhase on my dataset, but as you mention yourself it is very time consuming and I had to terminate the process after running it for 8 days. I was wondering if there is any way to parallelize the process, for example to run it one chromosome at a time like you present in your paper - though I haven't figured out how to do so. Is there an option in the algorithm or have you just subset the dataset before running the algorithm?
Best regards,
Elsa

nPhase has no option to set max heterozygosity level

Whenever someone wants to run nPhase on a very heterozygous, large genome and runs into memory issues/takes too long to run, I recommend subsampling the vcf down to a 0.5~1% heterozygosity level since nPhase will still perform well (and significantly faster). This is probably a pain for everyone and it would be good to include an option to set a maximum heterozygosity level so that the pipeline can do the subsampling it on its own.

Running nPhase partial with mapped short reads

It seems the pipeline is skipping the Variant Calling from the mapped short reads? Is there something I can do to get this to run?
Thanks!

Any advice on how to create a haplotype-resolved assembly form the phase FastQ reads

I now have phased nanopore reads from nPhase. Thank you very much for the tool and your help so far.

Now, the task is to generate a first haplotype-resolved assembly. The phased reads are distributed into 191 clusters of different sizes. Having a tetraploid organism with 16 chromosomes that is approximately 3 clusters per expected haplotype which looks like a good number to me. However, the clusters are of very different size (from 1 to several thousands of reads) and total length (from 87bp to 57Mbp!). So I thought to: 1. filter out any cluster < 10 kb and any singleton clusters 2: sort the clusters by chromosome (as given in the .tsv) 3: assemble each remaining cluster individually with Flye or Necat, in a way similar to what is described in O'Donnell et al. 2023. But I am not sure if I understand the methods used correctly.

Thank you very much for your help.

ZeroDivisionError in assignLongReadToSNPs when samLines < 100

Hi,

I've come across an edge case whilst messing around with the tool. When the number of reads in the long-read alignment file is less than 100, the following code in function assignLongReadToSNPs() will fail:

if i%(int(len(samLines)/100)) == 0:
        print(int(i/len(samLines)*100)+1,"%")

If len(samLines)/100 returns a value less than 1, int() will turn it to a 0, which results in a ZeroDivisionError in the if conditional.

Cheers
Alastair

nPhase version: nPhase pipeline 1.1.3
Downloaded through conda

NameError: name 'cachedSimilarities' is not defined

I keep getting this error which, I suspect that , terminate clustering iteration at 374 clusters every time, no matter what parameters I change.

With this error, nPHASE still could produce output. But 374 clusters is too many, as I am expecting 3 clusters/haplotypes from this data.

Is this NameError normal? Is there any possible way to debug this on my end? Thank you in advance!

Traceback (most recent call last):
  File "/Users/naphat/miniconda3/miniconda3/envs/polyploidPhasing/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/naphat/miniconda3/miniconda3/envs/polyploidPhasing/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/naphat/miniconda3/miniconda3/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhase.py", line 179, in cacheFiller
    cacheA = cachedSimilarities[cacheKey]
NameError: name 'cachedSimilarities' is not defined

error in : Identified heterozygous SNPs in short read VCF

I installed phase in a singularity container (actually I installed the condo in the container).
I have an issue while running nphase and I am not sure if it's a problem due to the container or to nphase itself :

Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
Short reads successfully mapped to reference
Identified heterozygous SNPs in short read VCF
Traceback (most recent call last):
File "/opt/miniconda/bin/nphase", line 11, in
sys.exit(main())
File "/opt/miniconda/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 298, in main
nPhasePipeline(args)
File "/opt/miniconda/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 140, in nPhasePipeline
line=[line[0],str(int(line[1])-1),line[1]]
ValueError: invalid literal for int() with base 10: '62166_obj'

Can you help me ?

Duration and Heterozygosity

Hi I'm trying to run nPhase on a tetraploid yeast genome I seqeunced with PacBio HiFi reads.

Here is my code:

nphase pipeline --sampleName trialRun1 --reference s288c.fa --longReads pacbioHiFireads.fastq.gz --longReadPlatform pacbio --R1 illuminaShortRead_1.fastq --R2 illuminaShortRead_2.fastq --output trialRun1Output --threads 8

The issue I am having is that it's taking a long time to run - my computer accidentally restarted during its run so the progress I made running it for the past day or two is gone and before I run the code again I wanted to see if there was a tag or something I could add or do to the data to make it go faster.

I saw your messages in issue #11 and #10 but was wondering if there was anything done to the nPhase code to account for this. I also read up on your solution to #12 but I'm not sure how to go about doing that. I can run the code using the command I have written out above and I don't mind if it takes a longer amount of time but how much time on average would it take for a tetraploid yeast genome to run through nPhase?

Process killed after variant calling on short reads

Hi Omar, I have tried to run nPhase on one sample of tetraploid potato. It takes a lot of time, but today it finally finished with variant calling on short reads. I have the output vcf file and also the vcf file with selected SNPs - but then the process stops. I tried to start the process again with nPhase partial, but it is killed again. I can't seem to figure out what the problem is, do you have an idea? I've attached the log file here.

Best, Elsa

fullLog.txt

nPhase needs a guide/tutorial for working on large/very heterozygous genomes

As long as nPhase doesn't make efficient use of heuristics to drastically speed up prediction time, users will run into issues with trying to run it on large genomes and could benefit from a guide to help reduce the time it takes to obtain results and how to interpret them.

fastQ file creation failed

In v1.0.8 and prior versions the fastQ file code can error out saying that it can't find a given read.

Is it possible to run nPhase without short reads?

I just wondered if I use Pacbio HiFi reads, which are supposed to have lower error rates, is it still a requirement of illumina short reads to run the pipeline? Thanks very much! Best, Tao

nphase takes a long time at GATK step

I am now running nPhase on my real data and noticed it takes a very long time to finish. My data is yeast ONT reads with ~120X coverage and Illumina reads with ~100X coverage.

It seems nPhase is stuck in the GATK HaplotypeCaller step of the short reads, because this is the only process that is running (at only ~100% CPU):

michaeld  40243  40242 98 Jan02 ?        9-21:53:35 java -Dsamjdk.use_async_io_read_samtools=false 
-Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 
-jar /Home/ii/michaeld/miniconda2/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar HaplotypeCaller 
-R ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna -ploidy 2 -I ./nphase_out/Kveik_sample_6/Mapped/shortReads/Kveik_sample_6.final.bam -O ./nphase_out/Kveik_sample_6/VariantCalls/shortReads/Kveik_sample_6.vcf

This is the full command line I am using to run it:

nohup nice -2 nphase pipeline --sampleName Kveik_sample_6 --longReads sample6_cleaned_trimmed.fastq.gz 
--longReadPlatform  ont --R1 ../../Illumina_seq/Fastq/6-1_S8_R1_001.fastq.gz --R2 ../../Illumina_seq/Fastq/6-1_S8_R2_001.fastq.gz  \
--reference  ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna --output ./nphase_out --threads 120 >nphase_nohup.out &

The machine has 144 threads and 2TB RAM, so it should be fine, also I remember that variant calling based on the short reads didn't take that long when I ran GATK directly. Possibly, one could set some performance options for java/gatk to inscrease the
speed?

Short reads unsuccessfully mapped to reference

When I use the Conda installation, I got this error :

Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
ERROR: Short reads unsuccessfully mapped to reference

this is my command:

conda activate polyploidPhasing
nphase pipeline --sampleName TME3 --reference /media/vol2/scratch/lcornet/cassava/NPHASE/tme-hybrid-complete.fasta --R1 /media/vol2/scratch/lcornet/cassava/Dataset/raw-illumina/tme3/separator-R1.paired.fastq --R2 /media/vol2/scratch/lcornet/cassava/Dataset/raw-illumina/tme3/separator-R2.paired.fastq --longReads /media/vol2/scratch/lcornet/cassava/Dataset/ONT/tme3/tme3_all_cells_more_than_8_kb.fastq --longReadPlatform ont --output /media/vol2/scratch/lcornet/cassava/NPHASE/nphase --threads 40

The same command works (at this first step) with a singularity and the same file. Is there something more do to do with conda ? Can it be linked to a RAM problem ?

Thanks,

Number of haplotig

I have a question concerning the nulber of haplotig.

I have a diploid genome of 173 contigs.
nphase predicted 757 haplotig.
After vizualisation, some contig have a high ploidy (frequently 4-6).

Is it the results of the options/threshold, should I specify something else (run with default option currently) ?

nPhase is too slow/memory intensive

Right now nPhase is deterministic, meaning it doesn't use any useful time-saving heuristics. It also makes no use of parallelization. Generally it isn't optimized. It would be good to work on these points to make it more appropriate for use on larger/more heterozygous genomes.

Extract phased variants or reads

Hi Omar,

Thanks for sharing this great tool. I was wondering if there is a way to export a phased vcf or tag long reads with their haplotype group as in Whatshap-haplotag?

No plots and phased FastQ generated, AttributeError: 'DataFrame' object has no attribute 'append'

I am trying to run nPhase on long read and short read data from tetraploid yeast strain. However, after running nphase pipeline or algorithm, no plots are generated in Phased/Plots and no files in Phased/FastQ while .tsv files are generated.

Command:

          nphase algorithm --sampleName Kveik_sample_6 --longReads   sample6_cleaned_trimmed.fastq.gz --contextDepth  \
          nphase_out/Kveik_sample_6/Overlaps/Kveik_sample_6.contextDepths.tsv --processedLongReads \
          nphase_out/Kveik_sample_6/VariantCalls/longReads/Kveik_sample_6.hetPositions.SNPxLongReads.validated.tsv  \
          --output ./nphase_out --threads 20 --reference ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna

There is no error message in the log file, but an error is printed on STDERR. Here is the output of the run:

Loading reads.
Reads loaded.
Loading context depth.
Context depth loaded
Split reads identified.
Split reads processed.
All reads processed.
Initializing cachedCluster
Filling cachedCluster with similarity information
Preparing initial similarity index
Starting clustering loop (163818 sequences)
334 clusters
Initializing cachedCluster
Filling cachedCluster with similarity information
/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py:727: PlotnineWarning: Saving 18 x 10 in image.
/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py:730: PlotnineWarning: Filename: ./nphase_out/Kveik_sample_6/Phased/Plots/Kveik_sample_6_0.1_0.01_0.05_0_phasedVis.svg
Preparing initial similarity index
Starting clustering loop (168228 sequences)
187 clusters

Phased files can be found at ./nphase_out/Kveik_sample_6/Phased
The *_variants.tsv file contains information on the consensus heterozygous variants present in each predicted haplotig.
The *_clusterReadNames.tsv file contains information on the reads which comprise each cluster.
Traceback (most recent call last):
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/bin/nphase", line 11, in <module>
    sys.exit(main())
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 589, in main
    nPhaseAlgorithm(args)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipeline.py", line 256, in nPhaseAlgorithm
    nPhaseFunctions.generatePhasingVis(simpleOutPath,datavisPath)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/bin/nPhasePipelineFunctions.py", line 594, in generatePhasingVis
    ggsave(g,filename=outputSVG,width=18,height=10)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 761, in ggsave
    return plot.save(*arg, **kwargs)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 750, in save
    raise err
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 747, in save
    _save()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 734, in _save
    fig = figure[0] = self.draw()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 181, in draw
    return self._draw(return_ggplot)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 188, in _draw
    self._build()
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/ggplot.py", line 284, in _build
    layout.setup(layers, self)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/facets/layout.py", line 64, in setup
    layer.data = self.facet.map(ldata, self.layout)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/facets/facet_wrap.py", line 136, in map
    keys = join_keys(facet_vals, layout, self.vars)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/plotnine/utils.py", line 370, in join_keys
    joint = x[by].append(y[by], ignore_index=True)
  File "/home/ubuntu/micromamba/envs/polyploidPhasing/lib/python3.8/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'

nPhase was installed using conda in Ubuntu as to the instructions.

Python and pandas version:

Python 3.8.5 | packaged by conda-forge | (default, Jul 31 2020, 02:39:48)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> pandas.__version__
'2.0.3'

Linux ubuntu-compute-2 5.-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux