Code Monkey home page Code Monkey logo

hairsplitter's Introduction

HS_logo

Splits contigs into their different haplotypes (or repeats into their different versions).

For developers working on similar problems: HairSplitter is puposefully built as a series of modules that could be integrated in other software. See the How does it work section and do not hesitate to contact me.

What is Hairsplitter ?

Hairsplitter takes as input an assembly (obtained by any means) and the long reads (including high-error rate long reads) used to build this assembly. For each contig it checks if the contig was built using reads from different haplotypes/regions. If it was, Hairsplitter separates the reads into as many groups as necessary and computes the different versions (e.g. alleles) of the contig actually present in the genome. It outputs a new assembly, where different versions of contigs are not collapsed into one but assembled separately.

Why is it useful ?

Hairsplitter can be used to refine a metagenomic assembly. Assemblers commonly collapse closely related strains as on single genome. HairSplitter can recover the lost strains. The uncollapsed parts of the assembly are left as is. HairSplitter is also useful for single-organism assembly, especially if you are trying to obtain a phased assembly. The main advantage of Hairsplitter compared to other techniques is that it is totally parameter-free. Most importantly, it does not requires to know the ploidy of the organism, and can infer different ploidies corresponding to different contigs. It can thus be used just as well on haploid assemblies (to improve the assembly of duplications) as on complex allotetraploids (to assemble separately the haplotypes). Just run the assembly through!

Installation

You can install HairSplitter through conda conda install -c bioconda hairsplitter

Dependencies

List of dependencies

If Minimap2, Racon, Medaka or samtools are not in the PATH, their location should be specified through the --path-to-minimap2, --path-to-racon, path-to-medaka or --path-to-samtools options.

Quick conda dependencies

The recommended way to install HairSplitter is to create and activate a conda environment with all dependencies:

conda create -c bioconda -c conda-forge -c anaconda -n hairsplitter cmake gxx gcc python scipy numpy minimap2 minigraph=0.20 racon "samtools>=1.16" raven-assembler openmp
conda activate hairsplitter

conda install -c bioconda -c conda-forge medaka #only if you specifically want to use medaka /!\ Very heavy installation

Download & Compilation

To download and compile, run

git clone https://github.com/RolandFaure/Hairsplitter.git
cd Hairsplitter/src/
mkdir build && cd build
cmake ..
make
cd ../../ && chmod +x hairsplitter.py

Usage

Quick start

Let's say reads.fastq (ONT reads) were used to build assembly assembly.gfa (with any assembler)(the assembly can be in gfa or fasta format). To improve/phase the assembly using Hairsplitter, run

python /path/to/hairsplitter/folder/hairsplitter.py -f reads.fastq -i assembly.gfa -x ont -o hairsplitter_out/

In the folder hairsplitter_out, you will find the new assembly, named hairsplitter_final_assembly.gfa. Another generated file is hairsplitter_summary.txt, in which are written which contigs are duplicated and merged.

You can test the installation on the mock instance provided and check that HairSplitter exits without problems.

python hairsplitter.py -i test/simple_mock/assembly.gfa -f test/simple_mock/mock_reads.fasta -o test_hairsplitter/ -F

Options

usage: hairsplitter.py [-h] -i ASSEMBLY -f FASTQ [-c HAPLOID_COVERAGE]
                       [-x USE_CASE] [-p POLISHER] [--correct-assembly]
                       [-t THREADS] -o OUTPUT [--resume] [-s] [-P] [-F] [-l]
                       [--clean]
                       [--rarest-strain-abundance RAREST_STRAIN_ABUNDANCE]
                       [--minimap2-params MINIMAP2_PARAMS]
                       [--path_to_minimap2 PATH_TO_MINIMAP2]
                       [--path_to_minigraph PATH_TO_MINIGRAPH]
                       [--path_to_racon PATH_TO_RACON]
                       [--path_to_medaka PATH_TO_MEDAKA]
                       [--path_to_samtools PATH_TO_SAMTOOLS]
                       [--path_to_python PATH_TO_PYTHON]
                       [--path_to_raven PATH_TO_RAVEN] [-v] [-d]

optional arguments:
  -h, --help            show this help message and exit
  -i ASSEMBLY, --assembly ASSEMBLY
                        Original assembly in GFA or FASTA format (required)
  -f FASTQ, --fastq FASTQ
                        Sequencing reads fastq or fasta (required)
  -c HAPLOID_COVERAGE, --haploid-coverage HAPLOID_COVERAGE
                        Expected haploid coverage. 0 if does not apply [0]
  -x USE_CASE, --use-case USE_CASE
                        {ont, pacbio, hifi,amplicon} [ont]
  -p POLISHER, --polisher POLISHER
                        {racon,medaka} medaka is more accurate but much slower
                        [racon]
  --correct-assembly    Correct structural errors in the input assembly (time-
                        consuming)
  -t THREADS, --threads THREADS
                        Number of threads [1]
  -o OUTPUT, --output OUTPUT
                        Output directory
  --resume              Resume from a previous run
  -s, --dont_simplify   Don't merge the contig
  -P, --polish-everything
                        Polish every contig with racon, even those where there
                        is only one haplotype
  -F, --force           Force overwrite of output folder if it exists
  -l, --low-memory      Turn on the low-memory mode (at the expense of speed)
  --clean               Clean the temporary files
  --rarest-strain-abundance RAREST_STRAIN_ABUNDANCE
                        Limit on the relative abundance of the rarest strain
                        to detect (0 might be slow for some datasets) [0.01]
  --minimap2-params MINIMAP2_PARAMS
                        Parameters to pass to minimap2
  --path_to_minimap2 PATH_TO_MINIMAP2
                        Path to the executable minimap2 [minimap2]
  --path_to_minigraph PATH_TO_MINIGRAPH
                        Path to the executable minigraph [minigraph]
  --path_to_racon PATH_TO_RACON
                        Path to the executable racon [racon]
  --path_to_medaka PATH_TO_MEDAKA
                        Path to the executable medaka [medaka]
  --path_to_samtools PATH_TO_SAMTOOLS
                        Path to samtools [samtools]
  --path_to_python PATH_TO_PYTHON
                        Path to python [python]
  --path_to_raven PATH_TO_RAVEN
                        Path to raven [raven]
  -v, --version         Print version and exit
  -d, --debug           Debug mode

Issues

Most installation issues that we have seen yet stem from the use of too old compilers. Hairsplitter has been developed using gcc=11.2.0. Sometimes the default version of the compiler is too old (especially on servers). Specify gcc versions manually to cmake using -DCMAKE_CXX_COMPILER=/path/to/modern/g++ and -DCMAKE_C_COMPILER=/path/to/modern/gcc.

How does it work ?

HairSplitter is organized as series of modules, some of these modules being of independant interest. The full documentation can be found in the doc/ folder.

  1. Cleaning the assembly Ideally, the assembly would be purged of all assembly errors. In practice, ensure there is no over-duplication by deleting unconnected contigs that align very well on other contigs.

  2. Calling variants Variants are called using an alignment of the reads on the assembly. For now, a basic pileup is used. Calling variants in a metagenomic context is hard: favor calling false variants over missing true variants - the false variants will be filtered afterward.

  3. Filtering variants This step is crucial. Each called variant partition the reads in groups. Keep only variants which partition occur frequently, because this cannot be chance. This way, only very robust variant are kept.

  4. Separating the reads Based on the robust variants, HairSplitter inspect each contig and determine if several distinct groups of reads align there. If it is the case, it means that several different versions of the contig exist.

  5. Creating the new contigs Create every new contig by polishing the existing contig using the several groups of reads.

  6. Improving contiguity Contigs are generally separated only locally. To improve contiguity, use the long reads that align on several contigs sequentially.

Citation

Please cite the preprint on bioRxiv: https://www.biorxiv.org/content/10.1101/2024.02.13.580067v1

hairsplitter's People

Contributors

rolandfaure avatar

Stargazers

Zhang Yixing avatar  avatar Xiaofei Zeng avatar  avatar peterdfields avatar adigenova avatar Dina Mae Ranises avatar Longhao Jia avatar Francisco J. Ruiz-Ruano avatar Gaorui Gong avatar Piyumal Anthony Demotte avatar Antoine Houtain avatar Changwan Seo avatar Gaorui Gong avatar  avatar Xiao avatar Zhang Ruyi avatar Natalia S Araujo avatar Yair Motro avatar Adam Taranto avatar Colin Davenport avatar zhangwenda avatar Nick Minor avatar Billy Rowell avatar Matthew Wells avatar DDuchen avatar Wei Shen avatar Andrew Woodruff avatar Alexandros Vasilikopoulos avatar Luca Parmigiani avatar Evan Ernst avatar Jitendra Narayan avatar

Watchers

 avatar zhangwenda avatar Andrew Woodruff avatar

hairsplitter's Issues

Failed to compile from source directory

Hello,

I tried cloning and installing as described in the documentation but failed to install

git clone https://github.com/RolandFaure/Hairsplitter.git
cd Hairsplitter/src/build
cmake ..
make
Cloning into 'Hairsplitter'...
remote: Enumerating objects: 1095, done.
remote: Counting objects: 100% (76/76), done.
remote: Compressing objects: 100% (46/46), done.
remote: Total 1095 (delta 39), reused 55 (delta 27), pack-reused 1019
Receiving objects: 100% (1095/1095), 6.67 MiB | 5.85 MiB/s, done.
Resolving deltas: 100% (817/817), done.
cd: no such file or directory: Hairsplitter/src/build
CMake Warning:
  Ignoring extra path from command line:

   ".."

Then I tried running cmake and make from the src/ subdirectory:

cd Hairsplitter/src/
cmake .
make


[  2%] Building CXX object CMakeFiles/fa2gfa.dir/fa2gfa.cpp.o
[  4%] Linking CXX executable fa2gfa
[  4%] Built target fa2gfa
[  6%] Building CXX object CMakeFiles/gfa2fa.dir/gfa2fa.cpp.o
[  8%] Linking CXX executable gfa2fa
[  8%] Built target gfa2fa
[ 11%] Building CXX object CMakeFiles/clean_graph.dir/Partition.cpp.o
[ 13%] Building CXX object CMakeFiles/clean_graph.dir/clean_graph.cpp.o
[ 15%] Building CXX object CMakeFiles/clean_graph.dir/input_output.cpp.o
[ 17%] Building CXX object CMakeFiles/clean_graph.dir/read.cpp.o
[ 20%] Building CXX object CMakeFiles/clean_graph.dir/sequence.cpp.o
[ 22%] Building CXX object CMakeFiles/clean_graph.dir/tools.cpp.o
[ 24%] Building CXX object CMakeFiles/clean_graph.dir/edlib/src/edlib.cpp.o
[ 26%] Linking CXX executable clean_graph
[ 26%] Built target clean_graph
[ 28%] Building CXX object CMakeFiles/call_variants.dir/Partition.cpp.o
[ 31%] Building CXX object CMakeFiles/call_variants.dir/call_variants.cpp.o
[ 33%] Building CXX object CMakeFiles/call_variants.dir/input_output.cpp.o
[ 35%] Building CXX object CMakeFiles/call_variants.dir/read.cpp.o
[ 37%] Building CXX object CMakeFiles/call_variants.dir/sequence.cpp.o
[ 40%] Building CXX object CMakeFiles/call_variants.dir/tools.cpp.o
/mnt/sda1/Alex/software/Hairsplitter/src/tools.cpp: In function ‘std::__cxx11::string consensus_reads(const string&, std::vector<std::__cxx11::basic_string<char> >&, std::__cxx11::string&, std::__cxx11::string&, std::__cxx11::string&, std::__cxx11::string&, std::__cxx11::string&)’:
/mnt/sda1/Alex/software/Hairsplitter/src/tools.cpp:225:11: warning: ignoring return value of ‘int system(const char*)’, declared with attribute warn_unused_result [-Wunused-result]
     system("mkdir tmp/ 2> trash.txt");
     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/sda1/Alex/software/Hairsplitter/src/tools.cpp:284:11: warning: ignoring return value of ‘int system(const char*)’, declared with attribute warn_unused_result [-Wunused-result]
     system(commandMap.c_str());
     ~~~~~~^~~~~~~~~~~~~~~~~~~~
/mnt/sda1/Alex/software/Hairsplitter/src/tools.cpp:288:11: warning: ignoring return value of ‘int system(const char*)’, declared with attribute warn_unused_result [-Wunused-result]
     system(commandPolish.c_str());
     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~
/mnt/sda1/Alex/software/Hairsplitter/src/tools.cpp: In function ‘std::__cxx11::string consensus_reads_wtdbg2(const string&, std::vector<std::__cxx11::basic_string<char> >&, std::__cxx11::string&, std::__cxx11::string&, std::__cxx11::string&, std::__cxx11::string&, std::__cxx11::string&, std::__cxx11::string&, std::__cxx11::string&)’:
/mnt/sda1/Alex/software/Hairsplitter/src/tools.cpp:399:11: warning: ignoring return value of ‘int system(const char*)’, declared with attribute warn_unused_result [-Wunused-result]
     system("mkdir tmp/ 2> trash.txt");
     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/sda1/Alex/software/Hairsplitter/src/tools.cpp:427:11: warning: ignoring return value of ‘int system(const char*)’, declared with attribute warn_unused_result [-Wunused-result]
     system(commandMap.c_str());
     ~~~~~~^~~~~~~~~~~~~~~~~~~~
[ 42%] Building CXX object CMakeFiles/call_variants.dir/edlib/src/edlib.cpp.o
[ 44%] Linking CXX executable call_variants
[ 44%] Built target call_variants
[ 46%] Building CXX object CMakeFiles/filter_variants.dir/Partition.cpp.o
[ 48%] Building CXX object CMakeFiles/filter_variants.dir/filter_variants.cpp.o
/mnt/sda1/Alex/software/Hairsplitter/src/filter_variants.cpp:12:10: fatal error: clipp.h: No such file or directory
 #include "clipp.h" //library to build command line interfaces
          ^~~~~~~~~
compilation terminated.
make[2]: *** [CMakeFiles/filter_variants.dir/build.make:90: CMakeFiles/filter_variants.dir/filter_variants.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:199: CMakeFiles/filter_variants.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

Indeed I tried to locate this file but could not:

find ./ clipp.h
./
./Partition.h
./fa2gfa
./filter_variants.cpp
./separate_reads.h
./edlib
./edlib/src
./edlib/src/edlib.cpp
./edlib/include
./edlib/include/edlib.h
./Makefile
./Partition.cpp
./CMakeCache.txt
./read.cpp
./cluster_graph.h
./create_new_contigs.h
./cmake_install.cmake
./fa2gfa.cpp
./call_variants
./input_output.h
./clean_graph
./CMakeLists.txt
./read.h
./input_output.cpp
./separate_reads.cpp
./robin_hood.h
./tools.h
./gfa2fa.cpp
./CMakeFiles
./CMakeFiles/separate_reads.dir
./CMakeFiles/separate_reads.dir/depend.make
./CMakeFiles/separate_reads.dir/build.make
./CMakeFiles/separate_reads.dir/edlib
./CMakeFiles/separate_reads.dir/edlib/src
./CMakeFiles/separate_reads.dir/link.txt
./CMakeFiles/separate_reads.dir/progress.make
./CMakeFiles/separate_reads.dir/compiler_depend.make
./CMakeFiles/separate_reads.dir/compiler_depend.ts
./CMakeFiles/separate_reads.dir/flags.make
./CMakeFiles/separate_reads.dir/DependInfo.cmake
./CMakeFiles/separate_reads.dir/cmake_clean.cmake
./CMakeFiles/feature_tests.cxx
./CMakeFiles/Makefile.cmake
./CMakeFiles/progress.marks
./CMakeFiles/Progress
./CMakeFiles/Progress/35
./CMakeFiles/Progress/16
./CMakeFiles/Progress/3
./CMakeFiles/Progress/8
./CMakeFiles/Progress/5
./CMakeFiles/Progress/2
./CMakeFiles/Progress/14
./CMakeFiles/Progress/7
./CMakeFiles/Progress/count.txt
./CMakeFiles/Progress/11
./CMakeFiles/Progress/4
./CMakeFiles/Progress/9
./CMakeFiles/Progress/25
./CMakeFiles/Progress/27
./CMakeFiles/Progress/13
./CMakeFiles/Progress/36
./CMakeFiles/Progress/26
./CMakeFiles/Progress/12
./CMakeFiles/Progress/28
./CMakeFiles/Progress/1
./CMakeFiles/Progress/15
./CMakeFiles/Progress/10
./CMakeFiles/Progress/6
./CMakeFiles/CMakeDirectoryInformation.cmake
./CMakeFiles/CMakeOutput.log
./CMakeFiles/filter_variants.dir
./CMakeFiles/filter_variants.dir/depend.make
./CMakeFiles/filter_variants.dir/build.make
./CMakeFiles/filter_variants.dir/edlib
./CMakeFiles/filter_variants.dir/edlib/src
./CMakeFiles/filter_variants.dir/link.txt
./CMakeFiles/filter_variants.dir/progress.make
./CMakeFiles/filter_variants.dir/compiler_depend.make
./CMakeFiles/filter_variants.dir/compiler_depend.ts
./CMakeFiles/filter_variants.dir/flags.make
./CMakeFiles/filter_variants.dir/Partition.cpp.o
./CMakeFiles/filter_variants.dir/DependInfo.cmake
./CMakeFiles/filter_variants.dir/Partition.cpp.o.d
./CMakeFiles/filter_variants.dir/cmake_clean.cmake
./CMakeFiles/feature_tests.bin
./CMakeFiles/call_variants.dir
./CMakeFiles/call_variants.dir/depend.make
./CMakeFiles/call_variants.dir/build.make
./CMakeFiles/call_variants.dir/edlib
./CMakeFiles/call_variants.dir/edlib/src
./CMakeFiles/call_variants.dir/edlib/src/edlib.cpp.o.d
./CMakeFiles/call_variants.dir/edlib/src/edlib.cpp.o
./CMakeFiles/call_variants.dir/call_variants.cpp.o
./CMakeFiles/call_variants.dir/sequence.cpp.o.d
./CMakeFiles/call_variants.dir/sequence.cpp.o
./CMakeFiles/call_variants.dir/link.txt
./CMakeFiles/call_variants.dir/progress.make
./CMakeFiles/call_variants.dir/compiler_depend.make
./CMakeFiles/call_variants.dir/compiler_depend.ts
./CMakeFiles/call_variants.dir/input_output.cpp.o.d
./CMakeFiles/call_variants.dir/flags.make
./CMakeFiles/call_variants.dir/input_output.cpp.o
./CMakeFiles/call_variants.dir/Partition.cpp.o
./CMakeFiles/call_variants.dir/DependInfo.cmake
./CMakeFiles/call_variants.dir/Partition.cpp.o.d
./CMakeFiles/call_variants.dir/tools.cpp.o
./CMakeFiles/call_variants.dir/call_variants.cpp.o.d
./CMakeFiles/call_variants.dir/cmake_clean.cmake
./CMakeFiles/call_variants.dir/tools.cpp.o.d
./CMakeFiles/call_variants.dir/read.cpp.o.d
./CMakeFiles/call_variants.dir/read.cpp.o
./CMakeFiles/create_new_contigs.dir
./CMakeFiles/create_new_contigs.dir/depend.make
./CMakeFiles/create_new_contigs.dir/build.make
./CMakeFiles/create_new_contigs.dir/edlib
./CMakeFiles/create_new_contigs.dir/edlib/src
./CMakeFiles/create_new_contigs.dir/link.txt
./CMakeFiles/create_new_contigs.dir/progress.make
./CMakeFiles/create_new_contigs.dir/CXX.includecache
./CMakeFiles/create_new_contigs.dir/compiler_depend.make
./CMakeFiles/create_new_contigs.dir/compiler_depend.ts
./CMakeFiles/create_new_contigs.dir/flags.make
./CMakeFiles/create_new_contigs.dir/DependInfo.cmake
./CMakeFiles/create_new_contigs.dir/cmake_clean.cmake
./CMakeFiles/feature_tests.c
./CMakeFiles/FindOpenMP
./CMakeFiles/FindOpenMP/ompver_CXX.bin
./CMakeFiles/FindOpenMP/OpenMPCheckVersion.c
./CMakeFiles/FindOpenMP/ompver_C.bin
./CMakeFiles/FindOpenMP/OpenMPTryFlag.cpp
./CMakeFiles/FindOpenMP/OpenMPCheckVersion.cpp
./CMakeFiles/FindOpenMP/OpenMPTryFlag.c
./CMakeFiles/Makefile2
./CMakeFiles/CMakeScratch
./CMakeFiles/clean_graph.dir
./CMakeFiles/clean_graph.dir/depend.make
./CMakeFiles/clean_graph.dir/build.make
./CMakeFiles/clean_graph.dir/edlib
./CMakeFiles/clean_graph.dir/edlib/src
./CMakeFiles/clean_graph.dir/edlib/src/edlib.cpp.o.d
./CMakeFiles/clean_graph.dir/edlib/src/edlib.cpp.o
./CMakeFiles/clean_graph.dir/sequence.cpp.o.d
./CMakeFiles/clean_graph.dir/sequence.cpp.o
./CMakeFiles/clean_graph.dir/link.txt
./CMakeFiles/clean_graph.dir/progress.make
./CMakeFiles/clean_graph.dir/compiler_depend.make
./CMakeFiles/clean_graph.dir/compiler_depend.ts
./CMakeFiles/clean_graph.dir/input_output.cpp.o.d
./CMakeFiles/clean_graph.dir/flags.make
./CMakeFiles/clean_graph.dir/input_output.cpp.o
./CMakeFiles/clean_graph.dir/Partition.cpp.o
./CMakeFiles/clean_graph.dir/clean_graph.cpp.o
./CMakeFiles/clean_graph.dir/DependInfo.cmake
./CMakeFiles/clean_graph.dir/Partition.cpp.o.d
./CMakeFiles/clean_graph.dir/clean_graph.cpp.o.d
./CMakeFiles/clean_graph.dir/tools.cpp.o
./CMakeFiles/clean_graph.dir/cmake_clean.cmake
./CMakeFiles/clean_graph.dir/tools.cpp.o.d
./CMakeFiles/clean_graph.dir/read.cpp.o.d
./CMakeFiles/clean_graph.dir/read.cpp.o
./CMakeFiles/3.26.4
./CMakeFiles/3.26.4/CMakeCXXCompiler.cmake
./CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CXX.bin
./CMakeFiles/3.26.4/CMakeCCompiler.cmake
./CMakeFiles/3.26.4/CompilerIdCXX
./CMakeFiles/3.26.4/CompilerIdCXX/tmp
./CMakeFiles/3.26.4/CompilerIdCXX/a.out
./CMakeFiles/3.26.4/CompilerIdCXX/CMakeCXXCompilerId.cpp
./CMakeFiles/3.26.4/CompilerIdC
./CMakeFiles/3.26.4/CompilerIdC/CMakeCCompilerId.c
./CMakeFiles/3.26.4/CompilerIdC/tmp
./CMakeFiles/3.26.4/CompilerIdC/a.out
./CMakeFiles/3.26.4/CMakeDetermineCompilerABI_C.bin
./CMakeFiles/3.26.4/CMakeSystem.cmake
./CMakeFiles/CMakeTmp
./CMakeFiles/TargetDirectories.txt
./CMakeFiles/pkgRedirects
./CMakeFiles/3.12.1
./CMakeFiles/3.12.1/CMakeCXXCompiler.cmake
./CMakeFiles/3.12.1/CMakeDetermineCompilerABI_CXX.bin
./CMakeFiles/3.12.1/CMakeCCompiler.cmake
./CMakeFiles/3.12.1/CompilerIdCXX
./CMakeFiles/3.12.1/CompilerIdCXX/tmp
./CMakeFiles/3.12.1/CompilerIdCXX/a.out
./CMakeFiles/3.12.1/CompilerIdCXX/CMakeCXXCompilerId.cpp
./CMakeFiles/3.12.1/CompilerIdC
./CMakeFiles/3.12.1/CompilerIdC/CMakeCCompilerId.c
./CMakeFiles/3.12.1/CompilerIdC/tmp
./CMakeFiles/3.12.1/CompilerIdC/a.out
./CMakeFiles/3.12.1/CMakeDetermineCompilerABI_C.bin
./CMakeFiles/3.12.1/CMakeSystem.cmake
./CMakeFiles/gfa2fa.dir
./CMakeFiles/gfa2fa.dir/depend.make
./CMakeFiles/gfa2fa.dir/build.make
./CMakeFiles/gfa2fa.dir/link.txt
./CMakeFiles/gfa2fa.dir/progress.make
./CMakeFiles/gfa2fa.dir/CXX.includecache
./CMakeFiles/gfa2fa.dir/compiler_depend.make
./CMakeFiles/gfa2fa.dir/compiler_depend.ts
./CMakeFiles/gfa2fa.dir/flags.make
./CMakeFiles/gfa2fa.dir/DependInfo.cmake
./CMakeFiles/gfa2fa.dir/cmake_clean.cmake
./CMakeFiles/gfa2fa.dir/gfa2fa.cpp.o
./CMakeFiles/gfa2fa.dir/gfa2fa.cpp.o.d
./CMakeFiles/CMakeConfigureLog.yaml
./CMakeFiles/cmake.check_cache
./CMakeFiles/fa2gfa.dir
./CMakeFiles/fa2gfa.dir/depend.make
./CMakeFiles/fa2gfa.dir/build.make
./CMakeFiles/fa2gfa.dir/link.txt
./CMakeFiles/fa2gfa.dir/progress.make
./CMakeFiles/fa2gfa.dir/compiler_depend.make
./CMakeFiles/fa2gfa.dir/compiler_depend.ts
./CMakeFiles/fa2gfa.dir/flags.make
./CMakeFiles/fa2gfa.dir/fa2gfa.cpp.o
./CMakeFiles/fa2gfa.dir/DependInfo.cmake
./CMakeFiles/fa2gfa.dir/fa2gfa.cpp.o.d
./CMakeFiles/fa2gfa.dir/cmake_clean.cmake
./gfa2fa
./GraphUnzip
./GraphUnzip/analyse_coverage_HiC.py
./GraphUnzip/graphunzip.py
./GraphUnzip/transform_gfa.py
./GraphUnzip/input_output.py
./GraphUnzip/analyse_HiC.py
./GraphUnzip/docopt.py
./GraphUnzip/interaction_between_contigs.py
./GraphUnzip/README.md
./GraphUnzip/tests.py
./GraphUnzip/LICENSE
./GraphUnzip/solve_ambiguities.py
./GraphUnzip/solve_with_long_reads.py
./GraphUnzip/simple_unzip.py
./GraphUnzip/gfa_tangled.png
./GraphUnzip/gfa_split.png
./GraphUnzip/segment.py
./GraphUnzip/solve_with_HiC.py
./GraphUnzip/check_phasing.py
./GraphUnzip/contig_DBG.py
./GraphUnzip/determine_multiplicity.py
./GraphUnzip/trash.py
./GraphUnzip/finish_untangling.py
./call_variants.cpp
./cluster_graph.cpp
./sequence.h
./call_variants.h
./filter_variants.h
./tools.cpp
./create_new_contigs.cpp
./clean_graph.cpp
./sequence.cpp
find: ‘clipp.h’: No such file or directory

Any help would be appreciated

Thanks
Alex

ERROR: create_new_contigs failed.

Sorry for already having another bug to report! I was trying to run Hairsplitter today after the new update (one with the multiploid command, one without, both using multithreading). Hairsplitter ran so much faster this time around, and none of the previously problematic steps seemed to have issues!

However, there seems to be a new issue with STAGE 6.


Running (this one was without multiploid)

#!/bin/bash
#SBATCH --time=08:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=28
#SBATCH --account=PAS1802
#SBATCH --job-name=1376_hairsplitter-25kb-keephap-2
#SBATCH --export=ALL
#SBATCH --output=1376_hairsplitter-25kb-keephap-2.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1376/2_flye_assembly-keephap-25kb-2/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter -t 28

Resulted in

 - Loading all reads from ../1_demul_adtrim/BC15-25kbmin.fastq in memory
 - Loading all contigs from ../8_hairsplitter/tmp/cut_assembly.gfa in memory
 - Loading alignments of the reads on the contigs from ../8_hairsplitter/tmp/reads_on_asm.sam
 - Calling variants on each contig using basic pileup
separating reads on contig CONTIG	edge_34@3	46849	117.353
separating reads on contig CONTIG	edge_14@0	8141	131.349
separating reads on contig CONTIG	edge_40@3	11455	100.049
separating reads on contig CONTIG	edge_12@0	1480	444.816
separating reads on contig CONTIG	edge_23@0	18874	219.034
separating reads on contig CONTIG	edge_15@0	21915	217.477
separating reads on contig CONTIG	edge_45@4	300000	223.291
separating reads on contig CONTIG	edge_1@0	72803	76.6896
separating reads on contig CONTIG	edge_36@0	10083	360.579
separating reads on contig CONTIG	edge_37@1	64272	233.718
separating reads on contig CONTIG	edge_46@0	300000	211.382
separating reads on contig CONTIG	edge_28@6	87656	145.68
separating reads on contig CONTIG	edge_39@0	115223	209.538
separating reads on contig CONTIG	edge_41@1	233709	216.108
separating reads on contig CONTIG	edge_16@1	93636	209.789
separating reads on contig CONTIG	edge_45@0	300000	195.72
separating reads on contig CONTIG	edge_28@5	300000	225.631
separating reads on contig CONTIG	edge_44@4	300000	221.944
separating reads on contig CONTIG	edge_45@2	300000	218.656
separating reads on contig CONTIG	edge_37@0	300000	214.827
separating reads on contig CONTIG	edge_44@1	300000	220.502
separating reads on contig CONTIG	edge_44@0	300000	231.295
separating reads on contig CONTIG	edge_28@4	300000	209.897
separating reads on contig CONTIG	edge_6@1	300000	210.705
separating reads on contig CONTIG	edge_48@0	300000	228.833
separating reads on contig CONTIG	edge_45@1	300000	211.574
separating reads on contig CONTIG	edge_22@0	2532	458.313
separating reads on contig CONTIG	edge_40@1	300000	223.89
separating reads on contig CONTIG	edge_44@3	300000	215.01
separating reads on contig CONTIG	edge_6@0	300000	215.125
separating reads on contig CONTIG	edge_34@1	300000	212.838
separating reads on contig CONTIG	edge_48@1	300000	243.961
separating reads on contig CONTIG	edge_33@0	10239	214.442
separating reads on contig CONTIG	edge_38@0	300000	209.204
separating reads on contig CONTIG	edge_44@2	300000	221.737
separating reads on contig CONTIG	edge_34@2	300000	200.835
separating reads on contig CONTIG	edge_35@0	300000	227.907
separating reads on contig CONTIG	edge_34@0	300000	198.119
separating reads on contig CONTIG	edge_45@5	299225	193.126
separating reads on contig CONTIG	edge_3@0	25469	220.44
separating reads on contig CONTIG	edge_44@6	300000	224.849
separating reads on contig CONTIG	edge_28@2	300000	218.004
separating reads on contig CONTIG	edge_44@9	99227	148.113
separating reads on contig CONTIG	edge_4@0	16120	53.9382
separating reads on contig CONTIG	edge_42@0	1252	12304.8
separating reads on contig CONTIG	edge_7@2	233248	230.62
separating reads on contig CONTIG	edge_44@5	300000	223.86
separating reads on contig CONTIG	edge_28@3	300000	226.373
separating reads on contig CONTIG	edge_47@0	262506	233.746
separating reads on contig CONTIG	edge_40@0	300000	203.257
separating reads on contig CONTIG	edge_45@3	300000	226.299
separating reads on contig CONTIG	edge_35@1	300000	226.719
separating reads on contig CONTIG	edge_41@0	300000	203.041
separating reads on contig CONTIG	edge_28@1	300000	211.533
separating reads on contig CONTIG	edge_40@2	300000	204.561
separating reads on contig CONTIG	edge_44@7	300000	228.214
separating reads on contig CONTIG	edge_32@0	159093	131.672
separating reads on contig CONTIG	edge_7@1	300000	263.177
separating reads on contig CONTIG	edge_44@8	300000	226.218
separating reads on contig CONTIG	edge_28@0	300000	232.642
separating reads on contig CONTIG	edge_16@0	300000	204.921
separating reads on contig CONTIG	edge_38@1	97516	218.851
separating reads on contig CONTIG	edge_6@2	260846	197.358
separating reads on contig CONTIG	edge_48@3	85724	104.09
separating reads on contig CONTIG	edge_48@2	300000	239.204
separating reads on contig CONTIG	edge_46@1	80433	197.154
separating reads on contig CONTIG	edge_7@0	300000	203.408
 - Creating the .gaf file describing how the reads align on the new contigs
 - Creating the new contigs
ERROR racon failed, while running racon -w 500 -e 1 -t 1 ../8_hairsplitter/tmp/reads_11.fasta ../8_hairsplitter/tmp/mapped_11.paf ../8_hairsplitter/tmp/unpolished_11.fasta > ../8_hairsplitter/tmp/polished_11.fasta 2>../8_hairsplitter/tmp/trash.txt
/users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter -t 28
HairSplitter v1.3.3 (github.com/RolandFaure/HairSplitter). Last update: 2023-08-21

	******************
	*                *
	*  Hairsplitter  *
	*    Welcome!    *
	*                *
	******************


===== STAGE 1: Cleaning graph of small contigs that are unconnected parts of haplotypes   [ 2023-08-21 14:44:48.560662 ]


 When the assemblers manage to locally phase the haplotypes, they sometimes assemble the alternative haplotype as a separate contig, unconnected in the gfa graph. This affects negatively the performance of Hairsplitter. Let's delete these contigs

 - Mapping the assembly against itself
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/clean_graph assembly_graph.gfa ../8_hairsplitter/tmp/cleaned_assembly.gfa ../8_hairsplitter ../8_hairsplitter/hairsplitter.log 28 minimap2
 - Eliminated small unconnected contigs that align on other contigs

===== STAGE 2: Aligning reads on the reference   [ 2023-08-21 14:44:50.479979 ]

 - Cutting the contigs in chunks of 300000bp to avoid memory issues
 - Converting the assembly in fasta format
 - Aligning the reads on the assembly
 - Running minimap with command line:
      minimap2 ../8_hairsplitter/tmp/cleaned_assembly.fasta ../1_demul_adtrim/BC15-25kbmin.fastq -x map-ont -a --secondary=no -t 28 > ../8_hairsplitter/tmp/reads_on_asm.sam 2> ../8_hairsplitter/tmp/logminimap.txt 
   The log of minimap2 can be found at ../8_hairsplitter/tmp/logminimap.txt

===== STAGE 3: Calling variants   [ 2023-08-21 14:46:07.495951 ]

 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/call_variants ../8_hairsplitter/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq ../8_hairsplitter/tmp/reads_on_asm.sam 28 ../8_hairsplitter/tmp ../8_hairsplitter/tmp/error_rate.txt 0 ../8_hairsplitter/tmp/variants.col ../8_hairsplitter/tmp/variants.vcf

===== STAGE 4: Filtering variants   [ 2023-08-21 14:51:21.968869 ]

 - Filtering variants
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/filter_variants ../8_hairsplitter/tmp/variants.col 0.0121198 28 0 ../8_hairsplitter/tmp/filtered_variants.col ../8_hairsplitter/tmp/variants.vcf ../8_hairsplitter/tmp/variants_filtered.vcf

===== STAGE 5: Separating reads by haplotype of origin   [ 2023-08-21 14:51:51.986796 ]

 - Separating reads by haplotype of origin
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter/tmp/filtered_variants.col 28 0.0121198 0 ../8_hairsplitter/tmp/reads_haplo.gro

===== STAGE 6: Creating all the new contigs   [ 2023-08-21 16:22:00.649424 ]

 This can take time, as we need to polish every new contig using Racon
 Running :  /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0121198 ../8_hairsplitter/tmp/reads_haplo.gro ../8_hairsplitter/tmp 28 ont ../8_hairsplitter/tmp/zipped_assembly.gfa ../8_hairsplitter/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0
ERROR: create_new_contigs failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0121198 ../8_hairsplitter/tmp/reads_haplo.gro ../8_hairsplitter/tmp 28 ont ../8_hairsplitter/tmp/zipped_assembly.gfa ../8_hairsplitter/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0

And running (this one was with multiploid)

#!/bin/bash
#SBATCH --time=08:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=28
#SBATCH --account=PAS1802
#SBATCH --job-name=1376_hairsplitter-25kb-keephap-2-multiploid
#SBATCH --export=ALL
#SBATCH --output=1376_hairsplitter-25kb-keephap-2-multiploid.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1376/2_flye_assembly-keephap-25kb-2/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter-multiploid -m -t 28

Resulted in

 - Loading all reads from ../1_demul_adtrim/BC15-25kbmin.fastq in memory
 - Loading all contigs from ../8_hairsplitter-multiploid/tmp/cut_assembly.gfa in memory
 - Loading alignments of the reads on the contigs from ../8_hairsplitter-multiploid/tmp/reads_on_asm.sam
 - Calling variants on each contig using basic pileup
separating reads on contig CONTIG	edge_22@0	2532	458.313
separating reads on contig CONTIG	edge_15@0	21915	217.477
separating reads on contig CONTIG	edge_36@0	10083	360.579
separating reads on contig CONTIG	edge_1@0	72803	76.6896
separating reads on contig CONTIG	edge_48@3	85724	104.09
separating reads on contig CONTIG	edge_44@9	99227	148.113
separating reads on contig CONTIG	edge_48@2	300000	239.204
separating reads on contig CONTIG	edge_28@6	87656	145.68
separating reads on contig CONTIG	edge_41@1	233709	216.108
separating reads on contig CONTIG	edge_28@1	300000	211.533
separating reads on contig CONTIG	edge_28@5	300000	225.631
separating reads on contig CONTIG	edge_45@2	300000	218.656
separating reads on contig CONTIG	edge_47@0	262506	233.746
separating reads on contig CONTIG	edge_44@1	300000	220.502
separating reads on contig CONTIG	edge_48@0	300000	228.833
separating reads on contig CONTIG	edge_41@0	300000	203.041
separating reads on contig CONTIG	edge_44@8	300000	226.218
separating reads on contig CONTIG	edge_44@0	300000	231.295
separating reads on contig CONTIG	edge_28@4	300000	209.897
separating reads on contig CONTIG	edge_28@0	300000	232.642
separating reads on contig CONTIG	edge_35@1	300000	226.719
separating reads on contig CONTIG	edge_45@1	300000	211.574
separating reads on contig CONTIG	edge_7@2	233248	230.62
separating reads on contig CONTIG	edge_40@1	300000	223.89
separating reads on contig CONTIG	edge_28@2	300000	218.004
separating reads on contig CONTIG	edge_34@1	300000	212.838
separating reads on contig CONTIG	edge_32@0	159093	131.672
separating reads on contig CONTIG	edge_6@0	300000	215.125
separating reads on contig CONTIG	edge_38@1	97516	218.851
separating reads on contig CONTIG	edge_35@0	300000	227.907
separating reads on contig CONTIG	edge_40@3	11455	100.049
separating reads on contig CONTIG	edge_46@0	300000	211.382
separating reads on contig CONTIG	edge_42@0	1252	12304.8
separating reads on contig CONTIG	edge_45@5	299225	193.126
separating reads on contig CONTIG	edge_44@3	300000	215.01
separating reads on contig CONTIG	edge_45@4	300000	223.291
separating reads on contig CONTIG	edge_34@3	46849	117.353
separating reads on contig CONTIG	edge_40@2	300000	204.561
separating reads on contig CONTIG	edge_40@0	300000	203.257
separating reads on contig CONTIG	edge_34@0	300000	198.119
separating reads on contig CONTIG	edge_37@1	64272	233.718
separating reads on contig CONTIG	edge_45@0	300000	195.72
separating reads on contig CONTIG	edge_14@0	8141	131.349
separating reads on contig CONTIG	edge_28@3	300000	226.373
separating reads on contig CONTIG	edge_23@0	18874	219.034
separating reads on contig CONTIG	edge_44@6	300000	224.849
separating reads on contig CONTIG	edge_33@0	10239	214.442
separating reads on contig CONTIG	edge_6@2	260846	197.358
separating reads on contig CONTIG	edge_12@0	1480	444.816
separating reads on contig CONTIG	edge_37@0	300000	214.827
separating reads on contig CONTIG	edge_16@1	93636	209.789
separating reads on contig CONTIG	edge_6@1	300000	210.705
separating reads on contig CONTIG	edge_46@1	80433	197.154
separating reads on contig CONTIG	edge_4@0	16120	53.9382
separating reads on contig CONTIG	edge_44@4	300000	221.944
separating reads on contig CONTIG	edge_34@2	300000	200.835
separating reads on contig CONTIG	edge_44@2	300000	221.737
separating reads on contig CONTIG	edge_16@0	300000	204.921
separating reads on contig CONTIG	edge_48@1	300000	243.961
separating reads on contig CONTIG	edge_38@0	300000	209.204
separating reads on contig CONTIG	edge_44@7	300000	228.214
separating reads on contig CONTIG	edge_39@0	115223	209.538
separating reads on contig CONTIG	edge_7@0	300000	203.408
separating reads on contig CONTIG	edge_7@1	300000	263.177
separating reads on contig CONTIG	edge_45@3	300000	226.299
separating reads on contig CONTIG	edge_44@5	300000	223.86
separating reads on contig CONTIG	edge_3@0	25469	220.44
 - Creating the .gaf file describing how the reads align on the new contigs
 - Creating the new contigs
ERROR racon failed, while running racon -w 500 -e 1 -t 1 ../8_hairsplitter-multiploid/tmp/reads_11.fasta ../8_hairsplitter-multiploid/tmp/mapped_11.paf ../8_hairsplitter-multiploid/tmp/unpolished_11.fasta > ../8_hairsplitter-multiploid/tmp/polished_11.fasta 2>../8_hairsplitter-multiploid/tmp/trash.txt
/users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC15-25kbmin.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter-multiploid -m -t 28
HairSplitter v1.3.3 (github.com/RolandFaure/HairSplitter). Last update: 2023-08-21

	******************
	*                *
	*  Hairsplitter  *
	*    Welcome!    *
	*                *
	******************


===== STAGE 1: Cleaning graph of small contigs that are unconnected parts of haplotypes   [ 2023-08-21 14:44:56.992370 ]


 When the assemblers manage to locally phase the haplotypes, they sometimes assemble the alternative haplotype as a separate contig, unconnected in the gfa graph. This affects negatively the performance of Hairsplitter. Let's delete these contigs

 - Mapping the assembly against itself
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/clean_graph assembly_graph.gfa ../8_hairsplitter-multiploid/tmp/cleaned_assembly.gfa ../8_hairsplitter-multiploid ../8_hairsplitter-multiploid/hairsplitter.log 28 minimap2
 - Eliminated small unconnected contigs that align on other contigs

===== STAGE 2: Aligning reads on the reference   [ 2023-08-21 14:44:58.965748 ]

 - Cutting the contigs in chunks of 300000bp to avoid memory issues
 - Converting the assembly in fasta format
 - Aligning the reads on the assembly
 - Running minimap with command line:
      minimap2 ../8_hairsplitter-multiploid/tmp/cleaned_assembly.fasta ../1_demul_adtrim/BC15-25kbmin.fastq -x map-ont -a --secondary=no -t 28 > ../8_hairsplitter-multiploid/tmp/reads_on_asm.sam 2> ../8_hairsplitter-multiploid/tmp/logminimap.txt 
   The log of minimap2 can be found at ../8_hairsplitter-multiploid/tmp/logminimap.txt

===== STAGE 3: Calling variants   [ 2023-08-21 14:46:15.601267 ]

 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/call_variants ../8_hairsplitter-multiploid/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq ../8_hairsplitter-multiploid/tmp/reads_on_asm.sam 28 ../8_hairsplitter-multiploid/tmp ../8_hairsplitter-multiploid/tmp/error_rate.txt 0 ../8_hairsplitter-multiploid/tmp/variants.col ../8_hairsplitter-multiploid/tmp/variants.vcf

===== STAGE 4: Filtering variants   [ 2023-08-21 14:51:39.988027 ]

 - Filtering variants
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/filter_variants ../8_hairsplitter-multiploid/tmp/variants.col 0.0121198 28 0 ../8_hairsplitter-multiploid/tmp/filtered_variants.col ../8_hairsplitter-multiploid/tmp/variants.vcf ../8_hairsplitter-multiploid/tmp/variants_filtered.vcf

===== STAGE 5: Separating reads by haplotype of origin   [ 2023-08-21 14:52:10.334915 ]

 - Separating reads by haplotype of origin
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter-multiploid/tmp/filtered_variants.col 28 0.0121198 0 ../8_hairsplitter-multiploid/tmp/reads_haplo.gro

===== STAGE 6: Creating all the new contigs   [ 2023-08-21 16:14:06.974230 ]

 This can take time, as we need to polish every new contig using Racon
 Running :  /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter-multiploid/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0121198 ../8_hairsplitter-multiploid/tmp/reads_haplo.gro ../8_hairsplitter-multiploid/tmp 28 ont ../8_hairsplitter-multiploid/tmp/zipped_assembly.gfa ../8_hairsplitter-multiploid/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0
ERROR: create_new_contigs failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/create_new_contigs ../8_hairsplitter-multiploid/tmp/cut_assembly.gfa ../1_demul_adtrim/BC15-25kbmin.fastq 0.0121198 ../8_hairsplitter-multiploid/tmp/reads_haplo.gro ../8_hairsplitter-multiploid/tmp 28 ont ../8_hairsplitter-multiploid/tmp/zipped_assembly.gfa ../8_hairsplitter-multiploid/tmp/reads_on_new_contig.gaf 0 minimap2 racon 0

Both appear to be the same error, so I don't think it's the multiploid argument doing anything, but it's definitely not something I encountered previously, and this is the same dataset I ran last week, the only difference being multithreading (but I'm not certain it's multithreading at fault here). I did look at the commits and noticed that src/tools.cpp was changed to allow an exit() during polishing, but given that it didn't have a problem during the Minimap2 step (which was also given an exit()), I don't know why racon would end up having an issue.

An oddity I just noticed during this - the variants_filtered.vcf file never seems to have much added to it, even in my successful runs of Hairsplitter. All it seems to have is:

##fileformat=VCFv4.2
##source=call_variants
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO

Lastly, I don't know if it's helpful, but it looks like Hairsplitter tried to make a /tmp/ directory and a trash.txt file in my assembly folder (i.e. in the /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1376/2_flye_assembly-keephap-25kb-2/ directory). I don't know if it has done that every time and simply deleted it later, or if this is a new bug and it simply happened to leave the files there because Hairsplitter died before it could remove them. There are also a lot more files in the proper /8_hairsplitter/tmp/tmp/ directory than there were previously, like it wasn't deleting the files as it was running:
image

raven fails

Hi.

This is a great implementation.
When running Hairsplitter, however, it fails on the raven reassemblies after a while (attached picture).

$python hairsplitter.py -f SRR27955205.fastq -i metaflye_assembly.fasta -x ont -o outdir -t 20

I am not sure what the cause is, so I would appreciate it if you could include tiny dataset in the repository or write an example of some public data run if possible.

Kazuma Uesaka

screen-shot

ERROR: call_variants failed

Hi Roland,

I still haven't been able to run the pipeline successfully. I am getting some error that the variant calling pipeline failed and also the warnings about the headers of the reads again. Please have a look:

assembly=/mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/Adineta_ricciae.chrom.interleaved.fasta
reads=/mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/Adineta_ricciae.ONT.BXQ_G.merged.filt.40000.90.1000.fastq
outdir=/mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom

python3 /mnt/sda1/Alex/software/Hairsplitter/hairsplitter.py -i $assembly -f $reads -x ont -t 12 -o $outdir -F 
/mnt/sda1/Alex/software/Hairsplitter/hairsplitter.py -i /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/Adineta_ricciae.chrom.interleaved.fasta -f /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/Adineta_ricciae.ONT.BXQ_G.merged.filt.40000.90.1000.fastq -x ont -t 12 -o /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom -F
HairSplitter v1.3.2 (github.com/RolandFaure/HairSplitter). Last update: 2023-08-11

	******************
	*                *
	*  Hairsplitter  *
	*    Welcome!    *
	*                *
	******************


===== STAGE 1: Cleaning graph of small contigs that are unconnected parts of haplotypes   [ 2023-08-11 17:11:11.723509 ]


 When the assemblers manage to locally phase the haplotypes, they sometimes assemble the alternative haplotype as a separate contig, unconnected in the gfa graph. This affects negatively the performance of Hairsplitter. Let's delete these contigs

 - Mapping the assembly against itself
 Running:  /mnt/sda1/Alex/software/Hairsplitter/src/build/clean_graph /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/assembly.gfa /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/cleaned_assembly.gfa /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/hairsplitter.log 12 minimap2
 - Eliminated small unconnected contigs that align on other contigs

===== STAGE 2: Aligning reads on the reference   [ 2023-08-11 17:11:36.097181 ]

 - Converting the assembly in fasta format
 - Aligning the reads on the assembly
 - Running minimap with command line:
      minimap2 /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/cleaned_assembly.fasta /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/Adineta_ricciae.ONT.BXQ_G.merged.filt.40000.90.1000.fastq -x map-ont -a --secondary=no -t 12 > /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/reads_on_asm.sam 2> /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/logminimap.txt 
   The log of minimap2 can be found at /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/logminimap.txt

===== STAGE 3: Calling variants   [ 2023-08-11 17:24:35.077202 ]

 Running:  /mnt/sda1/Alex/software/Hairsplitter/src/build/call_variants /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/cleaned_assembly.gfa /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/Adineta_ricciae.ONT.BXQ_G.merged.filt.40000.90.1000.fastq /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/reads_on_asm.sam 12 /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/error_rate.txt 0 /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/variants.col /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/variants.vcf
 - Loading all reads from /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/Adineta_ricciae.ONT.BXQ_G.merged.filt.40000.90.1000.fastq in memory
 - Loading all contigs from /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/cleaned_assembly.gfa in memory
 - Loading alignments of the reads on the contigs from /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/reads_on_asm.sam
 - Calling variants on each contig using basic pileup
double free or corruption (out)
Aborted (core dumped)
ERROR: call_variants failed. Was trying to run: /mnt/sda1/Alex/software/Hairsplitter/src/build/call_variants /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/cleaned_assembly.gfa /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/Adineta_ricciae.ONT.BXQ_G.merged.filt.40000.90.1000.fastq /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/reads_on_asm.sam 12 /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/error_rate.txt 0 /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/variants.col /mnt/sda1/Alex/16.PHASED_ASSEMBLIES_HETEROZYGOSITY/Adineta_ricciae/hairsplitter_aricciae_chrom/tmp/variants.vcf

Fasta assembly not recognized

Hello,

Thanks for developing such useful tools.

For some reason my assembly file is not recognized:

python3 hairsplitter.py -i $assembly -f $reads -x ont -t 12 -o $outdir

	******************
	*                *
	*  Hairsplitter  *
	*    Welcome!    *
	*                *
	******************

ERROR: Assembly file must be in GFA or FASTA format. File extension not recognized.

I have tried with both .fasta and .fa extensions. Any idea what the issue is?
I am using

HairSplitter v1.3.0 (RolandFaure/HairSplitter). Last update: 2023-07-20

Thanks
Alex

High RAM Usage during "separate_reads" & GraphUnzip Output Fragmented

EDIT:

RAM usage was solved by my decision to remove all reads below 25 kb - it now uses far, far, far more reasonable amounts of RAM (I think it capped out at ~5-10 GB of RAM being used at one point), meaning the shorter reads were absolutely the issue on that front. The GraphUnzip step, unfortunately, is still resulting in fragmented assemblies, even when using the 25+ kb read file (the unzipped .gfa also still looks great).


I've been looking at some of the outputs now that I've managed to get HairSplitter to run all the way to the end, and I've found a few more things I have questions about or have noticed.


First, and this is minor because I'm able to get the information via the job output log from the supercomputing cluster I use, the hairsplitter.log file in my output directory (not the /tmp/) only includes information from Stage 1. For example:

STAGE 1 : Deleting contigs that look like parts of haplotypes but that are disconnected from the rest of the assembly
Suppressing contig edge_34 because it is disconnected from the rest of the graph and contained in edge_32
Suppressing contig edge_54 because it is disconnected from the rest of the graph and contained in edge_51

Again, it's not that big of an issue (for me at least) because I have an output log elsewhere from my job script, but it's just something I've noticed.


Second, during the "separate_reads" step, it seems to have incredibly high RAM usage if you have large long read files, making it hard to predict how much RAM/memory you will need to request/allocate for Hairspitter. Running

#!/bin/bash
#SBATCH --time=08:00:00
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=48 --gpus-per-node=2 --partition=gpuserial-48core
#SBATCH --account=PAS1802
#SBATCH --job-name=1739-hairsplitter-DualCore-semiLONG
#SBATCH --export=ALL
#SBATCH --output=1739-hairsplitter-DualCore-semiLONG.out.%j
module load cmake/3.25.2
module load gnu/11.2.0
source /users/PAS1802/woodruff207/miniconda3/bin/activate
conda activate hairsplitter_env
cd /fs/ess/PAS1802/ALW/2023_06_15-MAY1376_TLOKOs_LongRead/1739/2_flye_assembly-keephap-2/
python /users/PAS1802/woodruff207/Hairsplitter/hairsplitter.py -f ../1_demul_adtrim/BC16.fastq -i assembly_graph.gfa -x ont -o ../8_hairsplitter-DualCore-semiLONG
  • which gives it ~370 GB of RAM to use in the supercomputer - gave an out-of-memory error at the separate_reads step:
===== STAGE 5: Separating reads by haplotype of origin   [ 2023-08-17 03:41:59.941568 ]

 - Separating reads by haplotype of origin
 Running:  /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter-DualCore-semiLONG/tmp/filtered_variants.col 1 0.0059046 0 ../8_hairsplitter-DualCore-semiLONG/tmp/reads_haplo.gro
ERROR: separate_reads failed. Was trying to run: /users/PAS1802/woodruff207/Hairsplitter/src/build/separate_reads ../8_hairsplitter-DualCore-semiLONG/tmp/filtered_variants.col 1 0.0059046 0 ../8_hairsplitter-DualCore-semiLONG/tmp/reads_haplo.gro
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=23613304.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The BC16.fastq file was 8.6 GB large, and the input contigs from this run were from a .gfa file, where the largest contigs were: 1,886,358 bp; 1,780,637 bp; 1,381,840 bp; and 1,243,546 bp (the rest were below 1,000,000 bp).

Subsampling the file to reduce size worked for me previously, but I'm concerned that doing that might be removing larger reads that could help improve detection or implementation of heterozygous deletions or inversions (assuming that's something HairSplitter is capable of during the generation of new contigs). Additionally, given how long it takes to run, experimenting with settings becomes hard when there's a few hours to wait for space to be available for a job, then having to wait and see whether or not there is enough memory for it.


Third, for my two runs that have worked (one was using my original haploid assembly, the second was using a .gfa file from Flye output, both used the subsampled long read file), it seems like GraphUnzip (I think it's GraphUnzip at least) is acting oddly compared to your poster/presentation. From your poster, your before and after images look great and the phased chromosomes remain adjacent/connected:
image

However, when I ran it myself, the end result (hairsplitter_final_assembly.gfa and hairsplitter_final_assembly.fasta) for both the haploid assembly and the .gfa Flye assembly were heavily fragmented (the following is from my .gfa assembly, but it heavily mimics what I saw for the haploid assembly):
image

What's most confusing to me is that I checked the /tmp/ directory, there are two big things of note - the logGrapUnzip.txt file is empty, so there's no information on what exactly it was doing. Second, I checked the zipped_assembly.gfa file, and it looks like it had actually done a great job at phasing portions of the genome:
image

When I check the hairsplitter_final_assembly.gfa and hairsplitter_final_assembly.fasta files, they have done an EXCELLENT job at phasing SNPs (relative to a current phased assembly in the field, and apart from missing some heterozygous deletions/inversions) - as I mentioned, it just seems like GraphUnzip somehow heavily fragmented the resulting assembly, rather than maintaining the previous connections like your poster seems to have done. My long read file does have a number of shorter reads (shorter being between ~300-1000 bp) - could those possibly be interfering with the unzipping step somehow? Instead of randomly subsampling long reads, should I instead selectively remove reads shorter than ~2000-2500 bp, since Hairsplitter/GraphUnzip seems to be using ~2000 bp segments (plus or minus a few bases) for phasing?


UPDATE:

I'm going to experiment with a pipeline that drops either all reads below 10kb or 25kb, run Flye to assemble a draft genome, then try running that through Hairsplitter. Looking at the stats of the .fastq files using seqkit stats [NAME].fastq makes me feel fairly confident that the shorter reads could be choking Hairsplitter as it tries to draw upon the reads:

$ seqkit stats BC15.fastq 
file        format  type   num_seqs        sum_len  min_len  avg_len  max_len
BC15.fastq  FASTQ   DNA   1,738,400  6,814,432,181        1  3,919.9  200,922

$ seqkit stats BC15-10kbmin.fastq 
file                format  type  num_seqs        sum_len  min_len  avg_len  max_len
BC15-10kbmin.fastq  FASTQ   DNA    169,356  4,732,147,553   10,000   27,942  200,922

$ seqkit stats BC15-25kbmin.fastq 
file                format  type  num_seqs        sum_len  min_len  avg_len  max_len
BC15-25kbmin.fastq  FASTQ   DNA     80,732  3,294,751,518   25,000   40,811  200,922

Hopefully, after removing all of those shorter length reads, this might solve a few more problems.

Output explanantion - Hairsplitter

Hello Roland,

I was wondering how to interpret the output of Hairsplitter

 - Between positions 96000 and 101999 of the contig, I've created these contigs:
   - Scaffold_1@1_96000_1
   - Scaffold_1@1_96000_0
 - Between positions 102000 and 105999 of the contig, I've created these contigs:
   - Scaffold_1@1_102000_2
   - Scaffold_1@1_102000_1
   - Scaffold_1@1_102000_0

Does this mean that Hairsplitter outputs different ploidies for different parts of the original assembly? You can see here that both intervals are on the same original scaffold (Scaffold_1). It was not very clear from the documantation that the software computes ploidy separately for different parts of the original scaffold. Does this above mean that in the first interval it detects 2 haplotypes but in the second interval 3 haplotypes (possible aneuploidy)?

Is there a way to see which supercontigs are different alleles of other supercontigs in the final assembly?

Thanks
Alex

ERROR: GraphUnzip failed.

It moved further along this time, but it looks like both my runs failed at the GraphUnzip step. I will note that I used a .gfa file output from Flye as the assembly input, BUT I had modified it to remove some problematic telomeric sequence, which might have caused an issue. The file opened up fine in Bandage, and Hairsplitter made a cleaned_assembly.gfa file just fine, so I'm not sure if that contributed or not. Just so it's up at the top, though, the zipped_assembly.gfa files look fantastic now (apart from a minor issue which I discuss near the bottom of this post).

Both of the runs (one multiploid, the other not) ended with:

===== STAGE 7: Untangling (~scaffolding) the new assembly graph to improve contiguity   [ 2023-08-23 14:00:51.693528 ]

 - Running GraphUnzip with command line:
      python /users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/graphunzip.py unzip -l ../8_hairsplitter/tmp/reads_on_new_contig.gaf -g ../8_hairsplitter/tmp/zipped_assembly.gfa -o ../8_hairsplitter/hairsplitter_final_assembly.gfa --meta 2>../8_hairsplitter/tmp/logGraphUnzip.txt >../8_hairsplitter/tmp/trash.txt 
   The log of GraphUnzip is written on  ../8_hairsplitter/tmp/logGraphUnzip.txt

ERROR: GraphUnzip failed. Please check the output of GraphUnzip in ../8_hairsplitter/tmp/logGraphUnzip.txt

For the non-multiploid run

The logGraphUnzip.txt file was empty, but trash.txt had this inside:

Loading the GFA file
Loading contigs
WARNING: contig  ['edge_1@0_72804_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_1@0_72804_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_12@0_1481_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_12@0_1481_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@4_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@4_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_22@0_2533_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_22@0_2533_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_14@0_8142_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_14@0_8142_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_3@0_25470_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_3@0_25470_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_33@0_10240_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_33@0_10240_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_38@1_97517_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_38@1_97517_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_16@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_16@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_23@0_18875_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_23@0_18875_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_15@0_21916_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_15@0_21916_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_16@1_93637_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_16@1_93637_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_4@0_16121_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_4@0_16121_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_46@1_80434_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_46@1_80434_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_34@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_34@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@5_295292_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@5_295292_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_39@0_115224_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_39@0_115224_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_41@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_41@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_32@0_137974_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_32@0_137974_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_34@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_34@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_34@3_46850_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_34@3_46850_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_6@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_6@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_48@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_48@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_37@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_37@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@8_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@8_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_46@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_46@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@5_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@5_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@5_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@5_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_40@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_40@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_34@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_34@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_35@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_35@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_37@1_64273_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_37@1_64273_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_48@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_48@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@9_99228_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@9_99228_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@6_85555_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@6_85555_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_47@0_262507_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_47@0_262507_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_36@0_10084_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_36@0_10084_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_48@3_62608_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_48@3_62608_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_41@1_233710_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_41@1_233710_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_40@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_40@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_6@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_6@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_38@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_38@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@3_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@3_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_40@2_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_40@2_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_40@3_11456_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_40@3_11456_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_45@3_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_45@3_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@6_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@6_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_42@0_1253_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_42@0_1253_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@3_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@3_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_35@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_35@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_48@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_48@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_6@2_260847_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_6@2_260847_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@4_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@4_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_44@7_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_44@7_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_28@4_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_28@4_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_7@0_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_7@0_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_7@1_300001_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_7@1_300001_0']  has length = 0. This might infer in handling the coverage
WARNING: contig  ['edge_7@2_233249_0']  has no readable coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
WARNING: contig  ['edge_7@2_233249_0']  has length = 0. This might infer in handling the coverage
WARNING:  67  contigs out of  1525  had no coverage information or coverage=0. If this is a widespread issue, please use --conservative mode
================

Everything loaded, moving on to untangling the graph

================

*Untangling the graph using long reads*

Reading the gaf file...
Finished going through the gaf file.
here are all the links of  edge_44@9_0_0  :  [['edge_44@8_300001_0']]   [1]
Here is the gaf line :  ('aadd4b75-6dae-421f-a428-8bd1744e25f8', '<edge_44@9_0_0<edge_44@8_296000_0')
WARNING: discrepancy between what's found in the alignment files and the inputted GFA graph. Link  ['edge_44@9_0_0', 'edge_44@8_296000_0'] <<  not found in the gfa

For the multiploid run

The logGraphUnzip.txt file had the following in it (the trash.txt file looked very similar to the one above, but stopped at the Finished going through the gaf file line):

Traceback (most recent call last):
  File "/users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/graphunzip.py", line 436, in <module>
    main()
  File "/users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/graphunzip.py", line 388, in main
    segments = simple_unzip(segments, names, lrFile)
  File "/users/PAS1802/woodruff207/Hairsplitter/src/GraphUnzip/simple_unzip.py", line 253, in simple_unzip
    if best_pair_for_each_left_link[p[0]][0] < pairs[p] :
IndexError: list index out of range

Looks like it's almost there! The zipped_assembly.gfa from both the multiploid and non-multiploid run look excellent now - the contigs in those files are MUCH larger, AND I checked them against the field's current phased reference assembly (which has a number of issues, but is at least decent for checking phasing) - the homologous contigs in the zipped_assembly.gfa seem to be VERY well phased, as they each agree equally well with only one of the phased references (example: edge_40@2_144000_1 matches 99.7% with Chr4A and 98.8% with Chr4B, and edge_40@2_144000_0 matches 99.6% with Chr4B and 98.7% with Chr4A). This is very, very exciting, especially because some of the phased contigs Hairsplitter is generating in the zipped_assembly.gfa file are 150,000+ bp large!

There is an issue with the zipped_assembly.gfa file, though - every single contig it generated, when visualized in Bandage, has "Depth: 0.0X". Maybe that's contributing to the issue? I'm not sure.

From the error logs, it looks like it's just GraphUnzip that is the issue now (every other file apart from those that should be produced after the GraphUnzip step look good).

If you need me to provide any files to try to solve this one, just let me know!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.