zhangrengang / subphaser Goto Github PK

Phase, partition and visualize subgenomes of a neoallopolyploid or hybrid based on the subgenome-specific repetitive kmers.

Home Page: https://doi.org/10.1111/nph.18173

License: GNU General Public License v3.0

Python 99.03% Perl 0.17% Shell 0.80%

allopolyploid subgenome kmer partition phasing exchange

subphaser's Introduction

Quick install and start

git clone https://github.com/zhangrengang/SubPhaser
cd SubPhaser

# install
conda env create -f SubPhaser.yaml
conda activate SubPhaser
python setup.py install

# start
cd example_data
# small genome    (Arabidopsis_suecica: 270Mb)
bash test_Arabidopsis.sh
# middle genome   (peanut: 2.6Gb)
bash test_peanut.sh
# large genome    (wheat: 14Gb)
bash test_wheat.sh

Introduction
Inputs
Run SubPhaser
Run SubPhaser through Singularity/Apptainer
Outputs
When SubPhaser do not work
Citation
Applications
Contact
Full Usage and Default Parameters

Introduction

For many allopolyploid species, their diploid progenitors are unknown or extinct, making it impossible to unravel their subgenomes. Here, we develop SubPhaser to phase subgenomes, by using repetitive kmers as the "differential signatures", assuming that repetitive sequences (mainly transposable elements) were expanded across chromosomes in the progenitors' independently evolving period. The tool also identifies genome-wide subgenome-specific regions and long terminal repeat retrotransposons (LTR-RTs), which will provide insights into the evolutionary history of allopolyploidization.

For details of methods and benchmarking results of SubPhaser, please see the paper in New Phytologist and its Supplementary Material including performances in dozens of chromosome-level neoallopolyploid/hybrid genomes published before October, 2021.

There are mainly four modules:

The core module to phase subgenomes:
- Count kmers by jellyfish.
- Identify the differential kmers among homoeologous chromosome sets.
- Cluster into subgenomes by a K-Means algorithm and estimate confidence level by the bootstrap.
- Evaluate whether subgenomes are successfully phased by hierarchical clustering and principal component analysis (PCA).
The module to identify and test the enrichments of subgenome-specific kmers:
- Identify subgenome-specific kmers.
- Identify significant enrichments of subgenome-specific kmers by genome window/bin, which is useful to identify intewr-subgenomic exchanges (refer to Supplementary Material for identifying bona fide exchanges) and/or assembly errors (e.g. switch errors and hamming errors).
- Identify subgenome-specific enrichments with user-defined features (e.g. transposable elements, genes) via -custom_features.
The LTR module to identify and analyze subgenome-specific LTR-RT elements (disable by -disable_ltr):
- Identify the LTR-RTs by LTRharvest and/or LTRfinder (time-consuming for large genome, especially LTRfinder).
- Classify the LTR-RTs by TEsorter.
- Identify subgenome-specific LTR-RTs by testing the enrichment of subgenome-specific kmers.
- Estimate the insertion age of subgenome-specific LTR-RTs, which is helpful to estimate the time of divergence–hybridization period(s) (the period in which the progenitors are evolving independently; refer to Supplementary Material for estimating the time period).
- Reconstruct phylogenetic trees of subgenome-specific LTR/Gypsy and LTR/Copia elements, which is helpful to infer the evolutionary history of these LTR-RTs (disable by -disable_ltrtree, time-consuming for large genome).
The visualization module to visualize genome-wide data (disable by -disable_circos):
- Identify the homoeologous blocks by minimap2 simply (disable by -disable_blocks, time-consuming for large genome).
- Integrate and visualize the whole genome-wide data by circos.

The below is an example of output figures of wheat (ABD, 1n=3x=21):

Figure. Phased subgenomes of allohexaploid bread wheat genome. Colors are unified with each subgenome in subplots B-F, i.e. the same color means the same subgenome.

(A) The histogram of differential k-mers among homoeologous chromosome sets.
(B) Heatmap and clustering of differential k-mers. The x-axis, differential k-mers; y-axis, chromosomes. The vertical color bar, each chromosome is assigned to which subgenome; the horizontal color bar, each k-mer is specific to which subgenome (blank for non-specific kmers).
(C) Principal component analysis (PCA) of differential k-mers. Points indicate chromosomes.
(D) Chromosomal characteristics (window size: 1 Mb). Rings from outer to inner:
- (1) Subgenome assignments by a k-Means algorithm.
- (2) Significant enrichment of subgenome-specific k-mers (blank for non-enriched windows).
- (3) Normalized proportion of subgenome-specific k-mers.
- (4-6) Density distribution (count) of each subgenome-specific k-mer set.
- (7) Density distribution (count) of subgenome-specific LTR-RTs and other LTR-RTs (the most outer, in grey color).
- (8) Homoeologous blocks of each homoeologous chromosome set.
(E) Insertion time of subgenome-specific LTR-RTs.
(F) A phylogenetic tree of 1,000 randomly subsampled LTR/Gypsy elements.

Note: On the clustering heatmap (Fig. B) and PCA plot (Fig. C), a subgenome is defined as well-phased if it has clearly distinguishable patterns of both differential k-mers and homeologous chromosomes, indicating that each subgenome shares subgenome-specific features as expected. If the subgenomes are not well-phased, the downstream analyses (may be failed) are meaningless and should be ignored. Sometimes, just a few abmormal chromosomes are mistakely assigned by the k-Means method, according to the heatmap, PCA and/or circos plots. In this case, the users could manually adjust the subgenome assignments (edit and rename the *chrom-subgenome.tsv file) and then feed it to SubPhaser by -sg_assigned option for downstream analysis.

Inputs

Chromosome-level genome sequences (fasta format), e.g. the wheat genome (haploid assembly, ABD, 1n=3x=21). Note: do not use hard-masked genome by RepeatMakser, as subphaser depends on repeat sequences.
Configuration of homoeologous chromosome sets, e.g.

Chr1A   Chr1B   Chr1D                      # each row is one homoeologous chromosome set
Chr2B   Chr2A   Chr2D                      # chromosome order is arbitrary and useless
Chr3D   Chr3B   Chr3A                      # seperate with blank character(s)
Chr4B   Chr4D   Chr4A
5A|Chr5A   5B|Chr5B   5D|Chr5D             # will rename chromosome id to 5A, 5B and 5D, respectively
Chr6A,Chr7A   Chr6B,Chr7B   Chr6D,Chr7D    # will treat multiple chromosomes together using ","

If some homoeologous relationships are ambiguous, they can be placed as singletons that will not be used to identify differential kmers. For example:

Chr1A   Chr1B   Chr1D
Chr2B   Chr2A   Chr2D
Chr3D   Chr3B   Chr3A
Chr4B   Chr4D
Chr4A					# singleton(s) will skip the step to identify differential kmers
...

[Optional] Sequences of genomic features (fasta format, with -custom_features): Any sequences of genomic features, such as transposable elements (TEs), long terminal repeat retrotransposons (LTR-RTs), simple repeats and genes, could be fed to identify the subgenome-specific ones.

Run SubPhaser

Run with default parameters:

subphaser -i genome.fasta.gz -c sg.config

Run with just the core algorithm enabled:

subphaser -i genome.fasta.gz -c sg.config -just_core

subphaser -i genome.fasta.gz -c sg.config -disable_ltr -disable_circos

Change key parameters when differential kmers are too few (see Fig. A):

subphaser -i genome.fasta.gz -c sg.config -k 15 -q 50 -f 2

Mutiple genomes (e.g. two relative species):

subphaser -i genomeA.fasta.gz genomeB.fasta.gz -c sg.config

Mutiple config files:

subphaser -i genome.fasta.gz -c sg1.config sg2.config

Input custom feature (e.g. transposable element, gene) sequences for subgenome-specific enrichments:

subphaser -i genome.fasta.gz -c sg.config -custom_features TEs.fasta genes.fasta

Define custom colors for subgenomes:

subphaser -i genome.fasta.gz -c sg.config -colors "#f9c00c,#00b9f1,#7200da"

Run SubPhaser through Singularity/Apptainer

Alternatively, you can run subphaser through Singularity/Apptainer container:

# install
apptainer remote add --no-login SylabsCloud cloud.sylabs.io
apptainer remote use SylabsCloud
apptainer pull subphaser.sif library://shang-hongyun/collection/subphaser.sif:1.2.6

# run
./subphaser.sif subphaser -h

Outputs

phase-results/
├── k15_q200_f2.circos/                # config and data files for circos plot, so developers are able to re-plot with some custom modifications
├── k15_q200_f2.kmer_freq.pdf          # histogram of differential kmers, useful to adjust option `-q`
├── k15_q200_f2.kmer.mat               # differential kmer matrix (m kmer × n chromosome)
├── k15_q200_f2.kmer.mat.pdf           # heatmap of the kmer matrix
├── k15_q200_f2.kmer.mat.R             # R script for the heatmap plot
├── k15_q200_f2.kmer_pca.pdf           # PCA plot of the kmer matrix
├── k15_q200_f2.chrom-subgenome.tsv    # subgenome assignments and bootstrap values
├── k15_q200_f2.sig.kmer-subgenome.tsv # subgenome-specific kmers
├── k15_q200_f2.bin.enrich             # subgenome-specific enrichments by genome window/bin
├── k15_q200_f2.bin.group              # grouped bins by potential exchanges based on enrichments
├── k15_q200_f2.ltr.enrich             # subgenome-specific LTR-RTs
├── k15_q200_f2.ltr.insert.pdf         # density plot of insertion age of subgenome-specific LTR-RTs
├── k15_q200_f2.ltr.insert.R           # R script for the density plot
├── k15_q200_f2.LTR_Copia.tree.pdf     # phylogenetic tree plot of subgenome-specific LTR/Copia elements
├── k15_q200_f2.LTR_Copia.tree.R       # R script for the LTR/Copia tree plot
├── k15_q200_f2.LTR_Gypsy.tree.pdf     # phylogenetic tree plot of subgenome-specific LTR/Gypsy elements
├── k15_q200_f2.LTR_Gypsy.tree.R       # R script for the LTR/Gypsy tree plot
├── k15_q200_f2.circos.pdf             # final circos plot
├── k15_q200_f2.circos.png
├── circos_legend.txt                  # legend of the circos plot
.....

tmp/
├── LTR.scn                 # identification of LTR-RTs by LTRharvest and/or LTRfinder
├── LTR.inner.fa            # inner sequences of LTR-RTs
├── LTR.inner.fa.cls.*      # classfication of LTR-RTs by TEsorter
├── LTR.filtered.LTR.fa     # full sequences of the filtered LTR-RTs
├── LTR.LTR_*.aln           # alignments of LTR-RTs' protein domains for the below tree
├── LTR.LTR_*.rooted.tre    # phylogenetic tree files
├── LTR.LTR_*.map           # information of tip nodes on the above tree
.....

When SubPhaser do not work

It is expected not to work with autopolyploids, as autopolyploids is expected not to produce subgenome-specific TEs.
It may not work when there are too many natural recombinations or artifact switch errors between subgenomes.
It does not work when there are no or too few TEs expanding in the independently evolving period of the progenitors or these TEs have been eliminated. This maybe true for some plants and fungi with low TE content.
It may not work for mesopolyploids, and not work for paleopolyploids, of which subgenome-specific TEs have been eliminated. However, the genetic boundary is not very clear.
Other unknown cases can be reported to me.

When it do not work, you may try another pipeline based on multiple evidence of synteny, orthology and phylogeny.

Citation

If you use SubPhaser, please cite:

Jia KH, Wang ZX, Wang L et. al. SubPhaser: A robust allopolyploid subgenome phasing method based on subgenome-specific k-mers [J]. New Phytologist, 2022, 235: 801-809 DOI:10.1111/nph.18173

Applications

Evolution of genome size

In this study, SubPhaser was used to identify species-specific TEs among the apple tribe. By comparing the contents of non-TEs, species-specific TEs and non-specific TEs, the differences in genome size could be attributed to differential expansion and contraction of specific and non‐specific TEs, assuming that specific TEs expanded and non‐specific TEs contracted after split of species.

Zhang TC, Qiao Q, Du X et. al. Cultivated hawthorn (Crataegus pinnatifida var. major) genome sheds light on the evolution of Maleae (apple tribe) [J]. J. Integr. Plant Biol., 2022, 64 (8): 1487–1501 DOI:10.1111/jipb.13318

Evolution of reticulate allopolyploidization

In this study, SubPhaser was used to partition subgenomes of both neo-allotetraploid and neo-allooctoploid poppy genomes, identify exchanges between subgenomes and identify subgenome-specific LTR-RTs. By analysing subgenome phylogeny, exchange patterns and LTR-RT insertion time, a reticulate allopolyploidization evolutionary scenario was strongly supported.

Zhang RG, Lu C, Li G et. al. Subgenome-aware analyses suggest a reticulate allopolyploidization origin in three Papaver genomes [J]. Nat. Commun., 2023, 14 (1): 2204 DOI:10.1038/s41467-023-37939-2

Contact

For cooperations on polyploid genome research, please contact us via Email ([email protected]) or WeChat (bio_ture).

Full Usage and Default Parameters

usage: subphaser [-h] -i GENOME [GENOME ...] -c CFGFILE [CFGFILE ...]
                         [-labels LABEL [LABEL ...]] [-no_label]
                         [-target FILE] [-sg_assigned FILE] [-sep STR]
                         [-custom_features FASTA [FASTA ...]] [-pre STR]
                         [-o DIR] [-tmpdir DIR] [-k INT] [-f FLOAT] [-q INT]
                         [-baseline BASELINE] [-lower_count INT]
                         [-min_prop FLOAT] [-max_freq INT] [-max_prop FLOAT]
                         [-low_mem] [-by_count] [-re_filter] [-nsg INT]
                         [-replicates INT] [-jackknife FLOAT]
                         [-max_pval FLOAT]
                         [-test_method {ttest_ind,kruskal,wilcoxon,mannwhitneyu}]
                         [-figfmt {pdf,png}]
                         [-heatmap_colors COLOR [COLOR ...]]
                         [-heatmap_options STR] [-just_core] [-disable_ltr]
                         [-ltr_detectors {ltr_finder,ltr_harvest} [{ltr_finder,ltr_harvest} ...]]
                         [-ltr_finder_options STR] [-ltr_harvest_options STR]
                         [-tesorter_options STR] [-all_ltr] [-intact_ltr]
                         [-exclude_exchanges] [-shared_ltr] [-mu FLOAT]
                         [-disable_ltrtree] [-subsample INT]
                         [-ltr_domains {GAG,PROT,INT,RT,RH,AP,RNaseH} [{GAG,PROT,INT,RT,RH,AP,RNaseH} ...]]
                         [-trimal_options STR]
                         [-tree_method {iqtree,FastTree}] [-tree_options STR]
                         [-ggtree_options STR] [-disable_circos]
                         [-window_size INT] [-disable_blocks] [-aligner PROG]
                         [-aligner_options STR] [-min_block INT]
                         [-alt_cfgs CFGFILE [CFGFILE ...]] [-chr_ordered FILE]
                         [-p INT] [-max_memory MEM] [-cleanup] [-overwrite]
                         [-v]

Phase and visualize subgenomes of an allopolyploid or hybrid based on the repetitive kmers.

optional arguments:
  -h, --help            show this help message and exit

Input:
  Input genome and config files

  -i GENOME [GENOME ...], -genomes GENOME [GENOME ...]
                        Input genome sequences in fasta format [required]
  -c CFGFILE [CFGFILE ...], -sg_cfgs CFGFILE [CFGFILE ...]
                        Subgenomes config file (one homologous group per
                        line); this chromosome set is for identifying
                        differential kmers [required]
  -labels LABEL [LABEL ...]
                        For multiple genomes, provide prefix labels for each
                        genome sequence to avoid conficts among chromosome id
                        [default: '1-, 2-, ..., n-']
  -no_label             Do not use default prefix labels for genome sequences
                        as there is no confict among chromosome id
                        [default=False]
  -target FILE          Target chromosomes to output; id mapping is allowed;
                        this chromosome set is for cluster and phase [default:
                        the same chromosome set as `-sg_cfgs`]
  -sg_assigned FILE     Provide subgenome assignments to skip k-means
                        clustering and to identify subgenome-specific features
                        [default=None]
  -sep STR              Seperator for chromosome ID [default="|"]
  -custom_features FASTA [FASTA ...]
                        Custom features in fasta format to enrich subgenome-
                        specific kmers, such as TE and gene [default: None]

Output:
  -pre STR, -prefix STR
                        Prefix for output [default=None]
  -o DIR, -outdir DIR   Output directory [default=phase-results]
  -tmpdir DIR           Temporary directory [default=tmp]

Kmer:
  Options to count and filter kmers

  -k INT                Length of kmer [default=15]
  -f FLOAT, -min_fold FLOAT
                        Minimum fold [default=2]
  -q INT, -min_freq INT
                        Minimum total count for each kmer; will not work if
                        `-min_prop` is specified [default=200]
  -baseline BASELINE    Use sub-maximum (1) or minimum (-1) as the baseline of
                        fold [default=1]
  -lower_count INT      Don't output k-mer with count < lower-count
                        [default=3]
  -min_prop FLOAT       Minimum total proportion (< 1) for each kmer
                        [default=None]
  -max_freq INT         Maximum total count for each kmer; will not work if
                        `-max_prop` is specified [default=1000000000.0]
  -max_prop FLOAT       Maximum total proportion (< 1) for each kmer
                        [default=None]
  -low_mem              Low MEMory but slower [default: True if genome size >
                        3G, else False]
  -by_count             Calculate fold by count instead of by proportion
                        [default=False]
  -re_filter            Re-filter with subset of chromosomes (subgenome
                        assignments are expected to change) [default=False]

Cluster:
  Options for clustering to phase

  -nsg INT              Number of subgenomes (>1) [default: auto]
  -replicates INT       Number of replicates for bootstrap [default=1000]
  -jackknife FLOAT      Percent of kmers to resample for each bootstrap
                        [default=50]
  -max_pval FLOAT       Maximum P value for all hypothesis tests
                        [default=0.05]
  -test_method {ttest_ind,kruskal,wilcoxon,mannwhitneyu}
                        The test method to identify differiential
                        kmers[default=ttest_ind]
  -figfmt {pdf,png}     Format of figures [default=pdf]
  -heatmap_colors COLOR [COLOR ...]
                        Color panel (2 or 3 colors) for heatmap plot [default:
                        ('green', 'black', 'red')]
  -heatmap_options STR  Options for heatmap plot (see more in R shell with
                        `?heatmap.2` of `gplots` package) [default="Rowv=T,Col
                        v=T,scale='col',dendrogram='row',labCol=F,trace='none'
                        ,key=T,key.title=NA,density.info='density',main=NA,xla
                        b='Differential kmers',margins=c(2.5,12)"]
  -just_core            Exit after the core phasing module
                        [default=False]

LTR:
  Options for LTR analyses

  -disable_ltr          Disable this step (this step is time-consuming for
                        large genome) [default=False]
  -ltr_detectors {ltr_finder,ltr_harvest} [{ltr_finder,ltr_harvest} ...]
                        Programs to detect LTR-RTs [default=['ltr_harvest']]
  -ltr_finder_options STR
                        Options for `ltr_finder` to identify LTR-RTs (see more
                        with `ltr_finder -h`) [default="-w 2 -D 15000 -d 1000
                        -L 7000 -l 100 -p 20 -C -M 0.8"]
  -ltr_harvest_options STR
                        Options for `gt ltrharvest` to identify LTR-RTs (see
                        more with `gt ltrharvest -help`) [default="-seqids yes
                        -similar 80 -vic 10 -seed 20 -minlenltr 100 -maxlenltr
                        7000 -mintsd 4 -maxtsd 6"]
  -tesorter_options STR
                        Options for `TEsorter` to classify LTR-RTs (see more
                        with `TEsorter -h`) [default="-db rexdb -dp2"]
  -all_ltr              Use all LTR-RTs identified by `-ltr_detectors` (more
                        LTR-RTs but slower) [default: only use LTR as
                        classified by `TEsorter`]
  -intact_ltr           Use completed LTR-RTs classified by `TEsorter` (less
                        LTR-RTs but faster) [default: the same as `-all_ltr`]
  -exclude_exchanges    Exclude potential exchanged LTRs for insertion age
                        estimation and phylogenetic trees [default=False]
  -shared_ltr           Identify shared LTR-RTs among subgenomes
                        (experimental) [default=False]
  -mu FLOAT             Substitution rate per year in the intergenic region,
                        for estimating age of LTR insertion [default=1.3e-08]
  -disable_ltrtree      Disable subgenome-specific LTR tree (this step is
                        time-consuming when subgenome-specific LTR-RTs are too
                        many, so `-subsample` is enabled by defualt)
                        [default=False]
  -subsample INT        Subsample LTR-RTs to avoid too many to construct a
                        tree [default=1000] (0 to disable)
  -ltr_domains {GAG,PROT,INT,RT,RH,AP,RNaseH} [{GAG,PROT,INT,RT,RH,AP,RNaseH} ...]
                        Domains for LTR tree (Note: for domains identified by
                        `TEsorter`, PROT (rexdb) = AP (gydb), RH (rexdb) =
                        RNaseH (gydb)) [default: ['INT', 'RT', 'RH']]
  -trimal_options STR   Options for `trimal` to trim alignment (see more with
                        `trimal -h`) [default="-automated1"]
  -tree_method {iqtree,FastTree}
                        Programs to construct phylogenetic trees
                        [default=FastTree]
  -tree_options STR     Options for `-tree_method` to construct phylogenetic
                        trees (see more with `iqtree -h` or `FastTree
                        -expert`) [default=""]
  -ggtree_options STR   Options for `ggtree` to show phylogenetic trees (see
                        more from `https://yulab-smu.top/treedata-book`)
                        [default="branch.length='none', layout='circular'"]

Circos:
  Options for circos plot

  -disable_circos       Disable this step [default=False]
  -window_size INT      Window size (bp) for circos plot [default=1000000]
  -disable_blocks       Disable to plot homologous blocks [default=False]
  -aligner PROG         Programs to identify homologous blocks
                        [default=minimap2]
  -aligner_options STR  Options for `-aligner` to align chromosome sequences
                        [default="-x asm20 -n 10"]
  -min_block INT        Minimum block size (bp) to show [default=100000]
  -alt_cfgs CFGFILE [CFGFILE ...]
                        An alternative config file for identifying homologous
                        blocks [default=None]
  -chr_ordered FILE     Provide a chromosome order to plot circos
                        [default=None]

Other options:
  -p INT, -ncpu INT     Maximum number of processors to use [default=32]
  -max_memory MEM       Maximum memory to use where limiting can be enabled.
                        [default=65.2G]
  -cleanup              Remove the temporary directory [default=False]
  -overwrite            Overwrite even if check point files existed
                        [default=False]
  -v, -version          show program's version number and exit

subphaser's People

Stargazers

Watchers

Forkers

licheng0921 goodstudychina snowseed heche-psb xiaoyezao khjia fafuyyk chenwx-up toney823 ningshuang-yao aseetharam lizhi-git

subphaser's Issues

sg.config configuration (The homology of the de novo assembled genome is not known)

Hello, thank you very much for your SubPhaser software, which is very useful for genotyping. I used HIFI and HIC to assemble the genome of Fragaria x ananassa but I do not know the homology of the 28 scaffolds. Should I put all the chromosomes in the sg.config file in one line?
I am looking forward to your reply.

Installation problem

When I use conda to install this software, it prompts me as follows:
Grid computing is not available because DRMAA not configured properly: Could not find drmaa library.
How can I solve this problem?
Many thanks.

Is there a limitation on chromosome counts?

Hi Rengang,

Thanks for your pipeline ! SubPhaser is very useful for our project.

I wonder whether the chromosome counts is limit or not for Subphaser? Because my species has a huge chromosome numbers.

best,

Cheng

`TEsorter` cannot find `rexdb` in Singularity container

Hi and thanks for the tool! It looks very promising.

I had to use Singularity because of some cluster vs. conda Qt conflicts that I could not resolve. However, with Singularity I found myself unable to proceed beyond the TEsorter stage because of the following error:

Apptainer> cat /netscratch/dep_mercier/grp_novikova/software/SubPhaser/example_data/tmp/LTR.inner.fa.tesort.log
2023-12-08 17:33:23,593 -INFO- VARS: {'sequence': '/netscratch/dep_mercier/grp_novikova/software/SubPhaser/example_data/tmp/LTR.inner.fa', 'hmm_database': 'rexdb', 'seq_type': 'nucl', 'prefix': '/netscratch/dep_mercier/grp_novikova/software/SubPhaser/example_data/tmp/LTR.inner.fa', 'force_write_hmmscan': False, 'processors': 48, 'tmp_dir': '/netscratch/dep_mercier/grp_novikova/software/SubPhaser/example_data/tmp/LTR', 'min_coverage': 20, 'max_evalue': 0.001, 'disable_pass2': True, 'pass2_rule': '80-80-80', 'no_library': False, 'no_reverse': False, 'no_cleanup': False}
2023-12-08 17:33:23,594 -INFO- checking dependencies:
Traceback (most recent call last):
  File "/share/home/app/bin/miniconda3/envs/SubPhaser/bin/TEsorter", line 10, in <module>
    sys.exit(main())
  File "/share/home/app/bin/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/TEsorter/app.py", line 1014, in main
    pipeline(Args())
  File "/share/home/app/bin/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/TEsorter/app.py", line 145, in pipeline
    Dependency().check_hmmer(db=DB[args.hmm_database])
  File "/share/home/app/bin/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/TEsorter/app.py", line 952, in check_hmmer
    dp_version = self.get_hmm_version(db)[:3]
  File "/share/home/app/bin/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/TEsorter/app.py", line 967, in get_hmm_version
    line = open(db).readline()
**FileNotFoundError: [Errno 2] No such file or directory: '/share/home/app/bin/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/TEsorter/database/REXdb_protein_database_viridiplantae_v3.0_plus_metazoa_v3.hmm'**

Turns out databases are not loaded at all:

Apptainer> ls /share/home/app/bin/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/TEsorter/
__init__.py  __main__.py  app.py       modules/     version.py

How would I fix that?

Cheers,
Nikita

Only one pair of homologous chromosomes were not phased

Hi~
SubPhaser is a great piece of software, I have suffered some problems when I use this software to phase my diploid genome.

After the previous hic scaffolding, I got 22 superscaffolds, then I want to divide these scaffolds into 2 parts(2n=2x=22).

so I used this SubPhaser(-k 17 -q 50 -f 1.5), then 20 superscaffolds were phased and only 1 pair of homologous scaffolds(scaffold_9 and scaffold_10) were not phased.

k17_q50_f1.5.kmer_freq.pdf
k17_q50_f1.5.kmer_pca.pdf
k17_q50_f1.5.ltr.insert.density.pdf

How can I solve this problem?
Looking forward to your reply!
Yang

Division by zero when trying to build trees?

Hi, when running SubPhaser i get the following error:

Traceback (most recent call last):
  File "/home/531734/.conda/envs/SubPhaser/bin/subphaser", line 33, in <module>
    sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
  File "/home/531734/.conda/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/__main__.py", line 779, in main
    pipeline.run()
  File "/home/531734/.conda/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/__main__.py", line 516, in run
    ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[])
  File "/home/531734/.conda/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/__main__.py", line 615, in step_ltr
    d_files = tree.build(job_args=job_args)
  File "/home/531734/.conda/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/LTR.py", line 210, in build
    ncpus = [max(1, int(self.ncpu*v/tprop)) for v in prop]
  File "/home/531734/.conda/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/LTR.py", line 210, in <listcomp>
    ncpus = [max(1, int(self.ncpu*v/tprop)) for v in prop]
ZeroDivisionError: division by zero

I believe this could be because only one scaffold was identified as a subgenome. Does this sound possible?

Many thanks

Mike

ValueError: All singletons are not allowed

when I use the command subphaser -i brg.fa -c brg.txt ,it show the error ValueError: All singletons are not allowed

how can I slove? Thanks!

Traceback (most recent call last):
File "/home/zuozd/miniconda3/envs/SubPhaser/bin/subphaser", line 33, in
sys.exit(load_entry_point('subphaser==1.2.6', 'console_scripts', 'subphaser')())
File "/home/zuozd/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/main.py", line 797, in main
pipeline.run()
File "/home/zuozd/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/main.py", line 422, in run
d_mat = dumps.filter(d_mat, lengths, self.sgs, outfig=histfig, #d_targets=d_targets,
File "/home/zuozd/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/Jellyfish.py", line 479, in filter
raise ValueError('All singletons are not allowed')
ValueError: All singletons are not allowed

IndexError: cannot do a non-empty take from an empty axes.

Hi, I got the following error with my dataset when I was trying to pre-assign all 40 chromosomes to 2 subgenomes. Apparently, SubPhaser re-assigned all chromosomes to SG1. With a smaller number of assignments, SubPhaser successfully completed in the same genome with a smaller number of homologous chromosome assignments, as you suggested in #7.

22-12-23 03:08:56 [INFO] Version: 1.2.5
22-12-23 03:08:56 [INFO] Arguments: {'genomes': ['/gfe_data/species_genome/Nepenthes_gracilis_male_HiC.fa.gz'], 'sg_cfgs': ['/gfe_data/species_subphaser_cfg/Nepenthes_gracilis_subphaser_cfg.txt'], 'labels': None, 'no_label': True, 'target': None, 'sg_assigned': None, 'sep': '|', 'custom_features': None, 'prefix': 'Nepenthes_gracilis.', 'outdir': 'Nepenthes_gracilis.subphaser', 'tmpdir': 'Nepenthes_gracilis.tmp', 'k': 15, 'min_fold': 2, 'min_freq': 200, 'baseline': 1, 'lower_count': 3, 'min_prop': None, 'max_freq': 1000000000.0, 'max_prop': None, 'low_mem': None, 'by_count': False, 're_filter': False, 'nsg': None, 'replicates': 1000, 'jackknife': 50, 'max_pval': 0.05, 'test_method': 'ttest_ind', 'figfmt': 'pdf', 'heatmap_colors': ('green', 'black', 'red'), 'heatmap_options': "Rowv=T,Colv=T,scale='col',dendrogram='row',labCol=F,trace='none',key=T,key.title=NA,density.info='density',main=NA,xlab='Differential kmers',margins=c(2.5,12)", 'just_core': False, 'disable_ltr': False, 'ltr_detectors': ['ltr_harvest'], 'ltr_finder_options': '-w 2 -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.8', 'ltr_harvest_options': '-seqids yes -similar 80 -vic 10 -seed 20 -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6', 'tesorter_options': '-db rexdb -dp2', 'all_ltr': False, 'intact_ltr': False, 'exclude_exchanges': False, 'non_specific': False, 'mu': 1.3e-08, 'disable_ltrtree': False, 'subsample': 1000, 'ltr_domains': ['INT', 'RT', 'RH'], 'trimal_options': '-automated1', 'tree_method': 'FastTree', 'tree_options': '', 'ggtree_options': "branch.length='none', layout='circular'", 'disable_circos': False, 'window_size': 1000000, 'disable_blocks': False, 'aligner': 'minimap2', 'aligner_options': '-x asm20 -n 10', 'min_block': 100000, 'alt_cfgs': None, 'chr_ordered': None, 'ncpu': 4, 'max_memory': '32', 'cleanup': False, 'overwrite': False}
22-12-23 03:08:56 [INFO] Target chromosomes: ['scaffold2', 'scaffold1', 'scaffold8', 'scaffold11', 'scaffold12', 'scaffold3', 'scaffold17', 'scaffold23', 'scaffold24', 'scaffold40', 'scaffold4', 'scaffold22', 'scaffold30', 'scaffold33', 'scaffold39', 'scaffold5', 'scaffold13', 'scaffold16', 'scaffold18', 'scaffold26', 'scaffold6', 'scaffold15', 'scaffold20', 'scaffold32', 'scaffold38', 'scaffold7', 'scaffold14', 'scaffold27', 'scaffold28', 'scaffold29', 'scaffold9', 'scaffold19', 'scaffold21', 'scaffold34', 'scaffold36', 'scaffold10', 'scaffold25', 'scaffold31', 'scaffold35', 'scaffold37']
22-12-23 03:08:56 [INFO] Splitting genomes by chromosome into `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.`
22-12-23 03:09:08 [INFO] New check point file: `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.split.ok`
22-12-23 03:09:08 [INFO] Chromosomes: ['scaffold2', 'scaffold1', 'scaffold8', 'scaffold11', 'scaffold12', 'scaffold3', 'scaffold17', 'scaffold23', 'scaffold24', 'scaffold40', 'scaffold4', 'scaffold22', 'scaffold30', 'scaffold33', 'scaffold39', 'scaffold5', 'scaffold13', 'scaffold16', 'scaffold18', 'scaffold26', 'scaffold6', 'scaffold15', 'scaffold20', 'scaffold32', 'scaffold38', 'scaffold7', 'scaffold14', 'scaffold27', 'scaffold28', 'scaffold29', 'scaffold9', 'scaffold19', 'scaffold21', 'scaffold34', 'scaffold36', 'scaffold10', 'scaffold25', 'scaffold31', 'scaffold35', 'scaffold37']
22-12-23 03:09:08 [INFO] Chromosome Number: 40
22-12-23 03:09:08 [INFO] CONFIG: [[['scaffold2'], ['scaffold1', 'scaffold8', 'scaffold11', 'scaffold12']], [['scaffold3'], ['scaffold17', 'scaffold23', 'scaffold24', 'scaffold40']], [['scaffold4'], ['scaffold22', 'scaffold30', 'scaffold33', 'scaffold39']], [['scaffold5'], ['scaffold13', 'scaffold16', 'scaffold18', 'scaffold26']], [['scaffold6'], ['scaffold15', 'scaffold20', 'scaffold32', 'scaffold38']], [['scaffold7'], ['scaffold14', 'scaffold27', 'scaffold28', 'scaffold29']], [['scaffold9'], ['scaffold19', 'scaffold21', 'scaffold34', 'scaffold36']], [['scaffold10'], ['scaffold25', 'scaffold31', 'scaffold35', 'scaffold37']]]
22-12-23 03:09:08 [INFO] Genome size: 746,713,351 bp
22-12-23 03:09:08 [INFO] ###Step: Kmer Count
22-12-23 03:09:08 [INFO] Counting kmer by jellyfish
22-12-23 03:09:08 [INFO] Start Pool with 4 process(es)
.
.
.
22-12-23 03:13:26 [INFO] Bootstrap: mean Adjusted Rand-Index: 0.9428; mean V-measure score: 0.9295
22-12-23 03:13:26 [INFO] Subgenome assignments: OrderedDict([('scaffold2', 'SG1'), ('scaffold1', 'SG1'), ('scaffold8', 'SG1'), ('scaffold11', 'SG1'), ('scaffold12', 'SG1'), ('scaffold3', 'SG1'), ('scaffold17', 'SG1'), ('scaffold23', 'SG1'), ('scaffold24', 'SG1'), ('scaffold40', 'SG1'), ('scaffold4', 'SG1'), ('scaffold22', 'SG1'), ('scaffold30', 'SG2'), ('scaffold33', 'SG2'), ('scaffold39', 'SG1'), ('scaffold5', 'SG1'), ('scaffold13', 'SG1'), ('scaffold16', 'SG1'), ('scaffold18', 'SG1'), ('scaffold26', 'SG1'), ('scaffold6', 'SG1'), ('scaffold15', 'SG1'), ('scaffold20', 'SG1'), ('scaffold32', 'SG1'), ('scaffold38', 'SG1'), ('scaffold7', 'SG1'), ('scaffold14', 'SG1'), ('scaffold27', 'SG1'), ('scaffold28', 'SG1'), ('scaffold29', 'SG1'), ('scaffold9', 'SG1'), ('scaffold19', 'SG1'), ('scaffold21', 'SG1'), ('scaffold34', 'SG1'), ('scaffold36', 'SG1'), ('scaffold10', 'SG1'), ('scaffold25', 'SG1'), ('scaffold31', 'SG1'), ('scaffold35', 'SG1'), ('scaffold37', 'SG1')])
22-12-23 03:13:26 [INFO] Outputing `chromosome` - `subgenome` assignments to `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.subphaser/Nepenthes_gracilis.k15_q200_f2.chrom-subgenome.tsv`
22-12-23 03:13:26 [INFO] Outputing significant differiential `kmer` - `subgenome` maps to `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.subphaser/Nepenthes_gracilis.k15_q200_f2.sig.kmer-subgenome.tsv`
22-12-23 03:13:26 [INFO] Start Pool with 4 process(es)
22-12-23 03:13:26 [INFO] 9 significant subgenome-specific kmers
22-12-23 03:13:26 [INFO] 	9 SG2-specific kmers
22-12-23 03:13:27 [INFO] run CMD: `Rscript /gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.subphaser/Nepenthes_gracilis.k15_q200_f2.kmer.mat.R`
22-12-23 03:13:27 [INFO] Outputing PCA plot to `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.subphaser/Nepenthes_gracilis.k15_q200_f2.kmer_pca.pdf`
22-12-23 03:13:28 [INFO] Outputing `coordinate` - `subgenome` maps to `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.subphaser/Nepenthes_gracilis.k15_q200_f2.subgenome.bin.count`
22-12-23 03:13:28 [INFO] Start Pool with 4 process(es)
.
.
.
22-12-23 03:14:47 [INFO] Processed 94 sequences
22-12-23 03:14:47 [INFO] 92 (97.87%) sequences contain subgenome-specific kmers
22-12-23 03:14:47 [INFO] 100.00% of 9 subgenome-specific kmers are mapped
22-12-23 03:14:47 [INFO] New check point file: `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.Nepenthes_gracilis.k15_q200_f2.subgenome.bin.count.ok`
22-12-23 03:14:47 [INFO] Enriching subgenome by chromosome window (size: 1000000)
22-12-23 03:14:47 [INFO] Start Pool with 4 process(es)
.
.
.
22-12-23 03:21:31 [INFO] finished with 0 commands uncompleted
22-12-23 03:21:32 [INFO] New check point file: `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.LTR.scn.ok`
22-12-23 03:21:32 [INFO] 23051 LTRs identified
22-12-23 03:21:32 [INFO] Extracting inner sequences of LTRs to classify by `TEsorter`
22-12-23 03:21:32 [INFO] run CMD: `TEsorter /gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.LTR.inner.fa -db rexdb -dp2 -p 4 -pre /gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.LTR.inner.fa -tmp /gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.LTR &> /gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.LTR.inner.fa.tesort.log`
22-12-23 03:39:13 [INFO] New check point file: `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.LTR.tesort.ok`
22-12-23 03:39:13 [INFO] By TEsorter, 13396 (58.1%) are classified as LTRs, of which 5538 (41.3%) are intact with complete protein domains
22-12-23 03:39:13 [INFO] After filtering, 13202 / 23051 (57.3%) LTRs retained
22-12-23 03:39:13 [INFO] Outputing `coordinate` - `LTR` maps to `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.subphaser/Nepenthes_gracilis.k15_q200_f2.ltr.bin.count`
22-12-23 03:39:13 [INFO] Start Pool with 4 process(es)
22-12-23 03:39:23 [INFO] Processed 13202 sequences
22-12-23 03:39:23 [INFO] 204 (1.55%) sequences contain subgenome-specific kmers
22-12-23 03:39:23 [INFO] 44.44% of 9 subgenome-specific kmers are mapped
22-12-23 03:39:25 [INFO] New check point file: `/gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.tmp/Nepenthes_gracilis.Nepenthes_gracilis.k15_q200_f2.ltr.bin.count.ok`
22-12-23 03:39:25 [INFO] Enriching subgenome-specific LTR-RTs
22-12-23 03:39:25 [INFO] Start Pool with 4 process(es)
/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/Stats.py:157: RuntimeWarning: invalid value encountered in divide
  ratios = np.array(row) / np.array(total)
/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/Stats.py:157: RuntimeWarning: invalid value encountered in divide
  ratios = np.array(row) / np.array(total)
/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/Stats.py:157: RuntimeWarning: invalid value encountered in divide
  ratios = np.array(row) / np.array(total)
/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/Stats.py:157: RuntimeWarning: invalid value encountered in divide
  ratios = np.array(row) / np.array(total)
22-12-23 03:39:25 [INFO] Output: /gfe_data/tmp/14_Nepenthes_gracilis/Nepenthes_gracilis.subphaser/Nepenthes_gracilis.k15_q200_f2.ltr.enrich
22-12-23 03:39:25 [INFO] 0 significant subgenome-specific LTR-RTs
22-12-23 03:39:28 [INFO] Summary of overall LTR insertion age (million years):
/opt/conda/envs/biotools/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/opt/conda/envs/biotools/lib/python3.9/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "/opt/conda/envs/biotools/bin/subphaser", line 33, in <module>
    sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/__main__.py", line 784, in main
    pipeline.run()
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/__main__.py", line 518, in run
    ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[])
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/__main__.py", line 602, in step_ltr
    enrich_ltrs = LTR.plot_insert_age(ltrs, d_enriched, prefix, 
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/LTR.py", line 515, in plot_insert_age
    d_info = summary_ltr_time(d_data, fout)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/LTR.py", line 601, in summary_ltr_time
    np.median(xages), abs(np.percentile(xages, 2.5)), np.percentile(xages, 97.5)))
  File "<__array_function__ internals>", line 180, in percentile
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4166, in percentile
    return _quantile_unchecked(
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4424, in _quantile_unchecked
    r, k = _ureduce(a,
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/numpy/lib/function_base.py", line 3725, in _ureduce
    r = func(a, **kwargs)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4593, in _quantile_ureduce_func
    result = _quantile(arr,
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4699, in _quantile
    take(arr, indices=-1, axis=DATA_AXIS)
  File "<__array_function__ internals>", line 180, in take
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 190, in take
    return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return bound(*args, **kwds)
IndexError: cannot do a non-empty take from an empty axes.

Failed to install SubPhaser

Hi~
When i type 'conda env create -f SubPhaser.yaml', an error showed:

Do u known how to solve this?
Best wishes.

cannot allocate memory

Hi,

Thanks a lot for the very nice tool!

I am trying to phase the subgenomes from this hexaploid haplotype-phased genome (9Gb), but somehow I always get stuck with the error message cannot allocate memory, despite changing the memory option several times... Any help with that is appreciated.

Cheers
André
...
24-01-25 07:23:35 [INFO] Loading kmer matrix from jellyfish
24-01-25 07:23:35 [INFO] Start Pool with 40 process(es)
24-01-25 07:23:57 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_53.fasta_15.fa
24-01-25 07:28:54 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_60.fasta_15.fa
24-01-25 07:29:21 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_5.fasta_15.fa
24-01-25 07:30:13 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_57.fasta_15.fa
24-01-25 07:30:47 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_61.fasta_15.fa
24-01-25 07:30:52 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_54.fasta_15.fa
24-01-25 07:31:00 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_22.fasta_15.fa
24-01-25 07:31:36 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_50.fasta_15.fa
24-01-25 07:31:46 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_52.fasta_15.fa
24-01-25 07:32:25 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_48.fasta_15.fa
24-01-25 07:32:31 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_42.fasta_15.fa
24-01-25 07:32:38 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_47.fasta_15.fa
24-01-25 07:32:44 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_55.fasta_15.fa
24-01-25 07:32:49 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_4.fasta_15.fa
24-01-25 07:33:38 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_35.fasta_15.fa
24-01-25 07:33:47 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_40.fasta_15.fa
24-01-25 07:33:53 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_25.fasta_15.fa
24-01-25 07:34:02 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_27.fasta_15.fa
24-01-25 07:34:12 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_38.fasta_15.fa
24-01-25 07:34:22 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_37.fasta_15.fa
24-01-25 07:35:11 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_41.fasta_15.fa
24-01-25 07:35:17 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_26.fasta_15.fa
24-01-25 07:35:28 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_33.fasta_15.fa
24-01-25 07:35:40 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_65.fasta_15.fa
24-01-25 07:35:52 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_28.fasta_15.fa
24-01-25 07:36:01 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_7.fasta_15.fa
24-01-25 07:36:12 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_17.fasta_15.fa
24-01-25 07:36:21 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_36.fasta_15.fa
24-01-25 07:36:32 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_30.fasta_15.fa
24-01-25 07:36:44 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_14.fasta_15.fa
24-01-25 07:37:41 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_18.fasta_15.fa
24-01-25 07:37:57 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_63.fasta_15.fa
24-01-25 07:38:08 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_1.fasta_15.fa
24-01-25 07:38:19 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_16.fasta_15.fa
24-01-25 07:38:27 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_31.fasta_15.fa
24-01-25 07:38:36 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_12.fasta_15.fa
24-01-25 07:38:49 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_11.fasta_15.fa
24-01-25 07:39:01 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_62.fasta_15.fa
24-01-25 07:39:07 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_23.fasta_15.fa
24-01-25 07:39:18 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_64.fasta_15.fa
24-01-25 07:39:23 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_66.fasta_15.fa
24-01-25 07:39:37 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_39.fasta_15.fa
24-01-25 07:39:55 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_15.fasta_15.fa
24-01-25 07:40:08 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_3.fasta_15.fa
24-01-25 07:40:19 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_21.fasta_15.fa
24-01-25 07:40:29 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_24.fasta_15.fa
24-01-25 07:41:21 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_29.fasta_15.fa
24-01-25 07:41:31 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_34.fasta_15.fa
24-01-25 07:41:40 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_32.fasta_15.fa
24-01-25 07:41:52 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_56.fasta_15.fa
24-01-25 07:42:08 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_8.fasta_15.fa
24-01-25 07:42:20 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_9.fasta_15.fa
24-01-25 07:42:32 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_10.fasta_15.fa
24-01-25 07:42:43 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_13.fasta_15.fa
24-01-25 07:42:55 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_2.fasta_15.fa
24-01-25 07:43:08 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_19.fasta_15.fa
24-01-25 07:43:20 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_20.fasta_15.fa
24-01-25 07:43:30 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_45.fasta_15.fa
24-01-25 07:43:38 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_46.fasta_15.fa
24-01-25 07:43:44 [INFO] Loading /netscratch/dep_mercier/grp_marques/marques/LPA/CBC/SubPhaser/wgdi/non-necessary/CBC_tmp/CBC_chromosomes/scaffold_6.fasta_15.fa
24-01-25 07:43:51 [INFO] 62557073 kmers in total
24-01-25 07:43:51 [INFO] Filtering differential kmers
Traceback (most recent call last):
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/bin/subphaser", line 33, in
sys.exit(load_entry_point('subphaser==1.2.6', 'console_scripts', 'subphaser')())
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/main.py", line 797, in main
pipeline.run()
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/main.py", line 422, in run
d_mat = dumps.filter(d_mat, lengths, self.sgs, outfig=histfig, #d_targets=d_targets,
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/Jellyfish.py", line 487, in filter
for kmer, freqs, tot_freq in pool_func(_filter_kmer, args, self.ncpu,
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/RunCmdsMP.py", line 336, in pool_func
pool = multiprocessing.Pool(processors)
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/netscratch/dep_mercier/grp_marques/bin/marques-envs/SGphasing/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

ModuleNotFoundError: No module named 'TEsorter'

Hi, I got module error (No. 1) when I run subphaser.
To solve the error, I installed TEsorter using new and old school methods, but subphaser can't find that module.
When I run from TEsorter.app import CommonClassifications in python3, I got ImportError (No. 2).
Could you give me the solution?

Thank you !
Jung

(No. 1)
(SubPhaser)$ subphaser

Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/subphaser-1.2.6-py3.8.egg/subphaser/LTR.py", line 9, in
from TEsorter.app import CommonClassifications
ModuleNotFoundError: No module named 'TEsorter'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/subphaser", line 11, in
load_entry_point('subphaser==1.2.6', 'console_scripts', 'subphaser')()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 490, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2854, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2445, in load
return self.resolve()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2451, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/usr/local/lib/python3.8/dist-packages/subphaser-1.2.6-py3.8.egg/subphaser/main.py", line 15, in
from . import LTR
File "/usr/local/lib/python3.8/dist-packages/subphaser-1.2.6-py3.8.egg/subphaser/LTR.py", line 11, in
from .api.TEsorter.app import CommonClassifications
File "/usr/local/lib/python3.8/dist-packages/subphaser-1.2.6-py3.8.egg/subphaser/api/TEsorter/app.py", line 37, in
from .modules.get_record import get_records
File "/usr/local/lib/python3.8/dist-packages/subphaser-1.2.6-py3.8.egg/subphaser/api/TEsorter/modules/get_record.py", line 6, in
from TEsorter.modules.small_tools import open_file as open
ModuleNotFoundError: No module named 'TEsorter'

(No. 2)

from TEsorter.app import CommonClassifications
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'CommonClassifications' from 'TEsorter.app' (/home/super/miniconda3/envs/SubPhaser/lib/python3.10/site-packages/TEsorter/app.py)

Getting location of subgenome specific TEs

Hi! Thanks for the great tool! I was wondering if one could get the genomic location of subgenome-specific TE or TE k-mer. My idea is to take a look at coding regions that are upstream and downstream to subgenome-specific TEs.

My current approach is to take a look at the 'k15_q200_f2.ltr.enrich' file in the phase-results folder and look for specific k-mers that are found in one subgenome (column 2) and that have no potential exchange among subgenomes (column 5). Once I identify k-mers that fulfill those requirements I was going to look for the genomic position of those k-mers in the 'LTR.inner.fa.dom.gff3' file that is in the tmp directory. Is that approach correct? or should I be taking a look at other output files?

Thank you in advance!!
Bests,
Emiliano

Configuration file

Hello! How can I create Configuration file for genome?

The output subgenomes are not paired

Hi~,
I used this software to analyze the subgenome, input pairs of chromosome files, but output 11 chromosomes each and 13 chromosomes each. Is this result correct? What am I to make of this result?

look for you reply!
Hang

ValueError: 0 kmer with fold > 2. Please reset the filter options.

When I analyze using the default parameters, the following error occurs. What should I set the kmer parameters to?

23-12-30 15:24:20 [INFO] After filtering, remained 0 (0.00%) differential (freq >= 200) and 0 (0.00%) candidate (freq > 0) kmers
Traceback (most recent call last):
File "/home/zuozd/miniconda3/envs/SubPhaser/bin/subphaser", line 33, in
sys.exit(load_entry_point('subphaser==1.2.6', 'console_scripts', 'subphaser')())
File "/home/zuozd/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/main.py", line 797, in main
pipeline.run()
File "/home/zuozd/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/main.py", line 422, in run
d_mat = dumps.filter(d_mat, lengths, self.sgs, outfig=histfig, #d_targets=d_targets,
File "/home/zuozd/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/Jellyfish.py", line 502, in filter
raise ValueError('0 kmer with fold > {}. Please reset the filter options.'.format(min_fold))
ValueError: 0 kmer with fold > 2. Please reset the filter options.

THANK YOU!

Used for contig-level assembly

Hi, Thanks for developing such a useful tool. I wonder if it can be used for contig-level assembly.
Thank you for your reply in advance~

Changing mutation rate

Hello,

I'm curious why the ***.ltr.insert.density.pdf plot densities change in an order of magnitude if you change the mutation rate, but the ***.ltr.insert.histo.pdf remains the same. I anticipated that only the x-axis would change with the new mutation rate, but somehow the density (y-axis) of LTRs change as well. I attached the small genome example ran as default and with a different mutation rate of -mu 6.7e-09.

I used the default example script then I ran this:

prefix=Arabidopsis_suecica
DT=date +"%y%m%d%H%M"
options="-pre ${prefix}_" # to avoid conflicts
subphaser -i ${prefix}_genome.fasta.gz -c ${prefix}_sg.config -max_memory 128G -disable_circos -intact_ltr -mu 1.75e-09 $options 2>&1 | tee ${prefix}.log.$DT

I checked without the -intact_ltr flag and the results are the same.

Any insights would be greatly appreciated.

Arabidopsis_suecica_k15_q200_f2.ltr.insert.histo.default.pdf
Arabidopsis_suecica_k15_q200_f2.ltr.insert.density.1.75.pdf
Arabidopsis_suecica_k15_q200_f2.ltr.insert.histo.1.75.pdf
Arabidopsis_suecica_k15_q200_f2.ltr.insert.density.default.pdf

Thank you.
Crystal

Can't install SubPhaser: : Found conflicts! Looking for incompatible packages.

Dear Dr Zhang
Thanks for developing this useful tool.
Unfortunately, I was stuck in the installation step of SubPhaser.
When I run
conda env create -f SubPhaser.yaml
The conda environment can not be set up and errors like
`Collecting package metadata (repodata.json): done
Solving environment: Found conflicts! Looking for incompatible packages.
...
The following specifications were found to be incompatible with your system:

feature:/linux-64::__glibc==2.17=0
feature:|@/linux-64::__glibc==2.17=0
biopython==1.79=py38h497a2fe_0 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
blast==2.11.0=pl526he19e7b1_0 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
...
Your installed version is: 2.17
`
Looking forward to your responses.

No differential kmers

Hi,

I am trying to use the SubPhaser to phase the subgenomes of my species. However, after filtering differential kmers, no differential kmers were remained. I used parameter of this -k 15 -q 50 -f 2 . The same result is achieved even if I continue to reduce -k and -q.

23-12-27 17:32:21 [INFO] 125035 kmers in total 23-12-27 17:32:21 [INFO] Filtering differential kmers 23-12-27 17:32:22 [INFO] Start Pool with 112 process(es) 23-12-27 17:32:25 [INFO] After filtering, remained 0 (0.00%) differential (freq >= 25) and 0 (0.00%) candidate (freq > 0) kmers

Do you have any suggestions?

thanks,
Chen

Arabidopsis_suecica_LTR.inner.fa.cls.tsv not found

Thank you for developing this excellent tool! I installed SubPhaser in a singularity container with the minimal dependencies listed in #10 (comment) and tried to run the Arabidopsis test dataset, but an error occurred. Another run with my own dataset stopped with the same error message. I would be grateful if you could provide me with potential solutions.

22-12-18 06:54:20 [INFO] finished with 0 commands uncompleted
22-12-18 06:54:20 [INFO] New check point file: `/home/kfuku/docker_img/gfe/usr/local/bin/SubPhaser/example_data/Arabidopsis_suecica_tmp/Arabidopsis_suecica_LTR.scn.ok`
22-12-18 06:54:20 [INFO] 5566 LTRs identified
22-12-18 06:54:20 [INFO] Extracting inner sequences of LTRs to classify by `TEsorter`
22-12-18 06:54:20 [INFO] run CMD: `TEsorter /home/kfuku/docker_img/gfe/usr/local/bin/SubPhaser/example_data/Arabidopsis_suecica_tmp/Arabidopsis_suecica_LTR.inner.fa -db rexdb -dp2 -p 128 -pre /home/kfuku/docker_img/gfe/usr/local/bin/SubPhaser/example_data/Arabidopsis_suecica_tmp/Arabidopsis_suecica_LTR.inner.fa -tmp /home/kfuku/docker_img/gfe/usr/local/bin/SubPhaser/example_data/Arabidopsis_suecica_tmp/Arabidopsis_suecica_LTR &> /home/kfuku/docker_img/gfe/usr/local/bin/SubPhaser/example_data/Arabidopsis_suecica_tmp/Arabidopsis_suecica_LTR.inner.fa.tesort.log`
22-12-18 06:54:56 [INFO] New check point file: `/home/kfuku/docker_img/gfe/usr/local/bin/SubPhaser/example_data/Arabidopsis_suecica_tmp/Arabidopsis_suecica_LTR.tesort.ok`
Traceback (most recent call last):
  File "/opt/conda/envs/biotools/bin/subphaser", line 33, in <module>
    sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/__main__.py", line 784, in main
    pipeline.run()
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/__main__.py", line 518, in run
    ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[])
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/__main__.py", line 556, in step_ltr
    ltrs, ltrfile = pipeline.run()
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/LTR.py", line 335, in run
    d_class = self.classfify(ltrs)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/LTR.py", line 397, in classfify
    for classification in CommonClassifications(clsfile):
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/api/TEsorter/app.py", line 339, in _parse
    for i, line in enumerate(open(self.clsfile)):
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/xopen/__init__.py", line 1291, in xopen
    opened_file = open(filename, mode, **text_mode_kwargs)  # type: ignore
FileNotFoundError: [Errno 2] No such file or directory: '/home/kfuku/docker_img/gfe/usr/local/bin/SubPhaser/example_data/Arabidopsis_suecica_tmp/Arabidopsis_suecica_LTR.inner.fa.cls.tsv'

IndexError: index -1 is out of bounds for axis 0 with size 0

Hi, Thanks for developing the tool. I tried the example of ginger and successfully procressed. But When I used my own triploid genome (3n=63), I met an error. My config file is as follow:
1 2 3
4 5 6
7 8 9
10 11 12
13 14 15
16 17 18
19 21 22
23 24 25
27 28 29
30 31 32
33 34 35
36 37 38
39 40 41
42 43 45
47 48 49
50 51 52
53 54 59
60 64 65
67 68 71
72 73 74
75 76 77

The command was 'subphaser -i ref.fa -c config.txt -pre out', The I get the error like this:

22-06-02 16:24:49 [INFO] Summary of overall LTR insertion age (million years):
/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3440: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "/home/wangyue/software/miniconda2/envs/SubPhaser/bin/subphaser", line 33, in
sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
File "/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 779, in main
pipeline.run()
File "/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 516, in run
ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[])
File "/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 600, in step_ltr
enrich_ltrs = LTR.plot_insert_age(ltrs, d_enriched, prefix, shared=d_shared,
File "/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/LTR.py", line 513, in plot_insert_age
d_info = summary_ltr_time(d_data, fout)
File "/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/LTR.py", line 578, in summary_ltr_time
np.median(xages), abs(np.percentile(xages, 2.5)), np.percentile(xages, 97.5)))
File "<array_function internals>", line 5, in percentile
File "/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3867, in percentile
return _quantile_unchecked(
File "/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3986, in _quantile_unchecked
r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
File "/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3564, in _ureduce
r = func(a, **kwargs)
File "/home/wangyue/software/miniconda2/envs/SubPhaser/lib/python3.8/site-packages/numpy/lib/function_base.py", line 4098, in _quantile_ureduce_func
n = np.isnan(ap[-1])
IndexError: index -1 is out of bounds for axis 0 with size 0

And the results in the file "outk15_q200_f2.chrom-subgenome.tsv" showed different number of chromosomes for each genotype.

I don't know where is my problem. Can you give me any advises? Thanks a lot

Autopolyploid config file

Hey Dr Zhang @zhangrengang
I was reading your manuscript and wondering how does the config file for autopolyploid genome look like? For example Medicago sativa.

Error in os.link(figfile, dstfig)

I came across a seemingly rare error, which occurred when I tried to run SubPhaser installed to a Singularity container on macOS using Vagrant. This is not a big issue to me because it didn't happen when I used the same container on my main environment on a Linux server, but I would like to report it here.

(Please note that I used Arabidopsis thaliana as input only for a testing purpose)

22-12-22 12:11:09 [INFO] ###Step: Circos
22-12-22 12:11:09 [INFO] Limit memory 4.6G per process with total memory 1.2
22-12-22 12:11:09 [INFO] Using 1 processes to align chromosome sequences
22-12-22 12:11:09 [INFO] Check point file: `/gfe_data/tmp/1_Arabidopsis_thaliana/Arabidopsis_thaliana.tmp/Arabidopsis_thaliana.Blocks/Chr1-Chr2.paf.ok` exists; skip this step
22-12-22 12:11:09 [INFO] Start Pool with 1 process(es)
22-12-22 12:11:10 [INFO] Copy `/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/circos` to `/gfe_data/tmp/1_Arabidopsis_thaliana/Arabidopsis_thaliana.subphaser/`
using cutoff: upper 43576.5 for SG1
using cutoff: upper 694.5 for SG2
22-12-22 12:11:12 [INFO] run CMD: `cd /gfe_data/tmp/1_Arabidopsis_thaliana/Arabidopsis_thaliana.subphaser/Arabidopsis_thaliana.k15_q200_f2.circos && circos -conf ./circos.conf`
Traceback (most recent call last):
  File "/opt/conda/envs/biotools/bin/subphaser", line 33, in <module>
    sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/__main__.py", line 784, in main
    pipeline.run()
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/__main__.py", line 524, in run
    self.step_circos(
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/__main__.py", line 684, in step_circos
    Circos.circos_plot(self.chromfiles, wkdir, *args, **kargs)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/subphaser-1.2.5-py3.9.egg/subphaser/Circos.py", line 515, in circos_plot
    os.link(figfile, dstfig)
PermissionError: [Errno 1] Operation not permitted: '/gfe_data/tmp/1_Arabidopsis_thaliana/Arabidopsis_thaliana.subphaser/Arabidopsis_thaliana.k15_q200_f2.circos/circos.png' -> '/gfe_data/tmp/1_Arabidopsis_thaliana/Arabidopsis_thaliana.subphaser/Arabidopsis_thaliana.k15_q200_f2.circos.png'

Unbalanced of chromosomes number and differential kmers number among subgenomes

Hi~,
I have got a whole new set of problems now:

there is abnormally few subgenome-specific kmers in one of subgenomes, and the numbers of assigned chromosomes among subgenomes are abnormally unbalanced. I have also tried -k ( 8,13,15,17,22,27,33,37,45,50), -q (10,200,600,1000), -f(1.5,2), but failed to deal with this problem. Have you any suggestions about that?

Too few markers

Hi,

I am trying to use the SubPhaser to phase the subgenomes of my species. The parental species are unknown and I built a quite good chromosomal assembly with 99% of BUSCOs
k13_q100_f2.0.circos.pdf
complete. I named the subgenomes after synteny analysis with a close species sorghum bicolor. When I tried to use Subphaser, I managed to phase the subgenomes, but there seems to be very few kmer markers, and no ltr was found- much less than the numbers in your example files. I used parameter of this -k 13 -q 100 -f 2 -disable_ltr
Is this result trustworthy? What could be the reason?

thanks,
Cui

Singularity container fails if environmental variable `R_LIBS_USER` is set

Hi!

I was able to finish the pipeline Singularity version but had to reset the path to R libraries manually (I have a custom R library path set in my .bashrc). The following was sufficient:

export R_LIBS_USER=/share/home/app/bin/miniconda3/envs/SubPhaser/lib/R/library/

Maybe it's worth adding that variable to the container recipe.

Also, the mafft stage fails if $TMPDIR is in other path than /tmp (was /scratch in my case), I had to specify the bind path manually when running the container. (Could be fixed by adding $TMPDIR to default bindpaths or SINGULARITY_BIND variable?)

Thanks again!
Nikita

matplotlib raise RuntimeError ('Invalid DISPLAY variable')

Hi, when plotting the kmer_freq, it reported errors like this:

"23-09-13 23:23:33 [INFO] Plot k15_q200_f2.kmer_freq.pdf
Traceback (most recent call last):
  File "~/.conda/envs/SubPhaser/bin/subphaser", line 33, in <module>
    sys.exit(load_entry_point('subphaser==1.2.6', 'console_scripts', 'subphaser')())
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/__main__.py", line 790, in main
    pipeline.run()
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/__main__.py", line 415, in run
    d_mat = dumps.filter(d_mat, lengths, self.sgs, outfig=histfig, #d_targets=d_targets, 
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/Jellyfish.py", line 504, in filter
    plot_histogram(tot_freqs, outfig, vline=None)
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.6-py3.8.egg/subphaser/Jellyfish.py", line 647, in plot_histogram
    plt.figure(figsize=(7,5), dpi=300, tight_layout=True)
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/matplotlib/pyplot.py", line 797, in figure
    manager = new_figure_manager(
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/matplotlib/pyplot.py", line 316, in new_figure_manager
    return _backend_mod.new_figure_manager(*args, **kwargs)
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 3545, in new_figure_manager
    return cls.new_figure_manager_given_figure(num, fig)
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/matplotlib/backend_bases.py", line 3550, in new_figure_manager_given_figure
    canvas = cls.FigureCanvas(figure)
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/matplotlib/backends/backend_qt5agg.py", line 21, in __init__
    super().__init__(figure=figure)
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/matplotlib/backends/backend_qt5.py", line 213, in __init__
    _create_qApp()
  File "~/.conda/envs/SubPhaser/lib/python3.8/site-packages/matplotlib/backends/backend_qt5.py", line 108, in _create_qApp
    raise RuntimeError('Invalid DISPLAY variable')
RuntimeError: Invalid DISPLAY variable"

how can I solve it?

亚基因组分析

张老师，您好！正在用subphaser分一个异源四倍体的AB亚基因组，得到初步结果，请您帮忙看看。装出来的基因组共22条染色体，在subphaser分析后，一组为12条，另一组10条，目前这里比较迷惑，请您指点。结果如下图：

ValueError: n_components=3 must be between 0 and min

Dear Writer !
Thanks for your useful pipline, but I meet an error when use SubPhaser.
My command
subphaser -i groups_genome.fasta -c groups_sg.config
My configue file

But I have the error

What's wrong with me ?

Invalid specifier: '>=3.6:'

With python 3.9, I got the following error with python setup.py install when installing the latest SubPhaser.

> python setup.py install
error in subphaser setup command: 'python_requires' must be a string containing valid version specifiers; Invalid specifier: '>=3.6:'

The installation worked well after replacing python_requires='>=3.6:' with python_requires='>=3.6' in setup.py.

kmer 13 or less gives a lot of broken pipe errors

Thank you for developing this pipeline.

I've noticed that for my allotetraploid species, while -k 15, -k 14 works fine, once I try -k 13 or -k 12, there are a lot of broken pipeline issues. A lot of the underlying python scripts will start having errors.

I was wondering if I could talk to someone about this? Thanks!

三倍体基因组

老师，您好
三倍体基因组划分成这样了，单套17条，还能有改善吗

Suggestion for specific settings to improve subphasing

Hi
Thanks a lot for this awesome tool!

I am trying to phase an allopentaploid genome which we expect to have 4 subgenomes. Although the clustering works very well, I having trouble to adjust the settings to get the fours subgenomes correctly identified. Suphaser identifies normally 3 subgenomes, but if I set -nsg 4 it does not identify correctly the 4th subgenome based on the clustering but it splits one subgenome wrongly. Please below.

Using -nsg 3:

Using -nsg 4:

Using only the set of chromosomes from S1/2 and s3 from the two subgnomes that should be split:

Ideally, I would like to have in one run the 4 subgenomes correctly identified and split. Any suggestions are welcome!
Best
André