This repository contains links to all the supplementary material for my bachelor's thesis Masked Superstrings for Efficient k-Mer Set Representation and Indexing, i.e., experimental pipelines and results and individual plots.
The thesis is based on two arcitles, consider also visiting their respective supplementary repositories.
- Paper I: Masked superstrings as a unified framework for textual k-mer set representations - supplementary repository
- Paper II: Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for k-Mer Sets - supplementary repository
To cite the thesis, use the following:
@mastersthesis{sladky2024-thesis,
title = { Masked Superstrings for Efficient $k$-Mer Set Representation and Indexing },
author = { Sladk{\'y}, Ond{\v r}ej },
type = { Bachelor's Thesis },
year = { 2024 },
publisher = { Charles University },
doi = { 10.5281/zenodo.11076871 }
}
If you use masked superstrings or KmerCamel🐫 please cite also our paper on masked superstrings:
Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717
@article{sladky2023-masked-superstrings,
title = { Masked superstrings as a unified framework for textual $k$-mer set representations },
author = { Sladk{\'y}, Ond{\v r}ej and Vesel{\'y}, Pavel and B{\v r}inda, Karel },
journal = { bioRxiv },
volume = { 2023.02.01.526717 },
year = { 2023 },
doi = { 10.1101/2023.02.01.526717 }
}
If you use
Ondřej Sladký, Pavel Veselý, and Karel Břinda: Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for k-Mer Sets. bioRxiv 2024.03.06.583483, 2024. https://doi.org/10.1101/2024.03.06.583483
@article {sladky2024-f-masked-superstrings,
author = {Ond{\v r}ej Sladk{\'y} and Pavel Vesel{\'y} and Karel B{\v r}inda},
title = {Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for 𝑘-Mer Sets},
elocation-id = {2024.03.06.583483},
year = {2024},
doi = {10.1101/2024.03.06.583483},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/03/11/2024.03.06.583483},
eprint = {https://www.biorxiv.org/content/early/2024/03/11/2024.03.06.583483.full.pdf},
journal = {bioRxiv}
}
Superstrings were computed using the KmerCamel🐫 program, which local and global greedy heuristics for masked superstring computation using hash tables and also experimental implementations using Aho-Corasick automaton.
By default, the superstrings computed by KmerCamel🐫 come with default masks; these contain the minimal possible number of 1's (i.e., every k-mer masked on only once) and the patterns of 1's and 0's reflect the orders in which individual k-mers were added to the superstrings. Additionally, these masks can be optimized by KmerCamel🐫 to either contain the maximum/minimum number of ones or the minimum number of runs of ones.
Importantly, changes in the underlying data structures (hash-table vs. AC automaton), as well as changing machines or compilers, results/may result in different superstrings and their mask, and the specific choices can affect mask compressibility. For instance, hash-table-based approaches tend to produce more regular masks that are better compressible (e.g., for nearly complete de Bruijn graphs).
Indexing, membership queries, and set operations on k-mer sets represented via f-masked superstrings was performed and benchmarked on FMSI, which experimentaly implements membership queries as well as several basic operations on indexed masked superstrings such as normalization, export and merging, which can be used to perform set operations.
-
S. pneumoniae genome (ATCC 700669, NC_011900.1, fna
online)
- The resulting file: data/spneumoniae.fa.xz
-
S. pneumoniae pan-genome - 616 genomes, as provided in RASE DB S.
pneumoniae
- k-mers were collected and stored in the form of simplitigs (ProphAsm v0.1.1, k=63)
- The resulting file: data/spneumo_pangenome-616.simplitigs.k63.fa.xz
- S. cerevisiae genome (S288C, fna.gz online)
-
E. coli pan-genome, obtained as the union of the genomes from the 661k collection, downloaded from Phylogenetically compressed 661k collection
-
k-mers were collected and stored in the form of unitigs with
$k = 32$ and$k=61$ (BCALM 2, version v2.2.3, git commit e57cc46) - not provided in this repository due to file size (ask if you wish to get these datasets)
-
k-mers were collected and stored in the form of unitigs with
-
SARS-CoV-2 pan-genome - downloaded from GISAID
(access upon registration)
- 590k genomes for experiments on compressibility
- k-mers were collected using JellyFish 2 (v2.2.10, k=63) and stored in the form of simplitigs (ProphAsm v0.1.1, k=63)
- The resulting file: data/sars-cov-2_pangenome_k32.fa.xz
- 14.7M genomes for experiments on indexing
- k-mers were collected using JellyFish 2 (v2.2.10, k=32) and stored in the form of simplitigs (ProphAsm v0.1.1, k=32)
- The resulting file: data/sars-cov-2_pangenome_k32.fa.xz
- 590k genomes for experiments on compressibility
- Human genome (
GRCh38.p14
, genome length 3.1 Gbp)- Downloaded from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/
- not provided in this repository due to file size
-
C. elegans (
NC_003279.8
) - downloaded from NCBI -
C. briggsae (
NC_013489.2
) - downloaded from NCBI
For generating negative membership queries to these datasets, we used a 2MB prefix of the FASTA file for chromosome 1 of H. sapiens genome (GRCh38.p14 Primary Assembly
, NC_000001.11
), downloaded from NCBI; see data/GRCh38.p14.chromosome1.prefix2M.fasta.xz
To download the used tools, run the following:
git submodule update --init
Besides standard Linux programs the experimental pipeline requires Snakemake and seqtk).
All the experimental pipelines are located in the experiments
directory.
There are several subdirectories with individual experimental pipelines. The folders prefixed with 1*
correspond to experiments regarding storing masked superstrings (Chapters 2 and 3 of the thesis) and these prefixed with 2*
correspond to experiments on indexing (Chapters 4 and 5).
The individual pipelines are the following:
11_compute_tigs
/14_subsampled_compute_tigs
- computation of unitigs (required by matchtigs), simplitigs, eulertigs and greedy/optimal matchtigs; for not subsampled / subsampled data12_optimize_tigs
/15_subsampled_optimize_tigs
- mask optimization of previously computed *tigs and compression with different algorithms; for not subsampled / subsampled data13_compute_camel
/16_subsampled_compute_camel
- computation and optimization of masked superstrings with KmerCamel🐫 and compression with different algorithms; for not subsampled / subsampled data07_collect_results
- aggregation of data from previous pipelines21_build_and_query_memtime
- contains additional sub-pipelines for build and query time vs. memory experiments on indexes22_set_operations
- experiment on set operations using FMSI
Each pipeline can be run by command make
in the corresponding directory. For some make test
can be used to check if everything is set up correctly. Pipelines should be run from smaller numbers to larger.
Figures 6.1 - 6.5 were generated using R. The individual plots for each of these figures can be found in the figures
directory under a subdirectory starting with the figure number, i.e. 6.x
.
The plots used in the thesis can be generated by running make
in the appropriate subdirectory.
This also generates plots for other datasets and for other parameters.
Figure 1.1 was drawn in Google Slides,
Figures 1.2 - 1.4 and 2.1 were created in IPE, and Figures 3.1 and 4.1 were made directly in LaTeX.
Figure 6.5 was additionally modified in IPE.
Figure 6.6 was created in Google Slides from the data in the 22_set_operations
experiment.
Ondřej Sladký ([email protected])