Code Monkey home page Code Monkey logo

bc-thesis-supplement's Introduction

Supplementary repository of Ondřej Sladký's bachelor's thesis

Introduction

This repository contains links to all the supplementary material for my bachelor's thesis Masked Superstrings for Efficient k-Mer Set Representation and Indexing, i.e., experimental pipelines and results and individual plots.

The thesis is based on two arcitles, consider also visiting their respective supplementary repositories.

Citation

To cite the thesis, use the following:

@mastersthesis{sladky2024-thesis,
    title     = { Masked Superstrings for Efficient $k$-Mer Set Representation and Indexing },
    author    = { Sladk{\'y}, Ond{\v r}ej },
    type      = { Bachelor's Thesis },
    year      = { 2024 },
    publisher = { Charles University },
    doi       = { 10.5281/zenodo.11076871 }
}

If you use masked superstrings or KmerCamel🐫 please cite also our paper on masked superstrings:

Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717

@article{sladky2023-masked-superstrings,
  title   = { Masked superstrings as a unified framework for textual $k$-mer set representations },
  author  = { Sladk{\'y}, Ond{\v r}ej and Vesel{\'y}, Pavel and B{\v r}inda, Karel },
  journal = { bioRxiv },
  volume  = { 2023.02.01.526717 },
  year    = { 2023 },
  doi     = { 10.1101/2023.02.01.526717 }
}

If you use $f$-masked superstrings, FMS-index or FMSI, please also cite our paper on $f$-masked superstrings:

Ondřej Sladký, Pavel Veselý, and Karel Břinda: Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for k-Mer Sets. bioRxiv 2024.03.06.583483, 2024. https://doi.org/10.1101/2024.03.06.583483

@article {sladky2024-f-masked-superstrings,
	author = {Ond{\v r}ej Sladk{\'y} and Pavel Vesel{\'y} and Karel B{\v r}inda},
	title = {Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for 𝑘-Mer Sets},
	elocation-id = {2024.03.06.583483},
	year = {2024},
	doi = {10.1101/2024.03.06.583483},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2024/03/11/2024.03.06.583483},
	eprint = {https://www.biorxiv.org/content/early/2024/03/11/2024.03.06.583483.full.pdf},
	journal = {bioRxiv}
}

Methods

Masked superstrings computation - KmerCamel🐫

Superstrings were computed using the KmerCamel🐫 program, which local and global greedy heuristics for masked superstring computation using hash tables and also experimental implementations using Aho-Corasick automaton.

By default, the superstrings computed by KmerCamel🐫 come with default masks; these contain the minimal possible number of 1's (i.e., every k-mer masked on only once) and the patterns of 1's and 0's reflect the orders in which individual k-mers were added to the superstrings. Additionally, these masks can be optimized by KmerCamel🐫 to either contain the maximum/minimum number of ones or the minimum number of runs of ones.

Importantly, changes in the underlying data structures (hash-table vs. AC automaton), as well as changing machines or compilers, results/may result in different superstrings and their mask, and the specific choices can affect mask compressibility. For instance, hash-table-based approaches tend to produce more regular masks that are better compressible (e.g., for nearly complete de Bruijn graphs).

Indexing masked superstrings - FMSI

Indexing, membership queries, and set operations on k-mer sets represented via f-masked superstrings was performed and benchmarked on FMSI, which experimentaly implements membership queries as well as several basic operations on indexed masked superstrings such as normalization, export and merging, which can be used to perform set operations.

Experimental evaluation

Benchmark datasets

For generating negative membership queries to these datasets, we used a 2MB prefix of the FASTA file for chromosome 1 of H. sapiens genome (GRCh38.p14 Primary Assembly, NC_000001.11), downloaded from NCBI; see data/GRCh38.p14.chromosome1.prefix2M.fasta.xz

Reproducing experimental results

To download the used tools, run the following:

git submodule update --init

Besides standard Linux programs the experimental pipeline requires Snakemake and seqtk).

All the experimental pipelines are located in the experiments directory. There are several subdirectories with individual experimental pipelines. The folders prefixed with 1* correspond to experiments regarding storing masked superstrings (Chapters 2 and 3 of the thesis) and these prefixed with 2* correspond to experiments on indexing (Chapters 4 and 5). The individual pipelines are the following:

  • 11_compute_tigs/14_subsampled_compute_tigs - computation of unitigs (required by matchtigs), simplitigs, eulertigs and greedy/optimal matchtigs; for not subsampled / subsampled data
  • 12_optimize_tigs/15_subsampled_optimize_tigs - mask optimization of previously computed *tigs and compression with different algorithms; for not subsampled / subsampled data
  • 13_compute_camel/16_subsampled_compute_camel - computation and optimization of masked superstrings with KmerCamel🐫 and compression with different algorithms; for not subsampled / subsampled data
  • 07_collect_results - aggregation of data from previous pipelines
  • 21_build_and_query_memtime - contains additional sub-pipelines for build and query time vs. memory experiments on indexes
  • 22_set_operations - experiment on set operations using FMSI

Each pipeline can be run by command make in the corresponding directory. For some make test can be used to check if everything is set up correctly. Pipelines should be run from smaller numbers to larger.

Figures + supplementary plots

Figures 6.1 - 6.5 were generated using R. The individual plots for each of these figures can be found in the figures directory under a subdirectory starting with the figure number, i.e. 6.x. The plots used in the thesis can be generated by running make in the appropriate subdirectory. This also generates plots for other datasets and for other parameters.

Figure 1.1 was drawn in Google Slides, Figures 1.2 - 1.4 and 2.1 were created in IPE, and Figures 3.1 and 4.1 were made directly in LaTeX. Figure 6.5 was additionally modified in IPE. Figure 6.6 was created in Google Slides from the data in the 22_set_operations experiment.

Contact

Ondřej Sladký ([email protected])

bc-thesis-supplement's People

Contributors

ondrejsladky avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.