Code Monkey home page Code Monkey logo

atlas-feature-selection-benchmark's Introduction

Atlas feature selection benchmarking

This repository contains code for benchmarking the effect of feature selection on scRNA-seq atlas construction and use.

For more information please refer to the documentation on the wiki.

Directory structure

  • analysis/ - Notebooks used to perform analysis of the results
    • R/ - R functions used in the analysis notebooks
  • bin/ - Scripts used in Nextflow workflows
    • functions/ - Functions used across multiple scripts
  • conf/ - Nextflow configuration files
  • envs/ - conda environment YAML files
  • output/ - Output from Nexflow workflows (not included in git)
  • reports/ - RMarkdown files and functions for output reports
  • work/ - The Nextflow working directory (not included in git)
  • workflows/ - Nextflow workflow files
  • LICENSE - The project license
  • main.nf - Main Nextflow workflow file
  • nextflow.config - Main Nextflow config file
  • README.md - This README
  • style_bin.sh - A script for styling the files in bin/

atlas-feature-selection-benchmark's People

Contributors

amitfrish avatar cramsuig avatar lazappi avatar oliverdietrich avatar sabrinarichter avatar wwxkenmo avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

atlas-feature-selection-benchmark's Issues

New metric: F1

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The F1 score is a commonly used evaluation metric that measures classification performance as the harmonic mean of precision and recall.

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

  • Need to decide what kind of class averaging to use

New method: singleCellHaystack

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

singleCellHaystack is a package for predicting differentially expressed genes (DEGs) in single-cell transcriptome data without the use of cell labels. It uses Kullback-Leibler Divergence to find genes that are expressed in subsets of cells that are non-randomly positioned in a reduced dimensional space.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

  • use.advanced.sampling mode

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

New method: scSEGIndex

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

The single-cell Stably Expressed Gene (scSEG) index is available in the {scMerge} package and measures how stably expressed a gene is across a dataset and can be used to select stably expressed genes. This is the opposite of typical methods which look for highly variable genes and would serve as a negative control for the benchmark.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

  • Need to decide which genes to select based on the score, the top n genes is probably the simplest approach

New metric: NMI

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Normalised Mutal Information (NMI) measures the similarity between two sets of labels (in this case the ground truth cell labels and cluster assignments). It was used as part of the scIB project.

Links

Anything else?

Anything else you think is important about the metric

  • Clustering can be optimised for this metric using a function in scib

Things to think about

Promising:

  • reference mapping metrics
  • downstream analysis after integration and/or reference mapping, then metrics for this

Not sooo promising:

  • more integration metrics
  • potentially more methods

New method: seurat

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

{Seurat} is the most commonly used R toolbox. It contains a highly variable gene feature selection function that selects features by either performing a variance stabilising transformation and selecting variable features ("vst"), binning features by expression and selecting over-dispersed features ("mean.var.plot") or simply selecting the features with highest dispersion values ("dispersion").

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

  • Method ("vst", "mean.var.plot", "dispersion")
  • Number of features (for "dispersion" and "vst")
  • Overall vs per batch

Links

Anything else?

  • Some methods are also implemented in scanpy but both should be included for comparison
  • Consider implementing each method separately (different scripts)

New feature: Pass reference and query to metrics

Thanks for taking the time to suggest a new feature!

Description

Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?

Adapt pipeline to pass reference and query integrated objects to metrics. This is required by some of the mapping/unseen population metrics.

Anything else?

Anything else you think is important about the feature

New method: Wilcoxon

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Select features by taking the top marker genes for each label as detected using the Wilcoxon rank sum test

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

  • Filtering of genes (expression/proportion)
  • Ranking/sorting
  • Number of genes per label

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

Filtering 0 count genes in reference and query

Currently removal of genes with 0 counts is done before splitting into reference and query datasets. This means they have the same feature sets but means there is a chance that one of them could contain some 0 genes. Consider if filtering should be done on both separately and the intersection used.

New metric: ref-query ILIS

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

LISI between the reference and the query.

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

  • This should be able to be implemented by modifying one of the existing LISI scripts to use different labels (reference/query instead of batch).
  • Will need to take both the reference and mapped query objects

New method: SCMER

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

SCMER is a feature selection method designed for single-cell data analysis. It selects a compact set of markers that preserve the manifold in the original data. It can also be used for multimodal data integration by using features in one modality to match the manifold of another modality.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

New metric: Cell cycle conservation

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The cell cycle conservation score measures how much of the variance associated with the cell cycle in individual batches remains after integration. It was used as part of the scIB project.

Links

Anything else?

Anything else you think is important about the metric

  • Requires the pipeline to be modified to pass species information for datasets to metrics

New dataset: Human Endoderm Atlas

Thanks for taking the time to suggest a new dataset!

Description

Please briefly describe the suggested dataset: What is in the dataset (tissue, technology, number of cells etc.)? What kind of batches does it have? What kind of cell annotations does it have? Why would it be a good fit for the project?

The Human Endoderm Atlas [HEA] is a reference atlas of multiple endodermal organs from human development. It contains 34 samples from 14 individuals across 15 tissues from six organs. High quality labels with two hierarchical levels (major cell type, 7; cell type, 27) are available.

Links

Links to information about the dataset

Anything else?

Anything else you think is important about the dataset

  • This dataset is large (155.232 cells)

New method: scanpy

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

scanpy is the most commonly used Python toolbox and contains three methods for selecting highly variable genes: "seurat" (default), "cell_ranger" and "seurat_v3". For "seurat" and "cell_ranger" genes are binned by mean expression and normalised dispersions calculated per bin. "seurat" uses thresholds to select features while "cell_ranger" uses a target number of genes.

"seurat_v3" starts with raw counts (rather than log normalised) and applies a variance stabilising transformation and ranking genes by normalised variance.

More details on methods in the scanpy documentation.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

  • Flavors ("seurat", "cell_ranger" and "seurat_v3")
  • Number of features (for "cell_ranger" and "seurat_v3")
  • Overall vs per batch

Links

Anything else?

Anything else you think is important about the method

  • The "seurat" and "seurat_v3" flavors are also implemented in the original {Seurat} R package but it is worth including both to compare how similar the implementations are
  • Consider everything should be implemented as one method with variants (a single script), or a separate method for each flavor (multiple scripts)

New metric: MILO

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Use the MILO method to identify cells with enriched query neighbourhoods in unseen cell labels.

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

MIght need some kind of summarisation to get an overall score.

Weight classification metrics by label rarity

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Classification metrics that are calculated per label (such as F1 score) can be averaged in various ways. The suggestion would be to weight averages by the rarity of the labels to focus on correct classifications of uncommon labels.

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

New feature: Update to R 4.2

Thanks for taking the time to suggest a new feature!

Description

Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?

Update all R environments to use R 4.2 and Bioconductor 3.16

Anything else?

Anything else you think is important about the feature

New method: Simple statistics

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Select features based on simple statistical values

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

  • Mean
  • Variance

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

New method: scPNMF

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

The single-cell Projective Non-negative Matrix Factorization (scPNMF) method performs a dimensionality reduction and then filters bases to find those that show evidence of multimodal structure. Features can then be selected based on those bases.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

New metric: Jaccard Index

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Jaccard index, or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets.

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

  • Need to decide what kind of class averaging to use

New metric: kBET

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The k-Nearest Neighbour Batch effect Test (kBET) uses a statistical test to measure the mixing of batches and labels within the neighbourhood of a cell. It was used as part of the scIB project.

Links

Anything else?

Anything else you think is important about the metric

  • kBET can be difficult and slow to implement (partially because of using rpy2), for this reason it may not be worth using

New metric: Batch PCR

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The PCA Regression (PCR) comparison measures the amount of variance in the dataset explained by the batch label before and after integration. If the integration performs well more variance should be explained by batch prior to integration. It was used as part of the scIB project.

Links

Anything else?

Anything else you think is important about the metric

New metric: Imbalanced clustering metrics

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Add extensions to existing integration metrics based on clustering that take into account the imbalance in ground truth labels.

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

New method: Hotspot

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Hotspot is a graph-based method that selects features that are associated with similarity between cells (represented as a graph).

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Anything else?

_Anything else you think is important about the method

New metric: Graph iLISI

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Local Inverse Simpson’s Index (LISI) measures diversity in the neighbourhood of a cell. The integration variant (iLISI) gives better scores when the neighbourhood consists of different labels to the target cell. It was used as part of the scIB project where a more flexible graph-based implementation was developed.

Links

Anything else?

Anything else you think is important about the metric

New metric: ARI

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Adjusted Rand Index (ARI) measures the similarity between two sets of labels (in this case the ground truth label and a clustering). It adjusts for the similarity that would be expected by chance depending on the size of the clusters. It was used as part of the scIB project.

Links

Anything else?

Anything else you think is important about the metric

  • Clustering can be optimized for this metric using a function in scib

New metric: Reconstruction error

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Distance between an average reconstructed cell and real query cells

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

New method: DUBStepR

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? What does it measure? Why would it be a good fit for the project?

DUBStepR (Determining the Underlying Basis using Step-wise Regression) is a feature selection algorithm for cell type identification in single-cell RNA-sequencing data. It is based on the intuition that cell-type-specific marker genes tend to be well correlated with each other, i.e. they typically have strong positive and negative correlations with other marker genes.

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

New method: OSCA

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

The Orchestrating Single-Cell Analysis with Bioconductor book describes how to use core Bioconductor packages to analyse single-cell data. They propose a method a method for feature selection that considers batches in the data and selects features with additional biological variation.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Anything else?

Anything else you think is important about the method

New metric: Graph connectivity

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The graph connectivity score measures how connected the subgraphs for each label are. For a well-integrated dataset it is expected that cells with the same label will be well connected while for a poorly integrated dataset they will be more disconnected. It was used as part of the scIB project.

Links

Anything else?

Anything else you think is important about the metric

New metric: MCC

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Matthews Correlation Coefficient (MCC) measures the quality of binary and multiclass classifications and is regarded as a balanced measure that can be used even if the classes are of very different sizes.

Links

Anything else?

Anything else you think is important about the metric

New feature: Bump scIB version

Thanks for taking the time to suggest a new feature!

Description

Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?

Update scIB environments to the latest release once changes have been merged

Anything else?

Anything else you think is important about the feature

Required to avoid workarounds in:

  • LISI metrics
  • Cell cycle metric

New method: Brennecke

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Select features based on excess coefficient of variation

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

New metric: Local structure

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Describes how well the local structure of each group prior to integration is preserved after integration

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

New metric: Graph cLISI

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Local Inverse Simpson’s Index (LISI) measures diversity in the neighbourhood of a cell. The cell-type variant (cLISI) gives better scores when the neighbourhood consists of the same labels as the target cell. It was used as part of the scIB project where a more flexible graph-based implementation was developed.

Links

Anything else?

Anything else you think is important about the metric

New dataset: HLCA

Thanks for taking the time to suggest a new dataset!

Description

Please briefly describe the suggested dataset: What is in the dataset (tissue, technology, number of cells etc.)? What kind of batches does it have? What kind of cell annotations does it have? Why would it be a good fit for the project?

The Human Lung Cell Atlas is a comprehensive catalogue of cells in the human lung. It contains samples from hundreds of individuals and high-quality consensus cell labels at different hierarchical levels.

Links

Anything else?

Anything else you think is important about the dataset

  • This dataset is very large and may be difficult in terms of computational resources

New metric: ALCS

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Accuracy Loss of Cell type Self-projection, difference in accuracy between classifiers trained on the query only (per batch) vs the query in the reference space

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

Not sure that the exact metric is in the package or if it needs some wrapping code

Bug: NBumi returns 0 features on test simulation

Thanks for taking the time to fill out this bug report!
Please use Markdown formatting for any code snippets.

What happened?

Briefly describe the issue

On the small test simulation datasets the NBumi method selects zero genes which breaks the integration stage of the pipeline

What were you doing?

Briefly describe what led to the issue. A reproducible example or other code snippets are great

What did you see?

_Include any error messages, log output or other output that could help diagnose the problem

Subsetting to 0 selected features...
Setting up AnnData for scVI...
...
IndexError: index 172 is out of bounds for size 0

Proposed solution

If you have a suggestion for how to solve the issue we would love to hear it!

Options:

  • Make NBumi error if no features are selected (and allow the Nextflow process to fail)
  • Return all genes if none are selected?
  • Modify the integration stage to error if there are no selected genes (and allow the Nextflow process to fail)

Your environment

Please include the information relevant to your issue

HMGU server

Anything else?

Anything else you want to tell us about the issue

New metric: query kNN-corr

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

Correlation between the KNN of query cells before and after mapping.

Links

Links to information about the metric

Anything else?

Anything else you think is important about the metric

May be possible to reuse the Symphony implementation (preferred) but if not should not be too difficult to reimplement

New method: triku

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

triku is a graph-based feature selection method that selects features that show an unexpected number of zero counts and whose expression is located in cells that have similar expression profiles.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

New metric: Label ASW

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Adjusted Silhouette Width (ASW) measures the compactness of clusters in a dataset. By calculating it on cell labels we evaluate whether cells of the same type are nearby or separated. It was used as part of the scIB project.

Links

Anything else?

Anything else you think is important about the metric

New method: Pearson residuals

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Select features based on high Pearson residuals

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

Can probably be implemented by modifying the existing scanpy script

New metric: Isolated labels score

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The isolated labels score measures how well labels that were present in few samples can be distinguished in the integrated dataset. It can be calculated using either a clustering-based approach with the F1 score or unsupervised ASW. It was used as part of the scIB project.

Links

Anything else?

Anything else you think is important about the metric

  • Can probably be implemented as a single script with an option to select either the F1 or ASW score
    • Check what clustering information is required/used for the F1 variant

New feature: Pass dataset species

Thanks for taking the time to suggest a new feature!

Description

Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?

The pipeline should keep track of the species for each dataset and pass that information as needed (particularly to metrics). Required for the cell conservation score #18.

Anything else?

Anything else you think is important about the feature

New metric: Batch ASW

Thanks for taking the time to suggest a new metric!

Description

Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?

The Adjusted Silhouette Width (ASW) measures the compactness of clusters in a dataset. For the scIB project a modified version was developed which evaluates how spread (and therefore weel integrated) batches are.

Links

Anything else?

Anything else you think is important about the metric

New method: M3Drop

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

M3Drop fits a Michaelis-Menten model to the pattern of dropouts in single-cell RNASeq data. This model is used as a null to identify significantly variable (i.e. differentially expressed) genes for use in downstream analysis.

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

  • Variants for full-length (M3Drop) and UMI data (NBumi)

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

  • Package also contains implementations of other simple feature selection methods (Brennecke, giniFS, pcaFS, corFS, ConsensusFS)

New dataset: NeurIPs 2021

Thanks for taking the time to suggest a new dataset!

Description

Please briefly describe the suggested dataset: What is in the dataset (tissue, technology, number of cells etc.)? What kind of batches does it have? What kind of cell annotations does it have? Why would it be a good fit for the project?

Open Problems in Single-Cell Analysis produced a dataset for a NeurIPs 2021 competition. This contains multiomics samples from several individuals, produced at different sequencing facilities and with consensus labels.

Links

Anything else?

Anything else you think is important about the dataset

  • The samples contain multiple modalities (either CITE-seq or 10x Multiome), we are only interested the RNA parts
  • Possibly use a subset or create two datasets for the different technologies

New method: anticor

Thanks for taking the time to suggest a new method!

Description

Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?

Selects features based on those that have an excess of negative correlations

Variants

Please describe any variants of the method (different parameters, number of selected features etc.), if any

Links

Links to information about the method

Anything else?

Anything else you think is important about the method

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.