Code Monkey home page Code Monkey logo

dkm's Introduction

When Data Do Not Conform to Rows and Columns: A Case Study of T-cell Receptor Datasets

Publication submitted for peer-review
JARED L OSTMEYER, ASSISTANT PROFESSOR, UT SOUTHWESTERN DEPARTMENT OF POPULATION AND DATA SCIENCES

Introduction

Statistical classifiers are mathematical models that use example data to find patterns in features that predict a label. Most statistical classifiers assume the features are arranged into rows and columns, like a spreadsheet, but many kinds of data do not conform to this structure. Sequences are one example of a different kind of data, which is why this data is usually stored in a text document, not a spreadsheet. To build statistical classifiers for sequences and other non-conforming features, we have developed what we call dynamic kernel matching (DKM).

DKM is analogous to a convolutional network, but for sequences. Consider the problem of classifying a sequence. Because some sequences are longer than others, the number of features is irregular. Given a specific sequence, the challenge is to determine the appropriate permutation of features with weights, allowing us to run the features through the statistical classifier to generate a prediction. To find the permutation of features that exhibit the maximal response, like how max-pooling identifies the image patch that exhibit the maximal response, we use a sequence alignment algorithm. The alignment score, as it is called, is analgous to the output of a max-pooling operation of a convolutional network, but for sequences. Given the immense number of possible permutations, the problem appears computationally complex but can be solved in polynomial time using a sequence alignment algorithm. Here, we implement the Needleman-Wunsch algorithm (link) in TensorFlow. Equivalents to a sequence alignment algorithm exist for (i) sets, (ii) trees, and (iii) graphs, making it possible to use DKM on non-conforming features represented by these structures (The general problem of graph alignment is considered NP-hard).

To illustrate the types of non-conforming features that we can handle with DKM, we consider two datasets of T-cell receptors, anticipating these datasets to contain signatures for diagnosing disease.

alt text

Antigen Classification Problem

10x Genomics has published a dataset of sequenced T-cell receptors labelled by interaction with disease particles, which are called antigens. We refer to this as the antigen classification problem. To solve the antigen classification problem, we use DKM to classify each sequence in this dataset. See the folder antigen-classification-problem for details (link).

Repertoire Classification Problem

Adaptive Biotechnologies has published a separate dataset of patients' sequenced T-cell receptors, which are called immune repertoires, labelled by those patients' cytomegalovirus (CMV) serostatus. We refer to this as the repertoire classification problem. To solve the repertoire classification problem, we use DKM to classify each set of sequences in this dataset. See the folder repertoire-classification-problem for details (link).

Study Design

The training cohort is used to fit a model, the validation cohort is used for model selection, and the test cohort is used for reporting results. We strictly adhere to this protocol, ensuring that we avoid a model selection bias when reporting results.

Requirements

Recommended Tools

Download

  • Download: zip
  • Git: git clone https://github.com/jostmey/dkm

dkm's People

Contributors

jostmey avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.