Code Monkey home page Code Monkey logo

datachallenge-kernelmethods's Introduction

Data Challenge - Kernel Methods (MVA)

Authors: Adrien Le Franc and Alex Nowak

Introduction

The goal of the data challenge is to learn how to implement machine learning algorithms, gain understanding about them and adapt them to structural data. For this reason, we have chosen a sequence classification task: predicting whether a DNA sequence region is binding site to a specific transcription factor.

Transcription factors (TFs) are regulatory proteins that bind specific sequence motifs in the genome to activate or repress transcription of target genes. Genome-wide protein-DNA binding maps can be profiled using some experimental techniques and thus all genomics can be classified into two classes for a TF of interest: bound or unbound. In this challenge, we will work with three datasets corresponding to three different TFs.

What is expected

Two days after the deadline of the data challenge, you will have to provide

  • a small report on what you did (in pdf format, 11pt, 2 pages A4 max)
  • your source code (zip archive), with a simple script "start" (that may be called from Matlab, Python, R, or Julia) which will reproduce your submission and saves it in Yte.csv

Rules

  • At most 3 persons per team.
  • One team can submit results up to twice per day during the challenge.
  • A leader board will be available during the challenge, which shows the best results per team, as measured on a subset of the test set. A different part of the test set will be used after the challenge to evaluate the results.
  • Registration has to be done with email addresses @ens-cachan.fr, @polytechnique.edu, @u-psud.fr, @student.ecp.fr, @ens.fr, @mines-paristech.fr, @telecom-paristech.fr, @ensiee.fr, @dauphine.eu, @centralesupelec.fr, @ensiie.fr, @etu.parisdescartes.fr, @ens-paris-saclay.fr, @eleves.enpc.fr, @mines-ensae.fr.
  • The most important rule is: DO IT YOURSELF. The goal of the data challenge is not get the best score on this data set at all costs, but instead to learn how to implement things in practice, and gain practical experience with the machine learning techniques involved.

For this reason, the use of external machine learning libraries is forbidden. For instance, this includes, but is not limited to, libsvm, liblinear, scikit-learn, ...

On the other hand, you are welcome to use general purpose libraries, such as library for linear algebra (e.g., svd, eigenvalue decompositions), optimization libraries (e.g., for solving linear or quadratic programs)

Run the code

Make sure to change the paths corresponding to read/write correctly the kernels.

Requirements

  • Python 3.6.2
  • Numpy
  • matplotlib
  • cvxopt

Build Kernels from sequences

Feel free to play with the parameters inside these files.

Our implementation

  • To create the kernel matrix for the mismatch kernel using a depth graph search run python Code/utils.py
  • To create the kernel matrix for the substring kernel by computing the features (relatively efficient) run python Code/kernel_substring.py

External (non-ML) libraries

  • To create the kernel matrix for the mismatch kernel using the approximative montecarlo based method run python Code/mismatch/pyparse.py
  • To create the kernel matrix for the shape kernel (using the R code) run python Code/tofasta.py

Run experiments

python Code/main.py

References

datachallenge-kernelmethods's People

Contributors

alexnowakvila avatar adrien-le-franc avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.