Authors: Adrien Le Franc and Alex Nowak
The goal of the data challenge is to learn how to implement machine learning algorithms, gain understanding about them and adapt them to structural data. For this reason, we have chosen a sequence classification task: predicting whether a DNA sequence region is binding site to a specific transcription factor.
Transcription factors (TFs) are regulatory proteins that bind specific sequence motifs in the genome to activate or repress transcription of target genes. Genome-wide protein-DNA binding maps can be profiled using some experimental techniques and thus all genomics can be classified into two classes for a TF of interest: bound or unbound. In this challenge, we will work with three datasets corresponding to three different TFs.
Two days after the deadline of the data challenge, you will have to provide
- a small report on what you did (in pdf format, 11pt, 2 pages A4 max)
- your source code (zip archive), with a simple script "start" (that may be called from Matlab, Python, R, or Julia) which will reproduce your submission and saves it in Yte.csv
- At most 3 persons per team.
- One team can submit results up to twice per day during the challenge.
- A leader board will be available during the challenge, which shows the best results per team, as measured on a subset of the test set. A different part of the test set will be used after the challenge to evaluate the results.
- Registration has to be done with email addresses @ens-cachan.fr, @polytechnique.edu, @u-psud.fr, @student.ecp.fr, @ens.fr, @mines-paristech.fr, @telecom-paristech.fr, @ensiee.fr, @dauphine.eu, @centralesupelec.fr, @ensiie.fr, @etu.parisdescartes.fr, @ens-paris-saclay.fr, @eleves.enpc.fr, @mines-ensae.fr.
- The most important rule is: DO IT YOURSELF. The goal of the data challenge is not get the best score on this data set at all costs, but instead to learn how to implement things in practice, and gain practical experience with the machine learning techniques involved.
For this reason, the use of external machine learning libraries is forbidden. For instance, this includes, but is not limited to, libsvm, liblinear, scikit-learn, ...
On the other hand, you are welcome to use general purpose libraries, such as library for linear algebra (e.g., svd, eigenvalue decompositions), optimization libraries (e.g., for solving linear or quadratic programs)
Make sure to change the paths corresponding to read/write correctly the kernels.
- Python 3.6.2
- Numpy
- matplotlib
- cvxopt
Feel free to play with the parameters inside these files.
- To create the kernel matrix for the mismatch kernel using a depth graph search run
python Code/utils.py
- To create the kernel matrix for the substring kernel by computing the features (relatively efficient) run
python Code/kernel_substring.py
- To create the kernel matrix for the mismatch kernel using the approximative montecarlo based method run
python Code/mismatch/pyparse.py
- To create the kernel matrix for the shape kernel (using the R code) run
python Code/tofasta.py
python Code/main.py