Code Monkey home page Code Monkey logo

predictingcircadiantime's Introduction

Identifying marker genes to tell the circadian time using a single transcriptomic time point

We developed a ML based pipeline to predict the circadian time (phase) at any single transcriptomic sampling time point using gene expression data from a set of marker genes. This code uses an artificial neural network (with Tensorflow). The training, test and validation datasets for our analysis can be found in the data folder (as detailed below).

To try it out, the best place to start is the jupyter notebook. This is what you need:

1. A working version of python and tensorflow. A simple installation on your laptop could probably just be done with pip. (https://www.tensorflow.org/install/)
2. The data: we have provided this in the folder “Data”, this provided data can be replaced in the notebook with your own data.		
	- X_train_raw is the Romanowski expression matrix before scaling and feature selection [Our training data]
	- X_valid_raw is the Yang expression matrix before scaling and feature selection [Our validation data]
	- X_test_raw_A/B are the Graf expression matrices for the two different ecotypes before scaling and feature selection [Our test data]
	- X_train_postprocessed is the Romanowski expression matrix after scaling and feature selection [Our training data]
	- X_valid_postprocessed is the Yang expression matrix after scaling and feature selection  [Our validation data]
	- X_test_postprocessed is the Graf expression matrix including the two different ecotypes after scaling and feature selection [Our test data]
	- X_train_clusters contains WGCNA clusters generated for the Romanowski data

This notebook takes as input the raw training, test and validation data (In[3]). After removing low variance genes (In[4]), the following metrics; ARSER (Yang & Su, 2010); JTK-Cycle (Hughes et al., 2010); cross-correlation with circadian time; autocorrelation, were applied to quantify the rhythmicity of each gene’s expression pattern in the training dataset. Despite the metrics being highly correlated with each other, using several metrics should increase the number of detected rhythmic patterns compared to using one single criterion. The scores for each metric were combined using a Gaussian copula resulting in one score for each gene that enables the set of genes to be ranked in order of measured rhythmicity. The notebook includes the cross-correlation with circadian time and autocorrelation feature selection metrics that we used (‘CorrelationMetrics.py’ and In[6]). The top-ranking n genes were taken forward for model training and further feature selection where n was adjusted to minimise validation error. After a ranked subset of n rhythmic genes was obtained, their expression levels were scaled using MinMaxScaler in the range 0 to 1 as this method of scaling resulted in the smallest error for the validation dataset (In[6]). The scaling transformation was fitted on the training set and applied to both validation and test sets to prevent data leakage. We prioritized transcripts using feature selection to make the frequency distribution of sub-clusters generated by the WGCNA gene co-expression network analysis uniform and train the neural network (‘CustomFeatureSelection.py’ and In[7]).

The training dataset was used to learn the circadian time in hours, from the expression data from each transcriptomic timepoint individually, whilst the validation set was utilised for adjusting hyperparameters to reduce overfitting. The test set was used to estimate the error on unseen data. Here n is equal to our final set of transcripts that are sufficient to allow prediction of the circadian time. Finally we show the expected output line plots for training and validation predictions compared to ground truth data using our final subset of transcripts and optimized hyperparameters (In[8]). Due to the cyclical nature of the target (time 0-24 hours), standard regression loss functions were not suitable for this task. To quantify the error in the predictions, we defined the loss function as the squared angle between actual circadian time and predicted circadian time after being transformed onto a unit circle (In[5]).

The ‘Results’ folder contains the final predictions that are generated by ‘FinalPredictions.py’ which is found in the repository.

Dependencies: Python 3.7.5 Pandas 0.25.0 Numpy 1.16.6 Matplotlib 3.1.1 Sklearn 0.23.2 Seaborn 0.9.0 Tensorflow 2.0.0 Keras 2.2.4 Correlation Metrics Tqdm 4.32.1

predictingcircadiantime's People

Contributors

joshuacolmer avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.