Code Monkey home page Code Monkey logo

sscb_project_in_silico_tissue_stratification's Introduction

SSCB project In_silico tissue stratification

This repository serves as the deliverable for our assignment project during the Summer School in Computational Biology at the Universidade de Coimbra in September 2023. The project aims to address an important challenge in computational biology: predicting the percentage of each cell type within bulk tissue samples. This prediction is valuable for reducing result interpretation bias and optimizing sequencing costs.

The challenge

Biological samples, in different health conditions (healthy and disease), may consist of varying numbers of distinct cell types. The composition of these cell types within a tissue can significantly impact the determination of differentially expressed genes. When analyzing tissue samples as a whole (bulk analysis), gene expression values are averaged across all cells within the sample. However, if individual cells within the same tissue are sequenced (single-cell analysis), it becomes possible to determine the expression of each gene by cell type. This introduces a critical challenge: relative gene expression from bulk data may not reveal differences between conditions, potentially leading to biased result interpretation and/or identification of false negatives or false positive. For example, if we do not know the percentage of each cell type in the sample, then we cannot infer if the changes in gene expression are due to changes in the molecular processes within the cells or due to shifts in cellular composition.

In essence, the challenge aims to solve two problems:

  • Reduced Bias in Interpretation: Accurate estimation of cell type composition within a bulk-RNA tissue sample enables us to mitigate the bias introduced by cell type variations when identifying differentially expressed genes.
  • Optimizing Sequencing Costs: Precise cell type composition prediction can help optimize sequencing resources by reducing the need for costly single-cell RNA-seq experiments, especially when such information can be inferred from bulk RNA-seq data.

Data Resources

For this assignment, we leverage data related to Huntington's disease from publicly available publications:

publication data source GEO Organism
Hodges et al 2006 bulk RNA (Affymetrix) GSE3790 H. sapiens
Matsushima et al 2023 single-cells RNA-seq (Illumina) GSE152058 H. sapiens

Our approach

To tackle thi challenge, we initially envisioned an ideal experiment scenario: dividing each tissue sample equally. One part would undergo bulk analysis, while the other would undergo single-cell analysis. This would provide insights into how single cells contribute to the overall gene expression by examining differences in gene expression for identified cell types. However, for obvious reasons, this experiment is out of the scope of the course, and our stating point was the available data mentioned above.

Nowing that, we propose a machine learning approach. Specifically, we plan to train a neural network to learn the weights of each gene while considering tissue composition (the percentage of each cell type) as the target variable. Using our single-cell sample data, we calculate the percentage of each cell type in the tissue and the corresponding gene expression levels. Subsequently, we compute an overall expression value for each gene, considering all cells within that gene's group. This gene expression matrix becomes our input for the neural network, formatted similarly to the bulk tissue sample data.

For the sake of simplicity and time-efficiency, we design a neural network with two layers, with the number of nodes equal to the number of genes. The final layer employs a softmax() function, yielding probabilities for each cell type. Since our target variables are matrices representing cell compositions, where the sum() equals 1, these probabilities represent the estimated percentage of each cell type within the tissue, derived from the bulk RNA data.

The following flowchart summarizes our approach, with solid lines representing the problem flow and dashed lines indicating our proposed solution:

graph TD
%% problem, solid line %%
A((tissue sample)) ===> B[Bulk RNA-seq] 
B ---> E[unkown % cell type]
A ===> C[single-cell RNA-seq]
B --> |tissue average|D[gene relative expression]
C --> |cell type average|D
C ---> F[known % cell type]
E ---> G[unknown tissue composition]
F --> H[known tissue composition]
%% solution, dashed line %%
H -.-> I1[compute tissue gene expression average]
I1 -.-> I2[build neuronal net]
J1[features = genes, age, condition] -...- I2
J2[targets = % cell type] -...- I2
I2 -.- K1[model]
K1 -.-> G
G -.-> I[tissue composition estimmation]

Work under supervision of prof. Matthias Futschik and Sofia Torres(/fistorres).

sscb_project_in_silico_tissue_stratification's People

Contributors

lsilvam avatar

Watchers

 avatar

Forkers

ana-anazco

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.