weiba / mofgcn Goto Github PK

Python 100.00%

mofgcn's Introduction

README

MOFGCN:Predicting Drug Response Based on Multi-omics Fusion and Graph Convolution This document mainly introduces the python code of MOFGCN algorithm.

Requirements

pytorch==1.6.0
tensorflow==2.3.1
numpy==1.17.3+mkl
scipy==1.4.1
pandas==0.25.2
scikit-learn=0.21.3
pubchempy==1.0.4
seaborn==0.10.0
hickle==4.0.1
keras==2.4.3

Instructions

This project contains all the codes for MOFGCN and 5 comparison algorithms to experiment on the CCLE and GDSC databases, respectively.We only introduce the algorithm proposed in our paper, MOFGCN, and the introduction of other algorithms can be found in the corresponding paper.

Model composition and meaning

MOFGCN is composed of common modules and experimental modules.

Common module

model.py defines the complete MOFGCN model.
optimizer.py defines the optimizer of the model.
myutils.py defines the tool functions needed by the entire algorithm during its operation.

Experimental module

Entire_Drug_Cell performs the random clearing cross-validation experiment.
- entire_main.py performs a complete MOFGCN algorithm, which combines gene expression, copy number variation and somatic mutation as cell similarity.
- entire_gene_main.py performs an experiment that uses only gene expression to calculate cell line similarity.
- sampler.py defines the sampler for random zeroing experiments.
- result_data and statistic_result folders save the output results and statistical results of the algorithm respectively.
New_Drug_Cell performs single row and single column clearing experiments.
- main.py performs single row and single column clearing experiments.
- MOFGCN_New_target.py integrates the MOFGCN algorithm.
- sampler.py defines the sampler for single row and single column clearing experiments.
- result_data and statistic_result folders save the output results and statistical results of the algorithm respectively.
Single_Drug_Cell performs a single drug response prediction experiment.
- main.py performs a single drug response prediction experiment.
- MOFGCN_Single_target.py integrates the MOFGCN algorithm.
- sampler.py defines the sampler for single drug response prediction experiment.
- pan_reslt_data and statistic_result folders save the output results and statistical results of the algorithm respectively.
Target_Drug performs targeted drug experiments.
- target_main.py performs targeted drug experiments.
- sampler.py defines the sampler for targeted drug experiments.
- result_data and statistic_result folders save the output results and statistical results of the algorithm respectively.

All main.py files can complete a single experiment. Because of the randomness of dividing test data and training data, we recorded the true value of the test data during the algorithm performance. Therefore, the output of the main file includes the true and predicted values of the test data that have been cross-validated many times. In the subsequent statistical analysis, we analyze the output of the main file. The myutils.py file contains all the tools needed for the performance and analysis of the entire experiment, such as the calculation of AUC, ACC, F1 score, and MCC. All functions are developed using PyTorch and support CUDA.

Both the CCLE and GDSC folders contain the processed_data file, which contains the input data required by the MOFGCN algorithm and the comparison algorithm. -GDSC/processed_data/ - cell_drg_common.csv records the log IC50 association matrix of cell line-drug. - cell_drug_common_binary.csv records the binary cell line-drug association matrix. - cell_gene_cna.csv records the CNA features of the cell line. - cell_gene_feature.csv records cell line gene expression features. - cell_gene_mutation.csv records somatic mutation features of cell lines. - cell_id_tag.csv records the COSMIC ID of the cell line. - drug_cid.csv records the PubChem IDs of all drugs screened in GDSC. - drug_feature.csv records the fingerprint features of drugs. - null_mask.csv records the null values in the cell line-drug association matrix. - threshold.csv records the drug sensitivity threshold.

-CCLE/processed_data/ - cell_drug.csv records the log IC50 association matrix of cell line-drug. - cell_drug_binary.csv records the binary cell line-drug association matrix. - cna_feature.csv records the CNA features of the cell line. - drug_feature.csv records the fingerprint features of drugs. - drug_name_cid.csv records the drug name and PubChem ID. - gene_feature.csv records cell line gene expression features. - mutation_featre.csv records somatic mutation features of cell lines.

Contact

If you have any question regard our code or data, please do not hesitate to open a issue or directly contact me ([email protected]).

mofgcn's People

Contributors

Stargazers

Watchers

Forkers

inoue0426 drug-response

mofgcn's Issues

How to run this code?

Hi,

I want to run this code on a Linux machine.
But, not sure how to run this code.

If I run the entire_main.py, I get the error about the import_path.
Not sure what this is though. Is this library? I can't find this library on PyPI

$ python CCLE/MOFGCN/Entire_Drug_Cell/entire_main.py 
Traceback (most recent call last):
  File "CCLE/MOFGCN/Entire_Drug_Cell/entire_main.py", line 5, in <module>
    from import_path import *
ModuleNotFoundError: No module named 'import_path'

And if I commented out, I got this error.

 $ python CCLE/MOFGCN/Entire_Drug_Cell/entire_main.py 
Traceback (most recent call last):
  File "CCLE/MOFGCN/Entire_Drug_Cell/entire_main.py", line 6, in <module>
    from MOFGCN.model import GModel

I don't know how to resolve this. Could you give me more instructions about this?

GDSC doesn't have some data.

Hi,

Your work has been very helpful to me, but I found that GDSC/processed_datadoesn't have some data.

You said:

GDSC/processed_data/
- cell_drg_common.csv records the log IC50 association matrix of cell line-drug.
- cell_drug_common_binary.csv records the binary cell line-drug association matrix.
- cell_gene_cna.csv records the CNA features of the cell line.
- cell_gene_feature.csv records cell line gene expression features.
- cell_gene_mutation.csv records somatic mutation features of cell lines.
- cell_id_tag.csv records the COSMIC ID of the cell line.
- drug_cid.csv records the PubChem IDs of all drugs screened in GDSC.
- drug_feature.csv records the fingerprint features of drugs.
- null_mask.csv records the null values in the cell line-drug association matrix.
- threshold.csv records the drug sensitivity threshold.

But I can't find cell_gene_cna.csv, cell_gene_feature.csv and cell_gene_mutation these three files, even in Data.
Could you add some data to run this code? Thank you a lot!

GDSC and CCLE doesn't have some data.

Hi,

I found that GDSC/processed_data and CCLE/processed_data doesn't have some data.

You said

Both the CCLE and GDSC folders contain the processed_data file, which contains the input data required by the MOFGCN algorithm and the comparison algorithm. 

- GDSC/processed_data/ 
    - cell_drg_common.csv records the log IC50 association matrix of cell line-drug. 
    - cell_drug_common_binary.csv records the binary cell line-drug association matrix. 
    - cell_gene_cna.csv records the CNA features of the cell line. 
    - cell_gene_feature.csv records cell line gene expression features. 
    - cell_gene_mutation.csv records somatic mutation features of cell lines. 
    - cell_id_tag.csv records the COSMIC ID of the cell line. 
    - drug_cid.csv records the PubChem IDs of all drugs screened in GDSC. 
    - drug_feature.csv records the fingerprint features of drugs. 
    - null_mask.csv records the null values in the cell line-drug association matrix. 
    - threshold.csv records the drug sensitivity threshold.

- CCLE/processed_data/ 
    - cell_drug.csv records the log IC50 association matrix of cell line-drug. 
    - cell_drug_binary.csv records the binary cell line-drug association matrix. 
    - cna_feature.csv records the CNA features of the cell line. 
    - drug_feature.csv records the fingerprint features of drugs. 
    - drug_name_cid.csv records the drug name and PubChem ID. 
    - gene_feature.csv records cell line gene expression features. 
    - mutation_featre.csv records somatic mutation features of cell lines.

But it looks like there are not all data.

/home/inoue019/code/MOFGCN $ ls CCLE/processed_data/
cell_drug_binary.csv  
cell_drug.csv  
drug_feature.csv  
drug_name_cid.csv  
drug_protein.csv
/home/inoue019/code/MOFGCN $

/home/inoue019/code/MOFGCN $ ls GDSC/processed_data/
cell_drug_common_binary.csv  
cell_id_tag.csv  
drug_feature.csv  
threshold.csv
cell_drug_common.csv         
drug_cid.csv     
null_mask.csv
/home/inoue019/code/MOFGCN $

Could you add some data to run this code?

Question about gene expression data

Hi,

I have one question about gene expression data.
It has columns; I thought it meant genes' names or identifiers.
Could you explain how you collected this and the meaning of the number of columns?

Best,
Yoshi

Question about result

Hi author, sorry to bother you again, I have some questions about the results after processing some of the CCLE code, for example, after I run entire_main.py in Entire_Drug_Cell folder, it generates predict_data.csv and true_data.csv, but these two csv files contain 25 rows and 680 columns, but according to the data in CCLE, there should only be 24 drugs and 436 cell lines, may I ask why the generated results don't match the dimensions of the original matrix? Please correct me if I have any misunderstanding, thank you!

The version of Python

What version of Python is your program running? Not given in the README.md you gave. At the same time, I can't find the 'import_path' module in your code files.