Code Monkey home page Code Monkey logo

amirreza-mousavi / protein_normalized_sw_similarity_score Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 8.55 MB

An R script that calculates a similarity matrix for a list of protein sequences with the aid of Bleakley-Yamanishi Normalized Smith-Waterman Similarity Score.

License: MIT License

R 100.00%
psa pairwise-alignment sequence sequence-alignment r fasta fasta-sequences biostrings similarity-score similarity-matrix drug-discovery dti dti-prediction protein protein-sequences smith-waterman smith-waterman-alignment computer-aided-drug-design computational-biology local-alignment

protein_normalized_sw_similarity_score's Introduction

Protein_Normalized_SW_Similarity_Score

An R script that calculates a similarity matrix for a list of protein sequences with the aid of Bleakley-Yamanishi Normalized Smith-Waterman Similarity Score.


Installation

Make sure to have installed the following packages : Biostrings, and here.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")
install.packages("here")

More Details

In the area of Drug-Target Interaction (DTI) prediction and Computer-Aided Drug Design (CADD) , most of the times, it is necessary to express the proteins and the drugs as a series of features and attributes that help us train our prediction model more efficiently. Bleakley-Yamanishi Normalized Smith-Waterman Similarity Score is one of the ways to rate and cluster a pair of proteins based on the similarity of their sequences. The classic Smith-Waterman Similarity Score is higher for larger sequences and lower for small sequences. The normalization process, helps to neutralize the influence of the length of the pairs of the sequence on the similarity score, hence the name normalized similarity score.

Bleakley-Yamanishi Normalized Smith-Waterman Similarity Score was first described in this paper. and was subsequently used by other researchers in this field(see this paper.).

The settings for the pairwise alignment are similar to the default paramters of EMBOSS Water PSA program hosted at European Bioinformatic Institute(EBI) website :

Substitution Matrix = BLOSUM62, Gap open = 10, Gap extend = 0.5

This particular formula taken from Bleakley et al (2009), is the heart of this script :

Image

Further Readings :

Kevin Bleakley, Yoshihiro Yamanishi, Supervised prediction of drug–target interactions using bipartite local models, Bioinformatics, Volume 25, Issue 18, 15 September 2009, Pages 2397–2403, https://doi.org/10.1093/bioinformatics/btp433

Seal, A., Ahn, YY. & Wild, D.J. Optimizing drug–target interaction prediction based on random walk on heterogeneous networks. J Cheminform 7, 40 (2015). https://doi.org/10.1186/s13321-015-0089-z

Maryam Bagherian, Elyas Sabeti, Kai Wang, Maureen A Sartor, Zaneta Nikolovska-Coleska, Kayvan Najarian, Machine learning approaches and databases for prediction of drug–target interaction: a survey paper, Briefings in Bioinformatics, Volume 22, Issue 1, January 2021, Pages 247–269, https://doi.org/10.1093/bib/bbz157


  • The script takes about a minute or so provided that the number of sequences are below 50. For fasta files which contain 1000 unique sequences and more, it takes around 24 hours for the tasks to complete!
  • For larger sequences, one might use parallelization. I have to use parallel package of R, if I want to implement parallelization in the future.
  • I have provided five datasets, containing the sequences of various proteins of human proteome. All of these five files are taken from Uniprotkb. "16_Selected_Proteins.txt" and "38_Selected_Proteins.txt" can be used as demo datasets to test the script. The other three datasets, contain much more sequences and it takes more than a day for my system to execute them. Parallelization might be helpful to reduce the needed time.
  • I first calculated the similarity matrix, then used (res)^(-1) as a measure of dissimilarity. There must exist a better way to calculate dissimilarity scores from similarity scores...

NOTE: As always, I have put the Expected_Results folder which contains the files that the R script generates once you run it.

NOTE: Make sure to edit setwd command in the R script to point at your desired directory. In this particular repository, it is not necessary because of the here package. However you can consider to do it manually in case the script somehow does not work for you.

protein_normalized_sw_similarity_score's People

Contributors

amirreza-mousavi avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.