Code Monkey home page Code Monkey logo

megago's Introduction

published in: Journal of Proteome Research

MegaGO

Calculate semantic distance for sets of Gene Ontology terms.

Getting Started

These instructions will get you a copy of the project up and running on your local machine.

Prerequisites

Scripts are written in python 3. One easy way to get started is installing miniconda 3.

On linux:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Installing

Clone the repository:

git clone https://github.com/MEGA-GO/Mega-Go.git

Install package:

cd Mega-Go
pip install -U .

Execute example analysis:

megago sample7.txt sample8.txt

These files can be found here:

How does it work?

MegaGO calculates the similarity between GO terms with the Lin semantic similarity (simLin) metric 1.

where:

  • MICA: most informative common ancestor.
  • IC(goi): information content of the term goi.

The information content of a go term is calculated as follows:

The frequency p of a term go is defined as:

where:

  • c: children of go.
  • N: total number of terms in GO corpus.
  • ngo': number of occurences of a term go' in a reference data set.

To calculate the similarity of two sets of terms, the best match average (BMA)1 is used.

where:

  • m,n: number of terms in set gi and gj, respectively
  • sim(go1i,go2j): similarity between two GO terms

1: Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity.” In Proceedings of the 15th International Conference on Machine Learning, 296—304.

Interpretation

The relative similarity ranges between 0 and 1.

sim(go1i,go2j) value Interpretation
>0.9 highly similar functions
0.3-0.9 functionally related
<0.3 not functionally similar

License

This project is licensed under the MIT License - see the LICENSE file for details

megago's People

Contributors

bmesuere avatar dependabot[bot] avatar github-actions[bot] avatar michbur avatar pverscha avatar rababerladuseladim avatar tivdnbos avatar wassimg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

megago's Issues

Allow filtering on a subset of GO terms

Right now, we calculate our similarity scores on the full Gene Ontology, but some researchers might have a focus on a specific subset. For these users, it might be useful to filter the Gene Ontology and only run on the subset.

  • There must be a user-friendly way to select a subset. A list of terms to include is a minimum, but we probably also want to say "these terms and all child terms".
  • On first sight, we could limit the go terms to the similarity calculation phase. This might however give inaccurate results because the other terms would then still be taken into account when calculating the information contents. For optimal results, the filtering should also be applied when generating our initial values from Swissprot. We would need to test both approaches and see if the second option is needed or not.
  • We would need to talk to some users who have this problem to get a better idea about their needs.

Similarity values of non-comparable GO-term sets

The tool reports negative similarity values in some specific cases (cases I could identify are listed below).

Identified cases related to the issue:

term vs. term similarity calculation, using Rel_Metric function:

  1. GO terms are from different namespaces (e.g. molecular function and biological process) -> return -1

set vs. set similarity calculation, using the best match average (BMA) function:

  1. both sets are empty -> crash (zero division error)
  2. one set is empty, the other is not -> return -1
  3. all pairwise similarities are -1 -> return -1

The motivation for letting the Rel_Metric function return '-1' instead of None or null was that the max function used in the BMA function will filter these out and does not work for mixed types.

Unfortunately, it leads to some edge cases and I think it will easily introduce bugs if we ever decide to use a different aggregation function than BMA.

Proposed solution (suggestions welcome)

Raise an error in the Rel_Metric function and catch it with a try/catch statement in the BMA function. In any newly implemented aggregation function, these error would not be caught and therefore cannot silently pass.

One question remains: What is the similarity of sets that cannot be compared? My gut says to report 0.

How should code respond on non-existing terms?

Sometimes, the user inputs a list of GO-terms that contain unknown GO-terms (e.g. terms that might have recently changed). How should our application react to this? Do we ignore those terms? Do we print an error / warning message?

Fix DivisionByZero error

Sometimes a DivisionByZero error might occur on line 106 in the megago.py file. This happens if we try to compute the BMA-score for two empty lists, and can be relatively easy fixed.

Pre-calculating all GO-term comparisons

Since it's very hard to come up with useful heuristics for comparing 2 GO-terms, we decided that it would make a lot of sense to just precompute the comparison values for all GO-terms (in the same namespace) and use these to speed up the process of comparing two samples with each other.

We need to investigate a few things before we can do this:

  • How are we going to store the comparison values for all GO-terms? (relational database (SQLite?)?)
  • How long will it take to compute the values for all pairs of GO-terms?
  • Can we run this on an HPC?
  • We need to parallelise this computation as much as possible to decrease run time.
  • How are we going to distribute this database? (this depends on it's eventual size)

Ideas for heuristics to decrease the number of comparisons.

Comparing large sets of GO terms leads to a quadratic explosion required comparisons. This makes analyzing large sets unfeasible, due to enormous run-times. One way to reduce the computational overhead is to try and infer the similarity of two given terms based on heuristics and only calculate the similarity if this fails.

One heuristic is to check the namespace of the terms first and only compare terms if they are from the same namespace.

Any other ideas for sensible heuristics?

Similarity can exceed 1 if term is missing and parent is present

If a term is missing in the swissprot database but it's parent is present, the best match average can greater than 1. This is caused by the the way the BMA is calculated: the numerator is twice the information content of the lowest common ancestor of the two terms (say 2*0.5), while the denominator is the sum of the information content of the individual terms (0+0.5). In the given example, the resulting BMA would be 2.

Possible solutions:

  • inform the user that the similarity measure cannot be computed;
  • take the closest existing parent (with the lowest/the highest information content).

Interpretation problem when comparing identical high-level terms

When identical high-level terms are compared, a low score is returned, e.g.:
GO:0030170 (pyridoxal phosphate binding) vs GO:0030170 gives 99% similarity
GO:0043167 (ion binding) vs GO:0043167 gives 55% similarity
GO:0003674 (molecular function) vs GO:0003674 gives 0% similarity

I also tested what happens if that term is multiple times in the list (e.g. 10x GO:0043167 vs 1x GO:0043167) but this gives the same result, 55% in this case

Obtain output in matrix like text file

When running megago with multiple files, the output with semantic similarities gets printed out in the terminal in an odd format. A print for each sample comparison appears with their respective scores and sample names always get overwritten to "sample 0 and 1", "sample 1 and 2", etc.

I see that the heatmap option helps the interpretation by visualizing the scores and displaying sample names as the original file names. However, I would like to obtain a matrix like text file output of the semantic similarity scores, which would essentially be the same data on which the heatmap is based on. Currently I don't see any options to output the similarity scores in such a format and parsing the terminal print seems impractical, especially with the sample names being changed in the output.

Is there any possibility to achieve such a matrix like text file output of the scores?

with missing cache files, precomputation status messages appear in output

Precomputation messages are printed to stdout instead of stderr.

Reproduce:

  • remove json files from package storage (frequency_counts_uniprot.json and highest_ic_uniprot.json).
  • execute megago, e.g. in command line
  • progress messages from json rebuilding will end up in result file

Proposed solution

use logging facility

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.