mega-go / megago Goto Github PK

Calculate semantic distance for sets of Gene Ontology terms

License: MIT License

Jupyter Notebook 17.83% Python 35.28% Shell 0.79% Dockerfile 1.22% JavaScript 0.73% HTML 9.53% Vue 27.88% TypeScript 6.75%

gene-ontology-terms bioinformatics similarity-metric

megago's Introduction

MegaGO

Calculate semantic distance for sets of Gene Ontology terms.

Getting Started

These instructions will get you a copy of the project up and running on your local machine.

Prerequisites

Scripts are written in python 3. One easy way to get started is installing miniconda 3.

On linux:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Installing

Clone the repository:

git clone https://github.com/MEGA-GO/Mega-Go.git

Install package:

cd Mega-Go
pip install -U .

Execute example analysis:

megago sample7.txt sample8.txt

These files can be found here:

How does it work?

MegaGO calculates the similarity between GO terms with the Lin semantic similarity (sim_Lin) metric ¹.

where:

MICA: most informative common ancestor.
IC(go_i): information content of the term go_i.

The information content of a go term is calculated as follows:

The frequency p of a term go is defined as:

$p(go) = \frac{n_{go'}}{N} | go' \in \left\{go, c \right\}$

where:

c: children of go.
N: total number of terms in GO corpus.
n_go': number of occurences of a term go' in a reference data set.

To calculate the similarity of two sets of terms, the best match average (BMA)¹ is used.

where:

m,n: number of terms in set g_i and g_j, respectively
sim(go_1i,go_2j): similarity between two GO terms

1: Lin, Dekang. 1998. “An Information-Theoretic Definition of Similarity.” In Proceedings of the 15th International Conference on Machine Learning, 296—304.

Interpretation

The relative similarity ranges between 0 and 1.

sim(go_1i,go_2j) value	Interpretation
>0.9	highly similar functions
0.3-0.9	functionally related
<0.3	not functionally similar

License

This project is licensed under the MIT License - see the LICENSE file for details

megago's People

Contributors

Stargazers

Watchers

Forkers

barrantesisrael hreinwal

megago's Issues

Allow filtering on a subset of GO terms

Right now, we calculate our similarity scores on the full Gene Ontology, but some researchers might have a focus on a specific subset. For these users, it might be useful to filter the Gene Ontology and only run on the subset.

There must be a user-friendly way to select a subset. A list of terms to include is a minimum, but we probably also want to say "these terms and all child terms".
On first sight, we could limit the go terms to the similarity calculation phase. This might however give inaccurate results because the other terms would then still be taken into account when calculating the information contents. For optimal results, the filtering should also be applied when generating our initial values from Swissprot. We would need to test both approaches and see if the second option is needed or not.
We would need to talk to some users who have this problem to get a better idea about their needs.

Similarity values of non-comparable GO-term sets

The tool reports negative similarity values in some specific cases (cases I could identify are listed below).

Identified cases related to the issue:

term vs. term similarity calculation, using Rel_Metric function:

GO terms are from different namespaces (e.g. molecular function and biological process) -> return -1

set vs. set similarity calculation, using the best match average (BMA) function:

both sets are empty -> crash (zero division error)
one set is empty, the other is not -> return -1
all pairwise similarities are -1 -> return -1

The motivation for letting the Rel_Metric function return '-1' instead of None or null was that the max function used in the BMA function will filter these out and does not work for mixed types.

Unfortunately, it leads to some edge cases and I think it will easily introduce bugs if we ever decide to use a different aggregation function than BMA.

Proposed solution (suggestions welcome)

Raise an error in the Rel_Metric function and catch it with a try/catch statement in the BMA function. In any newly implemented aggregation function, these error would not be caught and therefore cannot silently pass.

One question remains: What is the similarity of sets that cannot be compared? My gut says to report 0.

Error when plotting the heatmaps

I am getting the error shown on the attached screenshot. Can I get some pointers?

How should code respond on non-existing terms?

Sometimes, the user inputs a list of GO-terms that contain unknown GO-terms (e.g. terms that might have recently changed). How should our application react to this? Do we ignore those terms? Do we print an error / warning message?

Fix DivisionByZero error

Sometimes a DivisionByZero error might occur on line 106 in the megago.py file. This happens if we try to compute the BMA-score for two empty lists, and can be relatively easy fixed.

Pre-calculating all GO-term comparisons

Since it's very hard to come up with useful heuristics for comparing 2 GO-terms, we decided that it would make a lot of sense to just precompute the comparison values for all GO-terms (in the same namespace) and use these to speed up the process of comparing two samples with each other.

We need to investigate a few things before we can do this:

How are we going to store the comparison values for all GO-terms? (relational database (SQLite?)?)
How long will it take to compute the values for all pairs of GO-terms?
Can we run this on an HPC?
We need to parallelise this computation as much as possible to decrease run time.
How are we going to distribute this database? (this depends on it's eventual size)

[FeatureRequest] Add progressbar to web application

It would be a nice addition to add a progressbar to the webapplication (and, on the stand alone if not implemented yet).

formulas in readme are broken

The images of the latex formulas, showing the calculation of the similarity metrics, are broken.

[Bug] Remove warning when empty line is added to GO list

Small detail, but remove warning when empty line is added to GO list

Ideas for heuristics to decrease the number of comparisons.

Comparing large sets of GO terms leads to a quadratic explosion required comparisons. This makes analyzing large sets unfeasible, due to enormous run-times. One way to reduce the computational overhead is to try and infer the similarity of two given terms based on heuristics and only calculate the similarity if this fails.

One heuristic is to check the namespace of the terms first and only compare terms if they are from the same namespace.

Any other ideas for sensible heuristics?

Similarity can exceed 1 if term is missing and parent is present

If a term is missing in the swissprot database but it's parent is present, the best match average can greater than 1. This is caused by the the way the BMA is calculated: the numerator is twice the information content of the lowest common ancestor of the two terms (say 2*0.5), while the denominator is the sum of the information content of the individual terms (0+0.5). In the given example, the resulting BMA would be 2.

Possible solutions:

inform the user that the similarity measure cannot be computed;
take the closest existing parent (with the lowest/the highest information content).

Interpretation problem when comparing identical high-level terms

When identical high-level terms are compared, a low score is returned, e.g.:
GO:0030170 (pyridoxal phosphate binding) vs GO:0030170 gives 99% similarity
GO:0043167 (ion binding) vs GO:0043167 gives 55% similarity
GO:0003674 (molecular function) vs GO:0003674 gives 0% similarity

I also tested what happens if that term is multiple times in the list (e.g. 10x GO:0043167 vs 1x GO:0043167) but this gives the same result, 55% in this case

Calculate and report similarity separately for each GO namespace

Comparing GO terms from different namespaces is not possible. We should therefore shift to calculate and report similarity values separately for each GO namespace.

Exporting the results from the web intephase

Once the analysis, how can one export the results? There is no such functionality

Obtain output in matrix like text file

When running megago with multiple files, the output with semantic similarities gets printed out in the terminal in an odd format. A print for each sample comparison appears with their respective scores and sample names always get overwritten to "sample 0 and 1", "sample 1 and 2", etc.

I see that the heatmap option helps the interpretation by visualizing the scores and displaying sample names as the original file names. However, I would like to obtain a matrix like text file output of the semantic similarity scores, which would essentially be the same data on which the heatmap is based on. Currently I don't see any options to output the similarity scores in such a format and parsing the terminal print seems impractical, especially with the sample names being changed in the output.

Is there any possibility to achieve such a matrix like text file output of the scores?

First tool reporting no GO results in entire row being skipped

to replicate, execute megago on the example input.
Expected behaviour: output similarity measure, coutld be 0.
actual behaviour: no output

example tool input csv:

set1,set2
,GO:0070279

Add documentation to the website

This PR adds the documentation for our tool to our website. I've made two pages: one for the CLI (our command line tool) and one for the web application itself. You can visit them at: https://megago.ugent.be/help/cli and https://megago.ugent.be/help/web

with missing cache files, precomputation status messages appear in output

Precomputation messages are printed to stdout instead of stderr.

Reproduce:

remove json files from package storage (frequency_counts_uniprot.json and highest_ic_uniprot.json).
execute megago, e.g. in command line
progress messages from json rebuilding will end up in result file

Proposed solution

use logging facility

mega-go / megago Goto Github PK

megago's Introduction

MegaGO

Getting Started

Prerequisites

Installing

How does it work?

Interpretation

License

megago's People

Contributors

Stargazers

Watchers

Forkers

megago's Issues

Identified cases related to the issue:

term vs. term similarity calculation, using Rel_Metric function:

set vs. set similarity calculation, using the best match average (BMA) function:

Proposed solution (suggestions welcome)

Reproduce:

Proposed solution

Recommend Projects

Recommend Topics

Recommend Org