Code Monkey home page Code Monkey logo

homologene's Introduction

homologene

Build Status codecov

An r package that works as a wrapper to homologene

Available species are

homologene::taxData
##    tax_id                      name_txt
## 1   10090                  Mus musculus
## 2   10116             Rattus norvegicus
## 3   28985          Kluyveromyces lactis
## 4  318829            Magnaporthe oryzae
## 5   33169         Eremothecium gossypii
## 6    3702          Arabidopsis thaliana
## 7    4530                  Oryza sativa
## 8    4896     Schizosaccharomyces pombe
## 9    4932      Saccharomyces cerevisiae
## 10   5141             Neurospora crassa
## 11   6239        Caenorhabditis elegans
## 12   7165             Anopheles gambiae
## 13   7227       Drosophila melanogaster
## 14   7955                   Danio rerio
## 15   8364 Xenopus (Silurana) tropicalis
## 16   9031                 Gallus gallus
## 17   9544                Macaca mulatta
## 18   9598               Pan troglodytes
## 19   9606                  Homo sapiens
## 20   9615        Canis lupus familiaris
## 21   9913                    Bos taurus

Installation

install.packages('homologene')

or

devtools::install_github('oganm/homologene')

Usage

Basic homologene function requires a list of gene symbols or NCBI ids, and an inTax and an outTax. In this example, inTax is the taxon id of mus musculus while outTax is for humans.

homologene(c('Eno2','Mog'), inTax = 10090, outTax = 9606)
##   10090 9606 10090_ID 9606_ID
## 1  Eno2 ENO2    13807    2026
## 2   Mog  MOG    17441    4340
homologene(c('Eno2','17441'), inTax = 10090, outTax = 9606)
##   10090 9606 10090_ID 9606_ID
## 1  Eno2 ENO2    13807    2026
## 2   Mog  MOG    17441    4340

For mouse and humans two convenience functions exist that removes the need to provide taxonomic identifiers. Note that the column names are not the same as the homologene output.

mouse2human(c('Eno2','Mog'))
##   mouseGene humanGene mouseID humanID
## 1      Eno2      ENO2   13807    2026
## 2       Mog       MOG   17441    4340
human2mouse(c('ENO2','MOG','GZMH'))
##   humanGene mouseGene humanID mouseID
## 1      ENO2      Eno2    2026   13807
## 2       MOG       Mog    4340   17441
## 3      GZMH      Gzmd    2999   14941
## 4      GZMH      Gzme    2999   14942
## 5      GZMH      Gzmg    2999   14944
## 6      GZMH      Gzmf    2999   14943

homologeneData2

Original homologene database has not been updated since 2014. This package also includes an updated version of the homologene database that replaces gene symbols and identifiers with the their latest version. For the procedure followed for updating, see this blog post and/or see the processing code.

Using the updated version can help you match genes that cannot matched due to out of date annotations.

mouse2human(c('Mesd',
              'Trp53rka',
              'Cstdc4',
              'Ifit3b'))
## [1] mouseGene humanGene mouseID   humanID  
## <0 rows> (or 0-length row.names)
mouse2human(c('Mesd',
              'Trp53rka',
              'Cstdc4',
              'Ifit3b'),
            db = homologeneData2)
##   mouseGene humanGene mouseID humanID
## 1      Mesd      MESD   67943   23184
## 2  Trp53rka    TP53RK  381406  112858
## 3    Cstdc4      CSTA  433016    1475
## 4    Ifit3b     IFIT3  667370    3437

The homologeneData2 object that comes with the GitHub version of this package is updated weekly but if you are using the CRAN version and want the latest annotations, or if you want to keep a frozen version homologene, you can use the updateHomologene function.

homologeneDataVeryNew = updateHomologene() # update the homologene database with the latest identifiers

mouse2human(c('Mesd',
              'Trp53rka',
              'Cstdc4',
              'Ifit3b'),
            db = homologeneDataVeryNew)

Gene ID syncronization

The package also includes functions that were used to create the homologeneData2, for updating outdated gene symbols and identifiers.

library(dplyr)

gene_history = getGeneHistory()
oldIds = c(4340964, 4349034, 4332470, 4334151, 4323831)
newIds = updateIDs(oldIds,gene_history)
print(newIds)
## [1] "9267698" "4349033" "4332468" "4334150" "4324017"
# get the latest gene symbols for the ids

gene_info = getGeneInfo()

gene_info %>%
    dplyr::filter(GeneID %in% as.integer(newIds)) # faster to match integers
## # A tibble: 5 × 3
##   tax_id  GeneID Symbol    
##    <int>   <int> <chr>     
## 1  39947 4324017 LOC4324017
## 2  39947 4332468 SPO11-3   
## 3  39947 4334150 LOC4334150
## 4  39947 4349033 LOC4349033
## 5  39947 9267698 LOC9267698

Querying DIOPT

Instead of using just homologene, one can also make queries into the DIOPT database. Diopt uses multiple databases to find gene homolog/orthologues. Note that this function has a delay parameter that is set to 10 seconds by default. This was done to obey the robots.txt of their website.

diopt(c('GZMH'),inTax = 9606, outTax = 10090) %>% 
    knitr::kable()
Input Order Search Term Human GeneID HGNCID Human Symbol Species 2 Mouse GeneID Mouse Species Gene ID Mouse Symbol Ensmbl ID (link HPA) DIOPT Score Weighted Score Rank Best Score Best Score Reverse Prediction Derived From Alignment & Scores Feedback Gene Details
1 GZMH 2999 4710 GZMH Mouse 14944 109253 Gzmg NA 6 5.87 high Yes Yes eggNOG, Homologene, Isobase, OrthoDB, Panther, Phylome NA Add G2F details (Gzmg) DRscDB Data: (Gzmg)
1 GZMH 2999 4710 GZMH Mouse 14939 109267 Gzmb NA 6 5.85 moderate Yes No eggNOG, OrthoDB, orthoMCL, Panther, Phylome, RoundUp NA Add G2F details (Gzmb) DRscDB Data: (Gzmb)
1 GZMH 2999 4710 GZMH Mouse 245839 2675494 Gzmn NA 5 4.93 moderate No No eggNOG, OMA, OrthoDB, Panther, Phylome NA Add G2F details (Gzmn) DRscDB Data: (Gzmn)
1 GZMH 2999 4710 GZMH Mouse 14940 109256 Gzmc NA 5 4.93 moderate No No eggNOG, OMA, OrthoDB, Panther, Phylome NA Add G2F details (Gzmc) DRscDB Data: (Gzmc)
1 GZMH 2999 4710 GZMH Mouse 14941 109255 Gzmd NA 5 4.92 moderate No No eggNOG, Homologene, OrthoDB, Panther, Phylome NA Add G2F details (Gzmd) DRscDB Data: (Gzmd)
1 GZMH 2999 4710 GZMH Mouse 14943 109254 Gzmf NA 5 4.92 moderate No No eggNOG, Homologene, OrthoDB, Panther, Phylome NA Add G2F details (Gzmf) DRscDB Data: (Gzmf)
1 GZMH 2999 4710 GZMH Mouse 14942 109265 Gzme NA 5 4.92 moderate No No eggNOG, Homologene, OrthoDB, Panther, Phylome NA Add G2F details (Gzme) DRscDB Data: (Gzme)
1 GZMH 2999 4710 GZMH Mouse 17231 1261780 Mcpt8 NA 4 3.97 moderate No No eggNOG, OrthoDB, Panther, TreeFam NA Add G2F details (Mcpt8) DRscDB Data: (Mcpt8)
1 GZMH 2999 4710 GZMH Mouse 13035 88563 Ctsg NA 2 1.91 low No No eggNOG, OrthoDB NA Add G2F details (Ctsg) DRscDB Data: (Ctsg)
1 GZMH 2999 4710 GZMH Mouse 17228 96941 Cma1 NA 2 1.91 low No No eggNOG, OrthoDB NA Add G2F details (Cma1) DRscDB Data: (Cma1)
1 GZMH 2999 4710 GZMH Mouse 14938 109266 Gzma NA 1 1.03 low No No RoundUp NA Add G2F details (Gzma) DRscDB Data: (Gzma)
1 GZMH 2999 4710 GZMH Mouse 17232 1194491 Mcpt9 NA 1 1.01 low No Yes OrthoDB NA Add G2F details (Mcpt9) DRscDB Data: (Mcpt9)
1 GZMH 2999 4710 GZMH Mouse 17225 96938 Mcpt2 NA 1 1.01 low No Yes OrthoDB NA Add G2F details (Mcpt2) DRscDB Data: (Mcpt2)
1 GZMH 2999 4710 GZMH Mouse 17224 96937 Mcpt1 NA 1 1.01 low No Yes OrthoDB NA Add G2F details (Mcpt1) DRscDB Data: (Mcpt1)
1 GZMH 2999 4710 GZMH Mouse 545055 88426 Cma2 NA 1 1.01 low No Yes OrthoDB NA Add G2F details (Cma2) DRscDB Data: (Cma2)
diopt(c('Eno2','Mog'),inTax = 10090, outTax =9606) %>%
    knitr::kable()
Input Order Search Term Mouse GeneID MGIID Mouse Symbol Species 2 Human GeneID Human Species Gene ID Human Symbol Ensmbl ID (link HPA) DIOPT Score Weighted Score Rank Best Score Best Score Reverse Prediction Derived From Alignment & Scores Feedback Gene Details
1 Eno2 13807 95394 Eno2 Human 2026 3353 ENO2 ENSG00000111674 14 14.29 high Yes Yes Compara, eggNOG, HGNC, Hieranoid, Homologene, Inparanoid, OMA, OrthoFinder, OrthoInspector, orthoMCL, Panther, Phylome, RoundUp, TreeFam NA Add G2F details (ENO2) DRscDB Data: (ENO2)
Human Protein Atlas (HPA)
1 Eno2 13807 95394 Eno2 Human 2023 3350 ENO1 ENSG00000074800 5 4.84 moderate No No eggNOG, OrthoDB, OrthoFinder, orthoMCL, RoundUp NA Add G2F details (ENO1) DRscDB Data: (ENO1)
Human Protein Atlas (HPA)
1 Eno2 13807 95394 Eno2 Human 2027 3354 ENO3 ENSG00000108515 4 3.83 moderate No No eggNOG, OrthoFinder, orthoMCL, RoundUp NA Add G2F details (ENO3) DRscDB Data: (ENO3)
Human Protein Atlas (HPA)
1 Eno2 13807 95394 Eno2 Human 2580 4113 GAK ENSG00000178950 1 1.01 low No No OrthoDB NA Add G2F details (GAK) DRscDB Data: (GAK)
Human Protein Atlas (HPA)
1 Eno2 13807 95394 Eno2 Human 2534 4037 FYN ENSG00000010810 1 1.01 low No No OrthoDB NA Add G2F details (FYN) DRscDB Data: (FYN)
Human Protein Atlas (HPA)
1 Eno2 13807 95394 Eno2 Human 387712 31670 ENO4 ENSG00000188316 1 0.90 low No No eggNOG NA Add G2F details (ENO4) DRscDB Data: (ENO4)
Human Protein Atlas (HPA)
2 Mog 17441 97435 Mog Human 4340 7197 MOG ENSG00000204655 15 15.30 high Yes Yes Compara, eggNOG, HGNC, Hieranoid, Homologene, Inparanoid, OMA, OrthoDB, OrthoFinder, OrthoInspector, orthoMCL, Panther, Phylome, RoundUp, TreeFam NA Add G2F details (MOG) DRscDB Data: (MOG)
Human Protein Atlas (HPA)

Mishaps

As of version version 1.1.68, the output now includes NCBI ids. Since it doesn’t change any of the existing column names or their order, this shouldn’t cause problems in most use cases.

If a you can’t find a gene you are looking for it may have synonyms. See geneSynonym package to find them. If you have other problems open an issue or send a mail.

homologene's People

Contributors

jimhester avatar leonfrench avatar oganm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

homologene's Issues

homologeneData2 convergence

It seems like there are times where 2 different genes with different HIDs can converge into a single gene ID. find a way to deal with that. If given a gene ID that has multiple HIDs associated with it, what is the correct output

Also history seems to be capable of altering itself to break previously existing connections. Found 16 cases so far. Mostly LOC genes but 2 real ones. Keeping things up to date as a chain might not be the best idea. For CRAN submission, update from the latest file. For later updates will have to think what to do. Investigating the broken connection between 417553 and 107054794 suggests they are the same gene

return NA for non-mappable genes

Hi,

I'm trying to use homologene to rename the rows of a gene expression table. As indicated in the docs, queries for genes that cannot be mapped are not returned in the results

mapped_genes <- mouse2human(c('NoKnownHomolog1,'NoKnownHomolog2'))
## [1] mouseGene humanGene mouseID   humanID  
## <0 rows> (or 0-length row.names)

This means the resulting dataframe will be of a different length than the input vector, which prevents doing something like
new_rownames<- mouse2human(rownames(input_data))$humanGene, followed by dropping all NA rows from the df.
Is there any feature/best practice to apply homologene to this problem?

ensembl support

Ebsembl is problematic because each species has its own database with their own gene symbol name that needs to be manually identified. To be as smooth as homologene, I need to translate any possible NCBI ids and gene symbols in the input to ensembl ids, get symbols and ncbi ids from the ensembl ids of the output.

It seems possible but will require manual fiddling

Add more species in homologene

Hi,
I am wondering whether it is possible to add the two modeling species in Plant, Arabidopsis and rice in the homologene package? I find your R package is really convenient to handle large amounts of data.

Dealing with changes in the history file

It seems like genes_history file can gain or lose events retroactively. This means certain paths to new genes occasionally break which means certain genes can become unavailable between different versions and there would be no way to identify what happened without looking at a past version of the file. Of course such an archive of gene_history files do not exist, to my understanding the whole point of gene_history file was to be such an archive anyway. Edits into the past do break this idea.

This doesn't happen frequently as far as I can see. Over the course of two months I lost 1 connection between a mouse gene in original homologene and its future version. I didn't really dig too deep into this though.

The major implication is that using updateHomologene function with the baseline homologeneData2 included in CRAN, which is the default parameter, you may get a different file compared to what I generate here since I always start from the original homologene. The default was set to homologeneData2 because it is significantly faster to update from a later starting point. No sure if I should change this default in light of this.

weekly updates stopped?

Hi,
Small question: maybe I am wrong, but it seems that since June 7 somehow the weekly updates of homologeneData2 have stopped. Or is a less frequent update schedule implemented now?

Thanks,
Guido

Nice work, thanks!

Hi,
No question or problem, but just saying thanks for keeping your package and the underlying homologene-info up-to-date; much appreciated!

Guido

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.