Comments (4)
Also found the same issue with Taylor
taylor.gex <- curatedPCaData::mae_taylor[["gex"]]
dim(taylor.gex)
[1] 17638 300
row.names(taylor.gex)[duplicated(row.names(taylor.gex))]
[1] "ABCB9" "ABHD2" "ADAMTS6" "ADAMTSL1" "ADAMTSL4" "AFG3L1P" "AGAP6" "AGPAT3" "ALG13" "ANKRD36B" "ANKS6"
[12] "ARHGAP19" "ARHGAP27" "ARHGEF10" "ARID5A" "ATL3" "ATP11B" "ATP13A4" "ATP6V1C2" "ATXN7L1" "BACH1" "BIRC5"
[23] "BMP8A" "BOD1L1" "BRK1" "BTN3A2" "BZW1" "C1orf86" "C1QTNF5" "C21orf62" "C4orf3" "CACNA1C" "CCDC108"
[34] "CCDC127" "CCDC144A" "CCDC150" "CCDC57" "CCDC93" "CCNL2" "CD22" "CEP128" "CHRNA7" "CLIC5" "COX7A2"
[45] "CPLX2" "CRB2" "CSH1" "CYP2A7" "CYP2R1" "DBF4B" "DENND1B" "DGCR14" "DIDO1" "DIP2B" "DKK3"
[56] "DLG2" "DNAH10" "DNAH3" "DNAH6" "DNHD1" "DPY19L2P2" "EEF1D" "EPHA10" "EPHA6" "FAM13A" "FAM91A1"
[67] "FANCI" "FKBP9" "FOLH1" "FOXK2" "FSD1L" "FSIP2" "FZD5" "GFOD1" "GLIS3" "GNA11" "GNAI2"
[78] "GOLGA8B" "GPR107" "GPR110" "GPR26" "GRK5" "H2AFV" "HEATR1" "HIP1" "HIST1H2BJ" "HIST1H3I" "IGF1R"
[89] "IGF2" "INF2" "ING5" "KALRN" "KIAA0101" "KIAA0125" "KIAA1456" "KIF24" "KIF5A" "KIR2DL4" "KLHDC10"
[100] "LASP1" "LINC00982" "LMBR1" "LNPEP" "LPHN1" "LRRFIP1" "LRRIQ3" "LRRK1" "LSM14B" "LSM14B" "LYRM4"
[111] "LYRM7" "MAFK" "MB" "MBP" "MCM9" "MECOM" "MEGF8" "MMP16" "MYCN" "MZT2B" "N4BP2L1"
[122] "NAIF1" "NCKAP5" "NDUFS4" "NEK10" "NHLRC2" "NOL4L" "NPR3" "NUDT22" "ORAOV1" "P4HA2" "PANK2"
[133] "PCDH11X" "PCID2" "PCLO" "PGPEP1" "PHLDB2" "PHLDB3" "PIEZO2" "PLB1" "PLXNA2" "PPIH" "PRR13"
[144] "PRR14L" "PRRC2B" "PRRC2B" "PRUNE2" "PSD3" "PTEN" "PTGR1" "PXN" "RAB3IP" "RAPGEF1" "RASGRF1"
[155] "RBM14" "RBM6" "RBMS1" "REEP3" "RFT1" "RGSL1" "RNF17" "RNF207" "RNF213" "ROPN1B" "RPL14"
[166] "RPL23AP7" "RPL27A" "RSC1A1" "RUSC1-AS1" "SAA1" "SAP25" "SCN8A" "SEC62" "SERHL2" "SH3PXD2A" "SHISA6"
[177] "SHPK" "SLC1A5" "SLC25A37" "SLC7A14" "SMAD2" "SNRPN" "SORBS2" "SPAG11B" "SPEM1" "SPTB" "SRP19"
[188] "ST8SIA2" "STMN1" "STX1B" "SZT2" "TAMM41" "TBC1D16" "TEAD1" "TF" "TMEM51-AS1" "TOM1L1" "TOM1L2"
[199] "TOR1AIP2" "TRA2A" "TRERF1" "TRMT44" "TSHZ2" "TSPAN9" "TTC40" "TUBB2B" "UBE2D3" "UNKL" "VANGL1"
[210] "VKORC1L1" "VPS53" "WDFY4" "WDR90" "WNK1" "XKR6" "YIF1B" "ZC3HAV1L" "ZDHHC20" "ZDHHC3" "ZNF384"
[221] "ZNF385C" "ZNF479" "ZNF678" "ZNF706" "ZNF75A" "ZNF8" "ZRANB1" "ZXDC"
taylor.gex <- ExpressionSet(assayData=taylor.gex[!duplicated(row.names(taylor.gex)),])
dim(taylor.gex)
Features Samples
17410 300
from curatedpcadata.
I've managed to track down the source of this issue; it comes down into differences in how biomaRt and cBioPortal handle gene symbols and aliases. Looks like for example RCAN1 and CSP1 are both mapped into RCAN1 in cBioPortal, while querying these from Ensembl database with biomaRt it assigns them separate gene symbols. The other duplicated is RCAN2. Why Ensembl returns separate gene symbols is unknown to me, so I'll have to think if I can come up with a generalized solution or just work one out for RCAN1 & RCAN2 (and their potential aliases):
> head(cgdsr::getProfileData(mycgds, genes = c("RCAN1", "CSP1"), geneticProfile = "prad_tcga_pub_rna_seq_v2_mrna", caseList = "prad_tcga_pub_sequenced"))
RCAN1 RCAN1.1
TCGA.EJ.5502.01 660.5782 660.5782
TCGA.HC.7209.01 469.9961 469.9961
TCGA.HC.7748.01 349.4016 349.4016
TCGA.J4.A83N.01 390.1099 390.1099
TCGA.2A.A8VV.01 238.6163 238.6163
TCGA.2A.A8VT.01 325.7353 325.7353
> c("RCAN1", "CSP1") %in% unique(curatedPCaData:::curatedPCaData_genes$hgnc_symbol)
[1] TRUE TRUE
from curatedpcadata.
Interestingly, gene cards lists CSP1 as an alias for RCAN1 but not vice versa (querying for CSP1 yields unique gene not connected to RCAN1; one is located on chromosome 2 other in 21):
https://www.genecards.org/cgi-bin/carddisp.pl?gene=RCAN1
https://www.genecards.org/cgi-bin/carddisp.pl?gene=CSP1
from curatedpcadata.
These gene mappings are not no longer an issue as we don't query from cBio:
> table(duplicated(row.names(mae_tcga[["gex.fpkm"]])))
FALSE
58387
> table(duplicated(row.names(mae_taylor[["gex.rma"]])))
FALSE
17410
from curatedpcadata.
Related Issues (20)
- Creation of sufficiently elaborate metadata for each MAE HOT 1
- Splitting apart OSF portion of TCGA HOT 2
- Purity Estimates HOT 2
- Abida et al focus on polyA / TCGA focus on TPM normalized GEX (OSF) HOT 1
- Risk score benchmarking HOT 1
- Gene ID aliases alter between datasets, especially older annotation ones HOT 2
- Row names HOT 1
- Benchmarking description HOT 1
- Double-checking ranks of newly normalized data HOT 1
- Include normal samples in Taylor et al. HOT 1
- Use the generic identifiers PCA#### / PAN#### in Taylor et al. HOT 1
- Sample IDs in Clinical data - Taylor et.al. HOT 2
- Fusion status HOT 1
- CNA and fusion sample size different in TCGA HOT 2
- Wang et al identifiers mismatch in derived variables HOT 1
- Weiner et al. & newly generated GEX for CIBERSORTx
- Cibersort results for Weiner et al. missing HOT 1
- colData does not work for MAE Barwick HOT 1
- Xenabrowser mapping not up to date HOT 1
- cBioPortal hg19 to hg38 liftOver HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from curatedpcadata.