seandavi / geoquery Goto Github PK
View Code? Open in Web Editor NEWThe bridge between the NCBI Gene Expression Omnibus and Bioconductor
Home Page: http://seandavi.github.io/GEOquery
License: MIT License
The bridge between the NCBI Gene Expression Omnibus and Bioconductor
Home Page: http://seandavi.github.io/GEOquery
License: MIT License
Unfortunately I don't have a build report to point you to because our windows devel build didn't happen today (for an unrelated reason) but I noticed that on windows GEOquery fails a unit test:
1 Test Suite :
GEOquery RUnit Tests - 14 test functions, 0 errors, 1 failure
FAILURE in testSuppFileSupport: Error in checkEquals(10, ncol(fres)) : Mean relative difference: 0.3
Test files with failing tests
test_SuppFileSupport.R
testSuppFileSupport
Basically it's because file.info() returns a different number of columns on Windows (7) than on unix (10).
So the test should be fixed.
Thanks,
Dan
reported in email:
library(GEOquery)
GSE37405 <- getGEO(GEO = "GSE37405", GSEMatrix = T, destdir = getwd())
results in:
Error: Duplicate identifiers for rows (232, 285, 338, 391, 444, 497, 550), (503, 556), (147, 200, 253, 306, 359, 412, 465, 518, 571)
This is a general problem, so not specific to this GSE.
getGEO("GPL13112")
Error in download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :
cannot open URL 'http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL13112&form=text&view=full'
$ GEOquery::getGEO('GSE23397')
Traceback:
1. GEOquery::getGEO("GSE23397")
2. getAndParseGSEMatrices(GEO, destdir, AnnotGPL = AnnotGPL, getGPL = getGPL,
. parseCharacteristics = parseCharacteristics)
3. parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL,
. getGPL = getGPL)
4. dplyr::mutate(pd, characteristics = ifelse(grepl("_ch2", characteristics),
. "ch2", "ch1")) %>% tidyr::separate(kvpair, into = c("k",
. "v"), sep = ":", fill = "right") %>% dplyr::mutate(k = paste(k,
. characteristics, sep = ":")) %>% dplyr::select(-characteristics) %>%
. tidyr::spread(k, v)
5. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
6. eval(quote(`_fseq`(`_lhs`)), env, env)
7. eval(quote(`_fseq`(`_lhs`)), env, env)
8. `_fseq`(`_lhs`)
9. freduce(value, `_function_list`)
10. withVisible(function_list[[k]](value))
11. function_list[[k]](value)
12. tidyr::spread(., k, v)
13. spread.data.frame(., k, v)
14. abort(glue("Duplicate identifiers for rows {rows}"))```
Retrieving data from getGEO seems to fail:
> library(GEOquery)
> gse <- getGEO('21653')
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE21nnn/GSE21653/matrix/
Found 1 file(s)
GSE21653_series_matrix.txt.gz
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE21nnn/GSE21653/matrix/GSE21653_series_matrix.txt.gz'
ftp data connection made, file length 22510111 bytes
==================================================
downloaded 22.5 MB
File stored at:
/tmp/Rtmp96s5ht/GPL570.soft
Error in xj[i] : only 0's may be mixed with negative subscripts
From: Hervé Pagès [email protected]
Sent: Sep 27, 2017 7:27 PM
To: "Davis, Sean (NIH/NCI) [E]" [email protected]
Cc: "Shepherd, Lori" [email protected]
Subject: intermittent GEO errorsHi Sean,
We seem to get these intermittent GEO errors on our build reports
pretty often these days, maybe more than usual. For example today
we see them in release here:https://bioconductor.org/checkResults/3.5/bioc-LATEST/GEOquery/malbec2-checksrc.html
https://bioconductor.org/checkResults/3.5/bioc-LATEST/ChIPXpress/malbec2-checksrc.html
and in devel here:
https://bioconductor.org/checkResults/3.6/bioc-LATEST/ChIPXpress/malbec1-checksrc.html
Do you have any idea why the GEO service is so flaky?
Are the sys admins in charge of ftp.ncbi.nlm.nih.gov aware of this?
Is there anything that could be done to improve this situation?
Was just wondering if you had any insight on this.
Thanks,
H.
E.g. GSE30729, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30729
Hi Sean,
geoq <- getGEO("GSE9514")
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE9nnn/GSE9514/matrix/
Found 1 file(s)
GSE9514_series_matrix.txt.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 378k 100 378k 0 0 204k 0 0:00:01 0:00:01 --:--:-- 204k
File stored at:
/data3/tmp/RtmpkDXZzR/GPL90.soft
Error in xj[i] : only 0's may be mixed with negative subscripts
And the error appears to come from this section in parseGPL():
if (hasDataTable) {
nLinesToRead <- NULL
if (!is.null(n)) {
nLinesToRead <- n - length(txt)
}
dat3 <- fastTabRead(con, n = nLinesToRead, quote = "")
geoDataTable <- new("GEODataTable", columns = cols, table = dat3[1:(nrow(dat3) -
1), ])
}
Where there is no error trapping for the case that fastTabRead returns a zero row data.frame:
debug: dat3 <- fastTabRead(con, n = nLinesToRead, quote = "")
Browse[3]> dim(dat3)
[1] 0 17
Browse[3]> dat3
[1] ID ORF
[3] SPOT_ID Species Scientific Name
[5] Annotation Date Sequence Type
[7] Sequence Source Target Description
[9] Representative Public ID Gene Title
[11] Gene Symbol ENTREZ_GENE_ID
[13] RefSeq Transcript ID SGD accession number
[15] Gene Ontology Biological Process Gene Ontology Cellular Component
[17] Gene Ontology Molecular Function
<0 rows> (or 0-length row.names)
Best,
Jim
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
Hello,
Retrieving data from GEO in the form of getGEO("GSExxxx") seems to produce errors that didn't occur before. Perhaps this is due to NCBI's recent move from http to https?
>GSE <- getGEO("GSE3105")
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE3nnn/GSE3105/matrix/
Found 1 file(s)
GSE3105_series_matrix.txt.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 251k 100 251k 0 0 70762 0 0:00:03 0:00:03 --:--:-- 70753
File stored at:
/tmp/RtmpnbTgfV/GPL199.soft
Error in read.table(con, sep = "\t", header = FALSE, nrows = nseries) :
invalid 'nlines' argument
Is this a known issue?
Yair
: reported by Alex Abbas via email
> getGEO("GSE83452”)
Found 1 file(s)
GSE83452_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE83nnn/GSE83452/matrix/GSE83452_series_matrix.txt.gz'
Content type 'application/x-gzip' length 40807397 bytes (38.9 MB)
==================================================
downloaded 38.9 MB
Parsed with column specification:
cols(
.default = col_double(),
ID_REF = col_integer()
)
See spec(...) for full column specifications.
|====================================================================================================================================| 100% 93 MB
File stored at:
/var/folders/wf/65nq9x_d4xj9ttpbz7shlkgc0000gn/T//RtmpbPhopr/GPL16686.soft
Warning: 190 parsing failures.
row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 53792 ID an integer AFFX-BioB-3_at literal data file 2 53793 ID an integer AFFX-BioB-3_st literal data row 3 53794 ID an integer AFFX-BioB-5_at literal data col 4 53795 ID an integer AFFX-BioB-5_st literal data expected 5 53796 ID an integer AFFX-BioB-M_at literal data
... ................. ... .................................................... ........ .................................................... ...... .................................................... .... .................................................... ... .................................................... ... .................................................... ........ ....................................................
See problems(...) for more details.
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning messages:
1: attributes are not identical across measure variables;
they will be dropped
2: In rbind(names(probs), probs_f) :
number of columns of result is not a multiple of vector length (arg 1)
3: non-unique value when setting 'row.names': ‘NA’
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] GEOquery_2.46.3 bindrcpp_0.2 Biobase_2.38.0 BiocGenerics_0.24.0 ggplot2_2.2.1 tidyr_0.7.2 dplyr_0.7.4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 xml2_1.1.1 bindr_0.1 magrittr_1.5 hms_0.3 tidyselect_0.2.3 munsell_0.4.3 colorspace_1.3-2
[9] R6_2.2.2 rlang_0.1.4 httr_1.3.1 plyr_1.8.4 tools_3.4.2 grid_3.4.2 gtable_0.2.0 lazyeval_0.2.1
[17] assertthat_0.2.0 tibble_1.3.4 readr_1.1.1 purrr_0.2.4 bitops_1.0-6 curl_3.0 RCurl_1.95-4.8 glue_1.2.0
[25] labeling_0.3 stringi_1.1.5 compiler_3.4.2 scales_0.5.0 XML_3.98-1.9 pkgconfig_2.0.1
Here is the output from appveyor. The same code has line numbers that match up on Linux/Mac and with the actual counts from GEO. This appears to affect only GPL and GSMs.
R version 3.4.2 Patched (2017-11-06 r73690) -- "Short Summer"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(testthat)
> library(GEOquery)
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colMeans, colSums, colnames, do.call,
duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
lapply, lengths, mapply, match, mget, order, paste, pmax, pmax.int,
pmin, pmin.int, rank, rbind, rowMeans, rowSums, rownames, sapply,
setdiff, sort, table, tapply, union, unique, unsplit, which,
which.max, which.min
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
>
> test_check("GEOquery")
Attaching package: 'limma'
The following object is masked from 'package:BiocGenerics':
plotMA
1. Failure: generic GPL parsing works as expected (@test_GPL.R#9) --------------
nrow(Table(gpl)) not equivalent to 22283.
1/1 mismatches
[1] 22284 - 22283 == 1
2. Failure: quoted GPL works (@test_GPL.R#19) ----------------------------------
45220 not equivalent to nrow(Table(gpl)).
1/1 mismatches
[1] 45220 - 45221 == -1
3. Failure: short GPL works (@test_GPL.R#26) -----------------------------------
52 not equivalent to nrow(Table(gpl)).
1/1 mismatches
[1] 52 - 53 == -1
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11413/matrix/GSE11413_series_matrix.txt.gz'
Content type 'application/x-gzip' length 3997 bytes
==================================================
downloaded 3997 bytes
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE35nnn/GSE35683/matrix/GSE35683_series_matrix.txt.gz'
Content type 'application/x-gzip' length 5733793 bytes (5.5 MB)
==================================================
downloaded 5.5 MB
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11595/matrix/GSE11595-GPL3906_series_matrix.txt.gz'
Content type 'application/x-gzip' length 59663 bytes (58 KB)
==================================================
downloaded 58 KB
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11595/matrix/GSE11595-GPL4348_series_matrix.txt.gz'
Content type 'application/x-gzip' length 1528930 bytes (1.5 MB)
==================================================
downloaded 1.5 MB
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34145/matrix/GSE34145-GPL15796_series_matrix.txt.gz'
Content type 'application/x-gzip' length 6643 bytes
==================================================
downloaded 6643 bytes
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34145/matrix/GSE34145-GPL6102_series_matrix.txt.gz'
Content type 'application/x-gzip' length 1428433 bytes (1.4 MB)
==================================================
downloaded 1.4 MB
4. Failure: basic GSM works (@test_GSM.R#12) -----------------------------------
nrow(Table(gsm)) not equivalent to 22283.
1/1 mismatches
[1] 22284 - 22283 == 1
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/GSE2553/matrix/GSE2553_series_matrix.txt.gz'
Content type 'application/x-gzip' length 8480960 bytes (8.1 MB)
==================================================
downloaded 8.1 MB
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1000/suppl//GSE1000_RAW.tar?tool=geoquery'
Content type 'application/x-tar' length 35307520 bytes (33.7 MB)
==================================================
downloaded 33.7 MB
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM15nnn/GSM15789/suppl//GSM15789.cel.gz?tool=geoquery'
Content type 'application/x-gzip' length 3507725 bytes (3.3 MB)
==================================================
downloaded 3.3 MB
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM15nnn/GSM15789/suppl//GSM15789.cel.gz?tool=geoquery'
Content type 'application/x-gzip' length 3507725 bytes (3.3 MB)
==================================================
downloaded 3.3 MB
testthat results ================================================================
OK: 181 SKIPPED: 0 FAILED: 4
1. Failure: generic GPL parsing works as expected (@test_GPL.R#9)
2. Failure: quoted GPL works (@test_GPL.R#19)
3. Failure: short GPL works (@test_GPL.R#26)
4. Failure: basic GSM works (@test_GSM.R#12)
Error: testthat unit tests failed
Execution halted
Some machines do not have curl installed....
In the vignette, in the section Converting a GSE to an expressionset, you mention that all the GSM arrays are from the GPL5 platform. However if I run the code on my computer, I get a list of GPL96 and GPL97 arrays:
> gse <- getGEO(filename=system.file("extdata/GSE781_family.soft.gz",package="GEOquery"))
> gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform})
> gsmplatforms
$GSM11805
[1] "GPL96"
$GSM11810
[1] "GPL97"
$GSM11814
[1] "GPL96"
This is also evident in the bioconductor page of GEOquery: http://www.bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html#converting-gse-to-an-expressionset
Maybe this file has been changed since the last documentation update?
The "full_soft" format contains both GDS expression and GPL annotation in the same file. Currently, GEOquery doesn't handle this format, but a user downloaded such a file from the website and was confused that GEOquery did not correctly parse it.
Hi there and Sean,
The problem is the extra 'GPL' meta line in the full soft file, it will
confuse the built-in file parser. Maybe this could be considered a bug,
since for each GEO series (not super series) there should be only one
GPL.ID associated with it. The meta data returned after parsing the soft
file should contain only a length one GPL.ID for the platform. (CCing
Sean for a possible fix)
You should be able to proceed with the first line changed as follows,
Considering the GEOquery is more an on-line tool-set to let you download
the GEO data on-the-go, it make less sense to pre-download the soft file
manually.
Best,
Dan
On Sun, 2013-11-24 at 20:21 +0800, 水静流深 wrote:
hi,i am new to bioconductor,when i run the following command ,i get wrong output
Error in gzfile(fname, open = "rt") : invalid 'description' argument
what is the matter?library(Biobase) library(GEOquery) gds4577 <- getGEO(filename='c:/test/GDS4577_full.soft.gz') eset <- GDS2eSet(gds4577, do.log2=TRUE) > eset <- GDS2eSet(gds4577, do.log2=TRUE) File stored at: C:\DOCUME
1\sanya\LOCALS1\Temp\RtmpQtuak0/GPL1261.annot.gzC:\DOCUME1\sanya\LOCALS1\Temp\RtmpQtuak0/GPL1261.annot.gz Error in gzfile(fname, open = "rt") : invalid 'description' argument In addition: Warning messages: 1: In if (GSEMatrix & geotype == "GSE") { : the condition has length > 1 and only the first element will be used 2: In if (geotype == "GDS") { : the condition has length > 1 and only the first element will be used 3: In if (geotype == "GSE" & amount == "full") { : the condition has length > 1 and only the first element will be used 4: In if (geotype == "GSE" & amount != "full" & amount != "table") { : the condition has length > 1 and only the first element will be used 5: In if (geotype == "GPL"!
) { : the condition has length > 1 and only the first element will be used 6: In if (!file.exists(destfile)) { : the condition has length > 1 and only the first element will be used 7: In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) : only first element of 'url' argument used 8: In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) : only first element of 'destfile' argument used > eset Error: object 'eset' not found
what is the matter with my computer?sessionInfo() R version 3.0.2 (2013-09-25) Platform: i386-w64-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Chinese_People's Republic of China.936 LC_CTYPE=Chinese_People's Republic of China.936 [3] LC_MONETARY=Chinese_People's Republic of China.936 LC_NUMERIC=C [5] LC_TIME=Chinese_People's Republic of China.936 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GEOquery_2.28.0 BiocInstaller_1.12.0 affy_1.40.0 Biobase_2.22.0 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] affyio_1.30.0 preprocessCore_1.24.0 RCurl_1.95-4.1 tools_3.0.2
Would it be possible to add an argument to getGEO, e.g., GPLdata = TRUE
, or define a special meaning for an NA
value in AnnotGPL
, that indicates if GPL feature data should be downloaded at all?
Example:
> eset <- getGEO('GSE12345')
> head(fData(eset)[, 1:5])
ID GB_ACC SPOT_ID Species Scientific Name Annotation Date
1007_s_at 1007_s_at U48705 <NA> Homo sapiens Jun 9, 2011
1053_at 1053_at M87338 <NA> Homo sapiens Jun 9, 2011
117_at 117_at X51757 <NA> Homo sapiens Jun 9, 2011
121_at 121_at X69699 <NA> Homo sapiens Jun 9, 2011
1255_g_at 1255_g_at L36861 <NA> Homo sapiens Jun 9, 2011
1294_at 1294_at L13852 <NA> Homo sapiens Jun 9, 2011
>
> # while no feature data is fetched with:
> eset <- getGEO('GSE12345', GPLdata = FALSE)
> # or
> # eset <- getGEO('GSE12345', AnnotGPL = NA)
> featureData(eset)
An object of class 'AnnotatedDataFrame': none
> fData(eset)
data frame with 0 columns and 54675 rows
Note that one needs to be careful when setting an empty feature data AnnotatedDataFrame
object: rownames must still correspond to featureNames, otherwise the object does not pass validation:
> featureData(eset) <- AnnotatedDataFrame(data.frame(row.names = featureNames(eset)))
> validObject(eset)
[1] TRUE
> featureData(eset) <- AnnotatedDataFrame()
> validObject(eset)
Error in validObject(eset) :
invalid class “ExpressionMix” object: 1: feature numbers differ between assayData and featureData
invalid class “ExpressionMix” object: 2: featureNames differ between assayData and featureData
Thanks!
Hello,
I use getGEO function. But GEOquery 2.34.0 and R version 3.2.1
getGEO(filename="gsexx_series_matrix.txt.gz",getGPL=F) is not running correctly. As if getGPL=T, function is downloading gplxx.soft data on NCBI. But I do not want to download gpl data.
I tried several data getGEO with getGPL=F but always function download data. Please help me.
See #16 for details.
Here are some links to support this:
http://www.ncbi.nlm.nih.gov/geo/info/geo_paccess.html
http://cran.r-project.org/web/packages/rentrez/index.html
https://github.com/ropensci/rentrez
Hi Sean,
I think there is a typo in the GEO ftp site url when you download supplement files for a GPL with the function getGEOSuppFiles, platform should be replace by platforms (line 37 of the file getGEOSuppFiles.R):
url <- sprintf("ftp://ftp.ncbi.nlm.nih.gov/geo/platform/%s/%s/suppl/",stub,GEO)
url <- sprintf("ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/%s/%s/suppl/",stub,GEO)
thanks
Slim Fourati
Sr Reaseach Associate
Department of Pathology, CWRU
Reported by Mike Love and Mike Smith who noted that the GSEMatrix file parsing seemed to not end. GEOquery enters an infinite loop in findFirstEntity
.
This code from http://bioconductor.org/packages/3.6/data/experiment/vignettes/airway/inst/doc/airway.html produced the bug:
suppressPackageStartupMessages( library( "GEOquery" ) )
suppressPackageStartupMessages( library( "airway" ) )
dir <- system.file("extdata",package="airway")
geofile <- file.path(dir, "GSE52778_series_matrix.txt")
gse <- getGEO(filename=geofile)
Currently, specifying filename in getGEO will result in a rather cryptic set of error messages when the file does not exist. Fix will mean checking for file existence and then reporting better what the problem is.
This is the error (below). The directory listing is finding both the parent directory and the actual file. We rely on finding only the file.
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/GSE2553/matrix/
OK
Found 2 file(s)
/geo/series/GSE2nnn/GSE2553/
Warning in download.file(sprintf("https://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/%s", :
URL https://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/GSE2553/matrix//geo/series/GSE2nnn/GSE2553/: cannot open destfile '/tmp/RtmpLJwK5g//geo/series/GSE2nnn/GSE2553/', reason 'No such file or directory'
Warning in download.file(sprintf("https://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/%s", :
download had nonzero exit status
Quitting from lines 135-139 (GEOquery.Rmd)
Error: processing vignette 'GEOquery.Rmd' failed with diagnostics:
'/tmp/RtmpLJwK5g//geo/series/GSE2nnn/GSE2553/' does not exist.
Hello,
I'm experiencing frequent connection issues with getGEO()
. This is an issue on multiple computers, all with different network connections from different locations.
Attempt 1:
> esetlist <- getGEO("gse94802")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94802/matrix/
OK
Found 2 file(s)
/geo/series/GSE94nnn/GSE94802/
downloaded 0 bytes
Error in download.file(sprintf("https://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/%s", :
cannot download all files
In addition: Warning message:
In download.file(sprintf("https://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/%s", :
URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94802/matrix//geo/series/GSE94nnn/GSE94802/': status was '404 Not Found'
Attempt 2 (5 seconds later):
> esetlist <- getGEO("gse94802")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94802/matrix/
OK
Found 1 file(s)
GSE94802_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94802/matrix/GSE94802_series_matrix.txt.gz'
Content type 'unknown' length 3277 bytes
==================================================
downloaded 3277 bytes
File stored at:
/tmp/RtmpYLjH8P/GPL17021.soft
Is there a way to have getGEO()
reattempt the connection several times (perhaps specified via an argument) before throwing an error? This is causing my reports to crash unexpectedly, disrupting reproducibility. Alternatively, I could try wrapping getGEO()
in a try-catch block to achieve this suggested functionality.
Thanks for your help!
Good day,
I found an issue in the way that getGEO parses the clinical data. Consider
ovaCancer <- getGEO("GSE53963")[[1]]
pData(ovaCancer)[1:6, 20:25]
characteristics_ch2.1 characteristics_ch2.2 characteristics_ch2.3 characteristics_ch2.4 characteristics_ch2.5 characteristics_ch2.6
GSM1304246 morphology: Serous stage_#: 2 Stage: II substage: B grade: 4 debulking: Optimal
GSM1304247 morphology: Serous stage_#: 3 Stage: III substage: C grade: 4 debulking: Optimal
GSM1304248 morphology: Serous stage_#: 4 Stage: IV grade: 3 debulking: Sub-optimal time_fu_months: 1.12
GSM1304249 morphology: Serous stage_#: 3 Stage: III substage: C grade: 3 debulking: Optimal
GSM1304250 morphology: Serous stage_#: 3 Stage: III substage: C grade: 4 debulking: Optimal
GSM1304251 morphology: Serous stage_#: 4 Stage: IV grade: 3 debulking: Sub-optimal time_fu_months: 47.20
Because stage 4 samples have no substage data, the alignment of the data into columns after the stage column is one column too left. Is it possible to handle this case automatically ?
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia
From Lori:
Error : object ‘gunzip’ is not exported by 'namespace:GEOquery'
From Lori:
AnnotationHubData and ExperimentHubData are failing with
Error : object ‘getGEOSuppFiles’ is not exported by 'namespace:GEOquery'
I looked at AnnotationHubData and while it is imported in the NAMESPACE I don't think its currently used anymore within the code so I can just remove this -
But I also wanted to double check that this was intentional since GEOquery still has the man file for the functions but it is no longer exported?
From [email protected]:
I get this error with GSE10, GSE17 or more recent file as GSE11956 or GSE10442 (these files are form GPL4) but not with GSE2253 (GPL339) . I opened some of files I can't dowload correctly and there's different line counts. Mabye the problem comes from the GPL4 technology or something like that? It's a SAGE NIaIII but I tried with another SAGE NIaIII, the GPL8251 and I succeed in downloading correctly the GSE file.
Hi Sean,
I have always been experiencing issues with the GEOquery package, when running behind an annoying university proxy, which I "bypass" using cntlm.
Forgetting about the technical details, the issue is that the GEO URL that is fetched by getAndParseGSEMatrices gets scrambled into an HTML page somewhere in between nih.gov and the R session (probably by the proxy or cntlm), this causes getGEO to throw the following typical error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 6 elements
For example:
getURL('ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12008/matrix/')
[1] "\r\n<meta http-equiv="Content-Type" content="text-html; charset=UTF-8">\r\n\r\n<TITLE>FTP directory /geo/series/GSE12nnn/GSE12008/matrix/ at ftp.ncbi.nlm.nih.gov. </TITLE>\r\n\r\n\r\nFTP directory /geo/series/GSE12nnn/GSE12008/matrix/ at ftp.ncbi.nlm.nih.gov.
\r\n
\r\n\r\n <DIR> <A HREF="..">..\r\n
09/18/13 09:09AM [GMT] 719,285 <A HREF="/geo/series/GSE12nnn/GSE12008/matrix/GSE12008-GPL4134_series_matrix.txt.gz">GSE12008-GPL4134_series_matrix.txt.gz\r\n09/18/13 09:09AM [GMT] 557,250 <A HREF="/geo/series/GSE12nnn/GSE12008/matrix/GSE12008-GPL6244_series_matrix.txt.gz">GSE12008-GPL6244_series_matrix.txt.gz\r\n
\r\n\r\n\r\n"
I have a fix for this (see below), which would be really nice if applied to GEOquery, so that I can work with GEO datasets :D
The idea is that instead of always scanning for a txt table format, we first check if the result is an HTML document, and, if so, parse for the href strings that point to the matrix files to download.
Here I provide two alternative ways of processing the HTML content, one one-liner using the stringr package, or more lines if you don't want to depend on stringr.
The message should probably go away or disabled if not in verbose mode.
Please let me know if you consider this fix can be applied any time soon.
Thank you.
Bests,
Renaud
gdsurl <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/"
a <- getURL(sprintf(gdsurl, stub, GEO))
if( grepl("^<HTML", a) ){ # process HTML content
message("# Processing HTML result page (behind a proxy?) ... ", appendLF=FALSE)
b <- stringr::str_match_all(a, "((href)|(HREF))=\s*["']/[^\"']+/([^/]+)["']")[[1]]
# or, alternatively, without stringr
sa <- gsub('HREF', 'href', a, fixed = TRUE)# just not to depend on case change
sa <- strsplit(sa, 'href', fixed = TRUE)[[1L]]
pattern <- "^=\\s*[\"']/[^\"']+/([^/]+)[\"'].*"
b <- as.matrix(gsub(pattern, "\\1", sa[grepl(pattern, sa)]))
#
message('OK')
}else{ # standard processing of txt content
tmpcon <- textConnection(a, "r")
b <- read.table(tmpcon)
close(tmpcon)
}
Dear developers of the GEOquery package,
gds <- getGEO("GDS3666")
eset <- GDS2eSet(gds)
results in the following error:
Error in value[[3L]](cond) : row names contain missing values
AnnotatedDataFrame 'initialize' could not update varMetadata:
perhaps pData and varMetadata are inconsistent?
I found out that this is due to a single NA in the ID list and that this
could be easily fixed by excluding NA's:
check.na <- is.na(gds@dataTable@table$ID_REF)
has.na <- any(check.na)
gpl <- NULL
if(has.na) {
ind <- !check.na
gds@dataTable@table <- gds@dataTable@table[ind,]
gpl <- getGEO(Meta(gds)$platform, AnnotGPL=TRUE)
gpl@dataTable@table <- gpl@dataTable@table[ind,]
}
eset <- GDS2eSet(gds, GPL=gpl)
I wonder whether you want to include something similar in the GDSeSet
function in order to avoid these issues with NA's.
Best,
Ludwig
The refactored code for parsing GSE matrices seems to fail in some cases e.g. GSE5350 which we use in the BeadArrayUseCases vignette.
A straightforward path to producing the error is to try calling parseGSEMatrix
on the GEO series text file directly:
GEOquery:::parseGSEMatrix(fname = "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5350/matrix/GSE5350-GPL2507_series_matrix.txt.gz")`
Error in enc2utf8(col_names(col_labels, sep = sep)) :
argument is not a character vector`
Some rudimentry digging leads me to think this is an issue with how the Sample_characteristics_ch1 field is being processed, but I don't know enough about how that may look across multiple datasets to offer a generic solution.
Happy to do some more experimentation if needed.
GSM1062236 is not public, leading to findFirstEntity
failing (infinite loop). Need to catch these cases.
Hi Sean,
I'm still seeing some timeouts with GEOquery 2.46.10 on bioc-release.
Here's a quick example:library('GEOquery')
getGEO('GSM1062236', getGPL = FALSE)I found it from
https://github.com/leekgroup/recount/blob/master/tests/testthat/test-misc.R#L19Best,
Leo
Reported by Amy Tang:
gse2 = getGEO('GSE99479')
Found 2 file(s)
GSE99479-GPL11154_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE99nnn/GSE99479/matrix/GSE99479-GPL11154_series_matrix.txt.gz'
Content type 'application/x-gzip' length 3751 bytes
==================================================
downloaded 3751 bytes
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 21 did not have 31 elements
>
The GSE in question has no data table:
....
!series_matrix_table_begin
"ID_REF" "GSM2644615" "GSM2644616" "GSM2644617" "GSM2644618" "GSM2644619" "GSM2644620" "GSM2644621" "GSM2644622" "GSM2644623" "GSM2644624" "GSM2644625" "GSM2644626" "GSM2644627" "GSM2644628" "GSM2644629" "GSM2644630" "GSM2644631" "GSM2644632" "GSM2644633" "GSM2644634" "GSM2644635" "GSM2644636" "GSM2644637" "GSM2644638" "GSM2644639" "GSM2644640" "GSM2644641" "GSM2644642" "GSM2644643" "GSM2644644"
!series_matrix_table_end
Apparently, NCBI GEO changed the number of supplementary files for GSE1000, causing a test failure:
1 Test Suite :
GEOquery RUnit Tests - 14 test functions, 0 errors, 1 failure
FAILURE in testSuppFileSupport: Error in checkEquals(2, nrow(fres)) : Mean relative difference: 0.5
Test files with failing tests
test_SuppFileSupport.R
testSuppFileSupport
E.g.:
$ GEOquery::getGEO('GSE23397', parseCharacteristics = FALSE)
Traceback:
1. GEOquery::getGEO("GSE23397", parseCharacteristics = FALSE)
2. getAndParseGSEMatrices(GEO, destdir, AnnotGPL = AnnotGPL, getGPL = getGPL,
. parseCharacteristics = parseCharacteristics)
3. parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL,
. getGPL = getGPL)
4. dplyr::mutate(pd, characteristics = ifelse(grepl("_ch2", characteristics),
. "ch2", "ch1")) %>% tidyr::separate(kvpair, into = c("k",
. "v"), sep = ":", fill = "right") %>% dplyr::mutate(k = paste(k,
. characteristics, sep = ":")) %>% dplyr::select(-characteristics) %>%
. tidyr::spread(k, v)
5. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
6. eval(quote(`_fseq`(`_lhs`)), env, env)
7. eval(quote(`_fseq`(`_lhs`)), env, env)
8. `_fseq`(`_lhs`)
9. freduce(value, `_function_list`)
10. withVisible(function_list[[k]](value))
11. function_list[[k]](value)
12. tidyr::spread(., k, v)
13. spread.data.frame(., k, v)
14. abort(glue("Duplicate identifiers for rows {rows}"))```
Hi Sean,
How does GEOquery library getGEO() to parse the quick soft file? I have below error. May you help?
getGEOfile("GSE10", destdir = "/Users/cwu1", AnnotGPL = FALSE, amount = "quick")
getGEO(filename="/Users/cwu1/GSE10.soft")
Parsing....
Error in readLines(con, lineCounts[1, 1] - 1) : invalid 'n' argument
Thanks,
Canglin Wu
Research Associate
Department of Biostatistics and Computational Biology
School of Medicine and Dentistry
University of Rochester
601 Elmwood Avenue, Box 630
Rochester, New York 14642
When I try to load a (manually downloaded) GSE matrix file by
getGEO(filename="GSEXXXX_series_matrix.txt", destdir="./")
the destdir option is not passed to parseGSEMatrix() through parseGEO().
Therefore, GPL file is not searched in the destdir, but downloaded
into a temporary folder every time R is restarted.
Hello,
Would it be possible to enable the geoGEO function to import a dataset which will be made public at a later date but is already hosted on GEO ? It would be nice to submit a R Markdown analysis to a journal that readers could easily reproduce without changing any file paths, but keep the dataset private until the article is published. Reviewers would also benefit from the convenience.
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia
Hi,
In the current implementation, GEOquery already looks for a local version of a GPL file to avoid downloading it again. But when reading multiple GEO files associated with a single GPL soft file, it parses and loads into memory this same file multiple times, resulting in many identical copies of an AnnotatedDataFrame that represents the Platform! Below is a reproducible example:
R > gse32982 <- getGEO('GSE32982')
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE32nnn/GSE32982/matrix/
Found 1 file(s)
GSE32982_series_matrix.txt.gz
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE32nnn/GSE32982/matrix/GSE32982_series_matrix.txt.gz'
ftp data connection made, file length 959302 bytes
==================================================
downloaded 936 KB
File stored at:
/tmp/Rtmpce6W06/GPL570.soft
R > gse45016 <- getGEO('GSE45016')
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE45nnn/GSE45016/matrix/
Found 1 file(s)
GSE45016_series_matrix.txt.gz
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE45nnn/GSE45016/matrix/GSE45016_series_matrix.txt.gz'
ftp data connection made, file length 1486395 bytes
==================================================
downloaded 1.4 MB
Using locally cached version of GPL570 found here:
/tmp/Rtmpce6W06/GPL570.soft
R > gpl570_1 <- featureData(gse32982[[1]])
R > gpl570_2 <- featureData(gse45016[[1]])
R > identical(gpl570_1, gpl570_2)
[1] TRUE
R > data.table::address(gpl570_1)
[1] "0x10d32c78"
R > data.table::address(gpl570_2)
[1] "0x78835a0"
It's a waste of time and resources, since that caching the object originated from a specific file through the duration of the session shouldn't be hard to do.
Hi Sean,
I received the green light to share GEOquery applog usage numbers
with you. So, as previously discussed, please go ahead and insert an extra
parameter into URLs.
Please use the parameter:
tool=geoquery
e.g.,
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc=GPL96&format=sort&tool=geoquery
Once this is in place, we will be able to track usage and report numbers
to you.
Let me know if you have any questions or concerns.
Kind regards,
Tanya
Hi Sean,
calling the following generates an error:
getGEOSuppFiles('GSE12345', baseDir = tempfir())
this is due to this piece of code, which creates the directory in the current working directory
if(makeDirectory) {
suppressWarnings(try(dir.create(GEO),silent=TRUE))
storedir <- file.path(baseDir,GEO)
}
One could do:
dir.create( storedir <- file.path(baseDir,GEO) )
Thanks
sorry to bother you again with this very same issue.
The patch solved the problem in getAndParseGSEMatrices, but I also get the proxy problem with getGEOSuppFiles, where the list of tar files is not detected properly. No error thrown here (it is caught in a try()), only a message saying there are no supplemental files (while there are).
In this function I see you have/use a function getDirListing.
It might the right place to put the patch, so that it can be used consistently across the package wherever needed (including by getAndParseGSEMatrices).
Sorry I did not report this together with the other one. I was just not aware of this function (! :().
Thank you.
Bests,
Renaud
I've been reviewing some failures of certain experimental data packages in Bioconductor 3.6 devel to build and I believe I've isolated the issue. The getGEO()
function from GEOquery seems to be dropping attributes when reading in a GSE matrix. This creates problems further down in their code. This reproducible example is gleaned from their code:
library( "GEOquery" )
gse <- getGEO("GSE52778")
pdata <- pData(gse)[,grepl("characteristics",names(pData(gse)))]
names(pdata) <- c("treatment","tissue","ercc_mix","cell","celltype")
Notice that a warning is thrown by getGEO:
Warning message:
attributes are not identical across measure variables;
they will be dropped
The user actually downloaded the file GSE52778_series_matrix.txt
from GEO, but the result of using the file and accessing it through NCBI appears to be the same.
Here is the sessionInfo()
:
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] mzR_2.11.11 Rcpp_0.12.13
[3] parathyroidSE_1.15.0 airway_0.111.0
[5] SummarizedExperiment_1.7.10 DelayedArray_0.3.21
[7] matrixStats_0.52.2 GenomicRanges_1.29.15
[9] GenomeInfoDb_1.13.5 bindrcpp_0.2
[11] illuminaHumanv1.db_1.26.0 org.Hs.eg.db_3.4.2
[13] AnnotationDbi_1.39.4 IRanges_2.11.19
[15] S4Vectors_0.15.14 limma_3.33.14
[17] GEOquery_2.45.2 Biobase_2.37.2
[19] BiocGenerics_0.23.3
loaded via a namespace (and not attached):
[1] XVector_0.17.1 BiocInstaller_1.27.5 compiler_3.4.1
[4] bindr_0.1 ProtGenerics_1.9.1 zlibbioc_1.23.0
[7] bitops_1.0-6 tools_3.4.1 digest_0.6.12
[10] bit_1.1-12 lattice_0.20-35 RSQLite_2.0
[13] memoise_1.1.0 tibble_1.3.4 pkgconfig_2.0.1
[16] rlang_0.1.2 Matrix_1.2-11 DBI_0.7
[19] curl_3.0 GenomeInfoDbData_0.99.1 dplyr_0.7.4
[22] httr_1.3.1 xml2_1.1.1 hms_0.3
[25] grid_3.4.1 tidyselect_0.2.2 bit64_0.9-7
[28] glue_1.1.1 R6_2.2.2 XML_3.98-1.9
[31] tidyr_0.7.2 readr_1.1.1 purrr_0.2.3
[34] blob_1.1.0 magrittr_1.5 codetools_0.2-15
[37] assertthat_0.2.0 stringi_1.1.5 RCurl_1.95-4.8
Probably want to just parse the GSE header metadata and then do separate getGEO
calls for GSM and GPLs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.