Code Monkey home page Code Monkey logo

geoquery's Introduction

Status

R-CMD-check R-CMD-check

Installation

To install from Bioconductor, use the following code:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("GEOquery")

To install directly from github:

library(devtools)
install_github('GEOquery','seandavi')

Usage

See the full vignette in rmarkdown or visit Bioconductor for details:

How to contribute

Contributions to GEOquery development can be submitted as a pull request or a feature request issue. We recommend following the Bioconductor coding standards where possible.

geoquery's People

Contributors

assaron avatar dalloliogm avatar dereckdemezquita avatar dtenenba avatar hpages avatar jwokaty avatar kalugny avatar khughitt avatar leogama avatar link-ny avatar lshep avatar nturaga avatar ramiromagno avatar russhyde avatar seandavi avatar vlakam avatar vobencha avatar yihui avatar yunuuuu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

geoquery's Issues

Parse GDS full_soft format

The "full_soft" format contains both GDS expression and GPL annotation in the same file. Currently, GEOquery doesn't handle this format, but a user downloaded such a file from the website and was confused that GEOquery did not correctly parse it.


Hi there and Sean,

The problem is the extra 'GPL' meta line in the full soft file, it will
confuse the built-in file parser. Maybe this could be considered a bug,
since for each GEO series (not super series) there should be only one
GPL.ID associated with it. The meta data returned after parsing the soft
file should contain only a length one GPL.ID for the platform. (CCing
Sean for a possible fix)

You should be able to proceed with the first line changed as follows,

  • either retrieve at run time, preferred
    gds4577 <- getGEO(GEO='GDS4577')
  • or download the GDS4577.soft.gz instead of the full one.
    gds4577 <- getGEO(filename='GDS4577.soft.gz')

Considering the GEOquery is more an on-line tool-set to let you download
the GEO data on-the-go, it make less sense to pre-download the soft file
manually.

Best,
Dan

On Sun, 2013-11-24 at 20:21 +0800, 水静流深 wrote:

hi,i am new to bioconductor,when i run the following command ,i get wrong output
Error in gzfile(fname, open = "rt") : invalid 'description' argument
what is the matter?

library(Biobase) library(GEOquery) gds4577 <- getGEO(filename='c:/test/GDS4577_full.soft.gz') eset <- GDS2eSet(gds4577, do.log2=TRUE) > eset <- GDS2eSet(gds4577, do.log2=TRUE) File stored at: C:\DOCUME1\sanya\LOCALS1\Temp\RtmpQtuak0/GPL1261.annot.gzC:\DOCUME1\sanya\LOCALS1\Temp\RtmpQtuak0/GPL1261.annot.gz Error in gzfile(fname, open = "rt") : invalid 'description' argument In addition: Warning messages: 1: In if (GSEMatrix & geotype == "GSE") { : the condition has length > 1 and only the first element will be used 2: In if (geotype == "GDS") { : the condition has length > 1 and only the first element will be used 3: In if (geotype == "GSE" & amount == "full") { : the condition has length > 1 and only the first element will be used 4: In if (geotype == "GSE" & amount != "full" & amount != "table") { : the condition has length > 1 and only the first element will be used 5: In if (geotype == "GPL"!
) { : the condition has length > 1 and only the first element will be used 6: In if (!file.exists(destfile)) { : the condition has length > 1 and only the first element will be used 7: In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) : only first element of 'url' argument used 8: In download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) : only first element of 'destfile' argument used > eset Error: object 'eset' not found
what is the matter with my computer?

sessionInfo() R version 3.0.2 (2013-09-25) Platform: i386-w64-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Chinese_People's Republic of China.936 LC_CTYPE=Chinese_People's Republic of China.936 [3] LC_MONETARY=Chinese_People's Republic of China.936 LC_NUMERIC=C [5] LC_TIME=Chinese_People's Republic of China.936 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] GEOquery_2.28.0 BiocInstaller_1.12.0 affy_1.40.0 Biobase_2.22.0 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] affyio_1.30.0 preprocessCore_1.24.0 RCurl_1.95-4.1 tools_3.0.2

findFirstEntity enters infinite loop on GSEMatrix file stored locally

Reported by Mike Love and Mike Smith who noted that the GSEMatrix file parsing seemed to not end. GEOquery enters an infinite loop in findFirstEntity.

This code from http://bioconductor.org/packages/3.6/data/experiment/vignettes/airway/inst/doc/airway.html produced the bug:

suppressPackageStartupMessages( library( "GEOquery" ) )
suppressPackageStartupMessages( library( "airway" ) )
dir <- system.file("extdata",package="airway")
geofile <- file.path(dir, "GSE52778_series_matrix.txt")
gse <- getGEO(filename=geofile)

messy metadata in GSE causes parsing failure

Reported by Amy Tang:

 gse2 = getGEO('GSE99479')
Found 2 file(s)
GSE99479-GPL11154_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE99nnn/GSE99479/matrix/GSE99479-GPL11154_series_matrix.txt.gz'
Content type 'application/x-gzip' length 3751 bytes
==================================================
downloaded 3751 bytes

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 21 did not have 31 elements
> 

The GSE in question has no data table:

....
!series_matrix_table_begin
"ID_REF"	"GSM2644615"	"GSM2644616"	"GSM2644617"	"GSM2644618"	"GSM2644619"	"GSM2644620"	"GSM2644621"	"GSM2644622"	"GSM2644623"	"GSM2644624"	"GSM2644625"	"GSM2644626"	"GSM2644627"	"GSM2644628"	"GSM2644629"	"GSM2644630"	"GSM2644631"	"GSM2644632"	"GSM2644633"	"GSM2644634"	"GSM2644635"	"GSM2644636"	"GSM2644637"	"GSM2644638"	"GSM2644639"	"GSM2644640"	"GSM2644641"	"GSM2644642"	"GSM2644643"	"GSM2644644"
!series_matrix_table_end

Bug: dirlisting is incorrect with changes on NCBI directory listing html

This is the error (below). The directory listing is finding both the parent directory and the actual file. We rely on finding only the file.

https://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/GSE2553/matrix/
OK
Found 2 file(s)
/geo/series/GSE2nnn/GSE2553/
Warning in download.file(sprintf("https://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/%s",  :
  URL https://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/GSE2553/matrix//geo/series/GSE2nnn/GSE2553/: cannot open destfile '/tmp/RtmpLJwK5g//geo/series/GSE2nnn/GSE2553/', reason 'No such file or directory'
Warning in download.file(sprintf("https://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/%s",  :
  download had nonzero exit status
Quitting from lines 135-139 (GEOquery.Rmd) 
Error: processing vignette 'GEOquery.Rmd' failed with diagnostics:
'/tmp/RtmpLJwK5g//geo/series/GSE2nnn/GSE2553/' does not exist.

Deal with NA in GDS dataset IDs

Dear developers of the GEOquery package,

gds <- getGEO("GDS3666")
eset <- GDS2eSet(gds)

results in the following error:

Error in value[[3L]](cond) : row names contain missing values
 AnnotatedDataFrame 'initialize' could not update varMetadata:
 perhaps pData and varMetadata are inconsistent?

I found out that this is due to a single NA in the ID list and that this
could be easily fixed by excluding NA's:

check.na <- is.na(gds@dataTable@table$ID_REF)
has.na <- any(check.na)
gpl <- NULL
if(has.na) {
   ind <- !check.na
   gds@dataTable@table <- gds@dataTable@table[ind,]
   gpl <- getGEO(Meta(gds)$platform, AnnotGPL=TRUE)
   gpl@dataTable@table <- gpl@dataTable@table[ind,]
 }
eset <- GDS2eSet(gds, GPL=gpl)

I wonder whether you want to include something similar in the GDSeSet
function in order to avoid these issues with NA's.

Best,
Ludwig

Access to private and reviewer link datasets

Hello,

Would it be possible to enable the geoGEO function to import a dataset which will be made public at a later date but is already hosted on GEO ? It would be nice to submit a R Markdown analysis to a journal that readers could easily reproduce without changing any file paths, but keep the dataset private until the article is published. Reviewers would also benefit from the convenience.


Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia

FTP urls translated to HTML behind firewall

Hi Sean,

I have always been experiencing issues with the GEOquery package, when running behind an annoying university proxy, which I "bypass" using cntlm.
Forgetting about the technical details, the issue is that the GEO URL that is fetched by getAndParseGSEMatrices gets scrambled into an HTML page somewhere in between nih.gov and the R session (probably by the proxy or cntlm), this causes getGEO to throw the following typical error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 6 elements

For example:

getURL('ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12008/matrix/')
[1] "\r\n<meta http-equiv="Content-Type" content="text-html; charset=UTF-8">\r\n\r\n<TITLE>FTP directory /geo/series/GSE12nnn/GSE12008/matrix/ at ftp.ncbi.nlm.nih.gov. </TITLE>\r\n\r\n\r\n

FTP directory /geo/series/GSE12nnn/GSE12008/matrix/ at ftp.ncbi.nlm.nih.gov.

\r\n
\r\n
\r\n                          <DIR> <A HREF="..">..
09/18/13 09:09AM [GMT] 719,285 <A HREF="/geo/series/GSE12nnn/GSE12008/matrix/GSE12008-GPL4134_series_matrix.txt.gz">GSE12008-GPL4134_series_matrix.txt.gz\r\n09/18/13 09:09AM [GMT] 557,250 <A HREF="/geo/series/GSE12nnn/GSE12008/matrix/GSE12008-GPL6244_series_matrix.txt.gz">GSE12008-GPL6244_series_matrix.txt.gz\r\n
\r\n
\r\n\r\n\r\n"

I have a fix for this (see below), which would be really nice if applied to GEOquery, so that I can work with GEO datasets :D

The idea is that instead of always scanning for a txt table format, we first check if the result is an HTML document, and, if so, parse for the href strings that point to the matrix files to download.
Here I provide two alternative ways of processing the HTML content, one one-liner using the stringr package, or more lines if you don't want to depend on stringr.
The message should probably go away or disabled if not in verbose mode.
Please let me know if you consider this fix can be applied any time soon.
Thank you.

Bests,
Renaud

########### in getAndParseGSEMatrices

gdsurl <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/"
a <- getURL(sprintf(gdsurl, stub, GEO))
if( grepl("^<HTML", a) ){ # process HTML content
message("# Processing HTML result page (behind a proxy?) ... ", appendLF=FALSE)
b <- stringr::str_match_all(a, "((href)|(HREF))=\s*["']/[^\"']+/([^/]+)["']")[[1]]

 # or, alternatively, without stringr
 sa <- gsub('HREF', 'href', a, fixed = TRUE)# just not to depend on case change
 sa <- strsplit(sa, 'href', fixed = TRUE)[[1L]]
 pattern <- "^=\\s*[\"']/[^\"']+/([^/]+)[\"'].*"
 b <- as.matrix(gsub(pattern, "\\1", sa[grepl(pattern, sa)]))
 #

 message('OK')

}else{ # standard processing of txt content
tmpcon <- textConnection(a, "r")
b <- read.table(tmpcon)
close(tmpcon)
}

Parsing characteristics failes

$ GEOquery::getGEO('GSE23397')

Traceback:

1. GEOquery::getGEO("GSE23397")
2. getAndParseGSEMatrices(GEO, destdir, AnnotGPL = AnnotGPL, getGPL = getGPL, 
 .     parseCharacteristics = parseCharacteristics)
3. parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL, 
 .     getGPL = getGPL)
4. dplyr::mutate(pd, characteristics = ifelse(grepl("_ch2", characteristics), 
 .     "ch2", "ch1")) %>% tidyr::separate(kvpair, into = c("k", 
 .     "v"), sep = ":", fill = "right") %>% dplyr::mutate(k = paste(k, 
 .     characteristics, sep = ":")) %>% dplyr::select(-characteristics) %>% 
 .     tidyr::spread(k, v)
5. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
6. eval(quote(`_fseq`(`_lhs`)), env, env)
7. eval(quote(`_fseq`(`_lhs`)), env, env)
8. `_fseq`(`_lhs`)
9. freduce(value, `_function_list`)
10. withVisible(function_list[[k]](value))
11. function_list[[k]](value)
12. tidyr::spread(., k, v)
13. spread.data.frame(., k, v)
14. abort(glue("Duplicate identifiers for rows {rows}"))```

getGEO fails on some GSEs

: reported by Alex Abbas via email

> getGEO("GSE83452”)
Found 1 file(s)
GSE83452_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE83nnn/GSE83452/matrix/GSE83452_series_matrix.txt.gz'
Content type 'application/x-gzip' length 40807397 bytes (38.9 MB)
==================================================
downloaded 38.9 MB

Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_integer()
)
See spec(...) for full column specifications.
|====================================================================================================================================| 100%   93 MB
File stored at:
/var/folders/wf/65nq9x_d4xj9ttpbz7shlkgc0000gn/T//RtmpbPhopr/GPL16686.soft
Warning: 190 parsing failures.
row # A tibble: 5 x 5 col     row   col   expected         actual         file expected   <int> <chr>      <chr>          <chr>        <chr> actual 1 53792    ID an integer AFFX-BioB-3_at literal data file 2 53793    ID an integer AFFX-BioB-3_st literal data row 3 53794    ID an integer AFFX-BioB-5_at literal data col 4 53795    ID an integer AFFX-BioB-5_st literal data expected 5 53796    ID an integer AFFX-BioB-M_at literal data
... ................. ... .................................................... ........ .................................................... ...... .................................................... .... .................................................... ... .................................................... ... .................................................... ........ ....................................................
See problems(...) for more details.

Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
  duplicate 'row.names' are not allowed
In addition: Warning messages:
1: attributes are not identical across measure variables;
they will be dropped
2: In rbind(names(probs), probs_f) :
  number of columns of result is not a multiple of vector length (arg 1)
3: non-unique value when setting 'row.names': ‘NA’
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GEOquery_2.46.3     bindrcpp_0.2        Biobase_2.38.0      BiocGenerics_0.24.0 ggplot2_2.2.1       tidyr_0.7.2         dplyr_0.7.4

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13     xml2_1.1.1       bindr_0.1        magrittr_1.5     hms_0.3          tidyselect_0.2.3 munsell_0.4.3    colorspace_1.3-2
 [9] R6_2.2.2         rlang_0.1.4      httr_1.3.1       plyr_1.8.4       tools_3.4.2      grid_3.4.2       gtable_0.2.0     lazyeval_0.2.1
[17] assertthat_0.2.0 tibble_1.3.4     readr_1.1.1      purrr_0.2.4      bitops_1.0-6     curl_3.0         RCurl_1.95-4.8   glue_1.2.0
[25] labeling_0.3     stringi_1.1.5    compiler_3.4.2   scales_0.5.0     XML_3.98-1.9     pkgconfig_2.0.1

error in the ftp site url when downloading supplement files for a GPL with getGEOSuppFiles

Hi Sean,

I think there is a typo in the GEO ftp site url when you download supplement files for a GPL with the function getGEOSuppFiles, platform should be replace by platforms (line 37 of the file getGEOSuppFiles.R):

url <- sprintf("ftp://ftp.ncbi.nlm.nih.gov/geo/platform/%s/%s/suppl/",stub,GEO)
url <- sprintf("ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/%s/%s/suppl/",stub,GEO)

thanks

Slim Fourati
Sr Reaseach Associate
Department of Pathology, CWRU

'destdir' is ignored when GPL file is downloaded for local GSE matrix file

When I try to load a (manually downloaded) GSE matrix file by

getGEO(filename="GSEXXXX_series_matrix.txt", destdir="./")

the destdir option is not passed to parseGSEMatrix() through parseGEO().
Therefore, GPL file is not searched in the destdir, but downloaded
into a temporary folder every time R is restarted.

Frequent inability to connect to GEO

Hello,

I'm experiencing frequent connection issues with getGEO(). This is an issue on multiple computers, all with different network connections from different locations.

Attempt 1:

> esetlist <- getGEO("gse94802")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94802/matrix/
OK
Found 2 file(s)
/geo/series/GSE94nnn/GSE94802/

downloaded 0 bytes

Error in download.file(sprintf("https://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/%s",  : 
  cannot download all files
In addition: Warning message:
In download.file(sprintf("https://ftp.ncbi.nlm.nih.gov/geo/series/%s/%s/matrix/%s",  :
  URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94802/matrix//geo/series/GSE94nnn/GSE94802/': status was '404 Not Found'

Attempt 2 (5 seconds later):

> esetlist <- getGEO("gse94802")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94802/matrix/
OK
Found 1 file(s)
GSE94802_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE94nnn/GSE94802/matrix/GSE94802_series_matrix.txt.gz'
Content type 'unknown' length 3277 bytes
==================================================
downloaded 3277 bytes

File stored at: 
/tmp/RtmpYLjH8P/GPL17021.soft

Is there a way to have getGEO() reattempt the connection several times (perhaps specified via an argument) before throwing an error? This is causing my reports to crash unexpectedly, disrupting reproducibility. Alternatively, I could try wrapping getGEO() in a try-catch block to achieve this suggested functionality.

Thanks for your help!

Give the option not to download GPL data

Would it be possible to add an argument to getGEO, e.g., GPLdata = TRUE, or define a special meaning for an NA value in AnnotGPL, that indicates if GPL feature data should be downloaded at all?

Example:

> eset <- getGEO('GSE12345')
> head(fData(eset)[, 1:5])
                 ID GB_ACC SPOT_ID Species Scientific Name Annotation Date
1007_s_at 1007_s_at U48705    <NA>            Homo sapiens     Jun 9, 2011
1053_at     1053_at M87338    <NA>            Homo sapiens     Jun 9, 2011
117_at       117_at X51757    <NA>            Homo sapiens     Jun 9, 2011
121_at       121_at X69699    <NA>            Homo sapiens     Jun 9, 2011
1255_g_at 1255_g_at L36861    <NA>            Homo sapiens     Jun 9, 2011
1294_at     1294_at L13852    <NA>            Homo sapiens     Jun 9, 2011
> 
> # while no feature data is fetched with:
> eset <- getGEO('GSE12345', GPLdata = FALSE)
> # or
> # eset <- getGEO('GSE12345', AnnotGPL = NA)
> featureData(eset)
An object of class 'AnnotatedDataFrame': none
> fData(eset)
data frame with 0 columns and 54675 rows

Note that one needs to be careful when setting an empty feature data AnnotatedDataFrame object: rownames must still correspond to featureNames, otherwise the object does not pass validation:

> featureData(eset) <- AnnotatedDataFrame(data.frame(row.names = featureNames(eset)))
> validObject(eset)
[1] TRUE
> featureData(eset) <- AnnotatedDataFrame()
> validObject(eset)
Error in validObject(eset) : 
  invalid class “ExpressionMix” object: 1: feature numbers differ between assayData and featureData
invalid class “ExpressionMix” object: 2: featureNames differ between assayData and featureData

Thanks!

Parse "quick" output

Hi Sean,

How does GEOquery library getGEO() to parse the quick soft file? I have below error. May you help?

getGEOfile("GSE10", destdir = "/Users/cwu1", AnnotGPL = FALSE, amount = "quick")
getGEO(filename="/Users/cwu1/GSE10.soft")
Parsing....
Error in readLines(con, lineCounts[1, 1] - 1) : invalid 'n' argument

Thanks,

Canglin Wu
Research Associate
Department of Biostatistics and Computational Biology
School of Medicine and Dentistry
University of Rochester
601 Elmwood Avenue, Box 630
Rochester, New York 14642

Error in xj[i] : only 0's may be mixed with negative subscripts

Retrieving data from getGEO seems to fail:

> library(GEOquery)
> gse <- getGEO('21653')
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE21nnn/GSE21653/matrix/
Found 1 file(s)
GSE21653_series_matrix.txt.gz
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE21nnn/GSE21653/matrix/GSE21653_series_matrix.txt.gz'
ftp data connection made, file length 22510111 bytes
==================================================
downloaded 22.5 MB
File stored at: 
/tmp/Rtmp96s5ht/GPL570.soft
Error in xj[i] : only 0's may be mixed with negative subscripts

getGEO("GSExxxx") error

Hello,
Retrieving data from GEO in the form of getGEO("GSExxxx") seems to produce errors that didn't occur before. Perhaps this is due to NCBI's recent move from http to https?

>GSE <- getGEO("GSE3105")
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE3nnn/GSE3105/matrix/

Found 1 file(s)
GSE3105_series_matrix.txt.gz

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  251k  100  251k    0     0  70762      0  0:00:03  0:00:03 --:--:-- 70753
File stored at:
/tmp/RtmpnbTgfV/GPL199.soft
Error in read.table(con, sep = "\t", header = FALSE, nrows = nseries) :
  invalid 'nlines' argument

Is this a known issue?
Yair

getGEOSuppFiles does not create a directory in baseDir

Hi Sean,

calling the following generates an error:

getGEOSuppFiles('GSE12345', baseDir = tempfir())

this is due to this piece of code, which creates the directory in the current working directory

if(makeDirectory) {
    suppressWarnings(try(dir.create(GEO),silent=TRUE))
    storedir <- file.path(baseDir,GEO)
  }

One could do:

dir.create( storedir <- file.path(baseDir,GEO) )

Thanks

Parsing of differing characteristics in Series Matrices makes phenodata messy

Good day,

I found an issue in the way that getGEO parses the clinical data. Consider

ovaCancer <- getGEO("GSE53963")[[1]]
pData(ovaCancer)[1:6, 20:25]
          characteristics_ch2.1 characteristics_ch2.2 characteristics_ch2.3 characteristics_ch2.4 characteristics_ch2.5 characteristics_ch2.6
GSM1304246    morphology: Serous            stage_#: 2             Stage: II           substage: B               grade: 4    debulking: Optimal
GSM1304247    morphology: Serous            stage_#: 3            Stage: III           substage: C               grade: 4   debulking: Optimal
GSM1304248    morphology: Serous            stage_#: 4             Stage: IV              grade: 3 debulking: Sub-optimal  time_fu_months: 1.12
GSM1304249    morphology: Serous            stage_#: 3            Stage: III           substage: C               grade: 3   debulking: Optimal
GSM1304250    morphology: Serous            stage_#: 3            Stage: III           substage: C               grade: 4   debulking: Optimal
GSM1304251    morphology: Serous            stage_#: 4             Stage: IV              grade: 3 debulking: Sub-optimal time_fu_months: 47.20

Because stage 4 samples have no substage data, the alignment of the data into columns after the stage column is one column too left. Is it possible to handle this case automatically ?


Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia

Error: duplicate identifiers for rows....

reported in email:

library(GEOquery)
GSE37405 <- getGEO(GEO = "GSE37405", GSEMatrix = T, destdir = getwd())

results in:

Error: Duplicate identifiers for rows (232, 285, 338, 391, 444, 497, 550), (503, 556), (147, 200, 253, 306, 359, 412, 465, 518, 571)

This is a general problem, so not specific to this GSE.

parseGPL reparses GPL files in the same session

Hi,
In the current implementation, GEOquery already looks for a local version of a GPL file to avoid downloading it again. But when reading multiple GEO files associated with a single GPL soft file, it parses and loads into memory this same file multiple times, resulting in many identical copies of an AnnotatedDataFrame that represents the Platform! Below is a reproducible example:

R > gse32982 <- getGEO('GSE32982')
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE32nnn/GSE32982/matrix/
Found 1 file(s)
GSE32982_series_matrix.txt.gz
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE32nnn/GSE32982/matrix/GSE32982_series_matrix.txt.gz'
ftp data connection made, file length 959302 bytes
==================================================
downloaded 936 KB

File stored at:
/tmp/Rtmpce6W06/GPL570.soft

R > gse45016 <- getGEO('GSE45016')
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE45nnn/GSE45016/matrix/
Found 1 file(s)
GSE45016_series_matrix.txt.gz
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE45nnn/GSE45016/matrix/GSE45016_series_matrix.txt.gz'
ftp data connection made, file length 1486395 bytes
==================================================
downloaded 1.4 MB

Using locally cached version of GPL570 found here:
/tmp/Rtmpce6W06/GPL570.soft

R > gpl570_1 <- featureData(gse32982[[1]])
R > gpl570_2 <- featureData(gse45016[[1]])
R > identical(gpl570_1, gpl570_2)
[1] TRUE
R > data.table::address(gpl570_1)
[1] "0x10d32c78"
R > data.table::address(gpl570_2)
[1] "0x78835a0"

It's a waste of time and resources, since that caching the object originated from a specific file through the duration of the session shouldn't be hard to do.

Test failure on Windows

Unfortunately I don't have a build report to point you to because our windows devel build didn't happen today (for an unrelated reason) but I noticed that on windows GEOquery fails a unit test:

1 Test Suite :
GEOquery RUnit Tests - 14 test functions, 0 errors, 1 failure
FAILURE in testSuppFileSupport: Error in checkEquals(10, ncol(fres)) : Mean relative difference: 0.3

Test files with failing tests

test_SuppFileSupport.R
testSuppFileSupport

Basically it's because file.info() returns a different number of columns on Windows (7) than on unix (10).

So the test should be fixed.

Thanks,
Dan

vignette: number of platforms for the GSE781 dataset

In the vignette, in the section Converting a GSE to an expressionset, you mention that all the GSM arrays are from the GPL5 platform. However if I run the code on my computer, I get a list of GPL96 and GPL97 arrays:

> gse <- getGEO(filename=system.file("extdata/GSE781_family.soft.gz",package="GEOquery"))
> gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform})
> gsmplatforms
$GSM11805
[1] "GPL96"

$GSM11810
[1] "GPL97"

$GSM11814
[1] "GPL96"

This is also evident in the bioconductor page of GEOquery: http://www.bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html#converting-gse-to-an-expressionset

Maybe this file has been changed since the last documentation update?

using `parseCharacteristics = FALSE` still parses characteristics

E.g.:
$ GEOquery::getGEO('GSE23397', parseCharacteristics = FALSE)

Traceback:

1. GEOquery::getGEO("GSE23397", parseCharacteristics = FALSE)
2. getAndParseGSEMatrices(GEO, destdir, AnnotGPL = AnnotGPL, getGPL = getGPL, 
 .     parseCharacteristics = parseCharacteristics)
3. parseGSEMatrix(destfile, destdir = destdir, AnnotGPL = AnnotGPL, 
 .     getGPL = getGPL)
4. dplyr::mutate(pd, characteristics = ifelse(grepl("_ch2", characteristics), 
 .     "ch2", "ch1")) %>% tidyr::separate(kvpair, into = c("k", 
 .     "v"), sep = ":", fill = "right") %>% dplyr::mutate(k = paste(k, 
 .     characteristics, sep = ":")) %>% dplyr::select(-characteristics) %>% 
 .     tidyr::spread(k, v)
5. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
6. eval(quote(`_fseq`(`_lhs`)), env, env)
7. eval(quote(`_fseq`(`_lhs`)), env, env)
8. `_fseq`(`_lhs`)
9. freduce(value, `_function_list`)
10. withVisible(function_list[[k]](value))
11. function_list[[k]](value)
12. tidyr::spread(., k, v)
13. spread.data.frame(., k, v)
14. abort(glue("Duplicate identifiers for rows {rows}"))```

getGEOSuppFiles not exported

From Lori:

AnnotationHubData and ExperimentHubData are failing with

Error : object ‘getGEOSuppFiles’ is not exported by 'namespace:GEOquery'

I looked at AnnotationHubData and while it is imported in the NAMESPACE I don't think its currently used anymore within the code so I can just remove this -
But I also wanted to double check that this was intentional since GEOquery still has the man file for the functions but it is no longer exported?

fix getDirListing to deal with proxies that convert directory indexes to HTML

sorry to bother you again with this very same issue.

The patch solved the problem in getAndParseGSEMatrices, but I also get the proxy problem with getGEOSuppFiles, where the list of tar files is not detected properly. No error thrown here (it is caught in a try()), only a message saying there are no supplemental files (while there are).

In this function I see you have/use a function getDirListing.
It might the right place to put the patch, so that it can be used consistently across the package wherever needed (including by getAndParseGSEMatrices).

Sorry I did not report this together with the other one. I was just not aware of this function (! :().
Thank you.

Bests,
Renaud

Line counts on Windows different than linux/mac causing test failures

Here is the output from appveyor. The same code has line numbers that match up on Linux/Mac and with the actual counts from GEO. This appears to affect only GPL and GSMs.


R version 3.4.2 Patched (2017-11-06 r73690) -- "Short Summer"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(testthat)
> library(GEOquery)
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colMeans, colSums, colnames, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, lengths, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, rank, rbind, rowMeans, rowSums, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which,
    which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
> 
> test_check("GEOquery")

Attaching package: 'limma'

The following object is masked from 'package:BiocGenerics':

    plotMA

1. Failure: generic GPL parsing works as expected (@test_GPL.R#9) --------------
nrow(Table(gpl)) not equivalent to 22283.
1/1 mismatches
[1] 22284 - 22283 == 1


2. Failure: quoted GPL works (@test_GPL.R#19) ----------------------------------
45220 not equivalent to nrow(Table(gpl)).
1/1 mismatches
[1] 45220 - 45221 == -1


3. Failure: short GPL works (@test_GPL.R#26) -----------------------------------
52 not equivalent to nrow(Table(gpl)).
1/1 mismatches
[1] 52 - 53 == -1


trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11413/matrix/GSE11413_series_matrix.txt.gz'
Content type 'application/x-gzip' length 3997 bytes
==================================================
downloaded 3997 bytes

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE35nnn/GSE35683/matrix/GSE35683_series_matrix.txt.gz'
Content type 'application/x-gzip' length 5733793 bytes (5.5 MB)
==================================================
downloaded 5.5 MB

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11595/matrix/GSE11595-GPL3906_series_matrix.txt.gz'
Content type 'application/x-gzip' length 59663 bytes (58 KB)
==================================================
downloaded 58 KB

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11595/matrix/GSE11595-GPL4348_series_matrix.txt.gz'
Content type 'application/x-gzip' length 1528930 bytes (1.5 MB)
==================================================
downloaded 1.5 MB

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34145/matrix/GSE34145-GPL15796_series_matrix.txt.gz'
Content type 'application/x-gzip' length 6643 bytes
==================================================
downloaded 6643 bytes

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34145/matrix/GSE34145-GPL6102_series_matrix.txt.gz'
Content type 'application/x-gzip' length 1428433 bytes (1.4 MB)
==================================================
downloaded 1.4 MB

4. Failure: basic GSM works (@test_GSM.R#12) -----------------------------------
nrow(Table(gsm)) not equivalent to 22283.
1/1 mismatches
[1] 22284 - 22283 == 1


trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE2nnn/GSE2553/matrix/GSE2553_series_matrix.txt.gz'
Content type 'application/x-gzip' length 8480960 bytes (8.1 MB)
==================================================
downloaded 8.1 MB

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1000/suppl//GSE1000_RAW.tar?tool=geoquery'
Content type 'application/x-tar' length 35307520 bytes (33.7 MB)
==================================================
downloaded 33.7 MB

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM15nnn/GSM15789/suppl//GSM15789.cel.gz?tool=geoquery'
Content type 'application/x-gzip' length 3507725 bytes (3.3 MB)
==================================================
downloaded 3.3 MB

trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM15nnn/GSM15789/suppl//GSM15789.cel.gz?tool=geoquery'
Content type 'application/x-gzip' length 3507725 bytes (3.3 MB)
==================================================
downloaded 3.3 MB

testthat results ================================================================
OK: 181 SKIPPED: 0 FAILED: 4
1. Failure: generic GPL parsing works as expected (@test_GPL.R#9) 
2. Failure: quoted GPL works (@test_GPL.R#19) 
3. Failure: short GPL works (@test_GPL.R#26) 
4. Failure: basic GSM works (@test_GSM.R#12) 

Error: testthat unit tests failed
Execution halted

getGEO() drops GSE attributes

I've been reviewing some failures of certain experimental data packages in Bioconductor 3.6 devel to build and I believe I've isolated the issue. The getGEO() function from GEOquery seems to be dropping attributes when reading in a GSE matrix. This creates problems further down in their code. This reproducible example is gleaned from their code:

library( "GEOquery" )
gse <- getGEO("GSE52778")
pdata <- pData(gse)[,grepl("characteristics",names(pData(gse)))]
names(pdata) <- c("treatment","tissue","ercc_mix","cell","celltype")

Notice that a warning is thrown by getGEO:

Warning message:
attributes are not identical across measure variables;
they will be dropped

The user actually downloaded the file GSE52778_series_matrix.txt from GEO, but the result of using the file and accessing it through NCBI appears to be the same.

Here is the sessionInfo():

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] mzR_2.11.11                 Rcpp_0.12.13
 [3] parathyroidSE_1.15.0        airway_0.111.0
 [5] SummarizedExperiment_1.7.10 DelayedArray_0.3.21
 [7] matrixStats_0.52.2          GenomicRanges_1.29.15
 [9] GenomeInfoDb_1.13.5         bindrcpp_0.2
[11] illuminaHumanv1.db_1.26.0   org.Hs.eg.db_3.4.2
[13] AnnotationDbi_1.39.4        IRanges_2.11.19
[15] S4Vectors_0.15.14           limma_3.33.14
[17] GEOquery_2.45.2             Biobase_2.37.2
[19] BiocGenerics_0.23.3

loaded via a namespace (and not attached):
 [1] XVector_0.17.1          BiocInstaller_1.27.5    compiler_3.4.1
 [4] bindr_0.1               ProtGenerics_1.9.1      zlibbioc_1.23.0
 [7] bitops_1.0-6            tools_3.4.1             digest_0.6.12
[10] bit_1.1-12              lattice_0.20-35         RSQLite_2.0
[13] memoise_1.1.0           tibble_1.3.4            pkgconfig_2.0.1
[16] rlang_0.1.2             Matrix_1.2-11           DBI_0.7
[19] curl_3.0                GenomeInfoDbData_0.99.1 dplyr_0.7.4
[22] httr_1.3.1              xml2_1.1.1              hms_0.3
[25] grid_3.4.1              tidyselect_0.2.2        bit64_0.9-7
[28] glue_1.1.1              R6_2.2.2                XML_3.98-1.9
[31] tidyr_0.7.2             readr_1.1.1             purrr_0.2.3
[34] blob_1.1.0              magrittr_1.5            codetools_0.2-15
[37] assertthat_0.2.0        stringi_1.1.5           RCurl_1.95-4.8

Intermittent connection issues to GEO

From: Hervé Pagès [email protected]
Sent: Sep 27, 2017 7:27 PM
To: "Davis, Sean (NIH/NCI) [E]" [email protected]
Cc: "Shepherd, Lori" [email protected]
Subject: intermittent GEO errors

Hi Sean,

We seem to get these intermittent GEO errors on our build reports
pretty often these days, maybe more than usual. For example today
we see them in release here:

https://bioconductor.org/checkResults/3.5/bioc-LATEST/GEOquery/malbec2-checksrc.html

https://bioconductor.org/checkResults/3.5/bioc-LATEST/ChIPXpress/malbec2-checksrc.html

and in devel here:

https://bioconductor.org/checkResults/3.6/bioc-LATEST/ChIPXpress/malbec1-checksrc.html

Do you have any idea why the GEO service is so flaky?

Are the sys admins in charge of ftp.ncbi.nlm.nih.gov aware of this?

Is there anything that could be done to improve this situation?

Was just wondering if you had any insight on this.

Thanks,
H.

Fix (expected) error in tests

Apparently, NCBI GEO changed the number of supplementary files for GSE1000, causing a test failure:

  1 Test Suite : 
  GEOquery RUnit Tests - 14 test functions, 0 errors, 1 failure
  FAILURE in testSuppFileSupport: Error in checkEquals(2, nrow(fres)) : Mean relative difference: 0.5
  
  
  Test files with failing tests
  
     test_SuppFileSupport.R 
       testSuppFileSupport 

Empty GPL platform_table on download from web API

Hi Sean,

geoq <- getGEO("GSE9514")
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE9nnn/GSE9514/matrix/
Found 1 file(s)
GSE9514_series_matrix.txt.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 378k 100 378k 0 0 204k 0 0:00:01 0:00:01 --:--:-- 204k
File stored at:
/data3/tmp/RtmpkDXZzR/GPL90.soft
Error in xj[i] : only 0's may be mixed with negative subscripts

And the error appears to come from this section in parseGPL():

if (hasDataTable) {
nLinesToRead <- NULL
if (!is.null(n)) {
nLinesToRead <- n - length(txt)
}
dat3 <- fastTabRead(con, n = nLinesToRead, quote = "")
geoDataTable <- new("GEODataTable", columns = cols, table = dat3[1:(nrow(dat3) -
1), ])
}

Where there is no error trapping for the case that fastTabRead returns a zero row data.frame:

debug: dat3 <- fastTabRead(con, n = nLinesToRead, quote = "")
Browse[3]> dim(dat3)
[1] 0 17
Browse[3]> dat3
[1] ID ORF
[3] SPOT_ID Species Scientific Name
[5] Annotation Date Sequence Type
[7] Sequence Source Target Description
[9] Representative Public ID Gene Title
[11] Gene Symbol ENTREZ_GENE_ID
[13] RefSeq Transcript ID SGD accession number
[15] Gene Ontology Biological Process Gene Ontology Cellular Component
[17] Gene Ontology Molecular Function
<0 rows> (or 0-length row.names)

Best,

Jim

James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

GEOquery error when parsing SAGE TAG data

From [email protected]:

I get this error with GSE10, GSE17 or more recent file as GSE11956 or GSE10442 (these files are form GPL4) but not with GSE2253 (GPL339) . I opened some of files I can't dowload correctly and there's different line counts. Mabye the problem comes from the GPL4 technology or something like that? It's a SAGE NIaIII but I tried with another SAGE NIaIII, the GPL8251 and I succeed in downloading correctly the GSE file.

Problem parsing some GSE matrices

The refactored code for parsing GSE matrices seems to fail in some cases e.g. GSE5350 which we use in the BeadArrayUseCases vignette.

A straightforward path to producing the error is to try calling parseGSEMatrix on the GEO series text file directly:

GEOquery:::parseGSEMatrix(fname = "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5350/matrix/GSE5350-GPL2507_series_matrix.txt.gz")`

Error in enc2utf8(col_names(col_labels, sep = sep)) : 
  argument is not a character vector`

Some rudimentry digging leads me to think this is an issue with how the Sample_characteristics_ch1 field is being processed, but I don't know enough about how that may look across multiple datasets to offer a generic solution.

Happy to do some more experimentation if needed.

not run getGPL=F in getGEO

Hello,

I use getGEO function. But GEOquery 2.34.0 and R version 3.2.1

getGEO(filename="gsexx_series_matrix.txt.gz",getGPL=F) is not running correctly. As if getGPL=T, function is downloading gplxx.soft data on NCBI. But I do not want to download gpl data.

I tried several data getGEO with getGPL=F but always function download data. Please help me.

Add tool tag to URLs

Hi Sean,

I received the green light to share GEOquery applog usage numbers
with you. So, as previously discussed, please go ahead and insert an extra
parameter into URLs.

Please use the parameter:
tool=geoquery

e.g.,
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc=GPL96&format=sort&tool=geoquery

Once this is in place, we will be able to track usage and report numbers
to you.

Let me know if you have any questions or concerns.

Kind regards,
Tanya

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.