ropensci / spocc Goto Github PK

View Code? Open in Web Editor NEW

115.0 21.0 27.0 14.2 MB

Species occurrence data toolkit for R

Home Page: https://docs.ropensci.org/spocc

License: Other

Makefile 0.69% R 99.31%

gbif vertnet ecoengine antweb ebird data api rstats r species

spocc's Introduction

spocc (SPecies OCCurrence)

Docs: http://docs.ropensci.org/spocc/

At rOpenSci, we have been writing R packages to interact with many sources of species occurrence data, including GBIF, Vertnet, iNaturalist, and eBird. Other databases are out there as well, which we can pull in. spocc is an R package to query and collect species occurrence data from many sources. The goal is to to create a seamless search experience across data sources, as well as creating unified outputs across data sources.

spocc currently interfaces with seven major biodiversity repositories

Global Biodiversity Information Facility (GBIF) (via rgbif) GBIF is a government funded open data repository with several partner organizations with the express goal of providing access to data on Earth's biodiversity. The data are made available by a network of member nodes, coordinating information from various participant organizations and government agencies.
iNaturalist iNaturalist provides access to crowd sourced citizen science data on species observations.
VertNet (via rvertnet) Similar to rgbif (see below), VertNet provides access to more than 80 million vertebrate records spanning a large number of institutions and museums primarly covering four major disciplines (mammology, herpetology, ornithology, and icthyology).
eBird (via rebird) ebird is a database developed and maintained by the Cornell Lab of Ornithology and the National Audubon Society. It provides real-time access to checklist data, data on bird abundance and distribution, and communtiy reports from birders.
iDigBio (via ridigbio) iDigBio facilitates the digitization of biological and paleobiological specimens and their associated data, and houses specimen data, as well as providing their specimen data via RESTful web services.
OBIS OBIS (Ocean Biogeographic Information System) allows users to search marine species datasets from all of the world's oceans.
Atlas of Living Australia ALA (Atlas of Living Australia) contains information on all the known species in Australia aggregated from a wide range of data providers: museums, herbaria, community groups, government departments, individuals and universities; it contains more than 50 million occurrence records.

The inspiration for this comes from users requesting a more seamless experience across data sources, and from our work on a similar package for taxonomy data (taxize).

BEWARE: In cases where you request data from multiple providers, especially when including GBIF, there could be duplicate records since many providers' data eventually ends up with GBIF. See ?spocc_duplicates, after installation, for more.

Learn more

spocc documentation: <docs.ropensci.org/spocc/>

Contributing

See CONTRIBUTING.md

Installation

Stable version from CRAN

install.packages("spocc", dependencies = TRUE)

Or the development version from GitHub

install.packages("remotes")
remotes::install_github("ropensci/spocc")

library("spocc")

Make maps

All mapping functionality is now in a separate package mapr (formerly known as spoccutils), to make spocc easier to maintain. mapr on CRAN.

spocc's People

Contributors

Stargazers

Watchers

spocc's Issues

Add summary / print / plot methods for S3 class

Is this something we want to add here, or is this going to be part of the general toolbox?

Consider coordinate reference system

Barry suggested converting lat long points e.g, by proj4string(foo) = CRS("+init=epsg:4326")

comment from #1

Not sure yet where to do this

Dealing with many species names

@ropensci/owners The occ function only accepts one species name (though accepts many sources, gbif, ebird, etc.). So i wrote another function occlist that accepts many species.

Is this okay? I thought it was necessary given the complexity already in occ, and occlist has different S4 classes given the different setup.

Add unit tests

Various issues with querying

Just a laundry list of things I'm coming up with as I go over workshop curriculum, with the call and the error.

Problem with ebird queries hawks <- occ(query = "Buteo jamaicensis", from = c("ebird"),limit=35)
Error in list(fmt = "json", r = region, rtype = regtype, sci = species, 'region' is missing
Problem with bison queries hawks <- occ(query = "Buteo jamaicensis", from = c("bison"),limit=35)
Error in (function (species, type = "scientific_name", start = NULL, count = 10, : unused argument (what = "points")
lat and lon are swapped for some query results. Note the difference between gbif and ebird e.g.

red_tailed_hawk <- occ(query = "Buteo jamaicensis", from = c("gbif","ebird"),limit=35, ebirdopts = list(region='US'))

rt_hawk <-  occ2df(red_tailed_hawk)

> head(rt_hawk)
               name  longitude latitude prov
1 Buteo jamaicensis  -72.75626 44.33782 gbif
2 Buteo jamaicensis  -72.95540 44.38789 gbif
3 Buteo jamaicensis -122.29619 45.48817 gbif
4 Buteo jamaicensis  -73.07887 43.79315 gbif
5 Buteo jamaicensis  -93.68460 32.46638 gbif
6 Buteo jamaicensis -116.93248 32.61669 gbif
> tail(rt_hawk)
                name longitude   latitude  prov
65 Buteo jamaicensis  28.37477  -98.15052 ebird
66 Buteo jamaicensis  35.86407  -75.86201 ebird
67 Buteo jamaicensis  44.61453 -123.28182 ebird
68 Buteo jamaicensis  31.88652 -110.96856 ebird
69 Buteo jamaicensis  34.12420  -84.94828 ebird
70 Buteo jamaicensis  33.73912  -96.77625 ebird

Add ecoengine

Should we eventually integrate the ecoengine into this?

Upcoming changes in AntWeb

will require minor modifications in spocc. Especially since now results are returned as an S3 class. Also all requests automatically get limited to a 1000 if more are available, forcing the user to paginate (with offsets and limits) rather than get an error 500.

Automatically generate metadata

Building off of @cboettig suggestion over on reml, we should automatically generate metadata from the s4 classes with method to easily write metadata in EML.

Maps: what do we want to do on maps?

We have options here. Do we want to

ggplot2 (@karthik is on this)
rCharts (see maprcharts fxn here https://github.com/ropensci/spocc/blob/master/R/maprcharts.r)
CartoDB (using https://github.com/Vizzuality/cartodb-r from Vizzuality)
to Github gist
Shiny map using rCharts
Base maps

Some questions/thoughts

How do we want to name these mapping functions? For now using same prefix, and different suffix with no separation, like maprcharts, mapgist, etc.?
Support maps that pull in shp files, etc. via other packages?
How do we want to create connection between output of a call to get occurrence data to plotting? Right now, the main function to get occurrence data occ() outputs an S4 object of class occdat with slots for metadata, parameter settings for each data source, and the data for each data source. occ() is not vectorized, so you have to pass it to lapply, e..g, for more than one taxonomic name. For multiple species, we could vectorize occ or just add a function that combines multiple outputs of occ to another class e.g., occdat_many for occurrence data for many species -which can then be passed to a plotting function, or coerced to an sp object, etc.

Add function to look up options for each source

Right now users would have to load each package separately to then look up the parameter settings that are available for each of inatopts, ebirdopts, etc. Would be easier to just have a function that prints out the args and their descriptions for each of the sources.

Add function to visualize WKT area or bounding box on a map

At least from my perspective, could be super useful to visualize where you are searching on a map, especially with WKT's, as they can be polygons with any number of vertices.

Add rcharts map function back in when rcharts is on CRAN

Package name change

I really think this should be Rspocc...

Travis issues

Not sure why we are still having so many issues getting Travis to pass any build of spocc. Looking over the latest fail, it seems like the issue is here:

Reading state information... Done
E: Unable to locate package r-cran-devtools
E: Unable to locate package r-cran-testthat
+for wait_time in 5 20 30 60
+echo 'Command failed, retrying in 60 ...'
Command failed, retrying in 60 ...
+sleep 60
+sudo apt-get install r-cran-devtools r-cran-testthat
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package r-cran-devtools
E: Unable to locate package r-cran-testthat
+echo 'Failed all retries!'
Failed all retries!
+exit 1

Do we still need devtools for the build to pass? Especially since we have now removed rCharts until that can appear on CRAN?

remove mapshiny before pushing-put back later

Depends on rCharts. Tried to use leafletR instead, but no easy way to gather raw HTML to send to shiny ui. Put back in in a future version.

Warning message on load

Warning message:
replacing previous import by ‘rgbif::blanktheme’ when loading ‘spocc’

Need to push a new rbison to CRAN before we push spocc

Unfortunately, the BISON API just changed. I've been working on updating rbison. Will try to get it up asap. Base url has changed, params are changed, field header names changed.

Metadata use cases

Following off of @cboettig issue in RNeXML ropensci/RNeXML#21 ...

What metadata would be useful for objects within this package?

data source (e.g., GBIF, iNauralist, VertNet)
query parameter values used to generate the output (species="Puma concolor")
date and time the query was made (could help in reproducibility - justifying why same query made on different dates differ)
other...
other...
other...

Push version 1 to CRAN

I think all the open issues now are really enhancements. Once we remove the rCharts dependency and replace it with leafletR we should be free to complete the unit tests #28, do a quick code review, and submit to CRAN.

ya, nay?

Removing rnpn as a source for now

I don't think it makes sense to include rnpn as a source anymore. It just has very little data per species, and it's really not meant to be a source of occurrence data. Also, looks like the API is changing too (usa-npn/rnpn#4 (comment)) so makes sense to take it out for now. This also means we have all our data source packages on CRAN, and are close to getting this to CRAN.

@ropensci/owners Okay with this?

Quick license Q

Is there a reason we are going MIT + License over CC0 for this?

Make new function to create data.frame from many outputs of occ() fxn

We have occ_todf to squash data from one output of occ that can have data from many sources (gbif, inaturalist, etc.) to a single data.frame, but not from many outputs of occ.

Make new function to create data.frame from many outputs of occ() fxn, e.g.,

out <- lapply(c('species1','species2','species3'), function(x) occ(x, ...))
occmany_todf(out)

# giving a single data.frame with all data combined

Query by spatial extent

I know you can do this in iNaturalist, but not sure about the other packages. Is there a way to allow people to query by a bounding box and return all species within that box? If so can we implement this kind of query? Seems like it would be really useful.

Many rcharts background maps are broken

I know Leaflet came out with new version recently. There must be something that changed in the Leaflet API, or on the rCharts side ? @ramnathv

Fix documentation for S4 classes

Still not sure quite how to do it correctly

Add ebird as a resource

example in mapggplot.r doesn't work

This example yields a map of west africa.

`#' ecoengine_data <- occ(query = "Lynx rufus californicus", from = "ecoengine")

' mapggplot(ecoengine_data)`

Remove rnpn from DESCRIPTION

Can't install it so that's another reason build fails.

GBIF is down on 2014-03-02

vignette is not building, check to make sure on monday

Class framework

Have we settled on a class framework for this? Does each search result get its own spot in a list? Do we want to build on @cboettig suggestion of just creating a standardized data frame and then writing a REML file to go with it?

Finish the ggplot add

Will get this done tomorrow as well.

Add support for sending GeoJSON to Google Maps API?

They just announced GeoJSON support https://developers.google.com/maps/documentation/javascript/datalayer

We send GeoJSON to github gists to make maps, why not gmaps?

Convert objects to sp and any other for easy use of spatial analysis pkgs

Can't search multiple sources.

out <- occ(query='Setophaga caerulescens')
Error: client error: (404) Not Found
traceback()
8: stop(status$message, call. = FALSE)
7: stop_for_status(tt)
6: (function (species, type = "scientific_name", start = NULL, count = 10,
countyFips = NULL, county = NULL, aoi = NULL, aoibbox = NULL)
{
if (!is.null(county)) {
numbs <- fips[grep(county, fips$county), ]
if (nrow(numbs) > 1) {
message("\n\n")
print(numbs)
message("\nMore than one matching county found '",
county, "'!\nEnter row number of county you want (other inputs will return 'NA'):\n")
take <- scan(n = 1, quiet = TRUE, what = "raw")
if (length(take) == 0)
take <- "notake"
if (take %in% seq_len(nrow(numbs))) {
take <- as.numeric(take)
message("Input accepted, took county '", as.character(numbs[take,
"county"]), "'.\n")
countyFips <- paste0(numbs[take, c("fips_state",
"fips_county")], collapse = "")
}
else {
countyFips <- NA
message("\nReturned 'NA'!\n\n")
}
}
else if (nrow(numbs) == 1) {
countyFips <- paste0(numbs[, c("fips_state", "fips_county")],
collapse = "")
}
else {
stop("a problem occurred...")
}
}
url <- "http://bison.usgs.ornl.gov/api/search"
args <- compact(list(species = species, type = type, start = start,
count = count, countyFips = countyFips, aoi = aoi, aoibbox = aoibbox))
tt <- GET(url, query = args)
stop_for_status(tt)
out <- content(tt)
class(out) <- "bison"
return(out)
})(species = "Setophaga caerulescens")
5: do.call(bison, opts)
4: foo_bison(sources, x, bisonopts)
3: FUN("Setophaga caerulescens"[[1L]], ...)
2: lapply(query, loopfun)
1: occ(query = "Setophaga caerulescens")

Allow passing in of taxonomic identifiers from taxize

Add geojson for github functions

It makes more sense to add this functionality here than in the individual packages for occurrence data

togeojson from rgbif https://github.com/ropensci/rgbif/blob/master/R/togeojson.r
stylegeojson from rgbif https://github.com/ropensci/rgbif/blob/master/R/stylegeojson.r
perhaps incorporate Ramnath's work to allow easy sharing to github

Push a minor update to CRAN

New binaries for AntWeb and Ecoengine are now on CRAN.

Antweb: Simplifies data retrieval and now returns a S3 class

Taxonomic scrubbing

Names are returned with too much detail I think, e.g.: Accipiter striatus Vieillot, 1808 Is there scrubbing of the inputs? Or should we scrub the outputs? Or just parse the string down?

Support passing in diff sources for each name

Right now you can pass in many sources like from = c('bison','antweb'), but the species searched for are all queried against each of those sources. I think we could support passing in a different source for each name, like

occ(queryfrom = list(c('gbif', 'Setophaga caerulescens'), c('bison', 'Spinus tristis'), list('antweb', c('formica','camponotus','tapinoma'))))

Where queryfrom accepts a list, with each element being a vector of length two like c('source', 'name') or a list like list('source', c('name1','name2',...))

Methods issue.

in method for ‘coerce’ with signature ‘"occdat","SpatialPointsDataFrame"’: no definition for class “SpatialPointsDataFrame”

testing new integration

Results dismo::gbif(...) different from spocc::occ(..., from = "gbif")

I wonder why queries to GBIF via dismo::gbif give different results compared to spocc:occ.. See this gist for example: https://gist.github.com/gimoya/9615828

Class structure

@ropensci/owners Thoughts on the class sturcture?

Right now, occ and occlist give back a slot for each source out@gbif, out@ebird, etc., then within each source are slots for metadata and data (e.g., out@gbif@meta and out@gbif@data). The data slots can be coerced to a single data.frame for all sources using a wrapper function that coerces S4 classes.

Is it better to have occ and occlist give back a slot for all metadata for all sources and a slot for all data , then within each of @metadata and @data slots have source slots (e.g., out@meta@gbif or out@data@ebird)?

New function to clean data

From discussion on #43

Check for impossible values of lat and long
Check for contextually wrong values. That is, if 99 out of 100 lat/long coordinates are within the continental US, but 1 is in China, then perhaps something is wrong with that one point
Check for points in the wrong country, or points outside of a known country
Check for points in a particular habitat, or must be outside of a partciular habitat
Check for possible duplicate occurrence records across GBIF + other data sources (see discussion below)
Check for points that are at 0/0. Seems to be quite a few of these. Can't be right. Remove these.

Standardize options

We should have a list of options we want to include and make sure they're standardized across api's. For instance right now in rinat I don't have a date range, but it can be implemented by subsetting the returned data frame, or in the API. Seems like we should have a set of common options that we can either standardize so they all take the same input form from occdat searches, e.g. we should have a standardized date format (which may differ among the underlying search packages).

Allow occ to proceed even with some services down

Right now AntWeb is down, and if we do

occ(query = "Bison bison", from = c("bison","antweb"))

we get

Error in (function (genus = NULL, species = NULL, scientific_name = NULL,  : 
  server error: (500) Internal Server Error

Yes, I realize we can't search for Bison in AntWeb, but the point is it's down, so returns 500 error.

We should allow the function to still retrieve the data from services that are up, and just pass on those that are down.

Install error

Anyone else getting an install error on installing spocc?

When I try install(spocc):

* installing *source* package 'spocc' ...
** R
** tests
** preparing package for lazy loading
Warning: replacing previous import by 'rgbif::blanktheme' when loading 'spocc'
** help
*** installing help indices
** building package indices
** installing vignettes
Warning in file(con, "w") :
  cannot open file '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/spocc/doc/index.html': No such file or directory
Error in file(con, "w") : cannot open the connection
ERROR: installing vignettes failed
* removing '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/spocc'
Error: Command failed (1)

But when I try install(spocc, build_vignettes = FALSE) it all works fine.

Anybody have a clue as to what's going on. I don't think I'm getting this error with other packages.

Custom icons in Leaflet maps

Here is a way to show custom icons on maps (via ramnathv/rCharts#301).

L1$geoJson(toGeoJSON(data_), 
  pointToLayer =  "#! function(feature, latlng){
    return L.marker(latlng, {icon: L.Icon.extend({
      options: {
        shadowUrl: 'leaf-shadow.png',
        iconSize:     [38, 95],
        shadowSize:   [50, 64],
        iconAnchor:   [22, 94],
        shadowAnchor: [4, 62],
        popupAnchor:  [-3, -76]
      }})
    })
  } !#"
)

Show eg's of how to do this, try to incorporate inside a function, but may not be able to as we can't define a javascript variable/function and use later

Vectorize fxns wrt species name input

Can have many sources in one call, but can occ somehow support many species names input, or write another fxn for that?

Add global variable for access to debugging curl

Right now these can be passed in for each source like gbifopts=list(callopts=list(verbose=TRUE)), would be better to have access globally as that previous code is cumbersome