Code Monkey home page Code Monkey logo

spocc's Introduction

spocc (SPecies OCCurrence)

R-CMD-check test-sp-sf codecov.io cran checks rstudio mirror downloads cran version

Docs: http://docs.ropensci.org/spocc/

At rOpenSci, we have been writing R packages to interact with many sources of species occurrence data, including GBIF, Vertnet, iNaturalist, and eBird. Other databases are out there as well, which we can pull in. spocc is an R package to query and collect species occurrence data from many sources. The goal is to to create a seamless search experience across data sources, as well as creating unified outputs across data sources.

spocc currently interfaces with seven major biodiversity repositories

  1. Global Biodiversity Information Facility (GBIF) (via rgbif) GBIF is a government funded open data repository with several partner organizations with the express goal of providing access to data on Earth's biodiversity. The data are made available by a network of member nodes, coordinating information from various participant organizations and government agencies.

  2. iNaturalist iNaturalist provides access to crowd sourced citizen science data on species observations.

  3. VertNet (via rvertnet) Similar to rgbif (see below), VertNet provides access to more than 80 million vertebrate records spanning a large number of institutions and museums primarly covering four major disciplines (mammology, herpetology, ornithology, and icthyology).

  4. eBird (via rebird) ebird is a database developed and maintained by the Cornell Lab of Ornithology and the National Audubon Society. It provides real-time access to checklist data, data on bird abundance and distribution, and communtiy reports from birders.

  5. iDigBio (via ridigbio) iDigBio facilitates the digitization of biological and paleobiological specimens and their associated data, and houses specimen data, as well as providing their specimen data via RESTful web services.

  6. OBIS OBIS (Ocean Biogeographic Information System) allows users to search marine species datasets from all of the world's oceans.

  7. Atlas of Living Australia ALA (Atlas of Living Australia) contains information on all the known species in Australia aggregated from a wide range of data providers: museums, herbaria, community groups, government departments, individuals and universities; it contains more than 50 million occurrence records.

The inspiration for this comes from users requesting a more seamless experience across data sources, and from our work on a similar package for taxonomy data (taxize).

BEWARE: In cases where you request data from multiple providers, especially when including GBIF, there could be duplicate records since many providers' data eventually ends up with GBIF. See ?spocc_duplicates, after installation, for more.

Learn more

spocc documentation: <docs.ropensci.org/spocc/>

Contributing

See CONTRIBUTING.md

Installation

Stable version from CRAN

install.packages("spocc", dependencies = TRUE)

Or the development version from GitHub

install.packages("remotes")
remotes::install_github("ropensci/spocc")
library("spocc")

Make maps

All mapping functionality is now in a separate package mapr (formerly known as spoccutils), to make spocc easier to maintain. mapr on CRAN.

Meta

spocc's People

Contributors

dlebauer avatar emhart avatar hannahlowens avatar jeroen avatar karthik avatar maelle avatar sckott avatar stewid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spocc's Issues

Dealing with many species names

@ropensci/owners The occ function only accepts one species name (though accepts many sources, gbif, ebird, etc.). So i wrote another function occlist that accepts many species.

Is this okay? I thought it was necessary given the complexity already in occ, and occlist has different S4 classes given the different setup.

Various issues with querying

Just a laundry list of things I'm coming up with as I go over workshop curriculum, with the call and the error.

  • Problem with ebird queries hawks <- occ(query = "Buteo jamaicensis", from = c("ebird"),limit=35)
    Error in list(fmt = "json", r = region, rtype = regtype, sci = species, 'region' is missing
  • Problem with bison queries hawks <- occ(query = "Buteo jamaicensis", from = c("bison"),limit=35)
    Error in (function (species, type = "scientific_name", start = NULL, count = 10, : unused argument (what = "points")
  • lat and lon are swapped for some query results. Note the difference between gbif and ebird e.g.
red_tailed_hawk <- occ(query = "Buteo jamaicensis", from = c("gbif","ebird"),limit=35, ebirdopts = list(region='US'))

rt_hawk <-  occ2df(red_tailed_hawk)

> head(rt_hawk)
               name  longitude latitude prov
1 Buteo jamaicensis  -72.75626 44.33782 gbif
2 Buteo jamaicensis  -72.95540 44.38789 gbif
3 Buteo jamaicensis -122.29619 45.48817 gbif
4 Buteo jamaicensis  -73.07887 43.79315 gbif
5 Buteo jamaicensis  -93.68460 32.46638 gbif
6 Buteo jamaicensis -116.93248 32.61669 gbif
> tail(rt_hawk)
                name longitude   latitude  prov
65 Buteo jamaicensis  28.37477  -98.15052 ebird
66 Buteo jamaicensis  35.86407  -75.86201 ebird
67 Buteo jamaicensis  44.61453 -123.28182 ebird
68 Buteo jamaicensis  31.88652 -110.96856 ebird
69 Buteo jamaicensis  34.12420  -84.94828 ebird
70 Buteo jamaicensis  33.73912  -96.77625 ebird

Add ecoengine

Should we eventually integrate the ecoengine into this?

Upcoming changes in AntWeb

will require minor modifications in spocc. Especially since now results are returned as an S3 class. Also all requests automatically get limited to a 1000 if more are available, forcing the user to paginate (with offsets and limits) rather than get an error 500.

Maps: what do we want to do on maps?

We have options here. Do we want to

Some questions/thoughts

  • How do we want to name these mapping functions? For now using same prefix, and different suffix with no separation, like maprcharts, mapgist, etc.?
  • Support maps that pull in shp files, etc. via other packages?
  • How do we want to create connection between output of a call to get occurrence data to plotting? Right now, the main function to get occurrence data occ() outputs an S4 object of class occdat with slots for metadata, parameter settings for each data source, and the data for each data source. occ() is not vectorized, so you have to pass it to lapply, e..g, for more than one taxonomic name. For multiple species, we could vectorize occ or just add a function that combines multiple outputs of occ to another class e.g., occdat_many for occurrence data for many species -which can then be passed to a plotting function, or coerced to an sp object, etc.

Add function to look up options for each source

Right now users would have to load each package separately to then look up the parameter settings that are available for each of inatopts, ebirdopts, etc. Would be easier to just have a function that prints out the args and their descriptions for each of the sources.

Travis issues

Not sure why we are still having so many issues getting Travis to pass any build of spocc. Looking over the latest fail, it seems like the issue is here:

Reading state information... Done
E: Unable to locate package r-cran-devtools
E: Unable to locate package r-cran-testthat
+for wait_time in 5 20 30 60
+echo 'Command failed, retrying in 60 ...'
Command failed, retrying in 60 ...
+sleep 60
+sudo apt-get install r-cran-devtools r-cran-testthat
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package r-cran-devtools
E: Unable to locate package r-cran-testthat
+echo 'Failed all retries!'
Failed all retries!
+exit 1

Do we still need devtools for the build to pass? Especially since we have now removed rCharts until that can appear on CRAN?

Warning message on load

Warning message:
replacing previous import by ‘rgbif::blanktheme’ when loading ‘spocc’

Metadata use cases

Following off of @cboettig issue in RNeXML ropensci/RNeXML#21 ...

What metadata would be useful for objects within this package?

  • data source (e.g., GBIF, iNauralist, VertNet)
  • query parameter values used to generate the output (species="Puma concolor")
  • date and time the query was made (could help in reproducibility - justifying why same query made on different dates differ)
  • other...
  • other...
  • other...

Push version 1 to CRAN

I think all the open issues now are really enhancements. Once we remove the rCharts dependency and replace it with leafletR we should be free to complete the unit tests #28, do a quick code review, and submit to CRAN.

ya, nay?

Removing rnpn as a source for now

I don't think it makes sense to include rnpn as a source anymore. It just has very little data per species, and it's really not meant to be a source of occurrence data. Also, looks like the API is changing too (usa-npn/rnpn#4 (comment)) so makes sense to take it out for now. This also means we have all our data source packages on CRAN, and are close to getting this to CRAN.

@ropensci/owners Okay with this?

Quick license Q

Is there a reason we are going MIT + License over CC0 for this?

Make new function to create data.frame from many outputs of occ() fxn

We have occ_todf to squash data from one output of occ that can have data from many sources (gbif, inaturalist, etc.) to a single data.frame, but not from many outputs of occ.

Make new function to create data.frame from many outputs of occ() fxn, e.g.,

out <- lapply(c('species1','species2','species3'), function(x) occ(x, ...))
occmany_todf(out)

# giving a single data.frame with all data combined 

Query by spatial extent

I know you can do this in iNaturalist, but not sure about the other packages. Is there a way to allow people to query by a bounding box and return all species within that box? If so can we implement this kind of query? Seems like it would be really useful.

example in mapggplot.r doesn't work

This example yields a map of west africa.

`#' ecoengine_data <- occ(query = "Lynx rufus californicus", from = "ecoengine")

' mapggplot(ecoengine_data)`

Class framework

Have we settled on a class framework for this? Does each search result get its own spot in a list? Do we want to build on @cboettig suggestion of just creating a standardized data frame and then writing a REML file to go with it?

Can't search multiple sources.

out <- occ(query='Setophaga caerulescens')
Error: client error: (404) Not Found
traceback()
8: stop(status$message, call. = FALSE)
7: stop_for_status(tt)
6: (function (species, type = "scientific_name", start = NULL, count = 10,
countyFips = NULL, county = NULL, aoi = NULL, aoibbox = NULL)
{
if (!is.null(county)) {
numbs <- fips[grep(county, fips$county), ]
if (nrow(numbs) > 1) {
message("\n\n")
print(numbs)
message("\nMore than one matching county found '",
county, "'!\nEnter row number of county you want (other inputs will return 'NA'):\n")
take <- scan(n = 1, quiet = TRUE, what = "raw")
if (length(take) == 0)
take <- "notake"
if (take %in% seq_len(nrow(numbs))) {
take <- as.numeric(take)
message("Input accepted, took county '", as.character(numbs[take,
"county"]), "'.\n")
countyFips <- paste0(numbs[take, c("fips_state",
"fips_county")], collapse = "")
}
else {
countyFips <- NA
message("\nReturned 'NA'!\n\n")
}
}
else if (nrow(numbs) == 1) {
countyFips <- paste0(numbs[, c("fips_state", "fips_county")],
collapse = "")
}
else {
stop("a problem occurred...")
}
}
url <- "http://bison.usgs.ornl.gov/api/search"
args <- compact(list(species = species, type = type, start = start,
count = count, countyFips = countyFips, aoi = aoi, aoibbox = aoibbox))
tt <- GET(url, query = args)
stop_for_status(tt)
out <- content(tt)
class(out) <- "bison"
return(out)
})(species = "Setophaga caerulescens")
5: do.call(bison, opts)
4: foo_bison(sources, x, bisonopts)
3: FUN("Setophaga caerulescens"[[1L]], ...)
2: lapply(query, loopfun)
1: occ(query = "Setophaga caerulescens")

Push a minor update to CRAN

New binaries for AntWeb and Ecoengine are now on CRAN.

Antweb: Simplifies data retrieval and now returns a S3 class

Taxonomic scrubbing

Names are returned with too much detail I think, e.g.: Accipiter striatus Vieillot, 1808 Is there scrubbing of the inputs? Or should we scrub the outputs? Or just parse the string down?

Support passing in diff sources for each name

Right now you can pass in many sources like from = c('bison','antweb'), but the species searched for are all queried against each of those sources. I think we could support passing in a different source for each name, like

occ(queryfrom = list(c('gbif', 'Setophaga caerulescens'), c('bison', 'Spinus tristis'), list('antweb', c('formica','camponotus','tapinoma'))))

Where queryfrom accepts a list, with each element being a vector of length two like c('source', 'name') or a list like list('source', c('name1','name2',...))

Methods issue.

in method for ‘coerce’ with signature ‘"occdat","SpatialPointsDataFrame"’: no definition for class “SpatialPointsDataFrame”

Class structure

@ropensci/owners Thoughts on the class sturcture?

Right now, occ and occlist give back a slot for each source out@gbif, out@ebird, etc., then within each source are slots for metadata and data (e.g., out@gbif@meta and out@gbif@data). The data slots can be coerced to a single data.frame for all sources using a wrapper function that coerces S4 classes.

Is it better to have occ and occlist give back a slot for all metadata for all sources and a slot for all data , then within each of @metadata and @data slots have source slots (e.g., out@meta@gbif or out@data@ebird)?

New function to clean data

From discussion on #43

  • Check for impossible values of lat and long
  • Check for contextually wrong values. That is, if 99 out of 100 lat/long coordinates are within the continental US, but 1 is in China, then perhaps something is wrong with that one point
  • Check for points in the wrong country, or points outside of a known country
  • Check for points in a particular habitat, or must be outside of a partciular habitat
  • Check for possible duplicate occurrence records across GBIF + other data sources (see discussion below)
  • Check for points that are at 0/0. Seems to be quite a few of these. Can't be right. Remove these.

Standardize options

We should have a list of options we want to include and make sure they're standardized across api's. For instance right now in rinat I don't have a date range, but it can be implemented by subsetting the returned data frame, or in the API. Seems like we should have a set of common options that we can either standardize so they all take the same input form from occdat searches, e.g. we should have a standardized date format (which may differ among the underlying search packages).

Allow occ to proceed even with some services down

Right now AntWeb is down, and if we do

occ(query = "Bison bison", from = c("bison","antweb"))

we get

Error in (function (genus = NULL, species = NULL, scientific_name = NULL,  : 
  server error: (500) Internal Server Error

Yes, I realize we can't search for Bison in AntWeb, but the point is it's down, so returns 500 error.

We should allow the function to still retrieve the data from services that are up, and just pass on those that are down.

Install error

Anyone else getting an install error on installing spocc?

When I try install(spocc):

* installing *source* package 'spocc' ...
** R
** tests
** preparing package for lazy loading
Warning: replacing previous import by 'rgbif::blanktheme' when loading 'spocc'
** help
*** installing help indices
** building package indices
** installing vignettes
Warning in file(con, "w") :
  cannot open file '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/spocc/doc/index.html': No such file or directory
Error in file(con, "w") : cannot open the connection
ERROR: installing vignettes failed
* removing '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/spocc'
Error: Command failed (1)

But when I try install(spocc, build_vignettes = FALSE) it all works fine.

Anybody have a clue as to what's going on. I don't think I'm getting this error with other packages.

Custom icons in Leaflet maps

Here is a way to show custom icons on maps (via ramnathv/rCharts#301).

L1$geoJson(toGeoJSON(data_), 
  pointToLayer =  "#! function(feature, latlng){
    return L.marker(latlng, {icon: L.Icon.extend({
      options: {
        shadowUrl: 'leaf-shadow.png',
        iconSize:     [38, 95],
        shadowSize:   [50, 64],
        iconAnchor:   [22, 94],
        shadowAnchor: [4, 62],
        popupAnchor:  [-3, -76]
      }})
    })
  } !#"
)

Show eg's of how to do this, try to incorporate inside a function, but may not be able to as we can't define a javascript variable/function and use later

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.