Code Monkey home page Code Monkey logo

data_cleaning's Introduction

Basic workflow for biodiversity data cleaning using R

0. Loading packages

For this exercise, we will use the R environment. If you need to download it go to: https://www.r-project.org/. It is strongly suggested to use an editor and we recommend RStudio (https://rstudio.com).

For this tutorial, you will need to install the R packages: rgbif, Taxonstand CoordinateCleaner and maps. If you don't have them installed use the following commands:

install.packages("rgbif")
install.packages("Taxonstand")
install.packages("CoordinateCleaner")
install.packages("maps")

Then, we'll start loading the packages.

library(rgbif)
library(Taxonstand)
library(CoordinateCleaner)
library(maps)

1. Getting the data

First, let's download the data of a tree species from South America Cariniana legalis (Kuntze) from the Lecythidaceae family.

You can also embed plots, for example:

species <- "Cariniana legalis"
occs <- occ_search(scientificName = species, 
                   return = "data")
nrow(occs) #number of records 

In the raw data, we have 500 records.

Column names returned from gbif follow the DarwinCore standard (https://dwc.tdwg.org).

colnames(occs)

Exporting raw data

In order to guarantee the documentation of all steps, saving the raw data is essential. We will create a directory to save data and then export the data as csv (text file separated by comma).

dir.create("data")
write.csv(occs, 
          "data/raw_data.csv", 
          row.names = FALSE)

2. Checking species taxonomy

Let's check the unique entries for the species name we just searched.

sort(unique(occs$scientificName))

In this particular case, we have two synonyms Cariniana brasiliensis and Couratari legalis. In the gbif data there is already a column showing the currently accepted taxonomy:

table(occs$taxonomicStatus)

Let's use the function TPL() from package taxonstand to check if the taxonomic updates in the gbif data are correct. This function receives a vector containing a list of species and performs both ortographical and nomenclature checking. Nomenclature checking follows The Plant List.

We will first generate a list with unique species names and combine it to the data. This is preferable because we do not need to check more than once the same name and, in the case of working with several species, it will make the workflow faster.

species.names <- unique(occs$scientificName) 
tax.check <- TPL(species.names)

Let's check the output:

tax.check

Note that the function adds several new variables to the input data and creates columns such as New.Genus and New.Species with the accepted name. We should adopt these names if the column New.Taxonomic.status is filled with "Accepted"

We will merge the new genus and species and then add them to the original data.

# creating new object w/ original and new names after TPL
new.tax <- data.frame(scientificName = species.names, 
                      genus.new.TPL = tax.check$New.Genus, 
                      species.new.TPL = tax.check$New.Species,
                      status.TPL = tax.check$Taxonomic.status,
                      scientificName.new.TPL = paste(tax.check$New.Genus,
                                                     tax.check$New.Species)) 
# now we are merging raw data and checked data
occs.new.tax <- merge(occs, new.tax, by = "scientificName")

Exporting data after taxonomy check

To guarantee the documentation of all steps, we will export the data after the taxonomy check.

write.csv(occs.new.tax, 
          "data/data_taxonomy_check.csv", 
          row.names = FALSE)

3. Checking species' coordinates

First, let's inspect visually the coordinates in the raw data.

plot(decimalLatitude ~ decimalLongitude, data = occs)
map(, , , add = TRUE)

Now we will use the the function clean_coordinates() from the CoordinateCleaner package to clean the species records. This function checks for common errors in coordinates such as institutional coordinates, sea coordinates, outliers, zeros, centroids, etc. This function does not accept not available information (here addressed as "NA") so we will first select only data that have a numerical value for both latitude and longitude.

Note: at this moment having a specific ID code for each observation is essential. The raw data already provides an ID in the column gbifID.

occs.coord <- occs[!is.na(occs$decimalLatitude) 
                   & !is.na(occs$decimalLongitude),]

Now that we don't have NA in latitude or longitude, we can perform the coordinate cleaning.

# output w/ only potential correct coordinates
geo.clean <- clean_coordinates(x = occs.coord, 
                               lon = "decimalLongitude",
                               lat = "decimalLatitude",
                               species = "species", 
                               value = "clean")

Let's plot the output of the clean data.

par(mfrow = c(1, 2))
plot(decimalLatitude ~ decimalLongitude, data = occs)
map(, , , add = TRUE)
plot(decimalLatitude ~ decimalLongitude, data = geo.clean)
map(, , , add = TRUE)
par(mfrow = c(1, 1))

When setting value = clean it returns only the potentially correct coordinates. For checking and reproducibility we want to save all the output with the flags generated by the routine. Let's try a different output.

occs.new.geo <- clean_coordinates(x = occs.coord, 
                                  lon = "decimalLongitude",
                                  lat = "decimalLatitude",
                                  species = "species", 
                                  value = "spatialvalid")

Then, we merge the raw data with the cleaned data.

# merging w/ original data
occs.new.geo2 <- merge(occs, occs.new.geo, 
                       all.x = TRUE, 
                       by = "key") 

Exporting the data after coordinate check

write.csv(occs.new.geo2, 
          "../data/data_coordinate_check.csv", 
          row.names = FALSE)

Here is just of a quick example of a workflow of data cleaning using available tools in R.

data_cleaning's People

Contributors

saramortara avatar andreasancheztapia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.