Code Monkey home page Code Monkey logo

barthoekstra / brc-data-preprocessor Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 1.0 67 KB

The data preprocessor checks the raw Batumi Raptor Count data coming straight from the Trektellen database. It flags records containing possibly erroneous or suspicious information, but does not delete any data. It is up to coordinators and data technicians to decide what to do with the flagged records.

Home Page: https://www.batumiraptorcount.org

License: GNU General Public License v3.0

Python 99.48% Dockerfile 0.52%

brc-data-preprocessor's Introduction

brc-data-preprocessor BRC logo

The data preprocessor checks the raw Batumi Raptor Count data coming straight from the Trektellen database. It flags records containing possibly erroneous or suspicious information, but does not delete any data. It is up to coordinators and data technicians to decide what to do with the flagged records.

Author: Bart Hoekstra | Mail: [email protected]

General workflow

The preprocessor runs on Amazon Lambda and regularly checks the Trektellen site for newly uploaded BRC counts. If both stations have uploaded data for the day, the fetcher will download the data and store a raw version of the data in Dropbox (in e.g. 2019/data/raw). The preprocessor subsequently checks a copy of the raw data for all kinds of possible errors and flags them by adding a description of the potential problem to a check column in the file stored in 2019/data/inprogress. It is then up to coordinators to use their experience and knowledge of the migration during a given day to determine the validity of the flags added by the preprocessor and act accordingly. Once they have dealt with these issues and emptied the check column of flags, the file can be moved to 2019/data/clean. A copy of the checked file gets stored in 2019/data/inprogress-backup, so data technicians can check how changes to the data have been made.

Flagged records

The following records will be flagged by the preprocessor:

  • Records with invalid doublecount entries (e.g. not within 10 minutes or with the wrong distance code).
  • Records containing >1 bird that is injured and/or killed (rare occurrence).
  • Records lacking critical information in datetime, telpost, speciesname, count or location columns (very unlikely, but the possible result of a bug).
  • Records of birds in >E3 (rare occurrence).
  • Records with registered morphs for all species other than Booted Eagles (and Eleonora's Falcons).
  • Records of HB_NONJUV, HB_JUV, BK_NONJUV and BK_JUV if the number of aged birds is higher than the number of counted birds (HB and BK) within a 10-minute window around the age record.
  • Records of Honey Buzzards that should probably be single-counted (at Station 2 during the HB focus period).
  • Records of aged Honey Buzzards and Black Kites outside of expected distance codes (i.e. outside of W1-O-E1).
  • Records containing unexpected combinations of sex and/or age information.
  • Records with no timestamps, which are set to 00:00:00 during processing.
  • Records containing non-protocol species.
  • Records with age details in W3, E3 and >E3, excluding non-juvenile harriers with a sex, juvenile MonPalHen and juvenile/non-juvenile eagles.
  • Records of female Pallid Harriers with I or A age (legal per protocol, though very difficult to age in the field).

Todo

  • Implement automatic download of the data, flagging of suspicious records and storing of the data in Dropbox using AWS Lambda.
  • Automatically add START and END records to fetched data based on count start and end times.

Future additions

  • Implement checks for possibly erroneous records based on some statistical rules, e.g. the expected (daily) phenology of a species.

Build Lambda deployment Docker image (requires Docker and AWS CLI)

  1. Clone this repository.
  2. cd into this directory.
  3. Build the Docker image to generate a deployment image for the function.
    docker build --platform linux/amd64 -t brc-data-preprocessor-docker:v1 . 
    
  4. Tag docker image. Replace XXXXXX with your account ID.
    docker tag brc-data-preprocessor-docker:v1 XXXXXX.dkr.ecr.eu-central-1.amazonaws.com/brc-data-preprocessor-docker:latest
    
  5. Push docker image to Amazon container repository. Replace XXXXXX with your account ID.
    docker push XXXXXX.dkr.ecr.eu-central-1.amazonaws.com/brc-data-preprocessor-docker:latest
    
  6. Update function. Replace XXXXXX with your account ID.
    aws lambda update-function-code --function-name brc-data-preprocessor-docker \
    --image-uri XXXXXX.dkr.ecr.eu-central-1.amazonaws.com/brc-data-preprocessor-docker:latest
    

brc-data-preprocessor's People

Contributors

barthoekstra avatar dependabot[bot] avatar

Watchers

 avatar

brc-data-preprocessor's Issues

v1.0 Improvements + Minor changes

Improvements

  • Flag non-protocol species (rare occurrence, but makes filtering out e.g. EuroBirdwatch results easier as well). Flags of non-protocol species should replace all other flags if they have been flagged.
  • Flag aged and/or sexed birds in W3, E3 and >E3, excluding non-juvenile Harriers with a sex and juvenile/non-juvenile eagles.
  • Flag all adult or immature female Pallid Harriers in all distance codes.

Minor changes

  • Clarify that flagged double counts should not be changed during data-checking anymore.
  • Stop flagging Eleonora's Falcons with morph.
  • Change harrier combination flag to include ‘species’ reference, e.g. ‘unexpected species+age+sex combination’.
  • Let script change HB_AD to HB_NONJUV.
  • Add last missing/rare protocol species.

AWS Lambda

  • Change trigger timer to once every 15 minutes from earliest season count end time (October 21st) + 4 hours, to not inflate Trektellen view numbers too much. See CloudWatch Cron timers.
  • Check if all environment variables match with protocol, such as the HB focus period.

Consider flagging records with unusual numbers of birds

Ideally records are flagged statistically, but that is quite time-intensive to implement anytime soon. Alternatively, we can just flag records based on a simple set of rules, such as:

  • Any ‘rarity’ with count > 2 (e.g. EgyptianV, CrestedHB, ImperialE)
  • Any large eagle that is not LesserSE with count > 2
  • ...

Fix SSL verification

SSL issues on Trektellen's side were ignored by setting requests verify=False, but that is an inelegant solution.

Improvements to erroneous doublecount detect

Flagging of records where two doublecounts are consecutive rows is possibly too aggressive.

Erroneous doublecount flag text should probably be changed, to stimulate small changes if code actually doesn't flag correctly entirely.

Add a check for counts with 0 birds

Zero birds for a count with a duration of more than 0 are quite unusual, so should probably add a check for it somewhere.

If there was no actual count, e.g. due to rain, the START and END times should be the same and duration should not be >0. See for example 2023-10-17, with obviously wrong start and end times.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.