Code Monkey home page Code Monkey logo

cob_datapipeline's Introduction

cob_datapipeline

CircleCI pylint Score

cob_datapipeline is the repository that holds Airflow DAGs (Directed Acyclic Graphs, e.g., data processing workflows) along with related scripts for Temple University Libraries' Library Search (tul_cob) indexing workflows.

These DAGs (and related scripts) are expecting to be run within an Airflow installation akin to the one built by our TUL Airflow Playbook (private repository).

Some DAG tasks in this repository use Temple University Libraries' centralized python library tulflow.

Local Development, QA, and Production environment usage of these DAGs is detailed below.

Prerequisites

Libraries & Packages

  • Python. Version as specified in .python-version.
  • Python Package Dependencies: see the Pipfile
  • Ruby (for running Traject via the TUL_COB Rails Application). These steps are tested with the following Ruby versions:
    • 3.1.3
  • Ruby Libraries:
    • rvm
    • tul_cob gemset installed:
      rvm use 2.4.1@tul_cob --create
      

Airflow Variables

These variables are initially set in the variables.json file. Variables are listed in variables.json.

Airflow Connections

  • SOLRCLOUD: An HTTP Connection used to connect to SolrCloud.
  • AIRFLOW_S3: An AWS (not S3 with latest Airflow upgrade) Connection used to manage AWS credentials (which we use to interact with our Airflow Data S3 Bucket).
  • slack_api_default: Used to report DAG run successes and failures to our internal slack channels.

Local Development

Local development relies on the Airflow Docker Dev Setup submodule.

This project uses the UNIX make command to build, run, stop, configure, and test DAGS for local development. These commands are written to first run the script to change into the submodule directory to use and access the development setup. See the Makefile for the complete list of commands available.

Related Documentation and Projects

Linting & Testing

Perform syntax and style checks on airflow code with pylint

To install and configure pylint

$ pip install pipenv
$ pipenv install --dev

To lint the DAGs

$ pipenv run pylint cob_datapipeline

Use pytest to run unit and functional tests on this project.

$ pipenv run pytest

lint and pytest are run automatically by CircleCI on each pull request.

Deployment

CircleCI checks (lints and tests) code and deploys to the QA server when development branches are merged into the main branch. Code is deployed to production when a new release is created. See the CircleCI configuration file for details.

cob_datapipeline's People

Contributors

bibliotechy avatar cdoyle-temple avatar cmharlow avatar dependabot-preview[bot] avatar dependabot[bot] avatar dkinzer avatar ebtoner avatar htomren avatar nomadicoder avatar relaxing avatar sensei100 avatar tulibraries-devops avatar tulmachine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cob_datapipeline's Issues

bash vs bashrc vs profile (interactive & login or not shells) run by BashOperators

When the bashoperator runs a bash script, even as user airflow, it runs bash as a non-interactive, non-login shell; thus, the bash script cannot access rbenv (or any of the airflow users bash environment setup).

For now, this is temporarily fixed by sourcing bashrc in the bash scripts themselves; in the future, what should we do to fix this? bash -l for the BashOperator? Is there a missing profile to be loaded?

Add automatic check for full reindex on production

We need to be extra careful when we are running a full re-index for production. Production currently uses two solr instances to manage full re-indexing. We run a full reindex on the instance that is not being used and then swap it with the instance that is being used.

Create a check that ssh's into production and retrieves the current $SOLR_URL and exit out of the full reindex process if by some mistake we are trying to run a full reindex job on the current solr instance.

Refactor AZ_CORE/AZ_CONFIGSET usage

We currently use both AZ_CORE and AZ_CONFIGSET. They hold essentially the same type of data which we use to generate the AZ solr_url.

Once we move over completely to Solrcloud we should remove AZ_CORE and just use AZ_CONFIGSET.

Improve Alma OAI Endpoint

  • Confirm Alma OAI's last_updated field(s) per record in the ADM enhanced field accurately reflects the last_updated date of the bibliographic record, holdings, and possibly items. A test set of these records, with new enrichment added, is available on the FTP server (file name test_alma_bibs_2019041016_11157338920003811_new.1.xml)
  • Use an updated traject mapping in tul_cob; see tulibraries/tul_cob#1113
  • Shift OAI harvest time ranges to cover possible dropping of records between Alma OAI publication jobs (every 6 hours)

how to have airflow task running traject only log (not die) upon traject - solr HTTP response error

We have document versioning on in Solr; this can sometimes cause Solr to respond 409 for a full reindex; in current process, doesn't stop the traject process, just throws errors in the logfile;
but for Airflow sees that error code and considers the task failing;
can the airflow script process become a wrapper around traject to swallow the error;
and traject in that issue tries to retry individual + slows process.

Full reindex to Solr Dev Boxes

  • Full reindex using Airflow QA to Solr Dev 1
  • Full reindex using Jenkins to Solr Dev 2

Hold off for turning on partial (OAI) updates on these for dev review; Requires tulibraries/grittyOps#59

How to manage/archive old sftp dumps

Currently this is a manual process, since the entire Alma sftp export is manual anyway.
Do we want to automate it with logrotate (my preference)? Or have an airflow task do it?
And how many do we keep? (Disk space vs ability to look back considerations.)
Keep in mind we may greatly increase the frequency of full reindexing.

Stop hitting solrcloud url using http.

Looks like we hit solrcloud url with http vs. https which could be exposing our basic auth ... this happens because in connection.host we dont' specify http/http and instead add that programmatically.

Boundwith dag use s3

Switch the boundwith implementation in #91 to use s3 instead of local storage.

This should effect the OAI harvesting and http api calls.

Additionally, update the reads as well to use the s3 location.

Create Web DAG that indexes to SolrCloud

Steps:

  • Gets number of solr docs for Web from SolrCloud Web Alias
  • Creates new Web SolrCloud collection
  • Indexes to new Web SolrCloud collection (ending with Web SolrCloud collection healthcheck)
  • Swaps Web SolrCloud Alias (same name as configset) to point to newly created & indexed SolrCloud collection (see funcake dags for example of command)
  • Gets number of solr docs for Web from SolrCloud Web Alias

Stetch goals:

  • remove delete all records step from indexing process
  • used tulflow shared Slack tasks instead of locally defined tasks
  • Abstract out SolrCloud HTTP Operator tasks for reuse across cob_datapipeline (and then maybe to tulflow for reuse across all possible DAGs)

UI Icon 404 error for Airflow in QA

Error from webserver logs: "GET /static/appbuilder/fonts/glyphicons-halflings-regular.woff2 HTTP/1.1" 404 3695 "https://airflow.qa.tul-infra.page/static/appbuilder/css/bootstrap.min.css" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:66.0) Gecko/20100101 Firefox/66.0"

Refactor harvest tasks to allow for no ALMA OAI harvests

The current OAI harvest task, almaoai_harvest.py, has a lot of baked in assumptions about the data that is coming in that block it from being reused for non Alma OAI feeds.

This includes, but is not limited to:

  • wrapping the data in a collection tag
  • Sickle harvest arguments

The harvest task should be refactored into a more generic OAI harvester, allowing extra steps to be added via parametrization

Create AZ DAG copy that indexes to SolrCloud

Steps:

  • Gets number of solr docs for AZ from SolrCloud AZ Alias
  • Creates new AZ SolrCloud collection
  • Indexes to new AZ SolrCloud collection (ending with AZ SolrCloud collection healthcheck)
  • Swaps AZ SolrCloud Alias (same name as configset) to point to newly created & indexed SolrCloud collection (see funcake dags for example of command)
  • Gets number of solr docs for AZ from SolrCloud AZ Alias

Stetch goals:

  • remove delete all records step from indexing process
  • used tulflow shared Slack tasks instead of locally defined tasks

Make sftp file code more robust?

Most of the code dealing with sftp dumps assumes there will be only one set of files present at a time. For a variety of reasons, that assumption can fail and bad things can happen as a result.

A regex for extracting the date from an export file exists; could we use that in other places to ensure we only operate on the latest dump?

On the other hand, reasoning about the state of sftp exports is just a heuristic and probably prone to edge cases. Ideally we'd want to start with having the sftp server pexpect code only download the latest export (and have that date used for all subsequent steps), but that process is hacky enough already.

Create repo w/ traject fork

Get the fork of traject curently being used in airlfow onto github in a separate repository so it can be used in automation / be maintained.

Is it possible to know the result of a Solr delete query?

Solr currently returns nonzero if there was a solr internal failure, but not if the query was not matched and nothing was deleted.
It would be nice to know for sanity checking if delete records are actually deleted, and also when they are preempted by another create record in the same harvest.

Airflow Staging Infrastructure

  • VM for Airflow
  • Mount / Volume for Airflow VM
  • Run ansible-server-bootstrap playbook on above
  • Build the above in terraform using remote state

Airflow User not able to access Shell

See output @relaxing experienced:

Apr  9 21:13:47 li1241-105 airflow: /bin/sh: line 0: exec: bash: not found
Apr  9 21:13:47 li1241-105 airflow: [2019-04-09 21:13:47,172] {local_executor.py:91} ERROR - Failed to execute task Command 'exec bash -c 'airflow run tul_cob_reindex get_num_solr_docs_pre 2019-04-09T20:59:52.378315+00:00 --local -sd /var/lib/airflow/airflow/dags/cob_datapipeline/tul_cob_fullreindex_dag.py'' returned non-zero exit status 127..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.