Code Monkey home page Code Monkey logo

ci-cd-serverless-spark's Introduction

Serverless Spark CI/CD on AWS with GitHub Actions

This is the code used for my GitHub Universe presentation on using GitHub Actions with EMR Serverless.

There is also a workshop you can use with step-by-step instructions: Build analytics applications using Apache Spark with Amazon EMR Serverless

Other resources

Pre-requisites

  • An AWS Account with Admin privileges
  • GitHub OIDC Provider in AWS
  • S3 Bucket(s)
  • EMR Serverless Spark application(s)
  • IAM Roles for GitHub and EMR Serverless

You can create all of these, including some sample data, using the included CloudFormation template.

Warning ๐Ÿ’ฐ The CloudFormation template creates EMR Serverless applications that you will be charged for when integration tests AND the scheduled workflow runs.

Note The IAM roles created in the template are very tightly scoped to the relevant S3 Buckets and EMR Serverless applications created by the stack.

Setup

To follow along, just fork this repository into your own account, clone it locally and do the following:

  1. Create the CloudFormation Stack
aws cloudformation create-stack \
    --stack-name gh-severless-spark-demo \
    --template-body file://./template.cfn.yaml \
    --capabilities CAPABILITY_NAMED_IAM \
    --parameters ParameterKey=GitHubRepo,ParameterValue=USERNAME/REPO ParameterKey=CreateOIDCProvider,ParameterValue=true
  • GitHubRepo is the user/repo format of your GitHub repository that you want your OIDC role to be able to access.
  • CreateOIDCProvider allows you to disable creating the OIDC endpoint for GitHub in your AWS account if it already exists.
  1. Create an "Actions" Secret in your repo

Go to your repository settings, find Secrets on the left-hand side, then Actions. Click "New repository secret" and add a secret named AWS_ACCOUNT_ID with your 12 digit AWS Account ID.

Note This is not sensitive info, just makes it easier to re-use the Actions.

  1. Update the Application IDs
  • In integration-test.yaml, replace TEST_APPLICATION_ID with the TestApplicationId output from the CloudFormation stack
  • In run-job.yaml, replace PROD_APPLICATION_ID with the ProductionApplicationId output from the CloudFormation stack

The rest of the environment variables in your workflows should stay the same unless you deployed in a region other than us-east-1.

With that done, you should be able to experiment with pushing new commits to the repo, opening pull requests, and running the "Fetch Data" workflow.

You can view the status of your job runs in the EMR Serverless console.

Overview

The demo goes into 4 specific use cases, each defined as part of a different GitHub Action. These are intended to be easily reusable

Creating a simple pytest unit test

The unit-tests.yaml file defines a very simple GitHub Action that runs on any push event. It runs the tests in the pyspark/tests/test_basic.py.

Running integration tests on EMR Serverless

integration-test.yaml runs on any Pull Request and both 1/ copies the local pyspark code to S3 and 2/ runs an EMR Serverless job and waits until it's complete.

Deploying EMR Serverless PySpark job to S3

When a semantic-versioned tag is added to the repository, deploy.yaml zips up files in the jobs folder, and copies the zip and main.py files to S3 in a location with the tag as part of the prefix.

Running ETL jobs manualy or on a schedule

run-job.yaml runs the main.py script on a schedule with the version defined in the JOB_VERSION variable. The workflow_dispatch section also lets you run the job manually, which by default uses the "latest" semantic tag on the repository.

ci-cd-serverless-spark's People

Contributors

dacort avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.