Code Monkey home page Code Monkey logo

gcp-demo's Introduction

gcp-demo

Prerequistes

Background

This repo is for the purpose of trying out GCP including google cloud SDK, BigQuery, and DataProc with Spark. It mostly contains tutorial material from GCP as an introduction to the tools. This was created as exploritation of different tools in a tech stack.

Additionally it contains an example of using streams with MySQL, and part of the NPR Story API.

(Eventually, the goal is to write the combined dataset to GCP. Currently they are not connected, repo is still a work in progress).

Querying MySQL with ORM, Streams, and NPR API XML example

The src folder contains a gradle/java8 sub-project to pull records from the MySQL DB (using the mysql connector, and Hibernate ORM) in a docker stack using compose.

  • A util class fetches genre data from the NPR API as XML.
  • Then a dataset of Users is fetched from the DB using the ORM.
  • Streams are used to perform simple operations:
    • First listing the Users
    • Then, filtering on Users with a favorite genre matching one selected from the API.

The purpose of this demo was take a look at exising APIs NPR has availible, as well as imagine a use case that the CMS Story API and Public Media Platform (PMP) replacement platform might use.

Running locally

Run the build script to build the MySQL docker container, and then run the java8 app.

./build.sh

or individually:

  1. Build and stage the the Java app with gradle:
./gradlew clean fatJar prepareEnvironment
  1. Then start the docker stack: building the MySQL container, Java container and bringing up the environment:
docker-compose -f ./docker/docker-compose.yml up --build
  1. The database will initialize by loading the init SQL script (InnoDB and startup will take a minute), then the Java container will check for the DB to be healthy before starting the app.

Improvements

This was made as an explorative project before an interview. To acutally run an app or service like this a number of improvements could be made.

Use of an ORM vs direct DB acess

ORMs can be a useful abstraction on top of RDBMS databases, however they can lead to issues. Direct DB access can be more modular and simple in some cases, many things that can be done with an ORM can be done with a SQL query.

Interaction with NPR One API vs Story API

The NPR One API is well documented and seems to have ongoing support. In future iterations, if this was a sample app it could interact with the NPR One API or use the javascript sdk.

XML parser

There are many XML parsers availible, if using the legacy API. SAX, JacksonXML or others could be good refactors.

API Client

If this were a production app, it would be good to generalize the api client for multiple routes in the Story API. Also if pulling static content from that API, it could be useful to make and cache those requests, if the underlying data isn't subject to frequent change.

Integration with brightspot CMS

To integrate this project with brightspot-cms (the CMS to potentially be used in the NPR platform) a few things could be done in future iterations.

  1. Replace the Hibernate ORM, with Dari the data modeling framework used in brightspot.
  2. Make the existing MySQL instance comptible with brightspot/Dari. Dari has it's own DDL (See the DDL here) that is used to make database versions compatible across vendors, also loads table schemas.
  3. Update exiting models classes to use dari. Note that dari doesn't store like Hibernate/ORMs where a class is mapped to a table. Dari will store them in the Record table serializing the object as json and storing it in a blob.
    1. In this example project we would extend the Content class provided by brightspot to implement this.

GCP Demo using Apache Spark and BigQuery.

Define a bucket with:

gsutil mb gs://gcp-demo-spark-bucket

Launch spark-shell with the bq connector

spark-shell --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar

or use the following if running python:

gcloud dataproc jobs submit pyspark ./demo/wordcount.py \
    --cluster gcp-demo \
    --region us-west2 \
    --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar
spark-submit --jars gs://spark-lib/bigquery/spark-bigquery-latest.jar wordcount.py

Awknowledgements

gcp-demo's People

Contributors

struthj2 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.