Code Monkey home page Code Monkey logo

architect_big_data_solutions_with_spark's Introduction

Architect Big Data Solutions with Apache Spark


Introduction

This repository contains lectures and codes for the course that aims to provide a gentle introduction on how to build distributed big data pipelines with the help of Apache Spark. Apache Spark is an open-source data processing engine for engineers and analysts that includes an optimized general execution runtime and a set of standard libraries for building data pipelines, advanced algorithms, and more. Spark is rapidly becoming the compute engine of choice for big data. Spark programs are more concise and often run 10-100 times faster than Hadoop MapReduce jobs. As companies realize this, Spark developers are becoming increasingly valued.

In this course we will learn the architectural and practical part of using Apache Spark to implement big data solutions. We will use the Spark Core, SparkSQL, Spark Streaming, and Spark ML to implement different advanced analytics and machine learning algorithms in a production like data pipeline. This course will master your skills in designing solutions for common Big Data tasks such as creating batch and real-time data processing pipelines, doing machine learning at scale, deploying machine learning models into a production environment, and much more!


Content

  1. Introduction [lecture 1] [labs] [pyspark Python cheat sheet]
  2. SQL and DataFrame [labs] [pyspark SQL cheat sheet]
  3. Batch Processing [lecture 2] [lecture 3]
  4. Stream Processing [lecture 4] [lecture 5] [labs]
  5. Machine Learning [lecture 6] [labs]

Computational Resources

  1. Please register for community version of DataBricks here.
  2. Please register for free tier AWS account here

Data Sources

You can find data and additional information from the links below:

  1. MovieLens DataSet
  2. House Prices: Advanced Regression Techniques
  3. Titanic: Machine Learning from Disaster

Note: For you convenience data already downloaded to Datasets folder of this repository.

Note: You can upload data to DataBricks directly or use AWS S3 bucket for storage:


Additional Resources

We provide links for nice cheat sheets and books in order to make course as smooth as possible:

  1. A Gentle Introduction to Apache Spark
  2. How to import Data to DataBricks using S3
  3. Python Cheat Sheet
  4. Machine Learning Tutorial for AWS
  5. DataBricks Development Documentation
  6. Developers Guide for AWS Machine Learning
  7. Superset

Course Initiative:

If you like the initiative please star/fork that repository and feel free to contribute with pull requests.


Places where this course has been taught (physically)

architect_big_data_solutions_with_spark's People

Contributors

ekhtiar avatar mgarriga avatar osin-vladimir avatar stefanodallapalma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

architect_big_data_solutions_with_spark's Issues

Course extensions

  • Add scala samples
  • Add hotel recommender
  • Cross-cloud support/examples (GCP, AWS)
  • End-to-end-pipelines
  • Medium blog for course
  • Rebuild labs and lectures

Make Code Fully Reproducible

  • Change Datasets(e.g. English based, light)
  • Split content by modules
  • 30 minutes to start working with code
  • Create syllabus, with links
  • Make sure all of code is executable in Databricks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.