Code Monkey home page Code Monkey logo

data_processing_course's Introduction

Data Processing with Spark

Materials for the Advanced Data Processing course of the Big Data Analytics Master at the Universitat Politècnica de València.

This course gives a 30 hours overview of many concepts, techniques and tools in data processing using Spark, including some key concepts from Apache Beam. We assume you're familiar with Python, but all the exercises can be easily followed in Java and Scala. We've included a Vagrant definition and docker images for both Spark and Beam.

If you find a bug or you want to contribute some comments, please fill an issue in this repository or simply write us. You're free to reuse course materials, please follow details in the license section.

Structure

Part A - Spark

  1. Brief intro to functional programming
  2. Spark basics
  3. PySpark: transformations, actions and basic IO
  4. Spark SQL
  5. MLib
  6. Graphs
    • GraphX (Scala)
    • GraphFrames (Python)
  7. Spark cluster deployment
  8. Apache Beam

Part B - Architecture Workshop

Team work using Aronson's puzzle. We present a set of real case studies to solve and teams have to design and develop them using any technology available in the market today.

In the first phase, the teams will split with the goal of becoming experts into a particular area and dig into the proposed tools and framework specifics. In the second phase, they'll return to their peers to design a system that covers use case requirement. There's a 15 minute presentation per team to share the results.

Lecture Notes

To be added soon, stay tuned!

Source Samples

Assignments

Final course assignments can be found in this document. They are in Spanish, I'm planning to translate them for 2017 edition.

I'm not publishing the solutions to avoid remaking the exercises every year. There's a test suite using py.test to help you validate the results. If you're really interested on them, please write me to [email protected].

Evaluation Criteria

Self-sufficiency is the state of not requiring any aid, support, or interaction, for survival; it is therefore a type of personal or collective autonomy - Wikipedia.

We follow a self-sufficiency principles for students to drive course goals. At the end of the course, students should have enough knowledge and tools to develop small data processing solutions their own.

  1. Student understands the underlying concepts behind Spark, and is able to write data processing scripts using PySpark, Spark SQL and MLib.
  2. Student is capable of identify common data processing libraries and frameworks and their applications.
  3. Student is capable to work in a team designing a system to cover a simple data processing scenario, understanding the basic implications of the choices they made on systems, languages, libraries and platforms.

Readings and links

We recommend the following papers to expand knowledge on Spark and other data processing techniques:

Roadmap

Some ideas we might add in forthcoming course editions:

  • Code samples in python notebooks
  • Apache Flink and Apache Beam (2017)
  • Add Tachyon content and exercises
  • Add Kafka source to the streaming sample
  • Introduce samples with Minio / InfiniSpan
  • Improve deployment scenarios and tools: Mesos, Chef, etc. (2017)
  • Monitoring using Prometheus and Grafana, provide ready-to-use docker containers
  • Profiling of Spark applications (Scala only)
  • Translate all content to English and Spanish

License

Advanced Data Processing course materials.
Copyright (C) 2016, Luis Belloch

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Recommended citation

Luis Belloch, course materials for Advanced Data Processing, Spring 2016. Master on Big Data Analytics (http://bigdata.inf.upv.es), Universitat Politècnica de València. Downloaded on [DD Month YYYY].

data_processing_course's People

Contributors

luisbelloch avatar chferfa avatar josmerod avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.