Code Monkey home page Code Monkey logo

edx-cs100.1-big-data-with-apache-spark's Introduction

edX-CS100.1-Big-Data-with-Apache-Spark

Introduction to Big Data with Apache Spark BerkeleyX - CS100.1x Ended - Jul 07, 2015

COURSE OVERVIEW

Organizations use their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then teach students how to use PySpark (part of Apache Spark) to deliver against these expectations. The course assignments include Log Mining, Textual Entity Recognition, and Collaborative Filtering exercises that teach students how to manipulate datasets using parallel processing with PySpark.

COURSE CONTENT

Week 1: Big Data and Data Science

    Introduction to Big Data and Data Science - learn about big data and see examples of how data science can leverage big data
    Performing Data Science and Preparing Data - explore data science definitions and topics, and the process of preparing data
    Setting up the Course Software Environment - download and install the course software, run your first Apache Spark notebook, and submit your first assignment

Week 2: Introduction to Apache Spark

    Big Data, Hardware Trends, and the History of  Apache Spark - discuss big data and hardware trends, and learn about the history of Apache Spark
    Spark Essentials - learn about Spark's Resilient Distributed Datasets, transformations, and actions 
    Lab 1: Learning Apache Spark - perform your first course lab where you will learn about the Spark data model, transformations, and actions, and write a word counting program to count the words in all of Shakespeare's plays

Week 3: Data Management

    Semi-Structured Data - explore the concept of semi-structured data and how tabular data is handled in Spark
    Structured Data - learn about structured data, the relational data model, SQL, and joins in SQL and Spark 
    Lab 2: Web Server Log Analysis with Apache Spark - use Spark to explore a NASA Apache web server log in the second course lab 

Week 4: Data Quality, Exploratory Data Analysis, and Machine Learning

    Data Quality - learn about the challenges of data quality and cleaning
    Exploratory Data Analysis - understand the statistics of Exploratory Data Analysis and data distributions
    Machine Learning - learn about Spark's machine learning library, mllib 
    Lab 3: Text Analysis and Entity Resolution - perform text analysis and entity resolution on Google and Amazon product listings using Spark in the third course lab 

Week 5: Data Management

    Lab 4: Introduction to Machine Learning with Apache Spark - use Spark's mllib Machine Learning library to perform collaborative filtering on a movie dataset in the fourth course lab 

edx-cs100.1-big-data-with-apache-spark's People

Contributors

rajeshthallam avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.