Code Monkey home page Code Monkey logo

scalable-data-science's Introduction

Scalable Data Science

Scalable data science is a technical course in the area of Big Data, aimed at the needs of the data industry. This course uses Apache Spark, a fast and general engine for large-scale data processing via databricks to compute with datasets that won't fit in a single computer. The course will introduce Spark’s core concepts via hands-on coding, including resilient distributed datasets and map-reduce algorithms, DataFrame and Spark SQL on Catalyst, scalable machine-learning pipelines in MlLib and vertex programs using the distributed graph processing framework of GraphX. We will solve instances of real-world big data decision problems from various scientific domains.

This is being prepared by Raazesh Sainudiin and Sivanand Sivaram with assistance from Paul Brouwers, Dillon George and Ivan Sadikov.

All course projects by seven enrolled and four observing students for Semester 1 of 2016 at UC, Ilam are part of this content.

How to self-learn this content?

The 2016 instance of this scalable-data-science course finished on June 30 2016.

To learn Apache Spark for free try databricks Community edition by starting from https://databricks.com/try-databricks.

All course content can be uploaded for self-paced learning by copying the following URL for 2016/Spark1_6_to_1_3/scalable-data-science.dbc archive and importing it from the URL to your free Databricks Community Edition.

The Gitbook version of this content is https://www.gitbook.com/book/raazesh-sainudiin/scalable-data-science/details.

The browsable git-pages version of the content is http://raazesh-sainudiin.github.io/scalable-data-science/.

How to cite this work?

Scalable Data Science, Raazesh Sainudiin and Sivanand Sivaram, Published by GitBook https://www.gitbook.com/book/raazesh-sainudiin/scalable-data-science/details, 787 pages, 30th June 2016.

Supported By

Databricks Academic Partners Program and Amazon Web Services Educate.

Summary of Contents

Contribute

All course content is currently being pushed by Raazesh Sainudiin after it has been tested in Databricks cloud (mostly under Spark 1.6 and some involving Magellan under Spark 1.5.1).

The markdown version for gitbook is generated from the Databricks .scala, .py and other source codes. The gitbook is not a substitute for the Databricks notebooks available in the Databricks cloud. The following issues need to be resolved:

  • need to find a stable solution for the output of various databricks cells to be shown in gitbook, including those from display_HTML and frameIt with their in-place embeds of web content.

Please feel free to fork the github repository:

Furthermore, due to the anticipation of Spark 2.0 this mostly Spark 1.6 version could be enhanced with a 2.0 version-specific upgrade.

Please send any typos or suggestions to [email protected]

Please read a note on babel to understand how the gitbook is generated from the .scala source of the databricks notebook.

Raazesh Sainudiin, Laboratory for Mathematical Statistical Experiments, Christchurch Centre and School of Mathematics and Statistics, University of Canterbury, Private Bag 4800, Christchurch 8041, Aotearoa New Zealand

Sun Jun 19 21:59:19 NZST 2016

scalable-data-science's People

Contributors

lamastex avatar sadikovi avatar samypesse avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.