Code Monkey home page Code Monkey logo

525_group_24's Introduction

Machine Learning Models in the Cloud to Predict Daily Rainfall in Australia

DSCI 525 - Group 24 @ University of British Columbia March 30, 2021

About

In this project, we will build and deploy ensemble machine learning models in the cloud to predict daily rainfall in Australia on a large dataset (~6 GB) (data source is here), where features are outputs of different climate models and the target is the actual rainfall observation. The purpose of this project is to get exposure to working with larger dataset and achieve various learning objectives in each of the following four milestones:

Milestone 1: Get the data from web using API, process it, and convert it to an efficient file format.
Milestone 2: Move the data to cloud, setup the infrastructure in the cloud and build a Machine Learning model.
Milestone 3: Setup distributed infrastructure using Spark and run the Machine Learning model on Spark.
Milestone 4: Deploy Machine Learning model in cloud so that other consumers can use it.

Report

Milestone 1: A summary of observations and discussion on challenges encountered, is documented in this notebook_1.

Milestone 2: A summary of moving data to cloud and wrangling for machine learning, is documented in this notebook_2.

Milestone 3: A summary of Machine Learning model building results is documented here.

Milestone 4: A summary of API deployment results is documented here.

License

The material on analysis about “Machine Learning Models in the Cloud to Predict Daily Rainfall in Australia” are licensed under the MIT License (Copyright (c) 2020 Master of Data Science at the University of British Columbia). If you want to re-use/re-mix the analysis and the materials used in this project, please provide attribution and link to this repository.

The data used to create the “Daily Rainfall over NSW, Australia” data set are freely available under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence.

Contributors

Contributor Name GitHub Username
Huanhuan Li huan-ds
Nash Makhija nashmakh
Nicholas Wu nichowu

525_group_24's People

Contributors

huan-ds avatar nashmakh avatar nichowu avatar

Watchers

 avatar

Forkers

nashmakh nichowu

525_group_24's Issues

Load the combined CSV to memory and perform a simple EDA

rubric={correctness:10,reasoning:10}

Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
Changing dtype of your data
Load just columns what we want
Loading in chunks
Dask
Discuss your observations.

Feedback

  • Very well-designed readme file.
  • The report was perfect, and you have successfully done all the sections (More than what is needed!).
  • Very well comparison in parts 3 and 4.
  • Final “Challenges and Difficulties Faced” part was great!

submission

In the textbox provided on Canvas for the Milestone 1 assignment include:

  • The URL of your public project's repository
  • The URL of your notebook for this milestone
  • Copy & Paste Team Contract

Downloading the data

Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
Extract the zip file, again programmatically, similar to how we did it in class.
You can download the data and unzip it manually. But we learned about APIs, and so we can do it in a reproducible way with the requests library, similar to how we did it in class.

There are 5 files in the figshare repo. The one we want is: data.zip

5. Submission

SUBMISSION: Please put a link in canvas where TAs can find the following-

  • Python 3 notebook, with the code for ML model in scikit-learn. (You can develop this on your existing jupyterHub in your EC2 instance, from milestone2)
  • PySpark notebook, with the code for obtaining best hyperparameter settings. ( For this you have to use PySpark notebook in your EMR cluster )
    Screenshot from:
  • Setup your EMR cluster (Task 1).
  • Setup your browser , jupyter environment & connect to the master node (Task 2).
  • Your S3 bucket showing model.joblib file. (From Task 3 Develop a ML model using scikit-learn)

Make sure notebook is well-documented and self-explanatory

  • Make sure the notebook runs from beginning to the end.
  • Make sure notebook is well-documented and self-explanatory
  • Discuss any challenges or difficulties you faced when dealing with this large data on your laptops.
  • Briefly explain your approach to overcome the challenges or reasons why you were not able to overcome them.

6. Perform a simple EDA in R

rubric={correctness:15,reasoning:10}

Pick an approach to transfer the dataframe from python to R.
Parquet file
Feather file
Pandas exchange
Arrow exchange
Discuss why you chose this approach over others.

from slack:

  1. When you pick a method to make your python dataframe available in R, make sure you give the reasoning behind it. This must be your team's reasoning/thoughts, and if you want, you can include some references to some articles. But the answer shouldn't be like "I picked it because I like it," "I picked this because you said it in class.", "This seems easy to me." - nothing that sort.

combine data CSV files

Use one of the following options to combine data CSVs into a single CSV.

Pandas
DASK
When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)

Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you check memory usage and discuss the reasons why you might not have been able to run this on your laptop.

Creating repository and project structure

  • Write brief introduction of the project in the README.
  • Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.