Light

ubc-mds / 525_group_24 Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 2.0 2.86 MB

Machine Learning Model on Rainfall in Australia

License: MIT License

Jupyter Notebook 100.00%

bigdata rainfall-data dask arrow feather parquet

525_group_24's Introduction

Machine Learning Models in the Cloud to Predict Daily Rainfall in Australia

DSCI 525 - Group 24 @ University of British Columbia March 30, 2021

About

In this project, we will build and deploy ensemble machine learning models in the cloud to predict daily rainfall in Australia on a large dataset (~6 GB) (data source is here), where features are outputs of different climate models and the target is the actual rainfall observation. The purpose of this project is to get exposure to working with larger dataset and achieve various learning objectives in each of the following four milestones:

Milestone 1: Get the data from web using API, process it, and convert it to an efficient file format.
Milestone 2: Move the data to cloud, setup the infrastructure in the cloud and build a Machine Learning model.
Milestone 3: Setup distributed infrastructure using Spark and run the Machine Learning model on Spark.
Milestone 4: Deploy Machine Learning model in cloud so that other consumers can use it.

Report

Milestone 1: A summary of observations and discussion on challenges encountered, is documented in this notebook_1.

Milestone 2: A summary of moving data to cloud and wrangling for machine learning, is documented in this notebook_2.

Milestone 3: A summary of Machine Learning model building results is documented here.

Milestone 4: A summary of API deployment results is documented here.

License

The material on analysis about “Machine Learning Models in the Cloud to Predict Daily Rainfall in Australia” are licensed under the MIT License (Copyright (c) 2020 Master of Data Science at the University of British Columbia). If you want to re-use/re-mix the analysis and the materials used in this project, please provide attribution and link to this repository.

The data used to create the “Daily Rainfall over NSW, Australia” data set are freely available under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence.

Contributors

Contributor Name	GitHub Username
Huanhuan Li	huan-ds
Nash Makhija	nashmakh
Nicholas Wu	nichowu

525_group_24's People

Contributors

Watchers

Forkers

nashmakh nichowu

525_group_24's Issues

4. Get the data what we wrangled in our first milestone.

1. Develop your API

Load the combined CSV to memory and perform a simple EDA

rubric={correctness:10,reasoning:10}

Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
Changing dtype of your data
Load just columns what we want
Loading in chunks
Dask
Discuss your observations.

Feedback

Very well-designed readme file.
The report was perfect, and you have successfully done all the sections (More than what is needed!).
Very well comparison in parts 3 and 4.
Final “Challenges and Difficulties Faced” part was great!

5. Setup your S3 bucket and move data

submission

In the textbox provided on Canvas for the Milestone 1 assignment include:

The URL of your public project's repository
The URL of your notebook for this milestone
Copy & Paste Team Contract

Downloading the data

Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
Extract the zip file, again programmatically, similar to how we did it in class.
You can download the data and unzip it manually. But we learned about APIs, and so we can do it in a reproducible way with the requests library, similar to how we did it in class.

There are 5 files in the figshare repo. The one we want is: data.zip

5. Submission

SUBMISSION: Please put a link in canvas where TAs can find the following-

Python 3 notebook, with the code for ML model in scikit-learn. (You can develop this on your existing jupyterHub in your EC2 instance, from milestone2)
PySpark notebook, with the code for obtaining best hyperparameter settings. ( For this you have to use PySpark notebook in your EMR cluster )
Screenshot from:
Setup your EMR cluster (Task 1).
Setup your browser , jupyter environment & connect to the master node (Task 2).
Your S3 bucket showing model.joblib file. (From Task 3 Develop a ML model using scikit-learn)

1. Setup your EMR cluster with Spark & JupyterHub.

Team-work contract (Done)

https://docs.google.com/document/d/1Iz8kKGP5jHSW27Gl5XV01bRL-9WBlDdISTAE6AyMo1I/edit#

submission

2. Deploy your API

Submission!

2. Setup your JupyterHub

1. Setup your EC2 instance

Make sure notebook is well-documented and self-explanatory

Make sure the notebook runs from beginning to the end.
Make sure notebook is well-documented and self-explanatory
Discuss any challenges or difficulties you faced when dealing with this large data on your laptops.
Briefly explain your approach to overcome the challenges or reasons why you were not able to overcome them.

6. Perform a simple EDA in R

rubric={correctness:15,reasoning:10}

Pick an approach to transfer the dataframe from python to R.
Parquet file
Feather file
Pandas exchange
Arrow exchange
Discuss why you chose this approach over others.

from slack:

When you pick a method to make your python dataframe available in R, make sure you give the reasoning behind it. This must be your team's reasoning/thoughts, and if you want, you can include some references to some articles. But the answer shouldn't be like "I picked it because I like it," "I picked this because you said it in class.", "This seems easy to me." - nothing that sort.

combine data CSV files

Use one of the following options to combine data CSVs into a single CSV.

Pandas
DASK
When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)

Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you check memory usage and discuss the reasons why you might not have been able to run this on your laptop.

4. Obtain best hyperparameter settings using spark's MLlib.

6. Wrangle the data in preparation for machine learning

3. Setup the server

Creating repository and project structure

Write brief introduction of the project in the README.
Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

2. Setup your WebBrowser(Firefox) for EMR.

3. Summarize your journey from Milestone 1 to Milestone 4

3. Develop a ML model using scikit-learn. (We will be using this model to deploy for our next milestone.)

Feedback - MileStone 2

Well done!

The results are fine, and you have done all the sections properly!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.