ubc-mds / dsci525_group14 Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 1.0 2.09 MB

Web and Cloud Computing

License: MIT License

Jupyter Notebook 42.20% HTML 57.80%

api datawrangling dataextraction

dsci525_group14's Introduction

DSCI 525 Group14

UBC MDS Web and Cloud Computing Course

Source

The goal of this project is to build and deploy ensemble machine learning models in the cloud to predict daily rainfall in Australia. We are using a large dataset from figshare. Features are outputs of different climate models and the target is the actual rainfall observation.

There are four milestones for this project:

Milestone 1

Download the data and perform simple EDA

Milestone 1 Notebook

Milestone 2

Transfer the data into cloud and set up the infrastructure for machine learning model.

Milestone 2 Notebook

Milestone 3

Build distributed infrastructure (Spark) in cloud and perform a Machine Learning model.

Milestone 3 Notebooks

Milestone 4

Deploy ML model using flask

Milestone 4 Notebooks

Team Members

Chuck Ho
Sakshi Jain
Zeliha Ural Merpez
Sasha Babicki

dsci525_group14's People

Contributors

Stargazers

Watchers

Forkers

chuckho777

dsci525_group14's Issues

Downloading the data

3. Downloading the data

rubric={correctness:10}

Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
Extract the zip file, again programmatically, similar to how we did it in class.

You can download the data and unzip it manually. But we learned about APIs, and so we can do it in a reproducible way with the requests library, similar to how we did it in class.

There are 5 files in the figshare repo. The one we want is: data.zip

Error while combine the files and save them

Hello group,

As discussed, I cloned the repository again. However, I am getting the same error while combining and saving the files. Below is the error details:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<timed exec> in <module>

C:\miniconda3\envs\525\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    283     ValueError: Indexes have overlapping values: ['a']
    284     """
--> 285     op = _Concatenator(
    286         objs,
    287         axis=axis,

C:\miniconda3\envs\525\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    337             objs = [objs[k] for k in keys]
    338         else:
--> 339             objs = list(objs)
    340 
    341         if len(objs) == 0:

<timed exec> in <genexpr>(.0)

IndexError: list index out of range

My code:

combined_file_path = output_directory + "combined_data.csv"
%%time
%memit

# Combine files and save
files = glob.glob(output_directory + "*_daily_rainfall_NSW.csv")
df = pd.concat(
    (
        pd.read_csv(file, index_col=0).assign(
            model=re.findall(r"/(.*)_daily_rainfall", file)[0]
        )
        for file in files
    )
)
df.to_csv(combined_file_path)

2. Try to fix foxyproxy setup

Load the combined CSV to memory and perform a simple EDA

5. Load the combined CSV to memory and perform a simple EDA

rubric={correctness:10,reasoning:10}

Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
- Changing dtype of your data
- Load just columns what we want
- Loading in chunks
- Dask
Discuss your observations.

Team-work contract

1. Team-work contract

rubric={correctness:10}

Similar to what you did in DSCI 522 and DSCI 524, create a team-work contract. The contract should outline how you are committed to work together so that you are accountable to one another. Again, you may start with your team contract document from previous project courses and adapt it for your new team. It is a fairly personal document and please do not push it into your public repositories. Instead, save it somewhere your team can easily share it, and you can share a link to it, or a copy with us in your submission to Canvas to prove you did this.

https://docs.google.com/document/d/1fipaY7_6rSNIgwLjjY6kFh5faoZpOt6nrFFWC76tJrU

Creating repository and project structure

2. Creating repository and project structure

rubric={mechanics:10}

Similar to previous project courses, create a public repository under UBC-MDS org for your project.
Write brief introduction of the project in the README.
Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

Combining data CSVs

4. Combining data CSVs

rubric={correctness:10,reasoning:10}

Use one of the following options to combine data CSVs into a single CSV.
- Pandas
- DASK
When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you check memory usage and discuss the reasons why you might not have been able to run this on your laptop.

Regular expression only works for Mac

Fixed in #25

Consistency with %%

Hello everybody,

I noticed that we are not using %% consistently. We are using only one % at some places and %% at some other places. I checked couple of notebooks and I noticed that correct usage is %%. I dont know if it matters.

I still have issues and my friend is running the file for me. Once I get updated. I will post here.

6. Perform a simple EDA in R

rubric={correctness:15,reasoning:10}

Pick an approach to transfer the dataframe from python to R.
Discuss why you chose this approach over others.

3. Set up ML Model

Develop a ML model using scikit-learn.
rubric={correctness:25}

Upload this notebook to your jupyterHub (TLJH in your EC2) from your previous milestone and follow instruction given there. https://github.ubc.ca/MDS-2020-21/DSCI_525_web-cloud-comp_students/blob/master/Milestones/milestone3/Milestone3-Task3.ipynb

Submit to canvas

Submission instructions

rubric={mechanics:5}

In the textbox provided on Canvas for the Milestone 1 assignment include:

The URL of your public project's repository
The URL of your notebook for this milestone

Milestone 3 Feedback

Great job, a few comments:
Organization:

Forgot to edit the links to Tasks 3&4 in Milestone3.ipynb,
3 Comments denoting steps 5 & 6 aren't as clear as 1-4

3 I think you should be comparing the RMSE for the test sets rather than train.