Code Monkey home page Code Monkey logo

dsci525_group14's Introduction

DSCI 525 Group14

UBC MDS Web and Cloud Computing Course

Source

photo

The goal of this project is to build and deploy ensemble machine learning models in the cloud to predict daily rainfall in Australia. We are using a large dataset from figshare. Features are outputs of different climate models and the target is the actual rainfall observation.

There are four milestones for this project:

Milestone 1

Download the data and perform simple EDA

Milestone 1 Notebook

Milestone 2

Transfer the data into cloud and set up the infrastructure for machine learning model.

Milestone 2 Notebook

Milestone 3

Build distributed infrastructure (Spark) in cloud and perform a Machine Learning model.

Milestone 3 Notebooks

Milestone 4

Deploy ML model using flask

Milestone 4 Notebooks

Team Members

  • Chuck Ho
  • Sakshi Jain
  • Zeliha Ural Merpez
  • Sasha Babicki

dsci525_group14's People

Contributors

chuckho777 avatar hellosakshi avatar sbabicki avatar zmerpez avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

chuckho777

dsci525_group14's Issues

Downloading the data

3. Downloading the data

rubric={correctness:10}

  1. Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
  2. Extract the zip file, again programmatically, similar to how we did it in class.

You can download the data and unzip it manually. But we learned about APIs, and so we can do it in a reproducible way with the requests library, similar to how we did it in class.

There are 5 files in the figshare repo. The one we want is: data.zip

Error while combine the files and save them

Hello group,

As discussed, I cloned the repository again. However, I am getting the same error while combining and saving the files. Below is the error details:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<timed exec> in <module>

C:\miniconda3\envs\525\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    283     ValueError: Indexes have overlapping values: ['a']
    284     """
--> 285     op = _Concatenator(
    286         objs,
    287         axis=axis,

C:\miniconda3\envs\525\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    337             objs = [objs[k] for k in keys]
    338         else:
--> 339             objs = list(objs)
    340 
    341         if len(objs) == 0:

<timed exec> in <genexpr>(.0)

IndexError: list index out of range

My code:

combined_file_path = output_directory + "combined_data.csv"
%%time
%memit

# Combine files and save
files = glob.glob(output_directory + "*_daily_rainfall_NSW.csv")
df = pd.concat(
    (
        pd.read_csv(file, index_col=0).assign(
            model=re.findall(r"/(.*)_daily_rainfall", file)[0]
        )
        for file in files
    )
)
df.to_csv(combined_file_path)

Load the combined CSV to memory and perform a simple EDA

5. Load the combined CSV to memory and perform a simple EDA

rubric={correctness:10,reasoning:10}

  1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
    • Changing dtype of your data
    • Load just columns what we want
    • Loading in chunks
    • Dask
  2. Discuss your observations.

Team-work contract

1. Team-work contract

rubric={correctness:10}

Similar to what you did in DSCI 522 and DSCI 524, create a team-work contract. The contract should outline how you are committed to work together so that you are accountable to one another. Again, you may start with your team contract document from previous project courses and adapt it for your new team. It is a fairly personal document and please do not push it into your public repositories. Instead, save it somewhere your team can easily share it, and you can share a link to it, or a copy with us in your submission to Canvas to prove you did this.

https://docs.google.com/document/d/1fipaY7_6rSNIgwLjjY6kFh5faoZpOt6nrFFWC76tJrU

Creating repository and project structure

2. Creating repository and project structure

rubric={mechanics:10}

  1. Similar to previous project courses, create a public repository under UBC-MDS org for your project.
  2. Write brief introduction of the project in the README.
  3. Create a folder called notebooks in the repository and create a notebook for this milestone in that folder.

Combining data CSVs

4. Combining data CSVs

rubric={correctness:10,reasoning:10}

  1. Use one of the following options to combine data CSVs into a single CSV.

  2. When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)

  3. Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you check memory usage and discuss the reasons why you might not have been able to run this on your laptop.

Consistency with %%

Hello everybody,

I noticed that we are not using %% consistently. We are using only one % at some places and %% at some other places. I checked couple of notebooks and I noticed that correct usage is %%. I dont know if it matters.

I still have issues and my friend is running the file for me. Once I get updated. I will post here.

Milestone 1 Feedback

Well-organized repo and notebook
Great job discussing problems through GH issues.
I encourage more discussion of results, remember to summarize at the end of each section.

Submit to canvas

Submission instructions

rubric={mechanics:5}

In the textbox provided on Canvas for the Milestone 1 assignment include:

  • The URL of your public project's repository
  • The URL of your notebook for this milestone

Milestone 3 Feedback

Great job, a few comments:
Organization:

  • Forgot to edit the links to Tasks 3&4 in Milestone3.ipynb,
  • 3 Comments denoting steps 5 & 6 aren't as clear as 1-4

3 I think you should be comparing the RMSE for the test sets rather than train.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.