Code Monkey home page Code Monkey logo

codechallenge's Introduction

Machine Learning Code Challenge

Objective

Given 2 data challenges that test for the ability to:

  • Wrangle/clean data to make it usable by a model
  • Figure out how to set up X's and y's for a use case, given a dataset
  • Write code to robustly and reproducibly preprocess data
  • Pick/design the right model, and tune hyperparameters to get the best performance

Deliverables:

  • Github
  • Code is documented within the notebooks
  • Models stored with joblib
  • ReadME file:

Most of the code documentation is explored in greater depth and detail in the working copy notebooks. In total, there are 3 working copies dedicated to task 1, task 2, and the LSTM Neural Net.

The main notebook is the MachineLearningChallenge notebook. It contains a comprehensive functionalized implementation of the code for both the tasks. Including final "train(data)" and "predict(data)" functions. Hyperparameter tuning resulted in models too big for github so I've scaled them back down for space limitations.

For the Neural Network implementation, I explored an LSTM. However, with the some increase in results, it came with additional complexities. I've left the LSTM to just work on the data in it's notebook and haven't included it in the final train/predict functions in the main notebook. For now, a simple yet robust versions of models using XGBoost for Predictive Maintenance and RandomForestRegression for Forecasting have been used.

If I had to pick, I'd say I'm proud of not having used any for loops at all in the code. Vectorized and package based implementations help with speed and scale! Especially, with the Preprocess function that uses several pandas methods to encompass edge cases. The function also manages to fill NaNs to the closest likely value by choosing information at the nearest timestep of the turbines.

Oh, and also proud of how much fun I had working through this problem!

Code Assessment:

  • Accuracy/RMSE of your model when predicting on held-out data

  • How well various edge cases are handled when testing on held-out data. For example, if the held-out data contains:

    • A new column that wasn't present in the dataset given to you
    • New value in a categorical field that wasn't seen in the dataset given to you
    • NA values

    Preprocess Functions aims to cover these edge cases

  • Efficiency of the code.

    • Is it easy to understand? I think so
    • Are the variable names descriptive? Yes
    • Are there any variables created that aren't used? No
    • Is redundant code replaced with function calls? Yes
    • Is vectorized implementation used instead of nested for loops? Yes, pandas!
    • Are classes defined and objects created where applicable? Wasn't applicable, I believe
    • Are packages used to perform tasks instead of implementing them from scratch? Yes, numpy, pandas, scikit-learn, keras

References:

WaterPump Maintenance: https://towardsdatascience.com/water-pumps-maintenance-prediction-data-science-illustrated-20c7100017c5

Hyperparameter tuning: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

codechallenge's People

Contributors

prakash-murugesan avatar amolmane1 avatar stevekludt avatar saad-kabir avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.