Code Monkey home page Code Monkey logo

crp-ml's Introduction

[DS 110 Fall 2022] Chocolate Bars Rating Prediction ML Model

By: Xiang Fu, Kayla Wu

Publicly Available Dataset and Machine Learning

Find a publicly available dataset (give a link in your proposal) that has some data you can use for the final project, and decide on 2 approaches that you could take to predict some variable with machine learning. (Approaches include k-nearest neighbors, decision trees, and random forests; you can also experiment with anything else you find in scikit-learn or elsewhere.) You don't need to implement the machine learning algorithm yourself - you can make use of scikit-learn and other libraries. ldentify the columns you will use for prediction. You should also plan to vary some parameters in each approach to achieve the best possible performance - for example, vary k for k-nearest neighbors, or vary maximum depth for decision trees.

Dataset and inputs (Data preparation)

  • Description
    • Believe it or not, chocolate is ranked the most popular candy in the world. For our final project, we are interested in exploring expert ratings of over 1,700 individual chocolate bars from a Chocolate Bar Ratings dataset via Kaggle. In other words, we will be predicting chocolate bar ratings using the following parameters/columns: (1) regional origin, (2) percentage of cocoa, (3) variety of chocolate bean used, and (4) broad bean origin, to train our machine learning model.

Objectives of our dataset exploration

  1. Estimate the chocolate bar ratings with the aforementioned parameters in the dataset
  2. Determine which parameter(s) will have the greatest impact on chocolate bar ratings

Possible approaches

  1. Decision trees
  2. We can split the dataset into small segments. 1. This is where the tree is built and tested using the dataset. 2. For instance, Is the cocoa percent 75%? > (Y/N) > Does the bean type originate in Peru? (Y/N) > Was the broad bean grown in Venezuela? (Y/N).
  3. Random forests
  4. This approach will allow us to determine which parameters are most important. Random forests provide a higher level of accuracy than decision trees.
  5. First, we will select random samples from the dataset, then construct a decision tree for each sample for a prediction result from each decision tree.
  6. Next, we will perform a majority vote for each predicted result.
  7. Finally, we will select the prediction result with the most votes as the final prediction.

List of all variables

  • Company (Maker-if known)
  • Specific Bean Origin or Bar Name
  • Review Date
  • Cocoa Percent (feature/parameter)
  • Company Location
  • Rating (outcome)
  • Bean Type (feature/parameter)
  • Broad Bean Origin (feature/parameter)

Chosen columns for chocolate bar rating prediction

  1. Cocoa Percent
  2. Bean type
  3. Broad Bean Origin
  4. Specific Bean Origin for Bar Name

Steps of Our Machine Learning Training Process

  1. Data cleaning
  2. During this process, we will remove inaccurate, corrupted, incorrectly formatted, duplicated, or incomplete data from our dataset.
  3. Train the model
  4. We will use our data to incrementally improve our model's ability to predict the ratings of chocolate
  5. Split the dataset into two sets: a training set and a testing set. 1. Training data:
    1. 80% of the dataset 2. Testing data (Evaluation):
    2. 20% of the dataset
  6. Evaluate the model
  7. Train the model on the training set.
  8. Test the model on the testing set and evaluate performance.
  9. For Decision tree: 1. Run the trained model on the test data and see what it predicts.
    1. test_pred_decision_tree = clf.predict (test_x) 2. Visualize the result with confusion matrix (A way to express how many of a classifier's predictions were correct, and when incorrect, where the classifier got 'confused')
    2. confusion_matrix(y_test, y_pred_test)
    3. Can also use a Seaborn heatmap() to visualize the confusion matrix
  10. For Random Forest: 1. We can use this an accuracy score to measure how many labels the model got right out of the total number of predictions
    1. accuracy_score(y_test, y_pred_test) 2. Scikit- Learnis classification_report()
    2. classification_report (y_test, y_pred_test)
  11. Parameter Tuning
  12. In this step, we will tune model parameters to improve the performance
  13. Model hyperparameters to consider 1. Decision Tree: random state, maximum depth, minimum samples leaf 2. Random Forest: max_features (maximum number of features Random Forest is allowed to try in individual tree), max_depth_, min_ sample_leaf, n_estimators (the number of trees to build before taking the maximum voting or averages of predictions), random_state

crp-ml's People

Contributors

suzzukiw avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.