Code Monkey home page Code Monkey logo

artisan1218 / recommendation-system Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 1.0 64.5 MB

Hybrid RecSys, CF-based RecSys, Model-based RecSys, Content-based RecSys, Finding similar items using Jaccard similarity

Python 59.87% Jupyter Notebook 40.13%
jaccard-similarity cosine-similarity tfidf collaborative-filtering content-based-recommendation user-based-recommendation item-based-recommendation spark pearson-correlation tf-idf-score similar-items svd-matrix-factorisation xgboost xgboost-regression surprise-python feature-augmentation upsampling

recommendation-system's Introduction

Recommendation Systems

Note: The Recommendation System will utilize the data from yelp.com

  • train_review.json – the main file that contains the review data, RS will primarily be working with this file.
  • test_review.json – containing only the target user and business pairs for prediction tasks
  • test_review_ratings.json – containing the ground truth rating for the testing pairs
  • stopwords - containing common stopwords that will be used when calculating TFIDF score.
  • The file is preprocessed first using Apache Spark

The Recommendation System will be divided into four subfolders, each uses different algorithm to accomplish the recommendation.

  1. Collaborative Filtering: Collaborative Filtering Recommendation System that has two cases: Item-based CF and User-based CF.

    1. Item-based CF: the RS is built by computing the Pearson correlation for the business pairs with at least three co-rated users and use 3 or 5 neighbors who are most similar to targeted business.
    2. User-based CF: MinHash and LSH is used first to identify similar users to reduce the number of pairs needed to compute Pearson Correlation. After identifying the similar users based on their jaccard similarity, RS will compute the Pearson Correlation for all candidates user pairs and make the prediction.
  2. Content-Based Recommendation Sys: The content-based RS which will generate profiles from review texts for users and businesses in the train_review.json file. Algorithms used are: Calculation of TF-IDF score and Cosine Similarity.

  3. Finding Similar Items: Find similar business pairs in the train_review.json file. Algorithms used are: MinHash and Locality Sensitive Hashing, Jaccard Similarity

  4. Hybrid Recommendation Sys: The hybrid recommendation system that utilizes several different models and produce the best result jointly. This project is also ranked the third place at USC Data Mining (Recommendation System) Competition 2021 with final score of 2709 and RMSE of 1.1498

    WeChat Screenshot_20210508111240

Output Demo

  • Similar Items: image
    • b1 and b2 are the business id
    • sim is the jaccard similarity of b1 and b2
  • Content-based RS: image
    • user_id and business_id pair means 'if a user would prefer to review a business'
    • sim is the calculated(predicted) cosine distance between the profile vectors.
  • User-based CF Pearson Correlation Model: image
    • u1 and u2 are the user id
    • sim is the Pearson Correlation between these two users
  • Item-based CF Pearson Correlation Model: image
    • b1 and b2 are the business id
    • sim is the Pearson Correlation between these two business
  • CF prediction result: image
    • user_id and business_id stands for 'this user will likely rate this business with this star'
    • stars is simply the predicted rating

Model and prediction accuracy/precision/recall/RMSE

  1. Similar business pairs
    1. precision: 1.0
    2. recall: 0.9582400942205771
  2. Content-based RS
    1. precision (test set): 1.0
    2. recall (test set): 0.999469477863536
  3. CF model
    1. item-based CF model
      1. precision: 0.9641450981844213
      2. recall: 0.9805068470797926
    2. user-based CF model
      1. precision: 0.9573746593617223
      2. recall: 0.8276633759390503
  4. CF prediction
    1. item-based RMSE (test set): 0.9023539405054186
    2. user-based RMSE (test set): 0.9901023647008427
  5. Hybrid Recommendation System:
    1. Blind test set RMSE: 1.1498
    2. Test set RMSE: 1.14166

Algorithm and Mathematical inference of the model

  1. Cosine Similarity:

  2. Normalized Term Frequency:

  3. Inverse Document Frequency:

  4. TF-IDF score:

  5. Jaccard Similarity and distance:

  6. User-based CF Pearson Correlation:

  7. User-based CF prediction using Pearson Correlation:

  8. Item-based CF Pearson Correlation:

  9. Item-based CF prediction using Pearson Correlation:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.