This task was part of the assignment in the course SDSC 3002 (Data Mining) taught by Dr.Yu Yang at City University of Hong Kong. The task was as follows:
Download the files "training.txt", "testing.txt" and "item_tag.txt". In the file "training.txt", each line is in the form
which means the rating of user u, on the item i is .
In the file "testing.txt", each line is represented by u,i,? which means you are required to predict the rating of user u on item i. Use the training dataset "training.txt" to build a recommender system and make predictions for the testing dataset "testing.txt" by replacing all the "?" with your predicted ratings. All the ratings are within the range [0,5].
You may also want to use the file "item_tag.txt", where each line indicates that the item i has tags
. Note that some items may not have any tags so it is normal if you cannot find some items in the file "item_tag.txt".
Solution Approach:
The approach of this recommendation system assumes users who liked the item in the past would still like the same in the future, which means, similar items would give similar ratings to a user.
In this recommendation system, singular value decomposition (SVD) is used, which is a collaborative filtering method. And the SVD constructs a matrix with users as row, and items as columns, and the elements are composed by the corresponding users’ rating on the item. And it decomposes a matrix into 3 other matrices and extracts the factors from the factorization of a high-level matrix.
The SVD model of this recommendation system is built based on the users’ past behavior, that is the rating of items of each user. And the model finds the association between the users and the items. Then the model predicts the items or rating of the item by considering those features, that the user may be interested. However, to train the model, we are predicting the rating for the user on specific items.
While for each prediction, a pair of (u,i) where u: user_id, i: item_id are required to input. To achieve this goal, the scikit “surprise” module is used for learning. At the same time, to reduce the error between the actual and predicted rating, the bias term is used. The bias term is shown below.
To estimate the unknowns, we minimized the following regularized error:
Stochastic Gradient Descent is used to minimize the error. n_epochs are the number of iterations in SGD which is a tunable parameter along with n_factors which is number of factors. Learning rate is set to 0.005 and regularization terms are set to 0.02 by default.