Chapter 5 Concluding remarks

In this project, we train a recommender system to predict movie ratings on a scale from 0.5 to 5, using the Movielens 10M (GroupLens 2009) dataset. Our final model considers user, movie, genre, and time-based biases, and uses Funk’s matrix factorization to approximate the residuals after these effects have been removed from the ratings. The RMSE achieved by our final model, as evaluated using the validation partition, is 0.7941947.

Note that the effect of adding genre and time-based biases was small. To explain this, first note that the movie bias for frequently-rated movies will be quite accurate without the need to “borrow” additional information from similar movies. On the other hand, movies with few ratings have little effect on the overall RMSE. For the same reason, adding the year of release of each movie as a model feature is also unlikely to significantly improve the results.

However, this result is because the validation set for this project was deliberately constructed not to contain any movies not in the training and test sets. In a live environment, adding genre and time-based information will prove useful for predicting ratings of new movies, where a movie bias cannot be computed (using a zero value is the likely best solution). In this case, adding the year of release as an additional model feature likely would improve prediction accuracy. Another possible feature we could have used is the age of a movie at the time it was rated. The tag information included in the original Movielens 10M dataset (but unused in this project) could also be useful for estimating the ratings of new or rarely rated movies.

A consideration is the fact that while this project attempts to minimize the error of the raw ratings, a possibly better approach may be binary: would a user like a movie they have not yet watched, if that movie were recommended to them? If we assume a user enjoys a movie if they rate it 3.5 stars or higher, then the confusion matrix as computed on the validation set is:

# Classify movies as good or bad based on 3.5-star threshold
#  and compute confusion matrix
confusionMatrix(
  as.factor(ifelse(predicted_ratings_FINAL_VALIDATION >= 3.5,'Good', 'Bad')),
  as.factor(ifelse(validation$rating >= 3.5, 'Good', 'Bad')),
  positive = 'Good')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    Bad   Good
##       Bad  301123 140407
##       Good 110332 448137
##                                           
##                Accuracy : 0.7493          
##                  95% CI : (0.7484, 0.7501)
##     No Information Rate : 0.5885          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4879          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7614          
##             Specificity : 0.7318          
##          Pos Pred Value : 0.8024          
##          Neg Pred Value : 0.6820          
##              Prevalence : 0.5885          
##          Detection Rate : 0.4481          
##    Detection Prevalence : 0.5585          
##       Balanced Accuracy : 0.7466          
##                                           
##        'Positive' Class : Good            
## 

The accuracy of our model is about three-quarters, with approximately equal sensitivity and specificity.

Furthermore, note that while the ratings in the Movielens dataset are discrete, the generated predictions are not. If only discrete predictions are allowed, then a series of thresholds may be fitted to our current model for binning (these thresholds do not have to be a half-star apart and can instead be based on the distribution of true and predicted ratings). It remains to be seen how such an approach would affect the accuracy of our model, as while correct binning decreases the error of a prediction, incorrect binning may instead increase the error of a rating. For example, for a prediction of 3.6 that is binned to 3.5, the error decreases from 0.1 to 0 given a true rating of 3.5, but increases from 0.4 to 0.5 given a true rating of 4.0.

5.1 The cmfrec package and benchmarks

After completing this project, I discovered another R package called cmfrec (Cortes 2022a) which can handle the size of the Movielens 10M dataset and in fact uses it for benchmarking (Cortes 2022b). The best reported result among the R implementations has a RMSE of 0.782465, somewhat better than that achieved here.

There are several possible reasons why cmfrec outperforms our method in this project. First, the cmfrec algorithms also update user and movie biases during each MF iteration, as opposed to the static method used here which only updates \(U\) and \(V\). The benchmarks shown in Cortes (2022b) show that such static methods generally perform worse than methods with iterative bias updates. Second, methods in cmfrec use alternating least squares (ALS) by default rather than the gradient descent method used here, which improves numerical stability/convergence. Another benefit of ALS is the possibility of massive parallelization (Koren, Bell, and Volinsky 2009). Finally, no optimization was performed on the learning and regularization parameters \(\lambda\) and \(\gamma\) in this project.

References

Cortes, David. 2022a. Cmfrec: Collective Matrix Factorization for Recommender Systems. https://cran.r-project.org/package=cmfrec.
———. 2022b. “Matrix Factorization Benchmarks.” https://github.com/david-cortes/cmfrec/tree/master/benchmark.
GroupLens. 2009. “MovieLens 10M Dataset.” https://grouplens.org/datasets/movielens/10m/.
Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. “Matrix Factorization Techniques for Recommender Systems.” Computer 42 (8): 30–37. https://doi.org/10.1109/mc.2009.263.