Chapter 2 Linear regression models
We start by splitting edx
into a training and test set:
# Test set will be 10% of edx data
set.seed(1, sample.kind="Rounding")
<- createDataPartition(y = edx$rating, times = 1, p = 0.1, list = FALSE)
test_index <- edx[-test_index,]
edx_train <- edx[test_index,]
temp
# Make sure userId and movieId in edx_test set are also in edx_train set
<- temp |>
edx_test semi_join(edx_train, by = "movieId") |>
semi_join(edx_train, by = "userId") |>
as.data.table()
# Add rows removed from edx_test set back into edx_train set
<- anti_join(temp, edx_test)
removed2 <- rbind(edx_train, removed2) |> as.data.table()
edx_train
rm(removed2, temp, test_index)
2.1 Overview and notation
Let \(Y\) be a \(N_\mathrm{U}\times N_\mathrm{M}\) matrix of movie ratings, such that \(Y_{u,i}\) is the rating user \(u\) has given or would give movie \(i\). Additionally, define \(X_j\) such that \(X_{j;u,i}\) denotes the \(j\)th attribute of user-movie pair \((u,i)\). Such attributes include \(u\) and \(i\) themselves, the genres of movie \(i\), and the timestamp at which the rating was made. Finally, only the indices \((u,i)\) contained in the training set, denoted \(\mathcal{T}\), are observable.
The goal is to estimate \(Y\) given the observable elements of \(Y\) (the actual ratings). Given a user-movie pair \((u,i)\), we model \(Y_{u,i}\) using a multiple linear regression model: \[\begin{equation} Y_{u,i} \sim \mu + \left(\sum_j \beta_{j;u,i} \right) +\varepsilon_{u,i}, \tag{2.1} \end{equation}\] where
- \(\mu\) represents the “true” rating for all movies,
- \(\beta_{j;u,i}\) is the \(j\)th bias term for pair \((u,i)\),
- and \(\varepsilon_{u,i}\) is random error, all independently sampled from the same zero-mean distribution.
We further define \(b_j\) such that \[ \left(X_{j;u,i} = n\right) \implies \left(\beta_{j;u,i} = b_{j;n}\right). \] In other words, whereas \(\beta_j\) defines biases based on user-movie pairs, \(b_j\) defines the same biases based on some attribute of these pairs, for example, the movie genres.
We can write Equation (2.1) in matrix form: \[ Y \sim \mu + \left( \sum_j \beta_j \right) + \varepsilon. \]
The objective is to minimize the sum of the squared errors \[ \text{SSE} = \sum_{(u,i)\in\mathcal{T}} \left[Y_{u,i} - \mu - \sum_j \beta_{j;u,i} \right]^2 \] where \(\mathcal{T}\) represents the test set of observed movie ratings.
The estimated value of \(Y_{u,i}\) for \((u,i) \notin \mathcal{T}\) is \[ \hat{Y}_{u,i} = \mu + \sum_j \beta_{j;u,v}. \]
2.2 Using the mean rating only
Our first model is of the form
\[
Y_{u,i} \sim \mu + \varepsilon_{u,i}.
\]
The best estimate \(\hat{\mu}\) of \(\mu\) is the mean of all ratings in edx_train
, or:
# Mean-only model: use mean rating to predict all ratings
<- mean(edx_test$rating)
mu mu
## [1] 3.512551
This model gives the following RMSE values when applied to edx_test
:
# Compute RMSE of mean-only model and add to a tibble.
<- tibble(Method = c("Mean only"),
RMSEs RMSE = RMSE(mu, edx_test$rating),
"RMSE (clamped estimates)" = RMSE(mu, edx_test$rating))
nrow(RMSEs),'RMSE']] RMSEs[[
## [1] 1.060054
2.3 Modeling movie effects
We add a term to our model for movie effects: \[ Y_{u,i} \sim \mu + b_{1;i} + \varepsilon_{u,i}, \] The least-squares estimate \(\hat{b}_{1;i}\) of \(b_{1;i}\) is the training-set mean of \(Y_{u,i} - \hat{\mu}\) for each movie \(i\). The following code computes \(\hat{b}_{1;i}\) for each \(i\) and plots these as a histogram:
# Least-squares estimate of movie effect is the mean of (rating - mu) for all
# ratings of that movie.
<- edx_train |>
movie_biases group_by(movieId) |>
summarize(b_i = mean(rating - mu))
# Plot a histogram of the movie effects
par(cex = 0.7)
hist(movie_biases$b_i, 30, xlab = TeX(r'[$\hat{b}_{1;i}$]'),
main = TeX(r'[Histogram of $\hat{b}_{1;i}$]'))
The new model gives the following RMSE values when applied to edx_test
:
# Obtain predictions for the edx_test set
<- edx_test |>
predicted_ratings left_join(movie_biases, by='movieId') |>
mutate(pred = mu + b_i) |> pull(pred)
# When multiple effects (movie, user, genre) are added in our model, some predictions
# may fall out of the valid range. This function fixes these predictions to the range
# [0.5, 5].
<- function(x) raster::clamp(as.numeric(x), 0.5, 5)
clamp
# Compute RMSE and add to data.table
<- RMSEs |>
RMSEs add_row(Method = "Movie effects",
RMSE = RMSE(predicted_ratings, edx_test$rating),
"RMSE (clamped estimates)" =
RMSE(clamp(predicted_ratings), edx_test$rating))
nrow(RMSEs),] |>
RMSEs[kable(align='lrr', booktabs = T) |> row_spec(0, bold = T)
Method | RMSE | RMSE (clamped estimates) |
---|---|---|
Movie effects | 0.9429615 | 0.9429615 |
2.3.1 Clamping the predictions
In the above table, clamping means setting any predictions less than 0.5 to 0.5, and any predictions greater than 5.0 to 5.0, thus enforcing the limits of possible ratings. This slightly reduces the RMSE when multiple biases are added to the model, as we demonstrate below.
2.4 Modeling movie and user effects
We add a term \(b_u\) to our model for user effects: \[ Y_{u,i} \sim \mu + b_{1;i} + b_{2;u} + \varepsilon_{u,i}. \] We approximate \(b_{2;u}\) for each user \(u\) as the mean of \(\hat{b}_u = Y_{u,i} - \hat{\mu} - \hat{b}_{1;i}\). The following code computes \(\hat{b}_{2;u}\) for each \(u\) and plots these as a histogram:
# Estimate user effects
<- edx_train |>
user_biases left_join(movie_biases, by='movieId') |>
group_by(userId) |>
summarize(b_u = mean(rating - mu - b_i))
# Plot a histogram of the user effects
par(cex = 0.7)
hist(user_biases$b_u, 30, xlab = TeX(r'[$\hat{b}_{2;u}$]'),
main = TeX(r'[Histogram of $\hat{b}_{2;u}$]'))
The new model gives the following RMSE values when applied to the edx_test
set:
# Obtain predictions for the edx_test set
<- edx_test |>
predicted_ratings left_join(movie_biases, by='movieId') |>
left_join(user_biases, by='userId') |>
mutate(pred = mu + b_i + b_u) |> pull(pred)
# Compute RMSE and add to data.table
<- RMSEs |>
RMSEs add_row(Method = "Movie + user effects",
RMSE = RMSE(predicted_ratings, edx_test$rating),
"RMSE (clamped estimates)" =
RMSE(clamp(predicted_ratings), edx_test$rating))
nrow(RMSEs),] |>
RMSEs[kable(align='lrr', booktabs = T) |> row_spec(0, bold = T)
Method | RMSE | RMSE (clamped estimates) |
---|---|---|
Movie + user effects | 0.8646843 | 0.8644818 |
2.5 Adding genre effects
We add another bias term \(b_g\) to our model for genre effects: \[ Y_{u,i} \sim \mu + b_{1;i} + b_{2;u} + b_{3;g(i)} + \varepsilon_{u,i}, \] where \(g(i)\) is the combination of genres for movie \(i\). We approximate \(b_{3;g}\) for each genre combination \(g\) as the mean of \(\hat{b}_u = Y_{u,i} - \hat{\mu} - \hat{b}_{1;i} - \hat{b}_{2;u}\), averaged over all ratings in the training set where \(g(i) = g\). The following code computes \(\hat{b}_{3;g}\) for each \(g\) and plots these as a histogram:
# Estimate genre effects
<- edx_train |>
genre_biases left_join(movie_biases, by='movieId') |>
left_join(user_biases, by='userId') |>
group_by(genres) |>
summarize(b_g = mean(rating - mu - b_i - b_u))
# Plot a histogram of the genre effects
par(cex = 0.7)
hist(genre_biases$b_g, 30, xlab = TeX(r'[$\hat{b}_{3;g}$]'),
main = TeX(r'[Histogram of $\hat{b}_{3;g}$]'))
The new model gives the following RMSE values when applied to the edx_test
set:
# Obtain predictions for the edx_test set
<- edx_test |>
predicted_ratings left_join(movie_biases, by='movieId') |>
left_join(user_biases, by='userId') |>
left_join(genre_biases, by='genres') |>
mutate(pred = mu + b_i + b_u + b_g) |> pull(pred)
# Compute RMSE and add to data.table
<- RMSEs |>
RMSEs add_row(Method = "Movie + user + genre effects",
RMSE = RMSE(predicted_ratings, edx_test$rating),
"RMSE (clamped estimates)" =
RMSE(clamp(predicted_ratings), edx_test$rating))
nrow(RMSEs),] |>
RMSEs[kable(align='lrr', booktabs = T) |> row_spec(0, bold = T)
Method | RMSE | RMSE (clamped estimates) |
---|---|---|
Movie + user + genre effects | 0.8643241 | 0.8641138 |
2.6 Adding a time effect
Consider a new model with the form \[ Y_{u,i} \sim \mu + b_{1;i} + b_{2;u} + b_{3;g(i)} + f(t_{u,i}) + \varepsilon_{u,i}. \] where \(t_{u,i}\) is a week index, such that the date of the oldest rating is defined as the start of Week 1.
The new optimization problem minimizes \[ \text{SSE} = \sum_{(u,i)\in\mathcal{T}} \left[Y_{u,i} - \mu - b_{1;i} - b_{2;u} - b_{3;g(i)} - f(t_{u,i})\right]^2. \]
The following code defines \(f(t)\) as the smoothed average rating on Week \(t\), minus \(\mu\):
# Add a week number to each rating in the edx_train and edx_test datasets
<- edx_train |>
edx_train mutate(weekNum = (timestamp - min(timestamp)) |>
as.numeric(unit = "days") |> {\(x) floor(x/7) + 1}() )
<- edx_test |>
edx_test mutate(weekNum = (timestamp - min(timestamp)) |>
as.numeric(unit = "days") |> {\(x) floor(x/7) + 1}() )
# Fit a smooth curve to the ratings as a function of time
<- mgcv::gam(rating ~ s(weekNum, bs = "cs"),
fit family = gaussian(), data = edx_train) # apply smoothing
# Evaluate the fitted curve for each week number
<- seq(1,max(edx_train$weekNum))
r <- mgcv::predict.gam(fit, data.frame(weekNum = r)) - mu
f_t rm(fit)
# Plot the fitted curve
ggplot(data.frame(weekNum = r, f_t), aes(weekNum, f_t)) + geom_line() +
xlab('t, Week number') + ylab(TeX(r'[$f(t)$]'))
Refitting the user, movie, and genre effects for the new time-based model, we obtain RMSE values of:
# Compute the biases
<- edx_train |>
movie_biases_t mutate(f_t = f_t[weekNum]) |>
group_by(movieId) |>
summarize(b_i = mean(rating - mu - f_t))
<- edx_train |>
user_biases_t mutate(f_t = f_t[weekNum]) |>
left_join(movie_biases_t, by='movieId') |>
group_by(userId) |>
summarize(b_u = mean(rating - mu - b_i - f_t))
<- edx_train |>
genre_biases_t mutate(f_t = f_t[weekNum]) |>
left_join(movie_biases_t, by='movieId') |>
left_join(user_biases_t, by='userId') |>
group_by(genres) |>
summarize(b_g = mean(rating - mu - b_i - b_u - f_t))
# Obtain predictions for the edx_test set
<- edx_test |>
predicted_ratings mutate(f_t = f_t[weekNum]) |>
left_join(movie_biases_t, by='movieId') |>
left_join(user_biases_t, by='userId') |>
left_join(genre_biases_t, by='genres') |>
mutate(pred = mu + b_i + b_u + b_g + f_t) |>
pull(pred)
# Compute RMSE and add to data.table
<- RMSEs |>
RMSEs add_row(Method = "Movie + user + genre + time effects",
RMSE = RMSE(predicted_ratings, edx_test$rating),
"RMSE (clamped estimates)" =
RMSE(clamp(predicted_ratings), edx_test$rating))
nrow(RMSEs),] |>
RMSEs[kable(align='lrr', booktabs = T) |> row_spec(0, bold = T)
Method | RMSE | RMSE (clamped estimates) |
---|---|---|
Movie + user + genre + time effects | 0.8641266 | 0.8639174 |
2.7 Adding \(L_2\) regularization
To improve our model further, we can add \(L_2\) regularization. Whereas the previous model fitting procedure minimizes \[ \text{SSE} = \sum_{(u,i)\in\mathcal{T}} \left[Y_{u,i} - \mu - b_{1;i} - b_{2;u} - b_{3;g(i)} - f(t_{u,i})\right]^2, \] in this section we add a penalty term such that the new expression to minimize is as follows: \[ \text{SSE}_R = \text{SSE} + \lambda\sum_j\left\Vert b_j\right\Vert_2^2 = \text{SSE} + \lambda\sum_i b_{1;i}^2 + \lambda\sum_u b_{2;u}^2 + \lambda\sum_g b_{3;g}^2. \] Fitting the regularized model to the training set for different \(\lambda\), and using the test set for RMSE calculation, we obtain the following plot of RMSE against \(\lambda\).
# List of regularization parameter values to try.
# I increased the density of points near the optimal value since
# I already know it approximately.
<- c(0,1,2,3,4,seq(4.5,5.5,0.1),6,7,8,9,10)
lambdas
# Compute RMSE values for each lambda using the *test set.
<- sapply(lambdas, function(l){
rmses message("lambda = ", l)
# Compute movie, user, genre, and time effects using the test set.
# Note that f_t here refers to the variable f_t and not the f_t column in
# any of the data.tables.
<-
movie_biases_reg b_i = sum(rating - mu - f_t[weekNum])/(.N+l)), by = 'movieId']
edx_train[, .(
<- movie_biases_reg[edx_train, on = 'movieId']
temp <-
user_biases_reg b_u = sum(rating - mu - b_i - f_t[weekNum])/(.N+l)), by = 'userId']
temp[, .(
<- user_biases_reg[temp, on = 'userId']
temp <-
genre_biases_reg b_g = sum(rating - mu - b_i - b_u - f_t[weekNum])/(.N+l)), by = 'genres']
temp[, .(
# Generate predictions
<- genre_biases_reg[
predicted_ratings
user_biases_reg[
movie_biases_reg[= 'movieId'],
edx_test, on = 'userId'],
on = 'genres'] |>
on mutate(pred = mu + b_i + b_u + b_g + f_t[weekNum]) |>
pull(pred)
# Compute RMSE
return(RMSE(predicted_ratings, edx_test$rating))
})
# Plot RMSE against lambda
par(cex = 0.7)
qplot(lambdas, rmses, xlab = TeX(r'($\lambda)'),
ylab = 'RMSE', geom = c('point', 'line'))
The optimal value of \(\lambda\) is thus:
<- lambdas[which.min(rmses)]
lambda lambda
## [1] 4.9
Fitting the regularized model one last time and computing the RMSE on edx_test
, we
obtain:
# Compute movie, user, genre, and time effects using the test set, using the optimal
# regularization parameter value lambda we just found.
<-
movie_biases_reg b_i = sum(rating - mu - f_t[weekNum])/(.N+lambda)), by = 'movieId']
edx_train[, .(
<- movie_biases_reg[edx_train, on = 'movieId']
temp <-
user_biases_reg b_u = sum(rating - mu - b_i - f_t[weekNum])/(.N+lambda)), by = 'userId']
temp[, .(
<- user_biases_reg[temp, on = 'userId']
temp <-
genre_biases_reg b_g = sum(rating - mu - b_i - b_u - f_t[weekNum])/(.N+lambda)),
temp[, .(= 'genres']
by
# Generate predictions for the *edx_test* set.
<- genre_biases_reg[
predicted_ratings_reg
user_biases_reg[
movie_biases_reg[= 'movieId'],
edx_test, on = 'userId'],
on = 'genres'] |>
on mutate(pred = mu + b_i + b_u + b_g + f_t[weekNum]) |>
pull(pred)
rm(temp)
# Compute RMSE and add to data.table
<- RMSEs |>
RMSEs add_row(Method = "Movie + user + genre + time effects (regularized)",
RMSE = RMSE(predicted_ratings_reg, edx_test$rating),
"RMSE (clamped estimates)" =
RMSE(clamp(predicted_ratings_reg), edx_test$rating))
nrow(RMSEs),] |>
RMSEs[kable(align='lrr', booktabs = T) |> row_spec(0, bold = T)
Method | RMSE | RMSE (clamped estimates) |
---|---|---|
Movie + user + genre + time effects (regularized) | 0.8636151 | 0.8634932 |
2.8 Section summary
The table of RMSEs for all models considered in this section is below.
|> kable(align='lrr', booktabs = T, linesep='') |> row_spec(0, bold = T) RMSEs
Method | RMSE | RMSE (clamped estimates) |
---|---|---|
Mean only | 1.0600537 | 1.0600537 |
Movie effects | 0.9429615 | 0.9429615 |
Movie + user effects | 0.8646843 | 0.8644818 |
Movie + user + genre effects | 0.8643241 | 0.8641138 |
Movie + user + genre + time effects | 0.8641266 | 0.8639174 |
Movie + user + genre + time effects (regularized) | 0.8636151 | 0.8634932 |
The results demonstrate that each added feature has reduced the RMSE, as well as adding regularization and clamping; however, there are diminishing returns as each effect is added to the model.