Chapter 4 Final validation
We select our best model (3.1) for validation against the validation
dataset.
save(mu, movie_biases_reg, user_biases_reg, genre_biases_reg,
file = 'FINAL_model.Rdata') f_t, Uidx, Vidx, U, V,
The number of parameters in the model is:
nrow(movie_biases_reg) + nrow(user_biases_reg) + nrow(genre_biases_reg) +
length(f_t) + length(U) + length(V)
## [1] 1209852
The final RMSE computed with the validation
set is:
# Generate final predictions for the VALIDATION datset
<- validation |>
predicted_ratings_FINAL_VALIDATION mutate(weekNum = (timestamp - min(timestamp)) |>
as.numeric(unit = "days") |> {\(x) floor(x/7) + 1}() ) |>
mutate(f_t = f_t[weekNum]) |>
left_join(movie_biases_reg, by='movieId') |>
left_join(user_biases_reg, by='userId') |>
left_join(genre_biases_reg, by='genres') |>
mutate(pred = mu + b_i + b_u + b_g + f_t +
map2_dbl(userId, movieId, \(u,v) U[Uidx[u],] %*% V[Vidx[v],])) |>
pull(pred) |> clamp()
# Compute RMSE
RMSE(predicted_ratings_FINAL_VALIDATION, validation$rating)
## [1] 0.7941947
The plot below plots a histogram of prediction errors:
par(cex = 0.7)
hist(predicted_ratings_FINAL_VALIDATION - validation$rating, 50,
xlab = 'Prediction Error', main = '')
Below is a plot of the cumulative distribution of the absolute error:
<- ecdf(predicted_ratings_FINAL_VALIDATION - validation$rating)
the_ecdf par(cex = 0.7)
qplot(seq(0,4.5,0.001), the_ecdf(seq(0,4.5,0.001)),
xlab = 'Absoulute error of prediction', ylab = 'Empirical CDF', geom = 'line')
The proportion of predictions are within half a star of the actual rating is:
mean(abs(predicted_ratings_FINAL_VALIDATION - validation$rating) < 0.5)
## [1] 0.5169635