Forward stepwise selection
The first variable chosen was the one with higher adjusted R² which is studio. Then it was combined with the second in the rank director.
stu_dir_model = lm(global_score ~ studio + director, data = movies)
summary(stu_dir_model)$adj.r.squared
## [1] 0.4017996
# Previously calculated adjusted R² from linear model using the director variable as predictor;
summary(dir_model)$adj.r.squared
## [1] 0.3993372
There was no significant difference right? The first value is what we had before adding the variable director as a predictor and the second one is the result after the variable was included. Lets try using the next variable with higher adjusted R². It is imdb_num_votes.
stu_imdb_votes_model = lm(global_score ~ imdb_num_votes, data = movies)
summary(stu_imdb_votes_model)$adj.r.squared
## [1] 0.07836049
That should not even be considered. This value is really far from what we would call a ‘good model’. However let’s combine all the three variables we tryied before. studio + director + imdb_num_votes.
stu_dir_imdb_votes_model = lm(global_score ~ studio + director + imdb_num_votes, data = movies)
summary(stu_dir_imdb_votes_model)$adj.r.squared
## [1] 0.4971885
That is something. Let’s include now the other remaining variable in consideration best_actor_win to see if we can get something better than that.
stu_dir_imdb_votes_best_act_model = lm(global_score ~ studio + director + imdb_num_votes + best_actor_win, data = movies)
summary(stu_dir_imdb_votes_best_act_model)$adj.r.squared
## [1] 0.4855043
It seems like this variable would not help much. So lets remove this variable from our model and stick to the previous model with higher adjusted R² stu_dir_imdb_votes_model.
Checking conditions
Since we only have one quantitative variable which is imdb_num_votes this is the one to be considered.
1. Outliers: The present outliers are not due to incorrectly entries or wrong measurement. Those are legitimate observations.
2. Linear relationship: The response variable has a linear relationship with the predictor variable X.
ggplot(aes(x = imdb_num_votes, y = global_score), data = movies) +
geom_jitter() +
geom_smooth(method = 'lm', se = F)

3. Homoscedasticity: The residuals do not have a constant variability. Instead it is a fan shape. I.g., the higher the value of the predictor variable X, the larger the residuals variability.
ggplot(aes(x = imdb_num_votes, y = global_score), data = movies) +
geom_jitter() +
geom_smooth(method = 'lm')

4. No autocorrelation: The data shows independence of observations. I.g., one sample do not affect other.
5. Multicollinearity: The predictor variables are not correlated.
The test was made using Pearson’s Chi-square test.
\(H_0:\) The variables are independent;
\(H_a:\) The variables are dependent;
Test: imdb_num_votes vs director
chisq.test(movies$imdb_num_votes, movies$director)
## Warning in chisq.test(movies$imdb_num_votes, movies$director): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: movies$imdb_num_votes and movies$director
## X-squared = 331410, df = 330540, p-value = 0.1403
As p-value is greater than 0.05 we do not reject the null hypothesis \(H_0\). There is no relationship between these two variables.
Test: imdb_num_votes vs studio
chisq.test(movies$imdb_num_votes, movies$studio)
## Warning in chisq.test(movies$imdb_num_votes, movies$studio): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: movies$imdb_num_votes and movies$studio
## X-squared = 132460, df = 132090, p-value = 0.2374
As p-value is greater than 0.05 we do not reject the null hypothesis \(H_0\). There is no relationship between these two variables.
Test: director vs studio
chisq.test(movies$director, movies$studio)
## Warning in chisq.test(movies$director, movies$studio): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: movies$director and movies$studio
## X-squared = 117990, df = 109310, p-value < 2.2e-16
As p-value is smaller than 0.05 we do reject the null hypothesis \(H_0\). There is a relationship between these two variables.
6. Residuals normal distribution:
imdb_num_votes_model = lm(global_score ~ imdb_num_votes, data = movies)
ggplot(aes(x = imdb_num_votes_model$residuals), data = movies) +
geom_histogram(aes(y = ..density..), binwidth = 4, color = 'black', fill = '#44679F') +
geom_density(alpha = .3, fill = '#DDF5F7') +
xlab('Residuals') +
ggtitle('Density histogram of IMDB number of votes model residuals') +
theme_minimal()

shapiro.test(imdb_num_votes_model$residuals)
##
## Shapiro-Wilk normality test
##
## data: imdb_num_votes_model$residuals
## W = 0.97776, p-value = 2.784e-08
Since the Shapiro-Wilk normality test p-value and the density residuals histogram of the model shows the residuals are not normally distributed so that makes our model not so reliable.
Note: The null hypothesis \(H_0\) for the Shapiro-Wilk normality test is ‘Data are normally distributed’. Whereas the alternative hypothesis \(H_a\) is ‘Data are not normally distributed’.
The formalized model is:
Formula: \(\hat{y} = intercept + slope \times variable\)
We have too many categorical variables and since we can either add all categories or neither of them it would be too much variables to write in a formula. So the summarized one would be like the following:
\(\hat{globalScore} =\) 10.6230058 \(+ slopeVar1 * var1 + slopeVar2 * var2 + ... + slopeVarN * varN\)