Adding extra package - GGally to use ggpairs function
library(statsr)
library(dplyr)
library(ggplot2)
library(GGally)
#load("movies.Rdata")
load("movies.Rdata")
The data was gathered from independent movie internet-resources such as Rotten Tomatoes and IMDB,random sampled. 651 movies, which are less than 10% of all movies for the cinematographic era and less than 10% of movies on both web resources. So, because data is get from 2 popular movie resources, we should keep in mind that prediction model, based on the data from this resources is applicable to make prediction for them only. It could not be spreaded for entire population, except IMDB and Rotten tomatoes variability of which should be considered in a separate work.
In this data we will evaluate which factors affect movie popularity. We are running observational studies. From a customer point of view, it’s quite interesting to find out why a certain movie could be popular.
To start with, we need to find out, which certain score we should use as a measure of movies’ success. We have 3 options: IMDB auditorium option, critics score on Rotten Tomatoes and Audience score on Rotten Tomatoes. We are analyzing different rating for the purpose of their representative evaluation.
movies_rate<-movies%>%
select(imdb_rating, critics_score, audience_score)%>%
group_by(imdb_rating, critics_score, audience_score)%>%
summarise(counts = n())
movies_rate
## Source: local data frame [644 x 4]
## Groups: imdb_rating, critics_score [?]
##
## imdb_rating critics_score audience_score counts
## <dbl> <dbl> <dbl> <int>
## 1 1.9 1 19 1
## 2 2.3 2 29 1
## 3 2.4 3 11 1
## 4 2.7 17 17 1
## 5 2.8 8 18 1
## 6 3.1 7 17 1
## 7 3.4 3 37 1
## 8 3.4 10 23 1
## 9 3.4 11 24 1
## 10 3.4 17 26 1
## # ... with 634 more rows
summary(movies_rate)
## imdb_rating critics_score audience_score counts
## Min. :1.900 Min. : 1.00 Min. :11.00 Min. :1.000
## 1st Qu.:5.900 1st Qu.: 33.00 1st Qu.:46.00 1st Qu.:1.000
## Median :6.600 Median : 61.00 Median :65.00 Median :1.000
## Mean :6.485 Mean : 57.40 Mean :62.23 Mean :1.011
## 3rd Qu.:7.300 3rd Qu.: 82.25 3rd Qu.:80.00 3rd Qu.:1.000
## Max. :9.000 Max. :100.00 Max. :97.00 Max. :2.000
ggpairs(movies, columns = c (13, 16, 18))
From this table of summary and correlation plots, we can see very high correlation between different scoring: IMDB, Rotten Tomatoes critic scores and audience scores.
Next, we exploring this single variety for it’s normality
summary(movies_rate$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.23 80.00 97.00
movies_rate%>%
group_by(audience_score)%>%
summarise(count = n())
## # A tibble: 84 × 2
## audience_score count
## <dbl> <int>
## 1 11 1
## 2 13 1
## 3 14 2
## 4 15 1
## 5 17 3
## 6 18 1
## 7 19 4
## 8 21 1
## 9 22 3
## 10 23 2
## # ... with 74 more rows
ggplot(movies_rate, aes(x=audience_score))+
geom_histogram(binwidth = 0.5)
Here we could see a short summary of the audience scores distribution. We can see that it’s normal distribution with a slight left skew.
ggplot(data = movies, aes(x = runtime, y = audience_score)) +
geom_jitter()+
geom_smooth(method = "lm", se = FALSE) +
xlab("Run time")+
ylab("Audience score")
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
ggplot(data = movies, aes(x = thtr_rel_year, y = audience_score)) +
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)+
xlab("Year of release")+
ylab("Audience score")
ggplot(data = movies, aes(x = audience_score, y = title_type )) +
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)+
xlab("Audience score")+
ylab("Title type")
ggplot(data = movies, aes(x = imdb_num_votes, y = audience_score)) +
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)+
xlab("Numer voters in IMDB")+
ylab("Audience score")
ggplot(data = movies, aes(x = audience_rating, y = audience_score)) +
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)+
xlab("Audience rating")+
ylab("Audience score")
ggplot(data = movies, aes(x = audience_score , y =best_pic_nom )) +
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)+
xlab("Audience score")+
ylab("Oscar nomination")
ggplot(data = movies, aes(x = audience_score , y = best_pic_win)) +
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)+
xlab("Audience score")+
ylab("Oscar title")
ggplot(data = movies, aes(x = audience_score , y = best_dir_win)) +
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)+
xlab("Audience score")+
ylab("Director gets Oscar")
ggplot(data = movies, aes(x = audience_score, y = top200_box )) +
geom_jitter()+
geom_smooth(method = "lm", se = FALSE)+
xlab("Audience score")+
ylab("Top 200 box list")
Providing this simple exploratory data analysis, we could find some tendentious: 1 run time vs score; 2. theater year release vs score; 3. documentary films have high grades; 4.movies with high numbers of voters, have higher scores; 5. Upright category of movies are associated with high ratings, also. 6. Oscar nominated movies have trend to have high ratings. 7. Oscar winning movies have trend to have high rating. 8. Oscar winning director, the tendency align that a movie with those tends to have high scores. 9. movies are in top200 have tendency to have high scores.
From this exploratory analysis we could highlight 9 important points to explore.
At this point of research, we need to run simple linear models to investigate our hypothesis if those variables are significant to predict audience scores.
#fit model
run_score = lm(audience_score ~ runtime , data = movies)
summary(run_score)
##
## Call:
## lm(formula = audience_score ~ runtime, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.641 -15.626 3.008 17.080 34.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.4203 4.3256 9.807 < 2e-16 ***
## runtime 0.1883 0.0402 4.684 3.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.92 on 648 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.03275, Adjusted R-squared: 0.03125
## F-statistic: 21.94 on 1 and 648 DF, p-value: 3.431e-06
type_score = lm(audience_score ~ title_type , data = movies)
summary(type_score)
##
## Call:
## lm(formula = audience_score ~ title_type, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.465 -14.465 2.535 15.535 36.535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.255 2.592 32.114 < 2e-16 ***
## title_typeFeature Film -22.789 2.710 -8.408 2.65e-16 ***
## title_typeTV Movie -26.455 8.981 -2.946 0.00334 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.23 on 648 degrees of freedom
## Multiple R-squared: 0.09889, Adjusted R-squared: 0.09611
## F-statistic: 35.56 on 2 and 648 DF, p-value: 2.225e-15
year_score = lm(audience_score ~ thtr_rel_year, data = movies)
summary(year_score)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_year, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.250 -15.962 3.042 17.147 34.042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 261.45858 144.30579 1.812 0.0705 .
## thtr_rel_year -0.09965 0.07223 -1.380 0.1682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.21 on 649 degrees of freedom
## Multiple R-squared: 0.002925, Adjusted R-squared: 0.001388
## F-statistic: 1.904 on 1 and 649 DF, p-value: 0.1682
inv_score = lm(audience_score ~ imdb_num_votes, data = movies)
summary(inv_score)
##
## Call:
## lm(formula = audience_score ~ imdb_num_votes, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.707 -15.085 1.866 16.245 36.521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.936e+01 8.534e-01 69.552 < 2e-16 ***
## imdb_num_votes 5.227e-05 6.776e-06 7.714 4.6e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.37 on 649 degrees of freedom
## Multiple R-squared: 0.08399, Adjusted R-squared: 0.08258
## F-statistic: 59.51 on 1 and 649 DF, p-value: 4.602e-14
audience_score2 = lm(audience_score ~ audience_rating, data = movies)
summary(audience_score2)
##
## Call:
## lm(formula = audience_score ~ audience_rating, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.9345 -7.3032 0.6968 8.6968 19.6968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.9345 0.6133 68.38 <2e-16 ***
## audience_ratingUpright 35.3686 0.8070 43.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.17 on 649 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7471
## F-statistic: 1921 on 1 and 649 DF, p-value: < 2.2e-16
bpn_score = lm(audience_score ~ best_pic_nom, data = movies)
summary(bpn_score)
##
## Call:
## lm(formula = audience_score ~ best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.56 -15.56 2.44 16.44 34.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.5596 0.7885 78.069 < 2e-16 ***
## best_pic_nomyes 23.7586 4.2894 5.539 4.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.78 on 649 degrees of freedom
## Multiple R-squared: 0.04514, Adjusted R-squared: 0.04367
## F-statistic: 30.68 on 1 and 649 DF, p-value: 4.425e-08
bpw_score = lm(audience_score ~ best_pic_win, data = movies)
summary(bpw_score)
##
## Call:
## lm(formula = audience_score ~ best_pic_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.12 -16.12 2.88 16.88 33.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.1196 0.7922 78.416 < 2e-16 ***
## best_pic_winyes 22.5947 7.6395 2.958 0.00321 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.1 on 649 degrees of freedom
## Multiple R-squared: 0.0133, Adjusted R-squared: 0.01178
## F-statistic: 8.748 on 1 and 649 DF, p-value: 0.003213
bdw_score = lm(audience_score ~ best_dir_win, data = movies)
summary(bdw_score)
##
## Call:
## lm(formula = audience_score ~ best_dir_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.857 -16.857 3.143 17.143 34.143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.8569 0.8171 75.701 <2e-16 ***
## best_dir_winyes 7.6547 3.1794 2.408 0.0163 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.15 on 649 degrees of freedom
## Multiple R-squared: 0.008852, Adjusted R-squared: 0.007325
## F-statistic: 5.797 on 1 and 649 DF, p-value: 0.01634
t2b_score = lm(audience_score ~ top200_box, data = movies)
summary(t2b_score)
##
## Call:
## lm(formula = audience_score ~ top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.075 -16.075 2.925 16.925 34.925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.0755 0.7991 77.686 <2e-16 ***
## top200_boxyes 12.4579 5.2641 2.367 0.0182 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.15 on 649 degrees of freedom
## Multiple R-squared: 0.008556, Adjusted R-squared: 0.007028
## F-statistic: 5.601 on 1 and 649 DF, p-value: 0.01825
By running this single line models we investigated that 9 explanatory variable are fit to be used in building prediction model. 0 hypothesis is rejected due to low p-values in all 9 cases.
Next step is to test residuals of all 9 explanatory variables for linearity, constant variability and near normality of residuals and their independence.
#test on nearly normal residuals, linearity and constant variability
ggplot(data = run_score, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = run_score, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
ggplot(data = type_score, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
ggplot(data = type_score, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = year_score, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = year_score, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
ggplot(data = inv_score, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = inv_score, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
ggplot(data = audience_score2, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = audience_score2, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
ggplot(data = bpn_score, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = bpn_score, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
ggplot(data = bpw_score, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = bpw_score, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
ggplot(data = bdw_score, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = bdw_score, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
ggplot(data = t2b_score, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = t2b_score, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
The residuals appear to be randomly distributed around 0. The plots are indicative of a linear relationships between running time, movie type, year of release, IMDb number of votes, audience rating; weather the movie was nominated or won Oscar and weather or not director won Oscar or not and audience score. Based on the residual plots, the constant variability condition appears to be met, as well as, condition if independent residuals.
Now let’s build a full prediction model, including all funded explanatory variables.
score1 = lm(audience_score ~ runtime + thtr_rel_year + title_type + imdb_num_votes + audience_rating + best_pic_nom + best_pic_win + best_dir_win + top200_box, data = movies)
summary(score1)
##
## Call:
## lm(formula = audience_score ~ runtime + thtr_rel_year + title_type +
## imdb_num_votes + audience_rating + best_pic_nom + best_pic_win +
## best_dir_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.878 -6.762 0.802 7.486 20.277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.411e+02 7.357e+01 3.278 0.00110 **
## runtime 2.461e-02 2.161e-02 1.139 0.25516
## thtr_rel_year -9.608e-02 3.654e-02 -2.629 0.00876 **
## title_typeFeature Film -1.040e+01 1.481e+00 -7.026 5.45e-12 ***
## title_typeTV Movie -1.415e+01 4.497e+00 -3.147 0.00173 **
## imdb_num_votes 2.204e-05 4.173e-06 5.280 1.77e-07 ***
## audience_ratingUpright 3.226e+01 8.337e-01 38.701 < 2e-16 ***
## best_pic_nomyes 6.446e+00 2.460e+00 2.621 0.00898 **
## best_pic_winyes -6.555e+00 4.404e+00 -1.488 0.13712
## best_dir_winyes 2.607e+00 1.648e+00 1.582 0.11418
## top200_boxyes -2.401e+00 2.641e+00 -0.909 0.36353
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.587 on 639 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.779, Adjusted R-squared: 0.7755
## F-statistic: 225.2 on 10 and 639 DF, p-value: < 2.2e-16
We choose those certain variables because during running the exploratory analysis we found them having the smallest p-values as indicators for significant roles to predict audience scores.
The next step is to choose the best prediction parsimony model which is the model with the smalls amount of variables and the highest prediction power by choosing backward p-values selection. Drop the largest p-values of the model.
score = lm(audience_score ~ thtr_rel_year + title_type + imdb_num_votes + audience_rating + best_pic_nom, data = movies)
summary(score)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_year + title_type + imdb_num_votes +
## audience_rating + best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.6067 -6.8432 0.8671 7.4220 20.4680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.543e+02 7.179e+01 3.542 0.000426 ***
## thtr_rel_year -1.016e-01 3.578e-02 -2.839 0.004662 **
## title_typeFeature Film -9.890e+00 1.467e+00 -6.741 3.49e-11 ***
## title_typeTV Movie -1.383e+01 4.502e+00 -3.072 0.002216 **
## imdb_num_votes 2.204e-05 3.794e-06 5.810 9.85e-09 ***
## audience_ratingUpright 3.237e+01 8.328e-01 38.867 < 2e-16 ***
## best_pic_nomyes 5.673e+00 2.227e+00 2.548 0.011070 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.607 on 644 degrees of freedom
## Multiple R-squared: 0.7764, Adjusted R-squared: 0.7743
## F-statistic: 372.7 on 6 and 644 DF, p-value: < 2.2e-16
Trying for adjusted R^2 backwards model
r2<-lm(audience_score ~ audience_rating + imdb_num_votes + title_type +thtr_rel_year + best_pic_nom + runtime+best_dir_win+best_pic_win, data = movies)
summary(r2)$adj.r.squared
## [1] 0.7755891
Running the different options, we found this model as the model with highest adjusted R^2. #Conclution: p-value backward elimination model is the most efficiant model as long as adjusted r^2 model offers only 1 variable elimination.
this search was done earlier in the document. The graphs show linear relationships.
hist(score$residuals)
qqnorm(score$residuals)
qqline(score$residuals)
Both plots show nearly normal distribution of residuals
Residuals should be equally variable for low and high values of the predicted response variable. We can check that by plotting residuals vs. predicted values of dependent variable. In that manner we can check the whole model at once.
plot(score$residuals ~ score$fitted)
Here we can observe awkward situation where high and low variables are not equilly destributed, this bias could appear due to 2 non-numerial variables which I do not know how to handle, yet.
#Checking for model’s residuals independence
plot(score$residuals)
We see random normal destribution of the values -> residuals are independent
The model is score = 2.5e-1.01e(year of release)-9.89e(if the movie is featured)-1.38e(if the movie is TV)+2.2e(number of voted)+3.23e*(if the movie is Upright rated)+5.6e(if the move is Oscar nominated)
Interpretation: variable “year of release” somewhat meaningless if its value less than 1800
Using founded p-value backward model to run the prediction:
score <- lm(audience_score ~ thtr_rel_year + title_type + imdb_num_votes + audience_rating + best_pic_nom, data = movies)
Because of p-value is so small, we could reject H0 and say that the model as a whole is significant. R2 tells us, that 77.5% of all variables are included in the prediction model.
#Suicide Squad
movie2016 <- data.frame(thtr_rel_year = 2016, imdb_num_votes = 391131, title_type = "Feature Film", audience_rating = "Spilled", best_pic_nom = "yes" )
movie2016
## thtr_rel_year imdb_num_votes title_type audience_rating best_pic_nom
## 1 2016 391131 Feature Film Spilled yes
predict(score, movie2016, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 53.86204 34.43349 73.29058
Hence, the model predicts, with 95% confidence, that Suicide Squad with a year of release 2016, number of voted -391131 (http://www.imdb.com/), audience rating Spilled (https://www.rottentomatoes.com/), and been nominated on Oscar’s best picture (https://www.polygon.com/2017/1/24/14371144/suicide-squad-Oscar-nomination-razzie), expected to have an audience score between 34.43 and 73.29 wit a fit on 53.86
The model predicts not high level of audience rating - 53.86. According to the Rotten tomatoes, current Suicide Squad’s audience rating is 62. Before the data analysis I was sure that one of important factors influence movie ratings are dates of its releases, studio and length of the movie, but the analysis show that, actually, year of release (seems like new movies tend to have lower ratings), type of the movie (if it’s documentary, it has higher ratings). And the most predictable explanatory variables are number of voted and weather or not movie is nominated on Oscar. The findings of the research are interested in the contest to be transferred on the entire population, as the next step. How the IMDB and Rotten Tomatoes rating are correlated with the world’s participation of cinematography?