library(ggplot2)
library(dplyr)
library(statsr)load("movies.Rdata")The data set has 651 randomly sampled movies produced and released before 2016. The information for this data set is gather from Rotten Tomatoes and IMDB.
Since the sample data has details of movies in all languages, the analysis results are generalizable to all the movies in IMDB and Rotten tomatoes website. However, this being an observational study, random assignment has not been taken place and hence it is not causal. * * *
Research Question: Building a best fitted multi linear regression model to predict the IMDB rating.
This question is of interest to me mainly because I check the IMDB rating of a movie before watching it. A good IMDB rating indicates that a movie is worth watching. If the movie rating is bad, I wouldn’t invest my time in watching it. Similarly many people across the globe have trust in IMDB ratings. This fact motivated me to figure out the factors responsible for rating a movie on a scale of 1 to 10.
Lets have a look at the structure of the dataset.
str(movies)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
Lets omit the rows with incomplete values
movie1 <- na.omit(movies)
dim(movie1)## [1] 619 32
We will now look into the summary statistics of the imdb_rating variable to get an idea about the type of values in it
summary(movie1$imdb_rating)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.900 5.900 6.600 6.486 7.300 9.000
The average rating a movie from our sample data has a rating of 6.486 with a minimum rating of 1.9 and maximum rating of 9.
Lets create a boxplot to find if there are any ifluencing points.
boxplot(movie1$imdb_rating)It looks like we have few outliers with an IMDB rating of less than 4. For now I do not want to take out these outliers and continue my analysis with the entire data as I would like to know the factors that result in a lowr or higher IMDB rating.
To avoid collinearity and attain parsimony, lets look into the behavior of few variables and their response to the other variables in the dataset.
Firstly, lets find correlation between various numeric variables
num_movie <- data.frame(movie1$imdb_rating , movie1$critics_score , movie1$audience_score , movie1$runtime , movie1$imdb_num_votes)
cor(num_movie)## movie1.imdb_rating movie1.critics_score
## movie1.imdb_rating 1.0000000 0.7619990
## movie1.critics_score 0.7619990 1.0000000
## movie1.audience_score 0.8605425 0.7015256
## movie1.runtime 0.2974388 0.2000696
## movie1.imdb_num_votes 0.3476431 0.2217208
## movie1.audience_score movie1.runtime
## movie1.imdb_rating 0.8605425 0.2974388
## movie1.critics_score 0.7015256 0.2000696
## movie1.audience_score 1.0000000 0.2031201
## movie1.runtime 0.2031201 1.0000000
## movie1.imdb_num_votes 0.3035904 0.3430813
## movie1.imdb_num_votes
## movie1.imdb_rating 0.3476431
## movie1.critics_score 0.2217208
## movie1.audience_score 0.3035904
## movie1.runtime 0.3430813
## movie1.imdb_num_votes 1.0000000
We can see that imdb_rating, critics_score and audience_score variables are highly correlated. Due to the presence of collinearity between these variables, it is not required to use all three of these variables. Hence we will only use imdb_rating, runtime and imdb_num_votes as the correlation between these variables is less and also the results will not be biased.
Lets peek into some categorical variables also to see if there are any variables that behave in similar fashion. To find this, I’ll build an MLR for categorical variables and find their behavior towards the IMDB rating.
best <- lm(imdb_rating ~ best_pic_win + best_pic_nom + best_actor_win + best_actress_win + best_dir_win + top200_box, data = movie1)
summary(best)##
## Call:
## lm(formula = imdb_rating ~ best_pic_win + best_pic_nom + best_actor_win +
## best_actress_win + best_dir_win + top200_box, data = movie1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4892 -0.5289 0.0711 0.7108 2.1108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.38916 0.04825 132.424 < 2e-16 ***
## best_pic_winyes 0.03871 0.47377 0.082 0.9349
## best_pic_nomyes 1.13902 0.26326 4.327 1.77e-05 ***
## best_actor_winyes 0.03976 0.12140 0.328 0.7434
## best_actress_winyes 0.05790 0.13561 0.427 0.6696
## best_dir_winyes 0.44136 0.17547 2.515 0.0121 *
## top200_boxyes 0.51150 0.27431 1.865 0.0627 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.042 on 612 degrees of freedom
## Multiple R-squared: 0.06884, Adjusted R-squared: 0.05971
## F-statistic: 7.541 on 6 and 612 DF, p-value: 8.132e-08
NOTE: This model is built only to understand the behavior of the categorical variables towards imdb_rating variable. We will be taking out those variables with high p values as including those in the final model gives us biased results and those are unreliable.
Some of the variables from the above summary result gave us higher p values compared to others. To achieve parsimony, I’m going to use single variable “best_pic_win” to represent “best_actress_win”, “best_actor_win”, and “best_pic_win” variables.
In addition, I’m going to skip using “director”, “actor1”, “actor2”, “actor3”, “actor4”, “actor5”, “imdb_url”, “rt_url” as these are the names of the characters in the film and urls and these do not fall under numerical or categorical (continuous or discrete) variables. Sure few actors and directors are responsible for dragging audience to the theatres but we just have too many values and variables like these require different techniques to be implemented to do the analysis.
Same rule applies to the variables, “title”, “title_type”, “genre”, “mpaa_rating”, “studio”, “best_pic_nom”. I would analyse the above variables using different techniques. I dont want to include the month of release of a movie since a good film will be watched, when people have time, whether or not it is released during holiday season or in the exams season.Hence I’m skipping “dvd_rel_month”, “thtr_rel_month” variables for now and will do a different type of analysis specific to these variables.
I’m creating a new data frame with the remaining variables to continue with my analysis
Response variable - imdb_rating: Rating on IMDB
Explanatory variables -
imdb_num_votes: Number of votes on IMDB genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) runtime: Runtime of movie (in minutes) thtr_rel_year: Year the movie is released in theaters dvd_rel_year: Year the movie is released on DVD best_pic_win: Whether or not the movie won a best picture Oscar (no, yes) best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the director won an Oscar for the given movie top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
movie <- movie1 %>%
select(imdb_rating, imdb_num_votes, runtime, genre, thtr_rel_year, dvd_rel_year, best_pic_win, best_dir_win, top200_box)
dim(movie)## [1] 619 9
str(movie)## Classes 'tbl_df', 'tbl' and 'data.frame': 619 obs. of 9 variables:
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
## $ imdb_num_votes: int 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
## $ runtime : num 80 101 84 139 90 142 93 88 119 127 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 5 6 1 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
## .. ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...
Lets build an MLR model with the remaining variables to find out the influence on imdb_rating.
It is now ideal to conduct hypothesis testing for the model as a whole. Null Hypothesis: Slopes of all the the variables is equal to zero. Alterative Hypothesis: Atleast one of the slopes is different than zero.
movie_full <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win + top200_box , data = movie1)
summary(movie_full)##
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + genre + runtime +
## thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win +
## top200_box, data = movie1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9247 -0.4248 0.0690 0.5412 2.1891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.072e+01 1.593e+01 3.184 0.00153 **
## imdb_num_votes 3.682e-06 3.610e-07 10.200 < 2e-16 ***
## genreAnimation 8.971e-02 3.205e-01 0.280 0.77966
## genreArt House & International 1.215e+00 2.692e-01 4.513 7.68e-06 ***
## genreComedy -7.059e-02 1.427e-01 -0.495 0.62113
## genreDocumentary 2.062e+00 1.766e-01 11.678 < 2e-16 ***
## genreDrama 7.127e-01 1.202e-01 5.929 5.15e-09 ***
## genreHorror 9.012e-03 2.126e-01 0.042 0.96620
## genreMusical & Performing Arts 1.489e+00 2.695e-01 5.524 4.93e-08 ***
## genreMystery & Suspense 4.450e-01 1.575e-01 2.826 0.00487 **
## genreOther 5.256e-01 2.463e-01 2.134 0.03326 *
## genreScience Fiction & Fantasy -1.745e-02 3.189e-01 -0.055 0.95638
## runtime 4.481e-03 2.068e-03 2.167 0.03061 *
## thtr_rel_year -1.135e-02 4.369e-03 -2.599 0.00959 **
## dvd_rel_year -1.139e-02 9.980e-03 -1.141 0.25426
## best_pic_winyes -2.290e-01 3.583e-01 -0.639 0.52310
## best_dir_winyes 2.254e-01 1.458e-01 1.546 0.12263
## top200_boxyes -3.655e-02 2.352e-01 -0.155 0.87652
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8463 on 601 degrees of freedom
## Multiple R-squared: 0.397, Adjusted R-squared: 0.3799
## F-statistic: 23.27 on 17 and 601 DF, p-value: < 2.2e-16
The model gave us a p value (2.2e-16) which is less than 0.05. This concludes that the model as a whole is significant.
Now lets conduct backward elimination with the help of adjusted R square scores for more reliable prediction model.
Lets drop imdb_num_votes
m1 <- lm(imdb_rating ~ genre + runtime + thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win + top200_box , data = movie1)
summary(m1)$adj.r.squared## [1] 0.2737727
Dropping genre
m2 <- lm(imdb_rating ~ imdb_num_votes + runtime + thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win + top200_box , data = movie1)
summary(m2)$adj.r.squared## [1] 0.1552294
Dropping runtime
m3 <- lm(imdb_rating ~ genre + imdb_num_votes + thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win + top200_box , data = movie1)
summary(m3)$adj.r.squared## [1] 0.3761099
Dropping thtr_rel_year
m4 <- lm(imdb_rating ~ genre + runtime + imdb_num_votes + dvd_rel_year + best_pic_win + best_dir_win + top200_box , data = movie1)
summary(m4)$adj.r.squared## [1] 0.3739922
Dropping dvd_rel_year
m5 <- lm(imdb_rating ~ genre + runtime + thtr_rel_year + imdb_num_votes + best_pic_win + best_dir_win + top200_box , data = movie1)
summary(m5)$adj.r.squared## [1] 0.3796064
Dropping best_pic_win
m6 <- lm(imdb_rating ~ genre + runtime + thtr_rel_year + dvd_rel_year + imdb_num_votes + best_dir_win + top200_box , data = movie1)
summary(m6)$adj.r.squared## [1] 0.3805273
Dropping best_dir_win
m7 <- lm(imdb_rating ~ genre + runtime + thtr_rel_year + dvd_rel_year + best_pic_win + imdb_num_votes + top200_box , data = movie1)
summary(m7)$adj.r.squared## [1] 0.3784859
Dropping top200_box
m8 <- lm(imdb_rating ~ genre + runtime + thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win + imdb_num_votes , data = movie1)
summary(m8)$adj.r.squared## [1] 0.3809229
After verifying all the models by removing each one of the variable everytime, the best fitted model (m8) has an adjusted R squared value of 0.3809229 after removing the top200_box variable.
Lets continue analysing further by removing the second variable.
Dropping top200_box + imdb_num_votes
m9 <- lm(imdb_rating ~ genre + runtime + thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win, data = movie1)
summary(m9)$adj.r.squared## [1] 0.2680378
Dropping top200_box + genre
m10 <- lm(imdb_rating ~ imdb_num_votes + runtime + thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win, data = movie1)
summary(m10)$adj.r.squared## [1] 0.1561708
Dropping top200_box + runtime
m11 <- lm(imdb_rating ~ imdb_num_votes + genre + thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win, data = movie1)
summary(m11)$adj.r.squared## [1] 0.377138
Dropping top200_box + thtr_rel_year
m12 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + dvd_rel_year + best_pic_win + best_dir_win, data = movie1)
summary(m12)$adj.r.squared## [1] 0.3750213
Dropping top200_box + dvd_rel_year
m13 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + thtr_rel_year + best_pic_win + best_dir_win, data = movie1)
summary(m13)$adj.r.squared## [1] 0.3806066
Dropping top200_box + best_pic_win
m14 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + thtr_rel_year + dvd_rel_year + best_dir_win, data = movie1)
summary(m14)$adj.r.squared## [1] 0.3815317
Dropping top200_box + best_dir_win
m15 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + thtr_rel_year + best_pic_win + dvd_rel_year, data = movie1)
summary(m15)$adj.r.squared## [1] 0.3794728
After verifying all the models by removing the second variable everytime, the best fitted model (m14) has an adjusted R squared value of 0.3815317 after removing the top200_box and best_pic_win variables.
Lets continue analysing further by removing the third variable.
Dropping top200_box + best_pic_win + imdb_num_votes
m16 <- lm(imdb_rating ~ genre + runtime + thtr_rel_year + dvd_rel_year + best_dir_win, data = movie1)
summary(m16)$adj.r.squared## [1] 0.2640055
Dropping top200_box + best_pic_win + genre
m17 <- lm(imdb_rating ~ imdb_num_votes + runtime + thtr_rel_year + dvd_rel_year + best_dir_win, data = movie1)
summary(m17)$adj.r.squared## [1] 0.1575187
Dropping top200_box + best_pic_win + runtime
m18 <- lm(imdb_rating ~ imdb_num_votes + genre + thtr_rel_year + dvd_rel_year + best_dir_win, data = movie1)
summary(m18)$adj.r.squared## [1] 0.3778628
Dropping top200_box + best_pic_win + thtr_rel_year
m19 <- lm(imdb_rating ~ imdb_num_votes + runtime + genre + dvd_rel_year + best_dir_win, data = movie1)
summary(m19)$adj.r.squared## [1] 0.3758894
Dropping top200_box + best_pic_win + dvd_rel_year
m20 <- lm(imdb_rating ~ imdb_num_votes + runtime + thtr_rel_year + genre + best_dir_win, data = movie1)
summary(m20)$adj.r.squared## [1] 0.3811376
Dropping top200_box + best_pic_win + best_dir_win
m21 <- lm(imdb_rating ~ imdb_num_votes + runtime + thtr_rel_year + dvd_rel_year + genre , data = movie1)
summary(m21)$adj.r.squared## [1] 0.3804479
As we dont see any increase in the adjusted R square after removing 3 variables, “m14” is the best fitted model for predicting the IMDB rating.
Lets run this model to find all the summary statistcs related to it.
m14 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + thtr_rel_year + dvd_rel_year + best_dir_win, data = movie1)
summary(m14)##
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + genre + runtime +
## thtr_rel_year + dvd_rel_year + best_dir_win, data = movie1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9203 -0.4250 0.0745 0.5407 2.1897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.075e+01 1.589e+01 3.193 0.00148 **
## imdb_num_votes 3.608e-06 3.353e-07 10.760 < 2e-16 ***
## genreAnimation 8.694e-02 3.197e-01 0.272 0.78579
## genreArt House & International 1.211e+00 2.684e-01 4.512 7.72e-06 ***
## genreComedy -7.477e-02 1.418e-01 -0.527 0.59808
## genreDocumentary 2.057e+00 1.757e-01 11.702 < 2e-16 ***
## genreDrama 7.106e-01 1.192e-01 5.961 4.27e-09 ***
## genreHorror 7.384e-03 2.118e-01 0.035 0.97220
## genreMusical & Performing Arts 1.488e+00 2.687e-01 5.539 4.54e-08 ***
## genreMystery & Suspense 4.446e-01 1.565e-01 2.840 0.00466 **
## genreOther 5.323e-01 2.456e-01 2.167 0.03062 *
## genreScience Fiction & Fantasy -1.539e-02 3.184e-01 -0.048 0.96146
## runtime 4.414e-03 2.062e-03 2.141 0.03269 *
## thtr_rel_year -1.104e-02 4.325e-03 -2.552 0.01097 *
## dvd_rel_year -1.172e-02 9.955e-03 -1.177 0.23974
## best_dir_winyes 2.010e-01 1.401e-01 1.435 0.15188
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8452 on 603 degrees of freedom
## Multiple R-squared: 0.3965, Adjusted R-squared: 0.3815
## F-statistic: 26.42 on 15 and 603 DF, p-value: < 2.2e-16
Interpretation of intercept (response variable) and slopes (explanatory variables) of each parameter in the model -
imdb_rating: Movies with no votes/genre/runtime/theatre release year/dvd release year/best director winner are expected on average to have an IMDB rating of 0.5075. This is meaningless in context but it helps to adjust the height of the line.
imdb_num_votes: All else held constant, for every additional vote, the model predicts the IMDB rating of the movie to be higher on an average by 0.000003608 points.
genre: All else held constant, for every additional vote, the model predicts the IMDB rating of the movie to be higher on an average by 0.000003608 points.
genreAnimation: All else held constant, for a movie belonging to an animation genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.08694 points.
genreArt House & International: All else held constant, for a movie belonging to an Art House & International genre, the model of the movie predicts the IMDB rating to be higher on an average by 1.211 points.
genreComedy: All else held constant, for a movie belonging to an comedy genre, the model predicts the IMDB rating of the movie to be lower on an average by 0.07477 points.
genreDocumentary :
All else held constant, for a movie belonging to an Documentary genre, the model predicts the IMDB of the movie rating to be higher on an average by 2.057 points.
genreDrama: All else held constant, for a movie belonging to an Drama genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.7106 points.
genreHorror: All else held constant, for a movie belonging to an Horror genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.007384 points.
genreMusical & Performing Arts: All else held constant, for a movie belonging to an Musical & Performing Arts genre, the model predicts the IMDB rating of the movie to be higher on an average by 1.488 points.
genreMystery & Suspense : All else held constant, for a movie belonging to a Mystery & Suspense genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.4446 points.
genreOther: All else held constant, for a movie belonging to an other genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.5323 points.
genreScience Fiction & Fantasy: All else held constant, for a movie belonging to a Science Fiction & Fantasy genre, the model predicts the IMDB rating of the movie to be lower on an average by 0.01539 points.
runtime: All else held constant, for each additional minute increase in runtime, the model predicts the IMDB rating of the movie to be higher on an average by 0.004414 points.
thtr_rel_year: All else held constant, for each year increase in the theatre release year, the model predicts the IMDB rating of the movie to be lower on an average by 0.01104 points.
dvd_rel_year: All else held constant, for each year increase in the dvd release year, the model predicts the IMDB rating of the movie to be lower on an average by 0.01172 points.
best_dir_win: All else held constant, for an award winning director ,the model predicts the IMDB rating of the movie to be higher on an average by 0.0201 points.
In order to determine whether using a linear model is the best way to predict the IMDB rating of a movie, we need to conduct some diagnostic tests.
1.Test for linear relationship between the numerical explanatory variable and response variable
plot(m14$residuals ~ movie1$imdb_num_votes) Above plot doesn’t show a U shaped or inverted U shaped spread, but has a fan shaped spread.
plot(m14$residuals ~ movie1$runtime) Above plot doesn’t show a U shaped or inverted U shaped spread, but has a fan shaped spread.
plot(m14$residuals ~ movie1$thtr_rel_year) The points are randomly scattered around 0.
plot(m14$residuals ~ movie1$dvd_rel_year) The points are randomly scattered around 0.
Since, two of the four numeric variables do not have a random scatter around 0, our model fails this test.
hist(m14$residuals)The histogram of residuals is nearly normal and centered at mean 0.
qqnorm(m14$residuals)
qqline(m14$residuals)From the above graph, we can confirm that we met the nearly normal condition.
plot(m14$residuals ~ m14$fitted.values)From the above graph, we can see that the residuals scattered nearly in the shape of a fan (from right to left).
plot(abs(m14$residuals) ~ m14$fitted.values)The absolute value of residuals plot vs. predicted values show a nearly traingle like structure.
Since, we were not able to achieve homoschedasticity, rule #3 is violated by our model.
plot(m14$residuals)This condition is met as there is a random scatter of residuals and hence the observations are independant.
I’m going to predict the IMDB rating for the movie mentioned below
Movie title : Raised on media imdb_num_votes: 7 genre: crime runtime: 65 mins thtr_rel_year: 2016 dvd_rel_year: 2016 best_dir_win: No
Data reference: IMDB.com
With reference to the summary results of the best fitted model (m14), the equation to predict imdb_rating for this movie can be written as mentioned below. Since the crime genre is not part of our model, we can classify it under the genre-other category.
Below is the equation of the line for model m14:
imdb_rating = (0.000003608 * imdb_num_votes) + (0.5323 * genre) + (0.004414 * runtime) - (0.01104 * thtr_rel_year) - (0.01172 * dvd_rel_year) + (0.0201 * best_dir_win)
Lets build a data frame with the above data
new_movie <- data.frame(imdb_num_votes = 7, genre = "Other", runtime = 65, thtr_rel_year = 2016, dvd_rel_year = 2016, best_dir_win = "no")Lets build a prediction interval to derive the measure of uncertainity around this prediction.
predict(m14, new_movie, interval = "prediction", level = 0.95)## fit lwr upr
## 1 5.698991 3.964844 7.433137
Hence, the model predicts, with 95% confidence, that a crime movie with a theatre release and dvd release in 2016 with 7 imdb number of votes, runtime of 65 minutes and with no award winning director is expected to have an evaluation score between 3.96 and 7.43.
IMDB rating prediction model built from above analysis can only predict the rating from the non obvious factors such as runtime, movie release year, genre etc. I also tried using the obvious factors such as actor1, director etc. which resulted in lengthy summary statistics and has an overall model p value of > 0.05. This implies that we fail to reject the null hypothesis and hence the conclusion would be all the explantory variables coefficients (slopes) are 0. This didn’t make sense to me and I finally ended up buiding above model. Further research is needed to figure out the best way to address problems like this.
Since two out of four diagnostic conditions are violated by our model, further research is needed either to simplify this model or performing a non linear regression analysis might help in building a better model to predict IMDB_rating with better accuracy.