Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("movies.Rdata")

Part 1: Data

The data set has 651 randomly sampled movies produced and released before 2016. The information for this data set is gather from Rotten Tomatoes and IMDB.

Since the sample data has details of movies in all languages, the analysis results are generalizable to all the movies in IMDB and Rotten tomatoes website. However, this being an observational study, random assignment has not been taken place and hence it is not causal. * * *

Part 2: Research question

Research Question: Building a best fitted multi linear regression model to predict the IMDB rating.

This question is of interest to me mainly because I check the IMDB rating of a movie before watching it. A good IMDB rating indicates that a movie is worth watching. If the movie rating is bad, I wouldn’t invest my time in watching it. Similarly many people across the globe have trust in IMDB ratings. This fact motivated me to figure out the factors responsible for rating a movie on a scale of 1 to 10.


Part 3: Exploratory data analysis

Lets have a look at the structure of the dataset.

str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

Lets omit the rows with incomplete values

movie1 <- na.omit(movies)

dim(movie1)
## [1] 619  32

We will now look into the summary statistics of the imdb_rating variable to get an idea about the type of values in it

summary(movie1$imdb_rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   5.900   6.600   6.486   7.300   9.000

The average rating a movie from our sample data has a rating of 6.486 with a minimum rating of 1.9 and maximum rating of 9.

Lets create a boxplot to find if there are any ifluencing points.

boxplot(movie1$imdb_rating)

It looks like we have few outliers with an IMDB rating of less than 4. For now I do not want to take out these outliers and continue my analysis with the entire data as I would like to know the factors that result in a lowr or higher IMDB rating.

To avoid collinearity and attain parsimony, lets look into the behavior of few variables and their response to the other variables in the dataset.

Firstly, lets find correlation between various numeric variables

num_movie <- data.frame(movie1$imdb_rating , movie1$critics_score , movie1$audience_score , movie1$runtime , movie1$imdb_num_votes)


cor(num_movie)
##                       movie1.imdb_rating movie1.critics_score
## movie1.imdb_rating             1.0000000            0.7619990
## movie1.critics_score           0.7619990            1.0000000
## movie1.audience_score          0.8605425            0.7015256
## movie1.runtime                 0.2974388            0.2000696
## movie1.imdb_num_votes          0.3476431            0.2217208
##                       movie1.audience_score movie1.runtime
## movie1.imdb_rating                0.8605425      0.2974388
## movie1.critics_score              0.7015256      0.2000696
## movie1.audience_score             1.0000000      0.2031201
## movie1.runtime                    0.2031201      1.0000000
## movie1.imdb_num_votes             0.3035904      0.3430813
##                       movie1.imdb_num_votes
## movie1.imdb_rating                0.3476431
## movie1.critics_score              0.2217208
## movie1.audience_score             0.3035904
## movie1.runtime                    0.3430813
## movie1.imdb_num_votes             1.0000000

We can see that imdb_rating, critics_score and audience_score variables are highly correlated. Due to the presence of collinearity between these variables, it is not required to use all three of these variables. Hence we will only use imdb_rating, runtime and imdb_num_votes as the correlation between these variables is less and also the results will not be biased.

Lets peek into some categorical variables also to see if there are any variables that behave in similar fashion. To find this, I’ll build an MLR for categorical variables and find their behavior towards the IMDB rating.

best <- lm(imdb_rating ~ best_pic_win + best_pic_nom + best_actor_win + best_actress_win + best_dir_win + top200_box, data = movie1)

summary(best)
## 
## Call:
## lm(formula = imdb_rating ~ best_pic_win + best_pic_nom + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box, data = movie1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4892 -0.5289  0.0711  0.7108  2.1108 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.38916    0.04825 132.424  < 2e-16 ***
## best_pic_winyes      0.03871    0.47377   0.082   0.9349    
## best_pic_nomyes      1.13902    0.26326   4.327 1.77e-05 ***
## best_actor_winyes    0.03976    0.12140   0.328   0.7434    
## best_actress_winyes  0.05790    0.13561   0.427   0.6696    
## best_dir_winyes      0.44136    0.17547   2.515   0.0121 *  
## top200_boxyes        0.51150    0.27431   1.865   0.0627 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.042 on 612 degrees of freedom
## Multiple R-squared:  0.06884,    Adjusted R-squared:  0.05971 
## F-statistic: 7.541 on 6 and 612 DF,  p-value: 8.132e-08

NOTE: This model is built only to understand the behavior of the categorical variables towards imdb_rating variable. We will be taking out those variables with high p values as including those in the final model gives us biased results and those are unreliable.

Some of the variables from the above summary result gave us higher p values compared to others. To achieve parsimony, I’m going to use single variable “best_pic_win” to represent “best_actress_win”, “best_actor_win”, and “best_pic_win” variables.

In addition, I’m going to skip using “director”, “actor1”, “actor2”, “actor3”, “actor4”, “actor5”, “imdb_url”, “rt_url” as these are the names of the characters in the film and urls and these do not fall under numerical or categorical (continuous or discrete) variables. Sure few actors and directors are responsible for dragging audience to the theatres but we just have too many values and variables like these require different techniques to be implemented to do the analysis.

Same rule applies to the variables, “title”, “title_type”, “genre”, “mpaa_rating”, “studio”, “best_pic_nom”. I would analyse the above variables using different techniques. I dont want to include the month of release of a movie since a good film will be watched, when people have time, whether or not it is released during holiday season or in the exams season.Hence I’m skipping “dvd_rel_month”, “thtr_rel_month” variables for now and will do a different type of analysis specific to these variables.


Part 4: Modeling

I’m creating a new data frame with the remaining variables to continue with my analysis

Response variable - imdb_rating: Rating on IMDB

Explanatory variables -

imdb_num_votes: Number of votes on IMDB genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) runtime: Runtime of movie (in minutes) thtr_rel_year: Year the movie is released in theaters dvd_rel_year: Year the movie is released on DVD best_pic_win: Whether or not the movie won a best picture Oscar (no, yes) best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the director won an Oscar for the given movie top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)

movie <- movie1 %>%
  select(imdb_rating, imdb_num_votes, runtime, genre, thtr_rel_year, dvd_rel_year, best_pic_win, best_dir_win, top200_box)

dim(movie)
## [1] 619   9
str(movie)
## Classes 'tbl_df', 'tbl' and 'data.frame':    619 obs. of  9 variables:
##  $ imdb_rating   : num  5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
##  $ imdb_num_votes: int  899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
##  $ runtime       : num  80 101 84 139 90 142 93 88 119 127 ...
##  $ genre         : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 6 6 5 6 1 ...
##  $ thtr_rel_year : num  2013 2001 1996 1993 2004 ...
##  $ dvd_rel_year  : num  2013 2001 2001 2001 2005 ...
##  $ best_pic_win  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
##   .. ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...

Lets build an MLR model with the remaining variables to find out the influence on imdb_rating.

It is now ideal to conduct hypothesis testing for the model as a whole. Null Hypothesis: Slopes of all the the variables is equal to zero. Alterative Hypothesis: Atleast one of the slopes is different than zero.

movie_full <- lm(imdb_rating ~ imdb_num_votes + genre + runtime  + thtr_rel_year  + dvd_rel_year + best_pic_win +  best_dir_win + top200_box , data = movie1)

summary(movie_full)
## 
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + genre + runtime + 
##     thtr_rel_year + dvd_rel_year + best_pic_win + best_dir_win + 
##     top200_box, data = movie1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9247 -0.4248  0.0690  0.5412  2.1891 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.072e+01  1.593e+01   3.184  0.00153 ** 
## imdb_num_votes                  3.682e-06  3.610e-07  10.200  < 2e-16 ***
## genreAnimation                  8.971e-02  3.205e-01   0.280  0.77966    
## genreArt House & International  1.215e+00  2.692e-01   4.513 7.68e-06 ***
## genreComedy                    -7.059e-02  1.427e-01  -0.495  0.62113    
## genreDocumentary                2.062e+00  1.766e-01  11.678  < 2e-16 ***
## genreDrama                      7.127e-01  1.202e-01   5.929 5.15e-09 ***
## genreHorror                     9.012e-03  2.126e-01   0.042  0.96620    
## genreMusical & Performing Arts  1.489e+00  2.695e-01   5.524 4.93e-08 ***
## genreMystery & Suspense         4.450e-01  1.575e-01   2.826  0.00487 ** 
## genreOther                      5.256e-01  2.463e-01   2.134  0.03326 *  
## genreScience Fiction & Fantasy -1.745e-02  3.189e-01  -0.055  0.95638    
## runtime                         4.481e-03  2.068e-03   2.167  0.03061 *  
## thtr_rel_year                  -1.135e-02  4.369e-03  -2.599  0.00959 ** 
## dvd_rel_year                   -1.139e-02  9.980e-03  -1.141  0.25426    
## best_pic_winyes                -2.290e-01  3.583e-01  -0.639  0.52310    
## best_dir_winyes                 2.254e-01  1.458e-01   1.546  0.12263    
## top200_boxyes                  -3.655e-02  2.352e-01  -0.155  0.87652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8463 on 601 degrees of freedom
## Multiple R-squared:  0.397,  Adjusted R-squared:  0.3799 
## F-statistic: 23.27 on 17 and 601 DF,  p-value: < 2.2e-16

The model gave us a p value (2.2e-16) which is less than 0.05. This concludes that the model as a whole is significant.

Now lets conduct backward elimination with the help of adjusted R square scores for more reliable prediction model.

Lets drop imdb_num_votes

m1 <- lm(imdb_rating ~ genre + runtime  + thtr_rel_year  + dvd_rel_year + best_pic_win +  best_dir_win + top200_box , data = movie1)

summary(m1)$adj.r.squared
## [1] 0.2737727

Dropping genre

m2 <- lm(imdb_rating ~ imdb_num_votes + runtime  + thtr_rel_year  + dvd_rel_year + best_pic_win +  best_dir_win + top200_box , data = movie1)

summary(m2)$adj.r.squared
## [1] 0.1552294

Dropping runtime

m3 <- lm(imdb_rating ~ genre + imdb_num_votes  + thtr_rel_year  + dvd_rel_year + best_pic_win +  best_dir_win + top200_box , data = movie1)

summary(m3)$adj.r.squared
## [1] 0.3761099

Dropping thtr_rel_year

m4 <- lm(imdb_rating ~ genre + runtime  + imdb_num_votes + dvd_rel_year + best_pic_win +  best_dir_win + top200_box , data = movie1)

summary(m4)$adj.r.squared
## [1] 0.3739922

Dropping dvd_rel_year

m5 <- lm(imdb_rating ~ genre + runtime  + thtr_rel_year  + imdb_num_votes + best_pic_win +  best_dir_win + top200_box , data = movie1)

summary(m5)$adj.r.squared
## [1] 0.3796064

Dropping best_pic_win

m6 <- lm(imdb_rating ~ genre + runtime  + thtr_rel_year  + dvd_rel_year + imdb_num_votes +  best_dir_win + top200_box , data = movie1)

summary(m6)$adj.r.squared
## [1] 0.3805273

Dropping best_dir_win

m7 <- lm(imdb_rating ~ genre + runtime  + thtr_rel_year  + dvd_rel_year + best_pic_win +  imdb_num_votes + top200_box , data = movie1)

summary(m7)$adj.r.squared
## [1] 0.3784859

Dropping top200_box

m8 <- lm(imdb_rating ~ genre + runtime  + thtr_rel_year  + dvd_rel_year + best_pic_win +  best_dir_win + imdb_num_votes , data = movie1)

summary(m8)$adj.r.squared
## [1] 0.3809229

After verifying all the models by removing each one of the variable everytime, the best fitted model (m8) has an adjusted R squared value of 0.3809229 after removing the top200_box variable.

Lets continue analysing further by removing the second variable.

Dropping top200_box + imdb_num_votes

m9 <- lm(imdb_rating ~ genre + runtime  + thtr_rel_year  + dvd_rel_year + best_pic_win +  best_dir_win, data = movie1)

summary(m9)$adj.r.squared
## [1] 0.2680378

Dropping top200_box + genre

m10 <- lm(imdb_rating ~ imdb_num_votes + runtime  + thtr_rel_year  + dvd_rel_year + best_pic_win +  best_dir_win, data = movie1)

summary(m10)$adj.r.squared
## [1] 0.1561708

Dropping top200_box + runtime

m11 <- lm(imdb_rating ~ imdb_num_votes + genre + thtr_rel_year  + dvd_rel_year + best_pic_win +  best_dir_win, data = movie1)

summary(m11)$adj.r.squared
## [1] 0.377138

Dropping top200_box + thtr_rel_year

m12 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + dvd_rel_year + best_pic_win +  best_dir_win, data = movie1)

summary(m12)$adj.r.squared
## [1] 0.3750213

Dropping top200_box + dvd_rel_year

m13 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + thtr_rel_year + best_pic_win +  best_dir_win, data = movie1)

summary(m13)$adj.r.squared
## [1] 0.3806066

Dropping top200_box + best_pic_win

m14 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + thtr_rel_year + dvd_rel_year +  best_dir_win, data = movie1)

summary(m14)$adj.r.squared
## [1] 0.3815317

Dropping top200_box + best_dir_win

m15 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + thtr_rel_year + best_pic_win +  dvd_rel_year, data = movie1)

summary(m15)$adj.r.squared
## [1] 0.3794728

After verifying all the models by removing the second variable everytime, the best fitted model (m14) has an adjusted R squared value of 0.3815317 after removing the top200_box and best_pic_win variables.

Lets continue analysing further by removing the third variable.

Dropping top200_box + best_pic_win + imdb_num_votes

m16 <- lm(imdb_rating ~ genre + runtime + thtr_rel_year + dvd_rel_year +  best_dir_win, data = movie1)

summary(m16)$adj.r.squared
## [1] 0.2640055

Dropping top200_box + best_pic_win + genre

m17 <- lm(imdb_rating ~  imdb_num_votes + runtime + thtr_rel_year + dvd_rel_year +  best_dir_win, data = movie1)

summary(m17)$adj.r.squared
## [1] 0.1575187

Dropping top200_box + best_pic_win + runtime

m18 <- lm(imdb_rating ~  imdb_num_votes + genre  + thtr_rel_year + dvd_rel_year +  best_dir_win, data = movie1)

summary(m18)$adj.r.squared
## [1] 0.3778628

Dropping top200_box + best_pic_win + thtr_rel_year

m19 <- lm(imdb_rating ~  imdb_num_votes + runtime + genre + dvd_rel_year +  best_dir_win, data = movie1)

summary(m19)$adj.r.squared
## [1] 0.3758894

Dropping top200_box + best_pic_win + dvd_rel_year

m20 <- lm(imdb_rating ~  imdb_num_votes + runtime + thtr_rel_year + genre +  best_dir_win, data = movie1)

summary(m20)$adj.r.squared
## [1] 0.3811376

Dropping top200_box + best_pic_win + best_dir_win

m21 <- lm(imdb_rating ~  imdb_num_votes + runtime + thtr_rel_year + dvd_rel_year + genre , data = movie1)

summary(m21)$adj.r.squared
## [1] 0.3804479

As we dont see any increase in the adjusted R square after removing 3 variables, “m14” is the best fitted model for predicting the IMDB rating.

Lets run this model to find all the summary statistcs related to it.

m14 <- lm(imdb_rating ~ imdb_num_votes + genre + runtime + thtr_rel_year + dvd_rel_year +  best_dir_win, data = movie1)

summary(m14)
## 
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + genre + runtime + 
##     thtr_rel_year + dvd_rel_year + best_dir_win, data = movie1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9203 -0.4250  0.0745  0.5407  2.1897 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.075e+01  1.589e+01   3.193  0.00148 ** 
## imdb_num_votes                  3.608e-06  3.353e-07  10.760  < 2e-16 ***
## genreAnimation                  8.694e-02  3.197e-01   0.272  0.78579    
## genreArt House & International  1.211e+00  2.684e-01   4.512 7.72e-06 ***
## genreComedy                    -7.477e-02  1.418e-01  -0.527  0.59808    
## genreDocumentary                2.057e+00  1.757e-01  11.702  < 2e-16 ***
## genreDrama                      7.106e-01  1.192e-01   5.961 4.27e-09 ***
## genreHorror                     7.384e-03  2.118e-01   0.035  0.97220    
## genreMusical & Performing Arts  1.488e+00  2.687e-01   5.539 4.54e-08 ***
## genreMystery & Suspense         4.446e-01  1.565e-01   2.840  0.00466 ** 
## genreOther                      5.323e-01  2.456e-01   2.167  0.03062 *  
## genreScience Fiction & Fantasy -1.539e-02  3.184e-01  -0.048  0.96146    
## runtime                         4.414e-03  2.062e-03   2.141  0.03269 *  
## thtr_rel_year                  -1.104e-02  4.325e-03  -2.552  0.01097 *  
## dvd_rel_year                   -1.172e-02  9.955e-03  -1.177  0.23974    
## best_dir_winyes                 2.010e-01  1.401e-01   1.435  0.15188    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8452 on 603 degrees of freedom
## Multiple R-squared:  0.3965, Adjusted R-squared:  0.3815 
## F-statistic: 26.42 on 15 and 603 DF,  p-value: < 2.2e-16

Interpretation of intercept (response variable) and slopes (explanatory variables) of each parameter in the model -

imdb_rating: Movies with no votes/genre/runtime/theatre release year/dvd release year/best director winner are expected on average to have an IMDB rating of 0.5075. This is meaningless in context but it helps to adjust the height of the line.

imdb_num_votes: All else held constant, for every additional vote, the model predicts the IMDB rating of the movie to be higher on an average by 0.000003608 points.

genre: All else held constant, for every additional vote, the model predicts the IMDB rating of the movie to be higher on an average by 0.000003608 points.

genreAnimation: All else held constant, for a movie belonging to an animation genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.08694 points.

genreArt House & International: All else held constant, for a movie belonging to an Art House & International genre, the model of the movie predicts the IMDB rating to be higher on an average by 1.211 points.

genreComedy: All else held constant, for a movie belonging to an comedy genre, the model predicts the IMDB rating of the movie to be lower on an average by 0.07477 points.

genreDocumentary :
All else held constant, for a movie belonging to an Documentary genre, the model predicts the IMDB of the movie rating to be higher on an average by 2.057 points.

genreDrama: All else held constant, for a movie belonging to an Drama genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.7106 points.

genreHorror: All else held constant, for a movie belonging to an Horror genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.007384 points.

genreMusical & Performing Arts: All else held constant, for a movie belonging to an Musical & Performing Arts genre, the model predicts the IMDB rating of the movie to be higher on an average by 1.488 points.

genreMystery & Suspense : All else held constant, for a movie belonging to a Mystery & Suspense genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.4446 points.

genreOther: All else held constant, for a movie belonging to an other genre, the model predicts the IMDB rating of the movie to be higher on an average by 0.5323 points.

genreScience Fiction & Fantasy: All else held constant, for a movie belonging to a Science Fiction & Fantasy genre, the model predicts the IMDB rating of the movie to be lower on an average by 0.01539 points.

runtime: All else held constant, for each additional minute increase in runtime, the model predicts the IMDB rating of the movie to be higher on an average by 0.004414 points.

thtr_rel_year: All else held constant, for each year increase in the theatre release year, the model predicts the IMDB rating of the movie to be lower on an average by 0.01104 points.

dvd_rel_year: All else held constant, for each year increase in the dvd release year, the model predicts the IMDB rating of the movie to be lower on an average by 0.01172 points.

best_dir_win: All else held constant, for an award winning director ,the model predicts the IMDB rating of the movie to be higher on an average by 0.0201 points.

DIAGNOSTICS

In order to determine whether using a linear model is the best way to predict the IMDB rating of a movie, we need to conduct some diagnostic tests.

1.Test for linear relationship between the numerical explanatory variable and response variable

plot(m14$residuals ~ movie1$imdb_num_votes) 

Above plot doesn’t show a U shaped or inverted U shaped spread, but has a fan shaped spread.

plot(m14$residuals ~ movie1$runtime) 

Above plot doesn’t show a U shaped or inverted U shaped spread, but has a fan shaped spread.

plot(m14$residuals ~ movie1$thtr_rel_year) 

The points are randomly scattered around 0.

plot(m14$residuals ~ movie1$dvd_rel_year) 

The points are randomly scattered around 0.

Since, two of the four numeric variables do not have a random scatter around 0, our model fails this test.

  1. Test for nearly normal residuals with mean 0.
hist(m14$residuals)

The histogram of residuals is nearly normal and centered at mean 0.

qqnorm(m14$residuals)
qqline(m14$residuals)

From the above graph, we can confirm that we met the nearly normal condition.

  1. Constant variability of residuals
plot(m14$residuals ~ m14$fitted.values)

From the above graph, we can see that the residuals scattered nearly in the shape of a fan (from right to left).

plot(abs(m14$residuals) ~ m14$fitted.values)

The absolute value of residuals plot vs. predicted values show a nearly traingle like structure.

Since, we were not able to achieve homoschedasticity, rule #3 is violated by our model.

  1. Independant Residuals
plot(m14$residuals)

This condition is met as there is a random scatter of residuals and hence the observations are independant.


Part 5: Prediction

I’m going to predict the IMDB rating for the movie mentioned below

Movie title : Raised on media imdb_num_votes: 7 genre: crime runtime: 65 mins thtr_rel_year: 2016 dvd_rel_year: 2016 best_dir_win: No

Data reference: IMDB.com

With reference to the summary results of the best fitted model (m14), the equation to predict imdb_rating for this movie can be written as mentioned below. Since the crime genre is not part of our model, we can classify it under the genre-other category.

Below is the equation of the line for model m14:

imdb_rating = (0.000003608 * imdb_num_votes) + (0.5323 * genre) + (0.004414 * runtime) - (0.01104 * thtr_rel_year) - (0.01172 * dvd_rel_year) + (0.0201 * best_dir_win)

Lets build a data frame with the above data

new_movie <- data.frame(imdb_num_votes = 7, genre = "Other", runtime = 65, thtr_rel_year = 2016, dvd_rel_year = 2016, best_dir_win = "no")

Lets build a prediction interval to derive the measure of uncertainity around this prediction.

predict(m14, new_movie, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 5.698991 3.964844 7.433137

Hence, the model predicts, with 95% confidence, that a crime movie with a theatre release and dvd release in 2016 with 7 imdb number of votes, runtime of 65 minutes and with no award winning director is expected to have an evaluation score between 3.96 and 7.43.


Part 6: Conclusion

  1. IMDB rating prediction model built from above analysis can only predict the rating from the non obvious factors such as runtime, movie release year, genre etc. I also tried using the obvious factors such as actor1, director etc. which resulted in lengthy summary statistics and has an overall model p value of > 0.05. This implies that we fail to reject the null hypothesis and hence the conclusion would be all the explantory variables coefficients (slopes) are 0. This didn’t make sense to me and I finally ended up buiding above model. Further research is needed to figure out the best way to address problems like this.

  2. Since two out of four diagnostic conditions are violated by our model, further research is needed either to simplify this model or performing a non linear regression analysis might help in building a better model to predict IMDB_rating with better accuracy.