Modeling and prediction for movies

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.4

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.5

library(statsr)
library(caret)

## Warning: package 'caret' was built under R version 3.2.5

## Warning: package 'lattice' was built under R version 3.2.5

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.rdata")
movies_table <-dplyr::tbl_df(movies)
movies_table_model<-dplyr::select(movies_table, title,genre, runtime, mpaa_rating, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, top200_box)
movies_table_model

## # A tibble: 651 × 11
##                     title       genre runtime mpaa_rating imdb_rating
##                     <chr>      <fctr>   <dbl>      <fctr>       <dbl>
## 1             Filly Brown       Drama      80           R         5.5
## 2                The Dish       Drama     101       PG-13         7.3
## 3     Waiting for Guffman      Comedy      84           R         7.6
## 4    The Age of Innocence       Drama     139          PG         7.2
## 5             Malevolence      Horror      90           R         5.1
## 6             Old Partner Documentary      78     Unrated         7.8
## 7               Lady Jane       Drama     142       PG-13         7.2
## 8            Mad Dog Time       Drama      93           R         5.5
## 9  Beauty Is Embarrassing Documentary      88     Unrated         7.5
## 10   The Snowtown Murders       Drama     119     Unrated         6.6
## # ... with 641 more rows, and 6 more variables: imdb_num_votes <int>,
## #   critics_rating <fctr>, critics_score <dbl>, audience_rating <fctr>,
## #   audience_score <dbl>, top200_box <fctr>

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released in 2016.

We are going to try to generalize the factors that affect the popularity of movies using this data. Such generalization might only be applicable to US based movies.

The fields of the data are as listed below:

title: Title of movie
title_type: Type of movie (Documentary, Feature Film, TV Movie)
genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime: Runtime of movie (in minutes)
mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
studio: Studio that produced the movie
thtr_rel_year: Year the movie is released in theaters
thtr_rel_month: Month the movie is released in theaters
thtr_rel_day: Day of the month the movie is released in theaters
dvd_rel_year: Year the movie is released on DVD
dvd_rel_month: Month the movie is released on DVD
dvd_rel_day: Day of the month the movie is released on DVD
imdb_rating: Rating on IMDB
imdb_num_votes: Number of votes on IMDB
critics_rating: Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
critics_score: Critics score on Rotten Tomatoes
audience_rating: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
audience_score: Audience score on Rotten Tomatoes
best_pic_nom: Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
director: Director of the movie
actor1: First main actor/actress in the abridged cast of the movie
actor2: Second main actor/actress in the abridged cast of the movie
actor3: Third main actor/actress in the abridged cast of the movie
actor4: Fourth main actor/actress in the abridged cast of the movie
actor5: Fifth main actor/actress in the abridged cast of the movie
imdb_url: Link to IMDB page for the movie
rt_url: Link to Rotten Tomatoes page for the movie

The potential bias that i see here in the data is that the audience ratings are only collected from either IMDB or Rotten Tomatoes which is limited to only people who register and take the time to rate the movies in this sites as compared to all of the people who have seen the movie. Since the data is collected at random from a sample population of all the people who have seen the movie we might be able to use this data to what affects a movie’s popularity.

Part 2: Research question

Is there any association between how the critic’s score to the audience’s score ?
Which among the following: critics score, imdb number of votes, imdb rating , critics rating is highly likely to influence the audience score ?

Part 3: Exploratory data analysis

Let us analyze the scatter plot between the audience score and critics score
Here i colored the graph with regards to the Genre of the movie to see which kind of movies then to get a higher critic score because we are trying to relate how critics score might affect audience score - Movies in the Comedy, Documentary and Drama tend to get a higher critics score

qplot(critics_score,audience_score,colour=genre, data=movies_table_model)

Here i colored the movies based on Critics Rating - in this graph we can see that Certified Fresh and Fresh tend to get a higher audience score

qplot(critics_score,audience_score,colour=critics_rating, data=movies_table_model)

Here i colored the MPAA Rating for each movie to see what kind of movie MPAA rating tend to have a high audience score- we can see that there is a good mix

qplot(critics_score,audience_score,colour=mpaa_rating, data=movies_table_model)

Let us see the relationship movies in the top 200 tend to have a high audience score-We can see that a movie being in the top 200 does not necessarily mean a high audience score

qplot(critics_score,audience_score,colour=top200_box, data=movies_table_model)

Let see the density of the movies that tend to have a high audience score-in the graph we can see that Documentary films tend to get the bulk of high audience score rating

qplot(audience_score,colour=genre, data=movies_table_model, geom="density")

Now let us see of there is a linear relationship between the audience score and the critics score - as we can see in the graph there might be a certain linear trend between the two.Looks like critics score and audience score might have a linear relationship.

ggplot(data = movies_table_model, aes(x = critics_score, y = audience_score)) +
    geom_jitter() +
  geom_smooth(method = "lm")

Part 4: Modeling

Let us model the relationship between the audience_score the following for the full model -> genre, runtime, mpaa_rating, imdb_rating, imdb_num_votes, critics_rating, critics_score Looks like mpaa rating is not a strong indicator based on its large p-values

m_critics1 <- lm(audience_score ~ genre + runtime + mpaa_rating + imdb_rating + imdb_num_votes + critics_rating + critics_score, data = movies_table_model)
summary(m_critics1)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating + 
##     imdb_rating + imdb_num_votes + critics_rating + critics_score, 
##     data = movies_table_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.270  -6.304   0.468   5.667  49.182 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.723e+01  4.900e+00  -5.556 4.07e-08 ***
## genreAnimation                  7.776e+00  3.827e+00   2.032   0.0426 *  
## genreArt House & International -1.500e-01  2.990e+00  -0.050   0.9600    
## genreComedy                     2.035e+00  1.631e+00   1.247   0.2127    
## genreDocumentary                1.052e+00  2.292e+00   0.459   0.6464    
## genreDrama                      4.848e-01  1.434e+00   0.338   0.7354    
## genreHorror                    -5.073e+00  2.444e+00  -2.076   0.0383 *  
## genreMusical & Performing Arts  5.198e+00  3.187e+00   1.631   0.1034    
## genreMystery & Suspense        -5.506e+00  1.828e+00  -3.012   0.0027 ** 
## genreOther                      1.510e+00  2.770e+00   0.545   0.5859    
## genreScience Fiction & Fantasy -9.231e-01  3.502e+00  -0.264   0.7922    
## runtime                        -4.596e-02  2.296e-02  -2.002   0.0457 *  
## mpaa_ratingNC-17               -4.448e+00  7.421e+00  -0.599   0.5491    
## mpaa_ratingPG                   9.615e-01  2.702e+00   0.356   0.7220    
## mpaa_ratingPG-13               -6.938e-01  2.789e+00  -0.249   0.8036    
## mpaa_ratingR                   -3.703e-01  2.684e+00  -0.138   0.8903    
## mpaa_ratingUnrated              9.499e-01  3.064e+00   0.310   0.7566    
## imdb_rating                     1.494e+01  6.211e-01  24.055  < 2e-16 ***
## imdb_num_votes                  3.595e-06  4.364e-06   0.824   0.4103    
## critics_ratingFresh            -2.052e+00  1.217e+00  -1.686   0.0923 .  
## critics_ratingRotten           -4.762e+00  1.977e+00  -2.408   0.0163 *  
## critics_score                   1.949e-03  3.574e-02   0.055   0.9565    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.781 on 628 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7739, Adjusted R-squared:  0.7664 
## F-statistic: 102.4 on 21 and 628 DF,  p-value: < 2.2e-16

Let’s remove mpaa rating because it is not a meaningful addition to our model. Looks like the critics score has a high p-value. Let’s see if removing it will give as a much simpler model

m_critics2 <- lm(audience_score ~ genre + runtime + imdb_rating + imdb_num_votes + critics_rating + critics_score, data = movies_table_model)
summary(m_critics2)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     imdb_num_votes + critics_rating + critics_score, data = movies_table_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.118  -5.979   0.452   5.650  49.099 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.753e+01  4.207e+00  -6.543 1.25e-10 ***
## genreAnimation                  8.057e+00  3.501e+00   2.301   0.0217 *  
## genreArt House & International -1.503e-01  2.925e+00  -0.051   0.9590    
## genreComedy                     1.861e+00  1.611e+00   1.155   0.2484    
## genreDocumentary                1.418e+00  2.062e+00   0.688   0.4919    
## genreDrama                      1.756e-01  1.394e+00   0.126   0.8998    
## genreHorror                    -5.274e+00  2.387e+00  -2.210   0.0275 *  
## genreMusical & Performing Arts  5.187e+00  3.161e+00   1.641   0.1014    
## genreMystery & Suspense        -5.884e+00  1.780e+00  -3.305   0.0010 ** 
## genreOther                      1.721e+00  2.753e+00   0.625   0.5320    
## genreScience Fiction & Fantasy -8.504e-01  3.491e+00  -0.244   0.8076    
## runtime                        -4.652e-02  2.256e-02  -2.062   0.0397 *  
## imdb_rating                     1.494e+01  6.176e-01  24.194  < 2e-16 ***
## imdb_num_votes                  2.863e-06  4.308e-06   0.665   0.5065    
## critics_ratingFresh            -1.957e+00  1.212e+00  -1.615   0.1067    
## critics_ratingRotten           -4.546e+00  1.965e+00  -2.314   0.0210 *  
## critics_score                   7.852e-03  3.532e-02   0.222   0.8241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.762 on 633 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.773,  Adjusted R-squared:  0.7672 
## F-statistic: 134.7 on 16 and 633 DF,  p-value: < 2.2e-16

Let’s remove critics score is not really as strong indicator.

m_critics3 <- lm(audience_score ~ genre + runtime + imdb_rating + imdb_num_votes + critics_rating, data = movies_table_model)
summary(m_critics3)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     imdb_num_votes + critics_rating, data = movies_table_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.035  -6.046   0.477   5.592  49.049 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2.740e+01  4.164e+00  -6.579  9.9e-11 ***
## genreAnimation                  8.070e+00  3.498e+00   2.307 0.021367 *  
## genreArt House & International -1.964e-01  2.915e+00  -0.067 0.946312    
## genreComedy                     1.867e+00  1.610e+00   1.160 0.246623    
## genreDocumentary                1.445e+00  2.057e+00   0.702 0.482679    
## genreDrama                      1.870e-01  1.392e+00   0.134 0.893147    
## genreHorror                    -5.254e+00  2.383e+00  -2.204 0.027865 *  
## genreMusical & Performing Arts  5.213e+00  3.157e+00   1.651 0.099185 .  
## genreMystery & Suspense        -5.882e+00  1.779e+00  -3.307 0.000998 ***
## genreOther                      1.739e+00  2.750e+00   0.633 0.527224    
## genreScience Fiction & Fantasy -8.510e-01  3.489e+00  -0.244 0.807354    
## runtime                        -4.629e-02  2.252e-02  -2.055 0.040268 *  
## imdb_rating                     1.501e+01  5.232e-01  28.700  < 2e-16 ***
## imdb_num_votes                  2.721e-06  4.257e-06   0.639 0.522986    
## critics_ratingFresh            -2.015e+00  1.183e+00  -1.704 0.088830 .  
## critics_ratingRotten           -4.873e+00  1.298e+00  -3.756 0.000189 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.755 on 634 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.773,  Adjusted R-squared:  0.7676 
## F-statistic: 143.9 on 15 and 634 DF,  p-value: < 2.2e-16

Alright, let’s see if removing genre will make a good model and all of our explanatory variables be a good fit for our model

m_critics3 <- lm(audience_score ~ runtime + imdb_rating + critics_rating, data = movies_table_model)
summary(m_critics3)

## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_rating, 
##     data = movies_table_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.829  -6.530   0.558   5.620  51.228 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -26.74042    3.85742  -6.932 1.01e-11 ***
## runtime               -0.05459    0.02099  -2.601  0.00952 ** 
## imdb_rating           15.16031    0.48215  31.443  < 2e-16 ***
## critics_ratingFresh   -2.85310    1.12448  -2.537  0.01141 *  
## critics_ratingRotten  -5.58152    1.27888  -4.364 1.48e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.984 on 645 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.758,  Adjusted R-squared:  0.7565 
## F-statistic: 505.2 on 4 and 645 DF,  p-value: < 2.2e-16

Removing genre lowered our adjusted R-squared but not that much

m_critics4 <- lm(audience_score ~  genre + runtime + imdb_rating + critics_rating, data = movies_table_model)
summary(m_critics4)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     critics_rating, data = movies_table_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.036  -6.019   0.599   5.575  49.175 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -27.90462    4.08624  -6.829 2.00e-11 ***
## genreAnimation                   8.03284    3.49580   2.298 0.021894 *  
## genreArt House & International  -0.44754    2.88716  -0.155 0.876862    
## genreComedy                      1.81976    1.60740   1.132 0.258014    
## genreDocumentary                 1.11257    1.98951   0.559 0.576212    
## genreDrama                       0.04222    1.37264   0.031 0.975471    
## genreHorror                     -5.32295    2.37988  -2.237 0.025656 *  
## genreMusical & Performing Arts   4.88807    3.11428   1.570 0.117014    
## genreMystery & Suspense         -5.92951    1.77663  -3.337 0.000895 ***
## genreOther                       1.76351    2.74809   0.642 0.521285    
## genreScience Fiction & Fantasy  -0.81231    3.48644  -0.233 0.815843    
## runtime                         -0.04258    0.02175  -1.957 0.050730 .  
## imdb_rating                     15.10176    0.50487  29.912  < 2e-16 ***
## critics_ratingFresh             -2.27666    1.10903  -2.053 0.040498 *  
## critics_ratingRotten            -5.07158    1.25938  -4.027 6.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.75 on 635 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7728, Adjusted R-squared:  0.7678 
## F-statistic: 154.3 on 14 and 635 DF,  p-value: < 2.2e-16

Looks like genre, runtime , imdb rating and critics rating can explain the audience score with an adjusted R of 76%

Let’s look at some diagnostics for the linear model specially the distribution of the residuals
Looking at the graph below it looks like a fairly normal distribution

hist(m_critics4$residuals)

qqnorm(m_critics4$residuals)
qqline(m_critics4$residuals)

plot(m_critics4$residuals ~ m_critics4$fitted)

plot(abs(m_critics4$residuals) ~ m_critics4$fitted)

plot(m_critics4$residuals)

Part 5: Prediction

So let’s predict what the audience score would be for the following movie

Prediction for
Popstar:Never Stop Never Stopping
Runtime = 86
IMDB Rating = 6.8
Genre: Comedy
Rotten Tomatoes Audience Score = 68%
Critics Rating= “Certified Fresh”

new.movie <- data.frame(genre=c('Comedy'),runtime=c(86),imdb_rating=c(6.8),critics_rating=c('Certified Fresh'))
predict(m_critics4,newdata=new.movie, interval='confidence')

##       fit      lwr      upr
## 1 72.9455 70.17536 75.71564

predict(m_critics4,newdata=new.movie, interval='prediction')

##       fit      lwr      upr
## 1 72.9455 53.59933 92.29166

Looks like our model predicted that the movie Popstar with will have an audience score between 70.18 to 75.72% on average Looks like 95% of the movies with a runtime, IMDB Rating, Genre, and Critics rating like “Popstar:Never Stop Never Stopping” will have an audience rating or 53.60% to 92.29%

Part 6: Conclusion

Even though i started to focus on critics score as one predictor. I eventually dropped it in the final model because the it did not really help in increasing the accuracy of the model. I end up using genre, imdb rating and critics rating because they are the strongest indicators based on the Adjusted-R squared when we were trying to fit the model. It looks like that if we drop genre from the model we can still have a good model based on IMDB Rating and Critics Rating.