library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
library(statsr)
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Warning: package 'lattice' was built under R version 3.2.5
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies
. Delete this note when before you submit your work.
load("movies.rdata")
movies_table <-dplyr::tbl_df(movies)
movies_table_model<-dplyr::select(movies_table, title,genre, runtime, mpaa_rating, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, top200_box)
movies_table_model
## # A tibble: 651 × 11
## title genre runtime mpaa_rating imdb_rating
## <chr> <fctr> <dbl> <fctr> <dbl>
## 1 Filly Brown Drama 80 R 5.5
## 2 The Dish Drama 101 PG-13 7.3
## 3 Waiting for Guffman Comedy 84 R 7.6
## 4 The Age of Innocence Drama 139 PG 7.2
## 5 Malevolence Horror 90 R 5.1
## 6 Old Partner Documentary 78 Unrated 7.8
## 7 Lady Jane Drama 142 PG-13 7.2
## 8 Mad Dog Time Drama 93 R 5.5
## 9 Beauty Is Embarrassing Documentary 88 Unrated 7.5
## 10 The Snowtown Murders Drama 119 Unrated 6.6
## # ... with 641 more rows, and 6 more variables: imdb_num_votes <int>,
## # critics_rating <fctr>, critics_score <dbl>, audience_rating <fctr>,
## # audience_score <dbl>, top200_box <fctr>
The data set is comprised of 651 randomly sampled movies produced and released in 2016.
We are going to try to generalize the factors that affect the popularity of movies using this data. Such generalization might only be applicable to US based movies.
The fields of the data are as listed below:title
: Title of movie
title_type
: Type of movie (Documentary, Feature Film, TV Movie)
genre
: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime
: Runtime of movie (in minutes)
mpaa_rating
: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
studio
: Studio that produced the movie
thtr_rel_year
: Year the movie is released in theaters
thtr_rel_month
: Month the movie is released in theaters
thtr_rel_day
: Day of the month the movie is released in theaters
dvd_rel_year
: Year the movie is released on DVD
dvd_rel_month
: Month the movie is released on DVD
dvd_rel_day
: Day of the month the movie is released on DVD
imdb_rating
: Rating on IMDB
imdb_num_votes
: Number of votes on IMDB
critics_rating
: Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
critics_score
: Critics score on Rotten Tomatoes
audience_rating
: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
audience_score
: Audience score on Rotten Tomatoes
best_pic_nom
: Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win
: Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win
: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress win
: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
best_dir_win
: Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
top200_box
: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
director
: Director of the movie
actor1
: First main actor/actress in the abridged cast of the movie
actor2
: Second main actor/actress in the abridged cast of the movie
actor3
: Third main actor/actress in the abridged cast of the movie
actor4
: Fourth main actor/actress in the abridged cast of the movie
actor5
: Fifth main actor/actress in the abridged cast of the movie
imdb_url
: Link to IMDB page for the movie
rt_url
: Link to Rotten Tomatoes page for the movie
The potential bias that i see here in the data is that the audience ratings are only collected from either IMDB or Rotten Tomatoes which is limited to only people who register and take the time to rate the movies in this sites as compared to all of the people who have seen the movie. Since the data is collected at random from a sample population of all the people who have seen the movie we might be able to use this data to what affects a movie’s popularity.
Is there any association between how the critic’s score to the audience’s score ?
Which among the following: critics score, imdb number of votes, imdb rating , critics rating is highly likely to influence the audience score ?
Let us analyze the scatter plot between the audience score and critics score
Here i colored the graph with regards to the Genre of the movie to see which kind of movies then to get a higher critic score because we are trying to relate how critics score might affect audience score - Movies in the Comedy, Documentary and Drama tend to get a higher critics score
qplot(critics_score,audience_score,colour=genre, data=movies_table_model)
Here i colored the movies based on Critics Rating - in this graph we can see that Certified Fresh and Fresh tend to get a higher audience score
qplot(critics_score,audience_score,colour=critics_rating, data=movies_table_model)
Here i colored the MPAA Rating for each movie to see what kind of movie MPAA rating tend to have a high audience score- we can see that there is a good mix
qplot(critics_score,audience_score,colour=mpaa_rating, data=movies_table_model)
Let us see the relationship movies in the top 200 tend to have a high audience score-We can see that a movie being in the top 200 does not necessarily mean a high audience score
qplot(critics_score,audience_score,colour=top200_box, data=movies_table_model)
Let see the density of the movies that tend to have a high audience score-in the graph we can see that Documentary films tend to get the bulk of high audience score rating
qplot(audience_score,colour=genre, data=movies_table_model, geom="density")
Now let us see of there is a linear relationship between the audience score and the critics score - as we can see in the graph there might be a certain linear trend between the two.Looks like critics score and audience score might have a linear relationship.
ggplot(data = movies_table_model, aes(x = critics_score, y = audience_score)) +
geom_jitter() +
geom_smooth(method = "lm")
Let us model the relationship between the audience_score the following for the full model -> genre, runtime, mpaa_rating, imdb_rating, imdb_num_votes, critics_rating, critics_score Looks like mpaa rating is not a strong indicator based on its large p-values
m_critics1 <- lm(audience_score ~ genre + runtime + mpaa_rating + imdb_rating + imdb_num_votes + critics_rating + critics_score, data = movies_table_model)
summary(m_critics1)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating +
## imdb_rating + imdb_num_votes + critics_rating + critics_score,
## data = movies_table_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.270 -6.304 0.468 5.667 49.182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.723e+01 4.900e+00 -5.556 4.07e-08 ***
## genreAnimation 7.776e+00 3.827e+00 2.032 0.0426 *
## genreArt House & International -1.500e-01 2.990e+00 -0.050 0.9600
## genreComedy 2.035e+00 1.631e+00 1.247 0.2127
## genreDocumentary 1.052e+00 2.292e+00 0.459 0.6464
## genreDrama 4.848e-01 1.434e+00 0.338 0.7354
## genreHorror -5.073e+00 2.444e+00 -2.076 0.0383 *
## genreMusical & Performing Arts 5.198e+00 3.187e+00 1.631 0.1034
## genreMystery & Suspense -5.506e+00 1.828e+00 -3.012 0.0027 **
## genreOther 1.510e+00 2.770e+00 0.545 0.5859
## genreScience Fiction & Fantasy -9.231e-01 3.502e+00 -0.264 0.7922
## runtime -4.596e-02 2.296e-02 -2.002 0.0457 *
## mpaa_ratingNC-17 -4.448e+00 7.421e+00 -0.599 0.5491
## mpaa_ratingPG 9.615e-01 2.702e+00 0.356 0.7220
## mpaa_ratingPG-13 -6.938e-01 2.789e+00 -0.249 0.8036
## mpaa_ratingR -3.703e-01 2.684e+00 -0.138 0.8903
## mpaa_ratingUnrated 9.499e-01 3.064e+00 0.310 0.7566
## imdb_rating 1.494e+01 6.211e-01 24.055 < 2e-16 ***
## imdb_num_votes 3.595e-06 4.364e-06 0.824 0.4103
## critics_ratingFresh -2.052e+00 1.217e+00 -1.686 0.0923 .
## critics_ratingRotten -4.762e+00 1.977e+00 -2.408 0.0163 *
## critics_score 1.949e-03 3.574e-02 0.055 0.9565
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.781 on 628 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7739, Adjusted R-squared: 0.7664
## F-statistic: 102.4 on 21 and 628 DF, p-value: < 2.2e-16
Let’s remove mpaa rating because it is not a meaningful addition to our model. Looks like the critics score has a high p-value. Let’s see if removing it will give as a much simpler model
m_critics2 <- lm(audience_score ~ genre + runtime + imdb_rating + imdb_num_votes + critics_rating + critics_score, data = movies_table_model)
summary(m_critics2)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating +
## imdb_num_votes + critics_rating + critics_score, data = movies_table_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.118 -5.979 0.452 5.650 49.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.753e+01 4.207e+00 -6.543 1.25e-10 ***
## genreAnimation 8.057e+00 3.501e+00 2.301 0.0217 *
## genreArt House & International -1.503e-01 2.925e+00 -0.051 0.9590
## genreComedy 1.861e+00 1.611e+00 1.155 0.2484
## genreDocumentary 1.418e+00 2.062e+00 0.688 0.4919
## genreDrama 1.756e-01 1.394e+00 0.126 0.8998
## genreHorror -5.274e+00 2.387e+00 -2.210 0.0275 *
## genreMusical & Performing Arts 5.187e+00 3.161e+00 1.641 0.1014
## genreMystery & Suspense -5.884e+00 1.780e+00 -3.305 0.0010 **
## genreOther 1.721e+00 2.753e+00 0.625 0.5320
## genreScience Fiction & Fantasy -8.504e-01 3.491e+00 -0.244 0.8076
## runtime -4.652e-02 2.256e-02 -2.062 0.0397 *
## imdb_rating 1.494e+01 6.176e-01 24.194 < 2e-16 ***
## imdb_num_votes 2.863e-06 4.308e-06 0.665 0.5065
## critics_ratingFresh -1.957e+00 1.212e+00 -1.615 0.1067
## critics_ratingRotten -4.546e+00 1.965e+00 -2.314 0.0210 *
## critics_score 7.852e-03 3.532e-02 0.222 0.8241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.762 on 633 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.773, Adjusted R-squared: 0.7672
## F-statistic: 134.7 on 16 and 633 DF, p-value: < 2.2e-16
Let’s remove critics score is not really as strong indicator.
m_critics3 <- lm(audience_score ~ genre + runtime + imdb_rating + imdb_num_votes + critics_rating, data = movies_table_model)
summary(m_critics3)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating +
## imdb_num_votes + critics_rating, data = movies_table_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.035 -6.046 0.477 5.592 49.049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.740e+01 4.164e+00 -6.579 9.9e-11 ***
## genreAnimation 8.070e+00 3.498e+00 2.307 0.021367 *
## genreArt House & International -1.964e-01 2.915e+00 -0.067 0.946312
## genreComedy 1.867e+00 1.610e+00 1.160 0.246623
## genreDocumentary 1.445e+00 2.057e+00 0.702 0.482679
## genreDrama 1.870e-01 1.392e+00 0.134 0.893147
## genreHorror -5.254e+00 2.383e+00 -2.204 0.027865 *
## genreMusical & Performing Arts 5.213e+00 3.157e+00 1.651 0.099185 .
## genreMystery & Suspense -5.882e+00 1.779e+00 -3.307 0.000998 ***
## genreOther 1.739e+00 2.750e+00 0.633 0.527224
## genreScience Fiction & Fantasy -8.510e-01 3.489e+00 -0.244 0.807354
## runtime -4.629e-02 2.252e-02 -2.055 0.040268 *
## imdb_rating 1.501e+01 5.232e-01 28.700 < 2e-16 ***
## imdb_num_votes 2.721e-06 4.257e-06 0.639 0.522986
## critics_ratingFresh -2.015e+00 1.183e+00 -1.704 0.088830 .
## critics_ratingRotten -4.873e+00 1.298e+00 -3.756 0.000189 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.755 on 634 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.773, Adjusted R-squared: 0.7676
## F-statistic: 143.9 on 15 and 634 DF, p-value: < 2.2e-16
Alright, let’s see if removing genre will make a good model and all of our explanatory variables be a good fit for our model
m_critics3 <- lm(audience_score ~ runtime + imdb_rating + critics_rating, data = movies_table_model)
summary(m_critics3)
##
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_rating,
## data = movies_table_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.829 -6.530 0.558 5.620 51.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.74042 3.85742 -6.932 1.01e-11 ***
## runtime -0.05459 0.02099 -2.601 0.00952 **
## imdb_rating 15.16031 0.48215 31.443 < 2e-16 ***
## critics_ratingFresh -2.85310 1.12448 -2.537 0.01141 *
## critics_ratingRotten -5.58152 1.27888 -4.364 1.48e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.984 on 645 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.758, Adjusted R-squared: 0.7565
## F-statistic: 505.2 on 4 and 645 DF, p-value: < 2.2e-16
Removing genre lowered our adjusted R-squared but not that much
m_critics4 <- lm(audience_score ~ genre + runtime + imdb_rating + critics_rating, data = movies_table_model)
summary(m_critics4)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating +
## critics_rating, data = movies_table_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.036 -6.019 0.599 5.575 49.175
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -27.90462 4.08624 -6.829 2.00e-11 ***
## genreAnimation 8.03284 3.49580 2.298 0.021894 *
## genreArt House & International -0.44754 2.88716 -0.155 0.876862
## genreComedy 1.81976 1.60740 1.132 0.258014
## genreDocumentary 1.11257 1.98951 0.559 0.576212
## genreDrama 0.04222 1.37264 0.031 0.975471
## genreHorror -5.32295 2.37988 -2.237 0.025656 *
## genreMusical & Performing Arts 4.88807 3.11428 1.570 0.117014
## genreMystery & Suspense -5.92951 1.77663 -3.337 0.000895 ***
## genreOther 1.76351 2.74809 0.642 0.521285
## genreScience Fiction & Fantasy -0.81231 3.48644 -0.233 0.815843
## runtime -0.04258 0.02175 -1.957 0.050730 .
## imdb_rating 15.10176 0.50487 29.912 < 2e-16 ***
## critics_ratingFresh -2.27666 1.10903 -2.053 0.040498 *
## critics_ratingRotten -5.07158 1.25938 -4.027 6.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.75 on 635 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7728, Adjusted R-squared: 0.7678
## F-statistic: 154.3 on 14 and 635 DF, p-value: < 2.2e-16
Looks like genre, runtime , imdb rating and critics rating can explain the audience score with an adjusted R of 76%
Let’s look at some diagnostics for the linear model specially the distribution of the residuals
Looking at the graph below it looks like a fairly normal distribution
hist(m_critics4$residuals)
qqnorm(m_critics4$residuals)
qqline(m_critics4$residuals)
plot(m_critics4$residuals ~ m_critics4$fitted)
plot(abs(m_critics4$residuals) ~ m_critics4$fitted)
plot(m_critics4$residuals)
So let’s predict what the audience score would be for the following movie
Prediction for
Popstar:Never Stop Never Stopping
Runtime = 86
IMDB Rating = 6.8
Genre: Comedy
Rotten Tomatoes Audience Score = 68%
Critics Rating= “Certified Fresh”
new.movie <- data.frame(genre=c('Comedy'),runtime=c(86),imdb_rating=c(6.8),critics_rating=c('Certified Fresh'))
predict(m_critics4,newdata=new.movie, interval='confidence')
## fit lwr upr
## 1 72.9455 70.17536 75.71564
predict(m_critics4,newdata=new.movie, interval='prediction')
## fit lwr upr
## 1 72.9455 53.59933 92.29166
Looks like our model predicted that the movie Popstar with will have an audience score between 70.18 to 75.72% on average Looks like 95% of the movies with a runtime, IMDB Rating, Genre, and Critics rating like “Popstar:Never Stop Never Stopping” will have an audience rating or 53.60% to 92.29%
Even though i started to focus on critics score as one predictor. I eventually dropped it in the final model because the it did not really help in increasing the accuracy of the model. I end up using genre, imdb rating and critics rating because they are the strongest indicators based on the Adjusted-R squared when we were trying to fit the model. It looks like that if we drop genre from the model we can still have a good model based on IMDB Rating and Critics Rating.