Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(psych)

Load data

load("movies.Rdata")

Part 1: Data

The goal of this section is:

Describe how the observations in the sample are collected and the implications of this data collection method on the scope of inference (generalizability / causality).

The data set “movies” is comprised of 651 randomly sampled movies produced and released before 2016. The movies are from American Studios. (We know this because the MPAA Ratings in the data applies only to American movies.) This data set includes information from both Rotten Tomatoes and IMDb.

About Rotten Tomatoes

Rotten Tomatoes and the TomatometerT rating is the most trusted measurement of quality entertainment. As the leading online aggregator of movie and TV show reviews from professional critics, Rotten Tomatoes offers the most comprehensive guide to what’s fresh. The world famous TomatometerT rating represents the percentage of positive professional reviews for films and TV shows and is used by millions every day, to help with their entertainment viewing decisions. Rotten Tomatoes designates the best reviewed movies and TV shows as Certified Fresh. That accolade is awarded with Tomatometer ratings of 75% and higher and a required minimum number of reviews.

About IMDB

The Internet Movie Database is an online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews. As of June 2016, IMDb has approximately 3.7 million titles (including episodes) and 7 million personalities in its database.

Scope of Inference

In Introduction to the Data above, a few phrases have been highlighted. These phrases provide key information about the generalizability of the data. This information suggests the data set should be considered the result of an observational retrospective study that uses a random sampling design to select a representative sample from U.S. movies. When random sampling has been employed in data collection, the results should be generalizable to the target population. Therefore, the results of the analysis should be generalizable to all the movies released between 1970 - 2014.

Note that observational studies show associations. In a data analysis, association does not imply causation. Causation can only be inferred from a randomized experiment. This analysis does not meet the requirements of an experiment.

While we has assumed the data to be a random sample, there are always potential sources of bias that can skew results.

Sources of Bias

The study may suffer from “voluntary response bias” since people with strong responses participate. The voluntary participants may not be representative of the US population. Also note that professional critics provide guidance for Rotten Tomatoes and authenticated users provide IMDB data.

Non-Response Bias: Simple random sampling means that each case in the population has an equal chance of being included and there is no implied connection between the cases in the sample. However, even when people are picked at random, caution must be exercised if the non-response is high. If only 30% of the people randomly sampled for a survey respond, then it is uncertain whether the results are representative of the entire population. This non-response bias can skew results. Because we do not know how many professional critics or authenticated users provide feedback on only certain movies, non-response bias may exist.

Even with these types of bias, this analysis assumes the data to be randomly sampled. * * *

Part 2: Research question

The project objective for this section:

Develop a research question that you want to answer using these data and a multiple linear regression model. You should phrase your research question in a way that matches up with the scope of inference your dataset allows for. Include a brief discussion why this question is of interest to you and/or your audience.

I’m 21 years old and in transition period from college to corporate, since their a different kinds of personality I meet, it’s quite inquisitive to know why are some movies liked only by college students and why not the whole world and what it takes to make a popular movie for both the segments of population as these two segment contribute to the major business of box office.

Therefore my research question is:

What does it take to make a movie popular?

Part 3: Exploratory data analysis

The project objective for this section:

Perform exploratory data analysis (EDA) that addresses the research question. The EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.

Response Variable Selection

The audience_score is the response variable for my analysis. It is the variable we will solve for in the analysis. Why did I chose audience_score instead of the imdb_rating? No specific reason. It was a coin toss. Both variables are correlated so picking one over the other is a random choice on my part. Here is the correlation calculation:

cor(movies$imdb_rating, movies$audience_score)

## [1] 0.8648652

Filter the Data

Not all of the variables are relevant to my research question. I have limited the movies data set to movies released later than 1999 because people are likely to hold a certain level of prejudice against old movies. I also remove the dates related to release schedules and all the individual actors and link of IMDB and rotten tomatoes.

movies_filtered <- filter(movies, thtr_rel_year >1999)
movies_filtered <- select(movies_filtered, c(-(thtr_rel_year:dvd_rel_day), -(actor1:rt_url)))

Removing the records which have NA in some field of entry.

movies_filtered <- na.omit(movies_filtered)

There are still some variables that I do not need for my analysis. For example, I have assumed the name of the movie is not relevant as well as the genre.The variable critics_rating is simply a categorical representation of critics_score, it too is removed from the data set. Similarly, audience_rating is a categorical representation of audience_score so it is removed.Also individuals are well awared that duration won’t be the criterion of popular movies. Moreover, for simplicity we are not taking into consideration the effect of production house and director name.

movies_filtered <- select(movies_filtered, c(-(title:title_type), -critics_rating, -audience_rating, -runtime, -studio, -director, -imdb_num_votes))

Let’s see what our reduced data looks like:

dim(movies_filtered)

## [1] 332  11

str(movies_filtered)

## Classes 'tbl_df', 'tbl' and 'data.frame':    332 obs. of  11 variables:
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 7 5 5 6 4 4 6 7 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 6 6 6 5 4 5 5 ...
##  $ imdb_rating     : num  5.5 7.3 5.1 7.8 7.5 6.6 7.6 5.9 7 4.4 ...
##  $ critics_score   : num  45 96 33 91 90 83 80 25 78 19 ...
##  $ audience_score  : num  73 81 27 86 89 66 89 53 64 24 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 2 1 2 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int  47 156 170 238
##   ..- attr(*, "names")= chr  "47" "156" "170" "238"

summary(movies_filtered)

##                 genre      mpaa_rating   imdb_rating    critics_score   
##  Drama             :146   G      :  6   Min.   :1.900   Min.   :  1.00  
##  Comedy            : 43   NC-17  :  0   1st Qu.:5.900   1st Qu.: 33.00  
##  Documentary       : 42   PG     : 39   Median :6.600   Median : 60.00  
##  Action & Adventure: 33   PG-13  : 85   Mean   :6.511   Mean   : 56.72  
##  Mystery & Suspense: 29   R      :158   3rd Qu.:7.400   3rd Qu.: 80.00  
##  Horror            : 11   Unrated: 44   Max.   :8.500   Max.   :100.00  
##  (Other)           : 28                                                 
##  audience_score  best_pic_nom best_pic_win best_actor_win best_actress_win
##  Min.   :11.00   no :324      no :330      no :289        no :297         
##  1st Qu.:47.00   yes:  8      yes:  2      yes: 43        yes: 35         
##  Median :65.00                                                            
##  Mean   :62.27                                                            
##  3rd Qu.:79.00                                                            
##  Max.   :96.00                                                            
##                                                                           
##  best_dir_win top200_box
##  no :318      no :327   
##  yes: 14      yes:  5   
##                         
##                         
##                         
##                         
##

Take note of the distribution of the last four variables. The number of Yes responses to No responses is unbalanced. To attenuate the potential impact of this unbalanced data, I will combine several variables into one. Specifically, if any of the following factor variables is Yes, then a new variable called best_any will be Yes. This consolidates the six variables into one. 1. best_pic_nom 2. best_pic_win 3. best_actor_win 4. best_actress win 5. best_dir_win 6. top200_box

#Create a new varaible where best_anything = Yes if any best* = Yes
movies_filtered <- mutate(movies_filtered, best_any=ifelse(best_pic_nom=="yes", 1, 
               ifelse(best_pic_win=="yes", 1,
               ifelse(best_actor_win=="yes", 1,
               ifelse(best_actress_win=="yes", 1,
               ifelse(best_dir_win=="yes", 1,
               ifelse(top200_box=="yes", 1, 0)))))))
#Make best_any into a factor
movies_filtered$best_any <- factor(movies_filtered$best_any, labels = c("No", "Yes"))
#Remove the 6 variables that have been consolidated into best_any
movies_filtered <- select(movies_filtered, -c(best_pic_nom:top200_box))

Final Working Data Set

We have made many data transformations. Let’s review the current state of the data:

dim(movies_filtered)

## [1] 332   6

str(movies_filtered)

## Classes 'tbl_df', 'tbl' and 'data.frame':    332 obs. of  6 variables:
##  $ genre         : Factor w/ 11 levels "Action & Adventure",..: 6 6 7 5 5 6 4 4 6 7 ...
##  $ mpaa_rating   : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 6 6 6 5 4 5 5 ...
##  $ imdb_rating   : num  5.5 7.3 5.1 7.8 7.5 6.6 7.6 5.9 7 4.4 ...
##  $ critics_score : num  45 96 33 91 90 83 80 25 78 19 ...
##  $ audience_score: num  73 81 27 86 89 66 89 53 64 24 ...
##  $ best_any      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 2 1 ...

summary(movies_filtered)

##                 genre      mpaa_rating   imdb_rating    critics_score   
##  Drama             :146   G      :  6   Min.   :1.900   Min.   :  1.00  
##  Comedy            : 43   NC-17  :  0   1st Qu.:5.900   1st Qu.: 33.00  
##  Documentary       : 42   PG     : 39   Median :6.600   Median : 60.00  
##  Action & Adventure: 33   PG-13  : 85   Mean   :6.511   Mean   : 56.72  
##  Mystery & Suspense: 29   R      :158   3rd Qu.:7.400   3rd Qu.: 80.00  
##  Horror            : 11   Unrated: 44   Max.   :8.500   Max.   :100.00  
##  (Other)           : 28                                                 
##  audience_score  best_any 
##  Min.   :11.00   No :249  
##  1st Qu.:47.00   Yes: 83  
##  Median :65.00            
##  Mean   :62.27            
##  3rd Qu.:79.00            
##  Max.   :96.00            
##

Data Exploration & Discovery

Let’s focus a moment of audience_score. Since it is the variable of interest, let’s consider it’s distribution. For linear regression, we need the data to be normally distributed.

summary(movies_filtered$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   47.00   65.00   62.27   79.00   96.00

The mean and median are very close suggesting the data may indeed be normal. Let’s confirm this will a histogram:

ggplot(movies_filtered, aes(x = audience_score)) + geom_histogram(binwidth = 10) + xlab("Audience Score") + ylab("Count") + ggtitle("Histogram of Audience Score")

The histogram appears sightly left skewed.

Our new data set contains the variables to be used in our model. It is now easy for us to plot them using the plot function. Because the base R pairs plot does not display very useful information when many categorical factors are in the data set, I opted to use a modified pairs plot available in the psyche package. (Psyche functions are primarily for multivariate analysis and scale construction using factor analysis, principal component analysis, cluster analysis and reliability analysis, although others provide basic descriptive statistics.)

#uses the psych package
pairs.panels(movies_filtered)

The matrix plot above allows us to visualize the relationship among all variables in one single image.

The oval-shaped object in the scatter plots above are called correlation ellipses. They provide a visualization of how strongly the correlated values are. The dot at the center of the ellipse indicates the point of the mean value for the x axis and y axis variables. The correlation between the two variables is indicated by the shape of the ellipse. An almost perfectly round oval indicates a very week correlation.

The curve drawn on the scatter plot is called a loess smooth. It indicates the general relationship between the x axis and y axis variables. Note the scatter plot between runtime and audience score. There appears there may be a slight nonlinear trend indicating there may be a higher audience score with short and longer movies with some in the center falling out of favor. This finding could not have been found from correlations alone.

Check for correlation among the variables. This step is very important to understand the relation of dependent variable with the independent variables and correlations among the variables. In general, there should not be any correlation among the independent variables. Other than critics_score, imdb_rating and audience_score, there do not appear to be any notably strong correlations.

Part 4: Modeling

The project objective for this section:

Develop a multiple linear regression model to predict a numerical variable in the dataset. The response variable and the explanatory variables can be existing variables in the dataset, or new variables you create based on existing variables.

Initial Model

To get started, we will build a linear model using all the remaining variables:

MLR <- lm(audience_score ~., data=movies_filtered)
summary(MLR)

## 
## Call:
## lm(formula = audience_score ~ ., data = movies_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.748  -6.464  -0.164   6.113  47.035 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -30.41404    6.59048  -4.615 5.74e-06 ***
## genreAnimation                  13.68926    5.28965   2.588   0.0101 *  
## genreArt House & International   3.50964    4.19862   0.836   0.4038    
## genreComedy                      0.73026    2.34967   0.311   0.7562    
## genreDocumentary                 4.80759    2.97200   1.618   0.1067    
## genreDrama                       2.08921    2.11670   0.987   0.3244    
## genreHorror                     -4.95782    3.61468  -1.372   0.1712    
## genreMusical & Performing Arts   5.65942    4.76162   1.189   0.2355    
## genreMystery & Suspense         -6.70642    2.66753  -2.514   0.0124 *  
## genreOther                       3.98383    5.53091   0.720   0.4719    
## genreScience Fiction & Fantasy  -0.44660    6.15832  -0.073   0.9422    
## mpaa_ratingPG                    3.72239    5.41773   0.687   0.4925    
## mpaa_ratingPG-13                 3.94831    5.55458   0.711   0.4777    
## mpaa_ratingR                     4.12884    5.50635   0.750   0.4539    
## mpaa_ratingUnrated               4.51872    5.54627   0.815   0.4158    
## imdb_rating                     12.92552    0.80344  16.088  < 2e-16 ***
## critics_score                    0.07024    0.03190   2.202   0.0284 *  
## best_anyYes                     -3.43430    1.33405  -2.574   0.0105 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.08 on 314 degrees of freedom
## Multiple R-squared:  0.7605, Adjusted R-squared:  0.7475 
## F-statistic: 58.64 on 17 and 314 DF,  p-value: < 2.2e-16

From the model output and the scatterplot we can make some interesting observations:

Holding all other variables constant, improving one unit in imdb_rating will see the average audience_score improve by nearly 13 points.
The Adjusted R-squared suggests the variables account for almost 74% of the variability of the regression model.
The model output can also help answer whether there is a relationship between the response and the predictors used. We can use the value of our F-Statistic to test whether all our coefficients are equal to zero. The F-Statistic value from our model is 130.2 on 7 and 324 degrees of freedom. So assuming that the number of data points is appropriate and given that the p-values returned are low, we have some evidence that at least one of the predictors is associated with income.
Given that we have indications that at least one of the predictors is associated with audience_score and that some variables have a high p-value, we can consider removing some from the model and see how the model fit changes.

Model Selection

We will now proceed with the variable selection : to select the best model answering our research question, we will perform a backward adjusted R² elimination.

The mpaa_rating variables have non-significant p-values. Therefore they are all candidates for removal. Let’s do that now:

MLR1 <- lm(audience_score ~. -mpaa_rating, data=movies_filtered)
summary(MLR1)

## 
## Call:
## lm(formula = audience_score ~ . - mpaa_rating, data = movies_filtered)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.053  -6.416  -0.134   6.003  47.285 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -26.72159    4.20755  -6.351 7.39e-10 ***
## genreAnimation                  11.32763    4.19310   2.701  0.00727 ** 
## genreArt House & International   3.78126    4.04043   0.936  0.35006    
## genreComedy                      0.77348    2.32514   0.333  0.73961    
## genreDocumentary                 4.82248    2.71954   1.773  0.07714 .  
## genreDrama                       2.14130    2.07552   1.032  0.30300    
## genreHorror                     -4.72971    3.50711  -1.349  0.17842    
## genreMusical & Performing Arts   5.68775    4.70120   1.210  0.22723    
## genreMystery & Suspense         -6.62765    2.59714  -2.552  0.01118 *  
## genreOther                       4.14691    5.44963   0.761  0.44725    
## genreScience Fiction & Fantasy  -0.53185    6.12393  -0.087  0.93085    
## imdb_rating                     12.96501    0.79152  16.380  < 2e-16 ***
## critics_score                    0.07136    0.03119   2.288  0.02280 *  
## best_anyYes                     -3.45232    1.31976  -2.616  0.00932 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 318 degrees of freedom
## Multiple R-squared:   0.76,  Adjusted R-squared:  0.7501 
## F-statistic: 77.44 on 13 and 318 DF,  p-value: < 2.2e-16

The result of removing mpaa_rating from our model is that our new adjusted R^2 is 75.01. Removing variables one at a time and testing for increased value of adjusted R^2 does not leads to any better conclusion. Hence this is the highest performing model. This will serve as the model for our predictions below.

Assumptions

Multiple linear regression has some inherent assumptions that we should evaluate:

Each variable is linearly related to the outcome
The residuals of the model are nearly normal
The variability of the residuals is nearly constant
The residuals are independent Let’s evaluate each of these assumptions.

Linear relationships between Numerical x and y

We evaluate this assumption by looking at the linear relationship between each (numerical) explanatory variable and the response using scatter plots of y vs. each x, and residuals plots of residuals vs. each x. The scatter plots were provided in the pairs ploy above. The residual plots are provided below.

plot(MLR1$residuals ~ movies_filtered$critics_score, main = "Residuals vs.critics_score")
abline(h=0)

plot(MLR1$residuals ~ movies_filtered$imdb_rating, main = "Residuals vs. imdb_rating")
abline(h=0)

Because the residuals look reasonably normal around 0, we can assume this condition is satisfied.

Nearly Normal Residuals with Mean 0

A histogram of the residuals below appears nearly normal around 0. The normal probability plot below the histogram does not show any notable deviations from the mean. These factors support the assumption.

hist(MLR1$residuals, main = "Histogram of Residuals")

qqnorm(MLR1$residuals, main = "Normal Probability Plot of Residuals")
qqline(MLR1$residuals)

Variablility of residuals is nearly constant

We evaluate homoscedasticity using the plots below:

plot(MLR1$residuals ~ MLR1$fitted.values, main = "Residuals vs. Fitted")
abline(h=0)

plot(abs(MLR1$residuals) ~ MLR1$fitted.values, main = "Absolute Value of Residuals vs. Fitted")
abline(h=0)

We don’t see a fan shape here. It appears that the variability of the residual stays constant as the value of the fitted or the predicted values change, so, the constant variability condition appears to be met.

Independent Residuals

The sampling of the data to obtain independent observations was discussed in the beginning of this analysis and we reached the conclusion that the data is a random sample and is generalizable. * * *

Part 5: Prediction

The project objective for this section:

Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed and the predict function in R. Also quantify the uncertainty around this prediction using an appropriate interval.

The movie which audience score we will try to predict is Black Panther. Using data from IMDB and Rotten Tomatoes a dataframe is created:

black_panther <- data.frame(genre = "Science Fiction & Fantasy",mpaa_rating = "PG-13", imdb_rating = 7.4, critics_score = 97, best_any = "Yes")

The genre and imdb_rating variables come from the IMDb website and the critics_score variable and the response come from the Rotten Tomatoes website. We will now predict the audience score with our model.

Let’s use our regression model to predict the audience_score:

bp_score <- predict(MLR1, newdata = black_panther, interval = "prediction")
bp_score

##        fit      lwr      upr
## 1 72.15686 49.31919 94.99453

Let’s evaluate the output from the Black Panther prediction:

The predicted audience_score is 72.15. The 95% confidence interval is 49.31-94.99. Note: The actual audience_score for Black Panther is 79%.

With this information we can conclude that we are 95% confident that the actual audience_score for Black Panther is between 49.31 - 94.99. The model returns an interval that includes our predicted value of 72.15 - very close to the actual score of 79! This was a bit expected as we avoided to take studio into consideration and everyone knows Black Panther is a big studio movie with great names attached to it due to which audience rating predicted by our model was expected to undervalue the actual rating. * * *

Part 6: Conclusion

The project objective for this section:

A brief summary of your findings from the previous sections without repeating your statements from earlier as well as a discussion of what you have learned about the data and your research question. You should also discuss any shortcomings of your current study (either due to data collection or methodology) and include ideas for possible future research.

We began this analysis with 32 variables that seemed to provide promise that we could build a robust model to predict the audience_score reliably. However, as the data was examined and the variables evaluated and modified to meet the needs of the research question, our final data set was limited to one response variable and 6 explanatory variables. Our model selection determined that the best performing model was the model with an adjusted R² value of ~75.01%.

The model provided accuracy to properly predict the value of a 2018 movie - the predicted value fell within the 95% confidence levels. We can explain only 75% of the variance of audience_score. I submit this is a relatively high performing linear regression predicative model model the relatively basic tools used to construct it. I am happy with the results.

In the future we might consider:

Perhaps not all the explanatory variables are perfectly linear. using a polynomial or other non-linear regression analysis would provide a higher performing predictive model.

What other data could we add to the model to improve its accuracy?

We reduced the data set record count from 651 observations to 332. Perhaps a larger set of data would help improve the model.
Testing some variable transformation may help improve the model.
Taking Studio name, Director and Cast name in our model too effects the observation at the cost of increasing the complexity.