Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

Load data

load("movies.Rdata")

Part 1: Data

The purpose of this observational study is to define which variables best predict whether a movie is going to be popular so the results can be further applied to other cases therefore arrangements can be done before footage phase.

We cannot say it is a causal study since there is no control of external variables.

The data set contains 651 movie samples, randomly sampled, produced and released before 2016.


Data munging (or Data wrangling)

# Removing unused variables
movies = subset(movies, select = -c(actor1, actor2, actor3, actor4, actor5, imdb_url, rt_url, mpaa_rating, thtr_rel_year, thtr_rel_month, thtr_rel_day, dvd_rel_year, dvd_rel_month, dvd_rel_day, best_pic_nom, best_pic_win))

# Removing rows with NA
#movies = movies[complete.cases(movies),]
movies = na.omit(movies)

# Changing the scale of imdb rating so it has closer values to the ones in columns audience_score and critics_score
movies$imdb_rating = movies$imdb_rating * 10

# New column to formalize the values in columns: imdb_rating, audience_score and critics_score
movies['global_score'] = round((movies$audience_score + movies$critics_score + movies$imdb_rating) / 3, 2)

The dataset presented some NA values in some columns and some variables that would not help in any way so these rows were removed since they would not affect our results.

The variable imdb_rating had a precision of 3 which was changed to 4 since all its values were multiplied by 10 so it woulb have the same precision as the variables critics_score and audience_score. Then these variables were merged into a single one called global_score which is the arithmetic mean.


Used variables

Variable Type Description
critics_score Interval Critics score on Rotten Tomatoes
audience_score Interval Audience score on Rotten Tomatoes
imdb_rating Interval Rating on IMDB
global_score Interval Aritimetical mean of the three above variables
studio Nominal Studio that produced the movie
best_dir_win Dichotomous Whether or not the director of the movie ever won an Oscar (no, yes)
best_actor_win Dichotomous Whether or not one of the main actors in the movie ever won an Oscar (no, yes)

Note: You might be wondering why the types used were interval instead of continuous and dichotomous instead of categorical in the table above. It is worth an explanation for clarification purposes.

A categorical variable can be categorized as nominal, ordinal or dichotomous.
  1. Nominal: Two or more categories but there is no order. E.g., Skin color;
  2. Ordinal: Two or more categories but there is an order (rank). E.g., Shirt sizes;
  3. Dichotomous: Only two categories and there is no order. E.g., Respondent gender;
A continuous (quantitative) variable can be categorizes as interval or ratio.
  1. Interval: Equal intervals between between measures. E.g., Temperatures. The difference between 9 and 10 degrees is 1. And the difference between 20 and 21 is also 1;
  2. Ratio: Is an interval variable but with the condition that a value 0 represents the absence of that variable. E.g., Temperature in Kelvin;

Part 2: Research questions

1 - Is there an association between the fact that a director had ever won an Oscar and a movie’s popularity?

If a director had ever won an Oscar it means their movies tend to be better hence their movies have higher chances to be popular, right? We will find out.


3 - Does the fact that the main character of a movie had ever won an Oscar increase a movie’s popularity?

This question would help directors to better think about actors/actresses prerequisites to invite them to be a character of their movies.


Part 3: Exploratory data analysis [ EDA ]

EDA - Movies global score

Interquartile Range [ IQR ]
summary(movies$global_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.67   48.33   62.67   61.74   77.75   94.67
Interpretation

There is not outlier in this dataset regarding global_score variable.
The minimum score is 12.67 and the maximum is 94.97.
The average score is 61.86 and the value that cuts the scores in half is 63.

75% of the scores are equal or less than 77.75
50% of the scores are between 48.67 and 77.75
25% of the scores are less than 48.67

Boxplot
ggplot(aes(x = '', y = global_score), data = movies) +
    geom_boxplot(aes(fill = '#FF4545')) +
    guides(fill = F) +
    ylab('Score') + 
    ggtitle('Movies global score') +
    scale_y_continuous(breaks = seq(0, max(movies$global_score), 10)) +
    stat_summary(fun.y = mean, geom = 'point', shape = 4, size = 2) +
    theme_minimal() +
    coord_flip()

Interpretation

The colorful rectangle represents 50% of the data.
The horizontal lines before and after the rectangles represents 25% of the data each.
The middle vertical line represents the median.
The x close to the middle lines represents the mean.


EDA - Directors had ever won an Oscar

ggplot(aes(x = best_dir_win, fill = best_dir_win), data = movies) +
    geom_bar() +
    geom_text(stat = 'count', aes(label = ..count..), vjust = -1) +
    scale_fill_manual(values = c('#FF4545', '#F5CD6D')) +
    guides(fill = guide_legend(title = 'Had ever won an Oscar')) +
    scale_y_continuous(limits = c(0, length(which(movies$best_dir_win == 'no')) + 20)) +
    xlab('') +
    ylab('Directors') +
    ggtitle('Directors had ever won an Oscar') +
    theme_minimal()

Interpretation

From 640 directors, 43 had ever won an Oscar. That is only about 6.8% of the whole sample.


Directors had ever won an Oscar x Movies global score

Interquartile Range [ IQR ]
by(movies$global_score, movies$best_dir_win, summary)
## movies$best_dir_win: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.67   47.67   61.67   61.10   77.00   92.33 
## -------------------------------------------------------- 
## movies$best_dir_win: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.67   60.33   73.67   70.67   83.00   94.67

The range between the groups of directors who had ever won an Oscar and those who have not are different.

Director had ever won an Oscar Range Difference
No 12.67 to 92.33 79.66
Yes 30.67 to 94.67 64

The global_score range is smaller for the group of directors who have ever won an Oscar.

Boxplot

ggplot(aes(y = global_score, x = best_dir_win, fill = best_dir_win), data = movies) +
    geom_boxplot() +
    guides(fill = F) +
    xlab('Director had ever won an Oscar') +
    ylab('Global score') +
    ggtitle('Directors had ever won an Oscar x Movies global score') +
    scale_y_continuous(breaks = seq(floor(min(movies$global_score)), max(movies$global_score), 10)) +
    stat_summary(fun.y = mean, geom = 'point', shape = 4, size = 2) +
    theme_minimal() +
    coord_flip()

Interpretation

The colorful rectangle represents 50% of the data.
The horizontal lines before and after the rectangles represents 25% of the data each.
The middle vertical line represents the median.
The x close to the middle lines represents the mean.

As median line shows in the blue box most directors who have ever won an Oscar produced a movie that the global_score is above 73 points whereas the directors in the red box group is more equally distributed.


Studio x Movies global score

ggplot(aes(x = studio, y = global_score), data = movies) +
    geom_jitter() +
    xlab('Studios') +
    ylab('Global score') +
    ggtitle('Studio x Movies global score') +
    scale_y_continuous(breaks = seq(0, max(movies$global_score), 10)) +
    theme(axis.text.x = element_blank())

Interpretation

Scatterplot was chosen since there are 210 different studios and it would be too many to display with labels. So the purpouse of this graph is to provide a better look on how the studios where movies were produced and global scores are distributed.

Studios with maximum and minimum global_score

Maximum global_score

max_global_score_movie = movies[which.max(movies$global_score), c('studio', 'title', 'genre', 'best_actor_win', 'best_dir_win', 'director', 'global_score')]

knitr::kable(max_global_score_movie)
studio title genre best_actor_win best_dir_win director global_score
Paramount Pictures The Godfather, Part II Mystery & Suspense yes yes Francis Ford Coppola 94.67

Minimum global_score

min_global_score_movie = movies[which.min(movies$global_score), c('studio', 'title', 'genre', 'best_actor_win', 'best_dir_win', 'director', 'global_score')]
knitr::kable(min_global_score_movie)
studio title genre best_actor_win best_dir_win director global_score
Warner Bros. Pictures Battlefield Earth Action & Adventure yes no Roger Christian 12.67
Studios that produced more and less movies

More movies

unique(subset(movies, studio == tail(names(sort(table(movies$studio))), 1))['studio'])
## # A tibble: 1 × 1
##               studio
##               <fctr>
## 1 Paramount Pictures
nrow(subset(movies, studio == tail(names(sort(table(movies$studio))), 1)))
## [1] 37

Paramount Pictures is the studio that produced more movies than any other in the sample. 37 to be more specific. That is about 5.85% of the whole sample.

Interquartile Range [ IQR ] - Global score

summary(subset(movies, studio == tail(names(sort(table(movies$studio))), 1))['global_score'])
##   global_score  
##  Min.   :33.00  
##  1st Qu.:48.67  
##  Median :63.33  
##  Mean   :62.96  
##  3rd Qu.:75.67  
##  Max.   :94.67

Interquartile Range [ IQR ] - Genre

summary(subset(movies, studio == tail(names(sort(table(movies$studio))), 1))['genre'])
##                 genre   
##  Drama             :19  
##  Comedy            : 7  
##  Action & Adventure: 5  
##  Other             : 3  
##  Mystery & Suspense: 2  
##  Animation         : 1  
##  (Other)           : 0

Less movies

studio_ftable = data.frame(count = sort(table(movies$studio), decreasing = T))
studio_one_freq = subset(studio_ftable, count.Freq == 1)

There are 132 studios that produced only one movie.


Main character had ever won an Oscar

ggplot(aes(x = best_actor_win, fill = best_actor_win), data = movies) +
    geom_bar() +
    geom_text(stat = 'count', aes(label = ..count..), vjust = -1) +
    scale_fill_manual(values = c('#FF4545', '#F5CD6D')) +
    guides(fill = guide_legend(title = 'Had ever won an Oscar')) +
    scale_y_continuous(limits = c(0, length(which(movies$best_actor_win == 'no')) + 20)) +
    xlab('') +
    ylab('Actors') +
    ggtitle('Main character had ever won an Oscar') +
    theme_minimal()

Interpretation

From 632 actors, 92 had ever won an Oscar. That is only about 5% of the whole sample.


Main character had ever won an Oscar x Movies global score

by(movies$global_score, movies$best_actor_win, summary)
## movies$best_actor_win: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   47.92   61.67   61.43   78.00   92.33 
## -------------------------------------------------------- 
## movies$best_actor_win: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.67   51.58   65.34   63.56   77.67   94.67

The range between the groups of directors who have ever won an Oscar and those who have not are different.

Actor had ever won an Oscar Range Difference
No 13.00 to 92.33 79.33
Yes 12.67 to 94.67 82

The global_score range is smaller for the group of actors who never won an Oscar.

Boxplot

ggplot(aes(y = global_score, x = best_actor_win, fill = best_actor_win), data = movies) +
    geom_boxplot() +
    guides(fill = F) +
    xlab('Actor had ever won an Oscar') +
    ylab('Global score') +
    ggtitle('Actors have never won an Oscar x Movies global score') +
    scale_y_continuous(breaks = seq(floor(min(movies$global_score)), max(movies$global_score), 10)) +
    stat_summary(fun.y = mean, geom = 'point', shape = 4, size = 2) +
    theme_minimal() +
    coord_flip()

Interpretation

The colorful rectangle represents 50% of the data.
The horizontal lines before and after the rectangles represents 25% of the data each.
The middle vertical line represents the median.
The x close to the middle lines represents the mean.

Despite the larger range where the main actor/actress have ever won an Oscar its middle 50% score range is smaller.


Part 4: Modeling

Approach

Stepwise model selection

The approach chosen in this step was the forward selection due to the fact that there are too many variables and we have structured data. I.e., columns with labels. So it was possible to choose some variables by meaning.

Choosing predictor variables: Adjusted R² vs P-value

We use the adjusted R² when we want more reliable predictions and the p-value when we want to know which predictors are more significant.
Since we want to predict wheather a movie is popular we chose the adjusted R² approach.

Selecting variables to the model

Since no good results were achieved new variables were selected in order to improve our model reliability.

actor_oscar_model = lm(global_score ~ best_actor_win, data = movies)
dir_oscar_model = lm(global_score ~ best_dir_win, data = movies)
imdb_votes_model = lm(global_score ~ imdb_num_votes, data = movies)
dir_model = lm(global_score ~ director, data = movies)
studio_model = lm(global_score ~ studio, data = movies)
Variable Adjusted R²
best_dir_win 1.035540610^{-4}
best_actor_win 0.0157031
studio 0.3993372
imdb_num_votes 0.0783605
director 0.1366016

Note: The adjusted R² is the adjusted value of the square of the correlation coefficient. The is the percentage of variability in Y (response variable) explained by X (explanatory or explanatory variable. The model).

Forward stepwise selection

The first variable chosen was the one with higher adjusted R² which is studio. Then it was combined with the second in the rank director.

stu_dir_model = lm(global_score ~ studio + director, data = movies)
summary(stu_dir_model)$adj.r.squared
## [1] 0.4017996
# Previously calculated adjusted R² from linear model using the director variable as predictor;
summary(dir_model)$adj.r.squared
## [1] 0.3993372

There was no significant difference right? The first value is what we had before adding the variable director as a predictor and the second one is the result after the variable was included. Lets try using the next variable with higher adjusted R². It is imdb_num_votes.

stu_imdb_votes_model = lm(global_score ~ imdb_num_votes, data = movies)
summary(stu_imdb_votes_model)$adj.r.squared
## [1] 0.07836049

That should not even be considered. This value is really far from what we would call a ‘good model’. However let’s combine all the three variables we tryied before. studio + director + imdb_num_votes.

stu_dir_imdb_votes_model = lm(global_score ~ studio + director + imdb_num_votes, data = movies)
summary(stu_dir_imdb_votes_model)$adj.r.squared
## [1] 0.4971885

That is something. Let’s include now the other remaining variable in consideration best_actor_win to see if we can get something better than that.

stu_dir_imdb_votes_best_act_model = lm(global_score ~ studio + director + imdb_num_votes + best_actor_win, data = movies)
summary(stu_dir_imdb_votes_best_act_model)$adj.r.squared
## [1] 0.4855043

It seems like this variable would not help much. So lets remove this variable from our model and stick to the previous model with higher adjusted R² stu_dir_imdb_votes_model.

Checking conditions

Since we only have one quantitative variable which is imdb_num_votes this is the one to be considered.

1. Outliers: The present outliers are not due to incorrectly entries or wrong measurement. Those are legitimate observations.

2. Linear relationship: The response variable has a linear relationship with the predictor variable X.

ggplot(aes(x = imdb_num_votes, y = global_score), data = movies) +
  geom_jitter() +
  geom_smooth(method = 'lm', se = F)

3. Homoscedasticity: The residuals do not have a constant variability. Instead it is a fan shape. I.g., the higher the value of the predictor variable X, the larger the residuals variability.

ggplot(aes(x = imdb_num_votes, y = global_score), data = movies) +
  geom_jitter() +
  geom_smooth(method = 'lm')

4. No autocorrelation: The data shows independence of observations. I.g., one sample do not affect other.

5. Multicollinearity: The predictor variables are not correlated.
The test was made using Pearson’s Chi-square test.
\(H_0:\) The variables are independent;
\(H_a:\) The variables are dependent;

Test: imdb_num_votes vs director

chisq.test(movies$imdb_num_votes, movies$director)
## Warning in chisq.test(movies$imdb_num_votes, movies$director): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  movies$imdb_num_votes and movies$director
## X-squared = 331410, df = 330540, p-value = 0.1403

As p-value is greater than 0.05 we do not reject the null hypothesis \(H_0\). There is no relationship between these two variables.

Test: imdb_num_votes vs studio

chisq.test(movies$imdb_num_votes, movies$studio)
## Warning in chisq.test(movies$imdb_num_votes, movies$studio): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  movies$imdb_num_votes and movies$studio
## X-squared = 132460, df = 132090, p-value = 0.2374

As p-value is greater than 0.05 we do not reject the null hypothesis \(H_0\). There is no relationship between these two variables.

Test: director vs studio

chisq.test(movies$director, movies$studio)
## Warning in chisq.test(movies$director, movies$studio): Chi-squared
## approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  movies$director and movies$studio
## X-squared = 117990, df = 109310, p-value < 2.2e-16

As p-value is smaller than 0.05 we do reject the null hypothesis \(H_0\). There is a relationship between these two variables.

6. Residuals normal distribution:

imdb_num_votes_model = lm(global_score ~ imdb_num_votes, data = movies)
ggplot(aes(x = imdb_num_votes_model$residuals), data = movies) +
  geom_histogram(aes(y = ..density..), binwidth = 4, color = 'black', fill = '#44679F') +
  geom_density(alpha = .3, fill = '#DDF5F7') +
  xlab('Residuals') +
  ggtitle('Density histogram of IMDB number of votes model residuals') +
  theme_minimal()

shapiro.test(imdb_num_votes_model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  imdb_num_votes_model$residuals
## W = 0.97776, p-value = 2.784e-08

Since the Shapiro-Wilk normality test p-value and the density residuals histogram of the model shows the residuals are not normally distributed so that makes our model not so reliable.

Note: The null hypothesis \(H_0\) for the Shapiro-Wilk normality test is ‘Data are normally distributed’. Whereas the alternative hypothesis \(H_a\) is ‘Data are not normally distributed’.

The formalized model is:

Formula: \(\hat{y} = intercept + slope \times variable\)

We have too many categorical variables and since we can either add all categories or neither of them it would be too much variables to write in a formula. So the summarized one would be like the following:
\(\hat{globalScore} =\) 10.6230058 \(+ slopeVar1 * var1 + slopeVar2 * var2 + ... + slopeVarN * varN\)


Part 5: Prediction

It is time to test our model with new movies and predict their global_score.

Title Studio Director Imdb num of votes
Movie One ‘Sony Pictures Home Entertainment’ ‘Joe Dante’ 60519
Movie Two ‘Miramax Films’ ‘Dennis Dugan’ 650496
Movie Three ‘Concorde/New Horizons Home Video’ ‘John Whitesell’ 396703
Movie Four ‘Warner Bros. Pictures’ ‘Patrick Tatopoulos’ 793684
Movie Five ‘Paramount Pictures’ ‘Jack Nicholson’ 462861
summary(movies$imdb_num_votes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     180    4848   15450   58200   58950  893000
movieOne = data.frame(title = 'Movie One', studio = 'Trimark Pictures', director = 'Joe Dante', imdb_num_votes = 60519)
predict(stu_dir_imdb_votes_model, movieOne)
## Warning in predict.lm(stu_dir_imdb_votes_model, movieOne): prediction from
## a rank-deficient fit may be misleading
##        1 
## 110.9774

We got this the warning prediction from a rank-deficient fit may be misleading because of the association between the variables studio and director we saw earlier.

Then we tried removing the studio from the model since it yields a lower adjusted R² than the variable director.

# Instances
movieOne = data.frame(title = 'Top Secret!', director = 'David Zucker', imdb_num_votes = 44741)
movieTwo = data.frame(title = 'Gaza Strip', director = 'James Longley', imdb_num_votes = 285)
movieThree = data.frame(title = 'Night and the City', director = 'Irwin Winkler', imdb_num_votes = 3673)
movieFour = data.frame(title = 'Orwell Rolls in His Grave', director = 'Robert Kane Pappas', imdb_num_votes = 1058)
movieFive = data.frame(title = 'Sunshine Cleaning', director = 'Christine Jeffs', imdb_num_votes = 60220)

# Model
imdb_votes_dir_model = lm(global_score ~ imdb_num_votes + director, data = movies)

# Predictions
predict(imdb_votes_dir_model, movieOne)
##  1 
## 76
predict(imdb_votes_dir_model, movieTwo)
##  1 
## 80
predict(imdb_votes_dir_model, movieThree)
##      1 
## 48.671
predict(imdb_votes_dir_model, movieFour)
##  1 
## 84
predict(imdb_votes_dir_model, movieFive)
##     1 
## 67.67

Results

We are 95% confident about the predicted global_scores:

Title Observed value Predicted value
Top Secret! 76 76
Gaza Strip 80 80
Night and the City 48.33 48.671
Orwell Rolls in His Grave 84 84
Sunshine Cleaning 67.67 67.67

Part 6: Conclusion

In order to predict whether a movie would get a good global_score the main factors to consider are the director who directed the movie and the number of votes it gets on IMDB.
The fact that the main actor had ever won an oscar would not tell much about a movie’s popularity.
The studio where the movie was produced is not a very good predictor of its popularity.


References

Data mania

Codebook

Inferential Statistics Final Project - Victor Augusto

Coursera exercises