Setup

Load packages

Adding extra package - GGally to use ggpairs function

library(statsr)
library(dplyr)
library(ggplot2)
library(GGally)

Load data

#load("movies.Rdata")
load("movies.Rdata")

Part 1: Data

The data was gathered from independent movie internet-resources such as Rotten Tomatoes and IMDB,random sampled. 651 movies, which are less than 10% of all movies for the cinematographic era and less than 10% of movies on both web resources. So, because data is get from 2 popular movie resources, we should keep in mind that prediction model, based on the data from this resources is applicable to make prediction for them only. It could not be spreaded for entire population, except IMDB and Rotten tomatoes variability of which should be considered in a separate work.


Part 2: Research question

In this data we will evaluate which factors affect movie popularity. We are running observational studies. From a customer point of view, it’s quite interesting to find out why a certain movie could be popular.



Our key exploratory variables are:

  1. runtime: Runtime of movie (in minutes),
  2. title_type: Type of movie (Documentary, Feature Film, TV Movie)
  3. thtr_rel_year: Year the movie is released in theaters
  4. imdb_num_votes: Number of votes on IMDB
  5. audience_rating: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
  6. best_pic_nom: Whether or not the movie was nominated for a best picture Oscar (no, yes)
  7. best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)
  8. best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) – not
  9. that this is not necessarily whether the director won an Oscar for the given movie


OUr investigation variable is

audience_score: Audience score on Rotten Tomatoes
for three reasons: 1. highest poin of cross collinearity and
2. Because no matter what critics say, customers or audience are those who choose and pay money.
3. 100 -grade scale is convenient


Part 3: Exploratory data analysis

To start with, we need to find out, which certain score we should use as a measure of movies’ success. We have 3 options: IMDB auditorium option, critics score on Rotten Tomatoes and Audience score on Rotten Tomatoes. We are analyzing different rating for the purpose of their representative evaluation.

movies_rate<-movies%>%
  select(imdb_rating, critics_score, audience_score)%>%
  group_by(imdb_rating, critics_score, audience_score)%>%
summarise(counts = n())
movies_rate
## Source: local data frame [644 x 4]
## Groups: imdb_rating, critics_score [?]
## 
##    imdb_rating critics_score audience_score counts
##          <dbl>         <dbl>          <dbl>  <int>
## 1          1.9             1             19      1
## 2          2.3             2             29      1
## 3          2.4             3             11      1
## 4          2.7            17             17      1
## 5          2.8             8             18      1
## 6          3.1             7             17      1
## 7          3.4             3             37      1
## 8          3.4            10             23      1
## 9          3.4            11             24      1
## 10         3.4            17             26      1
## # ... with 634 more rows
summary(movies_rate)
##   imdb_rating    critics_score    audience_score      counts     
##  Min.   :1.900   Min.   :  1.00   Min.   :11.00   Min.   :1.000  
##  1st Qu.:5.900   1st Qu.: 33.00   1st Qu.:46.00   1st Qu.:1.000  
##  Median :6.600   Median : 61.00   Median :65.00   Median :1.000  
##  Mean   :6.485   Mean   : 57.40   Mean   :62.23   Mean   :1.011  
##  3rd Qu.:7.300   3rd Qu.: 82.25   3rd Qu.:80.00   3rd Qu.:1.000  
##  Max.   :9.000   Max.   :100.00   Max.   :97.00   Max.   :2.000
ggpairs(movies, columns = c (13, 16, 18))

From this table of summary and correlation plots, we can see very high correlation between different scoring: IMDB, Rotten Tomatoes critic scores and audience scores.

These variables are collinear (correlated), and adding more than one of these variables to the model would not add much value to the model. In this application and with these highly-correlated predictors, it is reasonable to use the audience score as the single representative of these variables.

Next, we exploring this single variety for it’s normality

summary(movies_rate$audience_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.23   80.00   97.00
movies_rate%>%
group_by(audience_score)%>%
  summarise(count = n())
## # A tibble: 84 × 2
##    audience_score count
##             <dbl> <int>
## 1              11     1
## 2              13     1
## 3              14     2
## 4              15     1
## 5              17     3
## 6              18     1
## 7              19     4
## 8              21     1
## 9              22     3
## 10             23     2
## # ... with 74 more rows
ggplot(movies_rate,  aes(x=audience_score))+
  geom_histogram(binwidth = 0.5)

Here we could see a short summary of the audience scores distribution. We can see that it’s normal distribution with a slight left skew.

Now we will try to investigate relationships between the audience score and different variables (reasons) attached along with the movie’s ratings. We need to find a trend between few variables which influence audience’s rating. We have to pay your attention, that some work had been done, so we could output those 9 variables and analyse them as the most powerful elementd of prediction model. We considered others 20 variables of the chart as insignificant and will not be shown in that analyzis paper in order to safe reader’s time.

ggplot(data = movies, aes(x = runtime, y = audience_score)) +
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Run time")+
  ylab("Audience score")
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

ggplot(data = movies, aes(x = thtr_rel_year, y = audience_score)) +
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE)+
  xlab("Year of release")+
  ylab("Audience score")

ggplot(data = movies, aes(x = audience_score, y = title_type )) +
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE)+
  xlab("Audience score")+
  ylab("Title type")

ggplot(data = movies, aes(x = imdb_num_votes, y = audience_score)) +
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE)+
  xlab("Numer voters in IMDB")+
  ylab("Audience score")

ggplot(data = movies, aes(x = audience_rating, y = audience_score)) +
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE)+
  xlab("Audience rating")+
  ylab("Audience score")

ggplot(data = movies, aes(x = audience_score , y =best_pic_nom )) +
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE)+
  xlab("Audience score")+
  ylab("Oscar nomination")

ggplot(data = movies, aes(x =  audience_score , y = best_pic_win)) +
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE)+
  xlab("Audience score")+
  ylab("Oscar title")

ggplot(data = movies, aes(x = audience_score , y = best_dir_win)) +
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE)+
  xlab("Audience score")+
  ylab("Director gets Oscar")

ggplot(data = movies, aes(x = audience_score, y = top200_box )) +
  geom_jitter()+
  geom_smooth(method = "lm", se = FALSE)+
  xlab("Audience score")+
  ylab("Top 200 box list")

Providing this simple exploratory data analysis, we could find some tendentious: 1 run time vs score; 2. theater year release vs score; 3. documentary films have high grades; 4.movies with high numbers of voters, have higher scores; 5. Upright category of movies are associated with high ratings, also. 6. Oscar nominated movies have trend to have high ratings. 7. Oscar winning movies have trend to have high rating. 8. Oscar winning director, the tendency align that a movie with those tends to have high scores. 9. movies are in top200 have tendency to have high scores.

From this exploratory analysis we could highlight 9 important points to explore.

Before running multiply level model diagnostics, we need to run model diagnostics.

At this point of research, we need to run simple linear models to investigate our hypothesis if those variables are significant to predict audience scores.

#fit model
run_score = lm(audience_score ~ runtime , data = movies)
summary(run_score)
## 
## Call:
## lm(formula = audience_score ~ runtime, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.641 -15.626   3.008  17.080  34.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.4203     4.3256   9.807  < 2e-16 ***
## runtime       0.1883     0.0402   4.684 3.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.92 on 648 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03275,    Adjusted R-squared:  0.03125 
## F-statistic: 21.94 on 1 and 648 DF,  p-value: 3.431e-06
type_score = lm(audience_score ~ title_type , data = movies)
summary(type_score)
## 
## Call:
## lm(formula = audience_score ~ title_type, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.465 -14.465   2.535  15.535  36.535 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              83.255      2.592  32.114  < 2e-16 ***
## title_typeFeature Film  -22.789      2.710  -8.408 2.65e-16 ***
## title_typeTV Movie      -26.455      8.981  -2.946  0.00334 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.23 on 648 degrees of freedom
## Multiple R-squared:  0.09889,    Adjusted R-squared:  0.09611 
## F-statistic: 35.56 on 2 and 648 DF,  p-value: 2.225e-15
year_score = lm(audience_score ~ thtr_rel_year, data = movies)
summary(year_score)
## 
## Call:
## lm(formula = audience_score ~ thtr_rel_year, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.250 -15.962   3.042  17.147  34.042 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   261.45858  144.30579   1.812   0.0705 .
## thtr_rel_year  -0.09965    0.07223  -1.380   0.1682  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.21 on 649 degrees of freedom
## Multiple R-squared:  0.002925,   Adjusted R-squared:  0.001388 
## F-statistic: 1.904 on 1 and 649 DF,  p-value: 0.1682
inv_score = lm(audience_score ~ imdb_num_votes, data = movies)
summary(inv_score)
## 
## Call:
## lm(formula = audience_score ~ imdb_num_votes, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.707 -15.085   1.866  16.245  36.521 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.936e+01  8.534e-01  69.552  < 2e-16 ***
## imdb_num_votes 5.227e-05  6.776e-06   7.714  4.6e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.37 on 649 degrees of freedom
## Multiple R-squared:  0.08399,    Adjusted R-squared:  0.08258 
## F-statistic: 59.51 on 1 and 649 DF,  p-value: 4.602e-14
audience_score2 = lm(audience_score ~ audience_rating, data = movies)
summary(audience_score2)
## 
## Call:
## lm(formula = audience_score ~ audience_rating, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.9345  -7.3032   0.6968   8.6968  19.6968 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             41.9345     0.6133   68.38   <2e-16 ***
## audience_ratingUpright  35.3686     0.8070   43.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.17 on 649 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7471 
## F-statistic:  1921 on 1 and 649 DF,  p-value: < 2.2e-16
bpn_score = lm(audience_score ~ best_pic_nom, data = movies)
summary(bpn_score)
## 
## Call:
## lm(formula = audience_score ~ best_pic_nom, data = movies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50.56 -15.56   2.44  16.44  34.44 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      61.5596     0.7885  78.069  < 2e-16 ***
## best_pic_nomyes  23.7586     4.2894   5.539 4.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.78 on 649 degrees of freedom
## Multiple R-squared:  0.04514,    Adjusted R-squared:  0.04367 
## F-statistic: 30.68 on 1 and 649 DF,  p-value: 4.425e-08
bpw_score = lm(audience_score ~ best_pic_win, data = movies)
summary(bpw_score)
## 
## Call:
## lm(formula = audience_score ~ best_pic_win, data = movies)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.12 -16.12   2.88  16.88  33.88 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.1196     0.7922  78.416  < 2e-16 ***
## best_pic_winyes  22.5947     7.6395   2.958  0.00321 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.1 on 649 degrees of freedom
## Multiple R-squared:  0.0133, Adjusted R-squared:  0.01178 
## F-statistic: 8.748 on 1 and 649 DF,  p-value: 0.003213
bdw_score = lm(audience_score ~ best_dir_win, data = movies)
summary(bdw_score)
## 
## Call:
## lm(formula = audience_score ~ best_dir_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.857 -16.857   3.143  17.143  34.143 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      61.8569     0.8171  75.701   <2e-16 ***
## best_dir_winyes   7.6547     3.1794   2.408   0.0163 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.15 on 649 degrees of freedom
## Multiple R-squared:  0.008852,   Adjusted R-squared:  0.007325 
## F-statistic: 5.797 on 1 and 649 DF,  p-value: 0.01634
t2b_score = lm(audience_score ~ top200_box, data = movies)
summary(t2b_score)
## 
## Call:
## lm(formula = audience_score ~ top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.075 -16.075   2.925  16.925  34.925 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    62.0755     0.7991  77.686   <2e-16 ***
## top200_boxyes  12.4579     5.2641   2.367   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.15 on 649 degrees of freedom
## Multiple R-squared:  0.008556,   Adjusted R-squared:  0.007028 
## F-statistic: 5.601 on 1 and 649 DF,  p-value: 0.01825

By running this single line models we investigated that 9 explanatory variable are fit to be used in building prediction model. 0 hypothesis is rejected due to low p-values in all 9 cases.

Next step is to test residuals of all 9 explanatory variables for linearity, constant variability and near normality of residuals and their independence.

#test on nearly normal residuals, linearity and constant variability

ggplot(data = run_score, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = run_score, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

ggplot(data = type_score, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

ggplot(data = type_score, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = year_score, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = year_score, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

ggplot(data = inv_score, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = inv_score, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

ggplot(data = audience_score2, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

 ggplot(data = audience_score2, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

 ggplot(data = bpn_score, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

ggplot(data = bpn_score, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

ggplot(data = bpw_score, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

    ggplot(data = bpw_score, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

  ggplot(data = bdw_score, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")  

    ggplot(data = bdw_score, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

  ggplot(data = t2b_score, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")  

     ggplot(data = t2b_score, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals") 

The residuals appear to be randomly distributed around 0. The plots are indicative of a linear relationships between running time, movie type, year of release, IMDb number of votes, audience rating; weather the movie was nominated or won Oscar and weather or not director won Oscar or not and audience score. Based on the residual plots, the constant variability condition appears to be met, as well as, condition if independent residuals.


Part 4: Modeling

Now let’s build a full prediction model, including all funded explanatory variables.

score1 = lm(audience_score ~ runtime + thtr_rel_year + title_type + imdb_num_votes + audience_rating + best_pic_nom + best_pic_win + best_dir_win + top200_box, data = movies)
summary(score1)
## 
## Call:
## lm(formula = audience_score ~ runtime + thtr_rel_year + title_type + 
##     imdb_num_votes + audience_rating + best_pic_nom + best_pic_win + 
##     best_dir_win + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.878  -6.762   0.802   7.486  20.277 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.411e+02  7.357e+01   3.278  0.00110 ** 
## runtime                 2.461e-02  2.161e-02   1.139  0.25516    
## thtr_rel_year          -9.608e-02  3.654e-02  -2.629  0.00876 ** 
## title_typeFeature Film -1.040e+01  1.481e+00  -7.026 5.45e-12 ***
## title_typeTV Movie     -1.415e+01  4.497e+00  -3.147  0.00173 ** 
## imdb_num_votes          2.204e-05  4.173e-06   5.280 1.77e-07 ***
## audience_ratingUpright  3.226e+01  8.337e-01  38.701  < 2e-16 ***
## best_pic_nomyes         6.446e+00  2.460e+00   2.621  0.00898 ** 
## best_pic_winyes        -6.555e+00  4.404e+00  -1.488  0.13712    
## best_dir_winyes         2.607e+00  1.648e+00   1.582  0.11418    
## top200_boxyes          -2.401e+00  2.641e+00  -0.909  0.36353    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.587 on 639 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.779,  Adjusted R-squared:  0.7755 
## F-statistic: 225.2 on 10 and 639 DF,  p-value: < 2.2e-16

We choose those certain variables because during running the exploratory analysis we found them having the smallest p-values as indicators for significant roles to predict audience scores.


The next step is to choose the best prediction parsimony model which is the model with the smalls amount of variables and the highest prediction power by choosing backward p-values selection. Drop the largest p-values of the model.

score = lm(audience_score ~ thtr_rel_year + title_type + imdb_num_votes + audience_rating + best_pic_nom, data = movies)
summary(score)
## 
## Call:
## lm(formula = audience_score ~ thtr_rel_year + title_type + imdb_num_votes + 
##     audience_rating + best_pic_nom, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.6067  -6.8432   0.8671   7.4220  20.4680 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.543e+02  7.179e+01   3.542 0.000426 ***
## thtr_rel_year          -1.016e-01  3.578e-02  -2.839 0.004662 ** 
## title_typeFeature Film -9.890e+00  1.467e+00  -6.741 3.49e-11 ***
## title_typeTV Movie     -1.383e+01  4.502e+00  -3.072 0.002216 ** 
## imdb_num_votes          2.204e-05  3.794e-06   5.810 9.85e-09 ***
## audience_ratingUpright  3.237e+01  8.328e-01  38.867  < 2e-16 ***
## best_pic_nomyes         5.673e+00  2.227e+00   2.548 0.011070 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.607 on 644 degrees of freedom
## Multiple R-squared:  0.7764, Adjusted R-squared:  0.7743 
## F-statistic: 372.7 on 6 and 644 DF,  p-value: < 2.2e-16

Trying for adjusted R^2 backwards model

r2<-lm(audience_score ~ audience_rating + imdb_num_votes + title_type +thtr_rel_year + best_pic_nom + runtime+best_dir_win+best_pic_win, data = movies)
summary(r2)$adj.r.squared
## [1] 0.7755891

Running the different options, we found this model as the model with highest adjusted R^2. #Conclution: p-value backward elimination model is the most efficiant model as long as adjusted r^2 model offers only 1 variable elimination.

model diagnostics

linear relationship b/n eplanatories and response

this search was done earlier in the document. The graphs show linear relationships.

nearly normal residuals

checking residuals of themodel on condition of normal distribution, which is normal, somewhat centered around 0 and slight left skew.

hist(score$residuals)

qqnorm(score$residuals)
qqline(score$residuals)

Both plots show nearly normal distribution of residuals

Constant variability of residuals

Residuals should be equally variable for low and high values of the predicted response variable. We can check that by plotting residuals vs. predicted values of dependent variable. In that manner we can check the whole model at once.

plot(score$residuals ~ score$fitted)

Here we can observe awkward situation where high and low variables are not equilly destributed, this bias could appear due to 2 non-numerial variables which I do not know how to handle, yet.

#Checking for model’s residuals independence

plot(score$residuals)

We see random normal destribution of the values -> residuals are independent

The model is score = 2.5e-1.01e(year of release)-9.89e(if the movie is featured)-1.38e(if the movie is TV)+2.2e(number of voted)+3.23e*(if the movie is Upright rated)+5.6e(if the move is Oscar nominated)

Interpretation: variable “year of release” somewhat meaningless if its value less than 1800


Part 5: Prediction.

Using founded p-value backward model to run the prediction:

score <- lm(audience_score ~ thtr_rel_year + title_type + imdb_num_votes + audience_rating + best_pic_nom, data = movies)

Because of p-value is so small, we could reject H0 and say that the model as a whole is significant. R2 tells us, that 77.5% of all variables are included in the prediction model.

#Suicide Squad

movie2016 <- data.frame(thtr_rel_year = 2016, imdb_num_votes = 391131, title_type = "Feature Film", audience_rating = "Spilled", best_pic_nom = "yes" )
movie2016 
##   thtr_rel_year imdb_num_votes   title_type audience_rating best_pic_nom
## 1          2016         391131 Feature Film         Spilled          yes
predict(score, movie2016, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 53.86204 34.43349 73.29058

Hence, the model predicts, with 95% confidence, that Suicide Squad with a year of release 2016, number of voted -391131 (http://www.imdb.com/), audience rating Spilled (https://www.rottentomatoes.com/), and been nominated on Oscar’s best picture (https://www.polygon.com/2017/1/24/14371144/suicide-squad-Oscar-nomination-razzie), expected to have an audience score between 34.43 and 73.29 wit a fit on 53.86


Part 6: Conclusion

The model predicts not high level of audience rating - 53.86. According to the Rotten tomatoes, current Suicide Squad’s audience rating is 62. Before the data analysis I was sure that one of important factors influence movie ratings are dates of its releases, studio and length of the movie, but the analysis show that, actually, year of release (seems like new movies tend to have lower ratings), type of the movie (if it’s documentary, it has higher ratings). And the most predictable explanatory variables are number of voted and weather or not movie is nominated on Oscar. The findings of the research are interested in the contest to be transferred on the entire population, as the next step. How the IMDB and Rotten Tomatoes rating are correlated with the world’s participation of cinematography?