Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(gridExtra)
library(grid)
install.packages("corrplot")

## Error in install.packages : Updating loaded packages

library(corrplot)
install.packages("mlbench")

## Error in install.packages : Updating loaded packages

library(mlbench)

Load data

load("movies.Rdata")
View(movies)

Part 1: Data

This dataset is comprised of 651 randomly sampled movies produced or released before 2016.It includes information from Rotten Tomatoes and IMDB for a random sample of movies. There are 32 attributes. The type of study is observational one because the data were collected in a way that we are does not directly interfere with how the data arise.It is a retrospective data since its using data from the past. Since there is no random assignment hence this observational study is generalizable but the the sampling was random.

We can not establish causal connections between explanatory variables and response variable because there is no random assignment, also it is an observational study. Therefore, we can say that this study is not causal.

imdb_rating is scale of 1 to 10 scored by users of IMDb. An IMDb user have one vote per title per user and can change their vote anytime.The totals are converted into a weighted mean-rating that is displayed beside each title in the website. The imdb_num_votes is the number of users of IMDb who voted for a particular film. It is an open ended scale begining with 0 and the maximum votes is dependent on the number of users of the website that voted for a particular film.

critics_score and audience_score are ratings from the website Rotten tomatoes. Both have scale of 1 to 100. Each movie features a “user average,” which calculates the percentage of users who have rated the film positively. Users rate the movie on a scale of 0-10, while critics reviews generally use 4-star ratings and are often qualitative.

Part 2: Research question

Which attributes are the significant predictor to make a movie popular measured by IMDb rating? Is there anything new about movies?

Part 3: Exploratory data analysis

summary(movies)

##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58301  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:651         Length:651        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

attach(movies)

## The following object is masked _by_ .GlobalEnv:
## 
##     genre

## The following objects are masked from movies (pos = 3):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

## The following objects are masked from movies (pos = 4):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

## The following objects are masked from movies (pos = 8):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

## The following objects are masked from movies (pos = 15):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

names(movies)

##  [1] "title"            "title_type"       "genre"           
##  [4] "runtime"          "mpaa_rating"      "studio"          
##  [7] "thtr_rel_year"    "thtr_rel_month"   "thtr_rel_day"    
## [10] "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"  
## [16] "critics_score"    "audience_rating"  "audience_score"  
## [19] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [22] "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"          
## [28] "actor3"           "actor4"           "actor5"          
## [31] "imdb_url"         "rt_url"

head(movies$genre)

## [1] Drama       Drama       Comedy      Drama       Horror      Documentary
## 11 Levels: Action & Adventure Animation ... Science Fiction & Fantasy

summary(movies$genre)

##        Action & Adventure                 Animation 
##                        65                         9 
## Art House & International                    Comedy 
##                        14                        87 
##               Documentary                     Drama 
##                        52                       305 
##                    Horror Musical & Performing Arts 
##                        23                        12 
##        Mystery & Suspense                     Other 
##                        59                        16 
## Science Fiction & Fantasy 
##                         9

summary(movies$runtime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    39.0    92.0   103.0   105.8   115.8   267.0       1

summary(movies$studio)

##                       Paramount Pictures 
##                                       37 
##                    Warner Bros. Pictures 
##                                       30 
##         Sony Pictures Home Entertainment 
##                                       27 
##                       Universal Pictures 
##                                       23 
##                        Warner Home Video 
##                                       19 
##                         20th Century Fox 
##                                       18 
##                            Miramax Films 
##                                       18 
##                                      MGM 
##                                       16 
## Twentieth Century Fox Home Entertainment 
##                                       14 
##                                IFC Films 
##                                       13 
##                 MCA Universal Home Video 
##                                       13 
##                     Paramount Home Video 
##                                       12 
##                          New Line Cinema 
##                                       10 
##                            Sony Pictures 
##                                       10 
##                   Sony Pictures Classics 
##                                       10 
##                     Buena Vista Pictures 
##                                        9 
##                        Magnolia Pictures 
##                                        9 
##                   MGM Home Entertainment 
##                                        9 
##                 WARNER BROTHERS PICTURES 
##                                        9 
##                                HBO Video 
##                                        8 
##                        Columbia Pictures 
##                                        7 
##                         Lions Gate Films 
##                                        7 
##                                  Miramax 
##                                        7 
##              New Line Home Entertainment 
##                                        7 
##                    The Weinstein Company 
##                                        7 
##                             Warner Bros. 
##                                        7 
##                              Buena Vista 
##                                        6 
##                       First Run Features 
##                                        6 
##                           United Artists 
##                                        6 
##        20th Century Fox Film Corporation 
##                                        5 
##                         Orion Home Video 
##                                        5 
##                                Paramount 
##                                        5 
##                           Focus Features 
##                                        4 
##                          Fox Searchlight 
##                                        4 
##                                Lionsgate 
##                                        4 
##               Orion Pictures Corporation 
##                                        4 
##                        Paramount Studios 
##                                        4 
##              Sony Pictures Entertainment 
##                                        4 
##                      Touchstone Pictures 
##                                        4 
##                                      Fox 
##                                        3 
##                 Fox Searchlight Pictures 
##                                        3 
##                       Hollywood Pictures 
##                                        3 
##                                      IFC 
##                                        3 
##                      Image Entertainment 
##                                        3 
##                     Independent Pictures 
##                                        3 
##                          Lionsgate Films 
##                                        3 
##                         New Yorker Films 
##                                        3 
##                              Screen Gems 
##                                        3 
##                     Summit Entertainment 
##                                        3 
##                        The Weinstein Co. 
##                                        3 
##                                ThinkFilm 
##                                        3 
##                     Walt Disney Pictures 
##                                        3 
##              Warner Independent Pictures 
##                                        3 
##           20th Century Fox Film Corporat 
##                                        2 
##                                A24 Films 
##                                        2 
##                 Anchor Bay Entertainment 
##                                        2 
##                    Artisan Entertainment 
##                                        2 
##           Buena Vista Distribution Compa 
##                                        2 
##                          Cowboy Pictures 
##                                        2 
##                             FilmDistrict 
##                                        2 
##                               Fox Atomic 
##                                        2 
##                     Lions Gate Releasing 
##                                        2 
##                  LionsGate Entertainment 
##                                        2 
##                          Live Home Video 
##                                        2 
##                          Music Box Films 
##                                        2 
##        National Geographic Entertainment 
##                                        2 
##                     Nelson Entertainment 
##                                        2 
##                           Overture Films 
##                                        2 
##             Republic Pictures Home Video 
##                                        2 
##                     Roadside Attractions 
##                                        2 
##                     Samuel Goldwyn Films 
##                                        2 
##                Sony Pictures/Screen Gems 
##                                        2 
##                         Strand Releasing 
##                                        2 
##                                  Trimark 
##                                        2 
##                         TriStar Pictures 
##                                        2 
##                        Universal Studios 
##                                        2 
##                                USA Films 
##                                        2 
##                  Walt Disney Productions 
##                                        2 
##                        Weinstein Company 
##                                        2 
##                           7-57 Releasing 
##                                        1 
##                                  7th art 
##                                        1 
##                          905 Corporation 
##                                        1 
##                                      A24 
##                                        1 
##                     All Girl Productions 
##                                        1 
##         Alliance Atlantis Communications 
##                                        1 
##          American International Pictures 
##                                        1 
##                                 Analysis 
##                                        1 
##                         Anchor Bay Films 
##                                        1 
##                   Arab Film Distribution 
##                                        1 
##                     Arenas Entertainment 
##                                        1 
##                    AVCO Embassy Pictures 
##                                        1 
##                           Bankside Films 
##                                        1 
##                                Blumhouse 
##                                        1 
##                                      BMG 
##                                        1 
##                         Brainstorm Media 
##                                        1 
##                 Buena Vista Internationa 
##                                        1 
##                    Carnaby International 
##                                        1 
##                        Chloe Productions 
##                                        1 
##                                  (Other) 
##                                      113 
##                                     NA's 
##                                        8

summary(movies$critics_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   61.00   57.69   83.00  100.00

summary(movies$critics_rating)

## Certified Fresh           Fresh          Rotten 
##             135             209             307

summary(movies$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

summary(movies$audience_rating)

## Spilled Upright 
##     275     376

summary(movies$imdb_num_votes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     180    4546   15120   57530   58300  893000

######################### Plotting ##########################
hist(movies$imdb_rating)

plot of chunk unnamed-chunk-1

hist(movies$imdb_num_votes)

plot of chunk unnamed-chunk-1

hist(movies$critics_score)

plot of chunk unnamed-chunk-1

hist(movies$audience_score)

plot of chunk unnamed-chunk-1 The distribution of critics_score and audience_score appear similar.

quantile(movies$imdb_num_votes, c(0, 0.25, 0.5, 0.75, 0.9, 1))

##       0%      25%      50%      75%      90%     100% 
##    180.0   4545.5  15116.0  58300.5 151934.0 893008.0

imdb_rating appears to have the closest afinity to a shape of a normal distribution. imdb_num_votes is heavily skewed with 90% of movies having a score of 151,934 and below.

The distribution of critics_score and audience_score appear similar except that the audience score taper more in both ends.

We will choose imdb_rating as our response variable.

data1 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = critics_rating)) + geom_point()
data2 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = audience_rating)) + geom_point()
grid.arrange(data1, data2, nrow = 1, ncol = 2)

plot of chunk unnamed-chunk-3 Among critics_score and audience_score, we will choose critics_score because it has 3 categories.

hist(movies$thtr_rel_year, col = "green" )

plot of chunk unnamed-chunk-4

hist(movies$thtr_rel_month, col = "red")

plot of chunk unnamed-chunk-4

hist(movies$thtr_rel_day, col = "yellow" )

plot of chunk unnamed-chunk-4

d1 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = thtr_rel_year)) + geom_point()
d2 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = thtr_rel_month)) + geom_point()
d3 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = thtr_rel_day)) + geom_point()
grid.arrange(d1, d2, d3, nrow = 1, ncol = 3)

plot of chunk unnamed-chunk-4 The histograms above show that there are particular years, months, and days where more movies are released in theaters for the first time. We do not observe any clustering of these points in the scatterplot along the imdb_rating and critics_score scale.

hist(movies$dvd_rel_year, col = "green" )

plot of chunk unnamed-chunk-5

hist(movies$dvd_rel_month, col = "red")

plot of chunk unnamed-chunk-5

hist(movies$dvd_rel_day, col = "yellow" )

plot of chunk unnamed-chunk-5

d4 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = dvd_rel_year)) + geom_point()
d5 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = dvd_rel_month)) + geom_point()
d6 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = dvd_rel_day)) + geom_point()
grid.arrange(d4, d5, d6, nrow = 1, ncol = 3)

plot of chunk unnamed-chunk-5

The histograms above show that there are particular years, months, and days where more dvds are released for the first time. We do not observe any clustering of these points in the scatterplot along the imdb_rating and critics_score scale.

ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = top200_box)) + geom_point()

plot of chunk unnamed-chunk-6

ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = best_actor_win)) + geom_point()

plot of chunk unnamed-chunk-6

ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = best_actress_win)) + geom_point()

plot of chunk unnamed-chunk-6

ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = best_dir_win)) + geom_point()

plot of chunk unnamed-chunk-6

Part 4: Modeling

#Taking numerical factors
M <-  movies[, c("runtime","critics_score", "audience_score", "imdb_num_votes","imdb_rating")]

head(M)

## # A tibble: 6 x 5
##   runtime critics_score audience_score imdb_num_votes imdb_rating
##     <dbl>         <dbl>          <dbl>          <int>       <dbl>
## 1    80.0          45.0           73.0            899        5.50
## 2   101            96.0           81.0          12285        7.30
## 3    84.0          91.0           91.0          22381        7.60
## 4   139            80.0           76.0          35096        7.20
## 5    90.0          33.0           27.0           2386        5.10
## 6    78.0          91.0           86.0            333        7.80

M1 <- cor(M)
corrplot(M1, method = "square", type = "upper", order= "alphabet")

plot of chunk unnamed-chunk-7

Since imdb_rating distribution is almost similar to the normal distribution, hence we choose it as a response variable.

We choose audience_score as our first variable because it is the variable which is most correlated to the response variable.

attach(movies)

## The following object is masked _by_ .GlobalEnv:
## 
##     genre

## The following objects are masked from movies (pos = 3):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

## The following objects are masked from movies (pos = 4):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

## The following objects are masked from movies (pos = 5):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

## The following objects are masked from movies (pos = 9):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

## The following objects are masked from movies (pos = 16):
## 
##     actor1, actor2, actor3, actor4, actor5, audience_rating,
##     audience_score, best_actor_win, best_actress_win,
##     best_dir_win, best_pic_nom, best_pic_win, critics_rating,
##     critics_score, director, dvd_rel_day, dvd_rel_month,
##     dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
##     mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
##     thtr_rel_month, thtr_rel_year, title, title_type, top200_box

lm_as <- lm(imdb_rating ~ audience_score)
summary(lm_as)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2082 -0.1866  0.0712  0.3093  1.1516 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.599992   0.069291   51.95   <2e-16 ***
## audience_score 0.046392   0.001057   43.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.545 on 649 degrees of freedom
## Multiple R-squared:  0.748,  Adjusted R-squared:  0.7476 
## F-statistic:  1926 on 1 and 649 DF,  p-value: < 2.2e-16

Above summary shows that audience_score is a significant predictor as it has R-squared value of 0.748 and p-vale <2e-16.

Second variable we will choose is crictics_score as it has second most correlated variable with imdb_rating.

ml2 <- lm(imdb_rating ~ audience_score + critics_score)
summary(ml2)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51964 -0.19767  0.03466  0.30671  1.22691 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.647241   0.062471   58.38   <2e-16 ***
## audience_score 0.034703   0.001340   25.90   <2e-16 ***
## critics_score  0.011816   0.000954   12.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4904 on 648 degrees of freedom
## Multiple R-squared:  0.7962, Adjusted R-squared:  0.7956 
## F-statistic:  1266 on 2 and 648 DF,  p-value: < 2.2e-16

We can see that both R-squared and Adjusted R-Squared increased after adding critics score and p-value is less than 0.05. Hence, audience_score and crictics_score are significant predictors.

Next we will make model adding imdb_num_votes to our model

ml3 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes)
summary(ml3)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49004 -0.18552  0.02332  0.29450  1.17298 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.683e+00  6.192e-02  59.471  < 2e-16 ***
## audience_score 3.340e-02  1.347e-03  24.794  < 2e-16 ***
## critics_score  1.178e-02  9.387e-04  12.552  < 2e-16 ***
## imdb_num_votes 8.335e-07  1.764e-07   4.726 2.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4825 on 647 degrees of freedom
## Multiple R-squared:  0.803,  Adjusted R-squared:  0.8021 
## F-statistic: 879.3 on 3 and 647 DF,  p-value: < 2.2e-16

After adding imdb_num_votes to the model it increased the R-squared value and Adjusted R-squared by very less amount.

ml4 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes + best_dir_win + best_actor_win + best_actress_win + best_pic_nom + best_pic_win)
summary(ml4)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes + 
##     best_dir_win + best_actor_win + best_actress_win + best_pic_nom + 
##     best_pic_win)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.46429 -0.19315  0.02108  0.28757  1.19805 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.654e+00  6.321e-02  57.811  < 2e-16 ***
## audience_score       3.366e-02  1.352e-03  24.891  < 2e-16 ***
## critics_score        1.160e-02  9.450e-04  12.280  < 2e-16 ***
## imdb_num_votes       8.035e-07  1.886e-07   4.261 2.34e-05 ***
## best_dir_winyes      7.686e-02  8.173e-02   0.940   0.3474    
## best_actor_winyes    9.206e-02  5.543e-02   1.661   0.0972 .  
## best_actress_winyes  8.391e-02  6.219e-02   1.349   0.1777    
## best_pic_nomyes     -7.990e-02  1.254e-01  -0.637   0.5241    
## best_pic_winyes     -5.738e-02  2.219e-01  -0.259   0.7961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.482 on 642 degrees of freedom
## Multiple R-squared:  0.805,  Adjusted R-squared:  0.8026 
## F-statistic: 331.3 on 8 and 642 DF,  p-value: < 2.2e-16

After adding best_dir_win, best_actor_win, best_actress_win, best_pic_nom, best_pic_win variables to the model, we can see that there is negligible difference in the R-squared and adjusted R-squared. Also, p value for these variables are more than 0.05. Hence, these are not significant variable. Other categorical variables like studio, genre, and mpaa_rating have too many levels to be usefull.

Model Interpretation:

We used a forward selection approach with a combined criteria of p value, adjusted R squared, and logical reasoning. imdb_rating = 3.65 + 0.03audience_score + 0.01critics_score

R-Squared: 79.6% of the variability is explained by the model. audience_score andc ritics_score have p-value less tha 0.05.

Model Diagnostics: -> Linear relationships between explanatory numerical variables and response variable In this case each (numerical) explanatory variable should be linearly related to the response variable.We can check this condition using residual plot(e vs. x). There are two numerical variables in the model.

rating_final <- lm(imdb_rating ~ audience_score + critics_score)
plot(rating_final$residuals ~ audience_score, main = "Residuals vs. audience_score")
abline(0, 0)

plot of chunk unnamed-chunk-12

plot(rating_final$residuals ~ critics_score, main = "Residuals vs. critics_score")
abline(0, 0)

plot of chunk unnamed-chunk-13

As per residual plots, we can see a linear relationship between our residuals and our explanatory variables. This condition is met by our model.

-> nearly normal residuals On a residuals plot we look for random scatter of residuals around 0. This translates to a nearly normal distribution of residuals centered at 0. We can check this using histogram or normal probability plot.

hist(rating_final$residuals)

plot of chunk unnamed-chunk-14 There is a skew in above plot. This means that our model is not very reliable when audience score or critics score is low.

  qqnorm(rating_final$residuals)
  qqline(rating_final$residuals)

plot of chunk unnamed-chunk-15 There is a little deviation at starting, but there is not huge deviation. Hence, we can say that our model is not very reliable when audience score or critics score is low.

-> Constant variability of residuals In this case, residuals should be equally variable for low and high values of the predicted response variable. We can check this using residuals plots of residuals vs. predicted.

plot(rating_final$residuals ~ rating_final$fitted.values)
abline(0, 0)

plot of chunk unnamed-chunk-16

plot(abs(rating_final$residuals) ~ rating_final$fitted.values)
abline(0, 0)

plot of chunk unnamed-chunk-17 The residuals plot versus the fiited values show heteroscedasticity. This means that our model has more variability when predicting low imdb_rating.

-> Independence of residuals Indepedent residuals mean independent observations. We can check this using residuals vs. order of data collection.

plot(rating_final$residuals)

plot of chunk unnamed-chunk-18

There is no specific pattern in above plot. The observations appear to be independent of each other. This condition is satisfied.

Part 5: Prediction

set.seed(3974)
#Sample Indexes
indexes = sample(1:nrow(movies), size=0.999*nrow(movies))

# Split data
train = movies[indexes,]
dim(train)  # 650 32

## [1] 650  32

test = movies[-indexes,]
dim(test) #1 32

## [1]  1 32

View(test)

predicted_rating <- predict(rating_final, test, interval = "prediction", level = 0.95, se.fit = TRUE)
predicted_rating

## $fit
##        fit      lwr     upr
## 1 7.310473 6.346216 8.27473
## 
## $se.fit
## [1] 0.02516751
## 
## $df
## [1] 648
## 
## $residual.scale
## [1] 0.4904126

This model predicts that movie North Sea Hijack will have an imdb_rating with a 95% prediction interval of 4.6 points to 6.5 points. The original imddb_rating from movies dataset is “6.3”. Prediction interval has this value.

Part 6: Conclusion

This model has demonstrated that with only two predictors, we can predict with a certain amount of accuracy the popularity of movies using imdb_rating as a measure of popularity. Looking at the model's residual plots, it seems that there is greater variance when rating bad movies compared to good movies and critics are more likely than the audience to rate exteme values.

One of the shortcoming of our model is that it wasn't able to include the variable box office list as one of its predictors. It seems reasonable that an increase in the amount of revenues of movie sales is associated with popularity. The small number of movies in our dataset that's included in the list might have lessened the variables ability to have a small p value. It might have been better to have used a continuous variable of gross movie sales.We also could have used a larger number of observations in our testing data set to have a better measure of the model's accuracy.