library(ggplot2)
library(dplyr)
library(statsr)
library(gridExtra)
library(grid)
install.packages("corrplot")
## Error in install.packages : Updating loaded packages
library(corrplot)
install.packages("mlbench")
## Error in install.packages : Updating loaded packages
library(mlbench)
load("movies.Rdata")
View(movies)
This dataset is comprised of 651 randomly sampled movies produced or released before 2016.It includes information from Rotten Tomatoes and IMDB for a random sample of movies. There are 32 attributes. The type of study is observational one because the data were collected in a way that we are does not directly interfere with how the data arise.It is a retrospective data since its using data from the past. Since there is no random assignment hence this observational study is generalizable but the the sampling was random.
We can not establish causal connections between explanatory variables and response variable because there is no random assignment, also it is an observational study. Therefore, we can say that this study is not causal.
imdb_rating is scale of 1 to 10 scored by users of IMDb. An IMDb user have one vote per title per user and can change their vote anytime.The totals are converted into a weighted mean-rating that is displayed beside each title in the website. The imdb_num_votes is the number of users of IMDb who voted for a particular film. It is an open ended scale begining with 0 and the maximum votes is dependent on the number of users of the website that voted for a particular film.
critics_score and audience_score are ratings from the website Rotten tomatoes. Both have scale of 1 to 100. Each movie features a “user average,” which calculates the percentage of users who have rated the film positively. Users rate the movie on a scale of 0-10, while critics reviews generally use 4-star ratings and are often qualitative.
Which attributes are the significant predictor to make a movie popular measured by IMDb rating? Is there anything new about movies?
summary(movies)
## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:651 Length:651
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
attach(movies)
## The following object is masked _by_ .GlobalEnv:
##
## genre
## The following objects are masked from movies (pos = 3):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
## The following objects are masked from movies (pos = 4):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
## The following objects are masked from movies (pos = 8):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
## The following objects are masked from movies (pos = 15):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
names(movies)
## [1] "title" "title_type" "genre"
## [4] "runtime" "mpaa_rating" "studio"
## [7] "thtr_rel_year" "thtr_rel_month" "thtr_rel_day"
## [10] "dvd_rel_year" "dvd_rel_month" "dvd_rel_day"
## [13] "imdb_rating" "imdb_num_votes" "critics_rating"
## [16] "critics_score" "audience_rating" "audience_score"
## [19] "best_pic_nom" "best_pic_win" "best_actor_win"
## [22] "best_actress_win" "best_dir_win" "top200_box"
## [25] "director" "actor1" "actor2"
## [28] "actor3" "actor4" "actor5"
## [31] "imdb_url" "rt_url"
head(movies$genre)
## [1] Drama Drama Comedy Drama Horror Documentary
## 11 Levels: Action & Adventure Animation ... Science Fiction & Fantasy
summary(movies$genre)
## Action & Adventure Animation
## 65 9
## Art House & International Comedy
## 14 87
## Documentary Drama
## 52 305
## Horror Musical & Performing Arts
## 23 12
## Mystery & Suspense Other
## 59 16
## Science Fiction & Fantasy
## 9
summary(movies$runtime)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 39.0 92.0 103.0 105.8 115.8 267.0 1
summary(movies$studio)
## Paramount Pictures
## 37
## Warner Bros. Pictures
## 30
## Sony Pictures Home Entertainment
## 27
## Universal Pictures
## 23
## Warner Home Video
## 19
## 20th Century Fox
## 18
## Miramax Films
## 18
## MGM
## 16
## Twentieth Century Fox Home Entertainment
## 14
## IFC Films
## 13
## MCA Universal Home Video
## 13
## Paramount Home Video
## 12
## New Line Cinema
## 10
## Sony Pictures
## 10
## Sony Pictures Classics
## 10
## Buena Vista Pictures
## 9
## Magnolia Pictures
## 9
## MGM Home Entertainment
## 9
## WARNER BROTHERS PICTURES
## 9
## HBO Video
## 8
## Columbia Pictures
## 7
## Lions Gate Films
## 7
## Miramax
## 7
## New Line Home Entertainment
## 7
## The Weinstein Company
## 7
## Warner Bros.
## 7
## Buena Vista
## 6
## First Run Features
## 6
## United Artists
## 6
## 20th Century Fox Film Corporation
## 5
## Orion Home Video
## 5
## Paramount
## 5
## Focus Features
## 4
## Fox Searchlight
## 4
## Lionsgate
## 4
## Orion Pictures Corporation
## 4
## Paramount Studios
## 4
## Sony Pictures Entertainment
## 4
## Touchstone Pictures
## 4
## Fox
## 3
## Fox Searchlight Pictures
## 3
## Hollywood Pictures
## 3
## IFC
## 3
## Image Entertainment
## 3
## Independent Pictures
## 3
## Lionsgate Films
## 3
## New Yorker Films
## 3
## Screen Gems
## 3
## Summit Entertainment
## 3
## The Weinstein Co.
## 3
## ThinkFilm
## 3
## Walt Disney Pictures
## 3
## Warner Independent Pictures
## 3
## 20th Century Fox Film Corporat
## 2
## A24 Films
## 2
## Anchor Bay Entertainment
## 2
## Artisan Entertainment
## 2
## Buena Vista Distribution Compa
## 2
## Cowboy Pictures
## 2
## FilmDistrict
## 2
## Fox Atomic
## 2
## Lions Gate Releasing
## 2
## LionsGate Entertainment
## 2
## Live Home Video
## 2
## Music Box Films
## 2
## National Geographic Entertainment
## 2
## Nelson Entertainment
## 2
## Overture Films
## 2
## Republic Pictures Home Video
## 2
## Roadside Attractions
## 2
## Samuel Goldwyn Films
## 2
## Sony Pictures/Screen Gems
## 2
## Strand Releasing
## 2
## Trimark
## 2
## TriStar Pictures
## 2
## Universal Studios
## 2
## USA Films
## 2
## Walt Disney Productions
## 2
## Weinstein Company
## 2
## 7-57 Releasing
## 1
## 7th art
## 1
## 905 Corporation
## 1
## A24
## 1
## All Girl Productions
## 1
## Alliance Atlantis Communications
## 1
## American International Pictures
## 1
## Analysis
## 1
## Anchor Bay Films
## 1
## Arab Film Distribution
## 1
## Arenas Entertainment
## 1
## AVCO Embassy Pictures
## 1
## Bankside Films
## 1
## Blumhouse
## 1
## BMG
## 1
## Brainstorm Media
## 1
## Buena Vista Internationa
## 1
## Carnaby International
## 1
## Chloe Productions
## 1
## (Other)
## 113
## NA's
## 8
summary(movies$critics_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 33.00 61.00 57.69 83.00 100.00
summary(movies$critics_rating)
## Certified Fresh Fresh Rotten
## 135 209 307
summary(movies$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.36 80.00 97.00
summary(movies$audience_rating)
## Spilled Upright
## 275 376
summary(movies$imdb_num_votes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 180 4546 15120 57530 58300 893000
######################### Plotting ##########################
hist(movies$imdb_rating)
hist(movies$imdb_num_votes)
hist(movies$critics_score)
hist(movies$audience_score)
The distribution of critics_score and audience_score appear similar.
quantile(movies$imdb_num_votes, c(0, 0.25, 0.5, 0.75, 0.9, 1))
## 0% 25% 50% 75% 90% 100%
## 180.0 4545.5 15116.0 58300.5 151934.0 893008.0
imdb_rating appears to have the closest afinity to a shape of a normal distribution. imdb_num_votes is heavily skewed with 90% of movies having a score of 151,934 and below.
The distribution of critics_score and audience_score appear similar except that the audience score taper more in both ends.
We will choose imdb_rating as our response variable.
data1 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = critics_rating)) + geom_point()
data2 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = audience_rating)) + geom_point()
grid.arrange(data1, data2, nrow = 1, ncol = 2)
Among critics_score and audience_score, we will choose critics_score because it has 3 categories.
hist(movies$thtr_rel_year, col = "green" )
hist(movies$thtr_rel_month, col = "red")
hist(movies$thtr_rel_day, col = "yellow" )
d1 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = thtr_rel_year)) + geom_point()
d2 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = thtr_rel_month)) + geom_point()
d3 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = thtr_rel_day)) + geom_point()
grid.arrange(d1, d2, d3, nrow = 1, ncol = 3)
The histograms above show that there are particular years, months, and days where more movies are released in theaters for the first time. We do not observe any clustering of these points in the scatterplot along the imdb_rating and critics_score scale.
hist(movies$dvd_rel_year, col = "green" )
hist(movies$dvd_rel_month, col = "red")
hist(movies$dvd_rel_day, col = "yellow" )
d4 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = dvd_rel_year)) + geom_point()
d5 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = dvd_rel_month)) + geom_point()
d6 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = dvd_rel_day)) + geom_point()
grid.arrange(d4, d5, d6, nrow = 1, ncol = 3)
The histograms above show that there are particular years, months, and days where more dvds are released for the first time. We do not observe any clustering of these points in the scatterplot along the imdb_rating and critics_score scale.
ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = top200_box)) + geom_point()
ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = best_actor_win)) + geom_point()
ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = best_actress_win)) + geom_point()
ggplot(data = movies, aes(y = imdb_rating, x = critics_score, color = best_dir_win)) + geom_point()
#Taking numerical factors
M <- movies[, c("runtime","critics_score", "audience_score", "imdb_num_votes","imdb_rating")]
head(M)
## # A tibble: 6 x 5
## runtime critics_score audience_score imdb_num_votes imdb_rating
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 80.0 45.0 73.0 899 5.50
## 2 101 96.0 81.0 12285 7.30
## 3 84.0 91.0 91.0 22381 7.60
## 4 139 80.0 76.0 35096 7.20
## 5 90.0 33.0 27.0 2386 5.10
## 6 78.0 91.0 86.0 333 7.80
M1 <- cor(M)
corrplot(M1, method = "square", type = "upper", order= "alphabet")
Since imdb_rating distribution is almost similar to the normal distribution, hence we choose it as a response variable.
We choose audience_score as our first variable because it is the variable which is most correlated to the response variable.
attach(movies)
## The following object is masked _by_ .GlobalEnv:
##
## genre
## The following objects are masked from movies (pos = 3):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
## The following objects are masked from movies (pos = 4):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
## The following objects are masked from movies (pos = 5):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
## The following objects are masked from movies (pos = 9):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
## The following objects are masked from movies (pos = 16):
##
## actor1, actor2, actor3, actor4, actor5, audience_rating,
## audience_score, best_actor_win, best_actress_win,
## best_dir_win, best_pic_nom, best_pic_win, critics_rating,
## critics_score, director, dvd_rel_day, dvd_rel_month,
## dvd_rel_year, genre, imdb_num_votes, imdb_rating, imdb_url,
## mpaa_rating, rt_url, runtime, studio, thtr_rel_day,
## thtr_rel_month, thtr_rel_year, title, title_type, top200_box
lm_as <- lm(imdb_rating ~ audience_score)
summary(lm_as)
##
## Call:
## lm(formula = imdb_rating ~ audience_score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2082 -0.1866 0.0712 0.3093 1.1516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.599992 0.069291 51.95 <2e-16 ***
## audience_score 0.046392 0.001057 43.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.545 on 649 degrees of freedom
## Multiple R-squared: 0.748, Adjusted R-squared: 0.7476
## F-statistic: 1926 on 1 and 649 DF, p-value: < 2.2e-16
Above summary shows that audience_score is a significant predictor as it has R-squared value of 0.748 and p-vale <2e-16.
Second variable we will choose is crictics_score as it has second most correlated variable with imdb_rating.
ml2 <- lm(imdb_rating ~ audience_score + critics_score)
summary(ml2)
##
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.51964 -0.19767 0.03466 0.30671 1.22691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.647241 0.062471 58.38 <2e-16 ***
## audience_score 0.034703 0.001340 25.90 <2e-16 ***
## critics_score 0.011816 0.000954 12.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4904 on 648 degrees of freedom
## Multiple R-squared: 0.7962, Adjusted R-squared: 0.7956
## F-statistic: 1266 on 2 and 648 DF, p-value: < 2.2e-16
We can see that both R-squared and Adjusted R-Squared increased after adding critics score and p-value is less than 0.05. Hence, audience_score and crictics_score are significant predictors.
Next we will make model adding imdb_num_votes to our model
ml3 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes)
summary(ml3)
##
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49004 -0.18552 0.02332 0.29450 1.17298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.683e+00 6.192e-02 59.471 < 2e-16 ***
## audience_score 3.340e-02 1.347e-03 24.794 < 2e-16 ***
## critics_score 1.178e-02 9.387e-04 12.552 < 2e-16 ***
## imdb_num_votes 8.335e-07 1.764e-07 4.726 2.82e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4825 on 647 degrees of freedom
## Multiple R-squared: 0.803, Adjusted R-squared: 0.8021
## F-statistic: 879.3 on 3 and 647 DF, p-value: < 2.2e-16
After adding imdb_num_votes to the model it increased the R-squared value and Adjusted R-squared by very less amount.
ml4 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes + best_dir_win + best_actor_win + best_actress_win + best_pic_nom + best_pic_win)
summary(ml4)
##
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes +
## best_dir_win + best_actor_win + best_actress_win + best_pic_nom +
## best_pic_win)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.46429 -0.19315 0.02108 0.28757 1.19805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.654e+00 6.321e-02 57.811 < 2e-16 ***
## audience_score 3.366e-02 1.352e-03 24.891 < 2e-16 ***
## critics_score 1.160e-02 9.450e-04 12.280 < 2e-16 ***
## imdb_num_votes 8.035e-07 1.886e-07 4.261 2.34e-05 ***
## best_dir_winyes 7.686e-02 8.173e-02 0.940 0.3474
## best_actor_winyes 9.206e-02 5.543e-02 1.661 0.0972 .
## best_actress_winyes 8.391e-02 6.219e-02 1.349 0.1777
## best_pic_nomyes -7.990e-02 1.254e-01 -0.637 0.5241
## best_pic_winyes -5.738e-02 2.219e-01 -0.259 0.7961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.482 on 642 degrees of freedom
## Multiple R-squared: 0.805, Adjusted R-squared: 0.8026
## F-statistic: 331.3 on 8 and 642 DF, p-value: < 2.2e-16
After adding best_dir_win, best_actor_win, best_actress_win, best_pic_nom, best_pic_win variables to the model, we can see that there is negligible difference in the R-squared and adjusted R-squared. Also, p value for these variables are more than 0.05. Hence, these are not significant variable. Other categorical variables like studio, genre, and mpaa_rating have too many levels to be usefull.
Model Interpretation:
We used a forward selection approach with a combined criteria of p value, adjusted R squared, and logical reasoning. imdb_rating = 3.65 + 0.03audience_score + 0.01critics_score
R-Squared: 79.6% of the variability is explained by the model. audience_score andc ritics_score have p-value less tha 0.05.
Model Diagnostics: -> Linear relationships between explanatory numerical variables and response variable In this case each (numerical) explanatory variable should be linearly related to the response variable.We can check this condition using residual plot(e vs. x). There are two numerical variables in the model.
rating_final <- lm(imdb_rating ~ audience_score + critics_score)
plot(rating_final$residuals ~ audience_score, main = "Residuals vs. audience_score")
abline(0, 0)
plot(rating_final$residuals ~ critics_score, main = "Residuals vs. critics_score")
abline(0, 0)
As per residual plots, we can see a linear relationship between our residuals and our explanatory variables. This condition is met by our model.
-> nearly normal residuals On a residuals plot we look for random scatter of residuals around 0. This translates to a nearly normal distribution of residuals centered at 0. We can check this using histogram or normal probability plot.
hist(rating_final$residuals)
There is a skew in above plot. This means that our model is not very reliable when audience score or critics score is low.
qqnorm(rating_final$residuals)
qqline(rating_final$residuals)
There is a little deviation at starting, but there is not huge deviation.
Hence, we can say that our model is not very reliable when audience score or critics score is low.
-> Constant variability of residuals In this case, residuals should be equally variable for low and high values of the predicted response variable. We can check this using residuals plots of residuals vs. predicted.
plot(rating_final$residuals ~ rating_final$fitted.values)
abline(0, 0)
plot(abs(rating_final$residuals) ~ rating_final$fitted.values)
abline(0, 0)
The residuals plot versus the fiited values show heteroscedasticity. This means that our model has more variability when predicting low imdb_rating.
-> Independence of residuals Indepedent residuals mean independent observations. We can check this using residuals vs. order of data collection.
plot(rating_final$residuals)
There is no specific pattern in above plot. The observations appear to be independent of each other. This condition is satisfied.
set.seed(3974)
#Sample Indexes
indexes = sample(1:nrow(movies), size=0.999*nrow(movies))
# Split data
train = movies[indexes,]
dim(train) # 650 32
## [1] 650 32
test = movies[-indexes,]
dim(test) #1 32
## [1] 1 32
View(test)
predicted_rating <- predict(rating_final, test, interval = "prediction", level = 0.95, se.fit = TRUE)
predicted_rating
## $fit
## fit lwr upr
## 1 7.310473 6.346216 8.27473
##
## $se.fit
## [1] 0.02516751
##
## $df
## [1] 648
##
## $residual.scale
## [1] 0.4904126
This model predicts that movie North Sea Hijack will have an imdb_rating with a 95% prediction interval of 4.6 points to 6.5 points. The original imddb_rating from movies dataset is “6.3”. Prediction interval has this value.
This model has demonstrated that with only two predictors, we can predict with a certain amount of accuracy the popularity of movies using imdb_rating as a measure of popularity. Looking at the model's residual plots, it seems that there is greater variance when rating bad movies compared to good movies and critics are more likely than the audience to rate exteme values.
One of the shortcoming of our model is that it wasn't able to include the variable box office list as one of its predictors. It seems reasonable that an increase in the amount of revenues of movie sales is associated with popularity. The small number of movies in our dataset that's included in the list might have lessened the variables ability to have a small p value. It might have been better to have used a continuous variable of gross movie sales.We also could have used a larger number of observations in our testing data set to have a better measure of the model's accuracy.