Sachin Shekhar (SachinShekhar@outlook.com)
Nidhi Misra (< email redacted >)
RSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.
The production company wants to plan their every move analytically based on data and have approached you for help with this new project. You have been provided with the data of the movies that have been released in the past three years. You have to analyse the data set and draw meaningful insights that can help them start their new project.
USE rsvp;
SELECT 'movie' AS 'table', COUNT(*) AS row_count FROM movie
UNION
SELECT 'genre' AS 'table', COUNT(*) AS row_count FROM genre
UNION
SELECT 'director_mapping' AS 'table', COUNT(*) AS row_count FROM director_mapping
UNION
SELECT 'names' AS 'table', COUNT(*) AS row_count FROM names
UNION
SELECT 'ratings' AS 'table', COUNT(*) AS row_count FROM ratings
UNION
SELECT 'role_mapping' AS 'table', COUNT(*) AS row_count FROM role_mapping;
SELECT DISTINCT 'id' AS 'column_name' FROM movie WHERE id IS NULL
UNION
SELECT DISTINCT 'title' AS 'column_name' FROM movie WHERE title IS NULL
UNION
SELECT DISTINCT 'year' AS 'column_name' FROM movie WHERE year IS NULL
UNION
SELECT DISTINCT 'date_published' AS 'column_name' FROM movie WHERE date_published IS NULL
UNION
SELECT DISTINCT 'duration' AS 'column_name' FROM movie WHERE duration IS NULL
UNION
SELECT DISTINCT 'country' AS 'column_name' FROM movie WHERE country IS NULL
UNION
SELECT DISTINCT 'worlwide_gross_income' AS 'column_name' FROM movie WHERE worlwide_gross_income IS NULL
UNION
SELECT DISTINCT 'languages' AS 'column_name' FROM movie WHERE languages IS NULL
UNION
SELECT DISTINCT 'production_company' AS 'column_name' FROM movie WHERE production_company IS NULL;
Columns country, worlwide_gross_income, languages and production_company have null values.
/* Output format for the first part:
+---------------+-------------------+
| Year | number_of_movies|
+-------------------+----------------
| 2017 | 2134 |
| 2018 | . |
| 2019 | . |
+---------------+-------------------+
Output format for the second part of the question:
+---------------+-------------------+
| month_num | number_of_movies|
+---------------+----------------
| 1 | 134 |
| 2 | 231 |
| . | . |
+---------------+-------------------+ */
SELECT year, COUNT(id) AS number_of_movies
FROM movie
GROUP BY year;
SELECT
MONTH(date_published) AS month_num,
COUNT(id) AS number_of_movies
FROM movie
GROUP BY month_num
ORDER BY number_of_movies DESC;
The highest number of movies is produced in the month of March.
So, now that we have understood the month-wise trend of movies, let’s take a look at the other details in the movies table.
We know USA and India produces huge number of movies each year. Let's find the number of movies produced by USA or India for the last year.
SELECT
COUNT(id) AS movies_count
FROM
movie
WHERE
year = 2019
AND (country LIKE '%India%'
OR country LIKE '%USA%');
USA and India produced more than a thousand movies in the year 2019.
SELECT DISTINCT genre
FROM genre;
WITH movies_per_genre
AS (SELECT g.genre,
Count(m.id) AS movies_count
FROM movie m
INNER JOIN genre g
ON m.id = g.movie_id
GROUP BY g.genre)
SELECT genre
FROM movies_per_genre
WHERE movies_count = (SELECT Max(movies_count)
FROM movies_per_genre);
Drama has the highest number of movies produced overall. So, based on this insight, RSVP Movies should focus on the Drama genre. But wait, it is too early to decide. A movie can belong to two or more genres.
So, let’s find out the count of movies that belong to only one genre.
SELECT Count(*)
FROM (SELECT m.id,
Count(g.genre) AS genre_count
FROM movie m
INNER JOIN genre g
ON m.id = g.movie_id
GROUP BY m.id
HAVING genre_count = 1) genres_per_movie;
There are more than 3000 movies which has only one genre associated with them. So, this figure appears significant.
Note: The same movie can belong to multiple genres.
/* Output format:
+---------------+-------------------+
| genre | avg_duration |
+-------------------+----------------
| thriller | 105 |
| . | . |
| . | . |
+---------------+-------------------+ */
SELECT g.genre,
ROUND(AVG(duration),2) AS avg_duration
FROM movie m
INNER JOIN genre g
ON m.id = g.movie_id
GROUP BY g.genre
ORDER BY avg_duration DESC;
Now we know, movies of genre Drama (produced highest in number in 2019) has the average duration of 106.77 minutes.
thriller genre of movies among all the genres in terms of number of movies produced?/* Output format:
+---------------+-------------------+---------------------+
| genre | movie_count | genre_rank |
+---------------+-------------------+---------------------+
|drama | 2312 | 2 |
+---------------+-------------------+---------------------+*/
WITH movies_per_genre
AS (SELECT g.genre,
Count(m.id) AS movie_count,
Rank()
OVER(
ORDER BY Count(m.id) DESC) AS genre_rank
FROM movie m
INNER JOIN genre g
ON m.id = g.movie_id
GROUP BY g.genre)
SELECT *
FROM movies_per_genre
WHERE genre = 'thriller';
Thriller movies is in top 3 among all genres in terms of number of movies.
Rating > 8: Superhit movies
Rating between 7 and 8: Hit movies
Rating between 5 and 7: One-time-watch movies
Rating < 5: Flop movies
SELECT m.title,
r.avg_rating,
CASE
WHEN r.avg_rating > 8 THEN 'Superhit movies'
WHEN r.avg_rating BETWEEN 7 AND 8 THEN 'Hit movies'
WHEN r.avg_rating BETWEEN 5 AND 7 THEN 'One-time-watch movies'
ELSE 'Flop movies'
END AS movie_success_status
FROM movie m
INNER JOIN ratings r
ON m.id = r.movie_id
INNER JOIN genre using(movie_id)
WHERE genre = 'thriller';
/* Output format:
+---------------+-------------------+---------------------+----------------------+
| genre | avg_duration |running_total_duration|moving_avg_duration |
+---------------+-------------------+---------------------+----------------------+
| comdy | 145 | 106.2 | 128.42 |
| . | . | . | . |
| . | . | . | . |
| . | . | . | . |
+---------------+-------------------+---------------------+----------------------+*/
SELECT g.genre,
Round(Avg(m.duration), 2) AS avg_duration,
SUM(Round(Avg(m.duration), 2))
over(
ORDER BY g.genre ROWS unbounded preceding) AS running_total_duration,
Avg(Round(Avg(m.duration), 2))
over(
ORDER BY g.genre ROWS 10 preceding) AS moving_avg_duration
FROM movie m
inner join genre g
ON m.id = g.movie_id
GROUP BY g.genre
ORDER BY g.genre;
Note: The top 3 genres would have the most number of movies.
/* Output format:
+---------------+-------------------+---------------------+----------------------+-----------------+
| genre | year | movie_name |worldwide_gross_income|movie_rank |
+---------------+-------------------+---------------------+----------------------+-----------------+
| comedy | 2017 | indian | $103244842 | 1 |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
+---------------+-------------------+---------------------+----------------------+-----------------+*/
WITH joined_table
AS (SELECT *,
CASE
WHEN Position('$' IN worlwide_gross_income) > 0 THEN ROUND(Substring(worlwide_gross_income, 3))
WHEN Position('INR' IN worlwide_gross_income) > 0 THEN Round(Substring(worlwide_gross_income, 5) / 79.78) -- Converted INR to $
END AS worldwide_gross_income
FROM movie m
inner join genre g
ON m.id = g.movie_id),
top_genres
AS (SELECT genre,
Rank()
over(
ORDER BY Count(movie_id) DESC) AS genre_rank
FROM joined_table
GROUP BY genre),
top_movies_per_year
AS (SELECT genre,
year,
title AS movie_name,
Concat('$ ', worldwide_gross_income) AS worldwide_gross_income,
Rank()
over(
PARTITION BY year
ORDER BY worldwide_gross_income DESC) AS movie_rank
FROM joined_table
WHERE genre IN (SELECT genre
FROM top_genres
WHERE genre_rank <= 3))
SELECT *
FROM top_movies_per_year
WHERE movie_rank <= 5;
/* Output format:
+---------------+-------------------+---------------------+----------------------+-----------------+-----------------+
| min_avg_rating| max_avg_rating | min_total_votes | max_total_votes |min_median_rating|min_median_rating|
+---------------+-------------------+---------------------+----------------------+-----------------+-----------------+
| 0 | 5 | 177 | 2000 | 0 | 8 |
+---------------+-------------------+---------------------+----------------------+-----------------+-----------------+*/
SELECT
MIN(avg_rating) AS min_avg_rating,
MAX(avg_rating) AS max_avg_rating,
MIN(total_votes) AS min_total_votes,
MAX(total_votes) AS max_total_votes,
MIN(median_rating) AS min_median_rating,
MAX(median_rating) AS max_median_rating
FROM
ratings;
So, the minimum and maximum values in each column of the ratings table are in the expected range. This implies there are no outliers in the table.
/* Output format:
+---------------+-------------------+---------------------+
| title | avg_rating | movie_rank |
+---------------+-------------------+---------------------+
| Fan | 9.6 | 5 |
| . | . | . |
| . | . | . |
| . | . | . |
+---------------+-------------------+---------------------+*/
WITH tmp_table
AS (SELECT m.title AS title,
r.avg_rating AS avg_rating,
Rank()
OVER(
ORDER BY r.avg_rating DESC) AS movie_rank
FROM movie m
INNER JOIN ratings r
ON m.id = r.movie_id)
SELECT *
FROM tmp_table
WHERE movie_rank <= 10
LIMIT 10;
/* Output format:
+---------------+-------------------+
| median_rating | movie_count |
+-------------------+----------------
| 1 | 105 |
| . | . |
| . | . |
+---------------+-------------------+ */
SELECT
median_rating, COUNT(movie_id) AS movie_count
FROM
ratings
GROUP BY median_rating
ORDER BY movie_count DESC;
Movies with a median rating of 7 are highest in number.
/* Output format:
+---------------+-------------------+
| genre | movie_count |
+-------------------+----------------
| thriller | 105 |
| . | . |
| . | . |
+---------------+-------------------+ */
SELECT
g.genre AS genre, COUNT(DISTINCT m.id) AS movie_count
FROM
movie m
INNER JOIN
genre g ON m.id = g.movie_id
INNER JOIN
ratings r ON m.id = r.movie_id
WHERE
MONTH(m.date_published) = 3
AND m.year = 2017
AND m.country LIKE '%USA%'
AND r.total_votes > 1000
GROUP BY g.genre
ORDER BY movie_count DESC;
/* Output format:
+---------------+-------------------+---------------------+
| title | avg_rating | genre |
+---------------+-------------------+---------------------+
| Theeran | 8.3 | Thriller |
| . | . | . |
| . | . | . |
| . | . | . |
+---------------+-------------------+---------------------+*/
SELECT
m.title AS title,
r.avg_rating AS avg_rating,
g.genre AS genre
FROM
movie m
INNER JOIN
genre g ON m.id = g.movie_id
INNER JOIN
ratings r ON m.id = r.movie_id
WHERE
m.title LIKE 'The%' AND r.avg_rating > 8
ORDER BY genre;
SELECT
r.median_rating AS median_rating,
COUNT(DISTINCT m.id) AS movie_count
FROM
movie m
INNER JOIN
ratings r ON m.id = r.movie_id
WHERE
m.date_published BETWEEN '2018-04-01' AND '2019-04-01'
GROUP BY r.median_rating
HAVING median_rating = 8;
WITH german_vote_count
AS
(
SELECT sum(r.total_votes) AS german_vote_count
FROM movie m
INNER JOIN ratings r
ON m.id = r.movie_id
WHERE country LIKE '%germany%' ),
italian_vote_count
AS
(
SELECT sum(r.total_votes) AS italian_vote_count
FROM movie m
INNER JOIN ratings r
ON m.id = r.movie_id
WHERE country LIKE '%italy%' )
SELECT
CASE
WHEN german_vote_count - italian_vote_count > 0 THEN 'Yes'
ELSE 'No'
end AS result
FROM german_vote_count
JOIN italian_vote_count;
Answer is Yes.
/* Output format:
+---------------+-------------------+---------------------+----------------------+
| name_nulls | height_nulls |date_of_birth_nulls |known_for_movies_nulls|
+---------------+-------------------+---------------------+----------------------+
| 0 | 123 | 1234 | 12345 |
+---------------+-------------------+---------------------+----------------------+*/
SELECT
SUM(CASE WHEN name IS NULL THEN 1 ELSE 0 END) AS name_nulls,
SUM(CASE WHEN height IS NULL THEN 1 ELSE 0 END) AS height_nulls,
SUM(CASE WHEN date_of_birth IS NULL THEN 1 ELSE 0 END) AS date_of_birth_nulls,
SUM(CASE WHEN known_for_movies IS NULL THEN 1 ELSE 0 END) AS known_for_movies_nulls
FROM names;
There are no Null value in the column 'name'.
/* Output format:
+---------------+-------------------+
| director_name | movie_count |
+---------------+-------------------|
|James Mangold | 4 |
| . | . |
| . | . |
+---------------+-------------------+ */
WITH joined_table
AS (SELECT *
FROM names n
INNER JOIN director_mapping d
ON n.id = d.name_id
INNER JOIN genre g using(movie_id)
INNER JOIN ratings r using(movie_id)
WHERE r.avg_rating > 8),
top_genres
AS (SELECT genre,
Rank()
OVER(
ORDER BY Count(DISTINCT movie_id) DESC) AS genre_rank
FROM joined_table
GROUP BY genre),
top_directors
AS (SELECT NAME AS director_name,
Count(DISTINCT movie_id) AS movie_count,
Rank()
OVER(
ORDER BY Count(DISTINCT movie_id) DESC) AS director_rank
FROM joined_table
WHERE genre IN (SELECT genre
FROM top_genres
WHERE genre_rank <= 3)
GROUP BY director_name)
SELECT director_name,
movie_count
FROM top_directors
WHERE director_rank <= 3;
James Mangold can be hired as the director for RSVP's next project.
Director id
Name
Number of movies
Average inter movie duration in days
Average movie ratings
Total votes
Min rating
Max rating
total movie durations
/* Output format:
+---------------+-------------------+---------------------+----------------------+--------------+--------------+------------+------------+----------------+
| director_id | director_name | number_of_movies | avg_inter_movie_days | avg_rating | total_votes | min_rating | max_rating | total_duration |
+---------------+-------------------+---------------------+----------------------+--------------+--------------+------------+------------+----------------+
|nm1777967 | A.L. Vijay | 5 | 177 | 5.65 | 1754 | 3.7 | 6.9 | 613 |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . | . |
+---------------+-------------------+---------------------+----------------------+--------------+--------------+------------+------------+----------------+
--------------------------------------------------------------------------------------------*/
WITH tmp_table
AS (SELECT d.name_id AS director_id,
n.NAME AS director_name,
movie_id,
m.date_published,
Lead(m.date_published, 1)
OVER(
partition BY d.name_id
ORDER BY m.date_published) AS next_date_published,
r.avg_rating AS avg_rating,
r.total_votes AS total_votes,
m.duration AS duration
FROM movie m
INNER JOIN ratings r
ON m.id = r.movie_id
INNER JOIN director_mapping d using(movie_id)
INNER JOIN names n
ON n.id = d.name_id),
tmp_table_2
AS (SELECT *,
Datediff(next_date_published, date_published) AS inter_movie_days
FROM tmp_table),
tmp_table_3
AS (SELECT director_id,
director_name,
Count(movie_id) AS number_of_movies,
Round(Avg(inter_movie_days)) AS avg_inter_movie_days,
Round(Sum(avg_rating * total_votes) / Sum(total_votes), 2) AS avg_rating,
Sum(total_votes) AS total_votes,
Min(avg_rating) AS min_rating,
Max(avg_rating) AS max_rating,
Sum(duration) AS total_duration,
Rank()
OVER(
ORDER BY Count(movie_id) DESC) AS director_rank
FROM tmp_table_2
GROUP BY director_id)
SELECT director_id,
director_name,
number_of_movies,
avg_inter_movie_days,
avg_rating,
total_votes,
min_rating,
max_rating,
total_duration
FROM tmp_table_3
WHERE director_rank <= 9;
/* Output format:
+------------------+-------------------+---------------------+
|production_company|movie_count | prod_company_rank|
+------------------+-------------------+---------------------+
| The Archers | 1 | 1 |
+------------------+-------------------+---------------------+*/
WITH tmp_table
AS (SELECT m.id AS movie_id,
m.production_company AS production_company
FROM movie m
INNER JOIN ratings r
ON m.id = r.movie_id
WHERE r.avg_rating > 8
AND m.production_company IS NOT NULL)
SELECT *
FROM (SELECT production_company,
Count(movie_id) AS movie_count,
Rank()
OVER(
ORDER BY Count(movie_id) DESC) AS prod_company_rank
FROM tmp_table
GROUP BY production_company) tmp_table_2
WHERE prod_company_rank = 1;
/* Output format:
+------------------+--------------------+---------------------+
|production_company|vote_count | prod_comp_rank|
+------------------+--------------------+---------------------+
| The Archers | 830 | 1 |
| . | . | . |
| . | . | . |
+-------------------+-------------------+---------------------+*/
WITH tmp_table
AS (SELECT m.production_company AS production_company,
Sum(r.total_votes) AS vote_count,
Rank()
OVER(
ORDER BY Sum(r.total_votes) DESC) AS prod_comp_rank
FROM movie m
INNER JOIN ratings r
ON m.id = r.movie_id
GROUP BY m.production_company)
SELECT *
FROM tmp_table
WHERE prod_comp_rank <= 3;
Marvel Studios rules the movie world.
/* Output format:
+-------------------+-------------------+---------------------+
|production_company |movie_count | prod_comp_rank|
+-------------------+-------------------+---------------------+
| The Archers | 830 | 1 |
| . | . | . |
| . | . | . |
+-------------------+-------------------+---------------------+*/
WITH top_prod_comp
AS (SELECT m.production_company AS production_company,
Count(m.id) AS movie_count,
Rank()
over(
ORDER BY Count(m.id) DESC) AS prod_comp_rank
FROM movie m
inner join ratings r
ON m.id = r.movie_id
WHERE r.median_rating >= 8
AND m.production_company IS NOT NULL
AND Position(',' IN m.languages) > 0
GROUP BY m.production_company)
SELECT *
FROM top_prod_comp
WHERE prod_comp_rank <= 2;
/* Output format:
+---------------+-------------------+
| actor_name | movie_count |
+-------------------+----------------
|Christain Bale | 10 |
| . | . |
+---------------+-------------------+ */
WITH tmp_table
AS (SELECT n.NAME AS actor_name,
Count(DISTINCT movie_id) AS movie_count,
Rank()
OVER(
ORDER BY Count(DISTINCT movie_id) DESC) AS actor_rank
FROM names n
INNER JOIN role_mapping rm
ON n.id = rm.name_id
INNER JOIN ratings r using(movie_id)
WHERE rm.category = 'actor'
AND r.median_rating >= 8
GROUP BY actor_name)
SELECT actor_name,
movie_count
FROM tmp_table
WHERE actor_rank <= 2;
Note: The actor should have acted in at least five Indian movies.
/* Output format:
+---------------+-------------------+---------------------+----------------------+-----------------+
| actor_name | total_votes | movie_count | actor_avg_rating |actor_rank |
+---------------+-------------------+---------------------+----------------------+-----------------+
| Yogi Babu | 3455 | 11 | 8.42 | 1 |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
+---------------+-------------------+---------------------+----------------------+-----------------+*/
WITH tmp_table
AS (SELECT n.NAME AS actor_name,
Sum(r.total_votes) AS total_votes,
Count(DISTINCT movie_id) AS movie_count,
Round(Sum(r.avg_rating * r.total_votes) / Sum(r.total_votes), 2) AS actor_avg_rating
FROM movie m
INNER JOIN ratings r
ON m.id = r.movie_id
INNER JOIN role_mapping rm using(movie_id)
INNER JOIN names n
ON n.id = rm.name_id
WHERE rm.category = 'actor'
AND m.country LIKE '%India%'
GROUP BY n.NAME
HAVING movie_count >= 5)
SELECT *,
Dense_rank()
OVER(
ORDER BY actor_avg_rating DESC) AS actor_rank
FROM tmp_table
ORDER BY actor_rank,
total_votes DESC;
Top actor is Vijay Sethupathi
Note: The actresses should have acted in at least three Indian movies.
/* Output format:
+---------------+-------------------+---------------------+----------------------+-----------------+
| actress_name | total_votes | movie_count | actress_avg_rating |actress_rank |
+---------------+-------------------+---------------------+----------------------+-----------------+
| Tabu | 3455 | 11 | 8.42 | 1 |
| . | . | . | . | . |
| . | . | . | . | . |
| . | . | . | . | . |
+---------------+-------------------+---------------------+----------------------+-----------------+*/
WITH tmp_table
AS
(
SELECT n.name AS actress_name ,
sum(r.total_votes) AS total_votes,
count(DISTINCT movie_id) AS movie_count,
round(sum(r.avg_rating * r.total_votes) / sum(r.total_votes), 2) AS actress_avg_rating
FROM movie m
INNER JOIN ratings r
ON m.id = r.movie_id
INNER JOIN role_mapping rm
USING (movie_id)
INNER JOIN names n
ON n.id = rm.name_id
WHERE rm.category = 'actress'
AND languages LIKE '%Hindi%'
AND country LIKE '%India%'
GROUP BY n.name
HAVING movie_count >= 3),
tmp_table_2
AS
(
SELECT *,
rank() over( ORDER BY actress_avg_rating DESC) AS actress_rank
FROM tmp_table )
SELECT *
FROM tmp_table_2
WHERE actress_rank <= 5
ORDER BY actress_rank,
total_votes DESC
LIMIT 5;
Taapsee Pannu tops with average rating 7.74.
/* Output format:
+---------------+-------------------+---------------------+----------------------+-----------------+
| actress_name | total_votes | movie_count |actress_avg_rating |actress_rank |
+---------------+-------------------+---------------------+----------------------+-----------------+
| Laura Dern | 1016 | 1 | 9.60 | 1 |
| . | . | . | . | . |
| . | . | . | . | . |
+---------------+-------------------+---------------------+----------------------+-----------------+*/
WITH tmp_table AS
(
SELECT n.NAME AS actress_name,
Sum(r.total_votes) AS total_votes,
Count(movie_id) AS movie_count,
Round(Sum(r.avg_rating * r.total_votes) / Sum(r.total_votes), 2) AS actress_avg_rating
FROM genre g
INNER JOIN ratings r
using (movie_id)
INNER JOIN role_mapping rm
using (movie_id)
INNER JOIN names n
ON n.id = rm.name_id
WHERE rm.category = 'actress'
AND r.avg_rating > 8
AND g.genre = 'drama'
GROUP BY n.NAME ), tmp_table_2 AS
(
SELECT *,
Rank() OVER(ORDER BY actress_avg_rating DESC) AS actress_rank
FROM tmp_table )
SELECT *
FROM tmp_table_2
WHERE actress_rank <= 3
ORDER BY actress_rank,
total_votes DESC limit 3;