RSVP Movies Analytics

Sachin Shekhar (SachinShekhar@outlook.com)
Nidhi Misra (< email redacted >)

Problem Statement

RSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.

The production company wants to plan their every move analytically based on data and have approached you for help with this new project. You have been provided with the data of the movies that have been released in the past three years. You have to analyse the data set and draw meaningful insights that can help them start their new project.

Importing Dataset

USE rsvp;

Exploring Database

Find the total number of rows in each table of the schema?

SELECT 'movie' AS 'table', COUNT(*) AS row_count FROM movie UNION SELECT 'genre' AS 'table', COUNT(*) AS row_count FROM genre UNION SELECT 'director_mapping' AS 'table', COUNT(*) AS row_count FROM director_mapping UNION SELECT 'names' AS 'table', COUNT(*) AS row_count FROM names UNION SELECT 'ratings' AS 'table', COUNT(*) AS row_count FROM ratings UNION SELECT 'role_mapping' AS 'table', COUNT(*) AS row_count FROM role_mapping;

Which columns in the movie table have null values?

SELECT DISTINCT 'id' AS 'column_name' FROM movie WHERE id IS NULL UNION SELECT DISTINCT 'title' AS 'column_name' FROM movie WHERE title IS NULL UNION SELECT DISTINCT 'year' AS 'column_name' FROM movie WHERE year IS NULL UNION SELECT DISTINCT 'date_published' AS 'column_name' FROM movie WHERE date_published IS NULL UNION SELECT DISTINCT 'duration' AS 'column_name' FROM movie WHERE duration IS NULL UNION SELECT DISTINCT 'country' AS 'column_name' FROM movie WHERE country IS NULL UNION SELECT DISTINCT 'worlwide_gross_income' AS 'column_name' FROM movie WHERE worlwide_gross_income IS NULL UNION SELECT DISTINCT 'languages' AS 'column_name' FROM movie WHERE languages IS NULL UNION SELECT DISTINCT 'production_company' AS 'column_name' FROM movie WHERE production_company IS NULL;

Columns country, worlwide_gross_income, languages and production_company have null values.

Release Year

Find the total number of movies released each year? How does the trend look month wise?

/* Output format for the first part: +---------------+-------------------+ | Year | number_of_movies| +-------------------+---------------- | 2017 | 2134 | | 2018 | . | | 2019 | . | +---------------+-------------------+ Output format for the second part of the question: +---------------+-------------------+ | month_num | number_of_movies| +---------------+---------------- | 1 | 134 | | 2 | 231 | | . | . | +---------------+-------------------+ */
SELECT year, COUNT(id) AS number_of_movies FROM movie GROUP BY year; SELECT MONTH(date_published) AS month_num, COUNT(id) AS number_of_movies FROM movie GROUP BY month_num ORDER BY number_of_movies DESC;

The highest number of movies is produced in the month of March. So, now that we have understood the month-wise trend of movies, let’s take a look at the other details in the movies table.
We know USA and India produces huge number of movies each year. Let's find the number of movies produced by USA or India for the last year.

How many movies were produced in the USA or India in the year 2019?

SELECT COUNT(id) AS movies_count FROM movie WHERE year = 2019 AND (country LIKE '%India%' OR country LIKE '%USA%');

USA and India produced more than a thousand movies in the year 2019.

Genre

Find the unique list of the genres present in the data set?

SELECT DISTINCT genre FROM genre;

Which genre had the highest number of movies produced overall?

WITH movies_per_genre AS (SELECT g.genre, Count(m.id) AS movies_count FROM movie m INNER JOIN genre g ON m.id = g.movie_id GROUP BY g.genre) SELECT genre FROM movies_per_genre WHERE movies_count = (SELECT Max(movies_count) FROM movies_per_genre);

Drama has the highest number of movies produced overall. So, based on this insight, RSVP Movies should focus on the Drama genre. But wait, it is too early to decide. A movie can belong to two or more genres. So, let’s find out the count of movies that belong to only one genre.

How many movies belong to only one genre?

SELECT Count(*) FROM (SELECT m.id, Count(g.genre) AS genre_count FROM movie m INNER JOIN genre g ON m.id = g.movie_id GROUP BY m.id HAVING genre_count = 1) genres_per_movie;

There are more than 3000 movies which has only one genre associated with them. So, this figure appears significant.

What is the average duration of movies in each genre?

Note: The same movie can belong to multiple genres.

/* Output format: +---------------+-------------------+ | genre | avg_duration | +-------------------+---------------- | thriller | 105 | | . | . | | . | . | +---------------+-------------------+ */
SELECT g.genre, ROUND(AVG(duration),2) AS avg_duration FROM movie m INNER JOIN genre g ON m.id = g.movie_id GROUP BY g.genre ORDER BY avg_duration DESC;

Now we know, movies of genre Drama (produced highest in number in 2019) has the average duration of 106.77 minutes.

What is the rank of the thriller genre of movies among all the genres in terms of number of movies produced?

/* Output format: +---------------+-------------------+---------------------+ | genre | movie_count | genre_rank | +---------------+-------------------+---------------------+ |drama | 2312 | 2 | +---------------+-------------------+---------------------+*/
WITH movies_per_genre AS (SELECT g.genre, Count(m.id) AS movie_count, Rank() OVER( ORDER BY Count(m.id) DESC) AS genre_rank FROM movie m INNER JOIN genre g ON m.id = g.movie_id GROUP BY g.genre) SELECT * FROM movies_per_genre WHERE genre = 'thriller';

Thriller movies is in top 3 among all genres in terms of number of movies.

Select thriller movies as per avg rating and classify them in the following category:

		Rating > 8: Superhit movies
		Rating between 7 and 8: Hit movies
		Rating between 5 and 7: One-time-watch movies
		Rating < 5: Flop movies
SELECT m.title, r.avg_rating, CASE WHEN r.avg_rating > 8 THEN 'Superhit movies' WHEN r.avg_rating BETWEEN 7 AND 8 THEN 'Hit movies' WHEN r.avg_rating BETWEEN 5 AND 7 THEN 'One-time-watch movies' ELSE 'Flop movies' END AS movie_success_status FROM movie m INNER JOIN ratings r ON m.id = r.movie_id INNER JOIN genre using(movie_id) WHERE genre = 'thriller';

What is the genre-wise running total and moving average of the average movie duration?

/* Output format: +---------------+-------------------+---------------------+----------------------+ | genre | avg_duration |running_total_duration|moving_avg_duration | +---------------+-------------------+---------------------+----------------------+ | comdy | 145 | 106.2 | 128.42 | | . | . | . | . | | . | . | . | . | | . | . | . | . | +---------------+-------------------+---------------------+----------------------+*/
SELECT g.genre, Round(Avg(m.duration), 2) AS avg_duration, SUM(Round(Avg(m.duration), 2)) over( ORDER BY g.genre ROWS unbounded preceding) AS running_total_duration, Avg(Round(Avg(m.duration), 2)) over( ORDER BY g.genre ROWS 10 preceding) AS moving_avg_duration FROM movie m inner join genre g ON m.id = g.movie_id GROUP BY g.genre ORDER BY g.genre;

Which are the five highest-grossing movies of each year that belong to the top three genres?

Note: The top 3 genres would have the most number of movies.

/* Output format: +---------------+-------------------+---------------------+----------------------+-----------------+ | genre | year | movie_name |worldwide_gross_income|movie_rank | +---------------+-------------------+---------------------+----------------------+-----------------+ | comedy | 2017 | indian | $103244842 | 1 | | . | . | . | . | . | | . | . | . | . | . | | . | . | . | . | . | +---------------+-------------------+---------------------+----------------------+-----------------+*/
WITH joined_table AS (SELECT *, CASE WHEN Position('$' IN worlwide_gross_income) > 0 THEN ROUND(Substring(worlwide_gross_income, 3)) WHEN Position('INR' IN worlwide_gross_income) > 0 THEN Round(Substring(worlwide_gross_income, 5) / 79.78) -- Converted INR to $ END AS worldwide_gross_income FROM movie m inner join genre g ON m.id = g.movie_id), top_genres AS (SELECT genre, Rank() over( ORDER BY Count(movie_id) DESC) AS genre_rank FROM joined_table GROUP BY genre), top_movies_per_year AS (SELECT genre, year, title AS movie_name, Concat('$ ', worldwide_gross_income) AS worldwide_gross_income, Rank() over( PARTITION BY year ORDER BY worldwide_gross_income DESC) AS movie_rank FROM joined_table WHERE genre IN (SELECT genre FROM top_genres WHERE genre_rank <= 3)) SELECT * FROM top_movies_per_year WHERE movie_rank <= 5;

Ratings

Find the minimum and maximum values in each column of the ratings table except the movie_id column?

/* Output format: +---------------+-------------------+---------------------+----------------------+-----------------+-----------------+ | min_avg_rating| max_avg_rating | min_total_votes | max_total_votes |min_median_rating|min_median_rating| +---------------+-------------------+---------------------+----------------------+-----------------+-----------------+ | 0 | 5 | 177 | 2000 | 0 | 8 | +---------------+-------------------+---------------------+----------------------+-----------------+-----------------+*/
SELECT MIN(avg_rating) AS min_avg_rating, MAX(avg_rating) AS max_avg_rating, MIN(total_votes) AS min_total_votes, MAX(total_votes) AS max_total_votes, MIN(median_rating) AS min_median_rating, MAX(median_rating) AS max_median_rating FROM ratings;

So, the minimum and maximum values in each column of the ratings table are in the expected range. This implies there are no outliers in the table.

Which are the top 10 movies based on average rating?

/* Output format: +---------------+-------------------+---------------------+ | title | avg_rating | movie_rank | +---------------+-------------------+---------------------+ | Fan | 9.6 | 5 | | . | . | . | | . | . | . | | . | . | . | +---------------+-------------------+---------------------+*/
WITH tmp_table AS (SELECT m.title AS title, r.avg_rating AS avg_rating, Rank() OVER( ORDER BY r.avg_rating DESC) AS movie_rank FROM movie m INNER JOIN ratings r ON m.id = r.movie_id) SELECT * FROM tmp_table WHERE movie_rank <= 10 LIMIT 10;

Summarise the ratings table based on the movie counts by median ratings.

/* Output format: +---------------+-------------------+ | median_rating | movie_count | +-------------------+---------------- | 1 | 105 | | . | . | | . | . | +---------------+-------------------+ */
SELECT median_rating, COUNT(movie_id) AS movie_count FROM ratings GROUP BY median_rating ORDER BY movie_count DESC;

Movies with a median rating of 7 are highest in number.

How many movies released in each genre during March 2017 in the USA had more than 1,000 votes?

/* Output format: +---------------+-------------------+ | genre | movie_count | +-------------------+---------------- | thriller | 105 | | . | . | | . | . | +---------------+-------------------+ */
SELECT g.genre AS genre, COUNT(DISTINCT m.id) AS movie_count FROM movie m INNER JOIN genre g ON m.id = g.movie_id INNER JOIN ratings r ON m.id = r.movie_id WHERE MONTH(m.date_published) = 3 AND m.year = 2017 AND m.country LIKE '%USA%' AND r.total_votes > 1000 GROUP BY g.genre ORDER BY movie_count DESC;

Find movies of each genre that start with the word ‘The’ and which have an average rating > 8?

/* Output format: +---------------+-------------------+---------------------+ | title | avg_rating | genre | +---------------+-------------------+---------------------+ | Theeran | 8.3 | Thriller | | . | . | . | | . | . | . | | . | . | . | +---------------+-------------------+---------------------+*/
SELECT m.title AS title, r.avg_rating AS avg_rating, g.genre AS genre FROM movie m INNER JOIN genre g ON m.id = g.movie_id INNER JOIN ratings r ON m.id = r.movie_id WHERE m.title LIKE 'The%' AND r.avg_rating > 8 ORDER BY genre;

Of the movies released between 1 April 2018 and 1 April 2019, how many were given a median rating of 8?

SELECT r.median_rating AS median_rating, COUNT(DISTINCT m.id) AS movie_count FROM movie m INNER JOIN ratings r ON m.id = r.movie_id WHERE m.date_published BETWEEN '2018-04-01' AND '2019-04-01' GROUP BY r.median_rating HAVING median_rating = 8;

Do German movies get more votes than Italian movies?

WITH german_vote_count AS ( SELECT sum(r.total_votes) AS german_vote_count FROM movie m INNER JOIN ratings r ON m.id = r.movie_id WHERE country LIKE '%germany%' ), italian_vote_count AS ( SELECT sum(r.total_votes) AS italian_vote_count FROM movie m INNER JOIN ratings r ON m.id = r.movie_id WHERE country LIKE '%italy%' ) SELECT CASE WHEN german_vote_count - italian_vote_count > 0 THEN 'Yes' ELSE 'No' end AS result FROM german_vote_count JOIN italian_vote_count;

Answer is Yes.

Director

Which columns in the names table have null values?

/* Output format: +---------------+-------------------+---------------------+----------------------+ | name_nulls | height_nulls |date_of_birth_nulls |known_for_movies_nulls| +---------------+-------------------+---------------------+----------------------+ | 0 | 123 | 1234 | 12345 | +---------------+-------------------+---------------------+----------------------+*/
SELECT SUM(CASE WHEN name IS NULL THEN 1 ELSE 0 END) AS name_nulls, SUM(CASE WHEN height IS NULL THEN 1 ELSE 0 END) AS height_nulls, SUM(CASE WHEN date_of_birth IS NULL THEN 1 ELSE 0 END) AS date_of_birth_nulls, SUM(CASE WHEN known_for_movies IS NULL THEN 1 ELSE 0 END) AS known_for_movies_nulls FROM names;

There are no Null value in the column 'name'.

Who are the top three directors in the top three genres whose movies have an average rating > 8?

/* Output format: +---------------+-------------------+ | director_name | movie_count | +---------------+-------------------| |James Mangold | 4 | | . | . | | . | . | +---------------+-------------------+ */
WITH joined_table AS (SELECT * FROM names n INNER JOIN director_mapping d ON n.id = d.name_id INNER JOIN genre g using(movie_id) INNER JOIN ratings r using(movie_id) WHERE r.avg_rating > 8), top_genres AS (SELECT genre, Rank() OVER( ORDER BY Count(DISTINCT movie_id) DESC) AS genre_rank FROM joined_table GROUP BY genre), top_directors AS (SELECT NAME AS director_name, Count(DISTINCT movie_id) AS movie_count, Rank() OVER( ORDER BY Count(DISTINCT movie_id) DESC) AS director_rank FROM joined_table WHERE genre IN (SELECT genre FROM top_genres WHERE genre_rank <= 3) GROUP BY director_name) SELECT director_name, movie_count FROM top_directors WHERE director_rank <= 3;

James Mangold can be hired as the director for RSVP's next project.

Get the following details for top 9 directors (based on number of movies):

Director id
Name
Number of movies
Average inter movie duration in days
Average movie ratings
Total votes
Min rating
Max rating
total movie durations
/* Output format: +---------------+-------------------+---------------------+----------------------+--------------+--------------+------------+------------+----------------+ | director_id | director_name | number_of_movies | avg_inter_movie_days | avg_rating | total_votes | min_rating | max_rating | total_duration | +---------------+-------------------+---------------------+----------------------+--------------+--------------+------------+------------+----------------+ |nm1777967 | A.L. Vijay | 5 | 177 | 5.65 | 1754 | 3.7 | 6.9 | 613 | | . | . | . | . | . | . | . | . | . | | . | . | . | . | . | . | . | . | . | | . | . | . | . | . | . | . | . | . | | . | . | . | . | . | . | . | . | . | | . | . | . | . | . | . | . | . | . | | . | . | . | . | . | . | . | . | . | | . | . | . | . | . | . | . | . | . | | . | . | . | . | . | . | . | . | . | +---------------+-------------------+---------------------+----------------------+--------------+--------------+------------+------------+----------------+ --------------------------------------------------------------------------------------------*/
WITH tmp_table AS (SELECT d.name_id AS director_id, n.NAME AS director_name, movie_id, m.date_published, Lead(m.date_published, 1) OVER( partition BY d.name_id ORDER BY m.date_published) AS next_date_published, r.avg_rating AS avg_rating, r.total_votes AS total_votes, m.duration AS duration FROM movie m INNER JOIN ratings r ON m.id = r.movie_id INNER JOIN director_mapping d using(movie_id) INNER JOIN names n ON n.id = d.name_id), tmp_table_2 AS (SELECT *, Datediff(next_date_published, date_published) AS inter_movie_days FROM tmp_table), tmp_table_3 AS (SELECT director_id, director_name, Count(movie_id) AS number_of_movies, Round(Avg(inter_movie_days)) AS avg_inter_movie_days, Round(Sum(avg_rating * total_votes) / Sum(total_votes), 2) AS avg_rating, Sum(total_votes) AS total_votes, Min(avg_rating) AS min_rating, Max(avg_rating) AS max_rating, Sum(duration) AS total_duration, Rank() OVER( ORDER BY Count(movie_id) DESC) AS director_rank FROM tmp_table_2 GROUP BY director_id) SELECT director_id, director_name, number_of_movies, avg_inter_movie_days, avg_rating, total_votes, min_rating, max_rating, total_duration FROM tmp_table_3 WHERE director_rank <= 9;

Production House

Which production house has produced the most number of hit movies (average rating > 8)?

/* Output format: +------------------+-------------------+---------------------+ |production_company|movie_count | prod_company_rank| +------------------+-------------------+---------------------+ | The Archers | 1 | 1 | +------------------+-------------------+---------------------+*/
WITH tmp_table AS (SELECT m.id AS movie_id, m.production_company AS production_company FROM movie m INNER JOIN ratings r ON m.id = r.movie_id WHERE r.avg_rating > 8 AND m.production_company IS NOT NULL) SELECT * FROM (SELECT production_company, Count(movie_id) AS movie_count, Rank() OVER( ORDER BY Count(movie_id) DESC) AS prod_company_rank FROM tmp_table GROUP BY production_company) tmp_table_2 WHERE prod_company_rank = 1;

Which are the top three production houses based on the number of votes received by their movies?

/* Output format: +------------------+--------------------+---------------------+ |production_company|vote_count | prod_comp_rank| +------------------+--------------------+---------------------+ | The Archers | 830 | 1 | | . | . | . | | . | . | . | +-------------------+-------------------+---------------------+*/
WITH tmp_table AS (SELECT m.production_company AS production_company, Sum(r.total_votes) AS vote_count, Rank() OVER( ORDER BY Sum(r.total_votes) DESC) AS prod_comp_rank FROM movie m INNER JOIN ratings r ON m.id = r.movie_id GROUP BY m.production_company) SELECT * FROM tmp_table WHERE prod_comp_rank <= 3;

Marvel Studios rules the movie world.

Which are the top two production houses that have produced the highest number of hits (median rating >= 8) among multilingual movies?

/* Output format: +-------------------+-------------------+---------------------+ |production_company |movie_count | prod_comp_rank| +-------------------+-------------------+---------------------+ | The Archers | 830 | 1 | | . | . | . | | . | . | . | +-------------------+-------------------+---------------------+*/
WITH top_prod_comp AS (SELECT m.production_company AS production_company, Count(m.id) AS movie_count, Rank() over( ORDER BY Count(m.id) DESC) AS prod_comp_rank FROM movie m inner join ratings r ON m.id = r.movie_id WHERE r.median_rating >= 8 AND m.production_company IS NOT NULL AND Position(',' IN m.languages) > 0 GROUP BY m.production_company) SELECT * FROM top_prod_comp WHERE prod_comp_rank <= 2;

Actors

Who are the top two actors whose movies have a median rating >= 8?

/* Output format: +---------------+-------------------+ | actor_name | movie_count | +-------------------+---------------- |Christain Bale | 10 | | . | . | +---------------+-------------------+ */
WITH tmp_table AS (SELECT n.NAME AS actor_name, Count(DISTINCT movie_id) AS movie_count, Rank() OVER( ORDER BY Count(DISTINCT movie_id) DESC) AS actor_rank FROM names n INNER JOIN role_mapping rm ON n.id = rm.name_id INNER JOIN ratings r using(movie_id) WHERE rm.category = 'actor' AND r.median_rating >= 8 GROUP BY actor_name) SELECT actor_name, movie_count FROM tmp_table WHERE actor_rank <= 2;

Rank actors with movies released in India based on their average ratings. Which actor is at the top of the list?

Note: The actor should have acted in at least five Indian movies.

/* Output format: +---------------+-------------------+---------------------+----------------------+-----------------+ | actor_name | total_votes | movie_count | actor_avg_rating |actor_rank | +---------------+-------------------+---------------------+----------------------+-----------------+ | Yogi Babu | 3455 | 11 | 8.42 | 1 | | . | . | . | . | . | | . | . | . | . | . | | . | . | . | . | . | +---------------+-------------------+---------------------+----------------------+-----------------+*/
WITH tmp_table AS (SELECT n.NAME AS actor_name, Sum(r.total_votes) AS total_votes, Count(DISTINCT movie_id) AS movie_count, Round(Sum(r.avg_rating * r.total_votes) / Sum(r.total_votes), 2) AS actor_avg_rating FROM movie m INNER JOIN ratings r ON m.id = r.movie_id INNER JOIN role_mapping rm using(movie_id) INNER JOIN names n ON n.id = rm.name_id WHERE rm.category = 'actor' AND m.country LIKE '%India%' GROUP BY n.NAME HAVING movie_count >= 5) SELECT *, Dense_rank() OVER( ORDER BY actor_avg_rating DESC) AS actor_rank FROM tmp_table ORDER BY actor_rank, total_votes DESC;

Top actor is Vijay Sethupathi

Find out the top five actresses in Hindi movies released in India based on their average ratings?

Note: The actresses should have acted in at least three Indian movies.

/* Output format: +---------------+-------------------+---------------------+----------------------+-----------------+ | actress_name | total_votes | movie_count | actress_avg_rating |actress_rank | +---------------+-------------------+---------------------+----------------------+-----------------+ | Tabu | 3455 | 11 | 8.42 | 1 | | . | . | . | . | . | | . | . | . | . | . | | . | . | . | . | . | +---------------+-------------------+---------------------+----------------------+-----------------+*/
WITH tmp_table AS ( SELECT n.name AS actress_name , sum(r.total_votes) AS total_votes, count(DISTINCT movie_id) AS movie_count, round(sum(r.avg_rating * r.total_votes) / sum(r.total_votes), 2) AS actress_avg_rating FROM movie m INNER JOIN ratings r ON m.id = r.movie_id INNER JOIN role_mapping rm USING (movie_id) INNER JOIN names n ON n.id = rm.name_id WHERE rm.category = 'actress' AND languages LIKE '%Hindi%' AND country LIKE '%India%' GROUP BY n.name HAVING movie_count >= 3), tmp_table_2 AS ( SELECT *, rank() over( ORDER BY actress_avg_rating DESC) AS actress_rank FROM tmp_table ) SELECT * FROM tmp_table_2 WHERE actress_rank <= 5 ORDER BY actress_rank, total_votes DESC LIMIT 5;

Taapsee Pannu tops with average rating 7.74.

Who are the top 3 actresses based on number of Super Hit movies (average rating >8) in drama genre?

/* Output format: +---------------+-------------------+---------------------+----------------------+-----------------+ | actress_name | total_votes | movie_count |actress_avg_rating |actress_rank | +---------------+-------------------+---------------------+----------------------+-----------------+ | Laura Dern | 1016 | 1 | 9.60 | 1 | | . | . | . | . | . | | . | . | . | . | . | +---------------+-------------------+---------------------+----------------------+-----------------+*/
WITH tmp_table AS ( SELECT n.NAME AS actress_name, Sum(r.total_votes) AS total_votes, Count(movie_id) AS movie_count, Round(Sum(r.avg_rating * r.total_votes) / Sum(r.total_votes), 2) AS actress_avg_rating FROM genre g INNER JOIN ratings r using (movie_id) INNER JOIN role_mapping rm using (movie_id) INNER JOIN names n ON n.id = rm.name_id WHERE rm.category = 'actress' AND r.avg_rating > 8 AND g.genre = 'drama' GROUP BY n.NAME ), tmp_table_2 AS ( SELECT *, Rank() OVER(ORDER BY actress_avg_rating DESC) AS actress_rank FROM tmp_table ) SELECT * FROM tmp_table_2 WHERE actress_rank <= 3 ORDER BY actress_rank, total_votes DESC limit 3;