# Movie Recommendation Algorithm
*Team Project: Ahmed, Dennis, Pedro and Steven. This particular notebook was written by Pedro and Dennis, with review by Ahmed and Steven, and with comments and prose by Pedro.*

# Table of Contents


1.   Preparing the Dataset
  - Importing the data, merging separate .csv files, creating word soups!
2.   Creating Our Recommendation Model
  - Gathering user input, vectorizing word soups, investigating cosine similarity
3.   References



In this notebook, we will show how we modified code from Datacamp's [movie recommendation algorithm tutorial](https://www.datacamp.com/community/tutorials/recommender-systems-python), and data from [Kaggle](https://www.kaggle.com/rounakbanik/the-movies-dataset), to create a flexible movie recommendation chatbot. 

Our idea was to have a chatbot that you can talk to to get good movie recommendations: we wanted to make it flexible enough that it could be more useful than simply looking up 'best movie' lists but constrained enough such that we could make good recommendations. Our approach was to therefore build on the cosine similarity model presented by Datacamp (liked above) to create a system that can do much more than take in a movie title and inform you of the most similar movies: we built a model that can give you ever more precise recommendations based on how much you tell the chatbot about the genres, actors, and even elements of the plot that you enjoy (or are looking for at the moment).

## Preparing the Dataset

The first step in the process was to import the data. We imported the data from Kaggle and stored it in Google Drive for use with CoLab (having the data in the cloud makes it easier for others to access the latest version of the data). Kaggle's dataset consists of four relevant .csv files: each containing different information four around 45,000 movies. We import these csv files from GDrive into python using pandas.

In [None]:
#Mounting Google Drive to access the dataset
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

#Importing the relevant datasets from the mounted Google Drive (change the code below if the data is hosted elsewhere)
metadata = pd.read_csv("/content/drive/Shareddrives/Tutorial_GProject /Pedro's Experimenting Data/movies_metadata.csv")
ratings = pd.read_csv("/content/drive/Shareddrives/Tutorial_GProject /Pedro's Experimenting Data/ratings.csv")
credits = pd.read_csv("/content/drive/Shareddrives/Tutorial_GProject /Pedro's Experimenting Data/credits.csv")
keywords = pd.read_csv("/content/drive/Shareddrives/Tutorial_GProject /Pedro's Experimenting Data/keywords.csv")

  interactivity=interactivity, compiler=compiler, result=result)


The first step when we get access to new data is to take a look at the format of the tables and the content of the columns, so that we know how to move forward.

In [None]:
#Data Exploration
#Here we explore the data shape and the name of its coloumns

#check the shape of the DataFrame (rows, columns)
#check the data columns
print("metadata shape:",metadata.shape)
print("metadata columns name:", metadata.columns)
print() 

print("ratings shape:",ratings.shape)
print("ratings columns name:", list(ratings.columns))
print()

print("credits shape:",credits.shape)
print("columns name:", list(credits.columns))
print()

print("keywords shape:",keywords.shape)
print("columns name:", list(keywords.columns))
print()

metadata shape: (45466, 24)
metadata columns name: Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

ratings shape: (26024289, 4)
ratings columns name: ['userId', 'movieId', 'rating', 'timestamp']

credits shape: (45476, 3)
columns name: ['cast', 'crew', 'id']

keywords shape: (46419, 2)
columns name: ['id', 'keywords']



As we can see from above, these data are not exactly perfeclty organzied to be used together yet. All of these tables have different dimensions, and it could get quite confusing to merge them together for our recommendation engine. Luckily all of them have a column for movie ID, which is the unique IMDB id each movie has that will allow us to merge them effectively.

In [None]:
#Cutting the data to reduce resource use: note comments below on WHY we had to 
#cut the data down by so much

metadata = pd.read_csv("/content/drive/Shareddrives/Tutorial_GProject /Pedro's Experimenting Data/movies_metadata.csv") # re-importing the data here because if the data is not re-imported, every time we run this cell again we will be performing the data cut and the merge on a different dataframe
metadata = metadata.iloc[0:10000,:]

# Convert IDs to int. Required for merging on id using pandas .merge command
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe: this will look
#for candidates on the credits and keywords tables that have ids that match those
#in the metadata table, which we will use as our main data from now on.
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')
metadata.shape

  interactivity=interactivity, compiler=compiler, result=result)


(10048, 27)

Now, let us take a look into what our merged table looks like:

In [None]:
# Print the important features of metadata
metadata[['title', 'cast', 'crew', 'keywords', 'genres']].head()

Unnamed: 0,title,cast,crew,keywords,genres
0,Toy Story,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,Jumanji,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,Grumpier Old Men,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,Waiting to Exhale,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,Father of the Bride Part II,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[{'id': 35, 'name': 'Comedy'}]"


This looks good, but notice how the cast, crew, keywords and genres are stored in what **looks** like dictionaries within lists. However, if we check what type they are really stored at:

In [None]:
print(metadata['keywords'][0])
print(type(metadata['keywords'][0]))

[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]
<class 'str'>


It is a string! Pandas will by default store text values like this as strings, but we don't want that. The data was originally python objects, and the lists and dictonaries are much more useful for our data pre-processing. That is why we use the literal_eval function below on the columns with the info we want:

In [None]:
#raises an exception if the input isn't a valid Python datatype, so the code won't be executed if it's not.
#Parse the stringified features into their corresponding python objects

from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

How about now? What datatype is each cell in the cast, crew, keywords and genres section?

In [None]:
print(metadata['genres'][0])
print(type(metadata['genres'][0])) #type of the contents of the cell
print(type(metadata['genres'][0][0])) #type of the contents of the list

[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
<class 'list'>
<class 'dict'>


Great! Now we have recovered the python objects. The next step is to make use of the data objects to extract the information we want. For instance, now that we have the crew information as a series of dictionaries within a list, we may want to extract the director. First, we will want to understand what each dictionary stores exactly:

In [None]:
print(metadata['crew'][0])
print(type(metadata['crew'][0])) #type of the contents of the cell
print(type(metadata['crew'][0][0])) #type of the contents of the list

[{'credit_id': '52fe4284c3a36847f8024f49', 'department': 'Directing', 'gender': 2, 'id': 7879, 'job': 'Director', 'name': 'John Lasseter', 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}, {'credit_id': '52fe4284c3a36847f8024f4f', 'department': 'Writing', 'gender': 2, 'id': 12891, 'job': 'Screenplay', 'name': 'Joss Whedon', 'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'}, {'credit_id': '52fe4284c3a36847f8024f55', 'department': 'Writing', 'gender': 2, 'id': 7, 'job': 'Screenplay', 'name': 'Andrew Stanton', 'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'}, {'credit_id': '52fe4284c3a36847f8024f5b', 'department': 'Writing', 'gender': 2, 'id': 12892, 'job': 'Screenplay', 'name': 'Joel Cohen', 'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'}, {'credit_id': '52fe4284c3a36847f8024f61', 'department': 'Writing', 'gender': 0, 'id': 12893, 'job': 'Screenplay', 'name': 'Alec Sokolow', 'profile_path': '/v79vlRYi94BZUQnkkyznbGUZLjT.jpg'}, {'credit_id': '52fe4284c3a36847f8024f67', 'depart

Seems like the crew list for a particular movie has one dictionary object per crew member. Each dictionary has a key called 'job' which tells us if that person was the director or not. With that in mind we can create a function to extract the director:

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

We can do something similar to extract the actors, keywords and genres with a function as well:

In [None]:
#Getting a list of the actors, keywords and genres
def get_list(x):
    if isinstance(x, list): #checking to see if the input is a list or not
        names = [i['name'] for i in x] #if we take a look at the data, we find that
        #the word 'name' is used as a key for the names actors, 
        #the actual keywords and the actual genres
        
        #Check if more than 3 elements exist. If yes, return only first three. 
        #If no, return entire list. Too many elements would slow down our algorithm 
        #too much, and three should be more than enough for good recommendations.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

Now that we have written functions to clean up our data into director names and lists with only the relevant info for cast, keywords and genres, we can apply those functions to our data and see the results:

In [None]:
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

metadata[['title', 'cast', 'director', 'keywords', 'genres']].head()

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"
3,Waiting to Exhale,"[Whitney Houston, Angela Bassett, Loretta Devine]",Forest Whitaker,"[based on novel, interracial relationship, sin...","[Comedy, Drama, Romance]"
4,Father of the Bride Part II,"[Steve Martin, Diane Keaton, Martin Short]",Charles Shyer,"[baby, midlife crisis, confidence]",[Comedy]


Seems like everything is working fine! Note that our objective is to eventually have one big word soup for each movie such that we can vectorize these soups and then compute cosine similarity. This means we have to clean up the data a bit more: we want to get every entry to be lower caps, and we want to have names be put together without a space to make sure that when we vectorize, we aren't storing the "Robert" in "Robert De Niro" with the same variable as the "Robert" in "Robert Downey Junior", because it would be arbitrary to say that these actors are similar just based on their name. If we store these as "robertdeniro" and "robertdowneyjunior" instead, we are creating separate vectorizations.

In [None]:
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x] #cleaning up spaces in the data
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

Now that we have a cleanup function that works for both lists and strings, we can use it on our data:

In [None]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

metadata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords,director
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[animation, comedy, family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.9469,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[tomhanks, timallen, donrickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy]",johnlasseter
1,False,,65000000,"[adventure, fantasy, family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.0155,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[robinwilliams, jonathanhyde, kirstendunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[boardgame, disappearance, basedonchildren'sbook]",joejohnston
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[romance, comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[waltermatthau, jacklemmon, ann-margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fishing, bestfriend, duringcreditsstinger]",howarddeutch
3,False,,16000000,"[comedy, drama, romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.85949,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[whitneyhouston, angelabassett, lorettadevine]","[{'credit_id': '52fe44779251416c91011acb', 'de...","[basedonnovel, interracialrelationship, single...",forestwhitaker
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.38752,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[stevemartin, dianekeaton, martinshort]","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[baby, midlifecrisis, confidence]",charlesshyer


The data is now cleaned! Finally, we are ready to create our soup for each movie. We can now create a funcion that iterates over the rows of our metadata and joins the keywords, cast, director and genres columns into one big word soup. Each element will be separated by a space " " that will signal to our vectorization function that that is a particular word, to be encoded separately and uniquely. 

In [None]:
#This function makes use of the property of the cosine similarity funciton that
#the order and types of inputs don't matter, what matters is the similarity
#between different soups of words
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

metadata['soup'] = metadata.apply(create_soup, axis=1)
#metadata.head()
metadata[['title', 'soup', 'cast', 'director', 'keywords', 'genres']].head()

Unnamed: 0,title,soup,cast,director,keywords,genres
0,Toy Story,jealousy toy boy tomhanks timallen donrickles ...,"[tomhanks, timallen, donrickles]",johnlasseter,"[jealousy, toy, boy]","[animation, comedy, family]"
1,Jumanji,boardgame disappearance basedonchildren'sbook ...,"[robinwilliams, jonathanhyde, kirstendunst]",joejohnston,"[boardgame, disappearance, basedonchildren'sbook]","[adventure, fantasy, family]"
2,Grumpier Old Men,fishing bestfriend duringcreditsstinger walter...,"[waltermatthau, jacklemmon, ann-margret]",howarddeutch,"[fishing, bestfriend, duringcreditsstinger]","[romance, comedy]"
3,Waiting to Exhale,basedonnovel interracialrelationship singlemot...,"[whitneyhouston, angelabassett, lorettadevine]",forestwhitaker,"[basedonnovel, interracialrelationship, single...","[comedy, drama, romance]"
4,Father of the Bride Part II,baby midlifecrisis confidence stevemartin dian...,"[stevemartin, dianekeaton, martinshort]",charlesshyer,"[baby, midlifecrisis, confidence]",[comedy]


Take a look at the soup! It looks pretty good: all the information about each movie is now a compact list of names corresponding to the genres, director, actors and keywords. 

Now that we have the soup for each movie, we want to create one more soup every time our recommender is run: a soup of inputs by the user. We wan't to collect what genres, directors, actors and keywords THEY like, so that we can then vectorize everything, compute pairwise cosine similarity between that input and each movie in our database, and rank which are the most similar movies to that input. To that effect, Dennis wrote the following code:

In [None]:
#Getting the user's input for genre, actors and directors of their liking.
def get_genres():
  genres = input("What Movie Genre are you interested in (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] ")
  genres = " ".join(["".join(n.split()) for n in genres.lower().split(',')])
  return genres

def get_actors():
  actors = input("Who are some actors within the genre that you love (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] ")
  actors = " ".join(["".join(n.split()) for n in actors.lower().split(',')])
  return actors

def get_directors():
  directors = input("Who are some directors within the genre that you love (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] ")
  directors = " ".join(["".join(n.split()) for n in directors.lower().split(',')])
  return directors

def get_keywords():
  keywords = input("What are some of the keywords that describe the movie you want to watch, like elements of the plot, whether or not it is about friendship, etc? (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] ")
  keywords = " ".join(["".join(n.split()) for n in keywords.lower().split(',')])
  return keywords

def get_searchTerms():
  searchTerms = [] 
  genres = get_genres()
  if genres != 'skip':
    searchTerms.append(genres)

  actors = get_actors()
  if actors != 'skip':
    searchTerms.append(actors)

  directors = get_directors()
  if directors != 'skip':
    searchTerms.append(directors)

  keywords = get_keywords()
  if keywords != 'skip':
    searchTerms.append(keywords)
  
  return searchTerms

Note how each of the functions above are prompting a different type of inputs for search, and how we structured the input questions to make sure that they are adequately formatted for our funtions to convert them into lists that can then be 'word souped' and vectorized with the word soups for our movies.

## Creating Our Recommendation Model Based on Count Vectoriser and Cosine Similarity

With our building blocks in place, and our data properly formatted, we can finally implement the ranking/recommendation function. As mentioned above, our function will take as an input the data that has already been pre-processed above, and will ask for user input. It will then word-soupify the user input, and add it as a row to our data. Next, it will vectorize these wordsoups using a function from the sklearn python library called CountVectorizer. CountVectorizer is extremely simple in what it does: it takes documents (different stings) and returns a tokenized matrix. Each wordsoup is encoded into frequencies of words in that wordsoup. For example, the following sentences, stored in a list: 

corpus = [

'This is the first document.',

'This document is the second document.',

'And this is the third one.',

'Is this the first document?']


If we apply the CountVectorizer to them, we would get the following table:

|Word 1|Word 2|Word 3|Word 4|Word 5|Word 6|Word 7|Word 8|Word 9|
|---|---|---|---|---|---|---|---|---|
|0|1|1|1|0|0|1|0|1|
|0|2|0|1|0|1|1|0|1|
|1|0|0|1|1|0|1|1|1|
|0|1|1|1|0|0|1|0|1|

Word 2, for example, is 'document', and this table reflects that for the first sentence the word document is present once, and that for the second sentence the word document is present twice.

Note that for our recommendation algorithm, we also want to vectrorize the user input. We chose to do that by simiply adding the inputted word soup to the metadata table, as the last entry, and then running the vectorization. While this isn't the most efficient way to go about it, the CountVectorize function is very quick to run and spends little resources. The bigger problem we have to face is the cosine similarity calculations.

Cosine similarity is a mathematical computation that tells us the similarity between two vectors $A$ and $B$. In effect, we are calculating the cosine of the angle $\theta$ between these two vectors. The function returns a value between -1,  indicating complete opposite vectors, to 1, indicating the same vector. 0 indicates a lack of correlation between the vectors, and intermediate values indicate intermediate levels of similarity. The formula we use to compute cosine similarity is the following:

$\cos(\theta)=\frac{A\cdot B}{||A||\space||B||}$, where $||A||=\sqrt{\sum_{i=1}^{n}{A_i}^2}$ and $||B||=\sqrt{\sum_{i=1}^{n}{B_i}^2}$.

Note that the cosine similarity function increases linearly in complexity as we increase the size of A and B (note that A and B have the same size, $n$). The dot product of A and B will require *n+t* more computations if we add *t* more values to A and B, and the magnitude of each of these will also increase linearly. So far, no trouble in computational complexity.

However, our algorithm performs cosine similarity computation between each possibe pair of movies. If we have $k$ movies, then we need to perform $k^2$ computations. This is the reason why we had to reduce the number of movies in our dataset from 45,000 to 10,000: the 35,000 movie difference translates into $1.925*10^9$ computations. Of course, there are methods to decrease the number of computations required and therefore allow us to use the entire dataset, but we decided to leave these as backlogged potential improvements for now. 





In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def make_recommendation(metadata=metadata):
  new_row = metadata.iloc[-1,:].copy() #creating a copy of the last row of the 
  #dataset, which we will use to input the user's input
  
  #grabbing the new wordsoup from the user
  searchTerms = get_searchTerms()  
  new_row.iloc[-1] = " ".join(searchTerms) #adding the input to our new row
  
  #adding the new row to the dataset
  metadata = metadata.append(new_row)
  
  #Vectorizing the entire matrix as described above!
  count = CountVectorizer(stop_words='english')
  count_matrix = count.fit_transform(metadata['soup'])

  #running pairwise cosine similarity 
  cosine_sim2 = cosine_similarity(count_matrix, count_matrix) #getting a similarity matrix
  
  #sorting cosine similarities by highest to lowest
  sim_scores = list(enumerate(cosine_sim2[-1,:]))
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

  #matching the similarities to the movie titles and ids
  ranked_titles = []
  for i in range(1, 11):
    indx = sim_scores[i][0]
    ranked_titles.append([metadata['title'].iloc[indx], metadata['imdb_id'].iloc[indx]])
  
  return ranked_titles

In [None]:
make_recommendation()

What Movie Genre are you interested in (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] horror
Who are some actors within the genre that you love (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] tom Cruise
Who are some directors within the genre that you love (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] james cameron
What are some of the keywords that describe the movie you want to watch, like elements of the plot, whether or not it is about friendship, etc? (if multiple, please separate them with a comma)? [Type 'skip' to skip this question] blood, boat, romance, sex, magic


[['Interview with the Vampire', 'tt0110148'],
 ['Empire of Passion', 'tt0077132'],
 ['All the Right Moves', 'tt0085154'],
 ['Love Object', 'tt0328077'],
 ['Women in Love', 'tt0066579'],
 ['Once Bitten', 'tt0089730'],
 ['Amityville II: The Possession', 'tt0083550'],
 ['Prom Night IV: Deliver Us from Evil', 'tt0105179'],
 ['Vampyros Lesbos', 'tt0066380'],
 ['Rowing with the Wind', 'tt0093840']]

# References
https://www.datacamp.com/community/tutorials/recommender-systems-python
https://www.kaggle.com/rounakbanik/the-movies-dataset
https://www.kite.com/python/docs/ast.literal_eval
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
