Practical Deep Learning, Lesson 7, Movie Recommendations

2024-12-24 fastai

In this notebook, we’ll use the MovieLens 10M dataset and collaborative filtering to create a movie recommendation model. We’ll use the data from movies.dat and ratings.dat to create embeddings that will help us predict ratings for movies I haven’t watched yet.

Create some personal data

Before I wrote any code to train models, I code-generated a quick UI to rate movies to generate my_ratings.dat, to append to ratings.dat. There is a bit of code needed to do that. The nice part is using inline script metadata and uv, we can write (generate) and run the whole tool in a single file.

Here is the code:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "fastapi",
#     "pandas",
#     "uvicorn",
# ]
# ///

from fastapi import FastAPI
from fastapi.responses import JSONResponse, HTMLResponse
import pandas as pd
from datetime import datetime
import uvicorn

app = FastAPI()

movies_df = pd.read_csv(
    "ml-10M100K/movies.dat",
    sep="::",
    names=["movie_id", "title", "genres"],
    engine="python",
)
movies_df["year"] = movies_df["title"].str.extract(r"\((\d{4})\)")
movies_df["title"] = movies_df["title"].str.replace(r"\s*\(\d{4}\)", "", regex=True)
movies_df = movies_df.sort_values("year", ascending=False)

last_rated_index = 0


@app.get("/", response_class=HTMLResponse)
async def get_root():
    return """
    <!DOCTYPE html>
    <html>
    <head>
        <title>Movie Ratings</title>
        <style>
            body { font-family: sans-serif; max-width: 800px; margin: 0 auto; padding: 20px; }
            .movie { margin-bottom: 20px; padding: 20px; border: 1px solid #ccc; border-radius: 5px; }
            .rating-buttons { margin-top: 10px; }
            button { margin-right: 5px; padding: 5px 10px; cursor: pointer; }
            .rating-btn { background: #4CAF50; color: white; border: none; }
            .skip-btn { background: #f44336; color: white; border: none; }
        </style>
    </head>
    <body>
        <div id="current-movie" class="movie">
            <h2 id="movie-title"></h2>
            <p>Year: <span id="movie-year"></span></p>
            <p>Genres: <span id="movie-genres"></span></p>
            <div class="rating-buttons">
                <button class="rating-btn" onclick="rateMovie(1)">1★</button>
                <button class="rating-btn" onclick="rateMovie(2)">2★</button>
                <button class="rating-btn" onclick="rateMovie(3)">3★</button>
                <button class="rating-btn" onclick="rateMovie(4)">4★</button>
                <button class="rating-btn" onclick="rateMovie(5)">5★</button>
                <button class="skip-btn" onclick="skipMovie()">Skip</button>
            </div>
        </div>

        <script>
            let currentMovie = null;

            async function loadNextMovie() {
                const response = await fetch('/next-movie');
                currentMovie = await response.json();
                document.getElementById('movie-title').textContent = currentMovie.title;
                document.getElementById('movie-year').textContent = currentMovie.year;
                document.getElementById('movie-genres').textContent = currentMovie.genres;
            }

            async function rateMovie(rating) {
                if (!currentMovie) return;
                await fetch(`/rate-movie/${currentMovie.movie_id}/${rating}`, {
                    method: 'POST'
                });
                loadNextMovie();
            }

            async function skipMovie() {
                if (!currentMovie) return;
                await fetch(`/skip-movie/${currentMovie.movie_id}`, {
                    method: 'POST'
                });
                loadNextMovie();
            }

            loadNextMovie();
        </script>
    </body>
    </html>
    """


@app.get("/next-movie")
async def get_next_movie():
    global last_rated_index
    movie = movies_df.iloc[last_rated_index].to_dict()
    return JSONResponse(movie)


@app.post("/rate-movie/{movie_id}/{rating}")
async def rate_movie(movie_id: int, rating: int):
    global last_rated_index
    if rating not in range(1, 6):
        return JSONResponse(
            {"error": "Rating must be between 1 and 5"}, status_code=400
        )

    timestamp = int(datetime.now().timestamp())
    user_id = 99999

    with open("my_ratings.dat", "a") as f:
        f.write(f"{user_id}::{movie_id}::{rating}::{timestamp}\n")

    last_rated_index += 1
    with open("last_rated.txt", "w") as f:
        f.write(str(last_rated_index))

    return JSONResponse({"status": "success"})


@app.post("/skip-movie/{movie_id}")
async def skip_movie(movie_id: int):
    global last_rated_index
    last_rated_index += 1
    with open("last_rated.txt", "w") as f:
        f.write(str(last_rated_index))
    return JSONResponse({"status": "success"})


if __name__ == "__main__":
    uvicorn.run(app, host="127.0.0.1", port=8000)

which can be run with

uv run app.py

When run, the tool looks like this.

Load the data

With around 40 movies rated and saved in my_ratings.dat, let’s install fastai, suppress warnings to make the notebook cleaner and import the libraries we’ll need to train the model.

!pip install fastai

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt
from fastai.collab import *
from fastai.tabular.all import *

user_id = 99999

Looking at the README for the dataset, we see it has the following structure

MovieID::Title::Genres

We can import that as a csv

movies = pd.read_csv('ml-10M100K/movies.dat', sep='::', names=['id', 'name', 'genre'])
movies['year'] = movies['name'].str.extract(r'\((\d{4})\)')
movies.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	id	name	genre	year
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	1995
2	3	Grumpier Old Men (1995)	Comedy\|Romance	1995
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	1995
4	5	Father of the Bride Part II (1995)	Comedy	1995

Next, we load the ratings from the dataset and concatenate them with the ratings I created so that I could generate predictions for myself (user id 99999).

ratings = pd.concat([
    pd.read_csv('ml-10M100K/ratings.dat', sep='::', names=['userId', 'movieId', 'rating', 'timestamp']),
    pd.read_csv('ml-10M100K/my_ratings.dat', sep='::', names=['userId', 'movieId', 'rating', 'timestamp'])
])
ratings.tail()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	userId	movieId	rating	timestamp
35	99999	46578	3.0	1734831045
36	99999	44191	5.0	1734831168
37	99999	40815	4.0	1734831310
38	99999	30793	3.0	1734831332
39	99999	35836	4.0	1734831347

Once we load the ratings, we can check the ratings distribution to validate it seems diverse enough to be a good dataset.

ratings['rating'].hist(bins=20, figsize=(10,6))
plt.title('Distribution of Movie Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

png

Train some models

Now we’ll lean heavily on fastai and create and train a collaborative learner from the ratings data, training it for 5 epochs. This process will likely take some time.

dls = CollabDataLoaders.from_df(
    ratings,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=256
)

dls.show_batch()

	userId	movieId	rating
0	51959	4102	3.5
1	1485	377	3.5
2	38778	733	5.0
3	8960	1635	5.0
4	15107	1254	3.0
5	2392	246	5.0
6	36036	32	4.0
7	28788	3751	3.5
8	69439	3547	1.0
9	436	410	3.0

learner = collab_learner(
    dls,
    n_factors=20,
    y_range=(0.5, 5.5)
)

learner.fit_one_cycle(3)

epoch	train_loss	valid_loss	time
0	0.717291	0.733708	02:19
1	0.652198	0.687307	02:20
2	0.632653	0.676981	02:23

Let’s save a checkpoint of the model, then reload it and run two more epochs.

learner.save('collab_model_20_factors_256_bs')

Path('models/collab_model_20_factors_256_bs.pth')

learner = learner.load('collab_model_20_factors_256_bs')
learner.fit_one_cycle(2)

epoch	train_loss	valid_loss	time
0	0.602660	0.670399	02:21
1	0.586995	0.658351	02:20

It looks like the loss doesn’t improve too much with the additional two training epochs, which is good to know for future model training.

I’m intentionally keeping the training time relatively fast at this point. I want to be able to get a feel for how training these types of models works. Once I get a better sense of that, I’ll increase things like n_factors and training epochs.

Now that we’ve trained the model, let’s get movie recommendations for me — user 99999. To do this, we’ll predict ratings for all movies and sort them by highest predicted rating. These values are what the model thinks we’ll rate these movies given our rating history.

def get_preds(learner, user_id=99999, num_recs=20):
    all_movies = pd.DataFrame({'userId': [user_id] * len(movies), 'movieId': movies['id']})
    preds = learner.get_preds(dl=learner.dls.test_dl(all_movies))[0].numpy()

    recommendations = pd.DataFrame({
        'movie_id': movies['id'],
        'title': movies['name'],
        'year': movies['year'],
        'predicted_rating': preds
    })

    return recommendations.sort_values('predicted_rating', ascending=False).head(num_recs)

recommendations = get_preds(learner)
recommendations

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	movie_id	title	year	predicted_rating
293	296	Pulp Fiction (1994)	1994	4.727180
49	50	Usual Suspects, The (1995)	1995	4.673663
315	318	Shawshank Redemption, The (1994)	1994	4.528549
843	858	Godfather, The (1972)	1972	4.521612
2487	2571	Matrix, The (1999)	1999	4.423033
4134	4226	Memento (2000)	2000	4.410904
1173	1198	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	1981	4.393826
4899	4993	Lord of the Rings: The Fellowship of the Ring, The (2001)	2001	4.388923
2874	2959	Fight Club (1999)	1999	4.380060
10216	58559	Dark Knight, The (2008)	2008	4.371044
7039	7153	Lord of the Rings: The Return of the King, The (2003)	2003	4.354837
1195	1221	Godfather: Part II, The (1974)	1974	4.352811
1067	1089	Reservoir Dogs (1992)	1992	4.338005
5852	5952	Lord of the Rings: The Two Towers, The (2002)	2002	4.321620
1113	1136	Monty Python and the Holy Grail (1975)	1975	4.311879
1171	1196	Star Wars: Episode V - The Empire Strikes Back (1980)	1980	4.311409
257	260	Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)	1977	4.301165
523	527	Schindler's List (1993)	1993	4.293892
5916	6016	City of God (Cidade de Deus) (2002)	2002	4.267132
1187	1213	Goodfellas (1990)	1990	4.253361

These recommendations seem pretty good. I’ve actually seen and liked some of these movies the model is recommending. Something I didn’t expect is also happening. The model is generating predictions for movies I’ve already rated. Movie id 58559 is already in my_ratings.dat

ratings[(ratings['userId'] == 99999) & (ratings['movieId'] == 58559)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	userId	movieId	rating	timestamp
0	99999	58559	5.0	1734829873

recommendations[recommendations['movie_id'] == 58559]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	movie_id	title	year	predicted_rating
10216	58559	Dark Knight, The (2008)	2008	4.371044

Let’s modify get_preds to filter these duplicates out

def get_preds(learner, ratings, user_id=99999, num_recs=20):
    all_movies = pd.DataFrame({'userId': [user_id] * len(movies), 'movieId': movies['id']})
    preds = learner.get_preds(dl=learner.dls.test_dl(all_movies))[0].numpy()

    recommendations = pd.DataFrame({
        'movie_id': movies['id'],
        'title': movies['name'],
        'year': movies['year'],
        'predicted_rating': preds
    })

    rated_movies = ratings[ratings['userId'] == user_id]['movieId'].values

    recommendations = recommendations[~recommendations['movie_id'].isin(rated_movies)]

    return recommendations.sort_values('predicted_rating', ascending=False).head(num_recs)

recommendations = get_preds(learner, ratings)
recommendations

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	movie_id	title	year	predicted_rating
293	296	Pulp Fiction (1994)	1994	4.727180
49	50	Usual Suspects, The (1995)	1995	4.673663
315	318	Shawshank Redemption, The (1994)	1994	4.528549
843	858	Godfather, The (1972)	1972	4.521612
2487	2571	Matrix, The (1999)	1999	4.423033
4134	4226	Memento (2000)	2000	4.410904
1173	1198	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	1981	4.393826
4899	4993	Lord of the Rings: The Fellowship of the Ring, The (2001)	2001	4.388923
2874	2959	Fight Club (1999)	1999	4.380060
7039	7153	Lord of the Rings: The Return of the King, The (2003)	2003	4.354837
1195	1221	Godfather: Part II, The (1974)	1974	4.352811
1067	1089	Reservoir Dogs (1992)	1992	4.338005
5852	5952	Lord of the Rings: The Two Towers, The (2002)	2002	4.321620
1113	1136	Monty Python and the Holy Grail (1975)	1975	4.311879
1171	1196	Star Wars: Episode V - The Empire Strikes Back (1980)	1980	4.311409
257	260	Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)	1977	4.301165
523	527	Schindler's List (1993)	1993	4.293892
5916	6016	City of God (Cidade de Deus) (2002)	2002	4.267132
1187	1213	Goodfellas (1990)	1990	4.253361
1172	1197	Princess Bride, The (1987)	1987	4.251543

Now we’re getting clean predictions.

Since I’ve seen some of these movies (but haven’t added ratings for them yet), it would be nice to do that so I can generate new recommendations with additional data. Given the way collaborative filtering works, we’d need to retrain the model with the augmented ratings dataset.

I could just filter out the IDs of the recommendations I’ve already watched and work through the existing recommendations/predictions list. However, these would still only take into account the original ratings I trained the model on for my user, which means we’re not making great use of the data.

It would be nice to retrain the model on this augmented data, simulating what real system retraining could look like.

import time

def add_new_ratings(user_id, new_ratings):
    new_ratings_df = pd.DataFrame({
        'userId': [user_id] * len(new_ratings),
        'movieId': [x[0] for x in new_ratings],
        'rating': [x[1] for x in new_ratings],
        'timestamp': [int(time.time())] * len(new_ratings)
    })
    return new_ratings_df

new_ratings = [
    (318, 4.5), # Shawshank Redemption
    (50, 4), # The Usual Suspects
    (4226, 4.5), # Memento
]

new_ratings_df = add_new_ratings(user_id, new_ratings)
new_ratings_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	userId	movieId	rating	timestamp
0	99999	318	4.5	1735055101
1	99999	50	4.0	1735055101
2	99999	4226	4.5	1735055101

ratings2 = pd.concat([ratings, new_ratings_df], ignore_index=True)
ratings2.tail(5)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	userId	movieId	rating	timestamp
10000092	99999	30793	3.0	1734831332
10000093	99999	35836	4.0	1734831347
10000094	99999	318	4.5	1735055101
10000095	99999	50	4.0	1735055101
10000096	99999	4226	4.5	1735055101

We validate our new ratings have been added, then train a new model with 3 epochs this time (because I am impatient).

dls = CollabDataLoaders.from_df(
    ratings2,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=256,
)

learner2 = collab_learner(
    dls,
    n_factors=20,
    y_range=(0.5, 5.5)
)

learner2.fit_one_cycle(3)

epoch	train_loss	valid_loss	time
0	0.698186	0.734756	02:15
1	0.640888	0.686441	02:16
2	0.620803	0.676162	02:16

get_preds(learner2, ratings2)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	movie_id	title	year	predicted_rating
843	858	Godfather, The (1972)	1972	4.458255
523	527	Schindler's List (1993)	1993	4.407414
732	745	Wallace & Gromit: A Close Shave (1995)	1995	4.385326
660	668	Pather Panchali (1955)	1955	4.381900
708	720	Wallace & Gromit: The Best of Aardman Animation (1996)	1996	4.373108
1125	1148	Wallace & Gromit: The Wrong Trousers (1993)	1993	4.368431
895	912	Casablanca (1942)	1942	4.365451
661	670	World of Apu, The (Apur Sansar) (1959)	1959	4.364862
4879	4973	Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)	2001	4.358409
293	296	Pulp Fiction (1994)	1994	4.358396
5916	6016	City of God (Cidade de Deus) (2002)	2002	4.351150
1169	1193	One Flew Over the Cuckoo's Nest (1975)	1975	4.349298
9468	44555	Lives of Others, The (Das Leben der Anderen) (2006)	2006	4.344989
2937	3022	General, The (1927)	1927	4.336016
1935	2019	Seven Samurai (Shichinin no samurai) (1954)	1954	4.329831
887	904	Rear Window (1954)	1954	4.318365
737	750	Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)	1964	4.311434
2945	3030	Yojimbo (1961)	1961	4.303742
1195	1221	Godfather: Part II, The (1974)	1974	4.294334
1178	1203	12 Angry Men (1957)	1957	4.291220

We see these ratings are a bit different and not just that the three ratings I added are removed from the predictions. These predictions seem ok, but the movies skew a bit older than I typically like. It’s hard to explain quantitatively, but I think we can improve on this.

With the model training process down reasonably well, we’re going to train a model with more factors to see how the predictions change and hopefully improve.

dls = CollabDataLoaders.from_df(
    ratings2,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=256,
)

learner3 = collab_learner(
    dls,
    n_factors=100,
    y_range=(0.5, 5.5)
)

learner3.fit_one_cycle(3)

epoch	train_loss	valid_loss	time
0	0.653620	0.705440	03:37
1	0.515555	0.662772	03:44
2	0.494222	0.654260	03:42

This model ended up training faster than I expected. Here are the recommendations:

recommendations = get_preds(learner3, ratings2)
recommendations

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	movie_id	title	year	predicted_rating
843	858	Godfather, The (1972)	1972	4.215202
2240	2324	Life Is Beautiful (La Vita è bella) (1997)	1997	4.161647
523	527	Schindler's List (1993)	1993	4.159183
9468	44555	Lives of Others, The (Das Leben der Anderen) (2006)	2006	4.143415
293	296	Pulp Fiction (1994)	1994	4.143309
2245	2329	American History X (1998)	1998	4.124294
5916	6016	City of God (Cidade de Deus) (2002)	2002	4.122414
1171	1196	Star Wars: Episode V - The Empire Strikes Back (1980)	1980	4.106239
257	260	Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)	1977	4.096898
895	912	Casablanca (1942)	1942	4.092230
1149	1172	Cinema Paradiso (Nuovo cinema Paradiso) (1989)	1989	4.082796
1645	1704	Good Will Hunting (1997)	1997	4.076812
290	293	Léon: The Professional (Léon) (Professional, The) (1994)	1994	4.039810
4879	4973	Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)	2001	4.038737
1195	1221	Godfather: Part II, The (1974)	1974	4.038615
2773	2858	American Beauty (1999)	1999	4.035391
2874	2959	Fight Club (1999)	1999	4.030475
1172	1197	Princess Bride, The (1987)	1987	4.022649
6427	6539	Pirates of the Caribbean: The Curse of the Black Pearl (2003)	2003	4.022105
1207	1234	Sting, The (1973)	1973	4.010592

The predictions are pretty similar to the previous models. It seems like I misunderstood the meaning of bs (batch size) which when larger, can get stuck in local minima. The fastai bs default is 64, so let’s try training another model with that.

dls = CollabDataLoaders.from_df(
    ratings2,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=64,
)

learner4 = collab_learner(
    dls,
    n_factors=100,
    y_range=(0.5, 5.5)
)

learner4.fit_one_cycle(3)

epoch	train_loss	valid_loss	time
0	0.670424	0.718308	15:11
1	0.552318	0.678102	15:23
2	0.520957	0.663053	15:25

Forty five minutes later and our model is trained. Let’s see how we did.

recommendations = get_preds(learner4, ratings2)
recommendations

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	movie_id	title	year	predicted_rating
293	296	Pulp Fiction (1994)	1994	4.540648
2874	2959	Fight Club (1999)	1999	4.533676
843	858	Godfather, The (1972)	1972	4.420893
2487	2571	Matrix, The (1999)	1999	4.418619
1195	1221	Godfather: Part II, The (1974)	1974	4.376376
5916	6016	City of God (Cidade de Deus) (2002)	2002	4.362540
2245	2329	American History X (1998)	1998	4.313848
1935	2019	Seven Samurai (Shichinin no samurai) (1954)	1954	4.291965
587	593	Silence of the Lambs, The (1991)	1991	4.287333
46	47	Seven (a.k.a. Se7en) (1995)	1995	4.286267
7039	7153	Lord of the Rings: The Return of the King, The (2003)	2003	4.276011
1067	1089	Reservoir Dogs (1992)	1992	4.256866
7247	7361	Eternal Sunshine of the Spotless Mind (2004)	2004	4.256151
4899	4993	Lord of the Rings: The Fellowship of the Ring, The (2001)	2001	4.251918
5852	5952	Lord of the Rings: The Two Towers, The (2002)	2002	4.243845
6762	6874	Kill Bill: Vol. 1 (2003)	2003	4.231368
1564	1617	L.A. Confidential (1997)	1997	4.224444
2240	2324	Life Is Beautiful (La Vita è bella) (1997)	1997	4.212670
9127	33794	Batman Begins (2005)	2005	4.211322
523	527	Schindler's List (1993)	1993	4.199523

Qualitatively, I think these predictions are an improvement. The movies are a bit newer than surfaced by the other models and there are more in this list that either I’ve already seen or that have been recommended to me by friends.

Let’s see if we can reduce n_factors and still get what appear to be good results, as this should speed up training.

My goal is to find a balance of prediction quality and speed of training that makes it reasonable to retrain a model whenever I update my ratings list. It’s possible faster training will sacrifice quality too much on my hardware but let’s see.

dls = CollabDataLoaders.from_df(
    ratings2,
    user_name='userId',
    item_name='movieId',
    rating_name='rating',
    bs=64,
)

learner5 = collab_learner(
    dls,
    n_factors=50,
    y_range=(0.5, 5.5)
)

learner5.fit_one_cycle(3)

epoch	train_loss	valid_loss	time
0	0.679360	0.726244	10:14
1	0.624619	0.682216	10:10
2	0.580484	0.668213	10:08

recommendations = get_preds(learner5, ratings2)
recommendations

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	movie_id	title	year	predicted_rating
7039	7153	Lord of the Rings: The Return of the King, The (2003)	2003	4.511814
5852	5952	Lord of the Rings: The Two Towers, The (2002)	2002	4.485892
108	110	Braveheart (1995)	1995	4.471653
4899	4993	Lord of the Rings: The Fellowship of the Ring, The (2001)	2001	4.465234
257	260	Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)	1977	4.461960
2487	2571	Matrix, The (1999)	1999	4.458735
1171	1196	Star Wars: Episode V - The Empire Strikes Back (1980)	1980	4.448349
1173	1198	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	1981	4.401854
843	858	Godfather, The (1972)	1972	4.367479
523	527	Schindler's List (1993)	1993	4.364442
1184	1210	Star Wars: Episode VI - Return of the Jedi (1983)	1983	4.358963
1944	2028	Saving Private Ryan (1998)	1998	4.299438
1195	1221	Godfather: Part II, The (1974)	1974	4.241967
2245	2329	American History X (1998)	1998	4.237298
1113	1136	Monty Python and the Holy Grail (1975)	1975	4.220087
3489	3578	Gladiator (2000)	2000	4.213627
2677	2762	Sixth Sense, The (1999)	1999	4.209116
9127	33794	Batman Begins (2005)	2005	4.207511
2874	2959	Fight Club (1999)	1999	4.203415
587	593	Silence of the Lambs, The (1991)	1991	4.195570

These seem reasonably similar. Not so surprisingly if you haven’t seen Lord of the Rings or Star Wars, the model thinks you should.

Do things change with further training?

learner5.fit_one_cycle(2)

epoch	train_loss	valid_loss	time
0	0.584899	0.686621	10:07
1	0.575225	0.664721	10:07

recommendations = get_preds(learner5, ratings2)
recommendations

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	movie_id	title	year	predicted_rating
843	858	Godfather, The (1972)	1972	4.525020
1173	1198	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	1981	4.468911
1195	1221	Godfather: Part II, The (1974)	1974	4.464158
108	110	Braveheart (1995)	1995	4.433938
1171	1196	Star Wars: Episode V - The Empire Strikes Back (1980)	1980	4.393318
1944	2028	Saving Private Ryan (1998)	1998	4.367393
7039	7153	Lord of the Rings: The Return of the King, The (2003)	2003	4.325635
5852	5952	Lord of the Rings: The Two Towers, The (2002)	2002	4.321746
4899	4993	Lord of the Rings: The Fellowship of the Ring, The (2001)	2001	4.321311
2487	2571	Matrix, The (1999)	1999	4.315341
523	527	Schindler's List (1993)	1993	4.307681
257	260	Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)	1977	4.301846
1187	1213	Goodfellas (1990)	1990	4.281674
352	356	Forrest Gump (1994)	1994	4.267944
293	296	Pulp Fiction (1994)	1994	4.262625
5916	6016	City of God (Cidade de Deus) (2002)	2002	4.260427
9127	33794	Batman Begins (2005)	2005	4.240374
1263	1291	Indiana Jones and the Last Crusade (1989)	1989	4.234428
1935	2019	Seven Samurai (Shichinin no samurai) (1954)	1954	4.223034
1184	1210	Star Wars: Episode VI - Return of the Jedi (1983)	1983	4.215607

Not too different.

Let’s add some more filtering capabilities to our prediction generator.

def get_preds(learner, ratings, user_id=99999, num_recs=20, exclude_terms=None):
    all_movies = pd.DataFrame({'userId': [user_id] * len(movies), 'movieId': movies['id']})
    preds = learner.get_preds(dl=learner.dls.test_dl(all_movies))[0].numpy()

    recommendations = pd.DataFrame({
        'movie_id': movies['id'],
        'title': movies['name'],
        'year': movies['year'],
        'predicted_rating': preds
    })

    rated_movies = ratings[ratings['userId'] == user_id]['movieId'].values
    recommendations = recommendations[~recommendations['movie_id'].isin(rated_movies)]

    if exclude_terms:
        for term in exclude_terms:
            recommendations = recommendations[~recommendations['title'].str.contains(term, case=False)]

    return recommendations.sort_values('predicted_rating', ascending=False).head(num_recs)

recommendations = get_preds(learner5, ratings2, exclude_terms=['Star Wars', 'Lord of the Rings'])
recommendations

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	movie_id	title	year	predicted_rating
843	858	Godfather, The (1972)	1972	4.525020
1173	1198	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	1981	4.468911
1195	1221	Godfather: Part II, The (1974)	1974	4.464158
108	110	Braveheart (1995)	1995	4.433938
1944	2028	Saving Private Ryan (1998)	1998	4.367393
2487	2571	Matrix, The (1999)	1999	4.315341
523	527	Schindler's List (1993)	1993	4.307681
1187	1213	Goodfellas (1990)	1990	4.281674
352	356	Forrest Gump (1994)	1994	4.267944
293	296	Pulp Fiction (1994)	1994	4.262625
5916	6016	City of God (Cidade de Deus) (2002)	2002	4.260427
9127	33794	Batman Begins (2005)	2005	4.240374
1263	1291	Indiana Jones and the Last Crusade (1989)	1989	4.234428
1935	2019	Seven Samurai (Shichinin no samurai) (1954)	1954	4.223034
895	912	Casablanca (1942)	1942	4.207191
3489	3578	Gladiator (2000)	2000	4.201711
1113	1136	Monty Python and the Holy Grail (1975)	1975	4.175015
2245	2329	American History X (1998)	1998	4.164785
9468	44555	Lives of Others, The (Das Leben der Anderen) (2006)	2006	4.145649
2418	2502	Office Space (1999)	1999	4.134266

Nice. Lots of movies I have heard of or have seen but haven’t added ratings for (which I liked).

It turns out MovieLens is a movie recommendation service (they’ve kindly provided their data to learn from). Given my success here, I will probably try out the service.

Wrapping up

This exploration was great for getting a feel for how hyperparameter tuning can affect model predictions and an appreciation for the time constraints you can run into when trying to train a lot of slightly varied models. Specifically, I now have a deeper appreciation for the potential challenges one might face when trying to continuously train a model given new data. Since training isn’t always fast, accomplishing this isn’t so trivial, like writing the new data to a database might be.

It was fun to generate personalized predictions and I actually ended up watching The Usual Suspects after getting the recommendation from the model, which I enjoyed.

raw edit commit

Practical Deep Learning, Lesson 7, Movie Recommendations

Create some personal data

Load the data

Train some models

Wrapping up

Recommended

Practical Deep Learning, Lesson 4, Language Model Blog Post Imitator

Practical Deep Learning, Lesson 2, Rowing Classifier

Practical Deep Learning, Lesson 3, Stochastic Gradient Descent on the Titanic Dataset

Create some personal data

Load the data

Train some models

Wrapping up

Recommended

Practical Deep Learning, Lesson 4, Language Model Blog Post Imitator

Practical Deep Learning, Lesson 2, Rowing Classifier

Practical Deep Learning, Lesson 3, Stochastic Gradient Descent on the Titanic Dataset

Keyboard Shortcuts

Global

Navigation