Practical Deep Learning, Lesson 3, Stochastic Gradient Descent on the Titanic Dataset
In this notebook, we train two similar neural nets on the classic Titanic dataset using techniques from fastbook
chapter 1 and chapter 4.
The first, we train using mostly PyTorch APIs. The second, with FastAI APIs. There are a few cells that output warnings. I kept those because I wanted to preserve print outs of the models’ accuracy.
The Titanic data set can be downloaded from the link above or with:
!kaggle competitions download -c titanic
To start, we install and import the dependencies we’ll need:
%pip install torch pandas scikit-learn fastai
import pandas as pdimport torchimport torch.nn as nnimport torch.optim as optim
from fastai.tabular.all import *from sklearn.preprocessing import StandardScaler
Next, we import the training data
df = pd.read_csv('titanic/train.csv')
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']X = df[features].copy()y = df['Survived'].copy()X.head(5)
.dataframe tbody tr th { vertical-align: top;}
.dataframe thead th { text-align: right;}
Pclass | Sex | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | female | 38.0 | 1 | 0 | 71.2833 |
2 | 3 | female | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | female | 35.0 | 1 | 0 | 53.1000 |
4 | 3 | male | 35.0 | 0 | 0 | 8.0500 |
Now, we define two functions to normalize and fill in holes in the data so we can train on it.
def process_training_data(X): X['Sex'] = X['Sex'].map({'male': 0, 'female': 1}) X['Age'] = X['Age'].fillna(X['Age'].median()) X['Fare'] = X['Fare'].fillna(X['Fare'].median())
return X
def process_test_data(X): X['Sex'] = X['Sex'].map({'male': 0, 'female': 1})
return X
X = process_training_data(X)X.head(5)
.dataframe tbody tr th { vertical-align: top;}
.dataframe thead th { text-align: right;}
Pclass | Sex | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
0 | 3 | 0 | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 |
2 | 3 | 1 | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 |
4 | 3 | 0 | 35.0 | 0 | 0 | 8.0500 |
We need to scale the numeric values to be between 0 and 1, otherwise we’ll get
RuntimeError: all elements of input should be between 0 and 1
We’ll do this with StandardScaler
for the both the training and test data, per Sonnet’s recommendation.
StandardScaler
doesn’t actually constrain the data between 0 and 1 but it seems to get the job done for the needs of the model architecture I selected.
scaler = StandardScaler()X_scaled = scaler.fit_transform(X)
test_df = pd.read_csv('titanic/test.csv')X_test = test_df[features].copy()X_test = process_test_data(X_test)X_test_scaled = scaler.transform(X_test)y_test_df = pd.read_csv('titanic/gender_submission.csv')y_test = y_test_df['Survived']
Turn these numpy
arrays into PyTorch tensors and define the model architecture.
X_train_tensor = torch.FloatTensor(X_scaled)y_train_tensor = torch.FloatTensor(y.values)X_test_tensor = torch.FloatTensor(X_test_scaled)y_test_tensor = torch.FloatTensor(y_test.values)
model = nn.Sequential( nn.Linear(6, 8), nn.ReLU(), nn.Linear(8, 1), nn.Sigmoid())
Also, define a loss function and an optimizer:
criterion = nn.BCELoss()optimizer = optim.SGD(model.parameters(), lr=0.01)
Finally, we can train the model. Sonnet wrote this code.
num_epochs = 1000batch_size = 64
for epoch in range(num_epochs): for i in range(0, len(X_train_tensor), batch_size): batch_X = X_train_tensor[i:i+batch_size] batch_y = y_train_tensor[i:i+batch_size]
outputs = model(batch_X) loss = criterion(outputs, batch_y.unsqueeze(1))
optimizer.zero_grad() loss.backward() optimizer.step()
if (epoch + 1) % 100 == 0: print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
Epoch [100/1000], Loss: 0.3562Epoch [200/1000], Loss: 0.3216Epoch [300/1000], Loss: 0.3113Epoch [400/1000], Loss: 0.3065Epoch [500/1000], Loss: 0.3038Epoch [600/1000], Loss: 0.3024Epoch [700/1000], Loss: 0.2996Epoch [800/1000], Loss: 0.2975Epoch [900/1000], Loss: 0.2955Epoch [1000/1000], Loss: 0.2937
With the model trained, we can run inference on the test set and compare the results to the “Survived” column in the test set from gender_submission.csv
.
model.eval()with torch.no_grad(): y_pred = model(X_test_tensor) y_pred_class = (y_pred > 0.5).float() correct_predictions = (y_pred_class == y_test_tensor.unsqueeze(1)).sum().item() total_predictions = len(y_test_tensor) acc = correct_predictions / total_predictions print(f"Correct predictions: {correct_predictions} out of {total_predictions}") print(f"Accuracy: {acc:.2%}")
Correct predictions: 368 out of 418Accuracy: 88.04%
Now, let’s build what I think is a similar model with fastai
primitives.
Load the data again to avoid any unintentional contamination.
train_df = pd.read_csv('titanic/train.csv')test_df = pd.read_csv('titanic/test.csv')
The TabularDataLoaders
from fastai
needs the following configuration to create DataLoaders
.
cat_names
: the names of the categorical variablescont_names
: the names of the continuous variablesy_names
: the names of the dependent variables
cat_names = ['Pclass', 'Sex']cont_names = ['Age', 'SibSp', 'Parch', 'Fare']dep_var = 'Survived'
Following a pattern similar to the one used in chapter 1, we train the model:
procs = [Categorify, FillMissing, Normalize]dls = TabularDataLoaders.from_df( train_df, path='.', procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=dep_var, valid_pct=0.2, seed=42, bs=64,)
learn = tabular_learner(dls, metrics=accuracy)learn.fit_one_cycle(5, 1e-2)
/Users/danielcorin/dev/lab/fastbook_projects/sgd_titanic/.venv/lib/python3.12/site-packages/fastai/tabular/core.py:314: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
to[n].fillna(self.na_dict[n], inplace=True)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.486258 | 0.233690 | 0.662921 | 00:02 |
1 | 0.378460 | 0.192642 | 0.662921 | 00:00 |
2 | 0.294309 | 0.132269 | 0.662921 | 00:00 |
3 | 0.248516 | 0.140377 | 0.662921 | 00:00 |
4 | 0.220335 | 0.132353 | 0.662921 | 00:00 |
For some reason, learn.dls.test_dl
does not apply FillMissing
, for the ‘Fare` column of the test data, so we do that manually here.
test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].median())
We run the test set through the model, then compare the results to the ground truth labels and calculate the model accuracy.
test_dl = learn.dls.test_dl(test_df)preds, _ = learn.get_preds(dl=test_dl)
binary_preds = (preds > 0.5).float()
y_test = pd.read_csv('titanic/gender_submission.csv')correct_predictions = (binary_preds.numpy().flatten() == y_test['Survived']).sum()total_predictions = len(y_test)
acc = correct_predictions / total_predictions
print(f"Correct predictions: {correct_predictions} out of {total_predictions}")print(f"Accuracy: {acc:.2%}")
/Users/danielcorin/dev/lab/fastbook_projects/sgd_titanic/.venv/lib/python3.12/site-packages/fastai/tabular/core.py:314: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
to[n].fillna(self.na_dict[n], inplace=True)
Correct predictions: 377 out of 418Accuracy: 90.19%
The accuracies of the two models are about the same! For a first pass at training neural networks (with plenty of help from Sonnet), I think this went pretty well. If you know things about deep learning, let me know if I made any major mistakes. It’s a bit tough to know if you’re doing things correctly in isolation. I suppose that’s why Kaggle competitions can be useful for learning.
Recommended
Practical Deep Learning, Lesson 2, Rowing Classifier
The following is the notebook I used to experiment training an image model to classify types of rowing shells (with people rowing them) and the same...
Practical Deep Learning, Lesson 7, Movie Recommendations
In this notebook, we'll use the MovieLens 10M dataset and collaborative filtering to create a movie recommendation model. We'll use the data from...
Practical Deep Learning, Lesson 5, Pricing Iowa Houses with Random Forests
Having completed lesson 5 of the FastAI course, I prompted Claude to give me some good datasets upon which to train a random forest model. This...