Regression and Classification with Fully Connected Neural Networks#

Deep learning is a large, developing field with many sub-communities, a constant stream of new developments, and unlimited application areas. Despite this complexity, most deep learning techniques share a relatively small set of algorithmic building blocks. The purpose of this notebook is to gain familiarity with some of these core concepts. Much of the intuition you develop here is applicable to the neural networks making headline news from the likes of OpenAI. To focus this notebook, we will consider the two main types of supervised machine learning:

  1. Regression: estimation of continuous quantities

  2. Classification: estimation of categories or discrete quantities

We will create “deep learning” models for regression and classification. We will see the power of neural networks as well as the pitfalls.

# libraries we will need
import torch
from torch import nn
import numpy as np
from torch.distributions import bernoulli
import matplotlib.pyplot as plt

# set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

Tasks to solve in this notebook#

Regression#

# make some fake regression data
n_samples_reg = 100
x_reg = 1.5*torch.randn(n_samples_reg, 1)
y_reg = 2 * x_reg + 3.0*torch.sin(3*x_reg) + 1 + 2 * torch.randn(n_samples_reg, 1)
y_reg = y_reg.squeeze()
plt.plot(x_reg, y_reg, 'o')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
../_images/74b8d043c73c73dc197857af8132d582a8cce6af3255a598552d39d4ad81415b.png

Classification#

# fake classification data
n_samples_clf = 200
x_clf = 2.5*torch.randn(n_samples_clf, 2)
d = torch.sqrt( x_clf[:, 0]**2 +  x_clf[:, 1]**2 )
y_clf = (torch.sin(d) > 0) & (d<torch.pi)

# swap some labels near the boundary
width = 0.7
pgivd = 0.3 * width**2 / ((d - torch.pi)**2 + width**2)
swaps = bernoulli.Bernoulli(pgivd).sample().type(torch.bool)
y_clf[swaps] = ~y_clf[swaps]
plt.plot(x_clf[y_clf, 0], x_clf[y_clf,1], 'o', label='positive')
plt.plot(x_clf[~y_clf, 0], x_clf[~y_clf,1], 'o', label='negative')
plt.gca().set_aspect(1)
plt.grid()
../_images/13e6401695bf1bf2efdc65303e38b85b67f8dd1cf0e0d2c430b5b140e39da1b3.png

The Simplest “Deep Learning” Model: Linear Regression#

Initializing our model#

Good old fashioned linear model: \(y = wx + b\).

# Pytorch's `nn` module has lots of neural network building blocks
# nn.Linear is what we need for y=wx+b
reg_model = nn.Linear(in_features = 1, out_features = 1)

Question

What do you think the arguments `in_features` and `out_features` mean? When might they not be 1?

# pytorch randomly initializes the w and b
reg_model.weight, reg_model.bias
(Parameter containing:
 tensor([[-0.8761]], requires_grad=True),
 Parameter containing:
 tensor([-0.5502], requires_grad=True))
# let's look at some predictions
with torch.no_grad():
    y_reg_init = reg_model(x_reg)
    
y_reg_init[:5]
tensor([[-3.0825],
        [-2.5047],
        [-1.7339],
        [ 2.2167],
        [-1.4418]])
# plot the predictions before training the model
plt.plot(x_reg, y_reg, 'o', label='Actual targets')
plt.plot(x_reg, y_reg_init, 'o', label='Predicted targets')
plt.grid()
plt.legend()
_ = plt.title("This model sucks.\nWe better fit it.")
../_images/da7f5535b2c765ae2099aa127ee9534117762ecab07180610463ee42122e4e64.png

Anatomy of a training loop#

Basic idea:

  • Start with random parameter values

  • Define a loss/cost/objective measure to optimize

  • Use training data to evaluate the loss

  • Update the model to reduce the loss (use gradient descent)

  • Repeat until converged

Basic weight update formula: \( w_{i+1} = w_i - \mathrm{lr}\times \left.\frac{\partial L}{\partial w}\right|_{w=w_i} \)

Question

Why the minus sign in front of the second term?

grad descent
# re-initialize our model each time we run this cell
# otherwise the model picks up from where it left off
reg_model = nn.Linear(in_features = 1, out_features = 1)

# the torch `optim` module helps us fit models
import torch.optim as optim

# create our optimizer object and tell it about the parameters in our model
# the optimizer will be responsible for updating these parameters during training
optimizer = optim.SGD(reg_model.parameters(), lr=0.01)

# how many times to update the model based on the available data
num_epochs = 100
for i in range(num_epochs):
    # make sure gradients are set to zero on all parameters
    # by default, gradients accumulate
    optimizer.zero_grad()
    
    # "forward pass"
    y_hat = reg_model(x_reg).squeeze()
    
    # measure the loss
    # this is the mean squared error
    loss = torch.mean((y_hat - y_reg)**2)
    # can use pytorch built-in: https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html
    
    # "backward pass"
    # that is, compute gradient of loss wrt all parameters
    loss.backward()
    
    # parameter updates
    # use the basic weight update formula given above (with slight modifications)
    optimizer.step()
    
    if i % 20 == 0:
        print(f"Epoch {i+1}. MSE Loss = {loss:0.3f}")
        with torch.no_grad():
            y_reg_init = reg_model(x_reg)
        
        ix = x_reg.argsort(axis=0).squeeze()
        plt.plot(x_reg.squeeze()[ix], y_reg_init.squeeze()[ix], '-', label=i, alpha=1)


plt.plot(x_reg, y_reg, 'ok', label="Actual data")
plt.grid()
_ = plt.legend(title='Epoch')
Epoch 1. MSE Loss = 24.350
Epoch 21. MSE Loss = 9.776
Epoch 41. MSE Loss = 7.300
Epoch 61. MSE Loss = 6.871
Epoch 81. MSE Loss = 6.793
../_images/66f5ea4837818812eb3f38ad8576f8a82342ce32087e794e2862688b702254aa.png

Question

What shortcomings does this model have?

🔥 IMPORTANT CONCEPT 🔥

Model bias is a type of error that occurs when your model doesn't have enough flexibility to represent the real-world data!

Linear classification#

A slight modification to the model#

We need to modify our model because the outputs are true/false

y_clf.unique()
tensor([False,  True])

While training the model, we cannot simply threshold the model outputs (e.g. call outputs greater than 0 ‘True’). Instead, we can have our model output a continuous number between 0 and 1. Use sigmoid to transform the output of a linear model to the range (0,1): $\( p(y=1 | \vec{x}) = \mathrm{sigmoid}(\vec w \cdot \vec x + b) \)$

Question

Why doesn't thresholding work while training?

🔥 IMPORTANT CONCEPT 🔥

In order to use gradient descent, neural networks must be differentiable.

# what is sigmoid?
x_plt = torch.linspace(-10,10, 100)
y_plt = torch.sigmoid(x_plt)
plt.axvline(0, color='gray')
plt.plot(x_plt, y_plt)
plt.grid()
plt.xlabel('input')
plt.ylabel('output')
plt.gcf().set_size_inches(5,2.5)
_ = plt.title('Sigmoid Function')
../_images/ca79cb10a510054730e08b648afcc17d1dd3a8fc175841256f0130cafe33d42e.png
# proposed new model for classification
# use nn.Sequential to string operations together
model_clf = nn.Sequential(
    nn.Linear(2, 1), # notice: we now have two inputs
    nn.Sigmoid()  # sigmoid squishes real numbers into (0, 1)
)
# what do the outputs look like?
with torch.no_grad():
    print(model_clf(x_clf)[:10])
# notice how they all lie between 0 and 1
    
# we loosely interpret these numbers as
# "the probability that y=1 given the provided value of x"
tensor([[0.6701],
        [0.8487],
        [0.7813],
        [0.4899],
        [0.8440],
        [0.2253],
        [0.3412],
        [0.4453],
        [0.4484],
        [0.5387]])
# turning probabilities into "decisions"
# apply a threshold
with torch.no_grad():
    print(model_clf(x_clf)[:10] > 0.5)
tensor([[ True],
        [ True],
        [ True],
        [False],
        [ True],
        [False],
        [False],
        [False],
        [False],
        [ True]])
# let's see how this model does before any training

# Code to plot the decision regions
x_clf.min(), x_clf.max()
xx, yy = mg = torch.meshgrid(
    torch.linspace(x_clf[:,0].min(), x_clf[:,0].max(), 100),
    torch.linspace(x_clf[:,1].min(), x_clf[:,1].max(), 100)
)
grid_pts = torch.vstack([xx.ravel(), yy.ravel()]).T

with torch.no_grad():
    # use a 0.5 decision threshold
    grid_preds = model_clf(grid_pts).squeeze()>0.5

grid_preds = grid_preds.reshape(xx.shape)

plt.gcf().set_size_inches(7,7)
plt.plot(x_clf[y_clf, 0], x_clf[y_clf,1], 'o', label='positive')
plt.plot(x_clf[~y_clf, 0], x_clf[~y_clf,1], 'o', label='negative')
plt.gca().set_aspect(1)
plt.grid()
plt.contourf(xx, yy, grid_preds, cmap=plt.cm.RdBu, alpha=0.4)
plt.legend()
_ = plt.title("Decision boundary is not good")
../_images/fc07c5f52888f94316c5f8bac3bfebb0346188afeda6dc0a969fa261c710d61b.png

Training loop#

The main difference between regression and classification is the loss function. For regression, we used mean squared error. For classification, we will use cross-entropy loss. This loss formula encourages the model to output low values for \(p(y=1|x)\) when the class is 0 and high values when the class is 1.

# re-initialize our model each time we run this cell
# otherwise the model picks up from where it left off
model_clf = nn.Sequential(
    nn.Linear(2, 1),
    nn.Sigmoid()
)

# create our optimizer object and tell it about the parameters in our model
optimizer = optim.SGD(model_clf.parameters(), lr=0.01)

# torch needs the class labels to be zeros and ones, not False/True
y_clf_int = y_clf.type(torch.int16)

# how many times to update the model based on the available data
num_epochs = 1000
for i in range(num_epochs):
    optimizer.zero_grad()
    
    y_hat = model_clf(x_clf).squeeze()
    
    # measure the goodness of fit
    # need to use a different loss function here
    # this is the "cross entropy loss"
    loss = -y_clf_int * torch.log(y_hat) - (1-y_clf_int)*torch.log(1-y_hat)
    loss = loss.mean()
    # pytorch built-in: https://pytorch.org/docs/stable/generated/torch.nn.functional.binary_cross_entropy.html
    
    # update the model
    loss.backward() # gradient computation
    optimizer.step()  # weight updates
    
    if i % 100 == 0:
        print(f"Epoch {i+1}. MSE Loss = {loss:0.3f}")
Epoch 1. MSE Loss = 0.959
Epoch 101. MSE Loss = 0.774
Epoch 201. MSE Loss = 0.727
Epoch 301. MSE Loss = 0.705
Epoch 401. MSE Loss = 0.691
Epoch 501. MSE Loss = 0.683
Epoch 601. MSE Loss = 0.678
Epoch 701. MSE Loss = 0.675
Epoch 801. MSE Loss = 0.673
Epoch 901. MSE Loss = 0.672
with torch.no_grad():
    # use a 0.5 decision threshold
    grid_preds = model_clf(grid_pts).squeeze()>0.5

grid_preds = grid_preds.reshape(xx.shape)

plt.gcf().set_size_inches(7,7)
plt.plot(x_clf[y_clf, 0], x_clf[y_clf,1], 'o', label='positive')
plt.plot(x_clf[~y_clf, 0], x_clf[~y_clf,1], 'o', label='negative')
plt.gca().set_aspect(1)
plt.grid()
plt.contourf(xx, yy, grid_preds, cmap=plt.cm.RdBu, alpha=0.4)
plt.legend()
_ = plt.title("Decision boundary is still not good")
../_images/0f6382281fbf8d2491477633386491e55efae884e8322895f39b69c45112b588.png

Question

Why did this experiment fail? Any ideas what we could do to make it work?

Summing up linear models#

Sometimes linear models can be a good approximation to our data. Sometimes not.

Linear models tend to be biased in the statistical sense of the word. That is, they enforce linearity in the form of linear regression lines or linear decision boundaries. The linear model will fail to the degree that the real-world data is non-linear.

Getting non-linear with fully-connected neural networks#

We can compose simple math operations together to represent very complex functions.

grad descent

A few helper functions#

def train_and_test_regression_model(model, num_epochs=1000, reporting_interval=100):

    # create our optimizer object and tell it about the parameters in our model
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    # how many times to update the model based on the available data
    for i in range(num_epochs):
        optimizer.zero_grad()

        y_hat = model(x_reg).squeeze()

        # measure the goodness of fit
        loss = torch.mean((y_hat - y_reg)**2)

        # update the model
        loss.backward() # gradient computation
        optimizer.step()  # weight updates

        if i % reporting_interval == 0:
            print(f"Epoch {i+1}. MSE Loss = {loss:0.3f}")
            
    with torch.no_grad():
        y_reg_init = model(x_reg)
    plt.plot(x_reg.cpu(), y_reg.cpu(), 'o', label='Actual targets')
    plt.plot(x_reg.cpu(), y_reg_init.cpu(), 'x', label='Predicted targets')
    plt.grid()
    plt.legend()
def train_and_test_classification_model(model, num_epochs=1000, reporting_interval=100):

    # create our optimizer object and tell it about the parameters in our model
    optimizer = optim.SGD(model.parameters(), lr=0.01)

    # how many times to update the model based on the available data
    for i in range(num_epochs):
        optimizer.zero_grad()

        y_hat = model(x_clf).squeeze()

        # measure the goodness of fit
        # need to use a different loss function here
        loss = -y_clf_int * torch.log(y_hat) - (1-y_clf_int)*torch.log(1-y_hat)
        loss = loss.mean()

        # update the model
        loss.backward() # gradient computation
        optimizer.step()  # weight updates
    
        if i % reporting_interval == 0:
            print(f"Epoch {i+1}. MSE Loss = {loss:0.3f}")   
            
    with torch.no_grad():
        grid_preds = model(grid_pts).squeeze()>0.5

    grid_preds = grid_preds.reshape(xx.shape)

    plt.gcf().set_size_inches(7,7)
    plt.plot(x_clf[y_clf, 0], x_clf[y_clf,1], 'o', label='positive')
    plt.plot(x_clf[~y_clf, 0], x_clf[~y_clf,1], 'o', label='negative')
    plt.gca().set_aspect(1)
    plt.grid()
    plt.contourf(xx, yy, grid_preds, cmap=plt.cm.RdBu, alpha=0.4)
    plt.legend()

Regression#

What about nested linear models? Might this do something interesting? $\( y = w_1(w_0 x + b_0) + b_1 \)$

# We can build more complex models by stringing together more operations inside nn.Sequential:
model =  nn.Sequential(
    nn.Linear(in_features = 1, out_features = 1),
    nn.Linear(1,1)
)

train_and_test_regression_model(model)
_ = plt.title("But wait, that's still just a line?!")
Epoch 1. MSE Loss = 16.093
Epoch 101. MSE Loss = 6.771
Epoch 201. MSE Loss = 6.771
Epoch 301. MSE Loss = 6.771
Epoch 401. MSE Loss = 6.771
Epoch 501. MSE Loss = 6.771
Epoch 601. MSE Loss = 6.771
Epoch 701. MSE Loss = 6.771
Epoch 801. MSE Loss = 6.771
Epoch 901. MSE Loss = 6.771
../_images/29ab85a3988476448f91ab92cca6ea29ecd27f00eabe2939fdee75ab646ca9ee.png

Question

Why is it still just a line? How can we fix it?

# We need a non-linearity between the two linear operations
model =  nn.Sequential(
    nn.Linear(in_features = 1, out_features = 1),
    nn.Tanh(),
    nn.Linear(1,1)
)
# this model does not reduce to a linear operation

train_and_test_regression_model(model)
_ = plt.title("Non-linearity! But not enough.")
Epoch 1. MSE Loss = 15.631
Epoch 101. MSE Loss = 7.572
Epoch 201. MSE Loss = 7.014
Epoch 301. MSE Loss = 6.965
Epoch 401. MSE Loss = 6.940
Epoch 501. MSE Loss = 6.922
Epoch 601. MSE Loss = 6.907
Epoch 701. MSE Loss = 6.896
Epoch 801. MSE Loss = 6.887
Epoch 901. MSE Loss = 6.880
../_images/dc287e4292696877a0a2b202fb188c714dd0b6dd5cc4e1d545ee7b74d3c5c1dc.png
# Let's really crank up the number of hidden nodes
model =  nn.Sequential(
    nn.Linear(in_features = 1, out_features = 30),
    nn.Tanh(),
    nn.Linear(30,30),
    nn.Tanh(),
    nn.Linear(30,30),
    nn.Tanh(),
    nn.Linear(30,30),
    nn.Tanh(),
    nn.Linear(30,30),
    nn.Tanh(),
    nn.Linear(30,30),
    nn.Tanh(),
    nn.Linear(30,1)
)

train_and_test_regression_model(model, num_epochs=50000, reporting_interval=5000)
Epoch 1. MSE Loss = 17.355
Epoch 5001. MSE Loss = 2.136
Epoch 10001. MSE Loss = 1.604
Epoch 15001. MSE Loss = 1.451
Epoch 20001. MSE Loss = 1.072
Epoch 25001. MSE Loss = 1.222
Epoch 30001. MSE Loss = 0.997
Epoch 35001. MSE Loss = 0.880
Epoch 40001. MSE Loss = 0.940
Epoch 45001. MSE Loss = 0.741
../_images/ba61b98319aa7b257bdb1b0835409e16cd75735d9f9a186a18b91d8f4fbe8d72.png

Question

What is wrong with this picture?

with torch.no_grad():
    x_grid = torch.linspace(x_reg.min(), x_reg.max(), 200)[:,None]
    y_reg_grid = model(x_grid)
    y_reg_init = model(x_reg)
plt.plot(x_reg.cpu(), y_reg.cpu(), 'o', label='Actual targets')
plt.plot(x_reg.cpu(), y_reg_init.cpu(), 'x', label='Predicted targets')
plt.plot(x_grid.cpu(), y_reg_grid.cpu(), 'k-', label='Predicted')
plt.grid()
plt.legend()
<matplotlib.legend.Legend at 0x1454fe48a7a0>
../_images/6640218b6491182a9b05c5fb95c5c54e19068a102a28a163c3992b8583ef60be.png

Question

Where is overfitting most severe? How might having more data alleviate the problem?

🔥 IMPORTANT CONCEPT 🔥

Overfitting is a type of error that occurs when your model memorizes the training samples and fails to generalize to unseen data. This usually ocurrs when the model has excess capacity.

😅 EXCERCISE 😅

Using the code cell below, design and test a model that gets it "just right".

#model = your model goes here

# train_and_test_regression_model(model, num_epochs=10000, reporting_interval=1000)

Classification#

Let’s apply the same non-linear treatment to the case of classification.

model =  nn.Sequential(
    nn.Linear(in_features = 2, out_features = 2),
    nn.Tanh(),
    nn.Linear(2,1),
    nn.Sigmoid()
)

train_and_test_classification_model(model, 10000, 1000)
_ = plt.title("Not non-linear enough")
Epoch 1. MSE Loss = 0.748
Epoch 1001. MSE Loss = 0.674
Epoch 2001. MSE Loss = 0.665
Epoch 3001. MSE Loss = 0.640
Epoch 4001. MSE Loss = 0.589
Epoch 5001. MSE Loss = 0.548
Epoch 6001. MSE Loss = 0.522
Epoch 7001. MSE Loss = 0.507
Epoch 8001. MSE Loss = 0.497
Epoch 9001. MSE Loss = 0.492
../_images/8ab5b22cb54144571cfe3db252299285ae7a118043fa02e20478b7f083d409e2.png

Question

What would overfitting look like in this type of graph?

model =  nn.Sequential(
    nn.Linear(in_features = 2, out_features = 50),
    nn.Tanh(),
    nn.Linear(50,50),
    nn.Tanh(),
    nn.Linear(50,50),
    nn.Tanh(),
    nn.Linear(50,50),
    nn.Tanh(),
    nn.Linear(50,50),
    nn.Tanh(),
    nn.Linear(50,50),
    nn.Tanh(),
    nn.Linear(50,1),
    nn.Sigmoid()
)

train_and_test_classification_model(model, 30000, 3000)
_ = plt.title("What went wrong here?")
Epoch 1. MSE Loss = 0.697
Epoch 3001. MSE Loss = 0.276
Epoch 6001. MSE Loss = 0.251
Epoch 9001. MSE Loss = 0.236
Epoch 12001. MSE Loss = 0.221
Epoch 15001. MSE Loss = 0.209
Epoch 18001. MSE Loss = 0.186
Epoch 21001. MSE Loss = 0.168
Epoch 24001. MSE Loss = 0.150
Epoch 27001. MSE Loss = 0.141
../_images/ba3a69d18a5b411487c228d009f3458b14ff4d009c6ad560f284606dd487ef41.png

Question

What's wrong with this picture?

😅 EXCERCISE 😅

Using the code cell below, design and test a model that gets it "just right".

#model = your model goes here

# train_and_test_classification_model(model, 20000, 2000)

Summing up fully connected neural networks#

  • Fully connected neural networks can represent highly non-linear functions

  • We can learn good functions through gradient descent

  • Overfitting is a big problem

These concepts apply to nearly any neural network trained with gradient descent from the smallest fully connected net to GPT-4.