Intro to Natural Language Processing#

Welcome to NLP. NLP aims to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. Applications of NLP range from sentiment analysis, machine translation, chatbots, speech recognition, text summarization, and information retrieval. In this notebook, we’ll dive into the world of text analysis. We will explore ways to extract meaning from text, and build a model that can classify newsgroups post into their respective topics. We’ll be using a simplistic technique called Bag of Words, which involves representing text as numerical vectors of words represented their frequency and index.

Data#

To start off let’s get some data. The dataset we are going to use is the 20 newsgroups dataset from scikit-learn. The dataset comprises around 18000 newsgroups posts on 20 topics.

from sklearn.datasets import fetch_20newsgroups

news_dataset = fetch_20newsgroups(data_home='./')
import pandas as pd

# make a pandas dataframe out of the dataset
df = pd.DataFrame({
    'text': news_dataset.data,
    'label_number': news_dataset.target,
    'label_name': news_dataset.filenames
})

# df = df[:1000]

#  Let's see what's in it
df.head(5)
text label_number label_name
0 From: lerxst@wam.umd.edu (where's my thing)\nS... 7 ./20news_home/20news-bydate-train/rec.autos/10...
1 From: guykuo@carson.u.washington.edu (Guy Kuo)... 4 ./20news_home/20news-bydate-train/comp.sys.mac...
2 From: twillis@ec.ecn.purdue.edu (Thomas E Will... 4 ./20news_home/20news-bydate-train/comp.sys.mac...
3 From: jgreen@amber (Joe Green)\nSubject: Re: W... 1 ./20news_home/20news-bydate-train/comp.graphic...
4 From: jcm@head-cfa.harvard.edu (Jonathan McDow... 14 ./20news_home/20news-bydate-train/sci.space/60880

Preprocessing Data#

In order to get usable data, we must transform the data to be suitable for analysis. We’ll be using some regular expression to clean out unwanted strings and the CountVectorizer from scikit-learn to transform the collection of text into a matrix of token counts where each row represents a document and each column represents a unique word in the document collection. Let’s see it in action.

import re
from pprint import pprint

def clean_df(df: pd.DataFrame):
    # The regex matches HTML-like tags (e.g., <tag />), 
    # non-word characters (excluding spaces and apostrophes), digits, and underscores
    regex = re.compile('<\\w+ /?>|[^\\w \']|\\d|_')
    
    # Replace the matched substrings with a space and re-assign the result
    df['text'] = df['text'].replace(regex, ' ', regex=True)

    def extract_label_name(text: str):
        # The regex matches the path to the file and extracts the 3rd directory
        match = re.match(r'\./(.+)\\(.+)\\(.+)\\(.+)', text)
        return match.group(3) if match else text

    # extract label_name
    df['label_name'] = df.label_name.apply(extract_label_name)

    return df

# before
pprint(df.iloc[0]['text'])

clean_df(df)

# after
pprint(df.iloc[0]['text'])
("From: lerxst@wam.umd.edu (where's my thing)\n"
 'Subject: WHAT car is this!?\n'
 'Nntp-Posting-Host: rac3.wam.umd.edu\n'
 'Organization: University of Maryland, College Park\n'
 'Lines: 15\n'
 '\n'
 ' I was wondering if anyone out there could enlighten me on this car I saw\n'
 'the other day. It was a 2-door sports car, looked to be from the late 60s/\n'
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition,\n'
 'the front bumper was separate from the rest of the body. This is \n'
 'all I know. If anyone can tellme a model name, engine specs, years\n'
 'of production, where this car is made, history, or whatever info you\n'
 'have on this funky looking car, please e-mail.\n'
 '\n'
 'Thanks,\n'
 '- IL\n'
 '   ---- brought to you by your neighborhood Lerxst ----\n'
 '\n'
 '\n'
 '\n'
 '\n')
("From  lerxst wam umd edu  where's my thing  Subject  WHAT car is this   Nntp "
 'Posting Host  rac  wam umd edu Organization  University of Maryland  College '
 'Park Lines       I was wondering if anyone out there could enlighten me on '
 'this car I saw the other day  It was a   door sports car  looked to be from '
 'the late   s  early   s  It was called a Bricklin  The doors were really '
 'small  In addition  the front bumper was separate from the rest of the body  '
 'This is  all I know  If anyone can tellme a model name  engine specs  years '
 'of production  where this car is made  history  or whatever info you have on '
 'this funky looking car  please e mail   Thanks    IL         brought to you '
 'by your neighborhood Lerxst          ')
from utils import create_answer_box
create_answer_box("We're working with 18,000 newsgroup posts across 20 topics. What challenges do you think we'll face that are unique to text data compared to the image data we've worked with before?", "08-01")

We’re working with 18,000 newsgroup posts across 20 topics. What challenges do you think we’ll face that are unique to text data compared to the image data we’ve worked with before?

df.head()
text label_number label_name
0 From lerxst wam umd edu where's my thing Su... 7 ./20news_home/20news-bydate-train/rec.autos/10...
1 From guykuo carson u washington edu Guy Kuo ... 4 ./20news_home/20news-bydate-train/comp.sys.mac...
2 From twillis ec ecn purdue edu Thomas E Will... 4 ./20news_home/20news-bydate-train/comp.sys.mac...
3 From jgreen amber Joe Green Subject Re We... 1 ./20news_home/20news-bydate-train/comp.graphic...
4 From jcm head cfa harvard edu Jonathan McDow... 14 ./20news_home/20news-bydate-train/sci.space/60880
from sklearn.feature_extraction.text import CountVectorizer

# stop_words='english' removes common English words like "a" or "the' from the text
vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)
bag_of_words = vectorizer.fit_transform(df['text'])
df.shape
(11314, 3)
bag_of_words
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 981191 stored elements and shape (11314, 14060)>
# see some of the tokens it collected
vectorizer.get_feature_names_out()
array(['aa', 'aaa', 'aardvark', ..., 'zx', 'zyeh', 'zz'], dtype=object)

Dataset#

Now let’s use the cleaned Dataframe to make a pytorch Dataset so that we can manage and load data into our model later.

import torch
import numpy as np
from torch.utils.data import Dataset

class NewsGroupsDataset(Dataset):

    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)

        # fit vectorizer
        self.bag_of_words = self.vectorizer.fit_transform(self.df['text'])

    def __getitem__(self, index: int):
        # note: CrossEntropyLoss requires the datatype to be floats

        # converting bag-of-words representation into numpy array
        X = self.bag_of_words[index].toarray().squeeze().astype(np.float32)

        # one-hot encoded vector representing the target data
        # Y = [0.0] * len(self.classes)
        # Y[self.df.iloc[index]['label_number']] = 1.0
        # Y = torch.tensor(Y)
        Y = self.df.iloc[index]['label_number']

        return X, Y

    def __len__(self):
        return len(self.df)

    @property
    def classes(self):
        return fetch_20newsgroups(data_home='./').target_names

    @property
    def vocab_size(self):
        return len(self.vectorizer.get_feature_names_out())
dataset = NewsGroupsDataset(df)
dataset[0]
(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 7)
create_answer_box("We're using bag-of-words, but you might have heard of word embeddings or transformers. What advantages might those approaches have over counting words?", "08-02")

We’re using bag-of-words, but you might have heard of word embeddings or transformers. What advantages might those approaches have over counting words?

Hyper parameters#

Set some hyper parameters for our model

import os

epochs = 10
batch_size = 128
lr = 1e-3
num_workers = 10

Model#

Next, let’s make the model. We’ll make a super simple linear regression model using pytorch_lightning. For those that are unfamiliar with the library, PyTorch Lightning is a lightweight and flexible PyTorch wrapper that allows you to focus on the high-level structure of your deep learning models rather than the low-level details of PyTorch.

Here are a list of basic things we will need in pytorch lightning model to get started.

  1. the forward function

  2. training_step and validation_step

  3. configure_optimizers

  4. train_dataloader and val_dataloader

from torch import nn
from torch.utils.data import SubsetRandomSampler
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import torchmetrics

class LitModel(pl.LightningModule):
    def __init__(self, in_features: int, out_features: int, hidden_units=16, *, dataset=dataset):
        super().__init__()
        self.dataset = dataset
        self.loss_fn = nn.CrossEntropyLoss()
        self.metric = torchmetrics.Accuracy(task='multiclass', num_classes=len(self.dataset.classes))

        self.in_features = in_features
        self.out_features = out_features
        self.hidden_units = hidden_units

        # setting up samplers to split data for training and evaluation
        dataset_indices = list(range(len(self.dataset)))
        np.random.shuffle(dataset_indices)
        split_index = int(np.floor(0.2 * len(self.dataset)))
        train_indices, val_indices = dataset_indices[split_index:], dataset_indices[:split_index]
        self.train_sampler = SubsetRandomSampler(train_indices)
        self.val_sampler = SubsetRandomSampler(val_indices)

        # layers
        self.layer = nn.Sequential(
            nn.Identity()
            #nn.Linear(self.in_features, self.hidden_units),
        )
        self.fc = nn.Sequential(
            nn.LazyLinear(self.out_features),
        )

    def forward(self, X: torch.Tensor):
        outputs = self.layer(X)
        return self.fc(outputs)

    def training_step(self, batch: torch.Tensor, index: int):
        x, y = batch
        # forward pass
        output = self(x)
        # calculate loss
        loss = self.loss_fn(output, y)
        # log data to a logger
        self.log('train_loss', loss.item(), on_step=True, sync_dist=True)
        return loss

    def validation_step(self, batch: torch.Tensor, index: int):
        x, y = batch
        # forward pass
        output = self(x)
        # calculate loss
        loss = self.loss_fn(output, y)

        accuracy = self.metric(torch.argmax(output, dim=-1), y).item()

        # log data to a logger
        self.log('val_loss', loss.item(), on_step=True, sync_dist=True)
        self.log('accuracy', accuracy, on_epoch=True, sync_dist=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=lr)

    def train_dataloader(self):
        train_loader = DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            sampler=self.train_sampler,
            num_workers=num_workers
        )
        return train_loader

    def val_dataloader(self):
        val_loader = DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            sampler=self.val_sampler,
            num_workers=num_workers
        )
        return val_loader
create_answer_box("We're doing a random 80/20 split for training/validation. For text classification, what other splitting strategies might be important? What could go wrong with random splits?", "08-03")

We’re doing a random 80/20 split for training/validation. For text classification, what other splitting strategies might be important? What could go wrong with random splits?

model = LitModel(
    in_features=dataset.vocab_size,
    out_features=len(dataset.classes),
    hidden_units=32,
    dataset=dataset
)

Trainer#

To train our model, we’ll need a Trainer. The Trainer is a high-level module that provides a simple and consistent interface for training, validation, and testing your PyTorch models. We’ll also want to log our data with a logger. You will need a Wandb account to view you’re logs.

from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainer
logger = WandbLogger(
    project='lightning_logs',
    name='experiment1')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
/home/cehrett/.conda/envs/PytorchWorkshop/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /home/cehrett/.conda/envs/PytorchWorkshop/lib/python ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
trainer.fit(model)
You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: cehrett. Use `wandb login --relogin` to force relogin
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/cehrett/.conda/envs/PytorchWorkshop/lib/python3.11/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py:477: The total number of parameters detected may be inaccurate because the model contains an instance of `UninitializedParameter`. To get an accurate number, set `self.example_input_array` in your LightningModule.

  | Name    | Type               | Params | Mode 
-------------------------------------------------------
0 | loss_fn | CrossEntropyLoss   | 0      | train
1 | metric  | MulticlassAccuracy | 0      | train
2 | layer   | Sequential         | 0      | train
3 | fc      | Sequential         | 0      | train
-------------------------------------------------------
0         Trainable params
0         Non-trainable params
0         Total params
0.000     Total estimated model params size (MB)
6         Modules in train mode
0         Modules in eval mode
/home/cehrett/.conda/envs/PytorchWorkshop/lib/python3.11/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
`Trainer.fit` stopped: `max_epochs=10` reached.
logger.experiment.finish()
wandb: WARNING Unable to render Widget, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core

Improvement#

There are ways to improve a model. One way is to try different model architectures. Let’s start by introducing non-linearity with the ReLU activation function.

model = LitModel(
    in_features=dataset.vocab_size,
    out_features=len(dataset.classes),
    hidden_units=32,
    dataset=dataset
)
model.layer = nn.Sequential(
    nn.Linear(model.in_features, model.hidden_units),
    nn.ReLU(),
)
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()

Did it improve? Try adding regularization with Dropout.

model = LitModel(
    in_features=dataset.vocab_size,
    out_features=len(dataset.classes),
    hidden_units=32,
    dataset=dataset
)
model.layer = nn.Sequential(
    nn.Linear(model.in_features, model.hidden_units),
    nn.ReLU(),
    nn.Dropout(.2),
)
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()

Conclusion#

Machine learning requires lots of experimenting. It often requires trying out different models, hyperparameters, and preprocessing techniques to achieve optimal results. This is only the start of NLP. Throughout the workshop you may find other approaches to this problem.

create_answer_box("Thank you for taking part in the Beginner Pytorch workshop! Please take a moment to describe what you think could improve this workshop's usefulness for you and students like you in the future.", "08-04")

Thank you for taking part in the Beginner Pytorch workshop! Please take a moment to describe what you think could improve this workshop’s usefulness for you and students like you in the future.