Intro to Natural Language Processing#
Welcome to NLP. NLP aims to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. Applications of NLP range from sentiment analysis, machine translation, chatbots, speech recognition, text summarization, and information retrieval. In this notebook, we’ll dive into the world of text analysis. We will explore ways to extract meaning from text, and build a model that can classify newsgroups post into their respective topics. We’ll be using a simplistic technique called Bag of Words, which involves representing text as numerical vectors of words represented their frequency and index.
Data#
To start off let’s get some data. The dataset we are going to use is the 20 newsgroups dataset from scikit-learn
.
The dataset comprises around 18000 newsgroups posts on 20 topics.
from sklearn.datasets import fetch_20newsgroups
news_dataset = fetch_20newsgroups(data_home='./')
import pandas as pd
# make a pandas dataframe out of the dataset
df = pd.DataFrame({
'text': news_dataset.data,
'label_number': news_dataset.target,
'label_name': news_dataset.filenames
})
# df = df[:1000]
# Let's see what's in it
df.head(5)
text | label_number | label_name | |
---|---|---|---|
0 | From: lerxst@wam.umd.edu (where's my thing)\nS... | 7 | ./20news_home/20news-bydate-train/rec.autos/10... |
1 | From: guykuo@carson.u.washington.edu (Guy Kuo)... | 4 | ./20news_home/20news-bydate-train/comp.sys.mac... |
2 | From: twillis@ec.ecn.purdue.edu (Thomas E Will... | 4 | ./20news_home/20news-bydate-train/comp.sys.mac... |
3 | From: jgreen@amber (Joe Green)\nSubject: Re: W... | 1 | ./20news_home/20news-bydate-train/comp.graphic... |
4 | From: jcm@head-cfa.harvard.edu (Jonathan McDow... | 14 | ./20news_home/20news-bydate-train/sci.space/60880 |
Preprocessing Data#
In order to get usable data, we must transform the data to be suitable for analysis.
We’ll be using some regular expression to clean out unwanted strings and
the CountVectorizer
from scikit-learn to transform the collection of text
into a matrix of token counts where each row represents a document and
each column represents a unique word in the document collection.
Let’s see it in action.
import re
from pprint import pprint
def clean_df(df: pd.DataFrame):
# The regex matches HTML-like tags (e.g., <tag />),
# non-word characters (excluding spaces and apostrophes), digits, and underscores
regex = re.compile('<\\w+ /?>|[^\\w \']|\\d|_')
# Replace the matched substrings with a space and re-assign the result
df['text'] = df['text'].replace(regex, ' ', regex=True)
def extract_label_name(text: str):
# The regex matches the path to the file and extracts the 3rd directory
match = re.match(r'\./(.+)\\(.+)\\(.+)\\(.+)', text)
return match.group(3) if match else text
# extract label_name
df['label_name'] = df.label_name.apply(extract_label_name)
return df
# before
pprint(df.iloc[0]['text'])
clean_df(df)
# after
pprint(df.iloc[0]['text'])
("From: lerxst@wam.umd.edu (where's my thing)\n"
'Subject: WHAT car is this!?\n'
'Nntp-Posting-Host: rac3.wam.umd.edu\n'
'Organization: University of Maryland, College Park\n'
'Lines: 15\n'
'\n'
' I was wondering if anyone out there could enlighten me on this car I saw\n'
'the other day. It was a 2-door sports car, looked to be from the late 60s/\n'
'early 70s. It was called a Bricklin. The doors were really small. In '
'addition,\n'
'the front bumper was separate from the rest of the body. This is \n'
'all I know. If anyone can tellme a model name, engine specs, years\n'
'of production, where this car is made, history, or whatever info you\n'
'have on this funky looking car, please e-mail.\n'
'\n'
'Thanks,\n'
'- IL\n'
' ---- brought to you by your neighborhood Lerxst ----\n'
'\n'
'\n'
'\n'
'\n')
("From lerxst wam umd edu where's my thing Subject WHAT car is this Nntp "
'Posting Host rac wam umd edu Organization University of Maryland College '
'Park Lines I was wondering if anyone out there could enlighten me on '
'this car I saw the other day It was a door sports car looked to be from '
'the late s early s It was called a Bricklin The doors were really '
'small In addition the front bumper was separate from the rest of the body '
'This is all I know If anyone can tellme a model name engine specs years '
'of production where this car is made history or whatever info you have on '
'this funky looking car please e mail Thanks IL brought to you '
'by your neighborhood Lerxst ')
from utils import create_answer_box
create_answer_box("We're working with 18,000 newsgroup posts across 20 topics. What challenges do you think we'll face that are unique to text data compared to the image data we've worked with before?", "08-01")
We’re working with 18,000 newsgroup posts across 20 topics. What challenges do you think we’ll face that are unique to text data compared to the image data we’ve worked with before?
df.head()
text | label_number | label_name | |
---|---|---|---|
0 | From lerxst wam umd edu where's my thing Su... | 7 | ./20news_home/20news-bydate-train/rec.autos/10... |
1 | From guykuo carson u washington edu Guy Kuo ... | 4 | ./20news_home/20news-bydate-train/comp.sys.mac... |
2 | From twillis ec ecn purdue edu Thomas E Will... | 4 | ./20news_home/20news-bydate-train/comp.sys.mac... |
3 | From jgreen amber Joe Green Subject Re We... | 1 | ./20news_home/20news-bydate-train/comp.graphic... |
4 | From jcm head cfa harvard edu Jonathan McDow... | 14 | ./20news_home/20news-bydate-train/sci.space/60880 |
from sklearn.feature_extraction.text import CountVectorizer
# stop_words='english' removes common English words like "a" or "the' from the text
vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)
bag_of_words = vectorizer.fit_transform(df['text'])
df.shape
(11314, 3)
bag_of_words
<Compressed Sparse Row sparse matrix of dtype 'int64'
with 981191 stored elements and shape (11314, 14060)>
# see some of the tokens it collected
vectorizer.get_feature_names_out()
array(['aa', 'aaa', 'aardvark', ..., 'zx', 'zyeh', 'zz'], dtype=object)
Dataset#
Now let’s use the cleaned Dataframe
to make a pytorch Dataset
so that we can manage and load data into our model later.
import torch
import numpy as np
from torch.utils.data import Dataset
class NewsGroupsDataset(Dataset):
def __init__(self, df: pd.DataFrame):
self.df = df
self.vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)
# fit vectorizer
self.bag_of_words = self.vectorizer.fit_transform(self.df['text'])
def __getitem__(self, index: int):
# note: CrossEntropyLoss requires the datatype to be floats
# converting bag-of-words representation into numpy array
X = self.bag_of_words[index].toarray().squeeze().astype(np.float32)
# one-hot encoded vector representing the target data
# Y = [0.0] * len(self.classes)
# Y[self.df.iloc[index]['label_number']] = 1.0
# Y = torch.tensor(Y)
Y = self.df.iloc[index]['label_number']
return X, Y
def __len__(self):
return len(self.df)
@property
def classes(self):
return fetch_20newsgroups(data_home='./').target_names
@property
def vocab_size(self):
return len(self.vectorizer.get_feature_names_out())
dataset = NewsGroupsDataset(df)
dataset[0]
(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 7)
create_answer_box("We're using bag-of-words, but you might have heard of word embeddings or transformers. What advantages might those approaches have over counting words?", "08-02")
We’re using bag-of-words, but you might have heard of word embeddings or transformers. What advantages might those approaches have over counting words?
Hyper parameters#
Set some hyper parameters for our model
import os
epochs = 10
batch_size = 128
lr = 1e-3
num_workers = 10
Model#
Next, let’s make the model. We’ll make a super simple linear regression model using pytorch_lightning
.
For those that are unfamiliar with the library, PyTorch Lightning is a lightweight and flexible PyTorch wrapper
that allows you to focus on the high-level structure of your deep learning models rather than the low-level details of PyTorch.
Here are a list of basic things we will need in pytorch lightning model to get started.
the forward function
training_step and validation_step
configure_optimizers
train_dataloader and val_dataloader
from torch import nn
from torch.utils.data import SubsetRandomSampler
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import torchmetrics
class LitModel(pl.LightningModule):
def __init__(self, in_features: int, out_features: int, hidden_units=16, *, dataset=dataset):
super().__init__()
self.dataset = dataset
self.loss_fn = nn.CrossEntropyLoss()
self.metric = torchmetrics.Accuracy(task='multiclass', num_classes=len(self.dataset.classes))
self.in_features = in_features
self.out_features = out_features
self.hidden_units = hidden_units
# setting up samplers to split data for training and evaluation
dataset_indices = list(range(len(self.dataset)))
np.random.shuffle(dataset_indices)
split_index = int(np.floor(0.2 * len(self.dataset)))
train_indices, val_indices = dataset_indices[split_index:], dataset_indices[:split_index]
self.train_sampler = SubsetRandomSampler(train_indices)
self.val_sampler = SubsetRandomSampler(val_indices)
# layers
self.layer = nn.Sequential(
nn.Identity()
#nn.Linear(self.in_features, self.hidden_units),
)
self.fc = nn.Sequential(
nn.LazyLinear(self.out_features),
)
def forward(self, X: torch.Tensor):
outputs = self.layer(X)
return self.fc(outputs)
def training_step(self, batch: torch.Tensor, index: int):
x, y = batch
# forward pass
output = self(x)
# calculate loss
loss = self.loss_fn(output, y)
# log data to a logger
self.log('train_loss', loss.item(), on_step=True, sync_dist=True)
return loss
def validation_step(self, batch: torch.Tensor, index: int):
x, y = batch
# forward pass
output = self(x)
# calculate loss
loss = self.loss_fn(output, y)
accuracy = self.metric(torch.argmax(output, dim=-1), y).item()
# log data to a logger
self.log('val_loss', loss.item(), on_step=True, sync_dist=True)
self.log('accuracy', accuracy, on_epoch=True, sync_dist=True)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=lr)
def train_dataloader(self):
train_loader = DataLoader(
dataset=dataset,
batch_size=batch_size,
sampler=self.train_sampler,
num_workers=num_workers
)
return train_loader
def val_dataloader(self):
val_loader = DataLoader(
dataset=dataset,
batch_size=batch_size,
sampler=self.val_sampler,
num_workers=num_workers
)
return val_loader
create_answer_box("We're doing a random 80/20 split for training/validation. For text classification, what other splitting strategies might be important? What could go wrong with random splits?", "08-03")
We’re doing a random 80/20 split for training/validation. For text classification, what other splitting strategies might be important? What could go wrong with random splits?
model = LitModel(
in_features=dataset.vocab_size,
out_features=len(dataset.classes),
hidden_units=32,
dataset=dataset
)
Trainer#
To train our model, we’ll need a Trainer
.
The Trainer
is a high-level module that provides a simple and consistent interface for training, validation, and testing your PyTorch models.
We’ll also want to log our data with a logger. You will need a Wandb account to view you’re logs.
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainer
logger = WandbLogger(
project='lightning_logs',
name='experiment1')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
/home/cehrett/.conda/envs/PytorchWorkshop/lib/python3.11/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /home/cehrett/.conda/envs/PytorchWorkshop/lib/python ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
trainer.fit(model)
You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: cehrett. Use `wandb login --relogin` to force relogin
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/cehrett/.conda/envs/PytorchWorkshop/lib/python3.11/site-packages/pytorch_lightning/utilities/model_summary/model_summary.py:477: The total number of parameters detected may be inaccurate because the model contains an instance of `UninitializedParameter`. To get an accurate number, set `self.example_input_array` in your LightningModule.
| Name | Type | Params | Mode
-------------------------------------------------------
0 | loss_fn | CrossEntropyLoss | 0 | train
1 | metric | MulticlassAccuracy | 0 | train
2 | layer | Sequential | 0 | train
3 | fc | Sequential | 0 | train
-------------------------------------------------------
0 Trainable params
0 Non-trainable params
0 Total params
0.000 Total estimated model params size (MB)
6 Modules in train mode
0 Modules in eval mode
/home/cehrett/.conda/envs/PytorchWorkshop/lib/python3.11/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
`Trainer.fit` stopped: `max_epochs=10` reached.
logger.experiment.finish()
wandb: WARNING Unable to render Widget, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
wandb: WARNING Unable to render HTML, can't import display from ipython.core
Improvement#
There are ways to improve a model. One way is to try different model architectures.
Let’s start by introducing non-linearity with the ReLU
activation function.
model = LitModel(
in_features=dataset.vocab_size,
out_features=len(dataset.classes),
hidden_units=32,
dataset=dataset
)
model.layer = nn.Sequential(
nn.Linear(model.in_features, model.hidden_units),
nn.ReLU(),
)
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()
Did it improve?
Try adding regularization with Dropout
.
model = LitModel(
in_features=dataset.vocab_size,
out_features=len(dataset.classes),
hidden_units=32,
dataset=dataset
)
model.layer = nn.Sequential(
nn.Linear(model.in_features, model.hidden_units),
nn.ReLU(),
nn.Dropout(.2),
)
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()
Conclusion#
Machine learning requires lots of experimenting. It often requires trying out different models, hyperparameters, and preprocessing techniques to achieve optimal results. This is only the start of NLP. Throughout the workshop you may find other approaches to this problem.
create_answer_box("Thank you for taking part in the Beginner Pytorch workshop! Please take a moment to describe what you think could improve this workshop's usefulness for you and students like you in the future.", "08-04")
Thank you for taking part in the Beginner Pytorch workshop! Please take a moment to describe what you think could improve this workshop’s usefulness for you and students like you in the future.