Intro to Natural Language Processing#
Welcome to NLP. NLP aims to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. Applications of NLP range from sentiment analysis, machine translation, chatbots, speech recognition, text summarization, and information retrieval. In this notebook, we’ll dive into the world of text analysis. We will explore ways to extract meaning from text, and build a model that can classify newsgroups post into their respective topics. We’ll be using a simplistic technique called Bag of Words, which involves representing text as numerical vectors of words represented their frequency and index.
Data#
To start off let’s get some data. The dataset we are going to use is the 20 newsgroups dataset from scikit-learn
.
The dataset comprises around 18000 newsgroups posts on 20 topics.
from sklearn.datasets import fetch_20newsgroups
news_dataset = fetch_20newsgroups(data_home='./')
import pandas as pd
# make a pandas dataframe out of the dataset
df = pd.DataFrame({
'text': news_dataset.data,
'label_number': news_dataset.target,
'label_name': news_dataset.filenames
})
# df = df[:1000]
# Let's see what's in it
df.head(5)
Preprocessing Data#
In order to get usable data, we must transform the data to be suitable for analysis.
We’ll be using some regular expression to clean out unwanted strings and
the CountVectorizer
from scikit-learn to transform the collection of text
into a matrix of token counts where each row represents a document and
each column represents a unique word in the document collection.
Let’s see it in action.
import re
from pprint import pprint
def clean_df(df: pd.DataFrame):
# The regex matches HTML-like tags (e.g., <tag />),
# non-word characters (excluding spaces and apostrophes), digits, and underscores
regex = re.compile('<\\w+ /?>|[^\\w \']|\\d|_')
# Replace the matched substrings with a space and re-assign the result
df['text'] = df['text'].replace(regex, ' ', regex=True)
def extract_label_name(text: str):
# The regex matches the path to the file and extracts the 3rd directory
match = re.match(r'\./(.+)\\(.+)\\(.+)\\(.+)', text)
return match.group(3) if match else text
# extract label_name
df['label_name'] = df.label_name.apply(extract_label_name)
return df
# before
pprint(df.iloc[0]['text'])
clean_df(df)
# after
pprint(df.iloc[0]['text'])
df.head()
from sklearn.feature_extraction.text import CountVectorizer
# stop_words='english' removes common English words like "a" or "the' from the text
vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)
bag_of_words = vectorizer.fit_transform(df['text'])
df.shape
bag_of_words
# see some of the tokens it collected
vectorizer.get_feature_names_out()
Dataset#
Now let’s use the cleaned Dataframe
to make a pytorch Dataset
so that we can manage and load data into our model later.
import torch
import numpy as np
from torch.utils.data import Dataset
class NewsGroupsDataset(Dataset):
def __init__(self, df: pd.DataFrame):
self.df = df
self.vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)
# fit vectorizer
self.bag_of_words = self.vectorizer.fit_transform(self.df['text'])
def __getitem__(self, index: int):
# note: CrossEntropyLoss requires the datatype to be floats
# converting bag-of-words representation into numpy array
X = self.bag_of_words[index].toarray().squeeze().astype(np.float32)
# one-hot encoded vector representing the target data
# Y = [0.0] * len(self.classes)
# Y[self.df.iloc[index]['label_number']] = 1.0
# Y = torch.tensor(Y)
Y = self.df.iloc[index]['label_number']
return X, Y
def __len__(self):
return len(self.df)
@property
def classes(self):
return fetch_20newsgroups(data_home='./').target_names
@property
def vocab_size(self):
return len(self.vectorizer.get_feature_names_out())
dataset = NewsGroupsDataset(df)
dataset[0]
Hyper parameters#
Set some hyper parameters for our model
import os
epochs = 10
batch_size = 128
lr = 1e-3
num_workers = 10
Model#
Next, let’s make the model. We’ll make a super simple linear regression model using pytorch_lightning
.
For those that are unfamiliar with the library, PyTorch Lightning is a lightweight and flexible PyTorch wrapper
that allows you to focus on the high-level structure of your deep learning models rather than the low-level details of PyTorch.
Here are a list of basic things we will need in pytorch lightning model to get started.
the forward function
training_step and validation_step
configure_optimizers
train_dataloader and val_dataloader
from torch import nn
from torch.utils.data import SubsetRandomSampler
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import torchmetrics
class LitModel(pl.LightningModule):
def __init__(self, in_features: int, out_features: int, hidden_units=16, *, dataset=dataset):
super().__init__()
self.dataset = dataset
self.loss_fn = nn.CrossEntropyLoss()
self.metric = torchmetrics.Accuracy(task='multiclass', num_classes=len(self.dataset.classes))
self.in_features = in_features
self.out_features = out_features
self.hidden_units = hidden_units
# setting up samplers to split data for training and evaluation
dataset_indices = list(range(len(self.dataset)))
np.random.shuffle(dataset_indices)
split_index = int(np.floor(0.2 * len(self.dataset)))
train_indices, val_indices = dataset_indices[split_index:], dataset_indices[:split_index]
self.train_sampler = SubsetRandomSampler(train_indices)
self.val_sampler = SubsetRandomSampler(val_indices)
# layers
self.layer = nn.Sequential(
nn.Identity()
#nn.Linear(self.in_features, self.hidden_units),
)
self.fc = nn.Sequential(
nn.LazyLinear(self.out_features),
)
def forward(self, X: torch.Tensor):
outputs = self.layer(X)
return self.fc(outputs)
def training_step(self, batch: torch.Tensor, index: int):
x, y = batch
# forward pass
output = self(x)
# calculate loss
loss = self.loss_fn(output, y)
# log data to a logger
self.log('train_loss', loss.item(), on_step=True, sync_dist=True)
return loss
def validation_step(self, batch: torch.Tensor, index: int):
x, y = batch
# forward pass
output = self(x)
# calculate loss
loss = self.loss_fn(output, y)
accuracy = self.metric(torch.argmax(output, dim=-1), y).item()
# log data to a logger
self.log('val_loss', loss.item(), on_step=True, sync_dist=True)
self.log('accuracy', accuracy, on_epoch=True, sync_dist=True)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=lr)
def train_dataloader(self):
train_loader = DataLoader(
dataset=dataset,
batch_size=batch_size,
sampler=self.train_sampler,
num_workers=num_workers
)
return train_loader
def val_dataloader(self):
val_loader = DataLoader(
dataset=dataset,
batch_size=batch_size,
sampler=self.val_sampler,
num_workers=num_workers
)
return val_loader
model = LitModel(
in_features=dataset.vocab_size,
out_features=len(dataset.classes),
hidden_units=32,
dataset=dataset
)
Trainer#
To train our model, we’ll need a Trainer
.
The Trainer
is a high-level module that provides a simple and consistent interface for training, validation, and testing your PyTorch models.
We’ll also want to log our data with a logger. You will need a Wandb account to view you’re logs.
from pytorch_lightning.loggers import WandbLogger
from pytorch_light-
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()
Improvement#
There are ways to improve a model. One way is to try different model architectures.
Let’s start by introducing non-linearity with the ReLU
activation function.
model = LitModel(
in_features=dataset.vocab_size,
out_features=len(dataset.classes),
hidden_units=32,
dataset=dataset
)
model.layer = nn.Sequential(
nn.Linear(model.in_features, model.hidden_units),
nn.ReLU(),
)
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()
Did it improve?
Try adding regularization with Dropout
.
model = LitModel(
in_features=dataset.vocab_size,
out_features=len(dataset.classes),
hidden_units=32,
dataset=dataset
)
model.layer = nn.Sequential(
nn.Linear(model.in_features, model.hidden_units),
nn.ReLU(),
nn.Dropout(.2),
)
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()
Conclusion#
Machine learning requires lots of experimenting. It often requires trying out different models, hyperparameters, and preprocessing techniques to achieve optimal results. This is only the start of NLP. Throughout the workshop you may find other approaches to this problem.