Intro to Natural Language Processing#

Welcome to NLP. NLP aims to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. Applications of NLP range from sentiment analysis, machine translation, chatbots, speech recognition, text summarization, and information retrieval. In this notebook, we’ll dive into the world of text analysis. We will explore ways to extract meaning from text, and build a model that can classify newsgroups post into their respective topics. We’ll be using a simplistic technique called Bag of Words, which involves representing text as numerical vectors of words represented their frequency and index.

Data#

To start off let’s get some data. The dataset we are going to use is the 20 newsgroups dataset from scikit-learn. The dataset comprises around 18000 newsgroups posts on 20 topics.

from sklearn.datasets import fetch_20newsgroups

news_dataset = fetch_20newsgroups(data_home='./')
import pandas as pd

# make a pandas dataframe out of the dataset
df = pd.DataFrame({
    'text': news_dataset.data,
    'label_number': news_dataset.target,
    'label_name': news_dataset.filenames
})

# df = df[:1000]

#  Let's see what's in it
df.head(5)

Preprocessing Data#

In order to get usable data, we must transform the data to be suitable for analysis. We’ll be using some regular expression to clean out unwanted strings and the CountVectorizer from scikit-learn to transform the collection of text into a matrix of token counts where each row represents a document and each column represents a unique word in the document collection. Let’s see it in action.

import re
from pprint import pprint

def clean_df(df: pd.DataFrame):
    # The regex matches HTML-like tags (e.g., <tag />), 
    # non-word characters (excluding spaces and apostrophes), digits, and underscores
    regex = re.compile('<\\w+ /?>|[^\\w \']|\\d|_')
    
    # Replace the matched substrings with a space and re-assign the result
    df['text'] = df['text'].replace(regex, ' ', regex=True)

    def extract_label_name(text: str):
        # The regex matches the path to the file and extracts the 3rd directory
        match = re.match(r'\./(.+)\\(.+)\\(.+)\\(.+)', text)
        return match.group(3) if match else text

    # extract label_name
    df['label_name'] = df.label_name.apply(extract_label_name)

    return df

# before
pprint(df.iloc[0]['text'])

clean_df(df)

# after
pprint(df.iloc[0]['text'])
df.head()
from sklearn.feature_extraction.text import CountVectorizer

# stop_words='english' removes common English words like "a" or "the' from the text
vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)
bag_of_words = vectorizer.fit_transform(df['text'])
df.shape
bag_of_words
# see some of the tokens it collected
vectorizer.get_feature_names_out()

Dataset#

Now let’s use the cleaned Dataframe to make a pytorch Dataset so that we can manage and load data into our model later.

import torch
import numpy as np
from torch.utils.data import Dataset

class NewsGroupsDataset(Dataset):

    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)

        # fit vectorizer
        self.bag_of_words = self.vectorizer.fit_transform(self.df['text'])

    def __getitem__(self, index: int):
        # note: CrossEntropyLoss requires the datatype to be floats

        # converting bag-of-words representation into numpy array
        X = self.bag_of_words[index].toarray().squeeze().astype(np.float32)

        # one-hot encoded vector representing the target data
        # Y = [0.0] * len(self.classes)
        # Y[self.df.iloc[index]['label_number']] = 1.0
        # Y = torch.tensor(Y)
        Y = self.df.iloc[index]['label_number']

        return X, Y

    def __len__(self):
        return len(self.df)

    @property
    def classes(self):
        return fetch_20newsgroups(data_home='./').target_names

    @property
    def vocab_size(self):
        return len(self.vectorizer.get_feature_names_out())
dataset = NewsGroupsDataset(df)
dataset[0]

Hyper parameters#

Set some hyper parameters for our model

import os

epochs = 10
batch_size = 128
lr = 1e-3
num_workers = 10

Model#

Next, let’s make the model. We’ll make a super simple linear regression model using pytorch_lightning. For those that are unfamiliar with the library, PyTorch Lightning is a lightweight and flexible PyTorch wrapper that allows you to focus on the high-level structure of your deep learning models rather than the low-level details of PyTorch.

Here are a list of basic things we will need in pytorch lightning model to get started.

  1. the forward function

  2. training_step and validation_step

  3. configure_optimizers

  4. train_dataloader and val_dataloader

from torch import nn
from torch.utils.data import SubsetRandomSampler
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import torchmetrics

class LitModel(pl.LightningModule):
    def __init__(self, in_features: int, out_features: int, hidden_units=16, *, dataset=dataset):
        super().__init__()
        self.dataset = dataset
        self.loss_fn = nn.CrossEntropyLoss()
        self.metric = torchmetrics.Accuracy(task='multiclass', num_classes=len(self.dataset.classes))

        self.in_features = in_features
        self.out_features = out_features
        self.hidden_units = hidden_units

        # setting up samplers to split data for training and evaluation
        dataset_indices = list(range(len(self.dataset)))
        np.random.shuffle(dataset_indices)
        split_index = int(np.floor(0.2 * len(self.dataset)))
        train_indices, val_indices = dataset_indices[split_index:], dataset_indices[:split_index]
        self.train_sampler = SubsetRandomSampler(train_indices)
        self.val_sampler = SubsetRandomSampler(val_indices)

        # layers
        self.layer = nn.Sequential(
            nn.Identity()
            #nn.Linear(self.in_features, self.hidden_units),
        )
        self.fc = nn.Sequential(
            nn.LazyLinear(self.out_features),
        )

    def forward(self, X: torch.Tensor):
        outputs = self.layer(X)
        return self.fc(outputs)

    def training_step(self, batch: torch.Tensor, index: int):
        x, y = batch
        # forward pass
        output = self(x)
        # calculate loss
        loss = self.loss_fn(output, y)
        # log data to a logger
        self.log('train_loss', loss.item(), on_step=True, sync_dist=True)
        return loss

    def validation_step(self, batch: torch.Tensor, index: int):
        x, y = batch
        # forward pass
        output = self(x)
        # calculate loss
        loss = self.loss_fn(output, y)

        accuracy = self.metric(torch.argmax(output, dim=-1), y).item()

        # log data to a logger
        self.log('val_loss', loss.item(), on_step=True, sync_dist=True)
        self.log('accuracy', accuracy, on_epoch=True, sync_dist=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=lr)

    def train_dataloader(self):
        train_loader = DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            sampler=self.train_sampler,
            num_workers=num_workers
        )
        return train_loader

    def val_dataloader(self):
        val_loader = DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            sampler=self.val_sampler,
            num_workers=num_workers
        )
        return val_loader
model = LitModel(
    in_features=dataset.vocab_size,
    out_features=len(dataset.classes),
    hidden_units=32,
    dataset=dataset
)

Trainer#

To train our model, we’ll need a Trainer. The Trainer is a high-level module that provides a simple and consistent interface for training, validation, and testing your PyTorch models. We’ll also want to log our data with a logger. You will need a Wandb account to view you’re logs.

from pytorch_lightning.loggers import WandbLogger
from pytorch_light-
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()

Improvement#

There are ways to improve a model. One way is to try different model architectures. Let’s start by introducing non-linearity with the ReLU activation function.

model = LitModel(
    in_features=dataset.vocab_size,
    out_features=len(dataset.classes),
    hidden_units=32,
    dataset=dataset
)
model.layer = nn.Sequential(
    nn.Linear(model.in_features, model.hidden_units),
    nn.ReLU(),
)
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()

Did it improve? Try adding regularization with Dropout.

model = LitModel(
    in_features=dataset.vocab_size,
    out_features=len(dataset.classes),
    hidden_units=32,
    dataset=dataset
)
model.layer = nn.Sequential(
    nn.Linear(model.in_features, model.hidden_units),
    nn.ReLU(),
    nn.Dropout(.2),
)
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)
trainer.fit(model)
logger.experiment.finish()

Conclusion#

Machine learning requires lots of experimenting. It often requires trying out different models, hyperparameters, and preprocessing techniques to achieve optimal results. This is only the start of NLP. Throughout the workshop you may find other approaches to this problem.