Preparing data for LLM training#

We are going to train our LLM using the PubMed dataset, which contains abstracts from biomedical journal articles. To keep things quick for the workshop, we will be working with a small subset of 50k abstracts. There are a few key concepts to consider when preparing data for LLM training:

  1. Tokenization: how do we turn the raw text strings into units of analysis for our program?

  2. Batching: how do we batch multiple documents into a batch data structure for efficient model training?

To address these issues, we will be using a pre-built biomedical data tokenizer available from the huggingface hub. Implementing tokenizers is a complicated topic on its own, and we will not deal with it in detail here.

One pleasant difference between LLMs and the previous generation of NLP techniques is that we do not usually need to perform elaborate data preprocessing to acheive good results.

PubMed Data#

The huggingface datasets library contains some useful utilities for loading and working with text data. We use this here.

from datasets import load_dataset
# path to folder containing "train.txt" and "test.txt" files containing train/test PubMed abstracts
root = "/project/rcde/datasets/pubmed/mesh_50k/splits/"

train_test_files = {
    "train": root+"train.txt",
    "test": root+"test.txt"
}

dataset = load_dataset("text", data_files = train_test_files).with_format("torch")

dataset

Let’s check the sizes of training and test sets:

len(dataset["train"]), len(dataset["test"])

Look at a particular training sample:

dataset["train"][34799]

Tokenization#

Here, we will use the Huggingface transformers library to fetch a tokenizer purpose-built for biomedical data. The AutoTokenizer class allows us to provide the name of a model on the Huggingface Hub and automatically retrieve the associated tokenizer. We could experiment with different tokenizers to try to acheive better results.

from transformers import AutoTokenizer
# use a pretrained tokenizer
# https://huggingface.co/dmis-lab/biobert-base-cased-v1.2
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.2")

Let’s tokenize some text:

tokenizer(dataset["train"][34799]['text'])

The input_ids list contains an encoded representation of our text. It is a sequence of integer IDs corresponding to the tokens that appear in the text we tokenized. The IDs refer to specific terms in a pre-defined vocabulary that came with the tokenizer. So the input_ids list can be decoded back into our original text.

We will usually want pytorch tensors, not lists, as output. For this we need to enable padding.

What do you think padding does?

ids = tokenizer(dataset["train"][34799:34801]['text'], return_tensors='pt', padding=True)['input_ids']
ids

Can you see how padding appears in the tokenized text?

Do you notice anything special about the first and last non-padding tokens?

Let’s decode back:

[tokenizer.decode(input_ids) for input_ids in ids]

There is a minor cleanliness issue: the abstracts start and close with unneeded quotation marks. We will add a preprocessing step to remove these while batching samples.

Cleaning and batching#

Here we will interface between the Huggingface tools and native Pytorch tools.

from torch.utils.data import DataLoader, default_collate
def clean_and_tokenize(text_batch):
    """
    This method demonstrates how you can apply custom preprocessing logic while you load your data. 
    
    It expects a list of plaintext abstracts as input. 
    """
    ## custom preprocessing
    # get rid of unwanted opening/closing quotes
    text_batch = [t[1:-1] for t in text_batch]
    
    ## tokenization
    # we use the huggingface tokenizer as above
    text_batch = tokenizer(text_batch, padding=True, truncation=True, max_length=512)
    
    return text_batch
    
def custom_collate(batch_list):
    """
    This is for use with the pytorch DataLoader class. We use the default collate function
    but add the cleaning and tokenization step. 
    """
    batch = default_collate(batch_list)
    batch['text'] = clean_and_tokenize(batch['text'])
    
    return batch

We can now use this collate function with the Pytorch DataLoader class to load, clean, tokenize and batch our text data. Once we can do this, we’re ready to work on modeling our data.

dl = DataLoader(dataset['train'], batch_size=3, collate_fn = custom_collate)
# Let's look at a batch
batch = next(iter(dl))
batch

Saving code for later#

I’ve pulled the above code into a separate file called dataset.py. This will allow us to reuse the code in future notebooks. Copy the file into your working directory:

wget https://raw.githubusercontent.com/clemsonciti/rcde_workshops/master/pytorch_llm/dataset.py

Let’s briefly look at the usage:

from dataset import PubMedDataset
dataset = PubMedDataset(
    root = "/project/rcde/datasets/pubmed/mesh_50k/splits/", 
    max_tokens = 20,
    tokenizer_model = "dmis-lab/biobert-base-cased-v1.2"
)
dl_train = dataset.get_dataloader(split='train', batch_size=3) # split can be "train" or "test"
batch = next(iter(dl_train))
batch
dataset.decode_batch(batch['input_ids'])