# Preparing data for LLM training

We are going to train our LLM using the [PubMed dataset](https://pubmed.ncbi.nlm.nih.gov/download/), which contains abstracts from biomedical journal articles. To keep things quick for the workshop, we will be working with a [small subset of 50k abstracts](https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification). There are a few key concepts to consider when preparing data for LLM training: 
1. **Tokenization**: how do we turn the raw text strings into units of analysis for our program?
2. **Batching**: how do we batch multiple documents into a batch data structure for efficient model training?

To address these issues, we will be using a [pre-built biomedical data tokenizer](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) available from the huggingface hub. Implementing tokenizers is a complicated topic on its own, and we will not deal with it in detail here.

One pleasant difference between LLMs and the previous generation of NLP techniques is that we do not usually need to perform elaborate data preprocessing to acheive good results.

## PubMed Data
The huggingface `datasets` library contains some useful utilities for loading and working with text data. We use this here. 

In [1]:
from datasets import load_dataset

In [2]:
# path to folder containing "train.txt" and "test.txt" files containing train/test PubMed abstracts
root = "/project/rcde/datasets/pubmed/mesh_50k/splits/"

train_test_files = {
    "train": root+"train.txt",
    "test": root+"test.txt"
}

dataset = load_dataset("text", data_files = train_test_files).with_format("torch")

dataset

Found cached dataset text (/home/dane2/.cache/huggingface/datasets/text/default-cadbbf8acc2e2b5a/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 45137
    })
    test: Dataset({
        features: ['text'],
        num_rows: 5036
    })
})

Let's check the sizes of training and test sets: 

In [4]:
len(dataset["train"]), len(dataset["test"])

(45137, 5036)

Look at a particular training sample:

In [7]:
dataset["train"][34799]

{'text': '"BACKGROUND & AIMS: The contribution of duodeno-gastroesophageal reflux to the development of Barrett\'s esophagus has remained an interesting but controversial topic. The present study assessed the risk for Barrett\'s esophagus after partial gastrectomy.METHODS: The data of outpatients from a medicine and gastroenterology clinic who underwent upper gastrointestinal endoscopy for any reason were analyzed in a case-control study. A case population of 650 patients with short- segment and 366 patients with long-segment Barrett\'s esophagus was compared in a multivariate logistic regression to a control population of 3047 subjects without Barrett\'s esophagus or other types of gastroesophageal reflux disease.RESULTS: In the case population, 25 (4%) patients with short-segment and 15 (4%) patients with long-segment Barrett\'s esophagus presented with a history of gastric surgery compared with 162 (5%) patients in the control population, yielding an adjusted odds ratio of 0.89 with

## Tokenization

Here, we will use the Huggingface `transformers` library to fetch a tokenizer purpose-built for biomedical data. The `AutoTokenizer` class allows us to provide the name of a model on the Huggingface Hub and automatically retrieve the associated tokenizer. We could experiment with different tokenizers to try to acheive better results.

In [8]:
from transformers import AutoTokenizer

In [9]:
# use a pretrained tokenizer
# https://huggingface.co/dmis-lab/biobert-base-cased-v1.2
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.2")

Let's tokenize some text:

In [10]:
tokenizer(dataset["train"][34799]['text'])

{'input_ids': [101, 107, 3582, 111, 8469, 131, 1103, 6436, 1104, 6862, 2883, 1186, 118, 3245, 8005, 1279, 4184, 19911, 1348, 1231, 2087, 24796, 1106, 1103, 1718, 1104, 2927, 8127, 1204, 112, 188, 13936, 4184, 2328, 12909, 1144, 1915, 1126, 5426, 1133, 6241, 8366, 119, 1103, 1675, 2025, 14758, 1103, 3187, 1111, 2927, 8127, 1204, 112, 188, 13936, 4184, 2328, 12909, 1170, 7597, 3245, 7877, 5822, 18574, 119, 4069, 131, 1103, 2233, 1104, 1149, 27420, 1116, 1121, 170, 5182, 1105, 3245, 8005, 25195, 4807, 12257, 1150, 9315, 3105, 3245, 8005, 10879, 2556, 14196, 1322, 2155, 20739, 1111, 1251, 2255, 1127, 17689, 1107, 170, 1692, 118, 1654, 2025, 119, 170, 1692, 1416, 1104, 14166, 4420, 1114, 1603, 118, 6441, 1105, 3164, 1545, 4420, 1114, 1263, 118, 6441, 2927, 8127, 1204, 112, 188, 13936, 4184, 2328, 12909, 1108, 3402, 1107, 170, 4321, 8997, 15045, 9366, 5562, 1231, 24032, 1106, 170, 1654, 1416, 1104, 26714, 1559, 5174, 1443, 2927, 8127, 1204, 112, 188, 13936, 4184, 2328, 12909, 1137, 1168, 332

The `input_ids` list contains an encoded representation of our text. It is a sequence of integer IDs corresponding to the tokens that appear in the text we tokenized. The IDs refer to specific terms in a pre-defined vocabulary that came with the tokenizer. So the `input_ids` list can be decoded back into our original text. 

We will usually want pytorch tensors, not lists, as output. For this we need to enable padding. 

*What do you think padding does?*

In [21]:
ids = tokenizer(dataset["train"][34799:34801]['text'], return_tensors='pt', padding=True)['input_ids']
ids

tensor([[  101,   107,  3582,   111,  8469,   131,  1103,  6436,  1104,  6862,
          2883,  1186,   118,  3245,  8005,  1279,  4184, 19911,  1348,  1231,
          2087, 24796,  1106,  1103,  1718,  1104,  2927,  8127,  1204,   112,
           188, 13936,  4184,  2328, 12909,  1144,  1915,  1126,  5426,  1133,
          6241,  8366,   119,  1103,  1675,  2025, 14758,  1103,  3187,  1111,
          2927,  8127,  1204,   112,   188, 13936,  4184,  2328, 12909,  1170,
          7597,  3245,  7877,  5822, 18574,   119,  4069,   131,  1103,  2233,
          1104,  1149, 27420,  1116,  1121,   170,  5182,  1105,  3245,  8005,
         25195,  4807, 12257,  1150,  9315,  3105,  3245,  8005, 10879,  2556,
         14196,  1322,  2155, 20739,  1111,  1251,  2255,  1127, 17689,  1107,
           170,  1692,   118,  1654,  2025,   119,   170,  1692,  1416,  1104,
         14166,  4420,  1114,  1603,   118,  6441,  1105,  3164,  1545,  4420,
          1114,  1263,   118,  6441,  2927,  8127,  

*Can you see how padding appears in the tokenized text?* 

*Do you notice anything special about the first and last non-padding tokens?* 

Let's decode back:

In [22]:
[tokenizer.decode(input_ids) for input_ids in ids]

['[CLS] " background & aims : the contribution of duodeno - gastroesophageal reflux to the development of barrett\'s esophagus has remained an interesting but controversial topic. the present study assessed the risk for barrett\'s esophagus after partial gastrectomy. methods : the data of outpatients from a medicine and gastroenterology clinic who underwent upper gastrointestinal endoscopy for any reason were analyzed in a case - control study. a case population of 650 patients with short - segment and 366 patients with long - segment barrett\'s esophagus was compared in a multivariate logistic regression to a control population of 3047 subjects without barrett\'s esophagus or other types of gastroesophageal reflux disease. results : in the case population, 25 ( 4 % ) patients with short - segment and 15 ( 4 % ) patients with long - segment barrett\'s esophagus presented with a history of gastric surgery compared with 162 ( 5 % ) patients in the control population, yielding an adjusted

There is a minor cleanliness issue: the abstracts start and close with unneeded quotation marks. We will add a preprocessing step to remove these while batching samples. 

## Cleaning and batching

Here we will interface between the Huggingface tools and native Pytorch tools. 

In [27]:
from torch.utils.data import Dataset, DataLoader, default_collate

In [28]:
def clean_and_tokenize(text_batch):
    """
    This method demonstrates how you can apply custom preprocessing logic while you load your data. 
    
    It expects a list of plaintext abstracts as input. 
    """
    ## custom preprocessing
    # get rid of unwanted opening/closing quotes
    text_batch = [t[1:-1] for t in text_batch]
    
    ## tokenization
    # we use the huggingface tokenizer as above
    text_batch = tokenizer(text_batch, padding=True, truncation=True, max_length=512)
    
    return text_batch
    
def custom_collate(batch_list):
    """
    This is for use with the pytorch DataLoader class. We use the default collate function
    but add the cleaning and tokenization step. 
    """
    batch = default_collate(batch_list)
    batch['text'] = clean_and_tokenize(batch['text'])
    
    return batch

We can now use this collate function with the Pytorch DataLoader class to load, clean, tokenize and batch our text data. Once we can do this, we're ready to work on modeling our data. 

In [33]:
dl = DataLoader(dataset['train'], batch_size=3, collate_fn = custom_collate)

In [34]:
# Let's look at a batch
batch = next(iter(dl))
batch

{'text': {'input_ids': [[101, 1103, 2853, 1104, 4167, 21943, 4807, 1120, 1103, 2704, 173, 4894, 2883, 118, 15688, 10886, 21039, 113, 173, 4894, 2883, 117, 176, 14170, 1183, 114, 1108, 1771, 1107, 7079, 1112, 1141, 1104, 1103, 3778, 7844, 1104, 4167, 21943, 4807, 1107, 170, 5186, 2704, 1107, 176, 14170, 1183, 119, 173, 4894, 2883, 1108, 1103, 2364, 1104, 21718, 21501, 1183, 1105, 117, 1112, 1216, 117, 1141, 1104, 1103, 1211, 5918, 3057, 6425, 1104, 176, 14170, 1183, 119, 1142, 2440, 2820, 1110, 1145, 7226, 1118, 1103, 2704, 112, 188, 1607, 2111, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

## Saving code for later

I've pulled the above code into a separate file called [dataset.py](https://raw.githubusercontent.com/clemsonciti/rcde_workshops/master/pytorch_llm/dataset.py). This will allow us to reuse the code in future notebooks. Copy the file into your working directory:
```
wget https://raw.githubusercontent.com/clemsonciti/rcde_workshops/master/pytorch_llm/dataset.py
```

Let's briefly look at the usage: 

In [35]:
from dataset import PubMedDataset

In [38]:
dataset = PubMedDataset(
    root = "/project/rcde/datasets/pubmed/mesh_50k/splits/", 
    max_tokens = 20,
    tokenizer_model = "dmis-lab/biobert-base-cased-v1.2"
)

Found cached dataset text (/home/dane2/.cache/huggingface/datasets/text/default-cadbbf8acc2e2b5a/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)


  0%|          | 0/2 [00:00<?, ?it/s]

In [42]:
dl_train = dataset.get_dataloader(split='train', batch_size=3) # split can be "train" or "test"
batch = next(iter(dl_train))
batch

{'input_ids': tensor([[  101,  1103,  2853,  1104,  4167, 21943,  4807,  1120,  1103,  2704,
           173,  4894,  2883,   118, 15688, 10886, 21039,   113,   173,   102],
        [  101,  3582,   131,  1195,  3402,  1103, 22760,  1104,  5677,   118,
          2747, 12365, 25711,  4043, 10831,  1107,   170,  1372,  1104,   102],
        [  101,  3582,   131,  1103,  6457,  1104,  1103,  1675,  2025,  1108,
          1106, 14133,  1103,  7300, 23891,  1104,  8276, 24928,  7880,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [43]:
dataset.decode_batch(batch['input_ids'])

['[CLS] the department of dermatology at the hospital dresden - friedrichstadt ( d [SEP]',
 '[CLS] background : we compared the prevalence of organ - specific autoantibodies in a group of [SEP]',
 '[CLS] background : the aim of the present study was to compare the clinical efficacy of radical neph [SEP]']