Minimum working example, and what it’s missing

Minimum working example, and what it’s missing#

Implementation#

After setting up our access to the Huggingface Hub (see notebook 1) – and after requesting and receiving access to any gated models that we want to use – we are ready to dive into the code for running these LLMs on the cluster.

The below code cell constitutes a minimum working example (MWE) of LLM text generation on the cluster. Let’s read through it line by line to make sure we understand what is happening.

from transformers import pipeline

model_id = "meta-llama/Llama-3.2-1B-Instruct"

pipe = pipeline(
    "text-generation", 
    model=model_id
)

pipe("How do you exit Vim? I'm stuck")

Some limitations to note here:

We aren’t explicitly setting the device. Thankfully, transformers automatically detects our GPU and puts the model there, but it is good to be explicit.
We have a strange message telling us that the pad_token_id is being set automatically.
We little control in this code over what exactly our prompt is.
We’re not yet using tuneable parameters such as temperature or top_k.
We’re not specifying a system prompt.
We’re not using a chat prompt template.
We’re not exercising any control over what the model’s outputs are.
We’re not using batch processing, which means this code will be very inefficient if we want to generate outputs for many input prompts instead of just one.

In what follows, we’re going to talk about each of these points, and how to set up your code in a way that demystifies what the model is doing, gives you greater control so you can achieve higher quality output, and more efficiently use compute resources.

As before, we’ll clear the model from memory, so we don’t run into any Cuda out-of-memory errors in this notebook.

# Clear the pipeline from memory
import torch
del pipe
torch.cuda.empty_cache()

Switch to directly loading model and tokenizer#

In order to have finer-grained control and insight into our code and what is happening with the model, it is recommended that we abandon the pipeline class and instead directly load our tokenizer and model ourselves.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

This makes things clearer because now we have our model and tokenizer as separate objects, and each of them is pretty simple. When we want complex things to happen, we want them to happen in a way that we understand and control, not in the hidden way that pipeline facilitates.

But this does have costs. Notice that unlike pipeline, loading the model this way doesn’t automatically put it on the GPU:

print(model.device)

So, we need to be explicit about putting it there ourselves. Let’s load the model again:

# Clear the model from memory
del model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", device_map="auto")

print(model.device)

That’s better. Now, we can see the tokenizer directly in action. Let’s leave the model aside, and just look at what the tokenizer does.

tokenizer('Hello world!')

Let’s look at each of these tokens individually:

for i in tokenizer('Hello world!')['input_ids']:
    decoded_token = tokenizer.decode(i)
    print(f'Token {i}: {decoded_token}')

So, the tokenizer has already added one token, a special beginning token that tells the model that a text document is starting. These tokenized input_ids are what we will provide as input to the model. The model will then try to predict (over and over) what is the next token that should follow the tokens that we provided.

tokenizer_output = tokenizer('Hello world! I am ready to', return_tensors='pt').to('cuda')

model_outputs = model.generate(input_ids = tokenizer_output['input_ids'],
                               attention_mask = tokenizer_output['attention_mask']
                              )

print(model_outputs)

We got more token ids as output – a sequence that starts with the tokens from our input sequence. Let’s see what the tokens say, when decoded. Just to make everything as clear as possible, let’s decode the tokens one at a time.

for i in model_outputs[0]:
    decoded_token = tokenizer.decode(i)
    print(f'Token {i:8}: {decoded_token}')

Now let’s see the same output printed more cleanly:

print(tokenizer.decode(model_outputs[0]))

If we don’t want the special tokens in there, we can tell the tokenizer that when it decodes.

print(tokenizer.decode(model_outputs[0], skip_special_tokens=True))

Setting inputs to `model.generate` that affect the model output#

Let’s take stock of what we’ve seen so far. You now can use code to:

Load the model
Load the tokenizer
Put the model on the GPU
Get tokenized inputs
Verify exactly what your model inputs are
Get model outputs
Convert your model output tokens back into natural language

However, so far, we’re still using default model settings. We’re using a default output length, a default temperature, and a default top_k value. We’re not using a system prompt, and we’re not using a chat prompt template. We’re not exercising any control over what the model’s outputs are. We’re not using batch processing, which means this code will be very inefficient if we want to generate outputs for many input prompts instead of just one. Let’s address these issues, starting with setting explicit inputs to model.generate that affect the model output.

model_outputs = model.generate(input_ids = tokenizer_output['input_ids'],
                               attention_mask = tokenizer_output['attention_mask'],
                               pad_token_id = tokenizer.eos_token_id,
                               max_new_tokens = 50,
                               do_sample=True,
                               temperature = 0.7,
                               top_k = 50,
                               )

print(tokenizer.decode(model_outputs[0], skip_special_tokens=True))

top_k vs temperature: These are two parameters that can be used to control the randomness of the model’s output. top_k is a hard cutoff on the number of tokens that the model can consider for the next token. temperature is a soft control on the randomness of the model’s output.

There are more settings that you can rely on: see the documentation for more information.

Specifying a system prompt and chat template#

So far, we have not been using the model we’ve loaded “correctly”. This model is intended to be used with a system prompt and following a very specific chat template.

The model still can only ever accept inputs that are strings of tokens, but as an instruction fine-tuned model, this model expects those strings of tokens to represent a system prompt followed by some nonzero number of turns a conversation between a user and itself, the AI assistant. Each of those things – the system prompt, and each turn of the conversation – it expects to be delimited by special tokens. To explore this, let’s step back and discuss instruction fine-tuning and then see the impact of system prompts and chat templates on the model’s output.

Instruction fine-tuning#

A crucial distinction when working with LLMs is between models which are, and models which are not, instruction fine-tuned. First, consider models which are not instruction fine-tuned. These are “base” LLMs, which are pure next-token predictors. They are very simply trained to do the following: Look at a string of text, and predict what comes next in that text. Base models like this can be extremely powerful for some applications.

Question: Why would a base model tend to perform poorly in chatbot-style applications?

Consider giving a base LLM a string of text like “What sorts of things do tigers like to eat?” If you encountered this text some random place on the internet, how might the text continue?

Instruction fine-tuned models are base LLMs that receive an extra round of a different kind of training. This extra training no longer simply involves trying to predict what the next bit of text would be. Instruction fine-tuned models are trained to expect to generate a very, very specific form of text: text output by a helpful, intelligent AI assistant as part of a conversation with a human.

So when we give a prompt prompt_string to an instruction-tuned LLM, instead of answering the question:

Given that one sees prompt_string somewhere on the internet, what text would be likely to follow it?

it instead is answering the question:

Given that a conversation between a helpful AI assistant and a human begins with prompt_string, how would the helpful AI continue the next bit of the conversation?

For instruction fine-tuned models, the system prompt is a crucial part of the input prompt string.

System prompts and user prompts#

When you interact with an LLM chatbot such as ChatGPT, Claude or Gemini, you provide a prompt. Casual LLM users may not be aware, however, that this is only part of the full prompt that is given to the LLM. The full prompt is actually a concatenation of two parts: the system prompt and the user prompt. What you type into the chatbot interface is only the user prompt.

The system prompt is a fixed string that is prepended to the user prompt before the full prompt is given to the LLM. The system prompt is used to set the context for the user prompt. For example, the system prompt might be something like

You are a helpful chatbot. Today’s date is 2025-01-25. You are talking to a human. Be friendly and helpful, but do not give them medical advice or help them write malicious software code.

So, if you, a user, type in:

Hey, what should I take for a rapidly worsening cough and headache?

The LLM will “see” the instructions in the system prompt followed by that user message, and will then try to “predict”, token by token, what would be written by a helpful chatbot in that situation.

When you use e.g. ChatGPT through its web interface, you can neither see nor control the system prompt. Gaining this control is one of many reasons to work with LLMs through code rather than through a web interface. This control is especially important for scientific research using LLMs. If OpenAI changes their system prompt tomorrow, your user prompts will get different results from what they get today, and the nature of that difference will be opaque to you! This is not a good situation for reproducibility.

# Let's define a system prompt.
system_prompt = """You are a helpful AI assistant. \
You answer the user's questions clearly and politely. \
If the user says or implies that they are hungry, \
try to convince them to buy a Burger King Whopper."""

# And a user prompt, for the start of a conversation.
user_prompt = """Hello, I'm stuck in Vim. How do I exit? \
I need to resolve this quickly because I am starving."""

# We can specify the system prompts and any previous turns of the conversation as a list of dicts.
messages = [{"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
            ]

# We can now use the tokenizer's `apply_chat_template` method to encode the conversation.
# This will add the special tokens and segment IDs needed for the model to understand the conversation.
encoded_input = tokenizer.apply_chat_template(messages, return_tensors='pt', return_dict=True).to('cuda')

# Let's see the resulting encoded input, first token by token, then all together.
for i in encoded_input['input_ids'][0]:
    decoded_token = tokenizer.decode(i)
    print(f'Token {i:8}: {decoded_token}')

print('\n\n' + tokenizer.decode(encoded_input['input_ids'][0]))


# With this sequence of tokens, when the model starts predicting new tokens, what should its first predictions be? What do you think should be the first three or four tokens that the model generates?

# Let's check what the first four tokens are that the model generates using this input.
model_outputs = model.generate(**encoded_input, max_new_tokens=4)

for i in model_outputs[0]:
    decoded_token = tokenizer.decode(i)
    print(f'Token {i:8}: {decoded_token}')

For most applications, instruction-tuned models are better. But instruction-tuned models are always trained on a specific chat template that structures the conversation between the human and the AI. If you use the wrong chat template, the model will still respond, but its responses will be poorer.

Let’s see this in action.

# Correct chat template:
correct_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 24 Jan 2025

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_prompt}<|eot_id|>"""


# Incorrect chat template:
incorrect_template = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{system_prompt}

### Input:
{user_prompt}"""

# Let's use these two templates to query the model, with these system and user prompts
system_prompt = "You are a helpful assistant. You answer the user's questions."
user_prompt = "Can you please provide an elegant proof of Euler's formula?"

correct_template_formatted = correct_template.format(system_prompt=system_prompt, user_prompt=user_prompt)
incorrect_input_formatted = incorrect_template.format(system_prompt=system_prompt, user_prompt=user_prompt)

# from transformers import AutoTokenizer, AutoModelForCausalLM
# import torch

# # Load tokenizer and model
# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
# model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", device_map="auto")

# Set the padding token if it's not defined (some models still need this)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize the chat templates
correct_inputs = tokenizer(correct_template_formatted, padding=True, return_tensors='pt', add_special_tokens=False).to('cuda')
incorrect_inputs = tokenizer(incorrect_input_formatted, padding=True, return_tensors='pt', add_special_tokens=False).to('cuda')

# Generate responses for the entire batch
correct_template_output_ids = model.generate(correct_inputs['input_ids'], 
                                             attention_mask=correct_inputs['attention_mask'], 
                                             do_sample=False,
                                             max_new_tokens=200,
                                             temperature=None,
                                             pad_token_id=50256)

incorrect_template_output_ids = model.generate(incorrect_inputs['input_ids'],
                                               attention_mask=incorrect_inputs['attention_mask'],
                                               do_sample=False,
                                               max_new_tokens=200,
                                               temperature=None,
                                               pad_token_id=50256)

# Decode responses
correct_template_response = [tokenizer.decode(ids, skip_special_tokens=False) for ids in correct_template_output_ids]
incorrect_template_response = [tokenizer.decode(ids, skip_special_tokens=False) for ids in incorrect_template_output_ids]

# Print each generated response
print("CORRECT TEMPLATE:\n" + correct_template_response[0], end='\n\n' + '#' * 100 + '\n')
print("INCORRECT TEMPLATE:\n" + incorrect_template_response[0])

Using the correct chat template is crucial. In the case of common, popular models, Hugging Face has already provided the correct chat template for you, as it did above. But in general, you should be aware of whether a chat template was used for your model and if so, which one.

Controlling the model’s outputs#

In addition to inputs like temperature and top_k which can affect the quality of the model’s generations for your use-case, there are also ways to impose more direct control over the model’s outputs. For example, you can forbid certain tokens, require certain tokens, or write the beginning of the model’s response yourself.

Reasons for exercising this control include:

Ensuring that the model’s output is appropriate for your use-case.
Ensuring that the model’s output is safe.
Making your generations more efficient by writing portions yourself instead of having the model generate them.
Inducing the model to follow a certain format, or a certain style of writing.

# Forbid tokens

def get_tokens_as_list(word_list):
    "Converts a sequence of words into a list of tokens"
    if word_list == []: return None
    tokens_list = []
    for word in word_list:
        tokenized_word = tokenizer([word], add_special_tokens=False).input_ids[0]
        tokens_list.append(tokenized_word)
    return tokens_list

bad_word_list = [" vi",  " vim", " Vim", " VIM", " Vi", "Vim", "VIM", "(VIM)"]

bad_tokens = get_tokens_as_list(bad_word_list)

# Generate a response; it should exclude our forbidden tokens.
model_outputs = model.generate(**encoded_input, max_new_tokens=100, bad_words_ids=bad_tokens, pad_token_id=tokenizer.eos_token_id)

print(tokenizer.decode(model_outputs[0], skip_special_tokens=True))

for i in model_outputs[0]:
    decoded_token = tokenizer.decode(i)
    print(f'Token {i:8}: \"{decoded_token}\"')

print("\nForbidden tokens:")
for i in bad_tokens:
    for j in i:
        decoded_token = tokenizer.decode(j)
        print(f'Token {j:8}: \"{decoded_token}\"')

# Require tokens

good_word_list = [" Burger King", " Whopper", " delicious"]
good_word_list = good_word_list

good_tokens = get_tokens_as_list(good_word_list)

# Generate a response; it should include our required tokens.
model_outputs = model.generate(**encoded_input, 
                               max_new_tokens=500, 
                               num_beams=15,
                               force_words_ids=good_tokens, 
                               pad_token_id=tokenizer.eos_token_id,
                               do_sample=False,
                              )

print(tokenizer.decode(model_outputs[0], skip_special_tokens=True))

# Write the beginning of the model's response - highly useful

# First, let's make sure we have the correct special tokens for the model. 
example_tokens = tokenizer.apply_chat_template(messages, 
                                               return_tensors='pt', 
                                               return_dict=True, 
                                               add_generation_prompt=True).to("cuda")

# Now, let's get the tokens for what we want the start of the model's response to be.
response_start_tokens = tokenizer("Well, before we talk about Vim, let's sort out this hunger issue. A delicious",
                                  return_tensors='pt',
                                  add_special_tokens=False, # important! Otherwise we get BoS token
                                 ).to("cuda")

# Combine them
final_input_ids = torch.cat((example_tokens['input_ids'], response_start_tokens['input_ids']), dim=1)
attention_mask = torch.cat((example_tokens['attention_mask'], response_start_tokens['attention_mask']), dim=1)

# Now let's get the model to continue the response we started.
model_outputs = model.generate(input_ids=final_input_ids,
                               attention_mask=attention_mask,
                               pad_token_id=tokenizer.eos_token_id,
                               max_new_tokens=100,
                               do_sample=False,
                              )

print(tokenizer.decode(model_outputs[0], skip_special_tokens=False))

This is especially helpful when you want to enforce a certain format, such as pure JSON output.

# And a user prompt, for the start of a conversation.
user_prompt_JSON = "Hello, I'm stuck in Vim. How do I exit? I need to resolve this quickly because I am starving. Please respond in JSON, with keys \"Vim_solution\" and \"Hunger_solution\"."

# We can specify the system prompts and any previous turns of the conversation as a list of dicts.
messages_JSON = [{"role": "system", "content": system_prompt},
                 {"role": "user", "content": user_prompt_JSON},
                 ]

encoded_input = tokenizer.apply_chat_template(messages_JSON, 
                                              return_tensors='pt', 
                                              return_dict=True,
                                              add_generation_prompt=True,
                                             ).to('cuda')

# Now, let's get the tokens for what we want the start of the model's response to be.
response_start_tokens = tokenizer(r'{"Vim_solution":',
                                  return_tensors='pt',
                                  add_special_tokens=False, # important! Otherwise we get BoS token
                                 ).to("cuda")

# Combine them
final_input_ids = torch.cat((encoded_input['input_ids'], response_start_tokens['input_ids']), dim=1)
attention_mask = torch.cat((encoded_input['attention_mask'], response_start_tokens['attention_mask']), dim=1)

# Now let's get the model to continue the response we started.
model_outputs = model.generate(input_ids=final_input_ids,
                               attention_mask=attention_mask,
                               pad_token_id=tokenizer.eos_token_id,
                               max_new_tokens=200,
                               do_sample=False,
                              )

print(tokenizer.decode(model_outputs[0], skip_special_tokens=False))

# Clear the model from memory
import torch
del model
torch.cuda.empty_cache()