Running LLMs on Palmetto#
Instructor#
Instructor: Carl Ehrett
Office: 2105 Barre Hall, Clemson University
Email: cehrett AT clemson DOT edu
Workshop Description#
This workshop series introduces essential concepts related to LLMs and works through the common steps in an LLM inference workflow. This workshop focuses on efficiently running LLMs, rather than on constructing, training or fine-tuning them. Throughout the sessions, students will learn how to use the Hugging Face Transformers library to run LLMs on the Palmetto Cluster. The workshop will also cover how to use the Palmetto Cluster to run LLMs on large datasets and how to use the Palmetto Cluster to run LLMs on multiple GPUs and multiple nodes.
Prerequisites#
All workshop participants should have a Palmetto Cluster account. If you do not already have an account, you can visit our getting started page.
Participants should be familiar with the Python programming language. This requirement could be fulfilled by personal projects, coursework, or completion of the Introduction to Python Programming workshop series.
Accessing Workshop Files#
You can download the notebooks and their contents as follows.
In the terminal, create or navigate to an empty folder. Run the following command: wget https://raw.githubusercontent.com/clemsonciti/rcde_workshops/master/llms_inference/download.sh
This copies to your drivespace a script download.sh
that, when run, will copy the full workshop files to your drivespace. So now that you have that script, run the command: bash download.sh
. You should now have a folder, llms_inference
, which contains the workshop files.
Environment#
To run the code in this workshop, you will need a python environment with the appropriate libraries installed. You can create such an environment as follows.
Navigate to the directory where the workshop contents are stored. Submit the script create_env.sh
as a job, by running the command sbatch create_env.sh
. This will create a conda environment named LLMsInferenceWorkshop
. (This will take a while; up to 60 minutes.) You can then use that environment as the Jupyter kernel to run the notebooks in this environment.
Alternatively, if you’d rather run the script interactively: in the terminal (and not in JupyterLab), get an interactive session using salloc --mem=12GB --time=01:30:00
. In the directory where the workshop contents are stored, run bash create_env.sh
.
Hugging Face Hub#
In order to use the code in the Workshop notebooks, you will need a Hugging Face account. You can create one here.
Where to get models#
To use pre-trained models in your workflows, you need to know where to find and download them. Here’s a quick guide:
NOTE: USE THE DATA TRANSFER NODE TO DOWNLOAD MODELS. From the login node, ssh hpcdtn01.rcd.clemson.edu
, or ssh hpcdtn02.rcd.clemson.edu
.
Direct Download#
Many models are available for direct download from their creators’ websites or repositories.
Use
wget
orcurl
to fetch model files directly into your HPC environment.
Hugging Face Hub#
The Hugging Face Hub is a popular platform for accessing pre-trained models, datasets, and other ML resources.
What is the Hugging Face Hub?#
A community-driven platform with thousands of pre-trained models for NLP, computer vision, and more.
Provides tools for easy integration with Python scripts and libraries like
transformers
.
Setting Up Your Hugging Face Account#
Visit huggingface.co and create an account.
Generate a personal access token:
Go to Settings > Access Tokens.
Create a new token with the necessary permissions for downloading models.
Log in to the Hugging Face CLI to store your token:
Run the following command after activating your
LLMsInferenceWorkshop
environment:huggingface-cli login
Paste your access token when prompted.
The token will be saved in
~/.huggingface/token
, allowing persistent access without needing to re-enter it.
Setting the Cache Directory#
To avoid storing large models in your home directory, configure a cache directory on your scratch space:
export HF_HOME=/scratch/username/huggingface
Add this line to your .bashrc or .bash_profile to make it persistent.
Downloading models#
Now open a new terminal window and re-activate your environment.
To download a model, run the command
huggingface-cli download [model_name]
For these notebooks, you will need to download
Qwen/Qwen2.5-0.5B-Instruct
,Qwen/Qwen2-VL-2B-Instruct
,Equall/Saul-Instruct-v1
andmeta-llama/Llama-3.2-1B
.
Background on LLMs#
LLM Overview#
Conceptual overview#
LLMs are a class of models that are designed to predict the next token (word, or chunk of a word) in a sequence of tokens. They are trained on a truly vast amount of text data.
LLMs these days tend to be Transformer-based neural networks. If you’re interested in learning more about the Transformer architecture, we have a workshop in which we build and train a Transformer-based LLM from scratch using PyTorch.
In this workshop, we will not dwell on the details of the Transformer architecture or how LLMs are trained. We will not focus on the mathematical details of these models or how they work at a fundamental level. Instead, we will focus on how to use pre-trained LLMs to generate text efficiently on Palmetto. Everything we will discuss here also extends to using Multimodal LLMs on Palmetto as well.
Running an LLM can be simple or complicated, depending on how you want to use it. A very simple case would be the following code:
# Use a pipeline as a high-level helper
import warnings
import logging
# Suppress warnings and logging
warnings.filterwarnings('ignore')
logging.getLogger("transformers").setLevel(logging.ERROR)
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation",
model="Qwen/Qwen2.5-0.5B-Instruct",
)
messages = ["The purpose of this workshop is"]
output = pipe(messages)
print(output[0][0]['generated_text'])
# Clear the pipeline from memory
import torch
del pipe
torch.cuda.empty_cache()
Using LLMs on the Cluster#
How is using an LLM on the Cluster different from using a web interface like ChatGPT? How is it different from using an LLM through an API like OpenAI’s API?
Control: Aside from control over the system prompt, you have control over the model itself. You can choose which model to use. This is again crucial for reproducibility. Third-party services like Anthropic, Google and OpenAI can and do change or take away models without warning. This can undermine your research.
Cost: Since you all already have access to the Cluster, you can run LLMs on the Cluster for free. This is not the case with third-party services.
Privacy: When running LLMs on the Cluster, your data never need leave the Cluster. This is not the case with third-party services.
Speed: The Cluster is a powerful computing resource. You can run LLMs on the Cluster much faster than you can on your own computer. This is especially true if you use a GPU on the Cluster, and if you batch your data.
Not persistent server: While you can run LLM inference on the Palmetto Cluster, the Cluster does not support hosting a persistent, externally accessible service for real-time LLM interaction, like ChatGPT. HPC systems like the Cluster are optimized for batch processing and resource-intensive jobs, not for real-time interaction, especially with external users.
LLM Size#
Sizes of commonly used LLMs#
LLMs are enormous; while small ones can run reasonably well on a good laptop, larger ones require the kind of hardware you would only find in an HPC cluster like Palmetto.
We don’t know how big chatbots like GPT-4o are, because OpenAI won’t tell us. But models with similar performance typically have at least tens, and often hundreds, of billions of parameters.
That just means that the mathematical object that is the model is composed of that many numbers. Each such number is typically represented by a 16-bit floating point number, which is 2 bytes. So, a 100-billion-parameter model would be about 200 GB in size. To run that model, it’s not enough to store that on your SSD or HDD; you need to load it into memory – preferably GPU memory! Nobody’s laptop has that much memory.
Even if you can manage to run a 1- or 3-billion parameter model on your laptop, it will typically be much easier, faster and more efficient to run it on Palmetto, especially for large-scale workflows.
Quantization#
One way to run larger models on smaller hardware is to use quantization. This is useful sometimes both for running LLMs on your laptop and for running them on Palmetto.
Quantization represents the model parameters with fewer bits than the 16-bit floating point numbers that are typically used. This can reduce the size of the model by a factor of 2, 4, or more. It can also speed up the model. It comes at a cost, however: quantization can reduce the performance of the model.
# Load a heavily quantized model to demonstrate size reduction
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
# Load 4-bit quantized model
model_id = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model_4bit = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=24)
pipe_4bit = pipeline("text-generation", model=model_4bit, tokenizer=tokenizer, max_new_tokens=24)
# Print model size info
print(f"Model loaded in 16-bit quantization, Approximate memory usage: {pipe.model.get_memory_footprint() / 1e9:.2f} GB")
print(f"Model loaded in 4-bit quantization, Approximate memory usage: {pipe_4bit.model.get_memory_footprint() / 1e9:.2f} GB")
message = "The Fibonacci sequence begins: "
print(f"16-bit model: {pipe(message)[0]['generated_text']}")
print(f"4-bit model: {pipe_4bit(message)[0]['generated_text']}")
# Clear the model from memory
import torch
del pipe
del pipe_4bit
del model
del model_4bit
torch.cuda.empty_cache()
LLM Variety#
Domain-specific LLMs#
There are many LLMs available for use. Some are general-purpose, like Llama-3.2 models. Others are domain-specific, like models trained on scientific literature, or on code, or even “non-language” data like molecular structures.
Most LLMs, though, are trained to be general-purpose. General-purpose models can be adapted to specific domains by prompt-engineering, few-shot learning, retrieval-augmented generation, fine-tuning, or other techniques.
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="Equall/Saul-Instruct-v1", torch_dtype=torch.bfloat16, device_map="auto")
messages = [
{"role": "user", "content": "Please explain the key differences between common law and civil law systems."},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False)
print(outputs[0]["generated_text"])
# Clear the model from memory
del pipe, prompt, outputs
torch.cuda.empty_cache()
Multimodal LLMs#
Multimodal LLMs are LLMs that can take in not just text, but also images, audio, video, or other data types. They work essentially just like text LLMs, but they can tokenize and process these other data types as well.
Let’s see one such model take a look at the below image. We can ask the model to describe it.

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from pprint import pprint
import torch
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", max_pixels=256*28*28)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "files/kitchen.jpg",
},
{"type": "text", "text": "Write a recipe that this woman might be using."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
pprint(output_text[0])
# Clear the model from memory
del model
torch.cuda.empty_cache()
Inference vs. training#
The model is just a very large collection of (billions of) numbers, describing a mathematical operation which is performed on the input (the prompt) in order to produce the output (the generated text). Model training is the process of finding those numbers, identifying the values of the model parameters that get the best output.
Fine-tuning is a kind of model training, when a model is trained on a specialized dataset after first being trained on a broader dataset.
In this workshop, we are not training or fine-tuning models; we are inferencing models. That means the model parameters are already fixed, and we won’t do anything to change them. We are instead using those fixed model parameters to generate outputs. This is also what happens when you chat with an LLM such as ChatGPT.
Training and fine-tuning are far more compute-intensive and require much more GPU memory than inferencing. E.g., An 8B LLM takes about 16GB of GPU memory to inference, and takes about 32GB or more to fine-tune, due to the additional memory required for gradient computations, optimizer states, and intermediate activations during backpropagation.
Fine-tuning is often not necessary, even for specialized tasks. Prompt engineering, few-shot learning, and retrieval-augmented generation often are just as or even more effective than fine-tuning for molding an LLM to your particular desired behavior.
Alternative LLM Frameworks: When to Use Them#
While transformers
is the primary library we’ll use in this workshop for GPU-based workflows, it’s worth briefly introducing two lightweight alternatives: LlamaCpp and Ollama. These tools are particularly useful in scenarios where GPUs aren’t available or for quick, lightweight experiments.
LlamaCpp#
What it is: A lightweight, CPU-first library designed for running quantized versions of models like LLaMA.
Why use it: Efficient for local inference on CPU nodes or testing quantized models without GPU dependency. Not good for batching or large-scale workflows.
Key feature: Extremely low memory footprint and no reliance on external frameworks.
Ollama#
What it is: A simple CLI tool and platform for running pre-trained models locally.
Why use it: Easy to use for prototyping, quick experiments, or exploring pre-packaged models. Not good for batching or large-scale workflows.
Key feature: Integrated model management with minimal setup.
Comparison Table: When to Use Each Framework#
Framework |
Best Use Case |
Strengths |
Limitations |
---|---|---|---|
Transformers |
GPU-based training and inference on large-scale models in HPC environments. |
Extensive library, GPU acceleration, and flexibility for advanced workflows and training/fine-tuning. |
Requires GPUs and higher resource overhead. |
LlamaCpp |
CPU-based inference for small or quantized models in low-resource settings. |
Lightweight, runs efficiently on CPUs, no GPU dependency. |
Slower for large-scale tasks, limited feature set. |
Ollama |
Simple, quick prototyping or local testing of pre-trained models. |
Easy-to-use CLI, minimal setup required. |
Less customizable, not designed for large-scale HPC workflows. |