Running an interactive job on Palmetto#
QSUB#
Now, we arrive at the most important part of today’s workshop: getting on the compute nodes. Compute nodes are the real power of Palmetto. Let’s see which of the compute nodes are available at the moment:
whatsfree
We can see that the cluster is quite busy, but there is a fair amount of compute nodes that are available for us. Now, let’s request one compute node. Please type the following (or paste from the website into your SSH terminal):
qsub -I -l select=1:ncpus=4:mem=10gb:interconnect=1g,walltime=2:00:00
It is very important not to make typos, use spaces and upper/lowercases exactly as shown, and use the proper punctuation (note the :
between ncpus
and mem
, and the ,
before walltime). If you make a mistake, nothing wrong will happen, but the scheduler won’t understand your request.
Now, let’s carefully go through the request:
qsub
means that we are asking the scheduler to grant us access to a compute node;-I
means it’s an interactive job (we’ll talk about it in a bit);-l
is the list of resource requirements we are asking for;select=1
means we are asking for one compute node;ncpus=4
means that we only need four CPUs on the node (since all Palmetto compute nodes have at least 8 CPUs, we might share the compute node with other users, but it’s OK because users who use the same node do not interfere with each other);mem=10gb
means that we are asking for 10 Gb of RAM (you shouldn’t ask for less than 8 Gb); again, memory is specific to the user, and not shared between different users who use the same node);interconnect=1g
is the type of interconnect (the allowed types are1g
,10ge
,fdr
,hdr
, andany
). If you look at the output ofwhatsfree
andcat /etc/hardware-table
, you will see the different CPU/RAM configurations that are available for these three types of interconnect. Typically, but not always,1g
nodes have less RAM and a smaller number of CPUs thanfdr
andhdr
(with thehdr
nodes being the most powerful interms of RAM and CPUs).finally,
walltime=2:00:00
means that we are asking to use the node for 2 hours; after two hours we will be logged off the compute node if we haven’t already disconnected.
This is actually a very modest request, and the scheduler should grant it right away. Sometimes, when we are asking for much substantial amount of resources (for example, 20 nodes with 40 cores and 370 Gb of RAM), the scheduler cannot satisfy our request, and will put us into the queue so we will have to wait until the node becomes available.
Once the request is granted, you will see something like that:
[dndawso@login002 ~]$ qsub -I -l select=1:ncpus=4:mem=10gb:interconnect=1g,walltime=2:00:00
qsub (Warning): Interactive jobs will be treated as not rerunnable
qsub: waiting for job 74956.pbs02 to start
qsub: job 74956.pbs02 ready
[dndawso@node0033 ~]$
Importantly, you will see the prompt change. Previously, the prompt was node0033
(you might be on a different compute node). You can also see the job ID, in this case it is 74956.pbs02
.
We can see the information about the compute node by using the pbsnodes
command:
pbsnodes node0033
Here is the information about the node that I was assigned to (node0033):
node0033
Mom = node0033.palmetto.clemson.edu
ntype = PBS
state = free
pcpus = 8
Priority = 1
jobs = 61932.pbs02/0, 74956.pbs02/1, 74956.pbs02/2, 74956.pbs02/3, 74956.pbs02/4
resources_available.arch = linux
resources_available.chip_manufacturer = intel
resources_available.chip_model = xeon
resources_available.chip_type = e5520
resources_available.host = node0033
resources_available.hpmem = 0b
resources_available.interconnect = 1g, any
resources_available.make = dell
resources_available.manufacturer = dell
resources_available.mem = 31876mb
resources_available.model = r610
resources_available.ncpus = 8
resources_available.ngpus = 0
resources_available.node_make = dell
resources_available.node_manufacturer = dell
resources_available.node_model = r610
resources_available.phase = 1a
resources_available.qcat = c1_workq_qcat, c1_solo_qcat, osg_qcat, phase01a_qcat, mx_qcat, gilligan_qcat
resources_available.ssd = False
resources_available.vmem = 32836mb
resources_available.vnode = node0033
resources_available.vntype = cpu_node
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 1048576kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 1
resources_assigned.ngpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Thu Jan 26 04:00:34 2023
last_used_time = Thu Jan 26 10:25:35 2023
You can see that the node has 8 CPUs, no GPUs, and at the moment runs a couple of jobs. One of these jobs is mine (74956). When I submitted qsub
request, the scheduler told me that my job ID is 74956. The pbsnodes
command gives us the list of jobs that are currently running on the compute node, and, happily, I see my job on that list. It appears four times, because I have requested four CPUs. Somebody else runs a job (61932) which is using just one CPU.
To exit the compute node, type:
exit
This will bring you back to the login node. See how your prompt has changed to login002
. It is important to notice that you have to be on a login node to request a compute node. One you are on the compute node, and you want to go to another compute node, you have to exit first.
For some jobs, you might want to get a GPU, or perhaps two GPUs. For such requests, the qsub
command needs to specify the number of GPUs and the type of GPUs (which you can get from cat /etc/hardware-table
). For example, let’s request a NVIDIA K20:
qsub -I -l select=1:ncpus=4:mem=10gb:ngpus=1:gpu_model=k20,walltime=0:10:00
You might have to wait for a bit if the K20 nodes are busy. Once you get on the compute node, you can run:
nvidia-smi
Then, exit the compute node to let other people a chance to get on it.
If you want a GPU but don’t care about the type of the GPU, you can request gpu_model=any
.
It is possible to ask for several compute nodes at a time, for example select=4
will give you 4 compute nodes. Some programs, such as LAMMPS or NAMD, work a lot faster if you ask for several nodes. This is an advanced topic and we will not discuss it here, but you can find some examples on our website.
There are other resource limit selection options documented on our website.
Warning
Please be considerate of others when you issue qsub. Remember that Palmetto is a shared resource. Don’t request resources you don’t plan on actually using. Jobs that request in-demand resources and don’t use them are subject to termination.
Important
It is very important to remember that you shouldn’t run computations on the login node, because the login node is shared between everyone who logs into Palmetto, so your computations will interfere with other people’s login processes. However, once you are on a compute node, you can run some computations, because each user gets their own CPUs and RAM so there is no interference.
Modules#
If you are on the compute node, exit it. Once you get on the login node, type this:
qsub -I -l select=1:ncpus=4:mem=10gb,walltime=2:00:00
We have a lot of software installed on Palmetto, but most of it is organized into modules, which need to be loaded. To see which modules are available on Palmetto, please type
module avail
Hit SPACE
several times to get to the end of the module list. This is a very long list, and you can see that there is a lot of software installed for you. If you want to see which versions of MATLAB are installed, you can type
module avail matlab
[dndawso@node0033 ~]$ module avail matlab
------------------------------------------------- /software/AltModFiles --------------------------------------------------
matlab/MUSC2018b matlab/2021a matlab/2021b matlab/2022a (D)
Where:
D: Default Module
If the avail list is too long consider trying:
"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
Let’s say you want to use R. To load the module, you will need to specify its full name.To see which versions of R are available, type
module avail r
This will give you a list of all modules which have the letter “r” in them (module avail
is not very sophisticated). Let’s see what happens when you load the R 4.1.3 module:
module load r/4.1.3-gcc/9.5.0
module list
Currently Loaded Modules:
1) tcl/8.6.12-gcc/9.5.0 4) openjdk/11.0.15_10-gcc/9.5.0 7) glib/2.72.1-gcc/9.5.0
2) sqlite/3.38.5-gcc/9.5.0 5) libxml2/2.9.13-gcc/9.5.0 8) cairo/1.16.0-gcc/9.5.0
3) openssl/1.1.1o-gcc/9.5.0 6) libpng/1.6.37-gcc/9.5.0 9) r/4.1.3-gcc/9.5.0
R depends on other software to run, so we have configured the R module in a way that when you load it, it automatically loads other modules that it depends on.
To start command-line R, you can simply type
R
To quit R, type
quit()
Key Points
qsub
sends a request for a compute node to the scheduler.Software available on Palmetto is organized into modules according to version.
Modules need to be loaded before use.