Slurm agent#
This guide provides a comprehensive overview of setting up an environment to test the Slurm agent locally and enabling the agent in your Flyte deployment. Before proceeding, the first and foremost step is to spin up your own Slurm cluster, as it serves as the foundation for the setup.
Spin up a Slurm cluster#
Setting up a Slurm cluster can be challenging due to the limited detail in the official instructions. This tutorial simplifies the process, focusing on configuring a single-host Slurm cluster with slurmctld
(central management daemon) and slurmd
(compute node daemon).
Install MUNGE#
MUNGE is an authentication service, allowing a process to authenticate the UID and GID of another local or remote process within a group of hosts having common users and groups.
1. Install necessary packages#
sudo apt install munge libmunge2 libmunge-dev
2. Generate and verify a MUNGE credential#
After a MUNGE credential is generated, you can decode and verify the encoded token as follows:
munge -n | unmunge | grep STATUS
Note
A status of STATUS: Success(0)
is expected and the MUNGE key is stored at /etc/munge/munge.key
. If the key is absent, please run the following command to create one manually:
sudo /usr/sbin/create-munge-key
4. Start MUNGE#
Please make munged
start at boot and restart the service:
sudo systemctl enable munge
sudo systemctl restart munge
Note
To check if the daemon runs as expected, you can either use systemctl status munge
or inspect the log file under /var/log/munge
.
Create a dedicated Slurm user#
The SlurmUser must be created as needed prior to starting Slurm and must exist on all nodes in your cluster.
Please make sure that uid
is equal to gid
to avoid some troubles:
sudo adduser --system --uid <uid> --group --home /var/lib/slurm slurm
Note
A system user usually has an uid
in the range of 0-999, please refer to the section Add a system user.
Once the system user is created, you can verify it using the following command:
cat /etc/passwd | grep <uid>
It’s of vital importance to set correct ownership of specific Slurm-related directories to prevent access issue. Directories mentioned below will be created automatically when the Slurm services start. However, manually creating them and altering the ownership beforehand help reduce errors:
Properly setting ownership of specific Slurm-related directories is crucial to avoid access issues. These directories are created automatically when Slurm services start, but manually creating them and adjusting ownership beforehand can make setup easier:
sudo mkdir -p /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
sudo chown -R slurm: /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
Run the Slurm cluster#
1. Install Slurm packages#
First, you can download the Slurm source from here (we’ll use version 24.05.5
for illustration):
mkdir <your-clean-dir> && cd <your-clean-dir>
wget https://download.schedmd.com/slurm/slurm-24.05.5.tar.bz2
Note
We recommend to download the file to a clean directory because all Debian packages will be generate under this path.
Then, Debian packages can be built following this official guide:
# Install basic Debian package build requirements
sudo apt-get update
sudo apt-get install -y build-essential fakeroot devscripts equivs
# (Optional) Install dependencies if missing
sudo apt install -y \
libncurses-dev libgtk2.0-dev libpam0g-dev libperl-dev liblua5.3-dev \
libhwloc-dev dh-exec librrd-dev libipmimonitoring-dev hdf5-helpers \
libfreeipmi-dev libhdf5-dev man2html-base libcurl4-openssl-dev \
libpmix-dev libhttp-parser-dev libyaml-dev libjson-c-dev \
libjwt-dev liblz4-dev libmariadb-dev libdbus-1-dev librdkafka-dev
# Unpack the distributed tarball
tar -xaf slurm-24.05.5.tar.bz2
# cd to the directory containing the Slurm source
cd slurm-24.05.5
# (Optional) Enable source packages for Ubuntu 24.04
# For details, please refer to
# https://manpages.debian.org/stretch/apt/sources.list.5.en.html
# and https://askubuntu.com/questions/1512042/
sudo sed -i 's/^Types: deb$/Types: deb deb-src/' /etc/apt/sources.list.d/ubuntu.sources
sudo apt update
# Install the Slurm package dependencies
sudo mk-build-deps -i debian/control
# Build the Slurm packages
debuild -b -uc -us
Debian packages are built and placed under the parent directory <your-clean-dir>
. Since the single-host Slurm cluster functions as both a controller and a compute node, the following packages are required: slurm-smd
, slurm-smd-client
(for CLI), slurm-smd-slurmctld
, and slurm-smd-slurmd
.
# cd to the parent directory
cd ..
sudo dpkg -i slurm-smd_24.05.5-1_amd64.deb
sudo dpkg -i slurm-smd-client_24.05.5-1_amd64.deb
sudo dpkg -i slurm-smd-slurmctld_24.05.5-1_amd64.deb
sudo dpkg -i slurm-smd-slurmd_24.05.5-1_amd64.deb
Note
Please refer to Installing Packages for package selection.
2. Generate a Slurm configuration file#
After installation, generate a valid slurm.conf
file for the Slurm cluster. We recommend using the official configurator to create it.
The following key-value pairs need to be set manually. Please leave the other options unchanged, as the default settings are sufficient for running slurmctld
and slurmd
.
# == Cluster Name ==
ClusterName=localcluster
# == Control Machines ==
SlurmctldHost=localhost
# == Process Tracking ==
ProctrackType=proctrack/linuxproc
# == Event Logging ==
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
# == Compute Nodes ==
# For checking CPU info, please use `lscpu | egrep 'Socket|Thread|CPU\(s\)'`
# For checking Mem info, please use `free -m` and write "available" value
NodeName=localhost CPUs=<cpus> RealMemory=<available-mem> Sockets=<sockets> CoresPerSocket=<cores-per-socket> ThreadsPerCore=<threads-per-core> State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
After completing the form, submit it, copy the content, and save it to /etc/slurm/slurm.conf
.
Note
For a sample configuration file, please refer to this slurm.conf.
Note
If you are using GPUs in your Slurm cluster, you need additional configuration files. Here is an example configuration for a Ubuntu machine with a Tesla T4 GPU:
In /etc/slurm/slurm.conf
, add:
GresTypes=gpu
NodeName=localhost Gres=gpu:1 CPUs=<cpus> RealMemory=<available-mem> Sockets=<sockets> CoresPerSocket=<cores-per-socket> ThreadsPerCore=<threads-per-core> State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
In /etc/slurm/gres.conf
, add:
AutoDetect=nvml
NodeName=localhost Name=gpu Type=tesla File=/dev/nvidia0
3. Start daemons#
Then, enable slurmctld
and slurmd
to start at boot and restart them.
# For controller
sudo systemctl enable slurmctld
sudo systemctl restart slurmctld
# For compute
sudo systemctl enable slurmd
sudo systemctl restart slurmd
You can verify the status of the daemons using systemctl status <daemon>
or check the logs in /var/log/slurm/slurmctld.log
and /var/log/slurm/slurmd.log
to ensure the Slurm cluster is running correctly.
4. Try some Slurm commands#
Finally, run the following commands to ensure that a Slurm job can be submitted successfully:
sinfo
: View information about Slurm nodes and partitions
root@rockwei:/etc/slurm# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle localhost
Note
Here’s a small tip to enable job submission when the state is set to drain
. Simply change the state back to idle
as shown below:
scontrol update nodename=<your-nodename> state=idle
srun
: Run a parallel job on cluster managed by Slurm
root@rockwei:/etc/slurm# srun -N 1 hostname
rockwei
If both commands execute successfully and return the expected results, you can proceed with testing the Slurm agent.
Test your Slurm agent locally#
This section provides a brief guide on setting up an environment to test the Slurm agent locally without running the backend service (e.g., flyte agent gRPC server). It covers both basic and advanced use cases: the basic use case runs a shell script directly, while the advanced use case executes user-defined task functions on a Slurm cluster.
Overview#
The Slurm agent on the highest level has three core methods to interact with a Slurm cluster:
create
: Send asrun
orsbatch
command to run a Slurm job on a Slurm clusterget
: Usescontrol show job <job-id>
to monitor the Slurm job statedelete
: Usescancel <job-id>
to cancel the Slurm job
In its simplest form, the Slurm agent supports running a batch script using sbatch
on a Slurm cluster, as shown below:

For Python function tasks, the Slurm agent supports running a batch script using sbatch
on a Slurm cluster with pyflyte-fast-execute
as entrypoint, as shown below:

Set up a local test environment#
This setup consists of three main components: a client (localhost), a remote Slurm cluster, and an S3-compatible object storage. First, you need to configure SSH connection to facilitate communication between the two, which relies on asyncssh
. Then, an S3-compatible object storage is required for advanced use cases. Here, we use Amazon S3 as an example.
Note
A persistence layer, such as an S3-compatible object storage, is essential for managing complex workflows, particularly when integrating heterogeneous task types.
1. Install the Slurm agent on your local machine (Flyte client)#
Note
It is recommended to create a virtual environment when using Python to avoid contaminating the base environment and prevent conflicts between different projects.
pip install flytekitplugins-slurm
2. Install the Slurm agent on the Slurm cluster#
To run user-defined task functions on the Slurm cluster, you need to install the Slurm agent on it:
pip install flytekitplugins-slurm
3. Set up SSH configuration#
To facilitate communication between your local machine and the Slurm cluster, please setup SSH on the local machine as follows:
Create a new authentication key pair:
ssh-keygen -t rsa -b 4096
Copy the public key into the Slurm cluster:
ssh-copy-id <username>@<fqdn-or-ip>
Enable the key-based authentication by writing the following content to
~/.ssh/config
:
Host <host-alias>
HostName <fqdn-or-ip>
Port <ssh-port>
User <username>
IdentityFile <path-to-private-key>
Finally, run a sanity check to verify connectivity to the Slurm cluster:
ssh <host-alias>
4. Set up Amazon S3 bucket#
For advanced use cases where user-defined task functions are executed on the Slurm cluster, an S3-compatible object storage is essential. The setup process is summarized below:
Click “Create bucket” button (marked in yellow) to create a bucket on this page
Note
Please choose a unique bucket name and adjust the settings as needed.
Click the user on the top right corner and go to “Security credentials”
Create an access key and save it
Set up AWS credentials to enable access to the Amazon S3 bucket on both machines
[default]
region = <your-region>
[default]
aws_access_key_id = <aws-access-key-id>
aws_secret_access_key = <aws-secret-access-key>
Once configured, both machines will have access to the Amazon S3 bucket.
If you are running the Slurm agent on an AWS EC2 instance, you will need to:
Add an IAM role to the EC2 instance under the “Security” settings
Configure the IAM role with appropriate read and write permissions to access Flyte’s blob store
Note
You can verify your S3 access by running aws s3 ls
in the terminal.
This command will list all accessible S3 buckets if your credentials are configured correctly.
Specify agent configuration#
Enable the Slurm agent on the demo cluster by updating the ConfigMap:
kubectl edit configmap flyte-sandbox-config -n flyte
tasks:
task-plugins:
default-for-task-types:
container: container
container_array: k8s-array
sidecar: sidecar
slurm_fn: agent-service
slurm: agent-service
enabled-plugins:
- container
- sidecar
- k8s-array
- agent-service
Edit the relevant YAML file to specify the plugin.
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- agent-service
default-for-task-types:
- container: container
- container_array: k8s-array
- slurm_fn: agent-service
- slurm: agent-service
Create a file named values-override.yaml
and add the following config to it:
enabled_plugins:
tasks:
task-plugins:
enabled-plugins:
- container
- sidecar
- k8s-array
- agent-service
default-for-task-types:
container: container
sidecar: sidecar
container_array: k8s-array
slurm_fn: agent-service
slurm: agent-service
Add the Slurm Private Key#
You have to set the Slurm Private Key to the Flyte configuration.
Install the flyteagent pod using helm
helm repo add flyteorg https://flyteorg.github.io/flyte
helm install flyteagent flyteorg/flyteagent --namespace flyte
Set Your Slurm Private Key as a Secret (Base64 Encoded):
SECRET_VALUE=$(base64 < your_slurm_private_key_path) && \
kubectl patch secret flyteagent -n flyte --patch "{\"data\":{\"flyte_slurm_private_key\":\"$SECRET_VALUE\"}}"
Restart development:
kubectl rollout restart deployment flyteagent -n flyte
Upgrade the deployment#
kubectl rollout restart deployment flyte-sandbox -n flyte
helm upgrade <RELEASE_NAME> flyteorg/flyte-binary -n <YOUR_NAMESPACE> --values <YOUR_YAML_FILE>
Replace <RELEASE_NAME>
with the name of your release (e.g., flyte-backend
),
<YOUR_NAMESPACE>
with the name of your namespace (e.g., flyte
),
and <YOUR_YAML_FILE>
with the name of your YAML file.
helm upgrade <RELEASE_NAME> flyte/flyte-core -n <YOUR_NAMESPACE> --values values-override.yaml
Replace <RELEASE_NAME>
with the name of your release (e.g., flyte
)
and <YOUR_NAMESPACE>
with the name of your namespace (e.g., flyte
).
Wait for the upgrade to complete. You can check the status of the deployment pods by running the following command:
kubectl get pods -n flyte
For Slurm agent on the Flyte cluster, see Databricks agent.