Slurm agent#

This guide provides a comprehensive overview of setting up an environment to test the Slurm agent locally and enabling the agent in your Flyte deployment. Before proceeding, the first and foremost step is to spin up your own Slurm cluster, as it serves as the foundation for the setup.

Spin up a Slurm cluster#

Setting up a Slurm cluster can be challenging due to the limited detail in the official instructions. This tutorial simplifies the process, focusing on configuring a single-host Slurm cluster with slurmctld (central management daemon) and slurmd (compute node daemon).

Install MUNGE#

MUNGE is an authentication service, allowing a process to authenticate the UID and GID of another local or remote process within a group of hosts having common users and groups.

1. Install necessary packages#

sudo apt install munge libmunge2 libmunge-dev

2. Generate and verify a MUNGE credential#

After a MUNGE credential is generated, you can decode and verify the encoded token as follows:

munge -n | unmunge | grep STATUS

Note

A status of STATUS: Success(0) is expected and the MUNGE key is stored at /etc/munge/munge.key. If the key is absent, please run the following command to create one manually:

sudo /usr/sbin/create-munge-key

4. Start MUNGE#

Please make munged start at boot and restart the service:

sudo systemctl enable munge
sudo systemctl restart munge

Note

To check if the daemon runs as expected, you can either use systemctl status munge or inspect the log file under /var/log/munge.

Create a dedicated Slurm user#

The SlurmUser must be created as needed prior to starting Slurm and must exist on all nodes in your cluster.

Slurm Super Quick Start

Please make sure that uid is equal to gid to avoid some troubles:

sudo adduser --system --uid <uid> --group --home /var/lib/slurm slurm

Note

A system user usually has an uid in the range of 0-999, please refer to the section Add a system user.

Once the system user is created, you can verify it using the following command:

cat /etc/passwd | grep <uid>

It’s of vital importance to set correct ownership of specific Slurm-related directories to prevent access issue. Directories mentioned below will be created automatically when the Slurm services start. However, manually creating them and altering the ownership beforehand help reduce errors:

Properly setting ownership of specific Slurm-related directories is crucial to avoid access issues. These directories are created automatically when Slurm services start, but manually creating them and adjusting ownership beforehand can make setup easier:

sudo mkdir -p /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
sudo chown -R slurm: /var/spool/slurmctld /var/spool/slurmd /var/log/slurm

Run the Slurm cluster#

1. Install Slurm packages#

First, you can download the Slurm source from here (we’ll use version 24.05.5 for illustration):

mkdir <your-clean-dir> && cd <your-clean-dir>
wget https://download.schedmd.com/slurm/slurm-24.05.5.tar.bz2

Note

We recommend to download the file to a clean directory because all Debian packages will be generate under this path.

Then, Debian packages can be built following this official guide:

# Install basic Debian package build requirements
sudo apt-get update
sudo apt-get install -y build-essential fakeroot devscripts equivs

# (Optional) Install dependencies if missing
sudo apt install -y \
    libncurses-dev libgtk2.0-dev libpam0g-dev libperl-dev liblua5.3-dev \
    libhwloc-dev dh-exec librrd-dev libipmimonitoring-dev hdf5-helpers \
    libfreeipmi-dev libhdf5-dev man2html-base libcurl4-openssl-dev \
    libpmix-dev libhttp-parser-dev libyaml-dev libjson-c-dev \
    libjwt-dev liblz4-dev libmariadb-dev libdbus-1-dev librdkafka-dev

# Unpack the distributed tarball
tar -xaf slurm-24.05.5.tar.bz2

# cd to the directory containing the Slurm source
cd slurm-24.05.5

# (Optional) Enable source packages for Ubuntu 24.04
# For details, please refer to
# https://manpages.debian.org/stretch/apt/sources.list.5.en.html
# and https://askubuntu.com/questions/1512042/
sudo sed -i 's/^Types: deb$/Types: deb deb-src/' /etc/apt/sources.list.d/ubuntu.sources
sudo apt update

# Install the Slurm package dependencies
sudo mk-build-deps -i debian/control

# Build the Slurm packages
debuild -b -uc -us

Debian packages are built and placed under the parent directory <your-clean-dir>. Since the single-host Slurm cluster functions as both a controller and a compute node, the following packages are required: slurm-smd, slurm-smd-client (for CLI), slurm-smd-slurmctld, and slurm-smd-slurmd.

# cd to the parent directory
cd ..

sudo dpkg -i slurm-smd_24.05.5-1_amd64.deb
sudo dpkg -i slurm-smd-client_24.05.5-1_amd64.deb
sudo dpkg -i slurm-smd-slurmctld_24.05.5-1_amd64.deb
sudo dpkg -i slurm-smd-slurmd_24.05.5-1_amd64.deb

Note

Please refer to Installing Packages for package selection.

2. Generate a Slurm configuration file#

After installation, generate a valid slurm.conf file for the Slurm cluster. We recommend using the official configurator to create it.

The following key-value pairs need to be set manually. Please leave the other options unchanged, as the default settings are sufficient for running slurmctld and slurmd.

# == Cluster Name ==
ClusterName=localcluster

# == Control Machines ==
SlurmctldHost=localhost

# == Process Tracking ==
ProctrackType=proctrack/linuxproc

# == Event Logging ==
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log

# == Compute Nodes ==
# For checking CPU info, please use `lscpu | egrep 'Socket|Thread|CPU\(s\)'`
# For checking Mem info, please use `free -m` and write "available" value
NodeName=localhost CPUs=<cpus> RealMemory=<available-mem> Sockets=<sockets> CoresPerSocket=<cores-per-socket> ThreadsPerCore=<threads-per-core> State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

After completing the form, submit it, copy the content, and save it to /etc/slurm/slurm.conf.

Note

For a sample configuration file, please refer to this slurm.conf.

Note

If you are using GPUs in your Slurm cluster, you need additional configuration files. Here is an example configuration for a Ubuntu machine with a Tesla T4 GPU:

In /etc/slurm/slurm.conf, add:

GresTypes=gpu
NodeName=localhost Gres=gpu:1 CPUs=<cpus> RealMemory=<available-mem> Sockets=<sockets> CoresPerSocket=<cores-per-socket> ThreadsPerCore=<threads-per-core> State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

In /etc/slurm/gres.conf, add:

AutoDetect=nvml
NodeName=localhost Name=gpu Type=tesla File=/dev/nvidia0

3. Start daemons#

Then, enable slurmctld and slurmd to start at boot and restart them.

# For controller
sudo systemctl enable slurmctld
sudo systemctl restart slurmctld

# For compute
sudo systemctl enable slurmd
sudo systemctl restart slurmd

You can verify the status of the daemons using systemctl status <daemon> or check the logs in /var/log/slurm/slurmctld.log and /var/log/slurm/slurmd.log to ensure the Slurm cluster is running correctly.

4. Try some Slurm commands#

Finally, run the following commands to ensure that a Slurm job can be submitted successfully:

  • sinfo: View information about Slurm nodes and partitions

root@rockwei:/etc/slurm# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle localhost

Note

Here’s a small tip to enable job submission when the state is set to drain. Simply change the state back to idle as shown below:

scontrol update nodename=<your-nodename> state=idle
  • srun: Run a parallel job on cluster managed by Slurm

root@rockwei:/etc/slurm# srun -N 1 hostname
rockwei

If both commands execute successfully and return the expected results, you can proceed with testing the Slurm agent.

Test your Slurm agent locally#

This section provides a brief guide on setting up an environment to test the Slurm agent locally without running the backend service (e.g., flyte agent gRPC server). It covers both basic and advanced use cases: the basic use case runs a shell script directly, while the advanced use case executes user-defined task functions on a Slurm cluster.

Overview#

The Slurm agent on the highest level has three core methods to interact with a Slurm cluster:

  1. create: Send a srun or sbatch command to run a Slurm job on a Slurm cluster

  2. get: Use scontrol show job <job-id> to monitor the Slurm job state

  3. delete: Use scancel <job-id> to cancel the Slurm job

In its simplest form, the Slurm agent supports running a batch script using sbatch on a Slurm cluster, as shown below:

https://github.com/flyteorg/static-resources/blob/main/flytekit/plugins/slurm/basic_arch.png?raw=true

For Python function tasks, the Slurm agent supports running a batch script using sbatch on a Slurm cluster with pyflyte-fast-execute as entrypoint, as shown below:

https://github.com/flyteorg/static-resources/blob/main/flytekit/plugins/slurm/slurm_function_task.png?raw=true

Set up a local test environment#

This setup consists of three main components: a client (localhost), a remote Slurm cluster, and an S3-compatible object storage. First, you need to configure SSH connection to facilitate communication between the two, which relies on asyncssh. Then, an S3-compatible object storage is required for advanced use cases. Here, we use Amazon S3 as an example.

Note

A persistence layer, such as an S3-compatible object storage, is essential for managing complex workflows, particularly when integrating heterogeneous task types.

1. Install the Slurm agent on your local machine (Flyte client)#

Note

It is recommended to create a virtual environment when using Python to avoid contaminating the base environment and prevent conflicts between different projects.

pip install flytekitplugins-slurm

2. Install the Slurm agent on the Slurm cluster#

To run user-defined task functions on the Slurm cluster, you need to install the Slurm agent on it:

pip install flytekitplugins-slurm

3. Set up SSH configuration#

To facilitate communication between your local machine and the Slurm cluster, please setup SSH on the local machine as follows:

  1. Create a new authentication key pair:

ssh-keygen -t rsa -b 4096
  1. Copy the public key into the Slurm cluster:

ssh-copy-id <username>@<fqdn-or-ip>
  1. Enable the key-based authentication by writing the following content to ~/.ssh/config:

Host <host-alias>
  HostName <fqdn-or-ip>
  Port <ssh-port>
  User <username>
  IdentityFile <path-to-private-key>

Finally, run a sanity check to verify connectivity to the Slurm cluster:

ssh <host-alias>

4. Set up Amazon S3 bucket#

For advanced use cases where user-defined task functions are executed on the Slurm cluster, an S3-compatible object storage is essential. The setup process is summarized below:

  1. Click “Create bucket” button (marked in yellow) to create a bucket on this page

Note

Please choose a unique bucket name and adjust the settings as needed.

  1. Click the user on the top right corner and go to “Security credentials”

  2. Create an access key and save it

  3. Set up AWS credentials to enable access to the Amazon S3 bucket on both machines

[default]
region = <your-region>

Once configured, both machines will have access to the Amazon S3 bucket.

If you are running the Slurm agent on an AWS EC2 instance, you will need to:

  1. Add an IAM role to the EC2 instance under the “Security” settings

  2. Configure the IAM role with appropriate read and write permissions to access Flyte’s blob store

Note

You can verify your S3 access by running aws s3 ls in the terminal.

This command will list all accessible S3 buckets if your credentials are configured correctly.

Specify agent configuration#

Enable the Slurm agent on the demo cluster by updating the ConfigMap:

kubectl edit configmap flyte-sandbox-config -n flyte
tasks:
  task-plugins:
    default-for-task-types:
      container: container
      container_array: k8s-array
      sidecar: sidecar
      slurm_fn: agent-service
      slurm: agent-service
    enabled-plugins:
      - container
      - sidecar
      - k8s-array
      - agent-service

Add the Slurm Private Key#

You have to set the Slurm Private Key to the Flyte configuration.

  1. Install the flyteagent pod using helm

helm repo add flyteorg https://flyteorg.github.io/flyte
helm install flyteagent flyteorg/flyteagent --namespace flyte
  1. Set Your Slurm Private Key as a Secret (Base64 Encoded):

SECRET_VALUE=$(base64 < your_slurm_private_key_path) && \
kubectl patch secret flyteagent -n flyte --patch "{\"data\":{\"flyte_slurm_private_key\":\"$SECRET_VALUE\"}}"
  1. Restart development:

kubectl rollout restart deployment flyteagent -n flyte

Upgrade the deployment#

kubectl rollout restart deployment flyte-sandbox -n flyte

Wait for the upgrade to complete. You can check the status of the deployment pods by running the following command:

kubectl get pods -n flyte

For Slurm agent on the Flyte cluster, see Databricks agent.