Skip to content

Slurm

Introduction

The Slurm platform provides a multi-node HPC environment based on the Slurm workload manager and Open OnDemand. The platform is accessible with a web-browser using the Open OnDemand web-interface, or via SSH.

Launch configuration

Warning

Platforms and their names are visible to all members of the cloud project!

Option Explanation
Platform name A name to identify the Slurm platform.
External IP Use the plus button to assign an external IP address to your cloud project if the list is empty, and then select an IP to assign to the login node of your Slurm platform.
Compute node count The amount of Slurm compute (worker) nodes to configure for your Slurm platform.
Compute node size The size of the Slurm compute (worker) nodes. The options in this menu are set by the cloud operator, and the number of CPUs and quantity of RAM are displayed for each size.
Run post-configuration validation? Run a small suite of tests to check that the Slurm platform is functioning as expected.

Advanced

Platform monitoring

A Grafana dashboard for system monitoring is included in the platform, and is accessible from the platforms page. General current and historical system information is visible.

Additionally, Open OnDemand presents monitoring dashboards for each Slurm job.

Root access

To get passwordless sudo to the login node, SSH as the rocky user instead of the azimuth user shown on the platform's details page.

Other nodes can also be accessed as rocky by jumping through the login node, e.g.:

ssh -J rocky@$LOGIN_ADDR rocky@$NODE_ADDR

where $LOGIN_ADDR is the login node's address shown on the platform's details page and the other node's addresses can be retrieved from the /etc/hosts file on the login node, e.g.:

ssh -J rocky@$LOGIN_ADDR cat /etc/hosts

Additional software

Software installed directly via sudo will be lost when the platform is upgraded, as upgrades are performed by reimaging all nodes with a new image.

Where possible, it is preferable to package additional software for use via apptainer which is installed on all Slurm platforms. This supports both SIF and Docker/OCI container formats.

Some software is also available via the EESSI pilot repository - follow instructions from here.

If these methods are not appropriate and the software has wide applicability, consider making a PR to the Slurm appliance repository which builds images for the Slurm platforms. Additional Ansible tasks could be added to the extras.yml playbook.


Last update: December 14, 2023
Created: December 14, 2023