Slurm
Introduction
The Slurm platform provides a multi-node HPC environment based on the Slurm workload manager and Open OnDemand. The platform is accessible with a web-browser using the Open OnDemand web-interface, or via SSH.
Launch configuration
Warning
Platforms and their names are visible to all members of the cloud project!
Option | Explanation |
---|---|
Platform name | A name to identify the Slurm platform. |
External IP | Use the plus button to assign an external IP address to your cloud project if the list is empty, and then select an IP to assign to the login node of your Slurm platform. |
Compute node count | The amount of Slurm compute (worker) nodes to configure for your Slurm platform. |
Compute node size | The size of the Slurm compute (worker) nodes. The options in this menu are set by the cloud operator, and the number of CPUs and quantity of RAM are displayed for each size. |
Run post-configuration validation? | Run a small suite of tests to check that the Slurm platform is functioning as expected. |
Advanced
Platform monitoring
A Grafana dashboard for system monitoring is included in the platform, and is accessible from the platforms page. General current and historical system information is visible.
Additionally, Open OnDemand presents monitoring dashboards for each Slurm job.
Root access
To get passwordless sudo
to the login node, SSH as the rocky
user instead of the azimuth
user shown on the platform's details page.
Other nodes can also be accessed as rocky
by jumping through the login node, e.g.:
ssh -J rocky@$LOGIN_ADDR rocky@$NODE_ADDR
where $LOGIN_ADDR
is the login node's address shown on the platform's details page and the other node's addresses can be retrieved from the /etc/hosts
file on the login node, e.g.:
ssh -J rocky@$LOGIN_ADDR cat /etc/hosts
Additional software
Software installed directly via sudo
will be lost when the platform is upgraded, as upgrades are performed by reimaging all nodes with a new image.
Where possible, it is preferable to package additional software for use via apptainer which is installed on all Slurm platforms. This supports both SIF and Docker/OCI container formats.
Some software is also available via the EESSI pilot repository - follow instructions from here.
If these methods are not appropriate and the software has wide applicability, consider making a PR to the Slurm appliance repository which builds images for the Slurm platforms. Additional Ansible tasks could be added to the extras.yml playbook.
Created: December 14, 2023