At SuperCompute '16 I spent a lot of time talking about OpenStack
OpenStack (and virtualizaton in general) has matured a lot over the last few years. Performance inside a virtual machine is now high enough that it is a legitimate landing zone for an increasing number and type of HPC workloads.
In the Red Hat booth on the floor at SC '16, I had a demo where I deployed multiple Slurm Workload Manager clusters inside an OpenStack instance and had them ready to do work. The demo accented a few key features of OpenStack, and even a little bit of Ansible to help show how they can work together.
Here’s a great interview with Dan McGuan talking about exactly this thing.
I built out the demo in a Red Hat OpenStack 8 installation we maintain in a lab environment in our headquarters in Raleigh. It is a pretty standard install that is using Nuage Networks for the SDN solution.
With that said, this demo should run on any Openstack platform with minimal, if any, changes.
The workload manager I picked for the initial demo was Slurm. It’s widespread and open source. I am also currently working on a version of the demo using Univa Grid Engine.
The Slurm source is easy enough to build, and I found a solid howto that outlined how to build Slurm RPMs and configure it.
I started with the publicly available RHEL 7.2 qcow2 image.
I ended up with two images that I used in my deployment.
Master Node Image
I used the master image as the template for both. I followed the walkthrough and enabled slurmctld along with slurmd via systemctl. A few other highlights:
I set up the hosts file with prescripting static IPs for each node
I set up ssh keys on the master node so it could ssh to itself and each other node
I enabled ntpd so time would effectively sync on each node (massively important for HPC workloads)
Worker Node Image
The worker node image is the exact same as the Master image, with the only difference being that slurmctld is disabled via systemctl.
Creating the Slurm Cluster
To deploy a cluster, I deploy the Heat Stack, either via the command-line, API, or GUI
The deployment typically takes about 90 seconds, depending on the OpenStack instance you are using.
Each Slurm cluster is 5 nodes, and have the same hostnames and IP addresses. This is possible because each cluster also creates its own SDN network and router. So you don’t get IP/Hostname collisions, and your jobs can be easier to copy/paster from.
With each cluster having its own router, you can also deploy the same HOT multiple times.
The only system that has a Floating IP is the control node. This allows you to access the cluster easily via ssh. You could easily restrict this access further, depending on your needs. With this being a demo, I wanted to be able to easily access the system.
Preparing the Slurm Cluster
You can log into the master node as cloud-user (remember, it’s the standard RHEL 7.2 image). Inside the heat stack, you can specify the ssh key public key you want supplied to each of the hosts.
Lab Prep with Ansible
The github repository is checked out into /root/rhissr on the master node image. From that directory you can run the lab prep playbook
[root@node0 ~]# cd rhissr/ [root@node0 rhissr]# ansible-playbook -i inventory reset_lab.yaml
This will do a few things
stop all slurm services
clear out the slurm spool directories
restart the slurm services
This ensures the systems are functioning cleanly and able to communicate with one another.
Running a job using sbatch
Now that you have a fully functional slurm cluster, let’s run a simple job to ensure we are running correctly.
[root@node0 rhissr]# cat sbatch.sh #!/usr/bin/env bash # Usage: sbatch -N5 $this_file #SBATCH -o slurm.out #SBATCH -p sc16 #SBATCH -D /tmp srun hostname |sort [root@node0 rhissr]# sbatch -N5 sbatch.sh Submitted batch job 3 [root@node0 rhissr]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [root@node0 rhissr]# cat /tmp/slurm.out node0.example.com node1.example.com node2.example.com node3.example.com node4.example.com [root@node0 rhissr]# cat sbatch.sh #!/usr/bin/env bash # Usage: sbatch -N5 $this_file #SBATCH -o slurm.out #SBATCH -p sc16 #SBATCH -D /tmp srun hostname |sort [root@node0 rhissr]# sbatch -N5 sbatch.sh Submitted batch job 3 [root@node0 rhissr]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [root@node0 rhissr]# cat /tmp/slurm.out node0.example.com node1.example.com node2.example.com node3.example.com node4.example.com
This script is incredibly simple. It just goes out to each slurm node and grabs its hostname to confirm all is working properly.
Scaling from 5 to 50 to 500
Right now, to add additional nodes, you copy and paste a few stanzas in the HOT.
Make a more science-y demo
Finish the Univa Grid Engine variant
Make scaling easier
OpenStack is an effective solution for an increasing number of HPC workloads. This demo demonstrates how you can take OpenStack and Ansible and quickly create a viable HPC platform. The inherently multi-tenant nature of OpenStack means that multiple scientists can run their jobs simultaneously. At the end of the day, it’s all about "time-to-science". This can help make that time shorter.