A NUMA-aware VM Balancer Using Group Scheduling for OpenNebula: Part 1

I’ve recently had the joy of building out a Hadoop cluster (Cloudera Distribution with Cloudera Manager) for internal development at work. The process was quite tedious as a large number of machines had to be hand configured. This seemed like a good application to move to a cloud infrastructure in order to improve cluster deployment times as well as to provide an on-demand resource for this workload (see Cloudera Hadoop Distribution and Manager). OpenNebula was the perfect choice for the compute oriented cloud architecture because the resources can be allocated/removed on demand into the hadoop environment. As hadoop is a fairly memory intensive workload we wanted to improve memory throughput on the VMs and group scheduling showed some promise to improve VM cpu and memory placement.

About our Infrastructure

We are working with Dell C6145 which are 8-way NUMA systems based on AMD quad-socket 12-core Mangy-Cours processors, note that each socket has 2 numa nodes. An interesting thing about these systems is that even though they are quad socket, they have 8 NUMA domains! We wanted to see if group scheduling can be used to improve performance on these boxes by compartmentalizing VMs so that memory accesses between numa domains can be minimized and to improve L2/L3 cache hits.

The Linux numa-aware scheduler already does a great job however we wanted to see if there was a quick and easy way to allocate resources on these numa machines to reduce non-local memory access and improve memory throughput and in turn improve memory sensitive workloads like Hadoop. A cpuset is a combination of memory and cpu configured as a single scheduling domain. Libvirt, the control API used to manage KVM has some capabilities to map vcpus to real cpus and even configure them in a virtual NUMA configuration mimicking the host its running on; however we found it very cumbersome to use because each VM has to be hand tuned to get any advantage. It also defeats the OpenNebula paradigm of rapid template based provisioning.

Implementation

Alex Tsariounov wrote a very user friendly program called cpuset that does all the heavy lifting of moving processes from one cpu set to another. The source is available from google code repository or from Ubuntu 12.04+ repository.

http://code.google.com/p/cpuset/

I wrote a python wrapper script building on cpuset, which adds the following features:

  • Creates CPU sets based on the numactl –hardware output
  • Maps CPUs and their memory domains into their respective CPU set
  • Places KVM virtual machines built using libvirt into cpuset using a balancing policy
  • Rebalances VMs based on a balancing policy
  • Runs ones then exits so that system admins can control when and how much balancing to do.

Implementation – Example

Cpuset without numa configuration, this is the status of most systems without group scheduling configured. The CPUs column describes the number of cpus in that particular scheduling domain, same for the memory domain. In this system there are 48 cores (0-47) and 8 Numa nodes (0-7).

root@clyde:~# cset set
cset:
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-47 y     0-7 y   490    1 /
      libvirt      ***** n   ***** n     0    1 /libvirt

Cpuset after vm-balancer.py run without any vms running.. Notice how the cpus and memory domains have been paired up.

root@clyde:~# cset set
cset:
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-47 y     0-7 y   489    9 /
      libvirt      ***** n   ***** n     0    1 /libvirt
        VMBS7      42-47 n       7 n     0    0 /VMBS7
        VMBS6      36-41 n       6 n     0    0 /VMBS6
        VMBS5      30-35 n       5 n     0    0 /VMBS5
        VMBS4      24-29 n       4 n     0    0 /VMBS4
        VMBS3      18-23 n       3 n     0    0 /VMBS3
        VMBS2      12-17 n       2 n     0    0 /VMBS2
        VMBS1       6-11 n       1 n     0    0 /VMBS1
        VMBS0        0-5 n       0 n     0    0 /VMBS0

VM balancer and cset in action, moving 8 newly created KVM processes and their threads (vcpus and iothreads to a numa core)

root@clyde:~# ./vm-balancer.py
Found cset at /usr/bin/cset
Found numactl at /usr/bin/numactl
Found virsh at /usr/bin/virsh
cset: --> created cpuset "VMBS0"
cset: --> created cpuset "VMBS1"
cset: --> created cpuset "VMBS2"
cset: --> created cpuset "VMBS3"
cset: --> created cpuset "VMBS4"
cset: --> created cpuset "VMBS5"
cset: --> created cpuset "VMBS6"
cset: --> created cpuset "VMBS7"
cset: moving following pidspec: 47737,47763,47762,47765,49299
cset: moving 5 userspace tasks to /VMBS0
[==================================================]%
cset: done
cset: moving following pidspec: 46200,46203,46204,46207
cset: moving 4 userspace tasks to /VMBS1
[==================================================]%
cset: done
cset: moving following pidspec: 45213,45210,45215,45214
cset: moving 4 userspace tasks to /VMBS2
[==================================================]%
cset: done
cset: moving following pidspec: 45709,45710,45711,45705
cset: moving 4 userspace tasks to /VMBS3
[==================================================]%
cset: done
cset: moving following pidspec: 46719,46718,46717,46714
cset: moving 4 userspace tasks to /VMBS4
[==================================================]%
cset: done
cset: moving following pidspec: 47306,47262,49078,47246,47278
cset: moving 5 userspace tasks to /VMBS5
[==================================================]%
cset: done
cset: moving following pidspec: 48247,48258,48252,48274
cset: moving 4 userspace tasks to /VMBS6
[==================================================]%
cset: done
cset: moving following pidspec: 48743,48748,48749,48746
cset: moving 4 userspace tasks to /VMBS7
[==================================================]%
cset: done

After VMs are balanced into their respective numa domains, note that there are 3 VCPUs per VM and 1 parent process, the vm that has 5 threads is actually running a short lived iothread.

root@clyde:~# cset set
cset:
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-47 y     0-7 y   500    9 /
        VMBS7      42-47 n       7 n     5    0 /VMBS7
        VMBS6      36-41 n       6 n     4    0 /VMBS6
        VMBS5      30-35 n       5 n     4    0 /VMBS5
        VMBS4      24-29 n       4 n     4    0 /VMBS4
        VMBS3      18-23 n       3 n     4    0 /VMBS3
        VMBS2      12-17 n       2 n     4    0 /VMBS2
        VMBS1       6-11 n       1 n     4    0 /VMBS1
        VMBS0        0-5 n       0 n     4    0 /VMBS0
      libvirt      ***** n   ***** n     0    1 /libvirt

The python script can be downloaded form http://code.google.com/p/vm-balancer-numa/downloads/list for now.

Note. This was inspired by work presented at the KVM Forum 2011 by Andrew Theurer, we even used the same benchmark to test our configuration.
http://www.linux-kvm.org/wiki/images/5/53/2011-forum-Improving-out-of-box-performance-v1.4.pdf

In the next part I will explain how the vm-balancer.py script works and it’s limitations.