Blog Article:

A NUMA-aware VM Balancer Using Group Scheduling for OpenNebula: Part 1

Shankhadeep Shome

Jul 10, 2012

I’ve recently had the joy of building out a Hadoop cluster (Cloudera Distribution with Cloudera Manager) for internal development at work. The process was quite tedious as a large number of machines had to be hand configured. This seemed like a good application to move to a cloud infrastructure in order to improve cluster deployment times as well as to provide an on-demand resource for this workload (see Cloudera Hadoop Distribution and Manager). OpenNebula was the perfect choice for the compute oriented cloud architecture because the resources can be allocated/removed on demand into the hadoop environment. As hadoop is a fairly memory intensive workload we wanted to improve memory throughput on the VMs and group scheduling showed some promise to improve VM cpu and memory placement.

About our Infrastructure

We are working with Dell C6145 which are 8-way NUMA systems based on AMD quad-socket 12-core Mangy-Cours processors, note that each socket has 2 numa nodes. An interesting thing about these systems is that even though they are quad socket, they have 8 NUMA domains! We wanted to see if group scheduling can be used to improve performance on these boxes by compartmentalizing VMs so that memory accesses between numa domains can be minimized and to improve L2/L3 cache hits.

The Linux numa-aware scheduler already does a great job however we wanted to see if there was a quick and easy way to allocate resources on these numa machines to reduce non-local memory access and improve memory throughput and in turn improve memory sensitive workloads like Hadoop. A cpuset is a combination of memory and cpu configured as a single scheduling domain. Libvirt, the control API used to manage KVM has some capabilities to map vcpus to real cpus and even configure them in a virtual NUMA configuration mimicking the host its running on; however we found it very cumbersome to use because each VM has to be hand tuned to get any advantage. It also defeats the OpenNebula paradigm of rapid template based provisioning.

Implementation

Alex Tsariounov wrote a very user friendly program called cpuset that does all the heavy lifting of moving processes from one cpu set to another. The source is available from google code repository or from Ubuntu 12.04+ repository.

http://code.google.com/p/cpuset/

I wrote a python wrapper script building on cpuset, which adds the following features:

  • Creates CPU sets based on the numactl –hardware output
  • Maps CPUs and their memory domains into their respective CPU set
  • Places KVM virtual machines built using libvirt into cpuset using a balancing policy
  • Rebalances VMs based on a balancing policy
  • Runs ones then exits so that system admins can control when and how much balancing to do.

Implementation – Example

Cpuset without numa configuration, this is the status of most systems without group scheduling configured. The CPUs column describes the number of cpus in that particular scheduling domain, same for the memory domain. In this system there are 48 cores (0-47) and 8 Numa nodes (0-7).

root@clyde:~# cset set
cset:
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-47 y     0-7 y   490    1 /
      libvirt      ***** n   ***** n     0    1 /libvirt

Cpuset after vm-balancer.py run without any vms running.. Notice how the cpus and memory domains have been paired up.

root@clyde:~# cset set
cset:
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-47 y     0-7 y   489    9 /
      libvirt      ***** n   ***** n     0    1 /libvirt
        VMBS7      42-47 n       7 n     0    0 /VMBS7
        VMBS6      36-41 n       6 n     0    0 /VMBS6
        VMBS5      30-35 n       5 n     0    0 /VMBS5
        VMBS4      24-29 n       4 n     0    0 /VMBS4
        VMBS3      18-23 n       3 n     0    0 /VMBS3
        VMBS2      12-17 n       2 n     0    0 /VMBS2
        VMBS1       6-11 n       1 n     0    0 /VMBS1
        VMBS0        0-5 n       0 n     0    0 /VMBS0

VM balancer and cset in action, moving 8 newly created KVM processes and their threads (vcpus and iothreads to a numa core)

root@clyde:~# ./vm-balancer.py
Found cset at /usr/bin/cset
Found numactl at /usr/bin/numactl
Found virsh at /usr/bin/virsh
cset: --> created cpuset "VMBS0"
cset: --> created cpuset "VMBS1"
cset: --> created cpuset "VMBS2"
cset: --> created cpuset "VMBS3"
cset: --> created cpuset "VMBS4"
cset: --> created cpuset "VMBS5"
cset: --> created cpuset "VMBS6"
cset: --> created cpuset "VMBS7"
cset: moving following pidspec: 47737,47763,47762,47765,49299
cset: moving 5 userspace tasks to /VMBS0
[==================================================]%
cset: done
cset: moving following pidspec: 46200,46203,46204,46207
cset: moving 4 userspace tasks to /VMBS1
[==================================================]%
cset: done
cset: moving following pidspec: 45213,45210,45215,45214
cset: moving 4 userspace tasks to /VMBS2
[==================================================]%
cset: done
cset: moving following pidspec: 45709,45710,45711,45705
cset: moving 4 userspace tasks to /VMBS3
[==================================================]%
cset: done
cset: moving following pidspec: 46719,46718,46717,46714
cset: moving 4 userspace tasks to /VMBS4
[==================================================]%
cset: done
cset: moving following pidspec: 47306,47262,49078,47246,47278
cset: moving 5 userspace tasks to /VMBS5
[==================================================]%
cset: done
cset: moving following pidspec: 48247,48258,48252,48274
cset: moving 4 userspace tasks to /VMBS6
[==================================================]%
cset: done
cset: moving following pidspec: 48743,48748,48749,48746
cset: moving 4 userspace tasks to /VMBS7
[==================================================]%
cset: done

After VMs are balanced into their respective numa domains, note that there are 3 VCPUs per VM and 1 parent process, the vm that has 5 threads is actually running a short lived iothread.

root@clyde:~# cset set
cset:
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-47 y     0-7 y   500    9 /
        VMBS7      42-47 n       7 n     5    0 /VMBS7
        VMBS6      36-41 n       6 n     4    0 /VMBS6
        VMBS5      30-35 n       5 n     4    0 /VMBS5
        VMBS4      24-29 n       4 n     4    0 /VMBS4
        VMBS3      18-23 n       3 n     4    0 /VMBS3
        VMBS2      12-17 n       2 n     4    0 /VMBS2
        VMBS1       6-11 n       1 n     4    0 /VMBS1
        VMBS0        0-5 n       0 n     4    0 /VMBS0
      libvirt      ***** n   ***** n     0    1 /libvirt

The python script can be downloaded form http://code.google.com/p/vm-balancer-numa/downloads/list for now.

Note. This was inspired by work presented at the KVM Forum 2011 by Andrew Theurer, we even used the same benchmark to test our configuration.
http://www.linux-kvm.org/wiki/images/5/53/2011-forum-Improving-out-of-box-performance-v1.4.pdf

In the next part I will explain how the vm-balancer.py script works and it’s limitations.

12 Comments

  1. Poil

    Hi,

    It doesn’t work on Ubuntu 12.04, I’m looking why.

    root@vmhost02l:/opt/MegaRAID/MegaCli# vm-balancer.py
    Found cset at /usr/bin/cset
    Found numactl at /usr/bin/numactl
    Found virsh at /usr/bin/virsh
    cset: –> modified cpuset “VMBS0”
    cset: –> modified cpuset “VMBS1”
    cset: moving following pidspec: 31702,23797,31990,31842,31980,31762,31729,31786,23800,23801,31733,31784,31785,32022,32020,32021,32019,31846,32000
    cset: moving 19 userspace tasks to /VMBS0
    [==================================================]%
    cset: **> 19 tasks are not movable, impossible to move
    cset: done
    cset: moving following pidspec: 30560,30581,23881,31328,23869,30585,31324,30595,30589,30597,23880,31322,31323,31329,31325,31327,31326,30399
    cset: moving 18 userspace tasks to /VMBS1
    [==================================================]%
    cset: **> 18 tasks are not movable, impossible to move
    cset: done

    For your information, you can replace
    ps -e | grep kvm | grep -v –
    [pythoncode]
    with
    pgrep kvm

    Best regards

    Reply
  2. Poil

    Ok, I’ve patch cpuset, it’s now working.
    Sorry to have troubled you.

    Reply
  3. Poil

    Another question is there a difference between using vm-balancer (which use cset) and virsh cpupin ?

    Reply
  4. Shankhadeep

    Virsh cpu pin does basically the same thing for cpus however you need cpuset support if you want to control memory location as well. Libvirt cpuset mechanism is clunky and requires the definition file to be changed everytime you want to pin cpu and memory. CSET just works in most cases for this function.

    Reply
  5. Poil

    On ubuntu 12.04, VM are already in cgroups, so I’ve written a small stupid script, I share it here :

    #!/bin/bash
    # Auteur : Benjamin DUPUIS
    # Role : Test pinning vCPU on pSOCKET
    # Date : 20/07/2012
    # Version : 1.00
    #———————————————-
    # Historique :
    # 20/07/2012 – v1.00 BDU: Initial version
    ################################################

    VM_DEF_DIR=’/etc/libvirt/qemu/’
    SYS_DEV_CPU=’/sys/devices/system/cpu/’

    [[ ! -d ${VM_DEF_DIR} ]] && exit 2
    [[ ! -d ${SYS_DEV_CPU} ]] && exit 3

    #———————————————-
    # Host
    cd ${SYS_DEV_CPU}

    # Number of core
    cfgCORE=$(for cpuCore in $(ls -d cpu[0-9]*); do
    echo -n “${cpuCore##*cpu} ”
    cat ${cpuCore}/topology/core_siblings_list
    done)

    # Number of socket
    CoreList=$(echo “${cfgCORE}” | cut -d” ” -f2 | sort)
    SocketList=$(echo “${cfgCORE}” | cut -d” ” -f2 | sort | uniq)
    nSocket=$(echo “${SocketList}” | wc -l)

    i=1
    firstr=0
    for socket in ${SocketList}; do
    [[ ${firstr} == 0 ]] && oldsocket=${socket} && firstr=1
    if [[ ${socket} != ${oldsocket} ]]; then
    i=$((${i} + 1))
    echo “${socket} != ${oldsocket}”
    fi
    tabsocket[${i}]=$(echo “${cfgCORE}” | awk -v Sock=${socket} ‘BEGIN { first=0 } {
    if ($2 ~ Sock) {
    if (first!=1) {
    str=$1
    first=1
    } else {
    str=str”,”$1
    }
    }
    } END { print str }’)
    oldsocket=${socket}
    done

    #———————————————-
    # Virtuals Machines

    cd ${VM_DEF_DIR}
    # nombre de vCPU alloués
    TOTAL_vCPU=$(cat * | awk ‘BEGIN { CPUn=0; FS = “[]” } /cpu/ { CPUn=CPUn+$3 } END { print CPUn }’)

    # number of vCPU/socket max
    MID_vCPU=$((${TOTAL_vCPU} / ${nSocket}))

    curCore=1
    for myVM in $(ls); do
    vmCPU=”${vmCPU} ${myVM%%.*}”
    nvmCPU=$((${nvmCPU} + $(cat ${myVM} | awk ‘BEGIN { FS = “[]” } /cpu/ { print $3 }’)))
    if [[ ${nvmCPU} -ge ${MID_vCPU} ]]; then
    cfgVMCPU[${curCore}]=${vmCPU}
    debugVMCPU[${curCore}]=${nvmCPU}
    curCore=$((${curCore} + 1))
    nvmCPU=0
    vmCPU=””
    fi
    done
    cfgVMCPU[${curCore}]=${vmCPU}
    debugVMCPU[${curCore}]=${nvmCPU}

    #———————————————-
    # Configuration

    for j in $(seq 1 ${nSocket}); do
    echo “On Core n° ${j} : ${cfgVMCPU[${j}]} For a total of ${debugVMCPU[${j}]} vCPUs”
    for vm in ${cfgVMCPU[${j}]}; do
    # num vCPU sur la VM
    vmncpu=$(awk ‘BEGIN { CPUn=0; FS = “[]” } /cpu/ { CPUn=CPUn+$3 } END { print CPUn }’ ${VM_DEF_DIR}/${vm}.xml)
    # bind of each vCPUs on a same core
    for vmcpu in $(seq 1 ${vmncpu}); do
    # Bind vcpu on a socket
    virsh vcpupin ${vm} $((${vmcpu} – 1 )) ${tabsocket[$j]}
    # Bind KVM process on the same processor/memory node
    cset set -s /libvirt/qemu/${vm} -c ${tabsocket[$j]} -m $(( $j – 1 ))
    done
    done
    done

    Reply
  6. Shank

    Hi, thanks for sharing, can you share the cpu set patch you did with a small explanation? I had issues making cset work with libvirt also, it wouldn’t recognize the directory structure when libvirt created subsets first. I had to turn off cpuset cgroup support in libvirt and use it only with cpuset.

    Reply
  7. Poil

    I was using the cpuset binary in v1.0 but I have to fix it by adding a prefix to all cset path.

    In v2.0 (http://own-you.com/tmp/cpu_pinning.sh) I don’t use cpuset binary anymore

    My cpuset is mounted in /cpuset
    My libvirt subset are in /cpusets/libvirt/qemu/${vm}/cpuset.cpus

    Reply
  8. Shankhadeep Shome

    Ah ok I just saw the script, thanks for the contribution 🙂

    Reply
  9. Poil

    I’ve upload v2.10
    Changelog: Use only started VM and exec if diff since last exec

    I run it all 15 mins on my KVM hosts

    Reply
  10. Mark

    @Poli – Thanks for this script!

    Very useful, I modified it to use virsh dumpxml instead of reading the configs from /etc/libvirt/qemu/*.xml as this seemed the best way to integrate it with OpenNebula.

    I also changed the line to mount /cpusets to be slightly more modern. Also just ‘mount -t cpuset’ always returned nothing for me – apparently this worked for you?

    [[ -z $(mount|grep ‘/cpusets’) ]] && mount -t cgroup none /cpusets/ -o cpuset

    Reply
  11. Poil

    Hi,

    Yes on ubuntu “mount -t cpuset” is working, but I think we don’t need it, cgroup already exist “cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,relatime,cpuset)”
    So, I think we can remove cpuset dependancy and use this path. How is it on Suse ?

    I’ve also added a check at the beggining
    ## Sub Check if cpuset directory exist for this VM
    for vm in ${CUR_VM}; do
    if [[ ! -d “/cpusets/libvirt/qemu/${vm}” ]]; then
    echo “Error cpuset for VM ${vm} not found”
    rm -f ${LIB_DIR}/cpupinning.cfg
    exit 1
    fi
    done

    Reply

Trackbacks/Pingbacks

  1. Optimizing Large NUMA OpenNebula KVM Hosts with Group Scheduling: Part 1 | FullyCloud - [...] cloud infrastructure improve cluster deployment times as well as provide an on-demand resource [...]source This entry was posted in…

Leave a Reply to Poil Cancel reply

Your email address will not be published. Required fields are marked *