Bringing best of NFS and LVM together in Opennebula

I am an engineer working for the european project BonFIRE, which :

Give researchers access to large-scale virtualised compute, storage and networking resources with the necessary control and monitoring services for detailed experimentation of their systems and applications.

In more technical words, it provides access to a set of testbeds behind a common API as well as a set of tools to monitor cloud infrastructure at both VM and Hypervisor levels, enabling experimenters to diagnose cross-experiment effects on the infrastructure.

Some of these testbeds are running OpenNebula ( 3.6 with some patches for BonFIRE ). As each testbeds has it’s own “administration domain”, the setup are different between sites. The way we use OpenNebula is:

  • we have some default images, used for almost all VMs
  • defaults images are not updated often
  • users can save their own images, but it does not happen often

The update of the BonFIRE software stack to last version was accompanied at our laboratory by an upgrade of our testbed. After some study / calculations, we bought this hardware:

  • 1 server with 2+8 disks (RAID1 for system, RAID 5 on 8 SAS 10k 600G disks), 6 cores and 48GB of RAM, 2 cards of 4 Gbps ports
  • 4 servers with 2 drives, 2 * 6 cores (total of 24 threads), 64GB of ram, 2 Gbps ports

We previously had 8 small worker nodes (4G of RAM) that was configured to use a LVM with a cache feature (snapshot based) to improve OpenNebula performances.

With this new hardware, the snapshot feature can’t be used anymore, as it has disastrous performance when you have more than a few snapshots on same source LV.

On the other side, everyone reading this blog probably knows that having some dozens of VMs volumes on shared (NFS) storage require a really strong backend with really good performances to achieve acceptable performances.

The idea of our setup is to bring NFS and local storage to work together, providing :

  • better management of copy of images through network (ssh has a huge performance impact)
  • good VM performance as their image is copied from NFS to local before being used.

Just before explaining the whole setup, let’s show some raw performances data (yes, I know, this is not a really relevant benchmark, It’s just to give some idea of hardware capacity).

  • Server has write performance (dd if=/dev/zero of=/dev/vg/somelv conv=fdatasync) > 700MB/s and read > 1GB/s
  • Workers have 2 disks added as 2 PVs, and LVs are stripped, performances are > 300MB/s for sync write, ~ 400MB/s in read.

As we use virtualisation a lot to host services, ONE frontend is installed in a virtual machine itself. Our previous install was on CentOS 5, but to get more recent kernel/virtio drivers I installed it on a CentOS 6.

To reduce network connections and improve disk perfs, the VM is hosted on the disk server (this is almost the only VM).

The default setup (ssh + lvm) didn’t have good performances, mostly due to the cost of ssh encryption. So I then switched to netcat, that was much better (almost at max Gb link speed) but has at least 2 drawbacks :

  • Doesn’t efficiently manage cache (no cache on client, only FS cache on server, so almost no cache between 2 copies)
  • Needs to setup a netcat to listen on worker for each copy

So I finally setup NFS. To avoid extra I/O between server/network and VM, I put the NFS server on the hypervisor itself, and mounted it on OpenNebula frontend. Advantage of NFS is that it handles cache on both client and server pretty well (for static content at least). That way, we have a good solution to:

  • Copy images when they are created on frontend (they are just copied on the NFS mounted (synchronously) from hypervisor)
  • Copy images from NFS to workers (a dd from NFS mount to local LV) are good, and may benefit of client cache when the same image is copied many times (remember, we use mostly the same set of 5/6 source images)

So, let’s assume we have a good copy solution, one bottleneck is the network. To avoid/limit this, I aggregate Gb links between NFS server and our switch (and between each worker node and our switch) to have a 4Gbps capacity between NFS server and switch. Moreover, as the load-balancing algorithm used to decide which link is used for a given transaction (a same transaction can’t go through more than 1 link), I computed IP addresses of each worker node so that each one use a given, unique link (it does not mean this given link won’t be used for other things / for other workers, but it ensure that when you try to copy on the 4 nodes at same time, network is optimally used, 1Gb link per transaction).

Also, in order to reduce the useless network transaction, I updated the TM drivers to :

  • handle copy from nfs to lv
  • don’t use ssh/scp when possible, just do a cp on the NFS (for example for saving vm, generating context, etc …)

We first wish to put the NFS read-only on workers, but it requires to do a scp when saving a VM, which isn’t optimal. So a few things are still written from worker to NFS:

  • deployment file (as is is sent over ssh on a ‘cat > file’)
  • saved images

This way, we :

  • have a efficient copy of images to workers (no ssh tunneling)
  • may have significant improve thanks to NFS cache
  • don’t suffer of concurrent write access to NFS because VMs are booted on a local copy

Some quick benchmark  to finish this post :

  • From first submission (no cache, nfs share just mounted) to 100VM running (ssh-able) : < 4min
  • From first submission (no cache, nfs share just mounted) to 200VM running (ssh-able) : > 8min

In the simultaenous deployment of large number of VMs, OpenNebula reveals some bottlenecks as monitoring of already-running vms slow down the deployment of new ones. In more details, when deploying a large number of vms, it might happen that the monitorization threads interfere with the deployment of new vms. This is because OpenNebula enforces a single vmm task per host simultaneously because in general hypervisors don’t support multiple concurrent operations robustly.

In our particular 200 vms deployment we noticed this effect, where the deployment of new vms was slowed down because of the monitorization of already running vms. We did the test with the default values for monitoring interval, but,to mitigate this issue, OpenNebula offers the possibility to adjust the  monitorization and scheduling times as well as tuning the number of vms sent per cycle.

Additionally there’s a current effort to optimize the monitorization strategy to remove this effect, to: (i) move the VM monitoring to the Information system and so prevent the overlap with VM control operations and (ii) obtain the information of all running VMs in a single operations as currently implemented when using Ganglia (see tracker 1739 for more details).