One of OpenNebula’s main features is its low resource footprint. This allows OpenNebula clouds to grow massive without a big impact on demanded hardware. There is a continuous effort from the team behind OpenNebula’s development related to efficiency and performance, and several improvements in this area have been included in the latest release, OpenNebula 5.8 “Edge”. The objective of this blog post is to describe the scalability testing performed to define the scale limits of a single OpenNebula instance (single zone). This testing and some recommendations to tune your deployment are described in the new guide of the OpenNebula documentation Scalability Testing and Tuning.
Scalability for OpenNebula can be limited on the server side, in terms of maximum amount of nodes/Virtual Machines (VM) in a single zone, and on the nodes side, in terms of maximum amount of VMs a single node is able to handle. In the first case, OpenNebula’s core defines the scale limit, while in the second case, it is the monitoring daemon (collectd) client. A set of tests has been designed to address both cases. The general recommendation is to have no more than 2,500 servers and 10,000 VMs, as well as 30 API load req/s, managed by a single instance. Better performance and higher scalability can be achieved with specific tuning of other components like the DB, using better hardware or adding a proxy server. In any case, to grow the size of your cloud beyond these limits, you can horizontally scale your cloud by adding new OpenNebula zones within a federated deployment. Currently, the largest OpenNebula deployment consists of 16 data center and 300,000 cores.
Hardware used for tests was a Packet t1.small.x86 bare metal cloud instance. No optimization or extra configuration besides defaults was used for OpenNebula. Hardware specifications are described as follows:
|CPU model:||Intel(R) Atom(TM) CPU C2550 @ 2.40GHz, 4 cores, no HT|
|RAM:||8GB, DDR3, 1600 MT/s, single channel|
|Database:||MariaDB v10.1 with default configurations|
|Hypervisor:||Libvirt (4.0), Qemu (2.11), lxd (3.03)|
Front-end (oned core) Testing
This is the main OpenNebula service, which orchestrates all the pools in the cloud (vms, hosts, vnets, users, groups, etc).
A single OpenNebula zone was configured for this test with the following parameters:
|Number of hosts:||2,500|
|Number of VMs:||10,000|
|Average VM template size:||7KBytes|
Note: Although hosts and VMs used were dummies they represent an identical entry on the DB compared to a real host/VM with a template size of 7KBytes. For this reason, results should be the same as in a real scenario with similar parameters.
The four most common API calls were used to stress the core at the same time in approximately the same ratio experienced on real deployments. Total amount of API calls per second used were: 10, 20 and 30. In these conditions, with a host monitoring interval of 20 hosts/second, in a pool with 2,500 hosts and a monitoring period on each host of 125 seconds, the response times in seconds of the oned process for the most common XMLRPC calls are shown below:
|Response Time (seconds)|
|API Call – ratio:||API Load: 10 req/s||API Load: 20 req/s||API Load: 30 req/s|
Host (monitoring client) Testing
This test stresses the monitoring probes in charge of querying the state, consumption, possible crashes, etc. of both physical hypervisors and virtual machines.
For this test, virtual instances were deployed incrementally. Monitoring client was executed each time that 20 new virtual instances were successfully launched and before launching 20 additional virtual machines to measure the time needed to monitor every virtual instance. This process was repeated until the node ran out of allocated resources, which happened at 250 virtual instances, and OpenNebula’s scheduler was not able to deploy more instances. Two monitoring drivers were tested: KVM and LXD. These are the settings for each KVM and LXD instance deployed:
|KVM VMs||None (empty disk)||32MB||0.1|
|LXD containers||Alpine 3.8||32MB||0.1|
Results for each driver are as follows:
|Monitoring Driver||Monitor time per virtual instance|
|KVM IM||0.42 seconds|
|LXD IM||0.1 seconds|