Liferay Enterprise Search

[LES] Elasticsearch Virtual Machine Considerations

Introduction

Elasticsearch* instances running in VM infrastructures can present unique challenges. This article describes some common issues we may see with different VM software.

VMware

vSphere is a very widely deployed VM infrastructure. It consists of physical hosts running VMware vSphere ESXi, management systems running vCenter, and a huge number of ancillary packages that handle things like backups, shipping copies of VMs to redundant datacenters, handling deployment of software on virtual infrastructure and converting physical machines to virtual instances. It's very handy for managing a datacenter because all storage and networking infrastructure can be managed through the vSphere management package. Admins can allocate storage and networking to the physical nodes and parcel out CPU, memory, disk and network bandwidth as needed with tight controls on allocation for each.

CPU issues

The primary CPU problem on a VMware virtual infrastructure is overallocation. The physical machine has a certain number of cores available to it, and each virtual machine is given a certain number of vCPUs. If the number of vCPU's allocated to VM's on that host exceeds the number of physical cores, the VM system will have to have some requests for cycles wait while other VM's are using the CPU. This is called overcommitting and can cause CPU response to be slower than expected.

The problem of overcommitted CPU resources gets worse when a system has many vCPU's. Since the OS believes it has many cores to work with, it will be scheduling operations assuming they will be handled immediately, and other operations on other CPU's may depend on results from many operations that were supposed to be running in parallel. The scheduler on the physical host may have to wait until 4 or 8 cores (for example) are available in order to process for 4 or 8 vCPU's. This can take longer than waiting for fewer cores. In many cases, this can cause a system that has more vCPU's allocated to it to run slower than one with fewer vCPU's. vSphere lets admins allocate clock ticks per VM, so some VM's can be capped to not exceed a certain speed, and others can run as fast as possible. That's why fewer vCPU's can sometimes run faster than more vCPU's. Two vCPU's could grab clock cycles whenever they are available, where four or eight might get tripped up waiting for enough cores to come available to handle the work. For java applications, it's a good idea to have a minimum of two vCPU's. One handles garbage collection operations and other JVM administration, and the other handles the bulk of the application processing. If there is only one vCPU available, GC will tend to block regular operation more than it should.

Memory Issues

Memory can be overcommitted on a physical host. This happens when the memory allocated to the VM's running on the host exceeds the total amount of memory available on the host. This isn't a big problem until all of the VMs' operating systems have claimed all available memory on the physical host. To handle requests for memory, the physical host has to start swapping out pages to disk. This is invisible to the VM's, except that operations that normally would be pulling from main memory are actually pulling data from disk without the VM's OS' knowledge. One symptom of this can be high %sys CPU usage in top. If the OS itself was swapping out the memory, you'd see high %iowait and high swap use. If the swapping is happening at the physical host, those indicators will not be present. If this is happening, the admin needs to get some VM's off the physical machine or set a memory reservation for the ES node. A memory reservation guarantees that there will always be physical memory available for the VM and it will not be swapped out, at the expense of the other VM's on the system.

Disk Utilization

Physical storage from the physical host will be presented as a virtual disk. Much like memory, if disks are being overutilized you might see slow responses from the disk when the ES node's VM is not using it very heavily itself.

Many virtual infrastructures use large-scale storage arrays to provide virtual disks to many VM's. If a large number of ES nodes are using the same storage array for their data paths, this can quickly become a bottleneck. The storage architecture must be carefully considered when designing a virtualized Elasticsearch infrastructure. Some of these problems can be avoided by ensuring there are multiple storage arrays available for Elasticsearch to use as a back-end, or using a virtualized storage architecture like VMware vSAN or another converged infrastructure solution.

Another consideration when designing a storage infrastructure for Elasticsearch in a virtualized environment is the impact of data protection. VM snapshots are not recommended. As documented here, you cannot back up an Elasticsearch cluster by simply snapshotting the data directories of all of its nodes. Elasticsearch may be making changes to the contents of its data directories while it is running; snapshotting its data directories cannot be expected to capture a consistent picture of their contents. If you try to restore a cluster from such a backup, it may fail and report corruption and/or missing files. Alternatively, it may appear to have succeeded though it silently lost some of its data. The only reliable way to back up a cluster is by using the snapshot and restore functionality.

Network Utilization

Network overutilization on the physical host is similar to overutilization of disk. Network operations on the VM will be slow for no apparent reason.

Migration with vMotion

Live migration of VMs with vMotion will pause the VM for a period of time. While the VM is paused any running Elasticsearch nodes will not respond to requests, including health checks. If a node does not respond to health checks for long enough then it is deemed unhealthy and removed from the cluster. To avoid any disruption caused by a temporarily unresponsive node, we recommend following the rolling restart procedure and performing the VM migration while the node is stopped.

* Elastic, Elasticsearch, and X-Pack are trademarks of Elasticsearch BV, registered in the U.S. and in other countries.

On this page