This article describes several methods to test VMware High Availability (HA) failover in your environment for cluster testing purposes. We assumes that you are working with an HA enabled cluster in vCenter Server consisting of 2 ESXi/ESX hosts, where the Management Network is uplinked in a redundant vmnic configuration.
You can test HA failover depending on the version of vSphere deployed in the environment.
Select a procedure below depending on your installed version of vSphere.
Procedure 1For a vSphere 4.x environment where you are running HA based on AAM and have two redundant NICs for the Management Network, you can physically disconnect the patch cable where these physical NICs are uplinked.
Alternatively, you can issue a command to your switch software to disconnect the ports. This simulates a host isolation event since vCenter Server is not communicating with the hosts. Furthermore, the hosts in the cluster have the AAM agent running. The agent is designed to monitor the uptime of neighboring hosts in the cluster. If the master host of the cluster detects that the host you have disconnected is isolated, it restarts its virtual machines on surviving hosts in the cluster. Ensure that your HA cluster settings have the appropriate Host Isolation Response setting, as this type of host outage is considered to be a Network Isolation.
Procedure 2In vSphere 5.x, HA is provided by the Fault Domain Manager (FDM) agent deployed on each of the HA cluster hosts. FDM is used where both Network and Datastore Heartbeats are used to determine the availability of a host, and in determining types host failure, whether that is a physically failed host or a Network Isolation type of failure. The FDM agent on secondary hosts report uptime information to the master host's FDM agent. The master host communicates with vCenter Server to report the uptime of itself and all secondary hosts.
For example, there are two ESXi/ESX hosts with two vmnics in a redundant NIC team serving Management Network traffic. These hosts are also sharing a single shared datastore. You want the virtual machines to failover to the surviving host in the cluster.
To prepare the environment for failover simulation:
- Log in to the vCenter Server with the vSphere Client.
- Edit the Cluster Settings.
- Under vSphere HA settings, change the Datastore heartbeat to
None. Ensure no datastores are selected from the available list, and select Select only from my preferred datastores.
To disrupt the communication between a single host and the vCenter Server, you can physically disconnect the patch cable where these physical NICs are uplinked. Alternatively, you may issue a command to your switch software to disconnect the ports.
When the network communication between the host and the master host is disrupted (or the master host and vCenter Server if this host is the master) is disrupted, vCenter Server waits for the timeout period where it does not receive communication from the host it is managing, and then declares the host as
Isolated. This causes all virtual machines to register and restart on the surviving host.
Procedure 3As mentioned in Method 1 and 2, disconnecting the network to forcibly disrupt communication between master and secondary hosts is an option in simulating HA failover. However, to simulate a power-outage or hardware fault type of failure, hard power off the host physically or by using a remote management application such as KVM, DRAC, iLO, or RAS.
Procedure 4Note: Use of this method may require re-installation of ESXi/ESX if the kernel module is not properly disabled/re-enabled. When disabling the kernel module for the physical NIC, you lose all remote management through the ESXi Service Console, and can only remotely manage the host through KVM, DRAC, iLO, or RAS. Be sure to have physical access to the host if a remote management application is not available.
Method 4 allows one to simulate a network isolation again, but this time by disabling the physical NIC (vmnic) driver module from the VMkernel, instead of physically disconnecting a patch cable or interrupting connectivity at the physical switch layer.
First determine which module is in use by the physical NIC by using one of these articles, depending on your installed vSphere version:
- For vSphere 5.x: Determining which storage or network driver is actively being used on ESXi host (1034674)
- For vSphere 4.x: Determining Network/Storage firmware and driver version in ESXi/ESX 4.x and 5.x (1027206)
- For vSphere 5.x:
esxcli system module set --disabled module_name
- For vSphere 4.x:
esxcfg-module -disable module_name