Fault Tolerance is a seldom used feature that has been available since the days of VMware Infrastructure, the old name for what we today know as VMware vSphere, the software suite comprising vCenter Server and the ESXi hypervisor along with the various APIs, tools, and clients that come included with it. If you’ve been through my post on High Availability, you’ll probably know that Fault Tolerance is a feature intrinsic to a HA enabled cluster.
A brief overviewVMware Fault Tolerance (FT) is a process by which a virtual machine, called the primary vm, replicates changes to a secondary vm created on a host other than where the primary vm resides. Think mirroring. Should the primary vm host fail, the secondary immediately takes over with zero interruption and downtime. Similarly, if the secondary vm’s host goes offline, the secondary vm is re-created on another host, assuming one is available. For this reason, it’s best to have as a minimum a 3-node cluster even though FT works just as well on a 2-node cluster
The one benefit that immediately stands out is that Fault Tolerance raises the High Availability bar a notch higher as it bolsters business continuity by mitigating against data loss and application downtime, something that the HA cluster feature alone cannot deliver. FT is also used in scenarios where expensive clustering solutions are impractical to implement from both technical and financial perspectives.
Occasionally, you may come across the term On-Demand Fault Tolerance. Since FT is a somewhat resource expensive process, you may decide to employ scripting, just to mention one way of doing it, to enable and disable FT on a schedule. On-Demand FT is used to protect business critical vms and the corresponding business processes such as payroll applications and payroll runs against data loss and service interruption. That said, keep in mind that FT protects at the host level. It does not provide any application level protection meaning that manual intervention will still be required if a vm experiences OS and/or application failures.
New features and supportI used the word seldom in the opening line in reference to how FT has been generally overlooked, prior to vSphere 6 at least, mainly due to its lack of support for symmetric multiprocessing and anything sporting more that 1GB of RAM. In fact, before vSphere 6.0 came along, FT could only be enabled on vms with a single vCPU and 1GB of RAM or less which needless to say turned out to be a show stopper given today’s operating systems and application generous compute resource requirements.
So, without further ado, the goodies vSphere 6.0 brings to FT are as follows. These are the ones which in my opinion make it a viable inclusion to any business continuity plan.
- Support for symmetric multiprocessor vms
- Max. 4 vCPUs (vSphere Standard and Enterprise licenses)
- Max. 2 vCPUs (vSphere Enterprise Plus licenses)
- Support for all types of vm disk provisioning
- Thick Provision Lazy Zeroed
- Thick Provision Eager Zeroed
- Thin Provision
- FT vms can now be backed up using VAPD disk-only snapshots
- Support for vms with up to 64GB of RAM and vmdk sizes of up to 2TB
- CD-ROM or floppy virtual devices backed by a physical or remote device.
- USB, Sound devices and 3D enabled Video devices
- Hot-plugging devices, I/O filters, Serial or parallel ports and NIC pass-through
- N_Port ID Virtualization (NPIV)
- Virtual Machine Communication Interface (VMCI)
- Virtual EFI (Extensible Firmware Interface) firmware
- Physical Raw Disk mappings (RDM)
- VM snapshots (remove them before enabling FT on a VM)
- Linked Clones
- Storage vMotion (moving to an alternate datastore)
- Storage-based policy management
- Virtual SAN
- Virtual Volume Datastores
- VM Component Protection (see my HA post)
Legacy FTBefore I move on, I need to highlight that VMware now uses the term “Legacy FT” to refer to FT implementations pre-dating vSphere 6.0. If required, you can still enable “Legacy FT” by adding vm.uselegacyft to the list of advanced configuration parameters.
Cluster and Host basic requirementsOne in particular, namely a 10 Gbit dedicated network, is a bit stringent in that you’d generally find this kind of infrastructure deployed in large enterprises so FT may be a hard sell for SMEs and the likes. Other than that, just ensure that the host CPUs in your cluster support Hardware MMU virtualization and are vMotion ready. For Intel CPUs anything from Sandy Bridge upwards is supported. For AMD, Bulldozer is the minimum supported microarchitecture.
On the networking side, you’ll need to configure at least one vmkernel for Fault Tolerance Logging. At a minimum, each host should have a Gbit nic dedicated to vMotioning and another to FT Logging.
Note: For DRS to work with FT, you will need to enable EVC. Apart from ensuring a consistent CPU profile, this allows DRS to optimize the initial placement of FT protected vms.
Turning FT on and offThere are a number of requirements to fulfill before you can turn on FT for a vm. You’ll need to ensure that the vm resides on shared storage (iSCSI, NFS,etc) and that is has no unsupported devices (see list above). Also note that the disk provisioning type changes to “Thick Provision Eager Zeroed” when FT is turned on. This may take a while for large sized vmdks.
FT can be turned on irrespective of the vm being powered on or not. There are also a number of validation checks, which depending on whether the vm is running or not, will slightly differ.
To turn on FT, just right-click on the vm and select “Turn On Fault Tolerance”. Use the same procedure to turn it off.
If the vm passes the validation checks, you should find that the secondary vm is created on a different host other than that of the primary. In this case I’ve enabled FT on a vm called Win2008-C which is hosted on 192.168.11.63. The secondary vm Win2008-c (secondary), as you can see, is created on 192.168.11.64.Figure 6 – vmdk scrubbing – changing provisioning type to thick eager zeroed
Now select the vm for which you turned on FT. Under the “Summary” tab you should see a sub-window titled “Fault Tolerance”. A number of FT metrics for the specific vm are displayed here. Of particular interest is the “vLockstep interval” which simply put is a measure of how far behind the secondary vm is from the primary in terms of replication changes.
Typically the value should be less than 500ms. In the example shown below the interval stands at 0.023s or 23ms which is good. Another metric is “Log Bandwidth” which is the network capacity currently in use to replicate changes from the primary to the secondary vm. This can quickly increase when you have multiple (max. 4 vms or 8 vCPUS per host) FT protected vms hence the 10Gbit dedicated network requirement for FT logging.
Figure 9 – Fault Tolerance details for a FT protect virtual machine
You can simulate a host failure by selecting “Test Failover” from the vm’s context menu. Similarly, a simulated Secondary restart is carried out using the same menu. In both instances, the vm is momentarily left unprotected until the test completes.
The next video shows you how to test a failover and perform a secondary restart. It doesn’t need much explaining but for completeness sake I’ll give a brief run through. As per the previous video, I’m using a 2-node nested environment with FT already enabled on the vm called Win2008-C.
- I first illustrate where from HA (and DRS) is enabled so that FT, in turn, can be enabled.
- Next we verify that the primary and secondary vms are hosted on separate servers.
- We then initiate a failover test while pinging the FT protected vm’s ip address. Normally you would experience a single “ping” loss but given the low resource environment I’m using, you’ll notice a loss of 3 packets. After a brief while, the secondary vm powers up and takes over from where the primary failed.
- Next, we simulate a secondary restart. This time there’s only one single packet loss. Notice that the vm is briefly left “unprotected” following which it returns to being fully protected as expected.
ConclusionUndoubtedly, vSphere 6.0 and the improvements it brought about to Fault Tolerance should make it a very valid tool to have towards ensuring business continuity and why not, fewer calls in the middle of the night. The 10Gbit dedicated network may prove too much of a requirement for small businesses. That said, you could always settle for cheaper 10Gbit gear but then again if you’re using FT to protect mission critical machines I think it’s best to go the extra mile and work with trusted providers.
That’s if for FT. For a complete list of requirements and functionality do have a look at these;
- vSphere Availability
- VMware vSphere 6 Fault Tolerance
- Providing Fault Tolerance for Virtual Machines