FT – latoga labs

One of vSphere 4.0 most understated features in my opinion is Fault Tolerance. I truly see this as a capability of vSphere that goes overlooked by most people (especially those who are focused on the cost of a vSphere deployment…as Fault Tolerance is included in vSphere Advanced and higher packages). Not to long ago, companies paid millions of dollars to achieve a lock step fault tolerant solution. Today, with vSphere, you can enable Fault Tolerance on a VM with just the click of your mouse. I want to clarify the key points on Fault Tolerance that most of my clients seem to ask me about; this won’t be a deep technical discussion on Fault Tolerance, that has been covered by others already and you can find those in the links I have included at the end of this article.

I still find it amazing how the spectrum of availability solutions still gets confused by IT administrators and executives alike. So, first a brief refresher on this spectrum:

Load Balancing: Multiple running copies of an application, failure may affect end user. Load Balancing via a network connection load balancer is the lowest common denominator for availability. Actually, these solutions are typically used to achieve scale out of applications that can’t scale out on their own. Load balancers allow you to run multiple copies of the same stateless (typically REST based) application. The nature of the client’s connection to the application determines what availability impact a load balancer has. If a failure occurs between a client’s connection the client should not be affected by the failure. However, if the failure occurs during a client’s connection the client most likely will be affected by the failure in some nature, possibly losing their work (REST, stateless short transactions less affected; non-REST, long connections more affected). By definition load balancing will increase the utilization on multiple servers…that’s what it’s designed to do. I spent five years crafting load balancing solutions for clients back in the late 90’s…and yet I still come across confusion here from time to time.
High Availability: Single running copy of an application, failure will affect end users. High Availability simply means that when a failure occurs, the highly available application will start running on another server. In vSphere, this means the environment will turn on the VM on another ESX hosts to ensure minimal amount of down time for users of the application. Typically, the user will be affected by the failure.
Fault Tolerance: Multiple (typically two) running copies of an application, failure will not affect end user. Fault Tolerance means that you are running two copies of the application in lock step, what ever instruction gets executed on the primary also gets executed on the secondary. This doubles the resource utilization in your environment, but ensures that a failure has no impact on the end user. When a failure occurs, the IP address of the primary system moves to the secondary system and the user continues doing what ever they were doing because the secondary system was processing the same instruction as the primary when it failed. By this definition, Fault Tolerance isn’t ideal for every application due to the higher cost of resource utilization, if you’re running at 80% utilization of your VM prior to Fault Tolerance, you will be running two VMs at 80% when Fault Tolerance is turned on.

What makes vSphere’s Fault Tolerance feature a diamond in the rough is this zero downtime solution is baked into the virtualization infrastructure that you may already own. For those key applications where zero downtime is valuable, it’s there to be turned on with minimal additional cost. There are some hardware requirements that you need to be kept in mind: like an additional network for the FT messages to be passed across (two networks if you want a 100% fault tolerant system), and ensure you have the right type of processor. But these are similar requirements for most comparable solutions.

What makes Fault Tolerance a bit rough is the fact that it only supports one VCPU Virtual Machines. If you application need multiple VCPUs, you’re out of luck. At least for today. Considering Fault Tolerance is a 1.0 feature, this limitation is understandable. It’s even more appreciated when you consider what is happening under the covers to keep the instructions in sync across two VMs, watch the following video from VMware Principal Engineer Doug Scale for the details:

Now imagine the complexity of trying to track, synchronize, and replay the processing instructions for multiple processors. Going back and using my basic Computer Science knowledge from my first year in college makes it obvious to me that supporting multiple processors is magnitudes more challenging that support one processor. So you gotta start somewhere!

Taking all this into consideration there are multiple applications that my clients are looking at as candidates for Fault Tolerance. From mail servers and messaging servers to custom applications where down time needs to be avoided. Before it used to apps where “downtime needs to be avoided at any cost“, but with vSphere Fault Tolerance it has become more like “avoided at a little cost”.

What apps do you have that you’re considering Fault Tolerance for? Tell me about them by leaving a comment.

Additional Resources on vSphere Fault Tolerance

Training Lab simulator for vSphere Fault Tolerance (FT) (via VMWARE INFO) – “See” FT in action thanks to this simulation created by the VMware Training team.
Check ESX CPU And VM OS Requirments for vSphere Fault Tolerance (via VM /ETC) – Make sure you meet the hardware requirements for FT…or use the New SiteSurvey utility from VMware checks for Fault Tolerance compatibility
VMware Fault Tolerance at your home-lab (via Eric Sloof) – after reading the last two, find out how to set this up for testing at home…
vSphere Availability Guide (pdf) – for the full skinny on Fault Tolerance and HA in vSphere.
How does Fault Tolerance prevent a split brain scenario? (via The VMguy) – Understand what happens when failure does occur.
VMware engineers caution IT pros: Use Fault Tolerance sparingly (via SearchServerVirtualization) – a bit of a misleading title, but reiterates my comments above: FT is different than HA, has specific use cases, and does use additional resource.
Don’t forget to search the VMware Knowledge Base for the latest articles on Fault Tolerance