Keeping all of your systems and networks running is one of the most important tasks for a network administrator. In this video, you’ll learn about fault tolerance, redundancy, and high availability.

<< Previous Video: Network Documentation Next: Power Management >>

When you’re working with technology, the question isn’t if you’re going to have a problem, it’s when are you going to have a problem. And you need to have a plan if there is some type of failure. Being able to provide continuous uptime is an important consideration for any part of information technology. And you need to have some type of fault-tolerant plan in place.

This fault tolerance usually adds additional complexity. There are a number of processes and procedures that you have to follow. And it may add additional cost as you acquire additional components so that everything is fault tolerant. If you’re adding fault tolerance for an individual device, then you may be adding additional storage devices and configuring RAID. Or maybe you are installing a new power supply for that device.

Or you may implement load balancing across entire server farms so you can have a large-scale fault tolerance in place. You might also include multiple network paths so if one particular device fails, you have a way to communicate through a different path.

We often implement this fault tolerance by using redundancy. We’ll have an additional device either standing by or online. And if the first device fails, we can failover to the secondary device. This means that you might have separate power supplies within a single server. Or you might build out two completely separate servers– one that is your primary device and the other one that’s used for redundancy.

The Redundant Array of Independent Disks is a common way to set multiple drives inside of a device and provide redundancy should any one of those drives fail. You might also include a UPS, an Uninterruptible Power Supply, because sometimes the entire power circuit may go down, and you still need some way to power all of these systems.

You might also want to create a cluster of servers so that if any individual server fails, the other servers will still provide that function. And you might even want to set up your fault tolerance with load balancing where there are always devices online, and the load is distributed throughout all of them.

Here’s an example of fault tolerance. We have an internet provider. And that internet provider is connecting to our firewall. Our firewall is then connecting to our internal router. The router is connecting to our internal switch. And finally, that’s connected to our web server.

But what if we have a problem with that firewall? Perhaps the power supply fails or the software is having a problem in that firewall. And of course, that single device being down affects the entire flow between the internet and that web server. Of course, we planned a fault-tolerant configuration that will handle a firewall outage. And we have already purchased and have on standby a spare firewall.

We’ll turn the firewall on, make sure it’s up to date, and then we’ll slide it into place. And the network is up and running again. As you’ve seen, that fault tolerance and redundancy doesn’t necessarily mean that you have 100% uptime. When our firewall failed, we had a redundant firewall in place. But it took a bit of time to get it up and running so that you could then have connectivity again. Many organizations, though, can’t afford to have any downtime. And in those scenarios, you need to have a configuration that is highly available.

This is often referred to as an HA configuration, for High Availability, where it is always on and always available. This also means that you’re probably going to install multiple devices that will always be running and always working together. You want to be sure that you don’t have any place in the entire path of communication that may be a single point of failure.

As you can imagine, something that is highly available usually means that you’re going to be paying additional money. You’re going to have an upgraded power supply. There may be higher-quality service components. Or you may be buying multiple devices instead of a single device at a time.

One way to provide high availability with servers is to put them behind a load balancer. With many load balancers, you can configure certain servers to always be available. Those are the ones designated with the green dots. But there may be some servers that are sitting there and they’re online and waiting for a problem to occur. If you do have a server failure, one of these passive servers is going to take its place.

So let’s take a scenario where a user is communicating through the network, through our load balancer to server A. And as long as server A is up and running, that user can connect to the load balancer and still communicate to that web server. This load balancer is always performing a health check to all of these servers. And if a server suddenly is unavailable, the load balancer will recognize that scenario and begin using a separate standby server in its place. That way, if someone does need access to this resource, there will always be a server available for that request.

Let’s build out a highly available network based on the configuration we had earlier with our internet provider, our firewall, our router, our switch, and our web server. In the earlier configuration, our firewall had a problem, and we lost connectivity. So one way to provide high availability is to include a separate firewall that can work in conjunction with the original.

We also might want to provide redundancy with our routers and have them up and running all the time using high-availability protocols to allow traffic to flow through either one or both of these routers if they’re available. We can also provide redundancy and high availability with our switches. And we can provide a load balancer to provide high availability to our web servers.

We could even take this further and include a separate internet provider in case one internet provider suddenly is unavailable. And of course, you could continue with this process of building out the high availability with multiple devices until you find exactly the right configuration that makes sense for your business requirements.

If you’re trying to provide high availability for servers, you may want to look at NIC teaming. This is a network interface card teaming. It’s often called LBFO for Load Balancing/ Fail Over. This provides not only aggregation of bandwidth because you’re using multiple network connections, but you would also have redundant paths. So if one path disappears, you still have a way to communicate out of that server.

From a practical perspective, we’re using multiple interface cards and teaming them together in the operating system. To the operating system, it looks like a single network interface card. But we really have multiple paths outside of that server and usually using redundant paths so that if any one of those disappears, we still have connectivity. The network interface cards are constantly communicating to each other, usually across the network. They’re using multicast to perform health checks of all of the other network interface cards in that server. If any of those network interface cards don’t respond to these health checks, it’s taken out of service, and the remaining network interface cards continue to provide connectivity.

Here’s a configuration to a server where multiple clients on the network may be connecting to the server through a series of switches. And the switch that’s connected to the server, you can see there are redundant connections through multiple interface cards. This is commonly called port aggregation because you’re using both of those interfaces as a single aggregated connection.

If you want to add some fault tolerance, you may want to add those different connections into different switches. That way, if one of these switches fails, you will still have a network path to the rest of the devices on the network.