You should read the first part of this article before reading this article. “High Availability Part I”
The expectation that a system will operate continuously for a significant span of time with minimal downtime is the goal of every organization. For example, with 8,760 hours in a year, 99% availability signals over 7 hours of downtime a month and 88 hours of downtime over the course of that year. In turn, 99.9% availability—“three nines”—adds up to over eight hours of unplanned downtime, while 99.99% (four nines) translates into under an hour.
How Many Nines Do You Need?
That is the question to ask. The answer will depend on your requirements. While some organizations set their sights on 100% availability, most systems don’t need to hit such heights. Nuclear reactors, missile defense systems, stock exchanges, transportation systems, automated teller machines have a high cost of failure and need high reliability, but web applications may not need as much.
Smaller businesses don’t need to invest in high levels of fault tolerance, which can require immense hardware and engineering resources. Every time you add a nine, costs rise exponentially.
Although going for high availability solution is alright, it comes at a big cost too. You must ask yourself if you think the decision is justified from the point of view of finance.
Two important questions:
- What is the dollar value per hour of downtime, and how does that cost compare to the cost of offsetting the problem?
- How much availability does your system realistically require? Do you need five nines? Three nines? Will 99% suffice?
What Can Go Wrong? Everything.
Murphy’s Law says whatever can go wrong will go wrong. For starters, it’s critical to identify all parts of your system—a single machine, one data center, one network in one location, a single cloud provider—that can fail and, as previously mentioned, put the right redundancies in place.
Single-machine failures are typically inexpensive to protect against and quick to recover from. To increase availability, you can deploy to data centers in multiple Availability Zones, where several servers are grouped into multiple distinct locations. Launching instances in separate Availability Zones can protect applications from single-location failure. Above Availability Zones are regions, with data centers located in different geographic areas. With multiple regions, a failure on one region usually doesn’t impact availability of the system.
The strategy for a mid-sized company might look something like this: Taking AWS as an example. You can achieve same results by using Microsoft Azure or Oracle Cloud Infrastructure.
- Use Aurora,SQL Server or Oracle RDBMS to manage database availability.
- Replicate the database into multiple Availability Zones to increase up time.
- Deploy applications into multiple Availability Zones.
- Ensure all applications are deployed to at least two nodes.
Start Small, Grow Smart
Companies usually start with a few nodes in one Availability Zone and then grow to two or three Availability Zones. As the cost of downtime increases, companies start to look at more expensive options to increase availability, such as multiple regions and multiple cloud providers.
On premise Solutions
For on premise high availability solutions for mission critical applications, organizations can leverage on tried and tested solutions such as use of multiple application servers, scaling of databases, data backup and recovery, clustering, replication, network load balancing and geographic redundancy solutions.
The next articles will focus on the ‘how to’ deploy these solutions in our on premise environment.