High Availability Part I


Unplanned outages are very expensive, and the cost of downtime is rising as almost all mission critical systems are fully automated. Industries like banking and finance, telecommunication systems, Transportation systems, airline reservation systems etc. have fully automated their end-to-end processes. The stakes are high in building robust applications to run these industries.

To ensure high availability of these systems, these applications should be design and implemented with high availability needs from the start.

If you are building a new application, you need to take into account the proper high availability solution by integrating various high availability considerations into your current development methodology.

The first part of this article defines many high availability terms.

Today you can take advantage of on-premises options, cloud options, and various hybrid options.

Many other combinations of cloud and hosted options are being used these days. This includes using Infrastructure as a service (IaaS), fully hosted infrastructure (hardware you rent), and colocation (rented servers in someone else’s racks). These solutions are still not fool proof as component within these stack may fail and renders the application unavailable.

Example, an application will be available in the cloud but if the application cannot be accessed by a user via the network, it is essentially “down.”

With a move toward more cloud based or hybrid application solutions, the emphasis of your high availability focus may also be refocused. And, to top things off, there is a human error aspect running through all these components as well. If you reduce or eliminate human errors (bringing a database (DB) offline accidentally, deleting a table’s data in error, and so on), you increase your system’s availability greatly. Continuous application availability (or high availability) involves three major variables:

Uptime—the time during which your application is up, running, and available to the end users.

Planned downtime—the time during which IT makes one or more of the system stacks unavailable for planned maintenance, upgrades, etc.

Unplanned downtime—the time during which the system is unavailable due to failure—human error, hardware failure, software failure, or natural disaster (earthquake, tornado, and so on).

As you may already know from your own personal experience, the planned downtime contributes the biggest share of unavailability on most systems. It involves things such as hardware upgrades; OS upgrades and application of service packs to the DB, OS, or applications. However, there is a steady trend to adopt hardware and software platforms that allow for this element to be minimized (and, often, completely eliminated). For instance, many vendors offer systems whose hardware components, such as CPUs, disk drives, and even memory, are “hot swappable.” (Engineered systems, a typical example is Oracle Exadata Machine and Oracle Database Appliance) But the price of these systems tends to be high.

Any downtime, planned or unplanned, is factored into your overall requirements of what you need for a high availability system. You can “design in” uptime to your overall system stack and application architectures by using basic distributed data techniques, basic backup and recovery practices, and basic application clustering techniques, as well as by leveraging hardware solutions that are almost completely resilient to failures.

Unplanned downtime usually takes the form of memory failures, disk drive failures, database corruptions (both logical and physical), data integrity breakdowns, virus breakouts, application errors, OS failures, network failures, natural disasters, and plain and simple human errors. There are basically two types of unplanned downtime:

Downtime that is “recoverable” by normal recovery mechanisms—this includes downtime caused by things like swapping in a new hard drive to replace a failed one and then bringing the system back up.

Downtime that is “not recoverable” and that makes your system completely unavailable and unable to be recovered locally—this includes things such as a natural disaster or any other unplanned downtime that affects hardware.

In addition, a good disaster recovery plan is paramount to any company’s critical application implementation and should be part of your high availability planning.

If you simply apply a standard high availability technique to all your applications, you will probably sacrifice something that may turn out to be equally important (such as performance or recovery time). So, be very cautious with a blanket approach to high availability.

You can read the second article of the series, High Availability Part II.

Leave a Reply

Your email address will not be published. Required fields are marked *