Always Open for Business

Follow these best practices to help ensure high availability for your mission-critical applications.

Imagine what your day would be like if you couldn't access your e-mail server or your Web server, or log in to a domain. Now consider a business like the stock trading site E*Trade, and how many millions of dollars it would lose if its customers couldn't access their portfolios. It's not a pleasant thought.

As technology became pervasive, the cry arose to have it available 24 hours a day, seven days a week, 365 days a year. This need for "high availability" isn't just for trading stocks or powerful corporations. It's a necessity for all organizations, large and small.

So just what is high availability? High availability refers to systems that have redundant components to allow for uninterrupted service in the event of hardware or software failure. Without drilling down into every possible fault, error or outage, unplanned downtime can include any of the following events:

  • hardware or software faults, like a blown power supply, component failure or OS crash
  • data or media errors, ranging from disk failures to data corruption
  • human error, including any number of user or administrative errors or even sabotage
  • external factors, like electromagnetic interference
  • natural disasters, like hurricanes and earthquakes, that lead to loss of power or property

People may want 24x7x365 availability, but that's often not a reasonable expectation. Instead, a high probability of availability is offered. The percentage of time is often used to express uptime within a year, which is often called by the number of nines. For example, 99.9 percent indicates an unplanned downtime of 43.8 minutes per month or 8.76 hours per year (considering the number of minutes in a year). You can increase that percentage to 99.99 percent or 99.999 percent with more high-availability-driven hardware and software. So, one major constraint to these solutions is clearly the budget.

Microsoft provides free high-availability solutions that vary depending on the server you have installed. For example, Server 2003 Enterprise Edition includes clustering services for high availability, whereas the Standard Edition doesn't. There's a cornucopia of high-availability solutions that are currently on the market (see "Always There for You"). Deciding which way to go for your organization is but one tiny piece of the puzzle when discussing best practices for high availability.

Best Practices for Exchange 2007 High Availability

Here are a few pointers if you plan to use the integrated high-availability solutions within Exchange 2007.

Whenever possible, you should use identical hardware between disks or servers upon which you wish to implement a high-availability solution. You should also try to ensure you have similar software installed and configured.

With Local Continuous Replication (LCR), you can replicate from storage group over to another disk location. In the event of a crash or data corruption, you can manually alter the location where Exchange looks for that storage group over to the functioning disk. Don't wait until the disk crashes to learn this part of the process.

Beyond using two disks to ensure one storage group on the first disk will failover to the second, separate the transaction logs and database files onto separate disks as well. Place transaction logs on a RAID10 LUNs. To maximize your high-availability options with LCR, place controller cards on separate PCI buses and ensure active and passive storage LUNs are on different arrays.

With Cluster Continuous Replication (CCR), you should consider many of the same possibilities. However, in this case you'll hold the passive copy of the data on a passive node. In the event of a crash, there's an automatic failover. You shouldn't become too reliant on the process, though. Make sure you know how to confirm the failover has occurred properly. Whenever necessary, remember that you can place the passive copy in a completely different site location. This actually provides site resiliency.

For both LCR and CCR, use Visual SourceSafe (VSS) backups of the passive node. Hopefully you'll never have to use the backups but it's good to know they're there, and in using VSS off the passive, you ensure limited effect on the active system and the working environment.

Single Copy Clusters (SCC) are a shared-storage failover cluster solution. They're a carry-over from previous Exchange versions and allow for server redundancy, but not your storage. The storage is usually an expensive SAN solution that has its own redundancy in place.

Standby Continuous Replication (SCR) is a new feature found in Exchange Server 2007 SP1. This option uses the same functionality of LCR with log shipping and reply. However, instead of replicating to another disk, you can replicate to another server. This solution doesn't require any clustering services, and you can use it for both standalone and clustered mailbox servers.

Client Access Servers
To provide high availability with a Client Access server (CAS), simply implement additional servers to the Active Directory site. Usually, however, you'll have connections from the Internet going to a dedicated CAS, so simply adding another won't provide high availability. Also, you should use Network Load Balancing (NLB) services in Windows Server 2003 to allow for a single IP address that comes in from the Internet and utilizes both servers.

Hub Transport Servers
You can support high availability for Hub Transport servers by simply adding an additional Hub Transport Server. In this case, Active Directory automatically uses load balancing between the servers, so you won't need an additional load-balancing solution.

Edge Transport Servers
Edge Transport Servers are not part of the Active Directory domain for your Exchange organization, so configuration and recipient information is stored using Active Directory Application Mode (ADAM). You use an Edge Subscription between the Hub Transport server and the Edge Transport server to populate the date within ADAM.

Keep in mind that you can't simply clone an Edge Transport server or use the same subscription you used for the first one. When setting up a second Edge server, you need to create its own Edge Subscription. In order to load balance your Edge servers, you need to ensure they're listed as smart hosts in a send connector. Then they'll be balanced for outgoing mail.

However, to balance their use for incoming mail if you have a single incoming connection, you can use Round Robin DNS or signify multiple mail exchange (MX) records for the public DNS servers. If you have multiple incoming connections, you can simply configure MX records for each connection coming in to go to a different Edge Transport server.

Unified Messaging
Establish multiple Unified Messaging (UM) servers within the same dial plan to create a higher level of availability for that dial plan. You can also configure your VoIP gateways to round-robin calls made to the UM server.


Planning for High Availability
Don't buy anything yet. If you don't already have a solution in place, do your research before you go out and blow thousands of dollars on the SAN. If you already have one, then you've narrowed the path you'll need to take.

In planning your high-availability solution, you need to assess your current business processes and requirements. Which servers and services do you consider essential? To what degree do you need to provide availability? At what point is it acceptable to relinquish high availability and move on to disaster recovery? For example, if one drive fails, can you keep going? If two drives fail, are you then stuck restoring from backup? Is that reasonable to you, or do you want more?

You have to determine where high availability ends and disaster recovery begins for your organization. So, you need to know your starting point: what you currently have in place and the budget you have for change. You also need to consider your destination: your hopes for redundant hardware/software solutions.

In considering your needs, you'll have to identify and categorize your systems. Are some, like your domain controllers, already structured for greater availability by simply having more than one? Are some running applications that could be down for short periods of time for maintenance or minor service disruptions? Are the systems essential to the day-to-day operations of your business, like a back-end SQL Server or an Exchange server? Obviously, these systems have no room for failure, so your solutions must match the needs.

Then consider the various commercial solutions that are available for your environment. You may decide that the solutions provided by Microsoft encompass your needs (see sidebar, "Best Practices for Exchange 2007 High Availability"). You may choose to evaluate one of the many third-party solutions, such as VMware HA or Veritas Cluster Server, or products from SteelEye Technology Inc., Double-Take Software Inc., Neverfail Ltd. and so on.

Serious Choices
When considering your options, don't get buried in the buzzwords, jargon and techno-speak that seem to saturate the technology marketing world. For example, as previously mentioned, you need to decide how much availability you can provide and concern yourself with the possibility of more than one failure.

You should be prepared to handle any one failure. Handling two is more difficult. This can include hardware and application failures. Ordinarily you don't assume two pieces will go at the same time, but it happens all too often.

Another consideration is the failure rate of any product or products. This rate seems to be the highest either right when you install something or toward the end of its normal functioning lifetime. You put in something new, and the next day it goes. Gradually, when you least expect it, anything can go.

Even if you can't purchase an expensive SAN solution, you might consider hardware that allows for a greater amount of "hot swapping." This lets you remove and replace parts of the system that burn out while the system is still running.

With hot swapping, the system has to detect the replaced component without requiring a reboot. Combining this type of feature with systems that have redundancy in certain primary failure points like power supplies and cooling fans will ensure a greater amount of availability.

The Human Side to High Availability
You can only automate so much. Even if you pay for the best in hardware and software, you and your team still need to prepare for the worst. Here are a few general best practices you should implement after you've made your decisions:

System Monitoring: Some of you who have experienced a crash could have avoided much heartache if you had simply monitored your servers more closely and ensured things were running smoothly. You can catch increases in little faults and minor errors before the availability of your servers comes into question. You can transfer services to another system while you perform routine maintenance on your servers, hence continuing to provide service availability while working on the server.

Keep Items in Stock: If you worry about certain parts of your systems burning out, keep spare backups of that hardware handy. Along with choosing systems that allow for hot-swappable components, it would be a shame if you weren't prepared with those components.

Training: Depending on the size of your department, you need to prepare for every possible downtime scenario. To do this properly, you'll have to document procedures in the event of a system or disk failure -- how to failover and how to ensure the failover runs smoothly. Perhaps run a few fire drills to ensure your team is ready to handle whatever comes its way. Don't forget to prepare for the worst-case scenario, when you need to switch gears from high availability to disaster recovery.

No Single Point of Failure
Hopefully you now have a lot to consider if you're going to implement a high availability solution within your environment. As you can see, there's a great deal that goes into the planning and preparing stages with the hope that you never have to be in the middle of a mid-afternoon meltdown.

If you are planning to put high-availability solutions in place, and if you've followed these best practice suggestions and prepared well, it should appear to your users like any ordinary day at the office. You'll be behind the scenes making sure they never knew how close they came to going home early.


comments powered by Disqus

Subscribe on YouTube