You are performing your backups and have an infrastructure in place with all of the appropriate redundancies. To complete the disaster recovery scenario, you need to recognize when a disaster has happened and have the tools and processes in place to execute your recovery plan. One of the coolest things about the cloud is that all of this can be automated. You can recover from the loss of Amazon’s U.S. data centers while you sleep.
Monitoring your cloud infrastructure is extremely important. You cannot replace a failing server or execute your disaster recovery plan if you don’t know that there has been a failure. The trick, however, is that your monitoring systems cannot live in either your primary or secondary cloud provider’s infrastructure. They must be independent of your clouds. If you want to enable automated disaster recovery, they also need the ability to manage your EC2 infrastructure from the monitoring site.
There exist a number of tools, such as enStratus and RightScale, that manage your infrastructure for you. Some even automate your disaster recovery processes so you don’t have to write any custom code.
Your primary monitoring objective should be to figure out what is going to fail before it actually fails. The most common problem I have encountered in EC2 is servers that gradually decrease in local file I/O throughput until they become unusable. This problem is something you can easily watch for and fix before users even notice it. ...