I recently asked Tanya Reilly, Site Reliability Engineer at Google, to share her thoughts on how to make better disaster recovery plans. Tanya is presenting a session titled Have you tried turning it off and turning it on again? at the O’Reilly Velocity Conference, taking place Oct. 1-4 in New York.
1. What are the most common mistakes people make when planning their backup systems strategy?
The classic line is “you don’t need a backup strategy, you need a restore strategy.” If you have backups, but you haven’t tested restoring them, you don’t really have backups. Testing doesn’t just mean knowing you can get the data back; it means knowing how to put it back into the database, how to handle incremental changes, how to reinstall the whole thing if you need to. It means being sure that your recovery path doesn’t rely on some system that could be lost at the same time as the data.
But testing restores is tedious. It’s the sort of thing that people will cut corners on if they’re busy. It’s worth taking the time to make it as simple and painless and automated as possible; never rely on human willpower for anything! At the same time, you have to be sure that the people involved know what to do, so it’s good to plan regular wide-scale disaster tests. Recovery exercises are a great way to find out that the documentation for the process is missing or out of date, or that you don’t have enough resources (disk, network, etc.) to transfer and reinsert the data.
2. What are the most common challenges in creating a disaster recovery (DR) plan?
I think a lot of DR is an afterthought: “We have this great system, and our business relies on it … I guess we should do DR for it?” And by that point, the system is extremely complex, full of interdependencies and hard to duplicate.