This is a bit of a primer for a totally relevant post, so stick with me.
This morning, I awoke to an email from a colleague that a few of our internal apps were down. On first inspection, it appeared a drive in the server become unavailable. On deeper inspection…the drive was dead. Where was that backup drive I realized I needed three years ago? Oh. I never bought it. PANIC ENSUES.
We didn’t lose any Sherlock® work or any really important data. What we did lose was important enough to make me realize that the investment of a week or so to recover the servers lost on that disk was expensive enough to face palm that I had not gotten that backup done.
Your investment in BI is way way more important than that one crummy drive that I did not have backed up. I’ve given a presentation a few times on the path from fault tolerance -> high availability -> disaster recovery and the backup and recovery requirements that accompany each of those milestones in your organization’s landscape. It’s even been a part of our BetterBOBJ webinar of late.
Achieving Fault Tolerance
You might have a really simple BI landscape. One node. No web tier. A small proportion of users. So, why does fault tolerance matter? I was working with a customer earlier this week on this very question, ironically enough. I actually can’t say I’d ever actually been challenged with it. Off the top of my head, my response to the customer as to why that second server for fault tolerance was important was:
“Fault tolerance for your environment is like an insurance policy. You buy insurance for your car, your home, and even your body. Why wouldn’t you invest in a second, even passive node, to achieve fault tolerance for your BI landscape?” (quoted rather loosely, but you get the idea)
Fault tolerance IS important and doesn’t have to get the keepers of the budget in a tizzy. Whether this is achieved through a simple, active/passive cluster, or in a more robust scenario in a high availability scenario, that second node in your cluster can make the difference between BI or no BI on failure.
Backup and Recovery
Whether you are fault tolerance or have a highly available cluster, a good backup strategy is vital. I have a lot of love for technologies on the storage and database front that allow for point and time recovery. Small enterprises may find this cost prohibitive. So what? Backup and recovery the old fashioned way. Full backups monthly and daily incremental backups. And, recovery tests are really a swell idea.
So if you couldn’t guess by the punchline, my pal and colleague, Benjamin, says to me “put the drive in the freezer for 15 minutes, tap it on the side a few times, then plug it back in.”. Skeptical, I went with it. Sure enough, with a freezing cold drive in my hand, I plugged it back in, and stood there in awe that the drive was recognized. It was loud, and clunky, and failed the first time I tried to copy and paste the files to another working drive. On the second attempt to remount it, sure enough, I got all of the stuff (180 GB of VMs included) and we’re back in business. I didn’t see that one coming.
Users that begin to consume your BI, published daily, hourly, real-time…whatever…is important. Not designing an architecture to survive a fault is…well…as dumb as me ignoring having a backup process in my own office. In this case, I have to call it how I see it.