A while ago I wrote a post, Visualize Disaster, prompted by a real incident we had at my office. Fortunately we came through it OK from a business point of view, but I took away an important lesson: it’s very easy, whether your organization and your team is savvy about disaster recovery or not, to have significant blind spots with regard to recovery in the face of some large, unexpected outage. We have very clear direction and decent budgets to work with, and the safety and recoverability of applications and data is a real, primary objective at my workplace – and still this was a take-your-breath-away, eye opening kind of experience. Here are some common places I have seen such blind spots in my past work. Perhaps you can have a look around you and see if you see these wherever you are today, and maybe some ways to combat them:
Invincibility Blind Spot
I think most of us have worked at a place where the leadership is just oblivious to the idea that anything damaging could happen to their business from an IT failure. I did some consulting at a place where one of the owners flat-out told me, “I’m not sure all this technology really helps us. It’s certainly not essential. We could go back to pencil and paper and be just fine.” I knew just from watching their operation for a short time that if they lost their technology, they would probably go straight out of business. Their server was the type you see pictures of as jokes on the web – in an un-conditioned room, with an oscillating fan aimed at it, on a rickety shelf, shared password, etc.
This can be hard to combat. A typical organization like this would look at ideas like backup or disaster recovery, and immediately balk because it “sounds expensive” or “there isn’t time for that, because it’s not real work.” And they aren’t always small companies – one place I did work for had 1500 employees and essentially no DR strategy other than some half-hearted tape backups.
The only way I have gotten traction with these cases is to do two things:
- Make sure the leadership hears the argument for DR from someone they trust and that has credibility with them. In some cases, that was me, once I built trust working with them. In other cases the argument, right or wrong, had to come from someone else – perhaps another business person, not even in I.T. – for it to carry any weight.
- Once you have that voice that carries real weight, walk the leaders of that organization through a visualization of what could really happen if they lost their infrastructure: the sending people home, the loss of credibility with customers, the real, no-hand-waving, no-magic amount of time it would take to recreate a functional system, the work lost. It has to be real, and it has to burst that imaginary bubble that can surround computer technology and make it seem like it’ll just somehow keep working. A building fire is usually a good scenario, because you don’t have to be in IT to relate.
Ego Blind Spot
The Ego blind spot is somewhat trickier. The place this can lurk is with capable IT staff who do have a mandate to make DR work, but whenever they are approached about discussing DR or testing their systems, may become defensive or make excuses. There can be an undercurrent in the conversation that insinuating that DR isn’t “covered” by their systems is some sort of an insult. Often that undercurrent actually comes from insecurity – there may really be gaps in their systems that they privately worry over, but don’t want to crack open and solve because either a. it’s embarrassing or b. they don’t relish the extra work and risk it could take to reconfigure a running system. These folks generally have the best intentions, but getting at the gaps in the technology can be a real problem, just because of personalities.
Here the only remedy I know of is sociological – the business continuity leader (or the IT team lead, if it’s the same person) has to have the leadership skills to win these folks over. The technical staff have to be in a position where finding the DR gaps and improving their systems is something they perceive will provide an opportunity to demonstrate, and not threaten, their skills. It has to feel like a worthwhile project. It’s almost impossible to get at the underlying problems any other way. The leader in this scenario will need their technical expertise, their on-the-ground view of how systems really work to even locate the issues, and for that, grudging cooperation will not do. Working DR has to become a real part of the staff’s fully owned, personal priorities. If the person who knows the low-level detail about how a system works is armed with DR know-how, and committed to making DR work, the gaps will disappear. If, on the other hand, the technical people don’t want to see the gaps, and the leadership isn’t capable of seeing the gaps, the gaps will remain until some incident exposes them.
Magic System Blind Spot
This is an interesting one – the Magic System blind spot is essentially a blind faith that some of the latest gee-whiz tech is the silver bullet that will save everything. “We have DR covered because we virtualized.” “We have DR covered because we replicate.” “We have DR covered because we load balance.” “Disaster can’t touch us – we have a SAN!”
I’ve seen naïve, young people succumb, I’ve seen leadership (the ones out of touch with the technology, generally) succumb, but surprisingly I have also seen savvy people I would never have expected succumb to this.
The remedy here looks simple to a staffer, but maybe difficult to a leader: no matter what a vendor claims or advertises, what we imagine a magical system can do, you must have someone available who, impartially, knows how that technology works enough to dispel the magic. All this stuff works for a reason. Using dedupe? Make sure someone on staff understands how that really works. Snapshots? How. Relying on virtualization for DR? Exactly how does that work? Only by unpacking how these systems do what they do can you be sure they will work at crunch time.
Devil in the Details Blind Spot
Lastly, we have the blind spot that is the nemesis of us all. The one present in every organization, extremely difficult to stamp out, “When we fail over, when the data center goes down, will it work?” This is a simple question, but here’s why it is so difficult: every system has so many moving parts, each of which perhaps requires specialized knowledge, and a seemingly small detail that nobody thought of can absolutely wreck the DR process when you have a real incident. It’s very easy to have a scenario where practically everything works except that one tiny thing that prevents it all working – the database is there, the web servers are up, we have network connectivity and name resolution but everyone forgot that the encryption key to the whoozit has to be loaded into the whatchacallit. It’s really easy to miss something. And because the something that was missed is small, maybe nobody took it very seriously.
Remedies for this are more difficult. For some organizations with the finances, it might be possible to actually run multiple data centers and, in fact, fail production systems between them. That would ensure the design is sound. Most of us, though, have to use test systems and then just try our darnedest to be really careful.
If you can’t test with production systems, the next best thing would be to have a pre-prod or staging system that is comparable to production where you can do rehearsals. Such a rehearsal can be a drill around some imaginary scenario, say “It’s 5:00 am and Data Center A is on fire. (This is a drill.) Go.”
Failing that, the only recourse – and it’s much less accurate – is a careful and detailed tabletop visualization. Visualizations like this are great, and valuable, if they are run well. Vital ingredients:
- Effective leadership that can persuade people to check egos at the door and take it seriously. Without buy-in, you never get to the details that matter.
- A facilitator that can ask relevant but probing questions, in order to eliminate the inevitable hand-waving that masks gaps in the system. Example: “At this point we would load the logins into the DR SQL Server.” The facilitator should not say “OK.” She should say “From where? How? Who?”
- Detail. Everything in a reasonably sane organization works at a high level. It’s only by diving into the details and making a visualization real that you uncover those small, system-breaking gaps.
- Note takers. In every tabletop I’ve attended, a huge number of issues were uncovered, and in order to get the most value, it’s important to capture them all right then, in the room. Otherwise they escape!
Do you see one of these four blind spots in your organization? Others? Any tips or processes for stamping them out? I’d love to hear.