British Airways IT Issues

The media coverage of the recent major systems outage at British Airways is some of the worst I can recall reading. Essentially a national institution working in an industry synonymous with resilience, safety and preparation is making a drama out of a crisis and many technical practitioners are still trying to understand what happened. 

In the most recent reports, it sounds like a contractor accidentally switched off the power in their datacenter and with it, toppled the first domino in a series which lead to 750,000 passengers being unable to fly and an as yet to be calculated compensation bill. 

"It was not an IT issue, it was a power issue" - British Airways 

Nothing in the information they’ve released so far really explains how this course of events came to transpire nor does it provide any confidence or assurance around BA’s approach to business continuity planning (BCP) or disaster recovery (DR). Essentially, someone was able to gain access to and then press a button which tore the entire organisation offline in such a way that they were unable to recover in a suitable timeframe. This also managed to happen at a peak period for the business, a bank holiday weekend, and one which the media had reported would see record travel figures given the abundance of good weather in the UK. We now have a lot of questions about change control, BCP/DR, personnel management, system residence, architectural practices, Governance, risk and compliance and so far no real answers. 

BCP/DR aren’t easy but nothing worth doing is. Over the years, I’ve seen shining examples of how to prepare for the worst and unthinkable. On the other hand, I’ve seen how not to do it. As a consultant, you can often get a feel for how seriously an organisation takes BCP/DR with a few well placed questions around hosting and infrastructure. If you’re met with a wry, nervous smile when you ask you can bet the situation isn’t going to be great. 

Those organisations that really were ready all had the same thing in common - they were expecting OR had needed to handle the worst in recent history, and this had made it certain that BCP/DR remained an active consideration in every aspect of how the company operated, how it tracked and cared for it’s people and how it provisioned central services. I recall one client in particular, not just because how much they impressed me with their preparations but because of how integral preparation and resilience was to their day-to-day business. It came as no surprise then when this client went on to detail their plans for New Zealand and Japan and then go on to show how they had managed the earthquakes in Christchurch and Fukushima in 2011. It focuses the mind when you consider IT availability issues within the context of human life.  Being able to support your employees livelihoods during a major recovery effort is a major part of an employer’s duty of care. 

Hopefully, this current media backlash the inquiry to follow will provide the leadership team there with the support and investment they need to prevent anything like this from happening again. Hopefully, this scrutiny will extend organisation wide and highlight any other weaknesses or gaps in both how BA operates and in the systems they employ.


Popular posts from this blog

Splunk Security Cheat Sheet

Developing Leeds Scene

PCI-DSS: Your CDE may be getting bigger