How a BGP error knocked Facebook, WhatsApp and Instagram offline for millions

Millions were unable to use Facebook, WhatsApp and Instagram yesterday as a configuration change had a disastrous effect (Reuters)
Millions were unable to use Facebook, WhatsApp and Instagram yesterday as a configuration change had a disastrous effect (Reuters)

Facebook has apologised for the biggest outage in its history that took place yesterday, which also affected Instagram and WhatsApp.

The service blackout affected millions of users around the globe for several hours on Monday afternoon and evening. Users in the UK were reporting problems as early as 4.30pm and weren’t able to get back on until around 10pm.

Facebook hasn’t given any detailed technical information about what caused the outage, but reports suggest it had to scramble engineers to its data centre to uncover the source of the problem.

The company said a ‘faulty configuration change’ was to blame for the problem, but technical experts believe the issue lay with something called the Border Gateway Protocol (BGP) which is one of the systems the internet uses to direct traffic around the web.

Unfortunately for Facebook, many of its engineers rely on internal logins to access the communication tools they need to resolve the problem. With the entire system down, they had to resort to alternative methods of communication – like holding conversations over email.

‘BGP is a technology which ISP’s (Internet Service Providers) share information about which providers are responsible for routing Internet traffic to which specific groups of Internet addresses,’ explained PJ Norris, principal systems engineer at Tripwire.

Mandatory Credit: Photo by Niyi Fote/via ZUMA Press Wire/REX/Shutterstock (12524209g) (NEW) Facebook, Instagram and Whatsapp off the air Worldwide. October 4, 2021, New York, USA: Many people in New York and some other cities around the world, are faced with Facebook, Instagram and Whatsapp apps not working today (04). Some have to communicate using their emails and text messages to get along. The three apps non functional situations started in the morning and still continue into the afternoon amid rainfall in New York . .Credit: Niyi Fote/Thenews2 (Foto: Niyi Fote/TheNews2/Zumapress) Facebook, Instagram and Whatsapp off the air Worldwide., New York, USA - 04 Oct 2021
A problem with the Border Gateway Protocol meant Facebook users weren’t able to connect (Shutterstock)

The best way to think of BGP is as an air-traffic controller, which sends packets of data around the internet (through the right servers) in the quickest and most efficient way possible. Because the routes around the web are always changing, the BGP is an automated way of keeping things going in the right direction.

When it’s changed, then suddenly your computer doesn’t receive a destination to go to.

‘In other words, Facebook inadvertently removed the ability to tell the world where it lives,’ PJ said.

As well as having to use alternative tools to communicate, Facebook was also grappling with the fact that not everyone was in the same place. Like many tech companies, the pandemic meant a number of employees were working remotely.

‘Those who were onsite at the data centres and offices who were trying to back out the change, were unable to access the environments as the door access control system was down due to the impact of the outage,’ PJ said.

‘So the question always comes down to, “could this have been avoided?” It’s evident at this early stage that Facebook had a single point of failure that cascaded in to a significant and costly outage for the technology giant.

‘Any changes, especially to critical services, should be tested, and double checked before implementation. It’s unclear around the circumstances of this change to the BGP at this point in time, so it’s speculative on how this happened.’

MENLO PARK, CA - OCTOBER 4: Famous Facebook sign is seen in Menlo Park of California, United States on October 4, 2021. Six hours after Facebook, WhatsApp and Instagram went down, service started coming back online. (Photo by Erkan Akkaya/Anadolu Agency via Getty Images)
Many of Facebook’s communication tools are internal, which means engineers had difficulty in communicating (Getty)

Facebook, which owns both Whatsapp and Instagram, said in a statement: ‘Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication.

‘This disruption to network traffic had a cascading effect on the way our data centres communicate, bringing our services to a halt.

‘We want to make clear at this time we believe the root cause of this outage was a faulty configuration change. We also have no evidence that user data was compromised as a result of this downtime.’

MORE : Facebook, WhatsApp and Instagram went down ‘when someone botched server update’

MORE : Mark Zuckerberg lost $6,000,000,000 in yesterday’s Facebook outage

Tags
Tech Facebook Technology