EXCLUSIVE NEWS – When Facebook, one of the most powerful companies in the world, experienced the biggest internet crisis in history, billions of people were deprived of communication. The company made a loss of 60 million dollars due to the outage that lasted for about 8 hours. This figure may seem big, of course, but for a company that lays its own internet line at the bottom of the ocean and makes close to $30 billion a year, it’s a handful.
So what happened that night, why did Facebook suddenly disappear from the infrastructure of the internet, how did its engineers – almost like a caveman – have to cut the protection shields of their own companies? Eren Algan, who transferred from Uber to Facebook as a Senior Software Engineer 7 months ago, told Webtekno what happened on the night of the crash.
Why did Facebook, Instagram, WhatsApp and Oculus crash? There are two concepts you should know before you start: BGP and DNS
- BGP (Border Gateway Protocol): In short, we can say ‘the mail service of the internet’. When you want to send a message from WhatsApp, it works to send this information to the other person in the fastest/efficient way.
- DNS (Domain Name System): In short, we can say ‘the phone book of the internet’. To the browser “facebook. com” is the system that tells you which IP (a kind of identification number) this address has.
Eren Algan, “Both these systems are necessary for you to reach any internet address. While DNS tells you which IP the characters you type belong to, BGP is the system that tells you how to go from your current network to the network you want to go in the fastest way. ” says. Let’s explain with a simple example: you entered WhatsApp and typed a person’s name in the search box; the results appear and the contact name matches the phone number/profile (DNS). You called the person, the command information first reached the nearest base station, then the closest satellite and the closest base station to the person on the other end, and finally his phone (BGP).
The beginning of the end: only one engineer and one wrong command!
Eren Algan, who said about his company Facebook, “It is an institution with its own data centers and a huge network structure,” said, “We are talking about an enormous company that even passes its own fiber optic cables under the Pacific Ocean. All these wired networks pass through a cabling system that Facebook calls the ‘backbone network’. ” says.
In other words, Facebook, which we mostly connect to wirelessly, actually owes all its systems to cables due to the nature of the internet. Therefore, the company needs to regularly maintain this wired infrastructure, only the engineering teams involved in this work regularly perform tests.
“During a routine ‘backbone’ maintenance on October 4, an engineer ran a command to see how much capacity this network had. This command, the beginning of the end, unintentionally knocked down all connections of the ‘backbone’ network; Facebook has taken all its computing centers offline. ”
The adventure of misfortunes continues: Eren described the moment when Facebook – literally – disappeared from the internet:
Eren Algan said that Facebook systems are normally based on preventing such errors; “However, it will be unfortunate that an error also occurred in the software developed to understand these errors, so the incorrect command entered by the engineer could not be stopped. As a result, the link between Facebook’s computing centers and the internet was severed. ” says:
“Facebook’s DNS servers; if they cannot reach their data centers, they mark themselves as ‘unreachable/faulty’ and report the situation to the postal service (BGP). On the night of the crash, the DNS servers gave an error meaning “We are not at home, not even a house”. ”
Facebook’s flashback to the ages ago: Cutting through server doors with a spiral saw and breaking into your own company like a thief. . .
Eren Algan, who said that the engineers realized that there was a problem with Facebook and all its connected applications, immediately got to work, “They encountered another unpleasant surprise. Since all computing centers were offline, there was no possibility for anyone to interfere over the internet. ” says. In other words, the employees of Facebook, one of the largest internet companies in the world, remain “without internet” when there is internet in the world.
“The only thing that could be done was to physically go to the computing centers and raise the DNS servers. Of course, it was not that simple. Facebook had taken extra security measures so that hackers could not enter its data processing centers. To access these computers, engineers had to buy a spiral saw machine and cut through their company’s physical security barriers like a thief. ”
Although everything is back to normal, the problems are not over:
Stating that the engineers entered their companies like thieves and corrected the wrong command, Eren Algan said, “Of course, the problems did not end there. Turning on all systems at once can never be the right solution for companies of this size. Since there will be a lot of load on the computers when the systems are lifted, these systems must be raised in certain sequences and with certain capacities. ” and we understand why WhatsApp does not have an on-off switch like only one light bulb.
“Facebook has a protocol called ‘storm drills’, where they test how systems will behave in the event of a possible disaster, because it predicts potential problems like this. In this case, the systems were able to stand up in a certain order and capacity. At the end of the feverish work that lasted for about 5 hours, access to Facebook, Instagram, WhatsApp, Oculus applications was provided :)”
“When I went to the office on October 4th, there was no access to the network used inside. Even the printers were not working. When the internal communication systems did not work, it was much later that we even got to know about the events. ”
Now it’s time to answer the question on everyone’s minds… What happened to that engineer who wrote that wrong command and made Facebook lose $67 million?
“Nothing happened, no way. Companies like Facebook look at such mistakes as learning opportunities and post-mortem (autopsy) to improve their systems. Usually people don’t get fired for mistakes like this. For those people, it would be a memorable story in their career, and a costly mistake for Facebook. ”
We would like to thank Eren Algan for his support of the content, you can reach his LinkedIn account here and his Instagram account here.
You can also learn what happened on the night of Facebook’s collapse in our video below:
Our other news about what happened that night: