Amazon Reveals the Cause of a Massive Server Crash Last Week

If your favorite website wasn’t accessible last Tuesday, you weren’t alone… As was reported, Amazon’s cloud services suffered from technical problems. Many internet services, devices and computers were affected, and the reason why it happened is quite unexpected.

“Amazon Web Services” (AWS) is the popular storage and hosting platform used by a huge range of companies such as Pinterest, Airbnb, Netflix, Slack, Buzzfeed, Spotify and some Gannett systems, experienced intermittent outages Tuesday afternoon. Even Amazon’s ability to report problems was broken for a while – the AWS dashboard wasn’t changing colour because its issue was “related” to S3’s problems.

Also, people reported delays on services like Slack, Trello, Sprinklr, Venmo and even Down Detector, which is the site that shows where real time outages are occurring.

“AWS provides cloud-based storage and web services for companies so that they don’t have to build their own server farms, allowing them to rapidly deploy computing power without having to invest in infrastructure. For example, a business might store its video or images or databases on an AWS server and access it via the Internet,” said USA Today.

Affected sites reported the outage began around 12:40 p.m. ET and was resolved shortly before 5 p.m. ET, although Down Detector was still unavailable.

Nest customers reported widespread issues with their cameras, and the company tweeted that it was likely because of the AWS outage. Other services like encryption and content delivery, used to distribute visual things for websites, were also impacted. Even Amazon devices like Alexa stopped functioning during the outage causing some frustration among users.

The company has not explained what went wrong almost four days, although its status page has narrowed the outage down to a North Virginia location. Finally, on 2nd of March, the company announced that the incident was a resulted by human error. More precisely, it was a typo.

Amazon published a full explanation about having a failure on S3 service last Tuesday. In a statement, Amazon said that an employee on its S3 team was working on an issue with the billing system and had to shut down a small number of servers, however, he or she incorrectly entered the command and removed a much larger set of servers.

People started to complain that it is impossible to use their favourite websites or share files on the enterprise chat app Slack. Also, some news agencies reported that they could not publish articles.

“While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further,” said company representatives.

Now, Amazon is working with its systems on purpose to make several changes and avoid similar incidents in the future — specifically that the current tool would not allow turning off so many servers too quickly.

According to Synergy Research Group, “Amazon Web Services” owns over 40 per cent of all cloud services. It means that they are responsible for the operability of large range of popular companies’ websites. Big businesses have been steadily moving storage to the cloud because it is cheaper, easily accessible and more resilient. But the bad side is that when there is an incident, there are problems for everybody.

It is possible to contract with every firm for avoiding potential issues, but it is quite pricey, so many of them just make peace knowing that on rare occasions they are going to have a very bad day.

“Only the most paranoid, and very large companies, distribute their files across not just AWS but also Microsoft and Google, and replicate them geographically across regions — but that’s very, very expensive,” said a cloud analyst, Lydia Leong.

Leave a Reply

Your email address will not be published. Required fields are marked *