Posts Tagged ‘event log’

It has been close to 33 hours when the outages at the Amazon’s N. Virginia data center began. The incident has brought down large number of websites that were hosted on Amazon Web Services (AWS). Though the incident has affected only the N. Virginia data center of the company, there are large number of customers affected, which Amazon cannot or rather should not ignore. However, there has been no official statement, blog post, tweet etc from Amazon so far. The last press release from Amazon was on the 19th of April, Announcing Live Streaming for Amazon CloudFront.

Amazon gives a 99.9% uptime guarantee to its customers, which is roughly about 8.76 hours in a year. With over 33 hours of outage, Amazon has already exceeded this by almost four times. Yet, Amazon doesn’t see any obligation in informing its customers about the reason of the outages. Logs from AWS’ Service Health Dashboard are showing newer services going down. Through this platform, Amazon is trying to ensure users they are working hard to get things sorted, but 33 hours of outage isn’t something anyone can expect one of the biggest cloud provider in the world.

Here is a current log of events:

Amazon Elastic Compute Cloud (N. Virginia)

10:58 PM PDT Just a short note to let you know that the team continues to be all-hands on deck trying to add capacity to the affected Availability Zone to re-mirror stuck volumes. It’s taking us longer than we anticipated to add capacity to this fleet. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.
2:41 AM PDT We continue to make progress in restoring volumes but don’t yet have an estimated time of recovery for the remainder of the affected volumes. We will continue to update this status and provide a time frame when available.
6:18 AM PDT We’re starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that well reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we’ll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone.
8:49 AM PDT We continue to see progress in recovering volumes, and have heard many additional customers confirm that they’re recovering. Our current estimate is that the majority of volumes will be recovered over the next 5 to 6 hours. As we mentioned in our last post, a smaller number of volumes will require a more time consuming process to recover, and we anticipate that those will take longer to recover. We will continue to keep everyone updated as we have additional information.

Amazon Relational Database Service (N. Virginia)

2:35 PM PDT We have restored access to the majority of RDS Multi AZ instances and continue to work on the remaining affected instances. A single Availability Zone in the US-EAST-1 region continues to experience problems for launching new RDS database instances. All other Availability Zones are operating normally. Customers with snapshots/backups of their instances in the affected Availability zone can restore them into another zone. We recommend that customers do not target a specific Availability Zone when creating or restoring new RDS database instances. We have updated our service to avoid placing any RDS instances in the impaired zone for untargeted requests.
11:42 PM PDT In line with the most recent Amazon EC2 update, we wanted to let you know that the team continues to be all-hands on deck working on the remaining database instances in the single affected Availability Zone. It’s taking us longer than we anticipated. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.
7:08 AM PDT In line with the most recent Amazon EC2 update, we are making steady progress in restoring the remaining affected RDS instances. We expect this progress to continue over the next few hours and we’ll keep folks posted.
Amazon EC2 (N. Virginia)
6:18 PM PDT Earlier today we shared our high level ETA for a full recovery. At this point, all Availability Zones except one have been functioning normally for the past 5 hours. We have stabilized the remaining Availability Zone, but recovery is taking longer than we originally expected. We have been working hard to add the capacity that will enable us to safely re-mirror the stuck volumes. We expect to incrementally recover stuck volumes over the coming hours, but believe it will likely be several more hours until a significant number of volumes fully recover and customers are able to create new EBS-backed instances in the affected Availability Zone. We will be providing more information here as soon as we have it.Here are a couple of things that customers can do in the short term to work around these problems. Customers having problems contacting EC2 instances or with instances stuck shutting down/stopping can launch a replacement instance without targeting a specific Availability Zone. If you have EBS volumes stuck detaching/attaching and have taken snapshots, you can create new volumes from snapshots in one of the other Availability Zones. Customers with instances and/or volumes that appear to be unavailable should not try to recover them by rebooting, stopping, or detaching, as these actions will not currently work on resources in the affected zone.
10:58 PM PDT Just a short note to let you know that the team continues to be all-hands on deck trying to add capacity to the affected Availability Zone to re-mirror stuck volumes. It’s taking us longer than we anticipated to add capacity to this fleet. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.
Amazon RDS (N. Virginia)
11:42 PM PDT In line with the most recent Amazon EC2 update, we wanted to let you know that the team continues to be all-hands on deck working on the remaining database instances in the single affected Availability Zone. It’s taking us longer than we anticipated. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.
AWS Elastic Beanstalk (N. Virginia)
2:16 PM PDT We have observed several successful launches of new and updated environments over the last hour. A single Availability Zone in US-EAST-1 is still experiencing problems. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.
The AWS downtime counter can be seen here.