Highly-available S3

It's possible to raise the availability of S3 through a multi-region setup. Below I'll explain some of the work we've done at Brightcove on this, including a new Node.js library.

Nearly-highly-available S3

S3 is a fantastic service, and it has the reputation of being a very reliable object store. Not only is the service well-regarded, Amazon touts a 99.999999999% durability of objects over a given year.

So why is anything more needed here for reliability?

While the stated durability is incredibly high, the claims on availability are fairly standard. It is available somewhere between 99.99% a year (from a details page) to 99.9% a month (from the SLA). At the lower end, that's 44 minutes of allowed downtime in a month.

On the project I'm working on, the player services for the Brightcove player, we have a lot of API calls to make. Each API that has a 99.9% (or even 99.99% availability) makes it much harder for us to make our own SLA requirements. Raising the availability rating on S3 ourselves makes life simpler.

And no matter S3's reputation, all services can fail. This was shown with S3 in the August outage in us-east. We had downtime during this issue, as did many others.

Try, try again

The main way we're making sure that S3 is available is through the usage of a second region with replicated S3 data. But there's a simpler step to understand first. If there are intermittent issues from one location, the retry mechanism built into the AWS SDK will help. By default, each S3 call will be retried up to 3 times using exponential backoff. If you want to try even more times, it's simple enough to do so:

  awsConfig.maxRetries = maxAttempts;

The retries happen on 5xx requests. We've also seen some 400 requests, socket timeouts, that work on a retry. If you want to have more retry attempts, or simply capture the information on retries, you can do this with the retry event:

  awsRequest.on('retry', function (response) {
    // add an "if" statement here for the
    // conditions you want to retry on
    response.error.retryable = true;
  });

Multi-region S3

The best way to increase the availablity of S3, as I mentioned above, is to use a second region.

The heavy lifting for using this feature is all done for you by Amazon with cross-region replication. By enabling cross-region replication in both directions between two buckets, you can always write to one of the buckets and have the other one as the backup.

Before going forward with this plan, make sure to spend some time in the AWS cost calculator. You will have twice the storage cost, twice the API calls, and an increase in intra-region data transfer. Take this in account with the extra work to be done below and, most importantly, the extra ongoing complexity of adding this in. You may find it's not worth it for your case, or that there's lower-hanging availability scares to tackle first.

Setting up cross-region replication

See Amazon's guide for setup of replication. A few additional tips on setup:

  • Don't forget to turn on versioning for both buckets
  • Once you have gone through the replication steps in Amazon's guide, remember to go back to setting up the second bucket for replication as well
  • If you are starting with one bucket that already has data, make sure to use the AWS SDK for an initial copy of files from one bucket to another.

Using a failover bucket

With a second bucket set up and cross-replication enable, it's time to start using your new setup. There's two main ways you can use your new failover capabilities:

  1. Switching to the failover bucket for public users of S3 data
  2. Switching to the failover bucket for API calls

For the case of public viewing, you will need something that is contacted before S3 to have this work properly. Hopefully you are already using a CDN that is contacted before S3, in which case this work should be fairly simple. Many CDNs have a way to switch over to a failover origin as needed. In the case of CloudFront, I believe this needs to be done using Route 53 instead, but you'll need to read up on this yourself.

For the case of API usage, you will want a way to automatically switch API calls to the failover bucket as needed. We created an open source Node.js library just for this case, s3-s3. Even if you aren't using Node.js, looking at the library should give you some ideas of how this could be done.

Testing failure

So you've set up your failover bucket and ensured its usage within your application. Now what? Sure you could deploy and call it done, but it's nice to know that things really work right before a disaster shows they don't.

The AWS SDK does provide a way to route S3 calls differently so that you can put them through a proxy that will simulate failure. You can do this by setting the endpoint property in the AWS config:

    endpoint = new aws.Endpoint(s3PrimaryProxy);
    awsConfigPrimary.endpoint = endpoint;
    awsConfigPrimary.s3BucketEndpoint = true;

Set s3PrimaryProxy to the full URL of the proxy of your choice. I've seen good work done with toxy.

Monitoring failure

If you've written and tested things well, you'll never even know things have gone badly. Well, unless you read any tech news anywhere, as they will all report on an S3 apocalypse. But it's nice to find out yourself.

The s3-s3 library has a failure event that you can use to send on metrics or alerts on a failure. I have seen some of these go off already, interestingly enough. The best guess is that these are just network blips, and it's a fairly tiny fraction of calls.

Another way to find out about failures is to look at the S3 bucket logs. The logs can take awhile to fill in, so it won't be an instant notification, but it will tell you eventually, likely within an hour. To read the logs, I've liked how Sumo Logic shows things, although there's many other tools. You may need multiple tools here to get exactly what you need.

Moving on

And finally, one less failure to worry about! Let me know in the comments if I can explain any of the many details better or what alternatives to the above you've discovered.