Utilising AWS Lambda to migrate 25,000,000+ images S3 bucket

When AWS announced AWS Lambda at last year’s re:Invent, we were really excited about it here at Mind Candy. The concept of a zero-administration compute platform, that is very scalable, cheap and so easy to use AND at the same time integrates with so many AWS services through triggers is pretty exciting and potentially – very powerful.

Since then, we started using AWS Lambda in some of our products – PopJam being one of them. We use it to near-instantly generate thumbnails of all the amazing creations users of PopJam share through the app.

Recently, a quite interesting story surfaced on our sprint – we were to migrate one of the AWS S3 buckets PopJam uses, from US to EU (to bring it closer to the backend and users) without any downtime for users.

Now, you’ll think – “why that would be interesting?”

The answer is – 25,000,000+ – scale of this task.

The aforementioned AWS S3 bucket stores over 25,000,000 files (mostly images) and this number is growing faster every single day. Just running ‘s3cmd du’ on the bucket, took almost a day. When I tried to perform ‘s3cmd ls’ to count the number of keys in the bucket, I got bored before it finished (I had to write a simple Python script that utilises multi-processing and split the process of counting into 256 threads; only then would it finish within few minutes).

Obviously, any form of existing CLI command like s3cmd sync or AWS CLI s3 commands is out of question as before it finishes (after many, many hours), the source bucket will have tens of thousands of new files which haven’t been copied across and we’d have to re-run it again which would lead to the same situation.

I mentioned, AWS Lambda functions can be triggered by other AWS services; one of them being AWS S3. Essentially, we can configure an AWS S3 Bucket to invoke a Lambda function whenever a new object (key) is being created.

Given this, we could create a Lambda function on the old bucket that will be triggered whenever a new key is created (ObjectCreated event) that would copy over new keys to the new bucket. Then, we’d have to only sync the old bucket to the new one without having to worry about missing some keys on the way.

The proposed plan looked like this:

  1. Create new S3 bucket in EU
  2. Set up AWS Lambda Copy function and configure it to be triggered whenever a new key is added
  3. Run aws s3 sync command in background
  4. Wait, wait, wait…
  5. Reconfigure CDN to use the new bucket as origin
  6. Switch backend application to upload all images from now on, to the new S3 bucket in EU

This plan, also meant there should be zero downtime during the whole migration. Everyone likes zero downtime migrations, right?

The actual implementation, while not very painful, did uncover a few issues with the plan that had to be dealt with. These issues resulted in some learnings which I wanted to share here.

AWS Lambda copy object function

The Lambda function code to perform the copy happens to be pretty trivial.

var AWS = require(‘aws-sdk’);
var util = require(‘util’);

exports.handler = function(event, context) {
        var s3 = new AWS.S3(options = {region: “eu-west-1”});

        var params = {
                Bucket: ‘popjam-new-bucket’,
                CopySource: event.Records[0].s3.bucket.name + ‘/‘ + event.Records[0].s3.object.key,
                Key: event.Records[0].s3.object.key,
                ACL: ‘public-read’
        }

        s3.copyObject(params, function(err, data) {
                if (err) console.log(err, err.stack);  // an error occurred
                else     context.done();  // successful response
        });
};

It just works, but there’s one small culprit…

… what happens to S3 object ACLs should they be changed in the meantime?

We needed ACLs for particular objects to be in-sync (for various reasons, one of them being moderation).

Given the AWS Lambda function is triggered on ObjectCreated event (there sadly isn’t a way to trigger it on ObjectModify), should you need to change ACL there’s no way to do it through AWS Lambda at this stage.

We worked around this problem by writing a Python script that basically iterates through the S3 buckets, compares ACLs and tweaks them if there’s a need (as before, we had to parallelise it otherwise it’d take ages).

Beware of AWS Lambda limits!

While being pretty scalable, AWS Lambda has got some limits. We were bitten by the “Concurrent requests per account” and “Requests per second per account” limits a few times (fortunately we did just enough with AWS Lambda to get the attention of AWS Lambda product team and they kindly raised these limits for us).

For most of the use cases those limits should be fine, but in our case, when on top of the AWS Lambda copy function we were also triggering a series of functions to generate thumbnails, we hit these limits pretty quickly and had to temporarily throttle our migration scripts.

AWS Lambda is still pretty bleeding edge technology

AWS Lambda will work great for you most of the time. However, when it fails, troubleshooting can be quite … inconvenient to say the least.

Remember you can now view all AWS Lambda logs through CloudWatch – make use of them and don’t shy away from placing debug statements in your code.

The deployment of AWS Lambda is pretty tricky, too. While there are some options, it’s still in early stage and it feels like even AWS is still trying to figure it out (potentially through feedback from customers – if you use AWS Lambda do make sure to feedback to AWS).

The most interesting tool that I found out to support deployment and integrating with AWS Lambda in general is kappa

And all of this for what?

Let the graph speak for itself…

(the graph represents upload time to S3 bucket in US – green line, and S3 bucket in EU – orange line – after migration)

Testing with Amazon SQS

We all know how great Amazon SQS is, and here at Mind Candy we use it extensively in our projects.

Quite recently, we started making some changes to our Data Pipeline in order to speed up our Event Processing, and we found ourselves with the following problem: how can we generate thousands of messages (events) to benchmark it? The first solution that came into our minds was to use the AWS Command Line Interface, which is a very nifty tool and works great.

The AWS Command Line Interface SQS module comes with the ability to send out messages in batches, with a maximum of 10 messages per batch, so we said: “right, let’s write a bash script to send out some batches”, and so we did.

Problem

It worked alright, but it had some problems:

  • It was slow; because messages were being sent in batches of up to 10 messages and not in parallel
  • The JSON payload had to contain some metadata along with the same message repeated 10 times (1 for each message entry)
  • If you needed to send 15 messages, you would have to have 1 message batch with 10 entries and another one with 5 entries (2 JSON files)
  • Bash scripts are not the best thing in the world for maintenance

So, what did we do to solve it? We wrote our own command line program, of course!

Solution: meet sqs-postman

Writing command line applications in Node.js is very very easy, with the aid of the good old Commander.js. Luckily, AWS has an SDK for Node.js, so that means that we don’t need to worry about: AWS authentication, SQS API design, etc. Convenient? Absolutely!

Sqs-postman was designed with the following features out of the box:

  • Sends messages in batches of up to 10 messages at a time (AWS limit)
  • Batches are sent out in parallel using a default of 10 producers, which can be configured using the –concurrent-producers option
  • A single message is read from disk, and expanded into the total number of messages that need to be sent out
  • It supports AWS configuration and profiles

In order to solve the “messages in parallel” problem, we used the async library. We basically split the messages into batches and we then use eachLimit to determine how many batches can be executed in parallel, which starts with a default value of 10 but can be configured with an option.

Can I see it in action?

Of course you can! sqs-postman has been published to npm, so you can install it by running:

 npm install -g sqs-postman

Once installed, just follow these simple steps:

  • Make sure to configure AWS
  • Create a file containing the message, i.e. message.json with a dummy content
    {
       "message": "hello from sqs-postman"
    }
  • Run it
    $ postman message my-test-queue --message-source ./message.json --concurrent-producers 100 --total 1000

If you would like to see more information, the debug mode can be enabled by prepending DEBUG=sqs-postman postman…

Text is boring, show me some numbers!

You are absolutely right! If we don’t share some numbers, it will be hard to determine how good sqs-postman is.

Messages aws-cli sqs-postman
100 0m 4.956s 0m 0.90s
1000 2m 31.457s 0m 4.18s
10000 8m 30.715s 0m 30.83s

As you can appreciate, the difference in performance between aws-cli and sqs-postman is huge! Because of sqs-postman’s ability to process batches in parallel (async), the execution time can be reduced quite considerably.

These tests were performed on a Macbook Pro 15-inch, Mid 2012 with a 2.6 GHz Intel Core i7 Processor and 16 GB 1600 MHz DDR3 of RAM. And time was measured using Unix time.

Conclusion

Writing this Node.js module was very easy (and fun). It clearly shows the power of Node.js for writing command line applications and how extensive the module library is when it comes to reusing existing modules/components (e.g. AWS SDK).

The module has been open sourced and can be found here. Full documentation can be found in there too.

As usual, feel free to raise issues or better yet contribute (we love contributors!).

Deployments using All the Things!

As we’ve mentioned in previous posts, we use AWS services extensively at Mind Candy. One of the services that we’ve blogged about before is CloudFormation. CloudFormation (CF) lets us template multiple AWS resources for a given product into a single file which can be easily version controlled in our internal Git implementation.

Our standard setup for production is to use CF to create Autoscaling Groups for all EC2 instances where, as Bart posted a while back, we mix and match our usage of on-demand instances and spot priced instances to get the maximum compute power for our money.

During load testing of the backend services of our games we did, however, notice a flaw in the way we’re doing things. Essentially, this was the speed with which we could scale up under rapid traffic surges, such as those generated by feature place in mobile app stores.

The core problem for us was that our process started with a base Amazon Image (AMI), after initial boot it would then call into Puppet to configure it from the ground up. This meant that a scaling up event could take many minutes to occur – even with SSD-backed instances – which isn’t ideal.

Not only could this take a long time – when it worked – but we were also dependent on third-party repositories being available, or praying that Ruby gem installations actually worked. If a third-party was not available then the instances would not even come up, which is worse position to be in than it just being slow.

The obvious answer to this problem is to cut an AMI of the whole system and use that for scaling up. However, this also poses another problem that you now make your AMI a cliff edge that sits outside of your configuration management system.

This is not a particularly new problem or conundrum of course. I can personally recall quite heated debates in previous companies about the merits of using AMIs versus a configuration management system alone.

We thought about this ourselves too and came the conclusion that instead of accepting this binary choice we’d split the difference and use both. We achieved this by modularising our deployment process for production and using a number of different tools.

The Tools

Teamcity – we were already using our continuous integration system as the initiator of our non-production deployments so we decided to leverage all the good stuff we already had there and, crucially, we could let our different product teams deploy their own builds to productions and we would just support the process.

Fabric – we’ve been using Fabric for deployments for quite some time already. Thanks to the excellent support for AWS through the Boto library we were easily able to utilise the Amazon API to programmatically determine our environments and services within our Fabric scripts.

Puppet – when you just have one server for a product using a push deploy method makes sense as its quick. However, this doesn’t scale. Bart created a custom Puppet provider that could retrieve a versioned deployment from S3 (pushed via Fabric) so we could pull our code deploys on to remote hosts.

Packer – we opted to use Packer to build our AMIs. With Packer, we could version control our environments and then build a stable image of a fully puppetized host which would also have the latest release of code running at boot, but could still run Puppet as normal as well. This meant we could remove the cliff edge with an AMI, because, at the very worst we would bring up the AMI and then gain anything that was missing but do so quickly as it was “pre-puppetized”.

Cloudformation – Once we had a working AMI we could then update our version controlled templates and poke the Amazon API to update them in CloudFormation. All scaling events would then occur using the new AMI containing the released version of code.

The Process – when you hit “Run” in Teamcity

  1. Checkout from git the Fabric repo, the Packer repo and the Cloudformation repo.
  2. Using a config file passed to Fabric that would run a task to query the Amazon API and discover our current live infrastructure for a given application/service.
  3. Administratively disable Puppet on the current live infrastructure so Puppet doesn’t deploy code from S3 outside of the deployment process.
  4. Push our new version of code to S3.
  5. Initiate a Packer build, launching an instance and deploying the new code release.
  6. Run some smoke tests on the Packer instance to confirm and validate deployment.
  7. Cut the AMI and capture its ID from the API when its complete.
  8. Re-enable and run Puppet on our running infrastructure thus deploying the new code.
  9. Update our Cloudformation template with the new AMI and push the updated template to the CloudFormation API.
  10. Check-in the template change to Git.
  11. Update our Packer configuration file to use the latest AMI as its base image for the next deploy.

What we’ve found with this set-up is, for the most part, a robust means of using Puppet to deploy our code in a controlled manner, and being able to take advantage of all the gains you get when autoscaling from baked AMI images.

Obviously we do run the risk of having a scaling event occur during deployment, however, by linking the AMI cutting process with Puppet we’re yet to experience this edge case, plus all our code deploys are (and should be) backwards compatible, so the edge case doesn’t pose that much of a risk in our set-up.

 

Cutting the AWS bill with spot instances

AWS has definitely changed the way we all approach infrastructures these days, especially here — at Mind Candy.

We’re finally not limited by the amount of available hardware, so we can get whatever amount of resources (well, nearly) we need, whenever we need, plus we get CloudFormations.

However, as exciting as spawning 100+ servers can be, as with many things, if you’re not cautious and smart, it can cost you a lot of money.

One way to save a bit of money on your AWS bill (and “a bit” is a serious understatement) is by utilising Spot Instances.

“Spot Instances allow you to name your own price for Amazon EC2 computing capacity. You simply bid on spare Amazon EC2 instances and run them whenever your bid exceeds the current Spot Price, which varies in real-time based on supply and demand.” – http://aws.amazon.com/ec2/purchasing-options/spot-instances/

How much can you save? Well, the c3.large instances which we use across the board for our application tier in on-demand pricing cost $0.12 per hour. When we use the same instance type with spot pricing we get them most of the time for around $0.02. That’s 6x cheaper compared to on-demand.

So what’s the trade-off? Well, if for some reason the spot-instance price exceeds your bid price, your spot reservations will get cancelled and your spot-instances will be killed. In short — your instances can and will die at random times and it’s not 100% guaranteed that you’ll get them when you want them.

That’s not good. Even if you use CloudFormations and auto-scaling as you could end up without instances when the spot price becomes too high – that could be almost the same as an AZ failure if you’re not prepared for it.

However, there’s a way to overcome that risk. In a single CloudFormation, you can create two launch configurations — one for on-demand instances and another one for spot-instances. With carefully tweaked scaling thresholds, you can make your spot-instances be preferred over on-demand instances, but still ensure on-demand takes over should spot-instances no longer be available at your bid price.

This way, if you can get spot-instances, your stack will be pretty much fully built using spot-instances. If (and when) the price goes over your bid price, spot instances will start getting killed and your on-demand instances will start booting up instead to cover the increased price. When the spot-price return beneath your bid price, spot instances will start booting up, slowly phasing out on-demand instances.

After few weeks of tests we managed to come up with a set of thresholds which work pretty well for us and keeps our stacks stable around the clock.

With on-demand, we always have a single instance running by setting the minimum to 1. Scale-up event happens when our average CPU usage exceeds 80% for a 5 minute period and we increase the on-demand autoscale group by 2 instances. We then scale down 1 instance at a time if the average CPU usage is less then 65% for a period of 5 minutes, and we ensure that a scale-down event only happens once in a 15 minute period.

With spot-instances, we also request a minimum of 1 instances but we set ourselves a bid price of $0.12 – remember, the bid price is not the price you pay, it’s the maximum you are willing to pay. Most of the time we have a spot-price cost of just $0.02!

As with on-demand we scale on average CPU in the spot-price autoscale group. However, we scale-up whenever we reach 50% (instead of 80%), and we also add 2 instances. We scale down and cancel our spot instances when we dip below 30% CPU usage.

The result is probably best as a picture from Ice (Ice is a great tool from Netflix that helps manage AWS costs). Below is the hourly cost of one of our app tiers before and after we started utilising spot instances.

Screen Shot 2014-10-15 at 11.45.23

For us, in the case of this specific stack, spot instances gave us savings up to 60%. Bear in mind the size of this specific stack is quite small (up to 10-12 instances at peak); so the bigger the stack, the more savings you’ll see!

To wrap up, I just wanted to share few tips and tricks we picked up along the way, that should help you:

  • bake AMIs; tools like Packer will greatly help you do this; this will let you minimise time required to boot up a new instances; it’ll give you much more, but the time is crucial when it comes to scale-up events, especially when spot-instances are being killed and you want on-demand instances to fill out the empty spaces ASAP. We managed to get time required to boot up a new instances down to around 75 seconds
  • use EBS based instances; they cost a fraction more (and yeah, EBS can be painful) but their boot time is significantly faster then the ephemeral-storage based instances
  • bid price = on-demand instance price; this way in worst case you’ll pay what you’d normally pay for on-demand instance
  • utilise reserved instances for the “base” on-demand instances in on-demand stacks
  • did I mention Ice from Netflix? Use it!

<shamelessplug> Obviously, the most important requirement is having an awesome application that is cloud-friendly. If you’re interested in building cloud-native applications and awesome infrastructures, we’d love to hear from you! ;-) </ shamelessplug>

That’ll be all folks! Happy spot-instancing!