Utilising AWS Lambda to migrate 25,000,000+ images S3 bucket

When AWS announced AWS Lambda at last year’s re:Invent, we were really excited about it here at Mind Candy. The concept of a zero-administration compute platform, that is very scalable, cheap and so easy to use AND at the same time integrates with so many AWS services through triggers is pretty exciting and potentially – very powerful.

Since then, we started using AWS Lambda in some of our products – PopJam being one of them. We use it to near-instantly generate thumbnails of all the amazing creations users of PopJam share through the app.

Recently, a quite interesting story surfaced on our sprint – we were to migrate one of the AWS S3 buckets PopJam uses, from US to EU (to bring it closer to the backend and users) without any downtime for users.

Now, you’ll think – “why that would be interesting?”

The answer is – 25,000,000+ – scale of this task.

The aforementioned AWS S3 bucket stores over 25,000,000 files (mostly images) and this number is growing faster every single day. Just running ‘s3cmd du’ on the bucket, took almost a day. When I tried to perform ‘s3cmd ls’ to count the number of keys in the bucket, I got bored before it finished (I had to write a simple Python script that utilises multi-processing and split the process of counting into 256 threads; only then would it finish within few minutes).

Obviously, any form of existing CLI command like s3cmd sync or AWS CLI s3 commands is out of question as before it finishes (after many, many hours), the source bucket will have tens of thousands of new files which haven’t been copied across and we’d have to re-run it again which would lead to the same situation.

I mentioned, AWS Lambda functions can be triggered by other AWS services; one of them being AWS S3. Essentially, we can configure an AWS S3 Bucket to invoke a Lambda function whenever a new object (key) is being created.

Given this, we could create a Lambda function on the old bucket that will be triggered whenever a new key is created (ObjectCreated event) that would copy over new keys to the new bucket. Then, we’d have to only sync the old bucket to the new one without having to worry about missing some keys on the way.

The proposed plan looked like this:

  1. Create new S3 bucket in EU
  2. Set up AWS Lambda Copy function and configure it to be triggered whenever a new key is added
  3. Run aws s3 sync command in background
  4. Wait, wait, wait…
  5. Reconfigure CDN to use the new bucket as origin
  6. Switch backend application to upload all images from now on, to the new S3 bucket in EU

This plan, also meant there should be zero downtime during the whole migration. Everyone likes zero downtime migrations, right?

The actual implementation, while not very painful, did uncover a few issues with the plan that had to be dealt with. These issues resulted in some learnings which I wanted to share here.

AWS Lambda copy object function

The Lambda function code to perform the copy happens to be pretty trivial.

var AWS = require(‘aws-sdk’);
var util = require(‘util’);

exports.handler = function(event, context) {
        var s3 = new AWS.S3(options = {region: “eu-west-1”});

        var params = {
                Bucket: ‘popjam-new-bucket’,
                CopySource: event.Records[0].s3.bucket.name + ‘/‘ + event.Records[0].s3.object.key,
                Key: event.Records[0].s3.object.key,
                ACL: ‘public-read’

        s3.copyObject(params, function(err, data) {
                if (err) console.log(err, err.stack);  // an error occurred
                else     context.done();  // successful response

It just works, but there’s one small culprit…

… what happens to S3 object ACLs should they be changed in the meantime?

We needed ACLs for particular objects to be in-sync (for various reasons, one of them being moderation).

Given the AWS Lambda function is triggered on ObjectCreated event (there sadly isn’t a way to trigger it on ObjectModify), should you need to change ACL there’s no way to do it through AWS Lambda at this stage.

We worked around this problem by writing a Python script that basically iterates through the S3 buckets, compares ACLs and tweaks them if there’s a need (as before, we had to parallelise it otherwise it’d take ages).

Beware of AWS Lambda limits!

While being pretty scalable, AWS Lambda has got some limits. We were bitten by the “Concurrent requests per account” and “Requests per second per account” limits a few times (fortunately we did just enough with AWS Lambda to get the attention of AWS Lambda product team and they kindly raised these limits for us).

For most of the use cases those limits should be fine, but in our case, when on top of the AWS Lambda copy function we were also triggering a series of functions to generate thumbnails, we hit these limits pretty quickly and had to temporarily throttle our migration scripts.

AWS Lambda is still pretty bleeding edge technology

AWS Lambda will work great for you most of the time. However, when it fails, troubleshooting can be quite … inconvenient to say the least.

Remember you can now view all AWS Lambda logs through CloudWatch – make use of them and don’t shy away from placing debug statements in your code.

The deployment of AWS Lambda is pretty tricky, too. While there are some options, it’s still in early stage and it feels like even AWS is still trying to figure it out (potentially through feedback from customers – if you use AWS Lambda do make sure to feedback to AWS).

The most interesting tool that I found out to support deployment and integrating with AWS Lambda in general is kappa

And all of this for what?

Let the graph speak for itself…

(the graph represents upload time to S3 bucket in US – green line, and S3 bucket in EU – orange line – after migration)

Testing with Amazon SQS

We all know how great Amazon SQS is, and here at Mind Candy we use it extensively in our projects.

Quite recently, we started making some changes to our Data Pipeline in order to speed up our Event Processing, and we found ourselves with the following problem: how can we generate thousands of messages (events) to benchmark it? The first solution that came into our minds was to use the AWS Command Line Interface, which is a very nifty tool and works great.

The AWS Command Line Interface SQS module comes with the ability to send out messages in batches, with a maximum of 10 messages per batch, so we said: “right, let’s write a bash script to send out some batches”, and so we did.


It worked alright, but it had some problems:

  • It was slow; because messages were being sent in batches of up to 10 messages and not in parallel
  • The JSON payload had to contain some metadata along with the same message repeated 10 times (1 for each message entry)
  • If you needed to send 15 messages, you would have to have 1 message batch with 10 entries and another one with 5 entries (2 JSON files)
  • Bash scripts are not the best thing in the world for maintenance

So, what did we do to solve it? We wrote our own command line program, of course!

Solution: meet sqs-postman

Writing command line applications in Node.js is very very easy, with the aid of the good old Commander.js. Luckily, AWS has an SDK for Node.js, so that means that we don’t need to worry about: AWS authentication, SQS API design, etc. Convenient? Absolutely!

Sqs-postman was designed with the following features out of the box:

  • Sends messages in batches of up to 10 messages at a time (AWS limit)
  • Batches are sent out in parallel using a default of 10 producers, which can be configured using the –concurrent-producers option
  • A single message is read from disk, and expanded into the total number of messages that need to be sent out
  • It supports AWS configuration and profiles

In order to solve the “messages in parallel” problem, we used the async library. We basically split the messages into batches and we then use eachLimit to determine how many batches can be executed in parallel, which starts with a default value of 10 but can be configured with an option.

Can I see it in action?

Of course you can! sqs-postman has been published to npm, so you can install it by running:

 npm install -g sqs-postman

Once installed, just follow these simple steps:

  • Make sure to configure AWS
  • Create a file containing the message, i.e. message.json with a dummy content
       "message": "hello from sqs-postman"
  • Run it
    $ postman message my-test-queue --message-source ./message.json --concurrent-producers 100 --total 1000

If you would like to see more information, the debug mode can be enabled by prepending DEBUG=sqs-postman postman…

Text is boring, show me some numbers!

You are absolutely right! If we don’t share some numbers, it will be hard to determine how good sqs-postman is.

Messages aws-cli sqs-postman
100 0m 4.956s 0m 0.90s
1000 2m 31.457s 0m 4.18s
10000 8m 30.715s 0m 30.83s

As you can appreciate, the difference in performance between aws-cli and sqs-postman is huge! Because of sqs-postman’s ability to process batches in parallel (async), the execution time can be reduced quite considerably.

These tests were performed on a Macbook Pro 15-inch, Mid 2012 with a 2.6 GHz Intel Core i7 Processor and 16 GB 1600 MHz DDR3 of RAM. And time was measured using Unix time.


Writing this Node.js module was very easy (and fun). It clearly shows the power of Node.js for writing command line applications and how extensive the module library is when it comes to reusing existing modules/components (e.g. AWS SDK).

The module has been open sourced and can be found here. Full documentation can be found in there too.

As usual, feel free to raise issues or better yet contribute (we love contributors!).

London PostgreSQL Meetup

London PostgreSQL Group meetup is a unofficial PostgreSQL community event happening quarterly. The meetup agenda is very relaxed but it always involves a lot of good PostgreSQL discussions over some pizza and beer.

The event is always open to everyone and usually announced well in advance through meetup.com website — http://www.meetup.com/London-PostgreSQL-Meetup-Group

Mind Candy had a pleasure of hosting the meetup on the 21 January 2015. We actually had a record attendance which was awesome; thank you to everyone who came!

We had two really good talks. First one was a joint talk by Howard Rolph & Giovanni Ciolli about key features of recently released PostgreSQL 9.4 followed by an awesome talk by Rachid Belaid about full-text search capabilities [1] (with proper deep-dive into technical details and how to do it) in PostgreSQL. Apparently you don’t really need to build a totally separate Elasticsearch cluster if you want to store documents and perform most usual operations on them; Postgres will do just as well! Who knew!

Howard talking about new key features in PostgreSQL 9.4

Howard talking about new key features in PostgreSQL 9.4

Again, thanks everyone for coming and especially to the great speakers and see you all next time!

[1] Slides available here https://speakerdeck.com/rach/postgres-full-text-search-is-good-enough

Cutting the AWS bill with spot instances

AWS has definitely changed the way we all approach infrastructures these days, especially here — at Mind Candy.

We’re finally not limited by the amount of available hardware, so we can get whatever amount of resources (well, nearly) we need, whenever we need, plus we get CloudFormations.

However, as exciting as spawning 100+ servers can be, as with many things, if you’re not cautious and smart, it can cost you a lot of money.

One way to save a bit of money on your AWS bill (and “a bit” is a serious understatement) is by utilising Spot Instances.

“Spot Instances allow you to name your own price for Amazon EC2 computing capacity. You simply bid on spare Amazon EC2 instances and run them whenever your bid exceeds the current Spot Price, which varies in real-time based on supply and demand.” – http://aws.amazon.com/ec2/purchasing-options/spot-instances/

How much can you save? Well, the c3.large instances which we use across the board for our application tier in on-demand pricing cost $0.12 per hour. When we use the same instance type with spot pricing we get them most of the time for around $0.02. That’s 6x cheaper compared to on-demand.

So what’s the trade-off? Well, if for some reason the spot-instance price exceeds your bid price, your spot reservations will get cancelled and your spot-instances will be killed. In short — your instances can and will die at random times and it’s not 100% guaranteed that you’ll get them when you want them.

That’s not good. Even if you use CloudFormations and auto-scaling as you could end up without instances when the spot price becomes too high – that could be almost the same as an AZ failure if you’re not prepared for it.

However, there’s a way to overcome that risk. In a single CloudFormation, you can create two launch configurations — one for on-demand instances and another one for spot-instances. With carefully tweaked scaling thresholds, you can make your spot-instances be preferred over on-demand instances, but still ensure on-demand takes over should spot-instances no longer be available at your bid price.

This way, if you can get spot-instances, your stack will be pretty much fully built using spot-instances. If (and when) the price goes over your bid price, spot instances will start getting killed and your on-demand instances will start booting up instead to cover the increased price. When the spot-price return beneath your bid price, spot instances will start booting up, slowly phasing out on-demand instances.

After few weeks of tests we managed to come up with a set of thresholds which work pretty well for us and keeps our stacks stable around the clock.

With on-demand, we always have a single instance running by setting the minimum to 1. Scale-up event happens when our average CPU usage exceeds 80% for a 5 minute period and we increase the on-demand autoscale group by 2 instances. We then scale down 1 instance at a time if the average CPU usage is less then 65% for a period of 5 minutes, and we ensure that a scale-down event only happens once in a 15 minute period.

With spot-instances, we also request a minimum of 1 instances but we set ourselves a bid price of $0.12 – remember, the bid price is not the price you pay, it’s the maximum you are willing to pay. Most of the time we have a spot-price cost of just $0.02!

As with on-demand we scale on average CPU in the spot-price autoscale group. However, we scale-up whenever we reach 50% (instead of 80%), and we also add 2 instances. We scale down and cancel our spot instances when we dip below 30% CPU usage.

The result is probably best as a picture from Ice (Ice is a great tool from Netflix that helps manage AWS costs). Below is the hourly cost of one of our app tiers before and after we started utilising spot instances.

Screen Shot 2014-10-15 at 11.45.23

For us, in the case of this specific stack, spot instances gave us savings up to 60%. Bear in mind the size of this specific stack is quite small (up to 10-12 instances at peak); so the bigger the stack, the more savings you’ll see!

To wrap up, I just wanted to share few tips and tricks we picked up along the way, that should help you:

  • bake AMIs; tools like Packer will greatly help you do this; this will let you minimise time required to boot up a new instances; it’ll give you much more, but the time is crucial when it comes to scale-up events, especially when spot-instances are being killed and you want on-demand instances to fill out the empty spaces ASAP. We managed to get time required to boot up a new instances down to around 75 seconds
  • use EBS based instances; they cost a fraction more (and yeah, EBS can be painful) but their boot time is significantly faster then the ephemeral-storage based instances
  • bid price = on-demand instance price; this way in worst case you’ll pay what you’d normally pay for on-demand instance
  • utilise reserved instances for the “base” on-demand instances in on-demand stacks
  • did I mention Ice from Netflix? Use it!

<shamelessplug> Obviously, the most important requirement is having an awesome application that is cloud-friendly. If you’re interested in building cloud-native applications and awesome infrastructures, we’d love to hear from you! ;-) </ shamelessplug>

That’ll be all folks! Happy spot-instancing!

How bloated is your PostgreSQL database?

When dealing with databases (or, in fact, any data that you need to read from disk), we all know how important it’s to have a lot of memory.  When we’ve a lot of memory, a good portion of data gets nicely cached due to smart operating system caching and most of the data, when requested comes from memory rather then disk which is much, much faster.  Hence trying to keep your dataset size possibly small becomes quite important maintenance task.

One of the things that take quite a bit of space in PostgreSQL which we use across most of the systems here at Mind Candy, are indexes.  And it’s good because they speed up access to data vastly, however they easily get “bloated”, especially if data you store in your tables gets modified often.

However, before we even are able to tackle the actual problem of bloated indexes, first we need to figure out which indexes are bloated.  There’re some tricky SQL queries that you can run against the database to see the index bloat, but in our experience, results we got were not always accurate (and actually quite far off).

Beside having happy databases we also care a lot about the actual data we store so we back it up very often (nightly backups + PITR backups) and once a day we do a fully automatic database restore to make sure backups we take, work.

Now, a restore operation includes building indexes from scratch, what means, those indexes are fresh and free of bloat.

Now, if we only could compare the sizes of indexes from our production databases to the ones from restored backups, we could easily say, very precisely, how much bloat we’ve got in our production database.  To help with that, we wrote a simple python script.

$ ./indexbloat.py -ps csva.csv csvb.csv
Index idx3 size compare to clean import: 117 % (14.49G vs. 12.35G)
Index idx2 size compare to clean import: 279 % (14.49G vs. 5.18G)
Ough!  idx4 index is missing in the csvb.csv file.  Likely a problem with backup!
Total index bloat: 11.46G

The whole process works as following:

  1. At the time of backup, we run a long SQL query which prints all indexes in production database alongside with their sizes in CSV format
  2. After the backup is restored, we run the same SQL query that prints all indexes in “fresh” database alongside with their sizes in CSV format
  3. We then run the aforementioned Python script which parses both CSV files and prints out human-friendly information show exactly how much bloat we’ve got in our indexes

We also added a percent thresold option so it’ll print out only indexes with bloat more then X %.  This is so we won’t get bothered by little differencies.

The aforementioned script, called pgindexbloat, can be found on Mind Candy Github account.  It’s very easy to be run from cronjob or wrapped around into a check script and used to feed Nagios / Sensu.

As a interesting note, I’ll just add that the first go of the script uncovered we had nearly 40GB worth of bloat on our production database.  Much more then we anticipated and getting rid of that bloat, definitelly made our database much happier.