Testing with Amazon SQS

We all know how great Amazon SQS is, and here at Mind Candy we use it extensively in our projects.

Quite recently, we started making some changes to our Data Pipeline in order to speed up our Event Processing, and we found ourselves with the following problem: how can we generate thousands of messages (events) to benchmark it? The first solution that came into our minds was to use the AWS Command Line Interface, which is a very nifty tool and works great.

The AWS Command Line Interface SQS module comes with the ability to send out messages in batches, with a maximum of 10 messages per batch, so we said: “right, let’s write a bash script to send out some batches”, and so we did.

Problem

It worked alright, but it had some problems:

  • It was slow; because messages were being sent in batches of up to 10 messages and not in parallel
  • The JSON payload had to contain some metadata along with the same message repeated 10 times (1 for each message entry)
  • If you needed to send 15 messages, you would have to have 1 message batch with 10 entries and another one with 5 entries (2 JSON files)
  • Bash scripts are not the best thing in the world for maintenance

So, what did we do to solve it? We wrote our own command line program, of course!

Solution: meet sqs-postman

Writing command line applications in Node.js is very very easy, with the aid of the good old Commander.js. Luckily, AWS has an SDK for Node.js, so that means that we don’t need to worry about: AWS authentication, SQS API design, etc. Convenient? Absolutely!

Sqs-postman was designed with the following features out of the box:

  • Sends messages in batches of up to 10 messages at a time (AWS limit)
  • Batches are sent out in parallel using a default of 10 producers, which can be configured using the –concurrent-producers option
  • A single message is read from disk, and expanded into the total number of messages that need to be sent out
  • It supports AWS configuration and profiles

In order to solve the “messages in parallel” problem, we used the async library. We basically split the messages into batches and we then use eachLimit to determine how many batches can be executed in parallel, which starts with a default value of 10 but can be configured with an option.

Can I see it in action?

Of course you can! sqs-postman has been published to npm, so you can install it by running:

 npm install -g sqs-postman

Once installed, just follow these simple steps:

  • Make sure to configure AWS
  • Create a file containing the message, i.e. message.json with a dummy content
    {
       "message": "hello from sqs-postman"
    }
  • Run it
    $ postman message my-test-queue --message-source ./message.json --concurrent-producers 100 --total 1000

If you would like to see more information, the debug mode can be enabled by prepending DEBUG=sqs-postman postman…

Text is boring, show me some numbers!

You are absolutely right! If we don’t share some numbers, it will be hard to determine how good sqs-postman is.

Messages aws-cli sqs-postman
100 0m 4.956s 0m 0.90s
1000 2m 31.457s 0m 4.18s
10000 8m 30.715s 0m 30.83s

As you can appreciate, the difference in performance between aws-cli and sqs-postman is huge! Because of sqs-postman’s ability to process batches in parallel (async), the execution time can be reduced quite considerably.

These tests were performed on a Macbook Pro 15-inch, Mid 2012 with a 2.6 GHz Intel Core i7 Processor and 16 GB 1600 MHz DDR3 of RAM. And time was measured using Unix time.

Conclusion

Writing this Node.js module was very easy (and fun). It clearly shows the power of Node.js for writing command line applications and how extensive the module library is when it comes to reusing existing modules/components (e.g. AWS SDK).

The module has been open sourced and can be found here. Full documentation can be found in there too.

As usual, feel free to raise issues or better yet contribute (we love contributors!).

Event Processing at Mind Candy

At Mind Candy we want to build great games that are fun and that captivate our audience. We gather a great deal of data from all of our products and analyse it to determine how our players interact with our games, and to find out how we can improve. The vast majority of this data consists of ‘events’; a blob of json that is fired by the client or server in response to an interesting action happening in the game.

This blog post is about the approach that we have taken at Mind Candy to gather and process these events, and scale the systems into the cloud using fluentd, akka, SQS, Redshift and other AWS Web Services.

What is an event?

From our point of view, an event is any arbitrary valid json that is fired (sent) to our Eventing service via a simple REST api.

When an event is received, it is enriched with some additional fields which includes a ‘fired_ts’ of when the event was received, a unique uuid and, importantly, the game name, version, and event name taken from the endpoint. These three together form what we call the ‘event key’.

This service is extremely lean, and does not itself expect or enforce a rigid taxonomy. It simply writes the enriched events to disk. As a result, the service is incredibly easy to scale and to achieve high availability.

Validation and processing

We then use fluentd, an open source data collector and aggregator, to take the enriched data written to disk and place it onto an SQS queue. Currently, we use a single queue (per environment) which receives data from many different eventing servers.

Once all that data is arriving on the queue, we need to do something useful with it! This is where our home grown event processing engine, Whirlpool, comes into play.

Whirlpool is a scala and akka based service which retrieves messages from SQS and processes and validates them accordingly. It uses a variant of the akka work-pull pattern with dedicated workers for pre-fetching, processing, and writing events, communicating with a master worker. The number of workers and other parameters can be tweaked for maximum throughput.

Where does the metadata for processing come from? We have a shared ‘data model’ which contains information on what an event should look like for a specific game and version. This is essentially a scala library that reads from a backing Postgres store.

The structure of that schema is (simplified):

Screen Shot 2014-07-25 at 15.49.50

An event field is a single field to be found in the json of the sent event. It has a number of different properties, for example whether it is mandatory or not, and whether it should be expanded (exploded out into multiple events), and the json path to where that field should be expected. The point of the eventversion table is to provide a history, so that all changes to all events are recorded over time so we have a rollback, as well as an audit trail for free.

An event destination configures where an event should end up in our warehouse. It can be copied to any number of schemas and tables as we require.

Whirlpool retrieves the metadata for an event based on the extracted event key. It then passes the event through a series of validation steps. If it fails at any level, the reason why is recorded. If it completes all validations, the event can be processed as expected.

The processMessage function looks like this:

Screen Shot 2014-07-25 at 16.49.28

We use Argonaut as our JSON processing library. It is a fully functional library written in Scala that is very nice to work with, as well as having the added benefit that our resident Mind Candy, Sean, is a contributor!

After our events have been validated, they are either a successful event for a particular game and version, or a failure. At this point we make use of fluentd again with a modified version of the Redshift plugin to load them into our Redshift data warehouse. Here they are available for querying by our data scientists and data analysts. Typically, the period from an event being received to being queryable within the data warehouse is measured in seconds, and in any case within a couple of minutes in normal cases.

Configuring events

To actually setup the metadata for what constitutes an event, we have created a simple GUI that can be accessed by all game teams. Any changes are picked up within a few minutes by Whirlpool, and those events will start to flow through our pipeline.

We also needed to solve one large problem with the configuration, namely: “How do you avoid having to create a mapping for every single game version when the events haven’t changed, and how do you accommodate for changes when they do occur?”

It took us a while to find a nice balance for solving this, but what we have now is a mapping from any POSIX regex which is matched against an incoming game version, to a specific version that should be used for retrieving the metadata (this is the purpose of the ‘configmapping’ table in the schema). So, when we release 1.0 of our game, we can create metadata that applies to “1.x”. If in version 1.5 we introduce a new event, we can create a new config at that point to apply to all later versions, while still having versions 1.0-1.4 processed correctly.

Handling Failure

Events can fail for a large variety of reasons. Currently there are 17 specific types of these, with a couple being:

  • The event is malformed; it does not contain the fields that we expect
  • The event is unknown

A failure is captured by the following class:

Screen Shot 2014-07-25 at 16.49.46

The FailureType here is another case class corresponding to the specific failure that was generated, and the fields contain some additional attributes which may or may not be extracted from the failure.

We treat failures separately from processed events, but they still make their way into Redshift in a separate schema. Each failure contains enough information to identity the problem with the event, which can then be fixed in most cases in the metadata; typically, event failures occur during development, and are a rare occurrence in production.

Scaling our infrastructure

We make heavy use of AWS at Mind Candy, and the eventing pipeline is no exception. All the eventing servers are described via Cloud Formation, and setup in an autoscale group fronted by an ELB. As a result, the number of servers deployed scales up and down in response to rising and waning demand.

The use of SQS also separates out our event gathering and event processing infrastructure. This means that Whirlpool instances do not have to scale as aggressively, as the queue provides a natural buffer to iron out fluctuations in the event stream due to peaks of traffic. For Redshift, we have a 6XL node cluster which we can scale up when required, thanks to the awesome features provided by Amazon.

Performance

We’ve benchmarked each of our eventing servers comfortably processing 5k events/sec, on m1.medium instances.

Whirlpool does a little more work, but we are currently running a configuration offering a sustained rate of just over 3k events/sec per instance, on c1.medium instances, with a quick ramp up time.

Instances of both Eventing and Whirlpool operate independently, so we scale horizontally as required.

Screen Shot 2014-07-25 at 16.24.54

The Future

We have real-time dashboards that run aggregations against our event data and display it on screens around the office. It’s very useful, but is only the first incarnation. Currently we’re working on streaming processed events from Whirlpool into Spark via Kafka, to complete our lambda architecture and greatly reduce the load on our Redshift cluster. We’re also improving the structure of how we store events in Redshift, based on our learnings over the last year or so! At some point when we have more time, we would also like to open-source Whirlpool into the community.

 

How I Learned to Stop Worrying and Love AWS CloudFormation

We love using AWS CloudFormation, here, at Mind Candy. Last year we moved all our cloud-based products application stacks to CloudFormations. We have learned, sometimes the hard way, how to design and use them in the best possible way for us. In this post I’m trying to summarize how we build and operate CloudFormations and what are the DOs and DON’Ts when using this technology. Throughout this post I will refer to CloudFormation as CF, to save some precious typing time.

First of all, you need to get to know cloud formation templates. This are just blocks of JSON, and as such are not really nice to edit (remember – no comments allowed in JSON). Because of that we use a helper technology – a templating tool to build CF templates. We decided to use tuxpiper’s cloudcast library (we are a python shop). You can take a peek or download it here https://github.com/tuxpiper/cloudcast. If your primary language is different than python you can easily find or write your own templating tool – it was pointed to me by a former colleague that CFNDSL is a good starting point for rubyists (https://github.com/howech/cfndsl). So lesson one is – don’t use plain JSON to write your CF templates. You will save yourself a lot of tedious time.

Once you have your first stack up and running you’ll realise how easy it is to modify and use it. But wait, what about testing the changes? That’s one of the biggest flaws of the CF technology. There is no other way to test your template than to apply it. CF does not give you a second chance – you can easily terminate/recreate your whole stack by changing of single line in your template. The good practice we try to adhere to is to test every single change in the template using different AWS account (we use separate AWS accounts for our development, integration, staging and production environments) or region, i.e. launch identical stack first in another AWS location and then perform the change on it to test if we end up in the desired state.

To make it possible to launch identical stacks in different accounts or regions one can leverage CF mappings and parameters. We don’t use parameters yet, but we use mapping heavily. That allows us to use a single CF template file to create multiple stacks in different environments. All you have to do is to define environment-specific properties within a global mapping on top of our template and then use CF’s “Fn::FindInMap” intrinsic function (actually, cloudcast does it for you). Also, use CF Outputs – they will allow you to programmatically access the resources created in your CF.

Next one is a set of more generic hints for those who work with AWS, still 100% valid for CF. First, use IAM roles to launch your stacks/instances. Let me quote AWS IAM official documentation here:

A role is an entity that has its own set of permissions, but that isn’t a user or group. Roles also don’t have their own permanent set of credentials the way IAM users do. Instead, a role is assumed by other entities. Credentials are then either associated with the assuming identity, or IAM dynamically provides temporary credentials (in the case of Amazon EC2)“.

That will make your environment more secure and save you misery of maintaining IAM users and keys. Bear in mind that once the instance is created you cannot assign it to an IAM role, so if you’re not using IAM roles yet you should create IAM role with an “empty” policy now and use it for all your resources until you’re ready to benefit from full-fat IAM roles.

Secondly, use a minimalistic user data – make it identical for your whole estate. Delegate environment/application specific settings to your configuration management system. This will just make your life easier. Get familiar with and start using auto-scaling groups, even if you’re launching a single instance (in that case you can have an auto-scaling group with minimum and maximum number of instances equal to 1). You’ll benefit from that approach later, once your service starts to scale up.

Finally, use AWS tags to tag your AWS resources. Tags allow you to do a lot of funky stuff with your AWS resources (let me only mention grouping, accounting, monitoring and reporting here).

Now, a few DON’Ts for your CF:

  • Don’t mix VPC and non-VPC regions in your mappings – CF uses different set of properties for EC2-VPC resources than for EC2-classic resources
  • Don’t ever specify resource name properties in your CF template. Using auto-generated names makes your stack easily portable. Thus, you can copy your existing stack to another environment or launch a completely new stack (say your canary stack) using the same template. Also some of AWS resource names need to be globally/regionally unique, so defining a name in your stack is not such a good idea. Finally, virtually any resource which allows you to set its name will require replacement on update – just imagine your whole stack relaunching from scratch when someone comes with a clever idea to rename resources in line with a new naming convention or a new product name?
  • Don’t use existing (non-CF built) AWS objects in your stack, if you can. Using existing resources also makes your stack non-portable. A lot here depends on the use case (i.e. we have a couple of security groups which we use in our stacks, but even then we provide their names/ids in the mappings or parameters, rather than using them directly in resource declaration).

Know your limits – CF is great orchestration tool, but it has its limits. You cannot create or update some AWS resources (e.g. EC2 keypairs). You cannot self-reference security groups in their definitions, which sucks (how do I open all my cassandra nodes for inter-node communication on port 7001 within the CF?). Stacks are difficult to maintain, as there are no incremental changes. For the above and other, obvious, reasons – don’t forget to source control your CF stacks (we have a dedicated git repository for that).

Finally, the last, and maybe most important, point – separate your applications into multiple CF stacks. One can easily get excited about CF and create a single stack for the whole application (network, databases, application servers, load balancers, caches, queues and so one). That’s not a good idea – you don’t want your database servers to relaunch when you decide to modify the properties of the auto-scaling group for you application layer. The solution is simple – create multiple CF stacks for your four application stack. Make your database layer a separate CF stack, then your distribution (app server auto-scaling groups and ELBs) a second CF stack and so on. This will give you the flexibility of CF without taking a risk of unwanted service disruption, due to CF stack update (been there, done that…). It’s very tempting to create very sophisticated CF stack, with many inter-dependent components, but I cannot stress enough how important is not to do it.

What’s next?

We are all the time looking to improve our stacks and processes, so definitely we are only at the beginning of our CF journey. One of my colleagues is looking at another CF templating library (https://github.com/cloudtools/troposphere) to help us automate our processes of CF creation even more. We will very likely start to protect our CF resources in production using stack policies soon. We will start working with CF parameters and dependencies more to make our templates 100% independent of our account/regional settings. Finally, we need to research if Custom Resources are fit for our purposes.