Replicating from AWS RDS MySQL to an external slave (without downtime)

We’ve recently needed to create an external copy of a large database running on Amazon RDS, with minimal or no downtime. The database is a backend to a busy site and our goal was to create a replica in our data centre without causing any disruptions to our users. With Amazon adding support for MySQL 5.6 this meant that we’re able to access the binary logs from an external location, which wasn’t possible before.

As MySQL replication only works from a lower version to an equal or higher version, we had to ensure that both our databases were on MySQL 5.6. This was simple with regards to the external slave but not as easy with the RDS instance, which was on MySQL 5.1. Upgrading the RDS instance would require a reboot after upgrading to each version i.e. 5.1 ->  5.5 -> 5.6. As per the recommendation in the Amazon upgrade guide we created a read replica and upgraded it to 5.6. With the replica synced up, we needed to enable automated backups before it was in a state where it could be used as a replication source.

Creating an initial database dump proved tricky, as the actual time to create the backup was around 40-50 minutes. The import time into the external slave was around 3-4 hours and with the site being as active as it is, the binary log and position changes pretty quickly. The best option would be to stop the RDS slave while the backup is happening. Due to the permissions given to the ‘master’ user by Amazon, running a STOP SLAVE command would return a

ERROR 1045 (28000): Access denied for user ‘admin’@’%’ (using password: YES)

Luckily there’s a stored procedure which can be used to stop replication –mysql.rds_stop_replication

mysql> CALL mysql.rds_stop_replication;
+—————————+
| Message |
+—————————+
| Slave is down or disabled |
+—————————+
1 row in set (1.08 sec)

Query OK, 0 rows affected (1.08 sec)

With replication on the RDS slave stopped, we can start creating the backup assured that no changes will be made during the process and any locking of tables won’t affect any users browsing the website.

Once the backup completes, we’d want to start up replication again but before doing this we’ll be able to get the binlog file log and position:

mysql> show master status;
+—————————-+———-+————–+——————+——————-+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+—————————-+———-+————–+——————+——————-+
| mysql-bin-changelog.074036 | 11653042 | | | |
+—————————-+———-+————–+——————+——————-+

This will be required when setting up the external slave later on. Now that we have the relevant information we can start the replication.Again, we’d need to use the RDS mapping of START SLAVE:

mysql> CALL mysql.rds_start_replication;
+————————-+
| Message |
+————————-+
| Slave running normally. |
+————————-+

 

Once the dump has been imported we can set the the new master on the external slave with the values previously recorded:

CHANGE MASTER TO MASTER_HOST=’AWS_RDS_SLAVE’, MASTER_PASSWORD=’SOMEPASS’, MASTER_USER=’REPL_USER’, MASTER_LOG_FILE=’mysql-bin-changelog.074036′, MASTER_LOG_POS=11653042;

Before we start the replication, we need to add a few more settings to the external slave’s my.cnf:

  • a unique server-id i.e. one that’s not being used by any of the other mysql DBs
  • the database(s) you want to replicate with replicate-do-db. This stops the slave trying to replicate the mysql table and other potential RDS related stuff. Thanks to Phil for picking that up.

So something like:

server-id = 739472
replicate-do-db=myreplicateddb
replicate-do-db=mysecondreplicateddb (if more than one db needs to be replicated)

Start up replication on the external slave – START SLAVE; 

This should start updating the slave, which you can monitor via

mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Log_File: mysql-bin-changelog.075782
Read_Master_Log_Pos: 113973
Relay_Log_File: relaylog.014354
Relay_Log_Pos: 114089
Relay_Master_Log_File: mysql-bin-changelog.074036
Slave_SQL_Running_State: updating

The above values are the most important from the sea of information that the command returns. You’ll be waiting for MASTER_LOG_FILE and Relay_Master_Log_File to be identical and Slave_SQL_Running_State having a status of Slave has read all relay log; waiting for the slave I/O thread to update it

Once that syncs up, an external replica has been created with zero downtime!

Managing Mac OS X

In the beginning Clonezilla was used for imaging the Macs and for asset management we solely used GLPI. Having come from a Windows background with a little bit of Debian and Red Hat experience, this lack of control and visibility of the machines you were supporting was completely alien to me. Soon after joining I was given the task of reviewing our imaging process and look to see what other technologies were available. Along came DeployStudio.

Deployment

DeployStudio is a fantastic deployment tool which allows you to image and deploy Macs and PCs. As opposed to just creating and deploying images it allows you to create advanced workflows and fully automate your deployment from start to finish. In order to run DeployStudio you will need OS X Server for NetBoot deployments. We initially installed DeployStudio on a Mac Mini which ran fine but then set up another Deploystudio Server on a second Mac Mini as a replica for redundancy purposes.

We didn’t want to get into the habit of creating multiple images for different models of machines but instead have one vanilla image with varying workflows depending on the purpose of the Mac it will be deployed to. In order to create a vanilla image you download the latest Mac OS X installer via the App Store and then use AutoDMG to build a system image ready for use with the latest updates. Once we have our base image we can then look to create a workflow. This workflow will then apply all the changes needed over the top of the vanilla image such as localising the Mac, setting the computer name, creating the local users, installing necessary packages, running scripts and lastly it then runs software update.

Management 

We were then left with the problem of, how do we preload and manage software on all our Macs? DeployStudio can install packages but they then quickly become out of date and, you aren’t able to manage the already installed software on the machines on your network.

Munki is a package management system which allows you to distribute software via the use of a web server and client side tools. Using catalogs and manifests you can install and remove software on your clients depending on what manifest has been applied to what machine. Munki also rather handily collects information on the clients and stores it on the Munki server each time Munki runs such as the user, hostname, IP, failed Munki runs, installed software etc.

The problem we found is that while Munki works well, the software uploaded to Munki Server will soon become out of date and managing the repository of software becomes a task in itself, here’s where AutoPKG and Jenkins comes in. AutoPkg is an automation framework for OS X software packaging and distribution and Jenkins is a continuous integration server. AutoPkg uses recipes to download the latest create the relevant metadata ready for importing to the Munki server via Munki tools. Jenkins is then used to schedule the AutoPkg recipes and Munki import as builds everyday in the early hours of the morning. The end product is Munki containing the most up-to-date software that you’d expect on all of your clients e.g. Chrome, Firefox, Munki tools, Antivirus, Mac Office.

Monitoring

After playing with Munki and being really impressed with it I decided to see what else I could find that would potentially hook into it. After some search I found Sal which is the creation of a guy called Graham Gilbert. Graham is an active contributor in the Mac management world and I personally am a big advocate of his, you can find his GitHub page here. Sal is a Munki and Puppet reporting tool which helps give you, not only a broad overview of your Macs via a dashboard but also an in depth look at each client via the stats collected by Munki and Facter. 

Once Sal has been configured on the client, each time Munki runs it triggers a sal submit script which gathers information and stores it on the Sal server. The highly customisable Sal dashboard uses plugins to display high level counters to then display for example which machine is running on which version of OS X or when the clients last checked in. There are custom plugins available on GitHub or you can even create your own. If you wanted to you could rearrange the order of the plugins and hide certain plugins if you had multiple dashboards. Due to the use of Munki and Facter it gives you the ability to drill down into an individual client and retrieve hourly status updates as well as static information about the machine. 

Asset Management 

In terms of Asset management not much has changed as GLPI is fit for purpose so we’ve had no need to replace it however, we do now use Meraki in conjunction with it.

Meraki is MDM tool which we initially used for our mobile devices but as Meraki grew their Mac and Windows management tools improved. By default now all machines are managed by Meraki as well as the other tools mentioned previously. With Meraki you can use their extremely powerful tools such as location tracking, remote desktop, viewing network stats, fetching process list, sending notifications etc. While on a daily basis they aren’t necessarily needed, they do come in handy for those odd situations.

Summary 

Via all of the technologies above we now have an imaging process which takes up to 15 minutes with only 30 seconds of actual input. In that imaging process you have a DeployStudio workflow which images the Mac, configures it for the individual user, runs software update and installs Munki/Sal. Once booted Munki then runs and installs all the up-to-date software (thanks to AutoPKG & Jenkins) that the client’s manifest has specified. Going forward you then have monitoring (Sal) and management (Munki & Meraki) of that Mac which then gives you control and visibility of the machine. 

The best part is this hasn’t cost us a penny! This is thanks to the ever passionate and generous open source community. See below for the fantastic folks who created the amazing tools I mentioned in the post.

  • AutoDMG – Builds a system image, suitable for deployment with DeployStudio
  • DeployStudioImage & deployment system for Macs and PCs
  • Munki – Package management for Mac
  • AutoPKG & Jenkins –  Automation framework for OS X software packaging and distribution and continuous integration server
  • Sal – Munki & Puppet reporting tool for Macs
  • Meraki – Mobile Device Management
  • GLPI – Asset management
  • Puppet & Facter – Configuration management utility and a cross-platform system profiling library

Dynamic-DynamoDB has been summoned!

Back in July, we posted about a tool for Amazon DynamoDB called Dynamic-DynamoDB that gives you the ability to dynamically increase and decrease your provisioned throughput in DynamoDB according to your current usage.

What you gain from using a tool like this, if you implement it correctly, is provisioned throughput in Dynamo that will ensure you have capacity when you need it.

Last week, we released our latest game, World of Warriors on the App Store, and, without going into the back-end systems in too much detail, we have a use case for DynamoDB in the game.

This is why we put the time into writing a Puppet module for DynamicDynamoDB which you can get from our Github account.

Since launch World of Warriors has performed incredibly. We received Editor’s Choice from Apple, and had over 2 million download in the first weekend alone, which made the game crack the top 50 grossing games in the US and become #1 role playing game for iPad in 80 countries.

As you can imagine, with statistics and rapid uptake like that you need have a good scaling. So did DynamicDynamoDB “do the business” for us over an extremely busy weekend?

We’ll let the following graph speak for itself. It shows our usage and scaling of DynamoDB from launch to yesterday.

scalingup

As you can see, our provisioned capacity compared to consumed capacity took a cautious approach.

We did this because what we didn’t want to happen was for us to not scale up fast enough and then get throttled by Amazon. Obviously this has a trade-off with cost, but for us this was acceptable thanks to what we had learned prior to hard launch.

So what had we learned?

We’d observed that altering throughput on DynamoDB tables is in no way immediate. This led us to the conclusion that to get the most out of dynamic scaling and avoid throttling you need to scale up early and do so by a significant amount.

In our case, we initiate a scaling up event when our consumed capacity reaches 60% of our provisioned capacity, and we scale back when we go below 30%. Each of these scaling events either increases or decreases provisioned capacity by 25%.

As the graph above shows, this strategy meant that at all times we had significant capacity for sudden bursts in traffic, whether due to virality, marketing, or any other sort of active promotional event that might occur.

Don’t forget to download World of Warriors on the AppStore, we’ll see you in Wildlands! Let battle commence!

Warriors_Map_Wildlandssmall

A Puppet module for Dynamic DynamoDB

As my colleagues have said in other posts, we make an extensive use of Amazon Web Services at Mind Candy. Recently we decided to use the AWS NoSQL offering DynamoDB for a specific use case in one of our products.

Whilst DynamoDB provides us with a highly distributed NoSQL solution, it works based on telling Amazon what read and write capacity you require via their API. If you find that you go over either of these value you begin to, potentially at least, lose queries if you have not factored in some sort of caching layer using, for example, Amazon SQS.

In the ideal world, Amazon would offer auto scaling features for DynamoDB, however at time of writing they don’t. Instead they advise people to use an independently developed tool called Dynamic DynamoDB written by Sebastian Dahlgren.

Dynamic DynamoDB is a tool written in Python that allows us to effectively auto scale our provisioned reads and writes. It use CloudWatch metrics to establish current usage and then based on the configuration option either scales up or down your provisioned capacity on a per table basis.

As I’ve posted before here, we use Puppet at Mind Candy, so the first point of call whenever a new tool comes along is to see if anyone has written, or started to write, a Puppet module for it. Sadly it didn’t look like anyone had, so we quickly wrote up our own, which is available on Github here.

Event Processing at Mind Candy

At Mind Candy we want to build great games that are fun and that captivate our audience. We gather a great deal of data from all of our products and analyse it to determine how our players interact with our games, and to find out how we can improve. The vast majority of this data consists of ‘events’; a blob of json that is fired by the client or server in response to an interesting action happening in the game.

This blog post is about the approach that we have taken at Mind Candy to gather and process these events, and scale the systems into the cloud using fluentd, akka, SQS, Redshift and other AWS Web Services.

What is an event?

From our point of view, an event is any arbitrary valid json that is fired (sent) to our Eventing service via a simple REST api.

When an event is received, it is enriched with some additional fields which includes a ‘fired_ts’ of when the event was received, a unique uuid and, importantly, the game name, version, and event name taken from the endpoint. These three together form what we call the ‘event key’.

This service is extremely lean, and does not itself expect or enforce a rigid taxonomy. It simply writes the enriched events to disk. As a result, the service is incredibly easy to scale and to achieve high availability.

Validation and processing

We then use fluentd, an open source data collector and aggregator, to take the enriched data written to disk and place it onto an SQS queue. Currently, we use a single queue (per environment) which receives data from many different eventing servers.

Once all that data is arriving on the queue, we need to do something useful with it! This is where our home grown event processing engine, Whirlpool, comes into play.

Whirlpool is a scala and akka based service which retrieves messages from SQS and processes and validates them accordingly. It uses a variant of the akka work-pull pattern with dedicated workers for pre-fetching, processing, and writing events, communicating with a master worker. The number of workers and other parameters can be tweaked for maximum throughput.

Where does the metadata for processing come from? We have a shared ‘data model’ which contains information on what an event should look like for a specific game and version. This is essentially a scala library that reads from a backing Postgres store.

The structure of that schema is (simplified):

Screen Shot 2014-07-25 at 15.49.50

An event field is a single field to be found in the json of the sent event. It has a number of different properties, for example whether it is mandatory or not, and whether it should be expanded (exploded out into multiple events), and the json path to where that field should be expected. The point of the eventversion table is to provide a history, so that all changes to all events are recorded over time so we have a rollback, as well as an audit trail for free.

An event destination configures where an event should end up in our warehouse. It can be copied to any number of schemas and tables as we require.

Whirlpool retrieves the metadata for an event based on the extracted event key. It then passes the event through a series of validation steps. If it fails at any level, the reason why is recorded. If it completes all validations, the event can be processed as expected.

The processMessage function looks like this:

Screen Shot 2014-07-25 at 16.49.28

We use Argonaut as our JSON processing library. It is a fully functional library written in Scala that is very nice to work with, as well as having the added benefit that our resident Mind Candy, Sean, is a contributor!

After our events have been validated, they are either a successful event for a particular game and version, or a failure. At this point we make use of fluentd again with a modified version of the Redshift plugin to load them into our Redshift data warehouse. Here they are available for querying by our data scientists and data analysts. Typically, the period from an event being received to being queryable within the data warehouse is measured in seconds, and in any case within a couple of minutes in normal cases.

Configuring events

To actually setup the metadata for what constitutes an event, we have created a simple GUI that can be accessed by all game teams. Any changes are picked up within a few minutes by Whirlpool, and those events will start to flow through our pipeline.

We also needed to solve one large problem with the configuration, namely: “How do you avoid having to create a mapping for every single game version when the events haven’t changed, and how do you accommodate for changes when they do occur?”

It took us a while to find a nice balance for solving this, but what we have now is a mapping from any POSIX regex which is matched against an incoming game version, to a specific version that should be used for retrieving the metadata (this is the purpose of the ‘configmapping’ table in the schema). So, when we release 1.0 of our game, we can create metadata that applies to “1.x”. If in version 1.5 we introduce a new event, we can create a new config at that point to apply to all later versions, while still having versions 1.0-1.4 processed correctly.

Handling Failure

Events can fail for a large variety of reasons. Currently there are 17 specific types of these, with a couple being:

  • The event is malformed; it does not contain the fields that we expect
  • The event is unknown

A failure is captured by the following class:

Screen Shot 2014-07-25 at 16.49.46

The FailureType here is another case class corresponding to the specific failure that was generated, and the fields contain some additional attributes which may or may not be extracted from the failure.

We treat failures separately from processed events, but they still make their way into Redshift in a separate schema. Each failure contains enough information to identity the problem with the event, which can then be fixed in most cases in the metadata; typically, event failures occur during development, and are a rare occurrence in production.

Scaling our infrastructure

We make heavy use of AWS at Mind Candy, and the eventing pipeline is no exception. All the eventing servers are described via Cloud Formation, and setup in an autoscale group fronted by an ELB. As a result, the number of servers deployed scales up and down in response to rising and waning demand.

The use of SQS also separates out our event gathering and event processing infrastructure. This means that Whirlpool instances do not have to scale as aggressively, as the queue provides a natural buffer to iron out fluctuations in the event stream due to peaks of traffic. For Redshift, we have a 6XL node cluster which we can scale up when required, thanks to the awesome features provided by Amazon.

Performance

We’ve benchmarked each of our eventing servers comfortably processing 5k events/sec, on m1.medium instances.

Whirlpool does a little more work, but we are currently running a configuration offering a sustained rate of just over 3k events/sec per instance, on c1.medium instances, with a quick ramp up time.

Instances of both Eventing and Whirlpool operate independently, so we scale horizontally as required.

Screen Shot 2014-07-25 at 16.24.54

The Future

We have real-time dashboards that run aggregations against our event data and display it on screens around the office. It’s very useful, but is only the first incarnation. Currently we’re working on streaming processed events from Whirlpool into Spark via Kafka, to complete our lambda architecture and greatly reduce the load on our Redshift cluster. We’re also improving the structure of how we store events in Redshift, based on our learnings over the last year or so! At some point when we have more time, we would also like to open-source Whirlpool into the community.

 

Advanced Scala Meetup at Mind Candy

We’re very happy to announce the Advanced Scala Meetup in collaboration with the London Scala Users’ Group. This is a new regular meet up for proficient Scala developers to share out their problems, solutions and experience to make for better code.

At Mind Candy, all our new games and products use Scala, so we’re very interested in Scala and happy to host the meetup!

For more details and to sign up see the event page: http://www.meetup.com/london-scala/events/195004482/

If you’d like to learn more about joining us at Mind Candy to work on highly scalable servers for our new wave of games, check out our Careers page for current vacancies. We’re hiring!

How I Learned to Stop Worrying and Love AWS CloudFormation

We love using AWS CloudFormation, here, at Mind Candy. Last year we moved all our cloud-based products application stacks to CloudFormations. We have learned, sometimes the hard way, how to design and use them in the best possible way for us. In this post I’m trying to summarize how we build and operate CloudFormations and what are the DOs and DON’Ts when using this technology. Throughout this post I will refer to CloudFormation as CF, to save some precious typing time.

First of all, you need to get to know cloud formation templates. This are just blocks of JSON, and as such are not really nice to edit (remember – no comments allowed in JSON). Because of that we use a helper technology – a templating tool to build CF templates. We decided to use tuxpiper’s cloudcast library (we are a python shop). You can take a peek or download it here https://github.com/tuxpiper/cloudcast. If your primary language is different than python you can easily find or write your own templating tool – it was pointed to me by a former colleague that CFNDSL is a good starting point for rubyists (https://github.com/howech/cfndsl). So lesson one is – don’t use plain JSON to write your CF templates. You will save yourself a lot of tedious time.

Once you have your first stack up and running you’ll realise how easy it is to modify and use it. But wait, what about testing the changes? That’s one of the biggest flaws of the CF technology. There is no other way to test your template than to apply it. CF does not give you a second chance – you can easily terminate/recreate your whole stack by changing of single line in your template. The good practice we try to adhere to is to test every single change in the template using different AWS account (we use separate AWS accounts for our development, integration, staging and production environments) or region, i.e. launch identical stack first in another AWS location and then perform the change on it to test if we end up in the desired state.

To make it possible to launch identical stacks in different accounts or regions one can leverage CF mappings and parameters. We don’t use parameters yet, but we use mapping heavily. That allows us to use a single CF template file to create multiple stacks in different environments. All you have to do is to define environment-specific properties within a global mapping on top of our template and then use CF’s “Fn::FindInMap” intrinsic function (actually, cloudcast does it for you). Also, use CF Outputs – they will allow you to programmatically access the resources created in your CF.

Next one is a set of more generic hints for those who work with AWS, still 100% valid for CF. First, use IAM roles to launch your stacks/instances. Let me quote AWS IAM official documentation here:

A role is an entity that has its own set of permissions, but that isn’t a user or group. Roles also don’t have their own permanent set of credentials the way IAM users do. Instead, a role is assumed by other entities. Credentials are then either associated with the assuming identity, or IAM dynamically provides temporary credentials (in the case of Amazon EC2)“.

That will make your environment more secure and save you misery of maintaining IAM users and keys. Bear in mind that once the instance is created you cannot assign it to an IAM role, so if you’re not using IAM roles yet you should create IAM role with an “empty” policy now and use it for all your resources until you’re ready to benefit from full-fat IAM roles.

Secondly, use a minimalistic user data – make it identical for your whole estate. Delegate environment/application specific settings to your configuration management system. This will just make your life easier. Get familiar with and start using auto-scaling groups, even if you’re launching a single instance (in that case you can have an auto-scaling group with minimum and maximum number of instances equal to 1). You’ll benefit from that approach later, once your service starts to scale up.

Finally, use AWS tags to tag your AWS resources. Tags allow you to do a lot of funky stuff with your AWS resources (let me only mention grouping, accounting, monitoring and reporting here).

Now, a few DON’Ts for your CF:

  • Don’t mix VPC and non-VPC regions in your mappings – CF uses different set of properties for EC2-VPC resources than for EC2-classic resources
  • Don’t ever specify resource name properties in your CF template. Using auto-generated names makes your stack easily portable. Thus, you can copy your existing stack to another environment or launch a completely new stack (say your canary stack) using the same template. Also some of AWS resource names need to be globally/regionally unique, so defining a name in your stack is not such a good idea. Finally, virtually any resource which allows you to set its name will require replacement on update – just imagine your whole stack relaunching from scratch when someone comes with a clever idea to rename resources in line with a new naming convention or a new product name?
  • Don’t use existing (non-CF built) AWS objects in your stack, if you can. Using existing resources also makes your stack non-portable. A lot here depends on the use case (i.e. we have a couple of security groups which we use in our stacks, but even then we provide their names/ids in the mappings or parameters, rather than using them directly in resource declaration).

Know your limits – CF is great orchestration tool, but it has its limits. You cannot create or update some AWS resources (e.g. EC2 keypairs). You cannot self-reference security groups in their definitions, which sucks (how do I open all my cassandra nodes for inter-node communication on port 7001 within the CF?). Stacks are difficult to maintain, as there are no incremental changes. For the above and other, obvious, reasons – don’t forget to source control your CF stacks (we have a dedicated git repository for that).

Finally, the last, and maybe most important, point – separate your applications into multiple CF stacks. One can easily get excited about CF and create a single stack for the whole application (network, databases, application servers, load balancers, caches, queues and so one). That’s not a good idea – you don’t want your database servers to relaunch when you decide to modify the properties of the auto-scaling group for you application layer. The solution is simple – create multiple CF stacks for your four application stack. Make your database layer a separate CF stack, then your distribution (app server auto-scaling groups and ELBs) a second CF stack and so on. This will give you the flexibility of CF without taking a risk of unwanted service disruption, due to CF stack update (been there, done that…). It’s very tempting to create very sophisticated CF stack, with many inter-dependent components, but I cannot stress enough how important is not to do it.

What’s next?

We are all the time looking to improve our stacks and processes, so definitely we are only at the beginning of our CF journey. One of my colleagues is looking at another CF templating library (https://github.com/cloudtools/troposphere) to help us automate our processes of CF creation even more. We will very likely start to protect our CF resources in production using stack policies soon. We will start working with CF parameters and dependencies more to make our templates 100% independent of our account/regional settings. Finally, we need to research if Custom Resources are fit for our purposes.

Scaling Puppet for Donuts

In the last year we’ve had a fair number of challenges within NetOps, especially with our config management of choice which is Puppet. Not only did we have a big piece of work that involved taking the giant leap from Puppet 2.x to 3.x, we also faced some architectural and performance challenges.

Whilst the upgrade was successful, we continued to have an architecture that was vertically scaled, and worse still we had CA signing authority host that had become a snowflake due to manual intervention during our upgrade. The architecture issues then really started to become apparent when we started hitting around 600 client nodes.

Now, as the old saying goes, if you have a problem, and no one else can help, maybe you should open…. a JIRA! So we did and it included the promise of donuts to the person willing to do the following:

1: Puppetise all our puppet infrastructure – inception FTW.
2: Add a level of redundancy and resilience to our Puppet CA signing authority.
3: Get us moved to the latest version of PuppetDB.
4: Make Puppet Dashboard better somehow.
5: Do it all on Debian Wheezy because Debian Squeeze support expires soon.
6: Seamlessly move to the new infrastructure.

What happened next was three weeks of work creating new modules in our Puppet git repo that could sit alongside our current configuration and be ready for the proverbial flip of the switch at the appropriate moment.

After a quick bit of research it became clear that the best approach to take was to separate out our CA signing authority host from our Puppet masters that would serve the vast majority of requests. This would allow us to make the CA resilient, which we achieved through bi-directional syncing of the signed certificates between our primary and failover CA.

Separation also meant that our “worker” cluster could be horizontally scaled on demand, and we estimate we can easily accommodate 2000 client nodes with our new set-up, which looks like this:

puppet

You may be thinking at this point that PuppetDB is an anomaly because it’s not redundant. However, we took the view that as the reporting data was transient and could potentially change with every puppet run on our agent nodes, we could quite happily take the hit on losing it (temporarily).

Yes we would have to rebuild it, but the data would repopulate once back online. In order to cope with a PuppetDB failure we enabled the “soft_write_failure” option on our Puppet masters, i.e. CA and Worker hosts. This meant that they would still serve in the event of a PuppetDB outage.

Finally, we decided to move away from using the official Puppet Dashboard – which relied on reports and local sql storage – and used Puppetboard Github project instead as it talks directly to PuppetDB. Puppetboard was written using the Flask (Python) web framework and we run it internally fronted with Facebook’s Tornado web server

Musings on BDD

This month at the London Python Dojo I gave a talk on Behaviour Driven Development. People were a lot more interested than I expected, so I thought I’d share my thoughts more widely on here. The main motivation behind this talk was to dispel some of the myths around BDD and make it more accessible. You can view the slides on google drive.

What is BDD?
The main points I made could be summarised as:

  • BDD is a methodology, not a tool or a process
  • As a methodology it’s a collection of best practices
  • You don’t have to take the whole thing for it to be useful
  • As the name should make clear – it’s about focussing on behaviour not implementation.

For a more thorough introduction I recommend that you read the slides with the speaker notes.

What I found more interesting though was the Q&A, which lasted longer than the slides themselves. There appeared to be a lot of misconceptions about testing in general. This came out because BDD encourages you to look at behaviour at the user level. This is important as it helps focus your tests and give them tangible meaning. The leap of comprehension though is that when you test at different levels you are concerned with different users.

At a unit level your user is a software developer who will be using that unit of code. The most important thing at this level is the behaviour to this developer, who could be you, or could be someone else.

At an integration level your user is another service or unit of code. Here you are concerned with the interface that is provided to these other services and code units. When this second piece of code interfaces with the first, how does it expect it to behave?

At a system level your user is probably a person, or possibly a script. At this level you should really be using the user interface, be it a GUI or text. This is the level at which there was least confusion, as it should be fairly obvious that the behaviour your interested in here is that of the system as a whole when the user attempts to interact with it.

BDD in Practice

One of the questions I was asked was one about speed – as behaviour is only the first word, it is important to be able to use this to drive your development. Indeed system level tests shall rarely be as fast to execute as your unit tests. This does not however destroy the value of having these tests, and with a little work, in most cases, system and integration tests do not have to be as slow and unpredictable as you might have experienced. This comes down to the proper application of mocking.

The most useful way I have heard of explaining this is from a talk by Simon Stewart where he talks about having small, medium and large tests. A small test lives entirely in memory, is single threaded, doesn’t sleep, and doesn’t access system properties. They should ideally be entirely self contained and able to be run in parallel without any restrictions on ordering, and each one should take less than a minute to run.

A medium test is given much more free reign. Its only restriction is that it cannot touch anything not running on localhost, i.e. any network operations should run on loopback. It should take no more than five minutes. A large test has none of these restrictions.

From this it should be clear that your small tests are almost certainly unit tests of some description. They will act entirely on one unit of code, and any other units of code they interact with should be mocked out with the behaviour they expect to consume. Because they run so fast, you can afford to set these up to run on every commit, and wait for the results before you continue to write code that corrects any errors.

Your large tests are going to be full end-to-end system tests, but might also include integrations with third party services, such as AWS. These are going to be slow. You might choose to use your continuous integration server to run these, or you might run them nightly. You might also choose to include your manual tests in this category, and I find it a perfectly acceptable, or in fact desirable, first step into BDD to have your behaviour specifications executed by a manual tester.

Your medium tests therefore might be assumed to be integration tests, and in some cases they will be, however by mocking out everything non-local which your user interface interacts with, you can make these medium tests too, and if you have a behaviour specification for the units this interacts with, you should have no less confidence in the results of this test. It is true that there is value in having end-to-end tests, but this can give you a much earlier indication of any issues your changes may have introduced or resolved on the system level. With up to a five minute execution time you’re probably not going to wait for them to run before you carry on coding, but you should definitely wait before merging your work to somewhere it will affect others.

Managing access from AWS EC2 through Cisco ASA firewalls

At Mind Candy, to support our games, we we have increasingly been looking toward AWS in order to take advantage of features such as autoscaling and RDS. With an existing estate spread across three datacentres, introducing instances in EC2 brought about a number of challenges.

Every time an EC2 instance is spun up, it’s given a random dynamic IP and hostname. It’s not possible to know in advance what this will be. While this might sound completely crazy if you’re used to working only with physical hardware, it’s actually been a great catalyst in the move towards ‘disposable’ infrastructure, where your server nodes can build and destroy themselves when required, with no sysadmin intervention. We’ll be talking more about how we do this another time, but today I wanted to talk about security.

As with any organisation, our physical locations have firewalls with very restictive policies. All of our servers have static addressing, and we know exactly where to find them. Creating firewall rules is predictable and no problem. Unfortunately, introducing EC2 nodes into the mix has caused us a bit of bother – how do we allow these dynamic nodes to talk back to our datacentres?

Amazon publish their list of IP ranges every now and then, via their support forum. Predictably, it’s huge. A number of companies I know have gone down the route of allowing the entire AWS address space to access their private networks. For me, this is simply asking for trouble.

We realised that we would need a more robust and targeted way of setting up access rules to only the hosts we were using, and ensuring that we didn’t persist rules for addresses which were no longer under our control.

Enter rulemanager!

This tool is written in Ruby, using the AWS SDK. Its job is to read a list of AWS account details, build a list of all active hosts and their IP addresses in every account and region, and then ensure that each IP is present in an object-group on a number of Cisco firewalls. Knowing that a certain object-group will always contain our entire AWS estate, we can happily create firewall rules without worrying about opening our network up to half the Internet.

Take a look at our github project page:
http://github.com/mindcandy/rulemanager