As we’ve mentioned in previous posts, we use AWS services extensively at Mind Candy. One of the services that we’ve blogged about before is CloudFormation. CloudFormation (CF) lets us template multiple AWS resources for a given product into a single file which can be easily version controlled in our internal Git implementation.
Our standard setup for production is to use CF to create Autoscaling Groups for all EC2 instances where, as Bart posted a while back, we mix and match our usage of on-demand instances and spot priced instances to get the maximum compute power for our money.
During load testing of the backend services of our games we did, however, notice a flaw in the way we’re doing things. Essentially, this was the speed with which we could scale up under rapid traffic surges, such as those generated by feature place in mobile app stores.
The core problem for us was that our process started with a base Amazon Image (AMI), after initial boot it would then call into Puppet to configure it from the ground up. This meant that a scaling up event could take many minutes to occur – even with SSD-backed instances – which isn’t ideal.
Not only could this take a long time – when it worked – but we were also dependent on third-party repositories being available, or praying that Ruby gem installations actually worked. If a third-party was not available then the instances would not even come up, which is worse position to be in than it just being slow.
The obvious answer to this problem is to cut an AMI of the whole system and use that for scaling up. However, this also poses another problem that you now make your AMI a cliff edge that sits outside of your configuration management system.
This is not a particularly new problem or conundrum of course. I can personally recall quite heated debates in previous companies about the merits of using AMIs versus a configuration management system alone.
We thought about this ourselves too and came the conclusion that instead of accepting this binary choice we’d split the difference and use both. We achieved this by modularising our deployment process for production and using a number of different tools.
Teamcity – we were already using our continuous integration system as the initiator of our non-production deployments so we decided to leverage all the good stuff we already had there and, crucially, we could let our different product teams deploy their own builds to productions and we would just support the process.
Fabric – we’ve been using Fabric for deployments for quite some time already. Thanks to the excellent support for AWS through the Boto library we were easily able to utilise the Amazon API to programmatically determine our environments and services within our Fabric scripts.
Puppet – when you just have one server for a product using a push deploy method makes sense as its quick. However, this doesn’t scale. Bart created a custom Puppet provider that could retrieve a versioned deployment from S3 (pushed via Fabric) so we could pull our code deploys on to remote hosts.
Packer – we opted to use Packer to build our AMIs. With Packer, we could version control our environments and then build a stable image of a fully puppetized host which would also have the latest release of code running at boot, but could still run Puppet as normal as well. This meant we could remove the cliff edge with an AMI, because, at the very worst we would bring up the AMI and then gain anything that was missing but do so quickly as it was “pre-puppetized”.
Cloudformation – Once we had a working AMI we could then update our version controlled templates and poke the Amazon API to update them in CloudFormation. All scaling events would then occur using the new AMI containing the released version of code.
The Process – when you hit “Run” in Teamcity
- Checkout from git the Fabric repo, the Packer repo and the Cloudformation repo.
- Using a config file passed to Fabric that would run a task to query the Amazon API and discover our current live infrastructure for a given application/service.
- Administratively disable Puppet on the current live infrastructure so Puppet doesn’t deploy code from S3 outside of the deployment process.
- Push our new version of code to S3.
- Initiate a Packer build, launching an instance and deploying the new code release.
- Run some smoke tests on the Packer instance to confirm and validate deployment.
- Cut the AMI and capture its ID from the API when its complete.
- Re-enable and run Puppet on our running infrastructure thus deploying the new code.
- Update our Cloudformation template with the new AMI and push the updated template to the CloudFormation API.
- Check-in the template change to Git.
- Update our Packer configuration file to use the latest AMI as its base image for the next deploy.
What we’ve found with this set-up is, for the most part, a robust means of using Puppet to deploy our code in a controlled manner, and being able to take advantage of all the gains you get when autoscaling from baked AMI images.
Obviously we do run the risk of having a scaling event occur during deployment, however, by linking the AMI cutting process with Puppet we’re yet to experience this edge case, plus all our code deploys are (and should be) backwards compatible, so the edge case doesn’t pose that much of a risk in our set-up.