Utilising AWS Lambda to migrate 25,000,000+ images S3 bucket

When AWS announced AWS Lambda at last year’s re:Invent, we were really excited about it here at Mind Candy. The concept of a zero-administration compute platform, that is very scalable, cheap and so easy to use AND at the same time integrates with so many AWS services through triggers is pretty exciting and potentially – very powerful.

Since then, we started using AWS Lambda in some of our products – PopJam being one of them. We use it to near-instantly generate thumbnails of all the amazing creations users of PopJam share through the app.

Recently, a quite interesting story surfaced on our sprint – we were to migrate one of the AWS S3 buckets PopJam uses, from US to EU (to bring it closer to the backend and users) without any downtime for users.

Now, you’ll think – “why that would be interesting?”

The answer is – 25,000,000+ – scale of this task.

The aforementioned AWS S3 bucket stores over 25,000,000 files (mostly images) and this number is growing faster every single day. Just running ‘s3cmd du’ on the bucket, took almost a day. When I tried to perform ‘s3cmd ls’ to count the number of keys in the bucket, I got bored before it finished (I had to write a simple Python script that utilises multi-processing and split the process of counting into 256 threads; only then would it finish within few minutes).

Obviously, any form of existing CLI command like s3cmd sync or AWS CLI s3 commands is out of question as before it finishes (after many, many hours), the source bucket will have tens of thousands of new files which haven’t been copied across and we’d have to re-run it again which would lead to the same situation.

I mentioned, AWS Lambda functions can be triggered by other AWS services; one of them being AWS S3. Essentially, we can configure an AWS S3 Bucket to invoke a Lambda function whenever a new object (key) is being created.

Given this, we could create a Lambda function on the old bucket that will be triggered whenever a new key is created (ObjectCreated event) that would copy over new keys to the new bucket. Then, we’d have to only sync the old bucket to the new one without having to worry about missing some keys on the way.

The proposed plan looked like this:

  1. Create new S3 bucket in EU
  2. Set up AWS Lambda Copy function and configure it to be triggered whenever a new key is added
  3. Run aws s3 sync command in background
  4. Wait, wait, wait…
  5. Reconfigure CDN to use the new bucket as origin
  6. Switch backend application to upload all images from now on, to the new S3 bucket in EU

This plan, also meant there should be zero downtime during the whole migration. Everyone likes zero downtime migrations, right?

The actual implementation, while not very painful, did uncover a few issues with the plan that had to be dealt with. These issues resulted in some learnings which I wanted to share here.

AWS Lambda copy object function

The Lambda function code to perform the copy happens to be pretty trivial.

var AWS = require(‘aws-sdk’);
var util = require(‘util’);

exports.handler = function(event, context) {
        var s3 = new AWS.S3(options = {region: “eu-west-1”});

        var params = {
                Bucket: ‘popjam-new-bucket’,
                CopySource: event.Records[0].s3.bucket.name + ‘/‘ + event.Records[0].s3.object.key,
                Key: event.Records[0].s3.object.key,
                ACL: ‘public-read’
        }

        s3.copyObject(params, function(err, data) {
                if (err) console.log(err, err.stack);  // an error occurred
                else     context.done();  // successful response
        });
};

It just works, but there’s one small culprit…

… what happens to S3 object ACLs should they be changed in the meantime?

We needed ACLs for particular objects to be in-sync (for various reasons, one of them being moderation).

Given the AWS Lambda function is triggered on ObjectCreated event (there sadly isn’t a way to trigger it on ObjectModify), should you need to change ACL there’s no way to do it through AWS Lambda at this stage.

We worked around this problem by writing a Python script that basically iterates through the S3 buckets, compares ACLs and tweaks them if there’s a need (as before, we had to parallelise it otherwise it’d take ages).

Beware of AWS Lambda limits!

While being pretty scalable, AWS Lambda has got some limits. We were bitten by the “Concurrent requests per account” and “Requests per second per account” limits a few times (fortunately we did just enough with AWS Lambda to get the attention of AWS Lambda product team and they kindly raised these limits for us).

For most of the use cases those limits should be fine, but in our case, when on top of the AWS Lambda copy function we were also triggering a series of functions to generate thumbnails, we hit these limits pretty quickly and had to temporarily throttle our migration scripts.

AWS Lambda is still pretty bleeding edge technology

AWS Lambda will work great for you most of the time. However, when it fails, troubleshooting can be quite … inconvenient to say the least.

Remember you can now view all AWS Lambda logs through CloudWatch – make use of them and don’t shy away from placing debug statements in your code.

The deployment of AWS Lambda is pretty tricky, too. While there are some options, it’s still in early stage and it feels like even AWS is still trying to figure it out (potentially through feedback from customers – if you use AWS Lambda do make sure to feedback to AWS).

The most interesting tool that I found out to support deployment and integrating with AWS Lambda in general is kappa

And all of this for what?

Let the graph speak for itself…

(the graph represents upload time to S3 bucket in US – green line, and S3 bucket in EU – orange line – after migration)

Office Music

Office music, some love it and some hate it. While I’m in the camp that’s for office music I can completely understand why some might not be in favour of it.

We here at Mind Candy find music in the workplace to be a mood enhancement, and in a way a bonding process. You find similarities between yourself and your peers and generate links that weren’t there previously. Music helps reduce those awkward silences filled with keyboard tapping, mouse clicking and the odd coughing fits, and introduces an atmosphere which is indusive to the culture we look to nurture and promote. There’s a great few articles out there which go into greater detail about whether music in the workplace is a good or bad thing, some can be found here.

Last year we started looking into a solution for playing music for the area in which our team sits, after some search engine fu we found Mopidy. Mopidy is an extensible MPD and HTTP server written in Python. Mopidy plays music from your local disk and radio streams while with the help from extensions, you can also play music from cloud services such as Spotify, SoundCloud and Google Play Music.

As we already have a few Spotify accounts we thought we’d toy with the idea of using Mopidy to play music from Spotify. In order to use Spotify you also need to use the Mopidy-Spotify extension.

Once we had both Mopidy and the Spotify extension working we then needed something to interact with it all. After looking through the Mopidy documentation we came across the web extensions section which suggests various web interfaces to interact with the HTTP side of the Mopidy server.

Initially we used Apollo Player. Apollo Player’s great as it allows anyone to log in using their Google Apps or Twitter credentials and then add music to a one time playlist meaning anyone can choose what music is playing. There is also a bombing feature so any music that’s been added can be skipped if bombed by three people. When no music has been selected it will default back to a playlist set in config.js which is found in the root directory of Apollo. The problem there is that once the default playlist has been played for the umpteenth time it can get pretty tedious and only people with access to the app’s root directory can change this. This led us to Mopify.

Mopify gives you much of the functionality that the Spotify client gives you e.g. Browse, Featured Playlists, New Releases, Playlists and Stations. You can log in with your own Spotify account or use the account that Mopidy-Spotify is utilising and use the playlists associated with either account. It gives you greater functionality and options than Apollo but then you lose the collaboration and unmanaged element you had with Apollo.

Finally we then needed to actually run Mopidy on something as it was no good having it run from my local machine. We decided to use a Raspberry Pi and plugged it into some speakers running along the cable trays above our heads. The Rasberry Pi is running Raspbian with Mopidy, Mopidy-Spotify and which ever web extension we’ve chosen. Another Raspberry Pi with Mopidy has been set up as a jukebox in our chillout/games area which works really well with Mobile devices due to most of the web extensions being bootstrapped. This gives employees the flexibility to easily play whatever music they feel like when they are in the communal area.

In our eyes, while music in the office isn’t a necessity, it is definitely beneficial, and it’s fantastic that all these open source tools and products give us the ability to do this.

And lets be honest, who can’t resist an impromptu sing along to Bohemian Rhapsody!

 

Mopidy – Extensible music server written in Python

Mopidy-Spotify – Mopidy extension for playing music from Spotify

Apollo Player – Mopidy web extension

Mopify –  Mopidy web extension

Raspberry Pi – ARM based computer running under GNU/Linux

 

Pi-tomation

Screen Automation – Selenium (and some other stuff), meets Raspberry Pi

Lets set the scene, you need to display some stuff on a screen so everyone in the office can see it. Easy, you mount a couple of TVs on the wall get a dvi-splitter and an old mac mini you had in the store room on the top shelf behind a roll of cat5 cable.

Set everything up, get the mac mini to auto login and mount a shared drive, then run a little script that uses selenium to open a browser and show pre-determined images of the stuff want to display, all stored on the same shared drive, done….

Fast forward a couple of years and you now have a lot more to display on a lot more screens, but what are you going do? It’s impractical – and expensive – to buy a bunch of mac minis just to run a script that opens a web browser. The end goal of all this is to have dashboards that are easily manageable by their respective teams.

 

Challenge Accepted

 

Have you heard of this new Raspberry Pi thing. Its a small ARM PC that’s the size of a credit card, and they’re cheap. What they’re also USB powered? Bonus now we can just power them from the TV itself and when the TV comes on the pi comes on. Now we just replace the mac mini with the pi and run the same script when it boots and we’re all done. Wait not so fast, the share isn’t public so we need a credentials to connect. That’s OK we can store them in a file locally and use fstab to connect. Yeah that works but we want to display different things on different screens so now I have to create different scripts and manually tell each Pi which one to use. OK that’s not too bad, the first time you set up each one just point it to the script it needs to run and then you can just update the script and reboot the pi. So far its shaky but it works, sometimes. One of the problems was that sometimes it would try to run the script on the network share before it was mounted properly and also running a script or (multiple at this point) over the network on a device with the processing power of about 7.4 hamsters isn’t really going to cut it. I’m getting tired of crowbarring fixes into something that wasn’t really designed for this use and troubleshooting seemingly random issues.

What do I actually want to accomplish here and how am I going to do it??

  1. Have the script run locally, its only managing a web browser after all.
  2. Config easily changeable and centrally managed.
  3. Get the pi to check for new config on startup.

Done, yes that’s it pretty simple, so here’s what I did.

Ingredients

  • bash script
  • json file. Lists the pages that the web browser should visit. Could also be local files loaded into the browser images etc.
  • python script. Loads the json ‘config’ and specifies how long each page should be displayed etc and does a bit of error checking.
  • Git (or other) repository

Method

Edit your rc.local to run a bash script that lives somewhere locally on the pi. eg /opt/scripts/ The bash script downloads selenium, firefox (actually iceweasel on debian) and facter (so we can get info really quickly)

I did consider using puppet for this whole thing at one point but that was a bit of overkill plus it had its own complications at the time try to run on on an ARM processor)

The bash script also uses facter to determine the mac address of the pi and remove the colons. (I must admit that facter may be a bit overkill here as well but hey, I’ve gotten used to having it around). It then searches your webserver (or other location) for files carrying its mac address as a name, ( I have a set of defaults that it uses if none are found). Have your webserver run a cron that pulls the repository of all your files. You could have each device pull the repository directly but the more screens you have the more inefficient that will be as you’ll be storing a whole repo on the pi just to get at 1 or 2 files. you could also have a web hook that only updates the web server when there are changes to the repo but I didn’t think it was worth it at this point. The json is self explanatory.

You can take a look at the principle here.

https://github.com/mindcandy/pi-screens.git

Plans for the future of this project includes a self service dashboard that will take the ingredients and mix them with the right config without the user necessarily having any coding knowledge.

HTML5 games in Mind Candy

HTML5 games in Mind Candy

HTML5 Games has always been a bit of a grey area, with the decline of the Flash Platform it still felt like Web Technologies were lagging behind what the Flash and Unity Player could do in the browser.

Over the last year or two this has all changed, since Steve Jobs declared war on Flash it’s been a bit of a bumpy ride but with companies such as Google, Mozilla, and Microsoft all getting behind HTML5, W3C finally declared the standard as ‘complete’ it suddenly feels like the technology has grown up.

HTML Games have also grown up, with Nintendo partnering with Unity and ImpactJS for their Web Framework, as well as the BBC and Nickelodeon investing a lot of money in to converting their existing Flash games to create new and exciting experiences for users on a wider range of devices.

Here at Mind Candy we always want to push things and try new technologies, however we also feel like whatever we do try has to work in a real world scenario and while HTML5 has been around for a while, we’ve never felt it a good fit for us until now.

Why HTML5?

With PopJam growing as a platform we always wanted to deliver games to our audience, however with the App store submission times releasing content frequently making the games natively within the App was completely out of the window, also having to support multiple platforms we needed something that was write once and deploy across all, this is where HTML5 came in for us.

Cross Platform

One of the huge benefits of using HTML5 was that it is truly cross platform, and while the performance of native will always be far greater, porting the games over to each platform would’ve destroyed us as a team.

When starting out with HTML5 we instantly noticed that even though we were cross platform, there were still hoops we needed to jump through to make things work in the way that we wanted, the main pain point being audio.

As we were targeting a mass of devices we needed to make sure that our games worked on all resolutions and inputs worked as expected, however it felt that once we’d broken this barrier we’d be okay.

iOS provided a UIWebView we could use out of the box, however we decided to use the Crosswalk Project for Android as it allowed us greater control than the one that comes built in to Android.

Fast Iterations

Using HTML5 means we were not bound by the App Store restrictions, meaning we can push new games and updates out incredibly fast. It’s not only deployments that are faster either, on of the most powerful things with making HTML5 games is that it’s a link to a page and the game can be played.

With JavaScript there is no compilation time, and you can debug in real time within the browser. This also meant we could develop some pretty cool in house tools that would speed up the development of our games and systems, whilst running within PopJam.

Starting out

One of the things with making HTML5 games is that there are so many things that need to be considered, such as asset loading, memory management, input, physics, 3D, 2D, animations and many more, we had to decide on the best way to deliver our games in the most optimal way possible.

On top of all of these decisions there are also multiple ways to render content within the browser:

  1. CanvasStarted as an experiment by Apple, is is now possibly the most widely supported standard for generating graphics on the web. Using canvas also eliminates a lot of cross compatibility issues that other methods may have. Performance tests on both iOS and Android worked out quite well for us.
  2. WebGLWebGL offers hardware accelerated graphics within the browser and on mobile is really still early days, while iOS implemented full support it still comes with some very interesting edge cases. Android support for WebGL is very different world as we found out when targeting low end devices.
  3. Divs / CSS TransitionsThe method of updating divs that are rendered on the page is an interesting one as it allows for nice affects using CSS3 transitions, however the lack of support across mobile browsers and different versions of mobile operating systems was a problem.

We tried all of the above methods and ultimately we ended up utilising all of them, it really came down to the content that was being presented to the user. We used WebGL where we could, and anything that didn’t support it we fell back to Canvas.

Anything that had relatively simple content we ended up manipulating divs and using various methods for transitioning elements to fix cross compatibility issues.

Choosing a Game Engine

One of the things that stood out when looking for a game engine is that there is a lot of them, and not only engines, there are also products out there that known as ‘Game Makers’ allow you to make games with little to no code such as Construct, Game Maker, and Game Salad. If you’re looking for something to try I can highly recommend this website.

We actually tried a couple of different engines, as well as allowing people who weren’t developers to use the ‘Game Makers’ to prototype ideas and test performance.

After evaluating our choices we decided to use Pixi.js from Goodboy Digital, an incredibly lightweight engine that offers an ActionScript like API as well as many other features such as:

  • Asset management
  • Multi Touch for Mobile
  • Sprite Sheet support
  • Full Scene Graphs
  • Third Party Libraries (Spine, Tiling)

it also allowed us to toggle effortlessly between Canvas and WebGL to allow for support on lower end devices.

Another thing that Pixi has is thorough tutorials, incredible documentation and a very active community which goes a long way when choosing something like an engine to use be it for games or software in general.

At the time of writing this article, Pixi have just announced v3 of the engine, and have provided a benchmark test to show off the performance. I would strongly urge you go check it out, even on a low end device it’s pretty impressive.

Tools

One of the things that came as a breath of fresh air when venturing in to the world of JavaScript is how far it’s come since I last used it, for the last few years I’ve had my head firmly planted in AS3 and Unity with C#.

With tools such as:

  1. YeomanYeoman allows you to start new projects, choosing from hundreds of generators that have been created it, you are able to scaffold new projects quickly whilst prescribing best practices and tools.
  2. BowerThis is one of the most lightweight package managers along with NPM I’ve used in my career, allowing us to manage dependencies across projects effectively and also allowed us to keep our repositories incredibly small.
  3. GruntUsing Grunt as our build system was one of the best decisions we made, allowing us to move incredibly fast when building our games, and automate a lot of tasks that done manually would’ve been incredibly laborious.

We were able to create a solid work flow from starting a project to releasing our content on to PopJam.

As well as using these tools, the JavaScript community is an incredibly talented community with a lot of libraries out there to use that help in every day web development.

It’s not all rosy

As amazing as things have been making games over the past few months, it has not been without its headaches and hair pulling moments, but this is why we love what we do, right? If it wasn’t a challenge then it would be boring.

Targeting multiple platforms comes with its own problems, however some of the biggest problems we had was with the hardware on Android, as there are a lot of cheaper low end devices that are prime for parents to buy for their children we encountered devices claiming they supported certain features however when running in the browser would crash the PopJam instantly leaving us in a state of flux and no logs to go on. We found a lot of this came down to the chipsets that the cheaper devices use.

It wasn’t only Android that caused us problems either, with the iPod Touch 4G being one of most used devices amongst children and some only supporting iOS6 this left us not being able to push performance as much as we wanted, as well as the iOS6 UIWebview implemented being very temperamental about what standards it supported.

The one thing that caused us the most headaches out of everything though was Audio, HTML5 Audio is still very limited and even more so on some of the cheaper devices with some only supporting the WAV format which means larger file sizes, any other format used would cause the whole application to crash as no other codec was available. It is recommended to use the

Conclusion

We’ve had some amazing fun creating some interesting games for PopJam using HTML5, not only because we got to make games but we also got to build some awesome internal technology and tools, create a pipeline from concept to production in just a few months, and most importantly we got to create some engaging experiences for our PopJam users.

 

 

 

Deployments using All the Things!

As we’ve mentioned in previous posts, we use AWS services extensively at Mind Candy. One of the services that we’ve blogged about before is CloudFormation. CloudFormation (CF) lets us template multiple AWS resources for a given product into a single file which can be easily version controlled in our internal Git implementation.

Our standard setup for production is to use CF to create Autoscaling Groups for all EC2 instances where, as Bart posted a while back, we mix and match our usage of on-demand instances and spot priced instances to get the maximum compute power for our money.

During load testing of the backend services of our games we did, however, notice a flaw in the way we’re doing things. Essentially, this was the speed with which we could scale up under rapid traffic surges, such as those generated by feature place in mobile app stores.

The core problem for us was that our process started with a base Amazon Image (AMI), after initial boot it would then call into Puppet to configure it from the ground up. This meant that a scaling up event could take many minutes to occur – even with SSD-backed instances – which isn’t ideal.

Not only could this take a long time – when it worked – but we were also dependent on third-party repositories being available, or praying that Ruby gem installations actually worked. If a third-party was not available then the instances would not even come up, which is worse position to be in than it just being slow.

The obvious answer to this problem is to cut an AMI of the whole system and use that for scaling up. However, this also poses another problem that you now make your AMI a cliff edge that sits outside of your configuration management system.

This is not a particularly new problem or conundrum of course. I can personally recall quite heated debates in previous companies about the merits of using AMIs versus a configuration management system alone.

We thought about this ourselves too and came the conclusion that instead of accepting this binary choice we’d split the difference and use both. We achieved this by modularising our deployment process for production and using a number of different tools.

The Tools

Teamcity – we were already using our continuous integration system as the initiator of our non-production deployments so we decided to leverage all the good stuff we already had there and, crucially, we could let our different product teams deploy their own builds to productions and we would just support the process.

Fabric – we’ve been using Fabric for deployments for quite some time already. Thanks to the excellent support for AWS through the Boto library we were easily able to utilise the Amazon API to programmatically determine our environments and services within our Fabric scripts.

Puppet – when you just have one server for a product using a push deploy method makes sense as its quick. However, this doesn’t scale. Bart created a custom Puppet provider that could retrieve a versioned deployment from S3 (pushed via Fabric) so we could pull our code deploys on to remote hosts.

Packer – we opted to use Packer to build our AMIs. With Packer, we could version control our environments and then build a stable image of a fully puppetized host which would also have the latest release of code running at boot, but could still run Puppet as normal as well. This meant we could remove the cliff edge with an AMI, because, at the very worst we would bring up the AMI and then gain anything that was missing but do so quickly as it was “pre-puppetized”.

Cloudformation – Once we had a working AMI we could then update our version controlled templates and poke the Amazon API to update them in CloudFormation. All scaling events would then occur using the new AMI containing the released version of code.

The Process – when you hit “Run” in Teamcity

  1. Checkout from git the Fabric repo, the Packer repo and the Cloudformation repo.
  2. Using a config file passed to Fabric that would run a task to query the Amazon API and discover our current live infrastructure for a given application/service.
  3. Administratively disable Puppet on the current live infrastructure so Puppet doesn’t deploy code from S3 outside of the deployment process.
  4. Push our new version of code to S3.
  5. Initiate a Packer build, launching an instance and deploying the new code release.
  6. Run some smoke tests on the Packer instance to confirm and validate deployment.
  7. Cut the AMI and capture its ID from the API when its complete.
  8. Re-enable and run Puppet on our running infrastructure thus deploying the new code.
  9. Update our Cloudformation template with the new AMI and push the updated template to the CloudFormation API.
  10. Check-in the template change to Git.
  11. Update our Packer configuration file to use the latest AMI as its base image for the next deploy.

What we’ve found with this set-up is, for the most part, a robust means of using Puppet to deploy our code in a controlled manner, and being able to take advantage of all the gains you get when autoscaling from baked AMI images.

Obviously we do run the risk of having a scaling event occur during deployment, however, by linking the AMI cutting process with Puppet we’re yet to experience this edge case, plus all our code deploys are (and should be) backwards compatible, so the edge case doesn’t pose that much of a risk in our set-up.

 

Replicating from AWS RDS MySQL to an external slave (without downtime)

We’ve recently needed to create an external copy of a large database running on Amazon RDS, with minimal or no downtime. The database is a backend to a busy site and our goal was to create a replica in our data centre without causing any disruptions to our users. With Amazon adding support for MySQL 5.6 this meant that we’re able to access the binary logs from an external location, which wasn’t possible before.

As MySQL replication only works from a lower version to an equal or higher version, we had to ensure that both our databases were on MySQL 5.6. This was simple with regards to the external slave but not as easy with the RDS instance, which was on MySQL 5.1. Upgrading the RDS instance would require a reboot after upgrading to each version i.e. 5.1 ->  5.5 -> 5.6. As per the recommendation in the Amazon upgrade guide we created a read replica and upgraded it to 5.6. With the replica synced up, we needed to enable automated backups before it was in a state where it could be used as a replication source.

Creating an initial database dump proved tricky, as the actual time to create the backup was around 40-50 minutes. The import time into the external slave was around 3-4 hours and with the site being as active as it is, the binary log and position changes pretty quickly. The best option would be to stop the RDS slave while the backup is happening. Due to the permissions given to the ‘master’ user by Amazon, running a STOP SLAVE command would return a

ERROR 1045 (28000): Access denied for user ‘admin’@’%’ (using password: YES)

Luckily there’s a stored procedure which can be used to stop replication –mysql.rds_stop_replication

mysql> CALL mysql.rds_stop_replication;
+—————————+
| Message |
+—————————+
| Slave is down or disabled |
+—————————+
1 row in set (1.08 sec)

Query OK, 0 rows affected (1.08 sec)

With replication on the RDS slave stopped, we can start creating the backup assured that no changes will be made during the process and any locking of tables won’t affect any users browsing the website.

Once the backup completes, we’d want to start up replication again but before doing this we’ll be able to get the binlog file log and position:

mysql> show master status;
+—————————-+———-+————–+——————+——————-+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+—————————-+———-+————–+——————+——————-+
| mysql-bin-changelog.074036 | 11653042 | | | |
+—————————-+———-+————–+——————+——————-+

This will be required when setting up the external slave later on. Now that we have the relevant information we can start the replication.Again, we’d need to use the RDS mapping of START SLAVE:

mysql> CALL mysql.rds_start_replication;
+————————-+
| Message |
+————————-+
| Slave running normally. |
+————————-+

 

Once the dump has been imported we can set the the new master on the external slave with the values previously recorded:

CHANGE MASTER TO MASTER_HOST=’AWS_RDS_SLAVE’, MASTER_PASSWORD=’SOMEPASS’, MASTER_USER=’REPL_USER’, MASTER_LOG_FILE=’mysql-bin-changelog.074036′, MASTER_LOG_POS=11653042;

Before we start the replication, we need to add a few more settings to the external slave’s my.cnf:

  • a unique server-id i.e. one that’s not being used by any of the other mysql DBs
  • the database(s) you want to replicate with replicate-do-db. This stops the slave trying to replicate the mysql table and other potential RDS related stuff. Thanks to Phil for picking that up.

So something like:

server-id = 739472
replicate-do-db=myreplicateddb
replicate-do-db=mysecondreplicateddb (if more than one db needs to be replicated)

Start up replication on the external slave – START SLAVE; 

This should start updating the slave, which you can monitor via

mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Log_File: mysql-bin-changelog.075782
Read_Master_Log_Pos: 113973
Relay_Log_File: relaylog.014354
Relay_Log_Pos: 114089
Relay_Master_Log_File: mysql-bin-changelog.074036
Slave_SQL_Running_State: updating

The above values are the most important from the sea of information that the command returns. You’ll be waiting for MASTER_LOG_FILE and Relay_Master_Log_File to be identical and Slave_SQL_Running_State having a status of Slave has read all relay log; waiting for the slave I/O thread to update it

Once that syncs up, an external replica has been created with zero downtime!

Managing Mac OS X

In the beginning Clonezilla was used for imaging the Macs and for asset management we solely used GLPI. Having come from a Windows background with a little bit of Debian and Red Hat experience, this lack of control and visibility of the machines you were supporting was completely alien to me. Soon after joining I was given the task of reviewing our imaging process and look to see what other technologies were available. Along came DeployStudio.

Deployment

DeployStudio is a fantastic deployment tool which allows you to image and deploy Macs and PCs. As opposed to just creating and deploying images it allows you to create advanced workflows and fully automate your deployment from start to finish. In order to run DeployStudio you will need OS X Server for NetBoot deployments. We initially installed DeployStudio on a Mac Mini which ran fine but then set up another Deploystudio Server on a second Mac Mini as a replica for redundancy purposes.

We didn’t want to get into the habit of creating multiple images for different models of machines but instead have one vanilla image with varying workflows depending on the purpose of the Mac it will be deployed to. In order to create a vanilla image you download the latest Mac OS X installer via the App Store and then use AutoDMG to build a system image ready for use with the latest updates. Once we have our base image we can then look to create a workflow. This workflow will then apply all the changes needed over the top of the vanilla image such as localising the Mac, setting the computer name, creating the local users, installing necessary packages, running scripts and lastly it then runs software update.

Management 

We were then left with the problem of, how do we preload and manage software on all our Macs? DeployStudio can install packages but they then quickly become out of date and, you aren’t able to manage the already installed software on the machines on your network.

Munki is a package management system which allows you to distribute software via the use of a web server and client side tools. Using catalogs and manifests you can install and remove software on your clients depending on what manifest has been applied to what machine. Munki also rather handily collects information on the clients and stores it on the Munki server each time Munki runs such as the user, hostname, IP, failed Munki runs, installed software etc.

The problem we found is that while Munki works well, the software uploaded to Munki Server will soon become out of date and managing the repository of software becomes a task in itself, here’s where AutoPKG and Jenkins comes in. AutoPkg is an automation framework for OS X software packaging and distribution and Jenkins is a continuous integration server. AutoPkg uses recipes to download the latest create the relevant metadata ready for importing to the Munki server via Munki tools. Jenkins is then used to schedule the AutoPkg recipes and Munki import as builds everyday in the early hours of the morning. The end product is Munki containing the most up-to-date software that you’d expect on all of your clients e.g. Chrome, Firefox, Munki tools, Antivirus, Mac Office.

Monitoring

After playing with Munki and being really impressed with it I decided to see what else I could find that would potentially hook into it. After some search I found Sal which is the creation of a guy called Graham Gilbert. Graham is an active contributor in the Mac management world and I personally am a big advocate of his, you can find his GitHub page here. Sal is a Munki and Puppet reporting tool which helps give you, not only a broad overview of your Macs via a dashboard but also an in depth look at each client via the stats collected by Munki and Facter. 

Once Sal has been configured on the client, each time Munki runs it triggers a sal submit script which gathers information and stores it on the Sal server. The highly customisable Sal dashboard uses plugins to display high level counters to then display for example which machine is running on which version of OS X or when the clients last checked in. There are custom plugins available on GitHub or you can even create your own. If you wanted to you could rearrange the order of the plugins and hide certain plugins if you had multiple dashboards. Due to the use of Munki and Facter it gives you the ability to drill down into an individual client and retrieve hourly status updates as well as static information about the machine. 

Asset Management 

In terms of Asset management not much has changed as GLPI is fit for purpose so we’ve had no need to replace it however, we do now use Meraki in conjunction with it.

Meraki is MDM tool which we initially used for our mobile devices but as Meraki grew their Mac and Windows management tools improved. By default now all machines are managed by Meraki as well as the other tools mentioned previously. With Meraki you can use their extremely powerful tools such as location tracking, remote desktop, viewing network stats, fetching process list, sending notifications etc. While on a daily basis they aren’t necessarily needed, they do come in handy for those odd situations.

Summary 

Via all of the technologies above we now have an imaging process which takes up to 15 minutes with only 30 seconds of actual input. In that imaging process you have a DeployStudio workflow which images the Mac, configures it for the individual user, runs software update and installs Munki/Sal. Once booted Munki then runs and installs all the up-to-date software (thanks to AutoPKG & Jenkins) that the client’s manifest has specified. Going forward you then have monitoring (Sal) and management (Munki & Meraki) of that Mac which then gives you control and visibility of the machine. 

The best part is this hasn’t cost us a penny! This is thanks to the ever passionate and generous open source community. See below for the fantastic folks who created the amazing tools I mentioned in the post.

  • AutoDMG – Builds a system image, suitable for deployment with DeployStudio
  • DeployStudioImage & deployment system for Macs and PCs
  • Munki – Package management for Mac
  • AutoPKG & Jenkins –  Automation framework for OS X software packaging and distribution and continuous integration server
  • Sal – Munki & Puppet reporting tool for Macs
  • Meraki – Mobile Device Management
  • GLPI – Asset management
  • Puppet & Facter – Configuration management utility and a cross-platform system profiling library

London PostgreSQL Meetup

London PostgreSQL Group meetup is a unofficial PostgreSQL community event happening quarterly. The meetup agenda is very relaxed but it always involves a lot of good PostgreSQL discussions over some pizza and beer.

The event is always open to everyone and usually announced well in advance through meetup.com website — http://www.meetup.com/London-PostgreSQL-Meetup-Group

Mind Candy had a pleasure of hosting the meetup on the 21 January 2015. We actually had a record attendance which was awesome; thank you to everyone who came!

We had two really good talks. First one was a joint talk by Howard Rolph & Giovanni Ciolli about key features of recently released PostgreSQL 9.4 followed by an awesome talk by Rachid Belaid about full-text search capabilities [1] (with proper deep-dive into technical details and how to do it) in PostgreSQL. Apparently you don’t really need to build a totally separate Elasticsearch cluster if you want to store documents and perform most usual operations on them; Postgres will do just as well! Who knew!

Howard talking about new key features in PostgreSQL 9.4

Howard talking about new key features in PostgreSQL 9.4

Again, thanks everyone for coming and especially to the great speakers and see you all next time!

[1] Slides available here https://speakerdeck.com/rach/postgres-full-text-search-is-good-enough

Circles Of Causality

The Misnomer of ‘Move Fast and Break Things’

image06For a while I’ve wanted to visualise the pain points of the development cycle so I could better explain to product & business owners why their new features take time to deliver. Also to squash the misnomer of the overused “Move fast, break things” mantra of Facebook.

 

So some of you may know that recently Facebook realised this wasn’t actually sustainable once you have your product stable and established. It might work in the early conception of a product, but later on it will come back to haunt you.

At the F8 Developers conference in 2014, Facebook announced they are now embracing the motto “Move Fast With Stable Infra.”

“We used to have this famous mantra … and the idea here is that as developers, moving quickly is so important that we were even willing to tolerate a few bugs in order to do it,” Zuckerberg said. “What we realized over time is that it wasn’t helping us to move faster because we had to slow down to fix these bugs was slowing us down and not improving our speed.”

I’ve recently been reading Peter Senge’s “The Fifth Discipline: The Art and Practice of the Learning Organization” and thought circles of causality would be a good way to express this subject. Using system thinking and looking at the whole picture can really help you stand back and see what’s going on.

circles of causality1Be it growing user acquisition or sales, a good place to start is what are you trying to achieve and how to exponentially increase it. In this case lets say that as a business you want to drive user engagement and grow your DAU.  One possible way is to add new features to your product.

So following the circle in this diagram we assume that by delivering new features, we increase user engagement which in turn leads to growing your DAU/Virality.

Lets assume the product has been soft launched, you’re acquiring users, A/B testing initial features and have begun growing your list of improvements and new features. Lets create a ‘Backlog’ of all this work, prioritise them, plan how we deliver those quickly using an Agile Scrum framework.

We want to deliver a MVP as quickly as possible, so lets do two week ‘Sprints’. The product team have lots of great ideas but far too many to put into one sprint and some of the new features require several sprints. Product owners & other business leaders debate the product roadmap and you have your sprint planning….simple right ?….Well to begin with yes

 

circles of causality2So lets look at what the size of the backlog does to delivery. In this diagram you see that the size of backlog directly affects how soon improvements/features are made. Why? Well because the backlog is also made up of bug fixes and technical debt, often inherited from your prototyping phase and deploying your MVP.

You’d love to tell the business that you’ll just work on new stuff; but hey worse case we deliver something every two weeks, but some features could take months to appear.

So with a relatively small backlog we are ok. Yes some business leaders are a bit frustrated their feature won’t get deployed in this sprint but the short term roadmap is clear right ?

Dev team gets their head down and get on with the sprint….but in the background product/business owners have moved on from yesterday’s must have feature to yet another new shiny idea or potential crisis; the backlog grows and the roadmap is changed. Features get pushed further down the priority list.

So the situation is we have X amount of resource and over time the business is getting frustrated at the pace of delivering changes to the product. Weeks have passed and only 25% of these ideas/features are shipped.

The Symptomatic Solution

So there could be two potential fixes for this…move faster and drop quality so we can ship stuff quicker or throw more people at it. Lets look at what happens with the “Move fast, break things” mantra. So to increase delivery time we cut corners, drop some testing, code reviews, developers pushed to make hacky solutions etc etc

circles of causality3As you see in this diagram, as you do this you create more bugs and the QA process takes longer. Any initial advantages are lost as this builds up.

 

 

Now we have also added a ‘side effect’. More bugs increase the size of the backlog creating the opposite effect you intended in the first place.

 

 

 

So lets put in more man hours (overtime) to get those bugs down and reduce this growing backlog. More overtime increases fatigue & the quality of the work. Developers get burnt out, they make more mistakes, quality of work suffers and again more bugs and are even more demoralised.

Lets look at the result of this on staff & the complexity of their work. In this diagram we see that by reducing quality we also increase code complexity which generates technical debt, which again slows down development. Tech debt is pretty demoralising, as usually no one is invested in fixing it and in most cases you just work around it.

Adding more developers has a different outcome with equally diminishing results. Big teams in an Agile framework, isn’t always a great idea. The typical strategy is to organize your larger team into a collection of smaller teams, and the most effective way to do so is around the architecture of your system.

The harder you push, the harder the system pushes back

When you look at the whole system, each part has a cause and effect. The harder you push one part, other parts are affected. Each one of these parts needs to be balanced against each other so that the system runs efficiently. It’s also important to step back and make sure you are solving the actual problem not trying to fix a symptom.

Balance

In this example the perceived view is that the team is moving slowly, whereas in fact they are moving at a pace that balances the system. Move fast, with stable infra is the sensible option. Use system diagrams like this to seek out counter balance to reinforcing circles.

Reference – “The Fifth Discipline: The Art and Practice of the Learning Organization” Peter Senge – Chapter 5 “A Shift Of Mind”