Working from home in AWS (with access to everything)

Ever since we started moving parts of our services into EC2, we’ve been faced with a growing problem. It’s important that our team can access nodes directly in a troubleshooting situation, even at 3am from Poland if necessary. With resources in AWS, this can mean that before you log into a node, you first have to log into the console (with two-factor auth), find a relevant security group, find your public IP, and then give yourself access via SSH. This can make troubleshooting among nodes in Amazon take much longer than it should. So, we’ve been trying out a different approach. We already have a mechanism for authenticating our employees when working from home – our VPN. We can be reasonably confident that Cisco have done a good job at making this secure, and we’re able to assume that anyone who successfully logs in is probably worthy of access to other resources off-site, in EC2. So, we developed a tool which checks the VPN session database every minute, by connecting to one of our Cisco ASAs and extracting the username, tunnel-group and IP address of all logged-in users. In this way, it’s possible to manage security-groups in EC2, such that you can automatically give users access to the resources they need, based upon their tunnel-group on the Cisco. Essentially, we now no longer need to think about remote access; it just works. There are two components to this system; the first polls the ASA for information every minute, and creates a hash containing user information. It then sends this to an HTTP endpoint, where the correct security groups are updated. The latter part of this is embedded in our ‘mission control’ system, but is really quite basic, in that it simply uses boto to create security groups based on tunnel-group names, and keeps track of when the user was last seen on the VPN in a simple database, so that inactive users can be removed from the groups. The Cisco part is perhaps a little trickier, so we’ve put this on GitHub, in case it comes in handy. You can find it here:

Scaling Puppet for Donuts

In the last year we’ve had a fair number of challenges within NetOps, especially with our config management of choice which is Puppet. Not only did we have a big piece of work that involved taking the giant leap from Puppet 2.x to 3.x, we also faced some architectural and performance challenges.

Whilst the upgrade was successful, we continued to have an architecture that was vertically scaled, and worse still we had CA signing authority host that had become a snowflake due to manual intervention during our upgrade. The architecture issues then really started to become apparent when we started hitting around 600 client nodes.

Now, as the old saying goes, if you have a problem, and no one else can help, maybe you should open…. a JIRA! So we did and it included the promise of donuts to the person willing to do the following:

1: Puppetise all our puppet infrastructure – inception FTW.
2: Add a level of redundancy and resilience to our Puppet CA signing authority.
3: Get us moved to the latest version of PuppetDB.
4: Make Puppet Dashboard better somehow.
5: Do it all on Debian Wheezy because Debian Squeeze support expires soon.
6: Seamlessly move to the new infrastructure.

What happened next was three weeks of work creating new modules in our Puppet git repo that could sit alongside our current configuration and be ready for the proverbial flip of the switch at the appropriate moment.

After a quick bit of research it became clear that the best approach to take was to separate out our CA signing authority host from our Puppet masters that would serve the vast majority of requests. This would allow us to make the CA resilient, which we achieved through bi-directional syncing of the signed certificates between our primary and failover CA.

Separation also meant that our “worker” cluster could be horizontally scaled on demand, and we estimate we can easily accommodate 2000 client nodes with our new set-up, which looks like this:


You may be thinking at this point that PuppetDB is an anomaly because it’s not redundant. However, we took the view that as the reporting data was transient and could potentially change with every puppet run on our agent nodes, we could quite happily take the hit on losing it (temporarily).

Yes we would have to rebuild it, but the data would repopulate once back online. In order to cope with a PuppetDB failure we enabled the “soft_write_failure” option on our Puppet masters, i.e. CA and Worker hosts. This meant that they would still serve in the event of a PuppetDB outage.

Finally, we decided to move away from using the official Puppet Dashboard – which relied on reports and local sql storage – and used Puppetboard Github project instead as it talks directly to PuppetDB. Puppetboard was written using the Flask (Python) web framework and we run it internally fronted with Facebook’s Tornado web server

How bloated is your PostgreSQL database?

When dealing with databases (or, in fact, any data that you need to read from disk), we all know how important it’s to have a lot of memory.  When we’ve a lot of memory, a good portion of data gets nicely cached due to smart operating system caching and most of the data, when requested comes from memory rather then disk which is much, much faster.  Hence trying to keep your dataset size possibly small becomes quite important maintenance task.

One of the things that take quite a bit of space in PostgreSQL which we use across most of the systems here at Mind Candy, are indexes.  And it’s good because they speed up access to data vastly, however they easily get “bloated”, especially if data you store in your tables gets modified often.

However, before we even are able to tackle the actual problem of bloated indexes, first we need to figure out which indexes are bloated.  There’re some tricky SQL queries that you can run against the database to see the index bloat, but in our experience, results we got were not always accurate (and actually quite far off).

Beside having happy databases we also care a lot about the actual data we store so we back it up very often (nightly backups + PITR backups) and once a day we do a fully automatic database restore to make sure backups we take, work.

Now, a restore operation includes building indexes from scratch, what means, those indexes are fresh and free of bloat.

Now, if we only could compare the sizes of indexes from our production databases to the ones from restored backups, we could easily say, very precisely, how much bloat we’ve got in our production database.  To help with that, we wrote a simple python script.

$ ./ -ps csva.csv csvb.csv
Index idx3 size compare to clean import: 117 % (14.49G vs. 12.35G)
Index idx2 size compare to clean import: 279 % (14.49G vs. 5.18G)
Ough!  idx4 index is missing in the csvb.csv file.  Likely a problem with backup!
Total index bloat: 11.46G

The whole process works as following:

  1. At the time of backup, we run a long SQL query which prints all indexes in production database alongside with their sizes in CSV format
  2. After the backup is restored, we run the same SQL query that prints all indexes in “fresh” database alongside with their sizes in CSV format
  3. We then run the aforementioned Python script which parses both CSV files and prints out human-friendly information show exactly how much bloat we’ve got in our indexes

We also added a percent thresold option so it’ll print out only indexes with bloat more then X %.  This is so we won’t get bothered by little differencies.

The aforementioned script, called pgindexbloat, can be found on Mind Candy Github account.  It’s very easy to be run from cronjob or wrapped around into a check script and used to feed Nagios / Sensu.

As a interesting note, I’ll just add that the first go of the script uncovered we had nearly 40GB worth of bloat on our production database.  Much more then we anticipated and getting rid of that bloat, definitelly made our database much happier.

Musings on BDD

This month at the London Python Dojo I gave a talk on Behaviour Driven Development. People were a lot more interested than I expected, so I thought I’d share my thoughts more widely on here. The main motivation behind this talk was to dispel some of the myths around BDD and make it more accessible. You can view the slides on google drive.

What is BDD?
The main points I made could be summarised as:

  • BDD is a methodology, not a tool or a process
  • As a methodology it’s a collection of best practices
  • You don’t have to take the whole thing for it to be useful
  • As the name should make clear – it’s about focussing on behaviour not implementation.

For a more thorough introduction I recommend that you read the slides with the speaker notes.

What I found more interesting though was the Q&A, which lasted longer than the slides themselves. There appeared to be a lot of misconceptions about testing in general. This came out because BDD encourages you to look at behaviour at the user level. This is important as it helps focus your tests and give them tangible meaning. The leap of comprehension though is that when you test at different levels you are concerned with different users.

At a unit level your user is a software developer who will be using that unit of code. The most important thing at this level is the behaviour to this developer, who could be you, or could be someone else.

At an integration level your user is another service or unit of code. Here you are concerned with the interface that is provided to these other services and code units. When this second piece of code interfaces with the first, how does it expect it to behave?

At a system level your user is probably a person, or possibly a script. At this level you should really be using the user interface, be it a GUI or text. This is the level at which there was least confusion, as it should be fairly obvious that the behaviour your interested in here is that of the system as a whole when the user attempts to interact with it.

BDD in Practice

One of the questions I was asked was one about speed – as behaviour is only the first word, it is important to be able to use this to drive your development. Indeed system level tests shall rarely be as fast to execute as your unit tests. This does not however destroy the value of having these tests, and with a little work, in most cases, system and integration tests do not have to be as slow and unpredictable as you might have experienced. This comes down to the proper application of mocking.

The most useful way I have heard of explaining this is from a talk by Simon Stewart where he talks about having small, medium and large tests. A small test lives entirely in memory, is single threaded, doesn’t sleep, and doesn’t access system properties. They should ideally be entirely self contained and able to be run in parallel without any restrictions on ordering, and each one should take less than a minute to run.

A medium test is given much more free reign. Its only restriction is that it cannot touch anything not running on localhost, i.e. any network operations should run on loopback. It should take no more than five minutes. A large test has none of these restrictions.

From this it should be clear that your small tests are almost certainly unit tests of some description. They will act entirely on one unit of code, and any other units of code they interact with should be mocked out with the behaviour they expect to consume. Because they run so fast, you can afford to set these up to run on every commit, and wait for the results before you continue to write code that corrects any errors.

Your large tests are going to be full end-to-end system tests, but might also include integrations with third party services, such as AWS. These are going to be slow. You might choose to use your continuous integration server to run these, or you might run them nightly. You might also choose to include your manual tests in this category, and I find it a perfectly acceptable, or in fact desirable, first step into BDD to have your behaviour specifications executed by a manual tester.

Your medium tests therefore might be assumed to be integration tests, and in some cases they will be, however by mocking out everything non-local which your user interface interacts with, you can make these medium tests too, and if you have a behaviour specification for the units this interacts with, you should have no less confidence in the results of this test. It is true that there is value in having end-to-end tests, but this can give you a much earlier indication of any issues your changes may have introduced or resolved on the system level. With up to a five minute execution time you’re probably not going to wait for them to run before you carry on coding, but you should definitely wait before merging your work to somewhere it will affect others.

Managing access from AWS EC2 through Cisco ASA firewalls

At Mind Candy, to support our games, we we have increasingly been looking toward AWS in order to take advantage of features such as autoscaling and RDS. With an existing estate spread across three datacentres, introducing instances in EC2 brought about a number of challenges.

Every time an EC2 instance is spun up, it’s given a random dynamic IP and hostname. It’s not possible to know in advance what this will be. While this might sound completely crazy if you’re used to working only with physical hardware, it’s actually been a great catalyst in the move towards ‘disposable’ infrastructure, where your server nodes can build and destroy themselves when required, with no sysadmin intervention. We’ll be talking more about how we do this another time, but today I wanted to talk about security.

As with any organisation, our physical locations have firewalls with very restictive policies. All of our servers have static addressing, and we know exactly where to find them. Creating firewall rules is predictable and no problem. Unfortunately, introducing EC2 nodes into the mix has caused us a bit of bother – how do we allow these dynamic nodes to talk back to our datacentres?

Amazon publish their list of IP ranges every now and then, via their support forum. Predictably, it’s huge. A number of companies I know have gone down the route of allowing the entire AWS address space to access their private networks. For me, this is simply asking for trouble.

We realised that we would need a more robust and targeted way of setting up access rules to only the hosts we were using, and ensuring that we didn’t persist rules for addresses which were no longer under our control.

Enter rulemanager!

This tool is written in Ruby, using the AWS SDK. Its job is to read a list of AWS account details, build a list of all active hosts and their IP addresses in every account and region, and then ensure that each IP is present in an object-group on a number of Cisco firewalls. Knowing that a certain object-group will always contain our entire AWS estate, we can happily create firewall rules without worrying about opening our network up to half the Internet.

Take a look at our github project page:

Mindcandy Hosts London Python Dojo

For those who don’t know, the London Python Dojo is a monthly meetup. While it doesn’t strictly adhere to the traditional code dojo formula, we do get together once a month to eat pizza, drink beer, and hack out a solution to an interesting problem in python.

Attendees of all backgrounds are welcome whether you’ve been coding in python for decades or you’re a complete coding newbie. Everyone is given the opportunity to write some code, but no one is forced.

The agenda is roughly as follows (although times are only approximate):

  • 6:30-7:30: People turn up, eat the food, drink the refreshments. Everyone has a chat, and people are free to suggest problems on the ideas board.
  • 7:30-8:00: Lightning talk(s), recruitment shout-outs, and the all important voting on which idea we will work on.
  • 8:00-9:30: People are arbitrarily split into teams, which set to work on the problem.
  • 9:30-10:00: Each team presents their work, so you get to see many varied solutions to the same problem.
  • 10:00-: Pub!
  • Tickets are free and are available on eventwax here

    Teaching Unity at Mind Candy

    At Mind Candy we use the Unity game engine for some of our games, especially those on mobile. We love Unity – it’s great for rapid prototyping and allows designers and artists to directly create game content, rather than having to go via programmers. We use it for prototyping, game jams and also to create some of our new mobile games!

    Why Learn Unity?

    Here at Mind Candy we’re spending a lot of time creating new games and we realised a great way to allow more people to get involved is to teach Unity to anyone who wanted to learn it! So back in October 2013 we set up a lunchtime “Learn Unity Club”. The aim is to teach more people how to make games!

    We had a few false starts, but we iterated and hit upon what we think is a winning formula, which we are sharing with the world – more on that later!

    Watching Videos?

    We started off by watching various video tutorials on Unity. Whilst the videos from Unity themselves are great, other sources varied in quality. We found that videos tended to be good to teach you how to use the editor, but didn’t have good information on game logic and programming. They could also be hard to follow and could be pretty passive, so it didn’t feel like people were really learning. We noticed after a few weeks that we were losing people (as this was an optional lunchtime club) so decided to change approach.

    Game Jam to learn?

    We love Game Jams and thought that perhaps a more hands-on project would be a good way to learn practical Unity techniques. So in December 2013 we got a group together, generated some random videogame names and set off! But we found that this approach wasn’t successful either. We were spending too much time getting excited by the game concepts and talking about/prototyping game design (even making up a set of cards to test “Intense Baking – the card game”). So we didn’t spend much time actually doing things in Unity. Also it turned out that December was a really busy month, so people didn’t have any time outside of the lunch hour to work on their games. We even saw quite a drop-off in numbers as folks were just too busy.

    A winning format!

    As they say, the third time’s a charm – at the start of 2014 we decided to take everything we’d learnt from the previous months and reboot Learn Unity Club to a new format.

    The basic pitch is “Give us a lunchtime and you’ll learn how to make a game” – seems like a pretty good deal, right? The idea is that we would start each lunchtime less with a pre-prepared unity project which everyone would have on their laptops. The tutor would then step through, building this up until by the end of lunchtime everyone in the room would have a working game that they could then take away and play with more, if they have the time. Each week is self-contained.

    A key point is that if you don’t have the time, that’s ok as well – just come back next week and we’ll have another game to make! We’d realised that people often did not have the time or would miss a session, so we wanted to allow people to easily drop in and not feel lost.

    The sessions are intentionally very hands-on – you’re always making changes on your own laptop and learning as you do it. A lot of people learn by doing so this method should be effective.

    Because we’re building a game in an hour, we don’t want to try and introduce too many new concepts, so we’re aiming for 1-2 new things each time.

    Lastly and very conveniently, Unity 4.3 with official Unity 2D APIs came out just at the right time, so we have focused on doing everything in 2D so far. This is really handy as 2D games are much less complex and easier to understand than 3D. It also makes building art much easier as even terrible “programmer art” will do for teaching purposes. It’s also great to be teaching an official Unity API.

    So to sum up, the format is:

    • 1 hour

    • build a game

    • self contained

    • hands on

    • learn 1-2 new things

    • Unity 2D

    What’s the downside? It’s more work for the course organisers! We couldn’t find any courses for learning Unity2D that exactly matched our needs – not least because Unity2D was so new. However, we did use this excellent 2D tutorial for the first week as a test.

    To share the load we have two people running the course, myself (Mark Baker) and Scott Mather. We take alternate weeks so we’ve got (in theory) 2 weeks to prepare each lesson.

    We’ve had 7 weeks so far and still have a great number of people coming. So we’re very happy with it so far!

    Good news everybody!

    Because we’ve only created this ourselves because we couldn’t find it elsewhere, we are sharing it with everyone! If you head over to you will see we have all the starter Unity projects, asset files and step-by-step instructions for each lesson. We plan on adding videos also if you would rather see someone building it and follow along that way.

    At the time of writing we have:

    All this is being released under the MIT license. This means you are free to use it any way you want, even for commercial purposes. The only restriction is that you need to acknowledge us as authors of the work, and that the work itself must always be distributed under the MIT license.

    So if you are a school, university, code club or company that wants to use this for training – please go ahead!

    We’d love to hear any feedback you have – neither of us are professional educators so its entirely possible we don’t explain things well. Please get in touch via gitlab or comments on this blog.

    In coming weeks we’ll be adding more lessons – as we create them and run them at Mind Candy. We’ll also be adding video and featuring lessons on this blog.

    If you wanted to get more information and see a 30 minute sample of training, Scott and I both spoke at a recent London Unity User Group meeting:

    You can view the slides from the presentation here.

    (there will soon be a collection of video guides available on this Google+ page)

    Mindcandy Techcon 2014

    Mindcandy Techcon 2014

    On 11th February 2014 we held our own mini tech conference in London at the Rich Mix Cinema in Shoreditch.

    Today we streamlined our jelly beans, expanded our guilds, sprinkled DevOps everywhere, emphasised our polyglotism then un-cheated our backends !

    In 2011 & 2012 we held a company Techcon to get together all Tech Mindcandies to share our technical experiences and technologies. It gives us an opportunity to share knowledge & give insights into different teams and products we wouldn’t normally get time to do.

    Jeff Reynar

    As Jeff, our new CTO joined us in the new year, it was a good opportunity for him to talk to all of us about our Tech & strategy going forward; especially on how we can build a great Tech culture here, so we can all grow and learn great things.


    We had some great talks from all the teams & learn’t some new things. One talk even caused an outbreak of nose bleeds….too much data !

    A learning organisation

    Collaboration was the focus of the day, where we talked about how we better collaborate across cross functional teams & disciplines. And so the “Guilds” were born & there was much rejoicing.

    DevOps soon followed with a healthy smattering of dev and ops hugging. We have been practising DevOps methodologies here for a while, so we showed the fruits of our labour. From shared tools, infrastructure as code to automation & sharing the PagerDuty rota. We saw the future & it was continuous delivery…..& there was much rejoicing.

    Screen Shot 2014-02-12 at 13.47.19

    Middleware team splashed us with more water themed services, with Plunger & Pipe Cleaner. We are safe in the knowledge our events can make it through the pipeline so quickly. We learn’t about how we use FluentD, AWS SQS & Redshift to get gazillions of events from our games into our data warehouse. And they showed improvements made to our identity & AB testing services.

    Tools team had a vision…..and it was “make things less crappy”. They talked about deployment tools & automation with the promise of making everyone productive & happy. We were happy…and less crappy. The future would be filled with tools that are whizz bang and swishy like Iron Man….I have raised a JIRA ticket for my flying suit, its in the backlog people !

    Screen Shot 2014-02-12 at 14.30.19

    NetOps Team talked about our implementation of autoscaling using cloudformation stacks & how we manage dynamic disposable infrastructure. Also, we got introduced to the Moshling Army, who will tirelessly automate and keep tidy our AWS accounts. The Bean Counter app would also keep all the product teams updated on a daily basis with their Amazon AWS costs.

    We then had presentations by all the product teams, QA & IT OPS.

    The product teams deep dived into their front & back end architecture. We were shown how we load tested one of our game backends using Gattling. Blasting an AWS Autoscale group with 1.2 million req/min , breaking Cassanda & RDS Postgres along the way. The monster 244GB, 88 ECUs RDS instance took it in the end. ( err time to scale out before we need that me thinks )

    This was swiftly followed by the “Bastardisation of Unity” & how we mashed it , rung its neck & made awesome 3D on mobile devices. Apache Thrift made an appearance, & we learn’t how we use the binary communication protocol in one of our apps.

    Moshi Monsters web team talked about the lessons learn’t over the last 5-6 years in managing a complicated code base. They revealed the pain of deployments back in the day & how they have been streamlined with “The Birth Monster” deploy tool. Wisdom was imparted about tech debt, code reviews & knowledge sharing.

    Screen Shot 2014-02-12 at 15.19.03

    Our Brighton studio dazzled us with their game architecture & visually jaw dropping in-game graphics. The front end tools they built to improve workflow was awesome. Using timelines to combine animation, audio, cameras, game logic, UI & VFX means they can build stuff super fast. They also talked about cheat detection & the best ways to tackle it.

    The QA Team told us to pull our socks up ! Together we should strive to always finish backlog stories, improve TDD & automation. Make sure we have plenty of time for regression testing. Fortunately, they are working like crazy with acceptance criteria on stories, testing & improving communication.

    Finally the IT OPS team joined the DevOps march by turning their Mac builds from manual to automated nirvana. Using Munki & Puppet to handle software / configuration of all our company Apple Macs. Amazed !

    Also we learn’t …never go on extended holiday

    Or this happens….


    Looking forward to the next Mindcandy Techcon !

    Play Framework meetup and example code!

    Recently Mind Candy hosted the London Play Framework Meetup group, where Guy and myself gave a couple of quick talks covering some useful parts of Play that don’t always get good examples / documentation.

    Here are the slides and a video of the event.

    Guy talked about using Akka actors to provide an asynchronous ‘task runner’ that we used to build our own web-based content deployment tool for Moshi Monsters. His sample code is here:

    Mark talked about Javascript Routing in Play 2, which helps maintain the “Don’t Repeat Yourself” principle by keeping URL routing in the Play routes file and nowhere else!  This has recently been documented for Play 2.1 but did you know you can use it in 2.0.x as well? Mark’s sample code is here:

    Finally – did you know that Mind Candy is currently recruiting for a Senior Software Engineer to join the Tools team? Click here for more info and to apply.

    Migrating the Moshi Monsters backend from SVN to Git

    Currently at Mindcandy we use a combination of SVN and Git for all our code. This is because storing lots of frequently changed binary Flash assets in Git is a pretty bad idea. There was also some legacy code that would benefit from moving to Git, but finding the time to do anything about it had been difficult.

    Thanks to some recent cleanup and changes, some of those barriers have been disappearing, so over the last couple of days I’ve been getting the Moshi Monsters backend migrated over. It ended up being quite involved, and as we’ll hopefully be migrating more code over, I decided to write-up how I did it.


    Migrating a large SVN repository to Git can cause issues when it contains a large amount of history, tags and branches. This is primarily due to the differences in the way that SVN and Git handle commits and branches.

    In short, when using git-svn to migrate, it’s necessary to pull down each commit from SVN, have Git calculate a commit hash for it, then re-commit that to the local repository. Furthermore, because SVN works by copying files for branches and doesn’t merge changes back into trunk in the same way as Git, it is also necessary to track back through every commit in a branch and calculate the commit information. Tags are awkward for similar reasons.

    In a repository like the Moshi backend, with a little over 6 years history and plenty of old branches and tags, this can result in Git taking a lot of time, and a lot of CPU to try and go calculate this information, much of which is actually so old that it isn’t needed.

    Interestingly though, if we ignore all the branches and tags and just pull down trunk into Git then the process takes about 10 minutes.


    The decision was made to not migrate across all the branches and tags, but instead to get the entirety of the trunk history, and just a select number of the recent branches and tags. Unfortunately, this is a little fiddly to do with git-svn and requires a bit of config magic.

    I’ll cover the commands necessary for doing this, however there is some other information that is useful when doing an SVN migration that this won’t go over. Primarily dealing with commit author name transformations. The site article covers that in more detail.


    The first step is simple enough thankfully, and just requires cloning the SVN repository trunk folder, making sure to tell git-svn that this is just trunk. You can do this by using the -T or –trunk flags, which will make sure Git knows that there could be other folders containing the tags or branches.

    git svn clone -T trunk http://svn.url/svn/repo/project/ project

    It is worth pointing out that there may be multiple remotes with the same name, but followed by “@“. This happens when a branch was copied from a subfolder in the repository, and is not necessarily a whole repository copy. For example, when cloning our backend project I got this :-

    remotes/trunk 0f6ddda [maven-release-plugin] prepare for next development iteration
    remotes/trunk@8127 cbef06a Fixing Bug

    Going back through the SVN history, it’s possible to see that revision 8128 was where /TRUNK was copied to /trunk. These should be safe to remove, because once Git has pulled everything it will track the history in its own commits. We’ll cover getting rid of them later.


    Once we have this, we need to manually add each branch we want to pull down by adding an svn-remote to our Git config. This needs to have a URL and a fetch ref so Git knows what to get from where.

    git config --add svn-remote.mybranch.url http://svn.url/svn/repo/
    git config --add svn-remote.mybranch.fetch branches/mybranch:refs/remotes/mybranch

    With that done we can fetch it from SVN and create a local branch.

    git svn fetch mybranch
    git checkout -b local-mybranch remotes/mybranch
    git svn rebase mybranch

    The fetch may also take a while but once the above is done you have a normal-looking Git branch, ready to be pushed to our new remote Git repository.


    Adding specific tags is pretty similar to adding branches, in fact Git treats SVN tags like branches because really they are just copies of the entire project up to a certain revision. This means that once they’ve been fetched, we’re going to have to convert them to Git tags.

    git config --add svn-remote.4.9.9.url http://svn.url/svn/repo/
    git config --add svn-remote.4.9.9.fetch tags/4.9.9:refs/remotes/tags/4.9.9
    git svn fetch 4.9.9

    So now we need to turn this into a real Git tag. We’ll make this an annotated tag and mention that it’s been ported from SVN as well. If you were going to continue working with this repo against SVN then you’d probably want to delete the remote branch, but since we’re just doing a migration I won’t bother.

    git tag -a 4.9.9 tags/4.9.9 -m "importing tag from svn"

    At this point, if you go back and look at the tag in the Git history, you’ll see that actually it is pointing to a commit that’s sitting off on its own, and not part of the branch history. This is because SVN created a new commit just for the tag, unlike Git which creates tags against existing commits. If you really don’t like this then you could create the tag against the previous commit using :-

    git tag -a 4.9.9 tags/4.9.9^ -m "importing tag from svn"

    Pushing to Git

    With that done, we can now push our repository up to our Git host and not have to worry about SVN again.

    git remote add origin 
    git push origin --all
    git push origin --tags

    Now we have a Git repository with all of the Trunk history in and only those branches and tags we specifically wanted. At this point you probably want to set the old SVN repo to be read only and get everybody moved over to Git.

    Cleaning up

    If you aren’t using this repo for migrations, and are instead just wanting to use git-svn to interact with your Git repository, then you will probably want to clean up the remotes a little. As I mentioned earlier, when Git pulls everything out of SVN, it will create extra remotes for tags and branches at revisions where there were non complete repository copies. Once the data is in Git you don’t need these, so we can safely remove them.

    To get a list of them we can use the Git plumbing command for-each-ref.

    git for-each-ref --format="%(refname:short)" refs/remotes/ | grep "@"

    With this we can iterate through and delete them.

    git for-each-ref --format="%(refname:short)" refs/remotes/ | grep "@" | while read ref
      git branch -rd $ref

    Other options

    There are a few other options to git-svn that can be useful when migrating over, though it’s worth investigating them before setting a script running for two days so you don’t end up with a repository that doesn’t contain what you were expecting.

    The –no-follow-parent option can be passed when cloning for fetching so that Git won’t follow the commit history all the way back. This will result in things being much quicker, but it also means that, according to the git-svn docs:

    branches created by git-svn will all be linear and not share any history

    In practice I found that this gave me a linear Git history with nothing in the places I expected. On the plus side, it was way quicker! Worth looking at but use with caution.

    The other option worth knowing about is –no-metadata which will stop Git adding in the git-svn-id metadata to each commit. This will result in cleaner commit logs, but means you won’t be able to commit back to the SVN repository. It’s fine if you’re making a clean break from Git, but dangerous otherwise. I’m also not sure how well it works with pulling down separate branches from SVN to merge into Git. That investigation is left as an exercise for the reader! :)


    So it’s all well and good being able to add our branches and tags, but we don’t want to do this by hand for each one when we can write a script to do it for us.

    Combining everything we’ve done so far, this shell script should do the job for us and leave us with a nice looking, ready to push, Git repository. I’m doing the cleanup step in the middle just to make sure there’s no ambiguity with which branches and tags are being created, and also so it’s easier to see what’s been created once all the dust settles.

    #! /bin/bash
    git svn clone -T trunk $SVNURL $FOLDER_NAME
    for bname in $BRANCHES; do
        git config --add svn-remote.svn-$bname.url $SVNURL
        git config --add svn-remote.svn-$bname.fetch $BRANCH_FOLDER/$bname:refs/remotes/svn-$bname
        git svn fetch svn-$bname
    for tname in $TAGS; do
        git config --add svn-remote.$tname.url $SVNURL
        git config --add svn-remote.$tname.fetch $TAG_FOLDER/$tname:refs/remotes/tags/$tname
        git svn fetch $tname
    git for-each-ref --format="%(refname:short)" refs/remotes/ | grep "@" | while read ref; do 
        git branch -rd $ref
    for bname in $BRANCHES; do
        git checkout -b $bname remotes/svn-$bname
        git svn rebase svn-$bname
    for tname in $TAGS; do
        git tag -a $tname tags/$tname -m "importing tag from svn"


    So migrating SVN to Git isn’t too tricky, but there are a few things worth knowing and it can certainly take a long time if you have a lot of history and branches. There are probably some mistakes and useful things I missed so feel free to get in contact if so.