A DevOps Journey

Over the past few years Mind Candy has gone through a DevOps transformation. We did this because we knew that we had to improve the delivery of our products and we knew that where we wanted to be involved having the following three things in place.

1. Shared goals and practices by aligning our different teams.
2. Unified tool sets, again we needed to align around a common set of tools.
3. Collaborative learning – knowledge sharing was and remains vitally important to us.

Obviously, achieving something like this cannot happen overnight. It had to be an iterative process just as software development is, and its starting point required changing the mindset of people across the teams so that we began to do DevOps.

These are some of the practical things we did on that journey.

Familiarity doesn’t breed contempt

In Aesop’s fable of the Fox and the Lion, we’re taught the moral that familiarity breeds contempt. However, in an organisation trying to transform towards a DevOps way of thinking we turned the fable on its head, acknowledging that it’s not familiarity that breeds contempt but separation does in the form of silos.

For us this didn’t mean that we needed everyone to know or be familiar with everything about everything. Unicorns don’t exist. What it meant was making our physical working proximity closer. It’s pretty amazing how, when different teams can hear each other – from Dev through to QA and onto Ops – how much more readily they interact and collaborate organically.

We found that technical decision-making became a much more shared process. Closer working environments encourage greater mutual support between teams.

It’s good to talk

Email is a wonderful thing. Instant messaging and relay chats are even better once you’re in a good DevOps place. However, if you’re trying to shift attitude and thinking email is not a substitute for getting up and talking to someone or having a phone/video call.

It might not always be possible across timezones, but it doesn’t take a genius to realise that intonation can easily be lost in the written word even if someone uses a emoticon.

The slowest and most problematic IT organisations I’ve known have tended to be ones where everyone hides behind email, resulting in bubbling tensions, and often leading to escalation and wars over who can CC the most senior people in. Change is able to be effected but only based on who has the loudest shout or clout.

Meanwhile, the best and least problematic IT organisations tend to be the ones where different functional teams not only sit physically close to each other but where they also walk across the office to talk to each other instead of sending snippet of easily misinterpreted text over the Internet. Obviously when you have no choice you have to use electronic communications, but when you don’t need to you probably shouldn’t.

Investment in knowledge pays the best interest

When you look up a typical DevOps venn diagram online, it will be one where DevOps sits as the joined intersection of Dev, QA and Ops. Acknowledging this intersection is crucial in moving an organisation’s mindset. The intersection represents all the things that you do that have a shared interest and investment in them. This is the place that you need to align across teams.

Take code deployment as the classic example.

During any software cycle, each team will deploy to different environments and it’s highly likely that there may be differences in the process due to the scale of environments, whether they operate under SLA, or under any internal governance controls like change management.

The tools used to deploy, and the process followed are an excellent starting point in any DevOps transformation. They not only encourage collaboration between teams, but also enable you to unify your toolset under known standards, something we have done at Mind Candy that I blogged about previously.

This has empowered tech teams to collaborate on a shared interest and shared investment, whilst also carrying a shared responsibility for its maintenance. The tool is as much a “product” as the product that it ships.

The net result of this investment is that code deployment becomes so trivial that it widens the scope of who can “push to live” to pretty much anyone. This shouldn’t be mistaken for anyone should (or does) deploy to live. That would be silly. Rather it should be seen in the terms that a robust deployment process can eliminate the lone rock star engineer being a single point of failure.

As Mazz Mosley said at Monki Gras 2013 when talking about how GDS built gov.uk, “rockstars are not webscale”.

This approach doesn’t negate strict change control and governance in the organisation (if you have it). It simply removes blockers from your delivery pipeline. Thats a win for the business as much as it is a win for those who have shared and gained knowledge through collaboration.

Devs as Ops and Ops as Devs

Once we had shared ownership and responsibility of tooling like deployment spanning across teams in the organisation it was clear that the reality of the DevOps intersection is one where Devs are Ops and Ops are Devs

This doesn’t mean that either team does the others job. This is not the full stack unicorn. Sysadmins are not dead and nor are developers, It just means that where the things they do have alignment they can learn from each other.

Take the traditional sysadmin position. They will often be quick to tell you that they’re not a developer. They may even say it with a sense of disgust that you even dared to ask the question. The sad truth is that they’re actually in denial.

They might not like it, but when writing short scripts, or declaring something in a configuration management system, they are developing, and, as the saying goes they’re doing “infrastructure as code”.

The only difference really is that frequently they have made life hard for themselves by lovingly hacking systems and creating the snowflake server. It’s great for job security of course, but it’s terrible for the business – rock star ninja single points of failure again.

At the very least they need to be using some sort of version control for the infrastructure, and what is version control if not a development tool? However, it’s not just in the tools that your Ops can be more like Devs. There’s the working practices too.

The Ops team had already been using Kanban to prioritise work weekly. Whilst this worked to a degree the team still had an ever growing backlog of tickets and requests, and what went on the Kanban board each week still contained a considerable amount of reactive work.

We decided, as a team, that we would take our workflow a step further and apply more development principles to the management of our ticket queue. We decided we should align ourselves with our colleagues and move towards a greater form of Agile along scrum lines. We would start using sprints, planning, backlog pruning and prioritisation.

We began to work through our backlog by opting for two week sprints. We introduced sprint planning, and started to commit to a certain number of story points (issues) for the sprint, and, barring any major issues or emergencies (which we left slack for) we would stick to the committed work and do nothing else.

The impact of what was a pretty small change was huge. It took a few sprints, but, as our different product teams (who were all also doing sprints obviously), became aware that we working in the same way as them, emergency work and high priority issues out of the blue gradually declined.

Obviously it’s not always like that when you’re supporting live services as well, but, by aligning our working practices with our primary internal customers, there became a greater appreciation of how our backlog could be impacted just as theirs could be by altering the scope of the sprint.

This was indirect collaboration born on the back of working in a more aligned way with our peers. Our backlog went from over 100 tickets to less than 40.

Meanwhile, as we in Ops were being more like Devs, we started to share some of our Ops roles with Devs with a little help from our a friend called Canbot.

ChatOps sets you free

Candy Bot, or Canbot for short is our in-house name for Github’s Hubot. It sits in our dedicated Slack channel #chatops and when not providing us with amusing animated cat images he/she does things for the Devs and for Ops.

Canbot can tell us where servers are. This is vital as we use AWS so the environment can be fluid and dynamic. Canbot can deploy config changes for the Devs to each environment, including to live and it’s all totally transparent.

If someone changes the code base in our Puppet infrastructure then Canbot will tell #chatops about the commit and who did it. We also opened up the Puppet repository to the Devs and some of them change it every now and then. Shared responsibility after all.

Canbot can also execute commands on our infrastructure, but when it does it is never in secret. Transparency is the key feature here. What Canbot can do is also open across the teams for development. Primarily it is Ops that play with him, but there is nothing stopping a pull request from others internally.

Canbot has allowed our Devs to be a bit more like Ops. They can orchestrate production without having to have ssh access and it can be audited. No more tickets asking for information about production.

Embrace failure

Failure is an opportunity to learn, it is not an opportunity to point a finger of blame and start shouting at someone. DevOps mindsets should see each failure in these terms. Iterate the failure and eliminate it with either better toolings, better documentation or better gated processes.

When we celebrate failure we do it with KrispyKreme donuts!

Encourage Tech Culture

Most of the people that work in tech love tech. Few of us see our jobs as a mere means to an end. If you encourage your technical teams to collaborate with learning sessions too you can create a greater sense of being “one team of many disciplines” rather than single teams doing only one thing.

At Mind Candy we hold regularly weekly book clubs open to whoever wishes to join, where we go through a particular book on a technology matter. We also have Guilds where we present and share what we’re working on between teams.

Additionally we use our office as a host location for MeetUps across tech businesses. Next month we’re hosting a London Virtual Reality meetup. Sharing should not always just be in-house after all.

Wrapping things up

Obviously the list and experiences above are not exhaustive. There are so many little things that an organisation can do when adopting a DevOps approach. What’s important is to realise that you change the mindsets first and then you iterate and encourage greater collaboration. Once an IT organisation realises that it relies on mutual support to sustain itself change can come about quite rapidly.

A Puppet module for Dynamic DynamoDB

As my colleagues have said in other posts, we make an extensive use of Amazon Web Services at Mind Candy. Recently we decided to use the AWS NoSQL offering DynamoDB for a specific use case in one of our products.

Whilst DynamoDB provides us with a highly distributed NoSQL solution, it works based on telling Amazon what read and write capacity you require via their API. If you find that you go over either of these value you begin to, potentially at least, lose queries if you have not factored in some sort of caching layer using, for example, Amazon SQS.

In the ideal world, Amazon would offer auto scaling features for DynamoDB, however at time of writing they don’t. Instead they advise people to use an independently developed tool called Dynamic DynamoDB written by Sebastian Dahlgren.

Dynamic DynamoDB is a tool written in Python that allows us to effectively auto scale our provisioned reads and writes. It use CloudWatch metrics to establish current usage and then based on the configuration option either scales up or down your provisioned capacity on a per table basis.

As I’ve posted before here, we use Puppet at Mind Candy, so the first point of call whenever a new tool comes along is to see if anyone has written, or started to write, a Puppet module for it. Sadly it didn’t look like anyone had, so we quickly wrote up our own, which is available on Github here.

How I Learned to Stop Worrying and Love AWS CloudFormation

We love using AWS CloudFormation, here, at Mind Candy. Last year we moved all our cloud-based products application stacks to CloudFormations. We have learned, sometimes the hard way, how to design and use them in the best possible way for us. In this post I’m trying to summarize how we build and operate CloudFormations and what are the DOs and DON’Ts when using this technology. Throughout this post I will refer to CloudFormation as CF, to save some precious typing time.

First of all, you need to get to know cloud formation templates. This are just blocks of JSON, and as such are not really nice to edit (remember – no comments allowed in JSON). Because of that we use a helper technology – a templating tool to build CF templates. We decided to use tuxpiper’s cloudcast library (we are a python shop). You can take a peek or download it here https://github.com/tuxpiper/cloudcast. If your primary language is different than python you can easily find or write your own templating tool – it was pointed to me by a former colleague that CFNDSL is a good starting point for rubyists (https://github.com/howech/cfndsl). So lesson one is – don’t use plain JSON to write your CF templates. You will save yourself a lot of tedious time.

Once you have your first stack up and running you’ll realise how easy it is to modify and use it. But wait, what about testing the changes? That’s one of the biggest flaws of the CF technology. There is no other way to test your template than to apply it. CF does not give you a second chance – you can easily terminate/recreate your whole stack by changing of single line in your template. The good practice we try to adhere to is to test every single change in the template using different AWS account (we use separate AWS accounts for our development, integration, staging and production environments) or region, i.e. launch identical stack first in another AWS location and then perform the change on it to test if we end up in the desired state.

To make it possible to launch identical stacks in different accounts or regions one can leverage CF mappings and parameters. We don’t use parameters yet, but we use mapping heavily. That allows us to use a single CF template file to create multiple stacks in different environments. All you have to do is to define environment-specific properties within a global mapping on top of our template and then use CF’s “Fn::FindInMap” intrinsic function (actually, cloudcast does it for you). Also, use CF Outputs – they will allow you to programmatically access the resources created in your CF.

Next one is a set of more generic hints for those who work with AWS, still 100% valid for CF. First, use IAM roles to launch your stacks/instances. Let me quote AWS IAM official documentation here:

A role is an entity that has its own set of permissions, but that isn’t a user or group. Roles also don’t have their own permanent set of credentials the way IAM users do. Instead, a role is assumed by other entities. Credentials are then either associated with the assuming identity, or IAM dynamically provides temporary credentials (in the case of Amazon EC2)“.

That will make your environment more secure and save you misery of maintaining IAM users and keys. Bear in mind that once the instance is created you cannot assign it to an IAM role, so if you’re not using IAM roles yet you should create IAM role with an “empty” policy now and use it for all your resources until you’re ready to benefit from full-fat IAM roles.

Secondly, use a minimalistic user data – make it identical for your whole estate. Delegate environment/application specific settings to your configuration management system. This will just make your life easier. Get familiar with and start using auto-scaling groups, even if you’re launching a single instance (in that case you can have an auto-scaling group with minimum and maximum number of instances equal to 1). You’ll benefit from that approach later, once your service starts to scale up.

Finally, use AWS tags to tag your AWS resources. Tags allow you to do a lot of funky stuff with your AWS resources (let me only mention grouping, accounting, monitoring and reporting here).

Now, a few DON’Ts for your CF:

  • Don’t mix VPC and non-VPC regions in your mappings – CF uses different set of properties for EC2-VPC resources than for EC2-classic resources
  • Don’t ever specify resource name properties in your CF template. Using auto-generated names makes your stack easily portable. Thus, you can copy your existing stack to another environment or launch a completely new stack (say your canary stack) using the same template. Also some of AWS resource names need to be globally/regionally unique, so defining a name in your stack is not such a good idea. Finally, virtually any resource which allows you to set its name will require replacement on update – just imagine your whole stack relaunching from scratch when someone comes with a clever idea to rename resources in line with a new naming convention or a new product name?
  • Don’t use existing (non-CF built) AWS objects in your stack, if you can. Using existing resources also makes your stack non-portable. A lot here depends on the use case (i.e. we have a couple of security groups which we use in our stacks, but even then we provide their names/ids in the mappings or parameters, rather than using them directly in resource declaration).

Know your limits – CF is great orchestration tool, but it has its limits. You cannot create or update some AWS resources (e.g. EC2 keypairs). You cannot self-reference security groups in their definitions, which sucks (how do I open all my cassandra nodes for inter-node communication on port 7001 within the CF?). Stacks are difficult to maintain, as there are no incremental changes. For the above and other, obvious, reasons – don’t forget to source control your CF stacks (we have a dedicated git repository for that).

Finally, the last, and maybe most important, point – separate your applications into multiple CF stacks. One can easily get excited about CF and create a single stack for the whole application (network, databases, application servers, load balancers, caches, queues and so one). That’s not a good idea – you don’t want your database servers to relaunch when you decide to modify the properties of the auto-scaling group for you application layer. The solution is simple – create multiple CF stacks for your four application stack. Make your database layer a separate CF stack, then your distribution (app server auto-scaling groups and ELBs) a second CF stack and so on. This will give you the flexibility of CF without taking a risk of unwanted service disruption, due to CF stack update (been there, done that…). It’s very tempting to create very sophisticated CF stack, with many inter-dependent components, but I cannot stress enough how important is not to do it.

What’s next?

We are all the time looking to improve our stacks and processes, so definitely we are only at the beginning of our CF journey. One of my colleagues is looking at another CF templating library (https://github.com/cloudtools/troposphere) to help us automate our processes of CF creation even more. We will very likely start to protect our CF resources in production using stack policies soon. We will start working with CF parameters and dependencies more to make our templates 100% independent of our account/regional settings. Finally, we need to research if Custom Resources are fit for our purposes.