Software Ate My Infrastructure: Two Years on AWS with Ansible, Terraform and Packer - Part 1

NAME

software-ate-my-infrastructure — Software Ate My Infrastructure: Two Years on AWS with Ansible, Terraform and Packer - Part 1

SYNOPSIS

Agari Data has made significant investment into infrastructure as code. Almost two years into this project, we've learned some lessons. Our efforts have already y...

METADATA

dateAugust 29, 2016

length9.2K

reading~6m

Previously: read about our first year

Agari Data has made significant investment into infrastructure as code. Almost two years into this project, we’ve learned some lessons. Our efforts have already yielded dividends by increasing engineering velocity while maintaining infrastructure reliability. Additionally, our new toolset enables more experimentation, assured in the knowledge that we’re never more than a few commands away from being able to roll back changes or re-deploy with zero data loss.

Our automation toolset makes managing cloud-based infrastructure orders of magnitude faster and more repeatable, but with more powerful tools at our disposal comes complexity, technical debt, and raised barriers to entry for simple changes. Managing state, deploying secrets, working with distributed data stores, and using tools that are themselves in a state of rapid development (e.g. Packer, Terraform, and Ansible) can yield significant and sometimes unexpected challenges. I’ll share several pitfalls and best-practices in a three-part series.

Your automation repository: git, branching and organization.

First, I’d like to address one of the more contentious issues we’ve struggled with: the organization of an integrated automation repository. I’ve seen a variety of approaches to organizing infrastructure code. You might place each of its automation tools (e.g. Ansible, Terraform, CloudFormation, Puppet) in a separate repository. Alternately, you might prefer that its infrastructure code be co-resident with each product or product component. For example, all of the web caching code and its automation code might live in a web caching repo, while the same would be true of a database component.

We’ve taken a different approach. All automation code, by which I mean nearly all non-product code, lives in one big slightly messy git repo. I’m not going to say that this is the most elegant, perfectly virtuous solution but it has some important advantages.

Maximizing code re-use between teams

It’s very easy to share Ansible roles between product teams when all you have to do is reference it alongside your custom application-specific role. For example, here’s how we configure roles for one of our host groups:

  roles:
    - nrt-alerter
    - role: cousteau
      url_domain: producta.stage.agari.com
      cousteau_branch: develop
    - postfix

This particular machine gets three roles: the role specific to its function (nrt-alerter in this instance), a role which checks out code specific to the product it supports and the environment it’s in (i.e., cousteau) and finally a general-purpose postfix role which supports Amazon SES as well as custom MTA configuration.

Additionally, all hosts get a base set of roles which configure things like users and ssh keys, IDS/IPS agents, host monitoring agents, ntp and log shipping configurations. For these catch-all roles we just use a hosts: all group.

Another major requirement of developing software at Agari is enabling product teams to have visibility into the automation work contributed by other product teams. This promotes code reusability, cross-team collaboration, and dissemination of best practices and reusable design patterns. Just being peripherally aware of what your colleagues are working on can be extremely valuable.

Integration between automation tools

Besides enabling teams to share code, it’s also important to be able to leverage the work we’ve put into writing and maintaining our Ansible roles by using them not strictly for typical playbook runs but also as provisioners for Terraform and Packer when appropriate. An example is a Packer-built Docker container we use to run Apache Airflow. In this case, the Packer provisioner first bootstraps Ansible and then utilizes an Apache Airflow ansible-local provisioner, thereby reusing an Ansible role we’ve already written to install and configure Airflow, on a virtual machine host.

  "provisioners": [
    {
      "type"   : "shell",
      "inline" : [
        "sudo apt-get update",
        "sudo apt-get install -y software-properties-common",
        "sudo apt-add-repository ppa:ansible/ansible",
        "sudo apt-get update",
        "sudo apt-get -y --force-yes install ansible python-apt"
      ]
    },
    {
      "type"         : "ansible-local",
      "playbook_file": "local.yml",
      "role_paths"   : [
        "{{ user `ans_home` }}/roles/airflow"
      ]
    }
  ]

Repo organization

├── bin
├── examples
├── group_vars
├── host_vars
├── packer
│   ├── airflow
├── roles
└── terraform
    ├── product_a
    │   ├── dev
    │   ├── prod
    │   └── stage
    ├── product_b

It’s important to maintain a pretty clear idea of where things go when you’ve got multiple product teams contributing to a single infrastructure repository. We organize our repository this way:

Ansible code goes into the top level (i.e. Level 1)
Roles, group and host vars underneath (i.e. Level 2)
- This way, Ansible code can easily be referenced by other tools which live in second-level directories
Other second-level directories include (i.e. Level 2)
- A bin directory for things like Python scripts to interact with EC2 via boto
- An examples directory to store items like example .ssh/config for connecting through our bastion hosts

Parameterizing environments in Terraform with make

We’ve iterated a few times on how to best organize Terraform configurations. Early efforts duplicated a lot of code between environments, resulting in configuration drift – it became increasingly difficult to keep staging and production in sync. Our solution to this issue is to parameterize the differences between environments and to use a Makefile to run Terraform with environment-specific statefiles and variables.

#  Test that we have necessary executables available
EXECUTABLES = ansible terraform
K := $(foreach exec,$(EXECUTABLES),\
	 $(if $(shell which $(exec)),some string,$(error "No $(exec) in PATH")))

all: plan

plan:
	@if [ -z ${ENV} ]; then echo "usage: make plan ENV=(prod|stage|dev)" ; else terraform plan -state=$(ENV)/terraform.tfstate -var-file=$(ENV)/terraform.tfvars -var-file=$(ENV)/secrets.tfvars -out=$(ENV)/terraform.tfplan $(ARGS); fi

apply:
	@if [ -z ${ENV} ]; then echo "usage: make apply ENV=(prod|stage|dev)" ; else terraform apply -state=$(ENV)/terraform.tfstate -var-file=$(ENV)/terraform.tfvars -var-file=$(ENV)/secrets.tfvars $(ARGS); fi

destroy:
	@if [ -z ${ENV} ]; then echo "usage: make destroy ENV=(prod|stage|dev)" ; else terraform destroy -state=$(ENV)/terraform.tfstate -var-file=$(ENV)/terraform.tfvars -var-file=$(ENV)/secrets.tfvars $(ARGS); fi

clean:
	rm */terraform.tfplan

.PHONY: all plan apply destroy clean

In this example, the common Terraform configurations live in the parent directory - there’s one for each product and then under that directory we have per-environment directories which contain a terraform.tfvars, terraform.tfstate and a secrets.tfvars. We generally store account credentials in secrets.tfvars since we have different environments segmented by AWS account (and you should too). The tfvars file looks something like this:

environment = "prod"
account_id = "313371234567"
ssl_certificate_id = "arn:aws:iam::313371234567:server-certificate/myagari-ev-cert"
app_server_count = 8
test_elb_count = 0

Setting the environment variable enables configurations like Name = "${format("app-%02d", count.index)}.cp.${var.environment}.agari.com" The account id and ssl certs will need to be set separately per environment, of course. Server counts are often different depending on environment. Resource counts can be set to zero for resources that exist in one environment but not the others- for example prototypes that are spun up in a development environment.

Git branching

Our git branching strategy is as follows:

master — represents production for both products
develop-product_a — holds product a-related changes to be deployed to our pre-production environment (i.e. staging). After staging is validated, we merge this branch into master for eventual production deployment.
develop-product_b — holds product b-related changes to be deployed to our pre-production environment (i.e. staging). After staging is validated, we merge this branch into master for eventual production deployment.

Three simple rules for how to use them:

Run Terraform/Ansible against prod from master and against stage from develop-product_{a,b} only.
As part of daily development, please commit and push to develop-product_{a,b}.
On prod deployments, please merge your respective develop-product_{a,b} branch into master as you would for the product code repository, and immediately merge master back into your develop-product_{a,b} branch to ensure coherence. Periodic merges of master into your develop-product_{a,b} branch outside of prod deployments can’t hurt either.

Coming up next

In part 2 of this post we’ll discuss:

State management
Dealing with datastores: special cases and how to handle complex migrations
Why your immutable infrastructure doesn’t necessarily need to be 100% immutable