Sunday, June 30, 2019

Azure DevOps, Terraform Validation and Linting

Hey all!

This post is part of a series on Azure DevOps CI/CD, which we've used to integrate with Azure Cloud, build a terraform deployment, and commit code to build resources.

In part 1 of this series, we:
  • Learned several DevOps and Azure Cloud terms
  • Signed up for an Azure Cloud and Azure DevOps (ADO) account
  • Created an Azure Cloud Service Connection to connect Azure Cloud and ADO
  • Initialized a new git repo in ADO
  • Installed git on our machine (if it didn't have it already)
  • Created an SSH key and associated it with our user account
  • Cloned our (mostly empty) git repo to our computers
  • Create terraform base code and a .gitignore file for terraform
  • Used git on our local computer to create a branch, add the files to it, and push the files and branch to our Azure DevOps git repository
  • Merged the code changes into our master branch
  • Created a build and release pipeline for terraform to validate and push code out to our cloud environment
This blog will continue from where part 2 ended, so if you're following along, walk through part 1 and 2 to build all the above from scratch. Or just read along and watch me do it - I've included lots of pictures. 

Despite all of these neat cloud pipelines, git branches, and terraform automation, most of our pipeline is still manual. Code must be: 
  1. Build locally on a machine and tested before committing to the master - code isn't tested before committing to master and running a build and release
  2. Running the build pipeline manually once new code is committed to master
  3. Running the release pipeline to test code
There's also significant parallelization drawbacks to this method. Say I write a bad terraform commit and push it to master. The release pipeline starts failing for everyone, for any reason, until this particular problem is sorted. This might work for a few developers who have time to commit to this project, but imagine dozens of devs working on a project. There would constantly be faulty code that blocks others - production would grind to a halt.

Software engineering, from which DevOps borrows many of its practices and methodologies, handles this problem by integrating testing before a merge, during the pull request phase of a commit. Then if problems are created, it's only in the individual workspace of a branch, rather than the master that is shared by all developers. 

We're able to build something similar in Azure DevOps, and this blog will also walk through that process. So what are we waiting for? Let's get started! 

Break (Up) the Release Process

Right now our release pipeline has a single stage, and that stage does everything in Terraform - it stages the code, it init(ializes) the environment, validates the terraform, and applies it. 

That's not going to scale, so let's go into the release and edit the first single stage. What we'll do is remove all the "apply" steps, so the first stage is all about testing and validation. So let's update the first stage's title to match. 


We also want to remove the "terraform apply" step - don't worry, we'll add it to a future stage. Click on the apply step in the left column and then in the top right, click on the trash can labelled "remove" to delete that step. Hit save, and click "Pipeline" in the top left. 


We could go ahead and build an entirely new stage with just what we need, but there's an easier way. In the stages area, hover your mouse over the "Terraform Validate" step that we just updated. Click on "Clone" and you'll see a new stage appear. 



Boom, you are a master of efficiency. Click on the "1 job, 5 tasks" in that second (far-right) stage, that starts with "Copy of..."


This second stage is all about applying the Terraform. Before we get to this stage, the Terraform code will be validated by the first stage, so no need to validate again. However, we do need to "init" the environment again - each stage is handled by an entirely new container. So let's pop in there and remove that pesky testing. 

What you have left will look something like this - make sure the last step is Apply. It's also a good idea to name this step "Terraform Apply" to help keep track of what each stage is doing. 


Hit "Save" in the top right and we now have two discrete stages - the first stage tests, and the second stage applies. Which doesn't do a lot of good yet, because the stages are automatically linked and apply without any input from anyone! So let's tell stage 2 (Apply) to wait for our ok before continuing. 

In the top left click on pipeline again. Find the "Terraform Apply" step and hover your mouse over it. Click on the lightning bolt - that's the "Pre-Deployment Conditions" instructions. 


There's a lot going on here, but for now feel free to minimize the "Triggers" section to make this place a lot less busy. Find the "Pre-deployment approvals" section and slide the slider to the Enabled position. Azure DevOps will require that someone approve this step before continuing, and you can lock down who has the authority. In our imaginary business we're the only current employee, so add yourself as an approve. Then click save in the top right. 



Let's test it! In the top right, click on "Create Release" to run this release again. The first step will run normally - we triggered it to run, but something new will happen. The second stage WON'T run immediately. It'll wait patiently for us to "Approve" it. 

This is important for several different use cases. For one, we can just plain validate the "Terraform Plan" is only showing the actions we expect it to do. Second, you could assign more senior resources the ability to approve rollouts, or potentially an InfoSec team, platform team, etc. 

Click into the running release pipeline and the GUI is very clear - the "Terraform Validation" (and plan!) stage ran successfully, and the second stage "Terraform Apply" is pending approval, and won't run without our say so. Feel free to click into the Logs on the Terraform Validation step and make sure plan is only doing what we say, then click "Approve" on the second step, and the apply will continue on. 


Woot! We've broken out our flow of work so there is a separate testing stage from an apply stage. That'll be important for the next few automated linting test items we add. 

Limit the Blast Radius - Pull Requests

Anyone who's worked in infrastructure for any length of time is familiar with the phrase "limiting the blast radius." It's a way to frame changes or processes in an adversarial light - if it all goes wrong, how wrong can it go? The idea is to build processes, protections, and to train the team so that WHEN things go wrong, they don't go terribly wrong. 

We can take that principal to heart here and limit someone's ability to commit changes directly to our master branch, where it is both harder to back-out and avoids the normal review process that code (and any infrastructure changes) should go through. 

In most cases, it shouldn't be allowed except in unusual circumstance. So let's require Pull Requests (PRs). Click on Repos --> Branches and find your "master" branch. It's created automatically when the repo is initialized. Click on the 3 dots to pull up all the settings that apply to that branch, and click on "Branch Policies". 



Check the box next to "Require a minimum number of reviews" and change the minimum number to 1 (since our company only has a single employee!). If you are building this for a larger team, set the number wherever you'd like (limit 10). Since we'll be approving our own changes, make sure to also check the box next to "Allow users to approve their own changes." That's not a best practice for a real enterprise, but for our lab it'll do just fine. 


Hit save to make the policy live. But before we leave this page, we should make one other change. 

Automatic Branch Builds

Normally when you run a "build" job, it'll stage artifacts for all the release jobs from the master branch. It is possible to stage artifacts from a branch, or a particular commit, but it's complex to manage, and can go wrong (rolling out changes from an out-dated version of code, for instance). 

However, that's exactly what we need to do for branches - stage their code in a special artifacts area where it can be tested without messing with our master branch, or with any of the other branches where folks might be working in parallel. 

So it's convenient we're already in this settings page for the master branch. In it you'll find this exact setting that we need to enable. Find the section labelled "Build validation" and click the "Add build policy" button. 


Find your "Build pipeline" in the drop-down menu for Terraform building and select it. Leave the other options on their defaults. The one we're interested in in particular is the "Trigger" being set to automatic. 


What we've just done is made sure that every time a Pull Request targeting our master terraform branch is created or even updated, a new build process will run. That by itself won't test our code, but it's a step in the right direction. Now we just need to tell our release testing stages to also run when the code is staged in a pull request. So let's do that. 

Automatic Release Testing

What we'd like to happen when a PR is created or updated is for our terraform validation stages to run, but not our apply stages. Running an apply stage against a branch that might differ from the master branch is a recipe for chaos. We can do that by tagging a flag in two places - on the "artifacts gather" part of our release pipeline, as well as on each stage to include it in our automated build processes. 

First, go to Pipelines --> Releases and click edit on our terraform release. Find the Artifacts section and click on the lightning bolt - it's labelled the "Continuous deployment trigger" menu. 


In the menu that pops up we're going to change a few things. First, we need to make sure that this release pipeline automatically grabs the new build artifacts when a PR builds them, and then executes this pull request. 

Target branch filter is required to be set. It's asking when to trigger this automated pull request release. Sometimes code is staged in a branch to be ready for a release date. However in our case, most PRs will be created against the master branch directly, so select it as the target branch. 



At the very bottom of this picture above, note something interesting in yellow. Though we've turned on an automatic build and release, it's telling us that neither stage of this release pipeline is going to be executed. 

ADO is cautious - it doesn't want us accidentally over-writing our production infrastructure. So we'll need to go into each stage we want to enable for this automated release process and flip a flag to say "yes, include this stage in the process." Click on the lightning bolt on the "Terraform Validation" / planning step. 


Under the Triggers section, find the "Pull request deployment" slider and move it to Enabled. This is the flag to include this stage in our automated PR testing process. 


Hit save to make the changes live. 

If you were to head back to the Release pipeline's Artifact continuous deployment settings, where you last saw "0 of 2 stages are enabled..." you'll now see "1 of 2" stages are enabled for pull request deployment. 

Did it Work? 

Let's test it so we can see it in action! On our local machine let's make sure we've checkout out our master git branch, then pull any changes we might have missed. This is silly in our lab - we're the only ones in it! But it's a good practice to get into for production environments. 


Let's start a new branch (make sure to not use any spaces in your branch name). 


Edit our main.tf terraform file to add something new - say a virtual network and a new subnet. Something like this would do the trick: 

provider "azurerm" {}

terraform {
backend "azurerm" {}
}

resource "azurerm_resource_group" "rg" {
name = "testResourceGroup"
location = "westus"
}

resource "azurerm_virtual_network" "WeAreAwesome" {
  name                = "vNetNewWeAreAwesome1"
  address_space       = ["10.0.0.0/16"]
  location            = "westus"
  resource_group_name = "${azurerm_resource_group.rg.name}"
}

resource "azurerm_subnet" "test" {
  name                 = "testsubnet"
  resource_group_name  = "${azurerm_resource_group.rg.name}"
  virtual_network_name = "${azurerm_virtual_network.WeAreAwesome.name}"
  address_prefix       = "10.0.1.0/24"
}

Copy and paste the above and save it over your main.tf file. Then head back to your command line and add the updated file to your change (git add .), commit the change to your new branch (git commit -m "adding a vNet and subnet" and then push your new branch, commits and files in tow, to your git master, with "git push origin testing-unit-testing". 


Back on ADO, head to Repos --> Branches to find your new branch. All the way at the right there is a column called "Pull Request." If you hover over it, you'll see the option to create a new pull request, called "New pull request". Click it, and then click "Create" on the next page. 

And we'll see the build process go, but nothing else! What did we miss?


Interestingly, the build process has to run a single time to initialize the ADO back-end for build status. But we can fix it! Head back to Repo --> Branches and find the master branch. Hover over it to find the three dots and then click on Branch Policies. Click it and then look for a section called "Require approval from additional services." This vaguely named section is able to receive input from other processes and give a go/no-go on whether a PR can proceed to be merged. It's exactly what the doctor ordered here. Click on "Add status policy". 


Under "Status to check", you'll now see your release pipeline name. Select it, and check either "Required" (which I heartily recommend, even in a lab environment) or "Optional". The selections are literal - if a required build shows a failed state (like a terraform init or validate failed), the PR CANNOT be merged into master until the code is fixed and a release testing passes. 


Now head back to your PR (Repo --> Pull Requests in the left bar). The build status won't show up right away, but we can tell our build to run again, which will trigger our new required policy. Hover over the "Build succeeded" line and click "Queue build". The build will run once more (in the future this will happen automatically, this is just a first-time gremlin), and then the release pipeline's validate step will run automatically a report a status. 


If all goes well you'll see a couple of happy green check marks - 1 for the build status (staging artifacts) and 1 for the release status (doing linting, terraform init, validate, plan). 

Wait, What Did We Just Do? 

What we just did is build automated linting and validation. We set PRs to be required for all changes to our terraform codebase, and put in automated building and terraform initialization and validation to any PR creation or update. We also made it so that any failure in build or validation would block PRs from being merged into our master tree, where they might cause problems for others. 

Really what we did is enable a large team of developers to work on the same code concurrently without getting in each other's way, and permit developers of this infrastructure code to move fast and break things in their own branches, with their own testing pipelines, without breaking anything in production. Pretty good for a day's work. 

Good luck out there! 
kyler

1 comment: