Tuesday, September 17, 2019

Cloud, DevOps: In Defense of Doing It Wrong

I am just awful at watching training videos and remembering the content. Which feels like a fair trade-off from the world for my ability to remember most things that I read with good fidelity. A place where this problem comes to an (unexpected) head is in DevOps and cloud architectures.

AWS and Azure are moving so fast they have trouble keeping up with written documentation. The written documentation usually exists, and isn't terrible, but is usually missing the most recent iterative developments.

Especially as a network engineer, the old model for architecting a system was to sit and read the architectural books. OSI standards don't really go and change on you, so the information can be presented in many different ways, boiled down by excellent authors, and presented in an accurate way that'll stand the test of time.

For instance, I know when I started studying for a CCIE (Cisco expert networking certification) one of the most recommended books to purchase and study was a volume that came out in the mid 90s, when I was learning how to tie my shoes and starting to read chapter books.
...the pace of iteration and invention within the technology space is increasing.
But I'm sure most of the folks reading this from technology fields feel this - the pace of iteration and invention within the technology space is increasing. Every week it seems like a new cloud functionality is released that layers up what that provider is offering, or a new devops framework, module, or practice is developed that can help increase the efficiency of what you and your team are doing.

That's not to say that the old methods and tools you're using will be deprecated (although it sometimes does!), but usually that you're designing for the state of the art from weeks, months, or years ago, and your competitors might be designing for the state of the art today.

It has a tremendously negative effective on the "tribal knowledge" of a team as new tools and practices are implemented. No longer does the cisco networking guru stay at the top of their game just by renewing their CCIE every few years with the same knowledge they had years ago - now we're trying to level up as fast as we can, and so is everyone else.
...we're trying to level up as fast as we can, and so is everyone else.

So Let's Do It Wrong

Which brings me to my point. The tools and technologies we use aren't going to slow down, and the documentation around them will continue to be sub-par. There's no way to become a paper expert at "cloud" or "devops" - the only way to get there as an expert is to DO it.

So deploy your own cloud, learn how your devops tools work by doing it wrong, and then iterating, and then doing it a bit less wrong, etc. Each time you do it wrong you learn a valuable lesson that can't have been learned elsewhere.
Each time you do it wrong you learn a value lesson that couldn't have been learned elsewhere.
So GO - build, break, iterate. Let's build this thing.

Sunday, September 15, 2019

Terraform - Iterative Subnet Module - AWS

Hey all!

Terraform has the ability to call modules, which are snippets of terraform code that can be passed information to build resources. Generally these modules enshrine best practices, and help to keep your DevOps teams on-track in terms of resource nomenclature, structure, and security guidelines.

Modules also have the ability to contain multiple resources, and be passed a "count" variable, which will lead to several similar resources being constructed.

In this blog I'll share code for an AWS subnet and route-table association module that can accept a count variable, and build "n" number of subnets. New subnets are as easy as updating the count variable and updating the list of subnets passed to the module.

But enough talking about the cool thing, let's build it.

Modules Overview

Creating a module is as easy as saying "hey, terraform, call this module, here's where it lives, like this: 

That'd work just fine, however there's much more power in modules when we pass information to them. Imagine calling an ec2 module and passing the subnetID, AMI, size of subnet, etc to it. That makes the module a lot more powerful.

You can imagine how extensible this solution is. For example, here's the finish module call we'll be building today:

We're calling a module that builds subnets, we're telling it a "group" to use in naming of the subnets, availability zones, subnet addresses, route table, oh my! The module has to be written to accept and use all these values, which is the tricky part. So let's jump into that.

Iterative Modules - The Secret Sauce

Writing modules isn't terribly hard - you tell it what values to accept from the caller, and assign those values to fields that terraform accepts, and boom, you have a functional module. However, that module can only build a single resource. Telling it to build several resources in a cogent way is some engineering, some creativity, and some luck. It starts with the "count" parameter.

Count is a built-in terraform variable that terraform uses to know how many times to loop over the same resource and build it several times. This built-in loop within terraform is exposed to the resource via an "index" attribute that tells the loop how many times the loop has run.

If that's made your head spin, that's okay - we'll walk through it with examples. First, let's take the subnet module a few lines at a time. Here's the start of our subnet module, and we're building a subnet resource. We're also using the count attribute we just talked about, but rather than assigning it by hand (which would work fine, as long as we remember to update it for each new subnet!), we're going to pass the number of items in the subnet_addresses list instead using the length function. Basically, we're saying to terraform, "If there are 3 subnets in the subnet_addresses list, iterate 3 times and build 3 subnets."

You can also see that the VPC ID is just specified like you would in any normal non-iterative module. That's because that value will remains static across all the subnets we build.

Let's add one more line - the variable that tells terraform the CIDR address of this subnet. Now, this item needs to change depending on which iteration of the loop we're on. So we tell terraform to pick up the variable passed to this module called subnet_addresses using the element function, of index whatever number of the loop we're on. So if we pass this module an array of "1, 2, 3" and the loop is on iteration 3, it'll pick out the 3rd item in the list, and use the value "3".

You can use those index values to interpolate also. It makes a lot of sense to me to name the subnets starting at 1, rather than 0 where a computer starts counting, so we interpolate the loop index value, and add 1 (so 0 becomes 1, and 1 becomes 2).

And boom, that's our iterative module. However, we also need to associate the subnet to a route-table. Check out the second half of this module and see if you can pick out where iterative looping is done, where interpolation and value modification is done, and follow along.

Notice also that when we're calling the subnet module above, we're only 2 availability zones and 2 route tables. How is the subnet module dealing with only having 2 values when it iterates 5 times? The answer is that lists inherently loop. So if a list contains value "A, B" and the loop is called 5 times, loop 1 will be A, loop 2 will be B, loop 3 will be A, and so on.

Outputs - Splat!

Normally, a module can output a static number of resources, so outputs are easy to write. However, in an iterative module, any number of resources can be created. Outputs don't support the "count" parameter in the same way resources do, so we have to use another creation of Terraform's - the Splat expression.

Here's what our output looks like in this subnet module:

So within the subnet module, that's how you'd export ALL subnet IDs. But say you wanted to reference one of the subnets from the main.tf? It'd look like this - referencing the array value and module name from within the main.tf file:

Go Build It Yourself!

You can find all the functional code at GitHub here: https://github.com/KyMidd/TerraformAwsSubnetModule

Go build some cool iterative modules! Imagine building a dozen ec2 instances, and standardizing their names, security settings, and managing them as a single flexible unit. The possibilities are endless.

Good luck out there.

Thursday, August 29, 2019

Recursive Terraform with Terragrunt

Hey all!

Terraform is capable of remarkable things, not least of which is speaking API commands for many dozens of providers, which lets terraform configuration do amazing things. That isn't to say that terraform is perfect, or as flexible as we'd like it to be, which is why some free-lancers have built wrappers around terraform to add functionality.

One that has gained significant traction is Terragrunt. Terragrunt is a tool that permits terraform to run in parallel, on lots of different main.tf files, and to go even further towards a DRY (Don't Repeat Yourself) implementation with sym-links to shared files. The github page for the tool goes into greater depth and specificity than I'd be able to, so visit it if you're interested!

Terragrunt also permits something cool - to recursively dive into a folder structure and execute all the main.tf files that are found there. Terragrunt can also manage each of these files' state files and help store them in an appropriate back-end.

This recursive property lets you separate your resources out into n number of data stacks, all with separate state files. This is particularly useful within the context of a CI/CD that has to be hard-coded to execute a single or few commands, e.g. terraform against a single location. If your users are going to be adding main.tf files with new resources in new folders, you'll have to constantly be updating your CI/CD to point at those new main.tf files. Or you can use terragrunt and tell it to dive into a folder structure recursively and grab them all, each time its run, automatically.

During this blog I'll walk through how to set up terragrunt, how to separate resources and reference them from different main.tf files, all in the context of what I hope is) an interesting example - building public and private subnets, 2x servers, a load balancer, listener, target group, and all sorts of other cool stuff. At the end of this demo, you'll have an internet-facing Application Load Balancer in AWS that can accept incoming http connections and load-share them between the 2x ec2 servers we'll build.

AWS Bootstrap - S3, IAM

Before we can run terragrunt against an AWS environment, we need to add some resources, which would include an S3 bucket to store the remote state, as well as an IAM user with a policy that lets them do things.

First, log into your AWS account and in the top right, click on My Security Credentials. This is your root user, with unlimited abilities. We don't want to do a ton with this, but we'll use it to get started

Then click on "Access keys" drop-down to expose your keys. If there aren't any there, that's fine. 

Click on "Create New Access Keys", then on "show access keys". Copy those down, then export them into your browser session. These exports will work on a linux or mac computer - you'll need to export in windows syntax for a windows session.

That will permit your terminal to authenticate to your AWS and do things.

Navigate to projects/ado_init on your local disk and update the ado_init_variables.tf file with custom names. The S3 name has to be globally unique, the rest are whatever names make sense to you.

Then run "terraform apply" against the main.tf there. It will build an S3 bucket and the IAM user with policies that you can assume.

Find that IAM user in the IAM panel, click on Users on left side, and find the new IAM user that we built. Click on it to open it up, then click on "Secure Credentials". Click "create access key" to generate some CLI creds.

Now we're ready to get started in terragrunt, so let's dive in!

Terragrunt Properties

Terragrunt is written in the same language as Terraform - HCL, so it'll look very familiar. Here's an example of a "terragrunt.hcl" file that exists in the networking folder of my terraform project.

Notice that it looks exactly like how a remote state backend is written in terraform. And rather than having all this information in both places, terragrunt requires us to remove that information from the terraform main.tf file. Really all that's left in the terraform file is the terraform init statement.

Here's what my folder structure looks like. Notice that there are several different "projects", which are components of my environment, broken out into main.tf files. All are under the "projects" folder, so that's where we can run our terragrunt commands.

Each project component will have its own terragrunt.hcl file, but they'll vary slightly. The reason for this is that each one will maintain a separate state file. Here's the terragrunt.hcl file for the security_group project. Notice that the s3 bucket stays the same - we're putting all our remote state files into the same s3 bucket, but the "key" (folder path and filename) changes, as well as the dynamo-db table.

You'll need to sync down the Terragrunt git repo for this demo to your local comp.

Go through the various terragrunt.hcl files and update the S3 bucket to the one you created in the ado_init step earlier. You can also update the name of the dynamoDB table if you'd like - terragrunt will automatically create these for you if they don't exist yet.

Now it's time to run it!

Run Terragrunt

Navigate to the projects folder and run command "terragrunt apply-all". This command will tell terragrunt to recurse through the directories and execute the main.tf files in each directory that has a terragrunt.hcl file. It'll read the terragrunt.hcl file in each directory and grab (or push) the terraform state to that remote location.

You'll see log messages from each of the main.tf files as it goes, and it can sometimes be hard to tell where the log messages come from.

You may need to run the command a few times - there are inter-dependencies among the several files, and some files can't execute at all when their data sources reference something in a data stack that doesn't exist yet.

Try it out and let me know what you think! Good luck out there.

Saturday, August 24, 2019

AWS IAM: Assuming an IAM role from an EC2 instance

tl;dr: A batch script (code provided) to assume an IAM role from an ec2 instance. Also provided is terraform code to build the IAM roles with proper linked permissions, which can be tricky. 

I'm working through an interesting problem - syncing Azure DevOps to AWS, and making the connection functional, scalable, and simple. Sometimes, when designing anything, a path is followed that doesn't pan out. This is one of those paths, and I wanted to share some lessons learned and code that might help you if this path is a winner for you.

Our security model for EC2 requires that a machine assume a higher IAM policy when it is required, but the rest of the time it have much lower permissions. That's a common use case, and a best practice.

Some applications support assuming a higher IAM role natively - I later learned, after pursuing this, that terraform is one of those applications (more details on that in a future blog). However, some applications can't, and require you to do the heavy lifting yourself.

IAM - a Sordid (and Ongoing) History

IAM (Identity and Access Management) is complex beast that controls authentication (who are you?) and authorization (what are you allowed to do?). Because even simple complex can be made complex with enough work, IAM supports recursive role assumptions, so a resource that starts with 1 set of credentials can assume a different (or more expensive) set of credentials during certain actions.

This has the benefit of being very flexible, and the detriment of allowing deployments so complex it can require a serious amount of nancy drew-ing to sort out what permissions something "really" has. 

This complexity has led to a series of high profile security vulnerabilities introduced by a lack of understanding or a too-complex deployment in some of what are generally thought to be the most security companies. The most recent high profile one was Capital One's hack by an ex-AWS employee. The ec2 IAM policies were written in such a way as to provide access to all s3 buckets, so once a single ec2 instance was compromised, all data everywhere was compromised. KrebsOnSecurity has a great write-up of the incident

Definitions - Policies, Roles, and Trust Relationships, Oh My

So clearly, lack of understanding here can be a vulnerability all in itself, so let's break down what pieces comprise IAM. 
  • Policies: Policies are a list of permissions that can be granted. They are not allowed to be assigned to resources themselves (to my knowledge). Rather, they are assigned to one or more roles, and the roles are assigned to or assumed by resources.
  • Roles: An IAM role is a bucket of permissions. The permissions it contains are not "within" the role, but rather are described in the IAM policies that are assigned to the role. These roles can be assigned to a resource (think ec2 resources being assigned a single ec2 role) or assumed by a resource or process. 
  • Instance Profile: An IAM Instance Profile is a somewhat hidden feature of IAM roles. Instance Profiles are assigned 1:1 to an IAM Role, and when assigned, allow an ec2 instance to be assigned the role. To be even simpler: This stand-alone resource acts as a check box for an IAM role on whether it can be assumed by an ec2 instance or not. 
    • Interesting tip: I say this this resource type is somewhat hidden because when an IAM role is created in the GUI, an Instance Profile is automatically created and assigned. However, if you're building an IAM Role via command line or API call (thing Terraform or CloudFormation), this resource isn't automatically created, and instead acts as a "gotcha". 
  • Trust Relationship: An IAM Trust Relationship is a special policy attached to an IAM Role that controls who can assume the role. This is a key part of our IAM role assuming, and we'll walk through the different policies required on the implicit (assigned) IAM role for the ec2 instance vs the IAM role assumed by the instance.

The Implicit IAM Role

We'll build several IAM roles, with associated policies and trust relationships. First, let's build the Implicit IAM role. This role will be assigned directly to the ec2 instance, and is static.

Note that this role has an embedded IAM policy - this is our trust policy that permits the ec2 instance service to assume this role - this is required if any ec2 instance will be assigned the role.  

Next we'll create an IAM policy for this implicit role. The only permission we want this policy to contain is the ability to use the STS service to assume a specific IAM role. Otherwise, this ec2 instance should act as a normal virtual machine, and not be able to edit or control the AWS environment around it.

Then we link the two together - remember that roles and policies are not linked by default, and have to be assigned together.

And remember this implicit IAM role needs to be statically assigned to an ec2 instance, and that requires it to have an instance profile, so let's build that and assign to the IAM role. 

Once this is all applied, it'll look like this:

And here's the trust policy under the "trust relationships" tab. You should see the ec2 service is trusted by this policy to be assumed.

Now that we have an IAM role with a policy and a trust relationship to the ec2 service (and that gotcha of an instance profile), let's go assign it to an ec2 instance. I didn't include terraform code for this, so you'll build an ec2 instance by hand. Once ready, go into the instance settings, and click "Attach/Replace IAM Role".

Find the IAM role you want to associate with the ec2 instance (the implicit one we just built). If you don't see it, try the refresh icon next to the list, or go check to make sure the instance profile is built and associated with the IAM role properly. 

Great, now we have an IAM role, assigned to an ec2 instance, that permits it to assume a higher permissions role. Which is all well and good, but we haven't built that higher permissions role yet, so let's do that.

More Permissions, Give Me More!

The whole point of this exercise is for the ec2 instance to be able to assume a set of more expansive permissions when it needs it, so we need to build a distinct IAM role to contain those permissions, a policy to describe what permissions we want to grant, and a trust relationship that allows the implicit (statically assigned) ec2 instance to assume the higher permissioned role.

First let's build the IAM role. The role parts are exactly the same, but notice the embedded IAM policy (the trust relationship) is entirely different. Rather than trusting the ec2 service to assume it, it's trusting the first IAM policy only. This assures that only a single specific IAM role can assume this upper IAM role. And the lower role is assigned only to a single ec2 (or more if you want) instance, creating a limited chain of permissions that is very flexible to assign.

Now, let's build an IAM policy of permissions for this expansive role. The example here permits all actions to all services, which is NOT AT ALL a best practice. If at all possible make sure to limit your expansive IAM policies to much more specific actions to specific resources. The policy here should rarely be used.

And you can probably guess what comes next - we need to link the IAM role to the IAM policy that we just built, which looks like this:
When you look at the new role in the AWS console, it'll look like this:

The trust relationship tab will look like this:

Let's Assume The Role, For Real Now

Now that everything is in place, we're ready to go onto the ec2 instance and assume the role. This involves running a batch script, which will do several things - clearing the variables in case of a last run hanging around, figuring out the account ID by calling the AWS ec2 metadata service, figuring out the instance ID, and setting the information to a text file where bash can call it and set the global env variables.

Then we start the cool stuff. AWS ec2 linux AMIs already contain the AWS CLI toolset. If you don't have it, install it for this to work.

First we use the AWS CLI to assume our role, depending on both the dynamic info we gathered earlier - the account number and the EC2 ID. These dynamic pieces permit this same script to be run in any account, and to set an IAM session name that is globally identifiable to this instance, for later CloudTrail-ing.

Then we use jq (javascript query) to export the pieces we need to a file, then we call bash to read the file and set variables into the bash shell environment. Then we cleanup by removing the STS creds from the disk.

Boom, your ec2 instance has now assumed a higher IAM role that the assigned one, and can do all sorts of stuff.

Wrap It All Up

The collected code for all these examples can be found here: https://github.com/KyMidd/AWS_EC2_IAM_Authentication

I'll continue to investigate how to use IAM roles in order to build a comprehensive terraform and Azure DevOps CI/CD, so these types of posts will continue. In the next one, I'll cover how Terraform can handle most of these items itself, so the bash script is not needed.

However, I hope this script and the coverage of IAM helps you in your non-Terraform requirements. Thanks all!

Good luck out there.

Monday, August 12, 2019

Azure DevOps & Terraform: Breaking Up The Monolith - Strategy

Azure DevOps is a CI/CD automation platform from Microsoft ($MSFT). It supports repositories and running all sorts of code and automated code against the code (among many, many other functions). This includes Terraform, a tool that converts scripted, declarative configurations to real resources in cloud (and other) providers via API calls.

Terraform has been an excellent tool for us so far, and is starting to be adopted by other teams, for other purposes, to manage more accounts and resources. Which means the model we selected - to have a single terraform file (with a single .tfstate file) that calls all resources and configurations for all resources in an environment, is quickly getting strained.

Here's an example - say you have this above environment, with a single file. You have a dozen developers working in parallel building projects and adding them to the single monolithic file. Changes might get through the PR process without being properly vetted. Devs might push changes to the terraform repo and not deploy changes yet - maybe the changes aren't ready yet, maybe they shouldn't be deployed yet for some dependency reason. And now it's time that you want to push a tiny little change - maybe to change the size of an instance. You push your PR, run a terraform plan, and it wants to change 22 resources in 3 different time-zones. Would you push the approval through? If you're an experienced engineer, heck no you wouldn't. You could break any number of things.

So that's a scary situation, and probably an eventuality for most companies that start using terraform and don't plan an extensible way to manage these files But that's okay - for better or worse, the best driver of innovation is impending failure. 

What Options Do We Have?

So how can we fix this problem? I have a few different strategies I want to discuss.

Option A: A few more TF files

We could of course break the single monolithic TF and .tfstate file into a few TF files. For instance, put all servers into a single file, and all databases into another. This has the benefit of minimizing changes to process, and putting off the eventual time where many changes are queued up for TF apply-ing.

This has the benefit also of being easily supported by Azure DevOps - you can point the native Terraform plan/apply steps at the several different files, even have them in different concurrent stages of the Terraform release. They can all run automatically, and boom, you're in business.

The big con here is that the problem is only delayed. You have expanded the ability for your processes to scale, but you're still queueing up changes within a single file. And you're going to need to do this again and again in the future.

What would be more ideal is a solution to the problem, rather than a bandaid. So what else can we do?

Option B: Many project TF files, Terragrunt recursion

A problem with Azure DevOps and Terraform in general is that each Terraform step must be pointed at a single directory, and Terraform doesn't support recursion. Which means if you have half a dozen TF files that need to be run, your TF release pipeline is going to be relatively complex. But if you have hundreds? It'd be untenable. Not to mention that ever time a project is added your release project would need to be updated. 

Which is exactly the gap that Terragrunt looks to fill. It natively supports recursion, complex deployments, and lots of tools to keep your configuration DRY (humorously, Don't Repeat Yourself). 

A pro here is that now you can expand ad infinitum with Terraform stacks. Your can tell your devs that if they drop their terraform code into a folder tree you specify, their code will be executed on the next run. 

There's still some downsides. Terragrunt, because of its additional deployment logic, requires new files to be added, and some changes to your TF stack config. If you already have lots of files, not great. And learning a new tool just for this problem isn't ideal either. One complication that seems trivial (but probably isn't) is the Azure DevOps tasks that consume a Service Principal are for Terraform in particular, not any other command, even if it's very similar (Terragrunt). Which means you're looking for a Terragrunt deployment module, which... doesn't exist (yet). So you're deploying code with straight-up terminal commands, and handling the service principal authentication yourself, which isn't a security best practice. 

And one of our big initial drawbacks remains - when an "apply" is run against the top-level of the folder structure, all changes that have been queued up by PR approvals in the terraform repo will be executed. Again, we might end up pushing out dozens of changes if devs haven't been applying their changes right after getting PRs approved. Still not ideal. 

Ideally, we'd be able to get all the benefits from Option B (Recursive Terragrunt) without learning and implementing a new tool and applying changes en masse during a single run. And what a monster I'd be if I didn't present something that satisfied that criteria - customized 

Option C: Targeted, custom Azure DevOps release pipelines

What many companies do is implement Jenkins, an extensible CI/CD that permits more customization of releases, including setting variables that can target particular files for jobs. This is used to help target and run specific Terraform file updates. 

Thankfully, Azure DevOps supports similar functionality. The functionality is relatively recent and still in development, so documentation isn't great. However, we can piece together enough disparate features to make this work well. 

When initiating a TF release pipeline, we can surface a variable that can be consumed by our TF steps within the pipeline to target specific files for execution. Combine building individual TF files with individual state files with a release pipeline that permits executing single TF files one at a time, and we can scale out indefinitely (thousands of TF files) and programmatically define where the TF state file is stored for each TF file.  


The output of all this:
  • We can scale out TF files indefinitely - TF files now stand alone, and aren't all tied back to a single file that can become cluttered and queue up many changes
  • Changes can be applied carefully and methodically - TF updates aren't applied all at once for an entire folder structure - they are targeted and only a single stack is updated at a time
  • No new tooling has to be implemented - We can rely on native Azure DevOps and Terraform functionality. There's no need to teach your team an entirely new tool and methodology

In future blog posts I'll be looking at Terragrunt to implement TF recursively in a folder structure, and separately at customizing Azure DevOps release pipelines with custom variables to permit releases targeting only a single arbitrary TF file.

Good luck out there!

Sunday, August 4, 2019

Connect Azure DevOps to AWS

Azure DevOps (ADO) is a CI/CD platform from Microsoft ($MSFT). It permits a great deal of flexibility in the type of code run, the structure and permissions sets applied to jobs, and many other items of your creation and management of resources and automated jobs. However, support for other cloud providers is (perhaps obviously) weaker than at $MSFT's native Azure Cloud.

However, that doesn't meant $MSFT hasn't made inroads into helping us connect Azure DevOps jobs to the other cloud providers.

I've spent the week researching how to integrate the two. The closest I could find were specific use cases, like Elastic Beanstalk deployments (sans terraform) or arguments about how things worked, or why. No one seems to have built it before, so I knew this challenge would make an interesting blog post. I've done my best to package up the code and lessons to permit you to get this stuff going in your own lab as well.

Install the Microsoft DevLab Terraform Add-On Into Azure DevOps

So first of all, let's install the add-on. Make sure you're signed into dev.azure.com with whatever account you'd like to connect this service to, then go here: https://marketplace.visualstudio.com/items?itemName=ms-devlabs.custom-terraform-tasks

At the top, click the button that says "Get It Free".

Make sure your org is selected, then click "Install". Once complete, you're good to go. Head back to dev.azure.com (or click "Proceed to Organization") to get started.

Remote State, State Locking, Permissions

Azure DevOps builds these items for us in the Azure cloud, so we never have to worry about it. However, when we're crossing clouds we'll need to build a few items to enable Azure DevOps to take over and do its thing in AWS. The items we need to address are: 
  • Remote state storage - Terraform uses a state file to keep track of resources and map the text TF configuration to the resource IDs in the environment. We'll need to store it somewhere. In AWS, the preferred method is an S3 bucket
  • State Locking - When Terraform is actively making changes to a remote state file, it locks the file so no one else can make changes at the same time. This prevents the remote state file from being corrupted by multiple concurrent writes. The preferred way to handle this in AWS is a DynamoDB database. 
  • Permissions - This is the most complex bit - We need to create an IAM user in AWS that ADO can connect as (authentication) and associate any IAM policies the ADO user might require to the role (authorization). 

Catch22, Immediately

Our next step is to build some resources in AWS to permit this connection (IAM), store the remote state (S3 bucket), and handle state locking (DynamoDB). What intuitively makes the most sense is to use our trusty Azure DevOps to build a terraform job to build the things.

But it's a catch-22 - we can't execute the job against Terraform without the permissions already in place. And we can't very easily managed the AWS resources with Terraform if we build them by hand. So what do we do?

The best solution I've found is to create the Azure DevOps "seed" configuration in AWS via a Terraform job from my desktop, without using a remote state file. Once we get all the configuration in place to where Azure DevOps can take over, we'll add the remote-state file from our desktop to the S3 bucket, and start running our jobs from ADO.

Let's build some resources!

Local Terraform - S3, IAM, DynamoDB

Doing all this from the ground up is time consuming and complex! So I did that work for you, and created a cheat-sheet of Terraform to help you get started.


This GitHub repo contains a few files you can use to get a running start. Make sure to preserve the folder structure - the main.tf file uses the path to the ado_seed to find it.

Let's walk through what we're doing in the main.tf file. The first block of main.tf initializes terraform, and requires we use version 0.12.6 exactly. When you run terraform it'll tell you if your version is behind. Right now, 0.12.6 is the state of the art.
Then we define the provider - in this case, AWS. Change the region to whatever region you'd like. When we update this in the future for cloud hosting in ADO, we'll add a remote state location to this block. For now, though, we want to create resources in AWS from our computer, and store the tfstate locally. Then we call the ado_seed module and pass it some variables. This helps ADO name the resources specific to what you'd like. You'll also have an opportunity to look over the ado_seed module itself and see where that info is.
Let's pop into the ado_seed module and see what TF code we're running. First, we're building the S3 bucket. The name of the S3 bucket can be anything, but it has to be globally unique. Also, these S3 buckets are only useable for us in the same region as the environment, so it makes sense to include the region ID in the name for ease of use.

We're enabling strong encryption by default, versioning history as the state file changes, and a terraform attribute called lifecycle prevent_destroy which means TF will error out before replacing or destroying the resource, which is good news for us - we will be in trouble if our state file gets destroyed.
Then we're going to build the DynamoDB. Terraform can consume this database to use it for state locking. Basically, when terraform is editing the state file in S3, it'll put an entry into the database here. When it's done, it removes the entry. As long as every TF session is configured to use the same database, the state locking mechanism works. The primary key for this DB is required to be LockID.
Then we need to start on the IAM user, role, and policies. Bear with me, because the AWS implementation of permissions is incredibly verbose.

First, we need to create an IAM user. This user is where we can generate secret credentials to teach something how to connect as it - for instance, to tell ADO to connect as this user. The user itself doesn't contain permissions - there's no authorization, only authentication.
Then we create a policy for the IAM user. This is a list of the permissions we grant it. I've done something here for simplicity that isn't a good practice - note that the second rule in this policy grants our IAM user ALL rights to ALL resources. That convenient, but if someone compromises this user, not great. It's a better idea to iterate through each permission your ADO service requires and grant it there.
Then we link the policy to the IAM user.
Despite this step-by-step walkthrough I'd recommend copying the whole things down to your computer to avoid syntax and spelling issues and go from there.

Local AWS Authentication, TF Apply

Now that we understand all the steps, let's authenticate our local comp to our AWS environment and build these items. Log into your AWS account and click on your org name, then on "My Security Credentials"

Click on "Create New Access Key" and then copy down the data that is displayed. This credential provides root level access to your AWS account, so 100% do not share it. Copy down both before closing this window - it won't be displayed again.

Export that info to your terminal using this type of syntax:

Run "terraform init" and then "terraform apply" from your desktop in the directory where the main.tf is. Once you see the confirmation to create, type "yes" and hit enter. Terraform will report if there were any issues.

Now we have an S3 bucket for storage, a DynamoDB for locking, and an IAM user for authentication. Let's switch to Azure DevOps to move our Terraform jobs to the cloud!

State to S3, Create IAM Creds

Now that we have our environment in the state we want it, we need to make sure our cloud Terraform jobs know about the state of the environment as it exists right now. To do that, we'll need to upload our local terraform.tfstate file into the S3 bucket.

Head over to the S3 bucket and click on Upload in the top left. Find your terraform.tfstate file in the root of the location you ran your "terraform apply" in and upload it. All options can be left at their defaults.

Once that's done, we need to head into the IAM console to generate some secrets info for our new IAM user so we can provide it to ADO for authentication. Head over to the IAM console --> Users --> and find your user. Click it to jump into it.

Click on the "Security credentials" tab, then click on "Create access key" to generate an IAM secret.

This IAM secret will only be shown once, so don't close this window. Copy down the Access key ID and Secret access key. We'll use that information in the next section. 

Integrating Azure DevOps with AWS IAM

With that done, we're finally(!) ready to head over to Azure DevOps and add a service principal that utilizes this new IAM user and the secrets info we just created.

Drop into ADO --> your project --> Project Settings in the bottom left. Under pipelines, find "Service Connections". These service connections are useful in that they are able to store and manage the secrets and configuration required to authenticate to a cloud environment. Our terraform jobs will be able to consume this info and make our lives easier.

Click on "New service connection" in the top left of this panel and find the "AWS for Terraform" selection. If it's not listed there, head back up to the top of this blog and make sure to follow the steps under "Install the Microsoft DevLab Terraform Add-On Into Azure DevOps".

Fill in the information requested. The name is just a string, name it whatever makes sense to you. The access key and secrets key id are the information from the IAM user that we just generated. This ISN'T your own user's root access to the env. That will work, but isn't a best practice since the root user has unfettered access to the account, not the permissions we set in the IAM policy assigned to this user. Also fill in the region - these service accounts (and S3 buckets, for what it's worth) are only valid in the region they are created for. So if you need to deploy this stuff in multiple regions, you're going to have multiple S3 buckets and multiple service connections, one for each region.

Update Code, then push to ADO repo

We're moving all our workflows into the Azure DevOps cloud, which means we need our Terraform code to live there also. The only change we have to make before pushing this code to our ADO repo is to add the "backend s3" block to our terraform config in the main.tf, like so: 

Since we're just starting this repo, you'll probably push directly to master. For info on how to do that (or how to start up a branch in git and add your changes to it), refer to previous blogs.

I put mine in the folders terraform / terraform-aws / main.tf.

Pipeline Time!

Now all the pieces are in place, and we can get to actually building the pipeline and setting up each step we'll need to actually DO stuff! This is exciting times. Let's do it.

I'm assuming the build pipeline for the terraform repo is already complete. If it isn't, refer back to previous blogs on this site on how to build that.

Head into your project, then click on Pipelines --> Releases. Click on "New" in the top left, then on New release pipeline to build a new one.

ADO really wants to be helpful and can be somewhat confusing. Make sure to click "start with an empty job" to avoid the wizard.

Click on the Artifacts box in the left on "Add an artifact" to pull up our build artifact selection wizard. 

On the "select an artifact" screen, find your terraform build pipeline, then probably use the "Latest" Default version. Click Add to head back to our release pipeline automation screen. 

Click on "Add a stage" to add the new Terraform Plan stage. 

Call the stage AWS Terraform Plan or something similar. This stage will only do validations and planning - no changes will be executed. That'll help us confirm our stages and configuration are working correctly before we move on to executing changes. 

Click the plus (+) sign on the right side of the agent job to add a step and search for "terraform". Look for the "terraform tool installer". It'll handle installing the version of terraform you specify. 

Remember we've required terraform version 0.12.6 in our config files, so make sure to specify the right version here.

Click the plus (+) sign on the agent bar on the left again and again search for "terraform". Look for a step called just that - Terraform. There are several add-on modules that sound similar but have different capabilities, so look for one that looks like this picture. Click add to put this step into your workflow.

Change the provider to AWS, and set the TF command to "init". Also make sure to hit the 3 dots to the right of the "Configuration directory" to find where the command should be executed at - the location of your main.tf file.

Under the Amazon Web Services (AWS) backend configuration, find your Amazon Web Services connection - the service connection we built earlier that uses the IAM user. If nothing shows up, click the refresh button on the right or double-check you created the correct type of ADO service connection. Set the bucket name to the bucket you created from your desktop. Then set the "Key" to the path and name of your terraform.tfstate file in the S3 bucket. I just put mine in the root of the S3 bucket, so my key is simply "terraform.tfstate".

And boom, that's init. You can stop and test here, but I'd recommend adding a few more steps to make sure we're all good to go. We'll want to add a "terraform validate" and a "terraform plan" step to this stage. The easiest way is to right click on the "Terraform Init" step we just created and click "Clone task(s)".

Update the third stage to "terraform validate" and the fourth to "terraform plan". Each stage requires different information, but it's all information we've covered already. Once you've created the stages, it'll look like this: 

Once you feel good about it, click on "Save" in the top right, then click on "Create release", then "Create". Click on the release banner at the top to jump into the release logs. 

In my experience these freeze a lot, so be aware of the "refresh" at the top. If the "Terraform Plan" stage fails, click into the logs and you can check out why. Click on the "terraform plan" stage to see the CLI results. Hopefully yours looks like this also, which means all things have gone well, and our ADO Terraform now has the same state files as we did locally and all things are working. 


There ya go, a functional Azure DevOps Terraform pipeline to build and manage your resources in AWS. Woot! 

Try building your own resources and see how things go! Try to tack on pull requests, validation of PRs pre-merge, and anything else, and report back the cool things you find! 

Good luck out there! 

Sunday, July 14, 2019

Network Engineering is Dying (Except at Cloud Providers)

Hey all!

This past week I spoke to a recruiter for one of the gang of 4 largest companies in tech. That term refers to Google, Amazon, Facebook, Apple (and sometimes Microsoft). The recruiter pitched me on a network engineering role - something that I've happily done for years now.

For the past 20+ years, network engineering teams from most companies have maintained the networks that connect computers which serve up every internet service we interact with each day. Network engineers make sure redundancies exist for the inevitable failures of a network that spans the globe, and they verify the health of all the hardware devices and interfaces which run the network.

A common analogy for network engineering is building the roads for the application "cars" to drive upon.

These jobs have been stable and profitable, integral to the growth and stability of any company that wants to use the internet to drive its business (read: all of them). Most would jump at the opportunity to take any job at these companies. These jobs sparkle on resumes, and even if the day-to-day is similar to most other jobs in the industry, the looming profile these companies have in the news cycle mean it'd be foolish to write off an employment opportunity like this one.

However, the world of network engineering is changing. Many would say dying.

With the exception of maybe a dozen companies on the planet, nearly every company is moving away from physical data centers. IT orgs struggle with the long lead times required to make changes in physical data centers. Purchasing hardware, organizing cabling standards, cooling, 24x7 staffing, and dozens of other concerns are simply avoided by moving to the cloud.

Ironically, the only companies who aren't decreasing their data center footprint? Cloud providers. 

Because of the increased demand, cloud providers are growing their physical data centers at an incredible rate. This requires hiring network engineers, data center engineers, and others with the skillsets to grow them in a scalable way.

The gilded cage of skill-set lock-in
The problem, of course, is skillset lock-in. Not only do most of the gang of 4 famously build their own tooling, but their business model is shared by almost no other company on the globe - to build world-spanning data centers and massive internet-scale networks.

Only cloud providers still invest in physical data centers - and the skillsets required to run them.

Spending time in your career at one of these companies in a department focused on these legacy networks is a dead-end in a career because of this skillset lock-in. It'll be difficult for the folks locked into these positions to leave the very small network of a dozen or so companies that provide these massive clouds and take just a job at just about anywhere else, because these other companies are looking for reliability engineers (SREs), DevOps engineers, and any number of other software-defined cloud computing experts that need entirely different skillsets than those harbored within these divisions at the gang of 4.

If you have the opportunity to work in these divisions at cloud providers, good luck to you! Their famously great pay and benefits are nothing to scoff at. But I hope you consider my points above about career lock-in. Your career must be played as a strategic long-game, and I worry these jobs might be the wrong move.

Best of luck out there.

Sunday, July 7, 2019

Sync Terraform Config and .tfstate for Existing AWS Resources

Hey all!

Terraform is a great (and dominant) infrastructure automation tool. It is multi-cloud, can build all sorts of resources, and in some cases supports API calls to build resources before the native tooling from cloud providers does.

However, it's dependent on a state file that is local, and only reflects resources created by terraform, and a local configuration file to describe resources. It's not able to reach out to a cloud account and create a configuration and .tfstate file based on the existing resources that were built via another method. Or at least, it isn't able to yet. The scaffolding for this functionality exists within Terraform for the AWS cloud, and is called the "import" functionality. It's able to map a single existing resource to a single configuration block for the same resource type and fill in the info, which is of course a manual and tedious process. And imagine if you have hundreds (or thousands!) of resources. It isn't a feasible way to move forward.

Terraforming (link) is a wrapper around terraform and is able to map multiple resources at the same time to configuration blocks, as well as build .tfstate files for multiple existing resource types. Still, it's a little awkward to use - only a single resource type is able to be imported at the same time, and if a command is run against a non-existing resource type (say you don't have a batch configuration, and run a sync against the batch resource), it wipes out the existing .tfstate entirely, removing your progress.

Clearly, the tools could use some help. So I wrote some. I imagine both of these tools (terraform import & terraforming) will eventually get this same functionality. In fact, both of these tools are open source, and I'll work on adding this functionality natively to both of these tools via PRs.

However, for the time being, I'm publishing my code which permits:

  • Creating from scratch a .tfstate file for every terraforming supported resource in an AWS region
  • Creating a single (monolithic) configuration file for each existing resource in an AWS region
This code assumes you don't have an existing .tfstate file - in fact, it wipes out your existing local .tfstate file and builds a new one. So please back up your .tfstate and configuration files before running this tool

However, if your'e new to terraform and want to sync the configuration to an existing AWS region and look at the config for all the resources that exist there, this is a neat shortcut. 

Rather than post the code here and update each time I (or you! The code is open source) update it, I'll post a link to the public github repo. 

I hope it's useful to you. Please add any corrections, comments, and feature additions you'd like via pull requests. And if you know how to update the terraform or terraforming source code to add these functionalities and make my code obsolete, please do so! That would be the best case scenario. 

Thanks all. Good luck out there! 

Sunday, June 30, 2019

Azure DevOps, Terraform Validation and Linting

Hey all!

This post is part of a series on Azure DevOps CI/CD, which we've used to integrate with Azure Cloud, build a terraform deployment, and commit code to build resources.

In part 1 of this series, we:
  • Learned several DevOps and Azure Cloud terms
  • Signed up for an Azure Cloud and Azure DevOps (ADO) account
  • Created an Azure Cloud Service Connection to connect Azure Cloud and ADO
  • Initialized a new git repo in ADO
  • Installed git on our machine (if it didn't have it already)
  • Created an SSH key and associated it with our user account
  • Cloned our (mostly empty) git repo to our computers
  • Create terraform base code and a .gitignore file for terraform
  • Used git on our local computer to create a branch, add the files to it, and push the files and branch to our Azure DevOps git repository
  • Merged the code changes into our master branch
  • Created a build and release pipeline for terraform to validate and push code out to our cloud environment
This blog will continue from where part 2 ended, so if you're following along, walk through part 1 and 2 to build all the above from scratch. Or just read along and watch me do it - I've included lots of pictures. 

Despite all of these neat cloud pipelines, git branches, and terraform automation, most of our pipeline is still manual. Code must be: 
  1. Build locally on a machine and tested before committing to the master - code isn't tested before committing to master and running a build and release
  2. Running the build pipeline manually once new code is committed to master
  3. Running the release pipeline to test code
There's also significant parallelization drawbacks to this method. Say I write a bad terraform commit and push it to master. The release pipeline starts failing for everyone, for any reason, until this particular problem is sorted. This might work for a few developers who have time to commit to this project, but imagine dozens of devs working on a project. There would constantly be faulty code that blocks others - production would grind to a halt.

Software engineering, from which DevOps borrows many of its practices and methodologies, handles this problem by integrating testing before a merge, during the pull request phase of a commit. Then if problems are created, it's only in the individual workspace of a branch, rather than the master that is shared by all developers. 

We're able to build something similar in Azure DevOps, and this blog will also walk through that process. So what are we waiting for? Let's get started! 

Break (Up) the Release Process

Right now our release pipeline has a single stage, and that stage does everything in Terraform - it stages the code, it init(ializes) the environment, validates the terraform, and applies it. 

That's not going to scale, so let's go into the release and edit the first single stage. What we'll do is remove all the "apply" steps, so the first stage is all about testing and validation. So let's update the first stage's title to match. 

We also want to remove the "terraform apply" step - don't worry, we'll add it to a future stage. Click on the apply step in the left column and then in the top right, click on the trash can labelled "remove" to delete that step. Hit save, and click "Pipeline" in the top left. 

We could go ahead and build an entirely new stage with just what we need, but there's an easier way. In the stages area, hover your mouse over the "Terraform Validate" step that we just updated. Click on "Clone" and you'll see a new stage appear. 

Boom, you are a master of efficiency. Click on the "1 job, 5 tasks" in that second (far-right) stage, that starts with "Copy of..."

This second stage is all about applying the Terraform. Before we get to this stage, the Terraform code will be validated by the first stage, so no need to validate again. However, we do need to "init" the environment again - each stage is handled by an entirely new container. So let's pop in there and remove that pesky testing. 

What you have left will look something like this - make sure the last step is Apply. It's also a good idea to name this step "Terraform Apply" to help keep track of what each stage is doing. 

Hit "Save" in the top right and we now have two discrete stages - the first stage tests, and the second stage applies. Which doesn't do a lot of good yet, because the stages are automatically linked and apply without any input from anyone! So let's tell stage 2 (Apply) to wait for our ok before continuing. 

In the top left click on pipeline again. Find the "Terraform Apply" step and hover your mouse over it. Click on the lightning bolt - that's the "Pre-Deployment Conditions" instructions. 

There's a lot going on here, but for now feel free to minimize the "Triggers" section to make this place a lot less busy. Find the "Pre-deployment approvals" section and slide the slider to the Enabled position. Azure DevOps will require that someone approve this step before continuing, and you can lock down who has the authority. In our imaginary business we're the only current employee, so add yourself as an approve. Then click save in the top right. 

Let's test it! In the top right, click on "Create Release" to run this release again. The first step will run normally - we triggered it to run, but something new will happen. The second stage WON'T run immediately. It'll wait patiently for us to "Approve" it. 

This is important for several different use cases. For one, we can just plain validate the "Terraform Plan" is only showing the actions we expect it to do. Second, you could assign more senior resources the ability to approve rollouts, or potentially an InfoSec team, platform team, etc. 

Click into the running release pipeline and the GUI is very clear - the "Terraform Validation" (and plan!) stage ran successfully, and the second stage "Terraform Apply" is pending approval, and won't run without our say so. Feel free to click into the Logs on the Terraform Validation step and make sure plan is only doing what we say, then click "Approve" on the second step, and the apply will continue on. 

Woot! We've broken out our flow of work so there is a separate testing stage from an apply stage. That'll be important for the next few automated linting test items we add. 

Limit the Blast Radius - Pull Requests

Anyone who's worked in infrastructure for any length of time is familiar with the phrase "limiting the blast radius." It's a way to frame changes or processes in an adversarial light - if it all goes wrong, how wrong can it go? The idea is to build processes, protections, and to train the team so that WHEN things go wrong, they don't go terribly wrong. 

We can take that principal to heart here and limit someone's ability to commit changes directly to our master branch, where it is both harder to back-out and avoids the normal review process that code (and any infrastructure changes) should go through. 

In most cases, it shouldn't be allowed except in unusual circumstance. So let's require Pull Requests (PRs). Click on Repos --> Branches and find your "master" branch. It's created automatically when the repo is initialized. Click on the 3 dots to pull up all the settings that apply to that branch, and click on "Branch Policies". 

Check the box next to "Require a minimum number of reviews" and change the minimum number to 1 (since our company only has a single employee!). If you are building this for a larger team, set the number wherever you'd like (limit 10). Since we'll be approving our own changes, make sure to also check the box next to "Allow users to approve their own changes." That's not a best practice for a real enterprise, but for our lab it'll do just fine. 

Hit save to make the policy live. But before we leave this page, we should make one other change. 

Automatic Branch Builds

Normally when you run a "build" job, it'll stage artifacts for all the release jobs from the master branch. It is possible to stage artifacts from a branch, or a particular commit, but it's complex to manage, and can go wrong (rolling out changes from an out-dated version of code, for instance). 

However, that's exactly what we need to do for branches - stage their code in a special artifacts area where it can be tested without messing with our master branch, or with any of the other branches where folks might be working in parallel. 

So it's convenient we're already in this settings page for the master branch. In it you'll find this exact setting that we need to enable. Find the section labelled "Build validation" and click the "Add build policy" button. 

Find your "Build pipeline" in the drop-down menu for Terraform building and select it. Leave the other options on their defaults. The one we're interested in in particular is the "Trigger" being set to automatic. 

What we've just done is made sure that every time a Pull Request targeting our master terraform branch is created or even updated, a new build process will run. That by itself won't test our code, but it's a step in the right direction. Now we just need to tell our release testing stages to also run when the code is staged in a pull request. So let's do that. 

Automatic Release Testing

What we'd like to happen when a PR is created or updated is for our terraform validation stages to run, but not our apply stages. Running an apply stage against a branch that might differ from the master branch is a recipe for chaos. We can do that by tagging a flag in two places - on the "artifacts gather" part of our release pipeline, as well as on each stage to include it in our automated build processes. 

First, go to Pipelines --> Releases and click edit on our terraform release. Find the Artifacts section and click on the lightning bolt - it's labelled the "Continuous deployment trigger" menu. 

In the menu that pops up we're going to change a few things. First, we need to make sure that this release pipeline automatically grabs the new build artifacts when a PR builds them, and then executes this pull request. 

Target branch filter is required to be set. It's asking when to trigger this automated pull request release. Sometimes code is staged in a branch to be ready for a release date. However in our case, most PRs will be created against the master branch directly, so select it as the target branch. 

At the very bottom of this picture above, note something interesting in yellow. Though we've turned on an automatic build and release, it's telling us that neither stage of this release pipeline is going to be executed. 

ADO is cautious - it doesn't want us accidentally over-writing our production infrastructure. So we'll need to go into each stage we want to enable for this automated release process and flip a flag to say "yes, include this stage in the process." Click on the lightning bolt on the "Terraform Validation" / planning step. 

Under the Triggers section, find the "Pull request deployment" slider and move it to Enabled. This is the flag to include this stage in our automated PR testing process. 

Hit save to make the changes live. 

If you were to head back to the Release pipeline's Artifact continuous deployment settings, where you last saw "0 of 2 stages are enabled..." you'll now see "1 of 2" stages are enabled for pull request deployment. 

Did it Work? 

Let's test it so we can see it in action! On our local machine let's make sure we've checkout out our master git branch, then pull any changes we might have missed. This is silly in our lab - we're the only ones in it! But it's a good practice to get into for production environments. 

Let's start a new branch (make sure to not use any spaces in your branch name). 

Edit our main.tf terraform file to add something new - say a virtual network and a new subnet. Something like this would do the trick: 

provider "azurerm" {}

terraform {
backend "azurerm" {}

resource "azurerm_resource_group" "rg" {
name = "testResourceGroup"
location = "westus"

resource "azurerm_virtual_network" "WeAreAwesome" {
  name                = "vNetNewWeAreAwesome1"
  address_space       = [""]
  location            = "westus"
  resource_group_name = "${azurerm_resource_group.rg.name}"

resource "azurerm_subnet" "test" {
  name                 = "testsubnet"
  resource_group_name  = "${azurerm_resource_group.rg.name}"
  virtual_network_name = "${azurerm_virtual_network.WeAreAwesome.name}"
  address_prefix       = ""

Copy and paste the above and save it over your main.tf file. Then head back to your command line and add the updated file to your change (git add .), commit the change to your new branch (git commit -m "adding a vNet and subnet" and then push your new branch, commits and files in tow, to your git master, with "git push origin testing-unit-testing". 

Back on ADO, head to Repos --> Branches to find your new branch. All the way at the right there is a column called "Pull Request." If you hover over it, you'll see the option to create a new pull request, called "New pull request". Click it, and then click "Create" on the next page. 

And we'll see the build process go, but nothing else! What did we miss?

Interestingly, the build process has to run a single time to initialize the ADO back-end for build status. But we can fix it! Head back to Repo --> Branches and find the master branch. Hover over it to find the three dots and then click on Branch Policies. Click it and then look for a section called "Require approval from additional services." This vaguely named section is able to receive input from other processes and give a go/no-go on whether a PR can proceed to be merged. It's exactly what the doctor ordered here. Click on "Add status policy". 

Under "Status to check", you'll now see your release pipeline name. Select it, and check either "Required" (which I heartily recommend, even in a lab environment) or "Optional". The selections are literal - if a required build shows a failed state (like a terraform init or validate failed), the PR CANNOT be merged into master until the code is fixed and a release testing passes. 

Now head back to your PR (Repo --> Pull Requests in the left bar). The build status won't show up right away, but we can tell our build to run again, which will trigger our new required policy. Hover over the "Build succeeded" line and click "Queue build". The build will run once more (in the future this will happen automatically, this is just a first-time gremlin), and then the release pipeline's validate step will run automatically a report a status. 

If all goes well you'll see a couple of happy green check marks - 1 for the build status (staging artifacts) and 1 for the release status (doing linting, terraform init, validate, plan). 

Wait, What Did We Just Do? 

What we just did is build automated linting and validation. We set PRs to be required for all changes to our terraform codebase, and put in automated building and terraform initialization and validation to any PR creation or update. We also made it so that any failure in build or validation would block PRs from being merged into our master tree, where they might cause problems for others. 

Really what we did is enable a large team of developers to work on the same code concurrently without getting in each other's way, and permit developers of this infrastructure code to move fast and break things in their own branches, with their own testing pipelines, without breaking anything in production. Pretty good for a day's work. 

Good luck out there!