Monday, August 12, 2019

Azure DevOps & Terraform: Breaking Up The Monolith - Strategy

Azure DevOps is a CI/CD automation platform from Microsoft ($MSFT). It supports repositories and running all sorts of code and automated code against the code (among many, many other functions). This includes Terraform, a tool that converts scripted, declarative configurations to real resources in cloud (and other) providers via API calls.

Terraform has been an excellent tool for us so far, and is starting to be adopted by other teams, for other purposes, to manage more accounts and resources. Which means the model we selected - to have a single terraform file (with a single .tfstate file) that calls all resources and configurations for all resources in an environment, is quickly getting strained.

Here's an example - say you have this above environment, with a single file. You have a dozen developers working in parallel building projects and adding them to the single monolithic file. Changes might get through the PR process without being properly vetted. Devs might push changes to the terraform repo and not deploy changes yet - maybe the changes aren't ready yet, maybe they shouldn't be deployed yet for some dependency reason. And now it's time that you want to push a tiny little change - maybe to change the size of an instance. You push your PR, run a terraform plan, and it wants to change 22 resources in 3 different time-zones. Would you push the approval through? If you're an experienced engineer, heck no you wouldn't. You could break any number of things.

So that's a scary situation, and probably an eventuality for most companies that start using terraform and don't plan an extensible way to manage these files But that's okay - for better or worse, the best driver of innovation is impending failure. 

What Options Do We Have?


So how can we fix this problem? I have a few different strategies I want to discuss.

Option A: A few more TF files


We could of course break the single monolithic TF and .tfstate file into a few TF files. For instance, put all servers into a single file, and all databases into another. This has the benefit of minimizing changes to process, and putting off the eventual time where many changes are queued up for TF apply-ing.

This has the benefit also of being easily supported by Azure DevOps - you can point the native Terraform plan/apply steps at the several different files, even have them in different concurrent stages of the Terraform release. They can all run automatically, and boom, you're in business.

The big con here is that the problem is only delayed. You have expanded the ability for your processes to scale, but you're still queueing up changes within a single file. And you're going to need to do this again and again in the future.

What would be more ideal is a solution to the problem, rather than a bandaid. So what else can we do?

Option B: Many project TF files, Terragrunt recursion


A problem with Azure DevOps and Terraform in general is that each Terraform step must be pointed at a single directory, and Terraform doesn't support recursion. Which means if you have half a dozen TF files that need to be run, your TF release pipeline is going to be relatively complex. But if you have hundreds? It'd be untenable. Not to mention that ever time a project is added your release project would need to be updated. 

Which is exactly the gap that Terragrunt looks to fill. It natively supports recursion, complex deployments, and lots of tools to keep your configuration DRY (humorously, Don't Repeat Yourself). 

A pro here is that now you can expand ad infinitum with Terraform stacks. Your can tell your devs that if they drop their terraform code into a folder tree you specify, their code will be executed on the next run. 

There's still some downsides. Terragrunt, because of its additional deployment logic, requires new files to be added, and some changes to your TF stack config. If you already have lots of files, not great. And learning a new tool just for this problem isn't ideal either. One complication that seems trivial (but probably isn't) is the Azure DevOps tasks that consume a Service Principal are for Terraform in particular, not any other command, even if it's very similar (Terragrunt). Which means you're looking for a Terragrunt deployment module, which... doesn't exist (yet). So you're deploying code with straight-up terminal commands, and handling the service principal authentication yourself, which isn't a security best practice. 

And one of our big initial drawbacks remains - when an "apply" is run against the top-level of the folder structure, all changes that have been queued up by PR approvals in the terraform repo will be executed. Again, we might end up pushing out dozens of changes if devs haven't been applying their changes right after getting PRs approved. Still not ideal. 


Ideally, we'd be able to get all the benefits from Option B (Recursive Terragrunt) without learning and implementing a new tool and applying changes en masse during a single run. And what a monster I'd be if I didn't present something that satisfied that criteria - customized 

Option C: Targeted, custom Azure DevOps release pipelines


What many companies do is implement Jenkins, an extensible CI/CD that permits more customization of releases, including setting variables that can target particular files for jobs. This is used to help target and run specific Terraform file updates. 

Thankfully, Azure DevOps supports similar functionality. The functionality is relatively recent and still in development, so documentation isn't great. However, we can piece together enough disparate features to make this work well. 

When initiating a TF release pipeline, we can surface a variable that can be consumed by our TF steps within the pipeline to target specific files for execution. Combine building individual TF files with individual state files with a release pipeline that permits executing single TF files one at a time, and we can scale out indefinitely (thousands of TF files) and programmatically define where the TF state file is stored for each TF file.  

Conclusions


The output of all this:
  • We can scale out TF files indefinitely - TF files now stand alone, and aren't all tied back to a single file that can become cluttered and queue up many changes
  • Changes can be applied carefully and methodically - TF updates aren't applied all at once for an entire folder structure - they are targeted and only a single stack is updated at a time
  • No new tooling has to be implemented - We can rely on native Azure DevOps and Terraform functionality. There's no need to teach your team an entirely new tool and methodology

In future blog posts I'll be looking at Terragrunt to implement TF recursively in a folder structure, and separately at customizing Azure DevOps release pipelines with custom variables to permit releases targeting only a single arbitrary TF file.

Good luck out there!
kyler

Sunday, August 4, 2019

Connect Azure DevOps to AWS

Azure DevOps (ADO) is a CI/CD platform from Microsoft ($MSFT). It permits a great deal of flexibility in the type of code run, the structure and permissions sets applied to jobs, and many other items of your creation and management of resources and automated jobs. However, support for other cloud providers is (perhaps obviously) weaker than at $MSFT's native Azure Cloud.

However, that doesn't meant $MSFT hasn't made inroads into helping us connect Azure DevOps jobs to the other cloud providers.

I've spent the week researching how to integrate the two. The closest I could find were specific use cases, like Elastic Beanstalk deployments (sans terraform) or arguments about how things worked, or why. No one seems to have built it before, so I knew this challenge would make an interesting blog post. I've done my best to package up the code and lessons to permit you to get this stuff going in your own lab as well.

Install the Microsoft DevLab Terraform Add-On Into Azure DevOps


So first of all, let's install the add-on. Make sure you're signed into dev.azure.com with whatever account you'd like to connect this service to, then go here: https://marketplace.visualstudio.com/items?itemName=ms-devlabs.custom-terraform-tasks

At the top, click the button that says "Get It Free".



Make sure your org is selected, then click "Install". Once complete, you're good to go. Head back to dev.azure.com (or click "Proceed to Organization") to get started.

Remote State, State Locking, Permissions


Azure DevOps builds these items for us in the Azure cloud, so we never have to worry about it. However, when we're crossing clouds we'll need to build a few items to enable Azure DevOps to take over and do its thing in AWS. The items we need to address are: 
  • Remote state storage - Terraform uses a state file to keep track of resources and map the text TF configuration to the resource IDs in the environment. We'll need to store it somewhere. In AWS, the preferred method is an S3 bucket
  • State Locking - When Terraform is actively making changes to a remote state file, it locks the file so no one else can make changes at the same time. This prevents the remote state file from being corrupted by multiple concurrent writes. The preferred way to handle this in AWS is a DynamoDB database. 
  • Permissions - This is the most complex bit - We need to create an IAM user in AWS that ADO can connect as (authentication) and associate any IAM policies the ADO user might require to the role (authorization). 

Catch22, Immediately


Our next step is to build some resources in AWS to permit this connection (IAM), store the remote state (S3 bucket), and handle state locking (DynamoDB). What intuitively makes the most sense is to use our trusty Azure DevOps to build a terraform job to build the things.

But it's a catch-22 - we can't execute the job against Terraform without the permissions already in place. And we can't very easily managed the AWS resources with Terraform if we build them by hand. So what do we do?

The best solution I've found is to create the Azure DevOps "seed" configuration in AWS via a Terraform job from my desktop, without using a remote state file. Once we get all the configuration in place to where Azure DevOps can take over, we'll add the remote-state file from our desktop to the S3 bucket, and start running our jobs from ADO.

Let's build some resources!

Local Terraform - S3, IAM, DynamoDB


Doing all this from the ground up is time consuming and complex! So I did that work for you, and created a cheat-sheet of Terraform to help you get started.

https://github.com/KyMidd/AzureDevOps_Terraform_AwsSeed

This GitHub repo contains a few files you can use to get a running start. Make sure to preserve the folder structure - the main.tf file uses the path to the ado_seed to find it.

Let's walk through what we're doing in the main.tf file. The first block of main.tf initializes terraform, and requires we use version 0.12.6 exactly. When you run terraform it'll tell you if your version is behind. Right now, 0.12.6 is the state of the art.
Then we define the provider - in this case, AWS. Change the region to whatever region you'd like. When we update this in the future for cloud hosting in ADO, we'll add a remote state location to this block. For now, though, we want to create resources in AWS from our computer, and store the tfstate locally. Then we call the ado_seed module and pass it some variables. This helps ADO name the resources specific to what you'd like. You'll also have an opportunity to look over the ado_seed module itself and see where that info is.
Let's pop into the ado_seed module and see what TF code we're running. First, we're building the S3 bucket. The name of the S3 bucket can be anything, but it has to be globally unique. Also, these S3 buckets are only useable for us in the same region as the environment, so it makes sense to include the region ID in the name for ease of use.

We're enabling strong encryption by default, versioning history as the state file changes, and a terraform attribute called lifecycle prevent_destroy which means TF will error out before replacing or destroying the resource, which is good news for us - we will be in trouble if our state file gets destroyed.
Then we're going to build the DynamoDB. Terraform can consume this database to use it for state locking. Basically, when terraform is editing the state file in S3, it'll put an entry into the database here. When it's done, it removes the entry. As long as every TF session is configured to use the same database, the state locking mechanism works. The primary key for this DB is required to be LockID.
Then we need to start on the IAM user, role, and policies. Bear with me, because the AWS implementation of permissions is incredibly verbose.

First, we need to create an IAM user. This user is where we can generate secret credentials to teach something how to connect as it - for instance, to tell ADO to connect as this user. The user itself doesn't contain permissions - there's no authorization, only authentication.
Then we create a policy for the IAM user. This is a list of the permissions we grant it. I've done something here for simplicity that isn't a good practice - note that the second rule in this policy grants our IAM user ALL rights to ALL resources. That convenient, but if someone compromises this user, not great. It's a better idea to iterate through each permission your ADO service requires and grant it there.
Then we link the policy to the IAM user.
Despite this step-by-step walkthrough I'd recommend copying the whole things down to your computer to avoid syntax and spelling issues and go from there.

Local AWS Authentication, TF Apply


Now that we understand all the steps, let's authenticate our local comp to our AWS environment and build these items. Log into your AWS account and click on your org name, then on "My Security Credentials"


Click on "Create New Access Key" and then copy down the data that is displayed. This credential provides root level access to your AWS account, so 100% do not share it. Copy down both before closing this window - it won't be displayed again.


Export that info to your terminal using this type of syntax:

Run "terraform init" and then "terraform apply" from your desktop in the directory where the main.tf is. Once you see the confirmation to create, type "yes" and hit enter. Terraform will report if there were any issues.


Now we have an S3 bucket for storage, a DynamoDB for locking, and an IAM user for authentication. Let's switch to Azure DevOps to move our Terraform jobs to the cloud!

State to S3, Create IAM Creds


Now that we have our environment in the state we want it, we need to make sure our cloud Terraform jobs know about the state of the environment as it exists right now. To do that, we'll need to upload our local terraform.tfstate file into the S3 bucket.

Head over to the S3 bucket and click on Upload in the top left. Find your terraform.tfstate file in the root of the location you ran your "terraform apply" in and upload it. All options can be left at their defaults.


Once that's done, we need to head into the IAM console to generate some secrets info for our new IAM user so we can provide it to ADO for authentication. Head over to the IAM console --> Users --> and find your user. Click it to jump into it.

Click on the "Security credentials" tab, then click on "Create access key" to generate an IAM secret.


This IAM secret will only be shown once, so don't close this window. Copy down the Access key ID and Secret access key. We'll use that information in the next section. 


Integrating Azure DevOps with AWS IAM


With that done, we're finally(!) ready to head over to Azure DevOps and add a service principal that utilizes this new IAM user and the secrets info we just created.

Drop into ADO --> your project --> Project Settings in the bottom left. Under pipelines, find "Service Connections". These service connections are useful in that they are able to store and manage the secrets and configuration required to authenticate to a cloud environment. Our terraform jobs will be able to consume this info and make our lives easier.

Click on "New service connection" in the top left of this panel and find the "AWS for Terraform" selection. If it's not listed there, head back up to the top of this blog and make sure to follow the steps under "Install the Microsoft DevLab Terraform Add-On Into Azure DevOps".


Fill in the information requested. The name is just a string, name it whatever makes sense to you. The access key and secrets key id are the information from the IAM user that we just generated. This ISN'T your own user's root access to the env. That will work, but isn't a best practice since the root user has unfettered access to the account, not the permissions we set in the IAM policy assigned to this user. Also fill in the region - these service accounts (and S3 buckets, for what it's worth) are only valid in the region they are created for. So if you need to deploy this stuff in multiple regions, you're going to have multiple S3 buckets and multiple service connections, one for each region.


Update Code, then push to ADO repo


We're moving all our workflows into the Azure DevOps cloud, which means we need our Terraform code to live there also. The only change we have to make before pushing this code to our ADO repo is to add the "backend s3" block to our terraform config in the main.tf, like so: 

Since we're just starting this repo, you'll probably push directly to master. For info on how to do that (or how to start up a branch in git and add your changes to it), refer to previous blogs.

I put mine in the folders terraform / terraform-aws / main.tf.


Pipeline Time!


Now all the pieces are in place, and we can get to actually building the pipeline and setting up each step we'll need to actually DO stuff! This is exciting times. Let's do it.

I'm assuming the build pipeline for the terraform repo is already complete. If it isn't, refer back to previous blogs on this site on how to build that.

Head into your project, then click on Pipelines --> Releases. Click on "New" in the top left, then on New release pipeline to build a new one.



ADO really wants to be helpful and can be somewhat confusing. Make sure to click "start with an empty job" to avoid the wizard.


Click on the Artifacts box in the left on "Add an artifact" to pull up our build artifact selection wizard. 


On the "select an artifact" screen, find your terraform build pipeline, then probably use the "Latest" Default version. Click Add to head back to our release pipeline automation screen. 


Click on "Add a stage" to add the new Terraform Plan stage. 


Call the stage AWS Terraform Plan or something similar. This stage will only do validations and planning - no changes will be executed. That'll help us confirm our stages and configuration are working correctly before we move on to executing changes. 


Click the plus (+) sign on the right side of the agent job to add a step and search for "terraform". Look for the "terraform tool installer". It'll handle installing the version of terraform you specify. 

Remember we've required terraform version 0.12.6 in our config files, so make sure to specify the right version here.


Click the plus (+) sign on the agent bar on the left again and again search for "terraform". Look for a step called just that - Terraform. There are several add-on modules that sound similar but have different capabilities, so look for one that looks like this picture. Click add to put this step into your workflow.


Change the provider to AWS, and set the TF command to "init". Also make sure to hit the 3 dots to the right of the "Configuration directory" to find where the command should be executed at - the location of your main.tf file.


Under the Amazon Web Services (AWS) backend configuration, find your Amazon Web Services connection - the service connection we built earlier that uses the IAM user. If nothing shows up, click the refresh button on the right or double-check you created the correct type of ADO service connection. Set the bucket name to the bucket you created from your desktop. Then set the "Key" to the path and name of your terraform.tfstate file in the S3 bucket. I just put mine in the root of the S3 bucket, so my key is simply "terraform.tfstate".

And boom, that's init. You can stop and test here, but I'd recommend adding a few more steps to make sure we're all good to go. We'll want to add a "terraform validate" and a "terraform plan" step to this stage. The easiest way is to right click on the "Terraform Init" step we just created and click "Clone task(s)".


Update the third stage to "terraform validate" and the fourth to "terraform plan". Each stage requires different information, but it's all information we've covered already. Once you've created the stages, it'll look like this: 


Once you feel good about it, click on "Save" in the top right, then click on "Create release", then "Create". Click on the release banner at the top to jump into the release logs. 


In my experience these freeze a lot, so be aware of the "refresh" at the top. If the "Terraform Plan" stage fails, click into the logs and you can check out why. Click on the "terraform plan" stage to see the CLI results. Hopefully yours looks like this also, which means all things have gone well, and our ADO Terraform now has the same state files as we did locally and all things are working. 


Profit!


There ya go, a functional Azure DevOps Terraform pipeline to build and manage your resources in AWS. Woot! 

Try building your own resources and see how things go! Try to tack on pull requests, validation of PRs pre-merge, and anything else, and report back the cool things you find! 

Good luck out there! 
kyler