Tuesday, September 17, 2019

Cloud, DevOps: In Defense of Doing It Wrong


I am just awful at watching training videos and remembering the content. Which feels like a fair trade-off from the world for my ability to remember most things that I read with good fidelity. A place where this problem comes to an (unexpected) head is in DevOps and cloud architectures.

AWS and Azure are moving so fast they have trouble keeping up with written documentation. The written documentation usually exists, and isn't terrible, but is usually missing the most recent iterative developments.

Especially as a network engineer, the old model for architecting a system was to sit and read the architectural books. OSI standards don't really go and change on you, so the information can be presented in many different ways, boiled down by excellent authors, and presented in an accurate way that'll stand the test of time.

For instance, I know when I started studying for a CCIE (Cisco expert networking certification) one of the most recommended books to purchase and study was a volume that came out in the mid 90s, when I was learning how to tie my shoes and starting to read chapter books.
...the pace of iteration and invention within the technology space is increasing.
But I'm sure most of the folks reading this from technology fields feel this - the pace of iteration and invention within the technology space is increasing. Every week it seems like a new cloud functionality is released that layers up what that provider is offering, or a new devops framework, module, or practice is developed that can help increase the efficiency of what you and your team are doing.

That's not to say that the old methods and tools you're using will be deprecated (although it sometimes does!), but usually that you're designing for the state of the art from weeks, months, or years ago, and your competitors might be designing for the state of the art today.

It has a tremendously negative effective on the "tribal knowledge" of a team as new tools and practices are implemented. No longer does the cisco networking guru stay at the top of their game just by renewing their CCIE every few years with the same knowledge they had years ago - now we're trying to level up as fast as we can, and so is everyone else.
...we're trying to level up as fast as we can, and so is everyone else.

So Let's Do It Wrong


Which brings me to my point. The tools and technologies we use aren't going to slow down, and the documentation around them will continue to be sub-par. There's no way to become a paper expert at "cloud" or "devops" - the only way to get there as an expert is to DO it.

So deploy your own cloud, learn how your devops tools work by doing it wrong, and then iterating, and then doing it a bit less wrong, etc. Each time you do it wrong you learn a valuable lesson that can't have been learned elsewhere.
Each time you do it wrong you learn a value lesson that couldn't have been learned elsewhere.
So GO - build, break, iterate. Let's build this thing.
kyler

Sunday, September 15, 2019

Terraform - Iterative Subnet Module - AWS

Hey all!

Terraform has the ability to call modules, which are snippets of terraform code that can be passed information to build resources. Generally these modules enshrine best practices, and help to keep your DevOps teams on-track in terms of resource nomenclature, structure, and security guidelines.

Modules also have the ability to contain multiple resources, and be passed a "count" variable, which will lead to several similar resources being constructed.

In this blog I'll share code for an AWS subnet and route-table association module that can accept a count variable, and build "n" number of subnets. New subnets are as easy as updating the count variable and updating the list of subnets passed to the module.

But enough talking about the cool thing, let's build it.

Modules Overview


Creating a module is as easy as saying "hey, terraform, call this module, here's where it lives, like this: 


That'd work just fine, however there's much more power in modules when we pass information to them. Imagine calling an ec2 module and passing the subnetID, AMI, size of subnet, etc to it. That makes the module a lot more powerful.


You can imagine how extensible this solution is. For example, here's the finish module call we'll be building today:


We're calling a module that builds subnets, we're telling it a "group" to use in naming of the subnets, availability zones, subnet addresses, route table, oh my! The module has to be written to accept and use all these values, which is the tricky part. So let's jump into that.

Iterative Modules - The Secret Sauce


Writing modules isn't terribly hard - you tell it what values to accept from the caller, and assign those values to fields that terraform accepts, and boom, you have a functional module. However, that module can only build a single resource. Telling it to build several resources in a cogent way is some engineering, some creativity, and some luck. It starts with the "count" parameter.

Count is a built-in terraform variable that terraform uses to know how many times to loop over the same resource and build it several times. This built-in loop within terraform is exposed to the resource via an "index" attribute that tells the loop how many times the loop has run.

If that's made your head spin, that's okay - we'll walk through it with examples. First, let's take the subnet module a few lines at a time. Here's the start of our subnet module, and we're building a subnet resource. We're also using the count attribute we just talked about, but rather than assigning it by hand (which would work fine, as long as we remember to update it for each new subnet!), we're going to pass the number of items in the subnet_addresses list instead using the length function. Basically, we're saying to terraform, "If there are 3 subnets in the subnet_addresses list, iterate 3 times and build 3 subnets."

You can also see that the VPC ID is just specified like you would in any normal non-iterative module. That's because that value will remains static across all the subnets we build.


Let's add one more line - the variable that tells terraform the CIDR address of this subnet. Now, this item needs to change depending on which iteration of the loop we're on. So we tell terraform to pick up the variable passed to this module called subnet_addresses using the element function, of index whatever number of the loop we're on. So if we pass this module an array of "1, 2, 3" and the loop is on iteration 3, it'll pick out the 3rd item in the list, and use the value "3".


You can use those index values to interpolate also. It makes a lot of sense to me to name the subnets starting at 1, rather than 0 where a computer starts counting, so we interpolate the loop index value, and add 1 (so 0 becomes 1, and 1 becomes 2).


And boom, that's our iterative module. However, we also need to associate the subnet to a route-table. Check out the second half of this module and see if you can pick out where iterative looping is done, where interpolation and value modification is done, and follow along.

Notice also that when we're calling the subnet module above, we're only 2 availability zones and 2 route tables. How is the subnet module dealing with only having 2 values when it iterates 5 times? The answer is that lists inherently loop. So if a list contains value "A, B" and the loop is called 5 times, loop 1 will be A, loop 2 will be B, loop 3 will be A, and so on.

Outputs - Splat!


Normally, a module can output a static number of resources, so outputs are easy to write. However, in an iterative module, any number of resources can be created. Outputs don't support the "count" parameter in the same way resources do, so we have to use another creation of Terraform's - the Splat expression.

Here's what our output looks like in this subnet module:


So within the subnet module, that's how you'd export ALL subnet IDs. But say you wanted to reference one of the subnets from the main.tf? It'd look like this - referencing the array value and module name from within the main.tf file:


Go Build It Yourself!


You can find all the functional code at GitHub here: https://github.com/KyMidd/TerraformAwsSubnetModule

Go build some cool iterative modules! Imagine building a dozen ec2 instances, and standardizing their names, security settings, and managing them as a single flexible unit. The possibilities are endless.

Good luck out there.
kyler

Thursday, August 29, 2019

Recursive Terraform with Terragrunt

Hey all!

Terraform is capable of remarkable things, not least of which is speaking API commands for many dozens of providers, which lets terraform configuration do amazing things. That isn't to say that terraform is perfect, or as flexible as we'd like it to be, which is why some free-lancers have built wrappers around terraform to add functionality.

One that has gained significant traction is Terragrunt. Terragrunt is a tool that permits terraform to run in parallel, on lots of different main.tf files, and to go even further towards a DRY (Don't Repeat Yourself) implementation with sym-links to shared files. The github page for the tool goes into greater depth and specificity than I'd be able to, so visit it if you're interested!

Terragrunt also permits something cool - to recursively dive into a folder structure and execute all the main.tf files that are found there. Terragrunt can also manage each of these files' state files and help store them in an appropriate back-end.

This recursive property lets you separate your resources out into n number of data stacks, all with separate state files. This is particularly useful within the context of a CI/CD that has to be hard-coded to execute a single or few commands, e.g. terraform against a single location. If your users are going to be adding main.tf files with new resources in new folders, you'll have to constantly be updating your CI/CD to point at those new main.tf files. Or you can use terragrunt and tell it to dive into a folder structure recursively and grab them all, each time its run, automatically.

During this blog I'll walk through how to set up terragrunt, how to separate resources and reference them from different main.tf files, all in the context of what I hope is) an interesting example - building public and private subnets, 2x servers, a load balancer, listener, target group, and all sorts of other cool stuff. At the end of this demo, you'll have an internet-facing Application Load Balancer in AWS that can accept incoming http connections and load-share them between the 2x ec2 servers we'll build.

AWS Bootstrap - S3, IAM


Before we can run terragrunt against an AWS environment, we need to add some resources, which would include an S3 bucket to store the remote state, as well as an IAM user with a policy that lets them do things.

First, log into your AWS account and in the top right, click on My Security Credentials. This is your root user, with unlimited abilities. We don't want to do a ton with this, but we'll use it to get started


Then click on "Access keys" drop-down to expose your keys. If there aren't any there, that's fine. 


Click on "Create New Access Keys", then on "show access keys". Copy those down, then export them into your browser session. These exports will work on a linux or mac computer - you'll need to export in windows syntax for a windows session.

That will permit your terminal to authenticate to your AWS and do things.

Navigate to projects/ado_init on your local disk and update the ado_init_variables.tf file with custom names. The S3 name has to be globally unique, the rest are whatever names make sense to you.

Then run "terraform apply" against the main.tf there. It will build an S3 bucket and the IAM user with policies that you can assume.

Find that IAM user in the IAM panel, click on Users on left side, and find the new IAM user that we built. Click on it to open it up, then click on "Secure Credentials". Click "create access key" to generate some CLI creds.


Now we're ready to get started in terragrunt, so let's dive in!

Terragrunt Properties


Terragrunt is written in the same language as Terraform - HCL, so it'll look very familiar. Here's an example of a "terragrunt.hcl" file that exists in the networking folder of my terraform project.

Notice that it looks exactly like how a remote state backend is written in terraform. And rather than having all this information in both places, terragrunt requires us to remove that information from the terraform main.tf file. Really all that's left in the terraform file is the terraform init statement.

Here's what my folder structure looks like. Notice that there are several different "projects", which are components of my environment, broken out into main.tf files. All are under the "projects" folder, so that's where we can run our terragrunt commands.



Each project component will have its own terragrunt.hcl file, but they'll vary slightly. The reason for this is that each one will maintain a separate state file. Here's the terragrunt.hcl file for the security_group project. Notice that the s3 bucket stays the same - we're putting all our remote state files into the same s3 bucket, but the "key" (folder path and filename) changes, as well as the dynamo-db table.

You'll need to sync down the Terragrunt git repo for this demo to your local comp.

Go through the various terragrunt.hcl files and update the S3 bucket to the one you created in the ado_init step earlier. You can also update the name of the dynamoDB table if you'd like - terragrunt will automatically create these for you if they don't exist yet.

Now it's time to run it!

Run Terragrunt


Navigate to the projects folder and run command "terragrunt apply-all". This command will tell terragrunt to recurse through the directories and execute the main.tf files in each directory that has a terragrunt.hcl file. It'll read the terragrunt.hcl file in each directory and grab (or push) the terraform state to that remote location.

You'll see log messages from each of the main.tf files as it goes, and it can sometimes be hard to tell where the log messages come from.

You may need to run the command a few times - there are inter-dependencies among the several files, and some files can't execute at all when their data sources reference something in a data stack that doesn't exist yet.

Try it out and let me know what you think! Good luck out there.
kyler

Saturday, August 24, 2019

AWS IAM: Assuming an IAM role from an EC2 instance

tl;dr: A batch script (code provided) to assume an IAM role from an ec2 instance. Also provided is terraform code to build the IAM roles with proper linked permissions, which can be tricky. 

I'm working through an interesting problem - syncing Azure DevOps to AWS, and making the connection functional, scalable, and simple. Sometimes, when designing anything, a path is followed that doesn't pan out. This is one of those paths, and I wanted to share some lessons learned and code that might help you if this path is a winner for you.

Our security model for EC2 requires that a machine assume a higher IAM policy when it is required, but the rest of the time it have much lower permissions. That's a common use case, and a best practice.

Some applications support assuming a higher IAM role natively - I later learned, after pursuing this, that terraform is one of those applications (more details on that in a future blog). However, some applications can't, and require you to do the heavy lifting yourself.

IAM - a Sordid (and Ongoing) History




IAM (Identity and Access Management) is complex beast that controls authentication (who are you?) and authorization (what are you allowed to do?). Because even simple complex can be made complex with enough work, IAM supports recursive role assumptions, so a resource that starts with 1 set of credentials can assume a different (or more expensive) set of credentials during certain actions.

This has the benefit of being very flexible, and the detriment of allowing deployments so complex it can require a serious amount of nancy drew-ing to sort out what permissions something "really" has. 

This complexity has led to a series of high profile security vulnerabilities introduced by a lack of understanding or a too-complex deployment in some of what are generally thought to be the most security companies. The most recent high profile one was Capital One's hack by an ex-AWS employee. The ec2 IAM policies were written in such a way as to provide access to all s3 buckets, so once a single ec2 instance was compromised, all data everywhere was compromised. KrebsOnSecurity has a great write-up of the incident

Definitions - Policies, Roles, and Trust Relationships, Oh My


So clearly, lack of understanding here can be a vulnerability all in itself, so let's break down what pieces comprise IAM. 
  • Policies: Policies are a list of permissions that can be granted. They are not allowed to be assigned to resources themselves (to my knowledge). Rather, they are assigned to one or more roles, and the roles are assigned to or assumed by resources.
  • Roles: An IAM role is a bucket of permissions. The permissions it contains are not "within" the role, but rather are described in the IAM policies that are assigned to the role. These roles can be assigned to a resource (think ec2 resources being assigned a single ec2 role) or assumed by a resource or process. 
  • Instance Profile: An IAM Instance Profile is a somewhat hidden feature of IAM roles. Instance Profiles are assigned 1:1 to an IAM Role, and when assigned, allow an ec2 instance to be assigned the role. To be even simpler: This stand-alone resource acts as a check box for an IAM role on whether it can be assumed by an ec2 instance or not. 
    • Interesting tip: I say this this resource type is somewhat hidden because when an IAM role is created in the GUI, an Instance Profile is automatically created and assigned. However, if you're building an IAM Role via command line or API call (thing Terraform or CloudFormation), this resource isn't automatically created, and instead acts as a "gotcha". 
  • Trust Relationship: An IAM Trust Relationship is a special policy attached to an IAM Role that controls who can assume the role. This is a key part of our IAM role assuming, and we'll walk through the different policies required on the implicit (assigned) IAM role for the ec2 instance vs the IAM role assumed by the instance.

The Implicit IAM Role


We'll build several IAM roles, with associated policies and trust relationships. First, let's build the Implicit IAM role. This role will be assigned directly to the ec2 instance, and is static.

Note that this role has an embedded IAM policy - this is our trust policy that permits the ec2 instance service to assume this role - this is required if any ec2 instance will be assigned the role.  

Next we'll create an IAM policy for this implicit role. The only permission we want this policy to contain is the ability to use the STS service to assume a specific IAM role. Otherwise, this ec2 instance should act as a normal virtual machine, and not be able to edit or control the AWS environment around it.


Then we link the two together - remember that roles and policies are not linked by default, and have to be assigned together.

And remember this implicit IAM role needs to be statically assigned to an ec2 instance, and that requires it to have an instance profile, so let's build that and assign to the IAM role. 

Once this is all applied, it'll look like this:

And here's the trust policy under the "trust relationships" tab. You should see the ec2 service is trusted by this policy to be assumed.


Now that we have an IAM role with a policy and a trust relationship to the ec2 service (and that gotcha of an instance profile), let's go assign it to an ec2 instance. I didn't include terraform code for this, so you'll build an ec2 instance by hand. Once ready, go into the instance settings, and click "Attach/Replace IAM Role".


Find the IAM role you want to associate with the ec2 instance (the implicit one we just built). If you don't see it, try the refresh icon next to the list, or go check to make sure the instance profile is built and associated with the IAM role properly. 

 
Great, now we have an IAM role, assigned to an ec2 instance, that permits it to assume a higher permissions role. Which is all well and good, but we haven't built that higher permissions role yet, so let's do that.

More Permissions, Give Me More!

The whole point of this exercise is for the ec2 instance to be able to assume a set of more expansive permissions when it needs it, so we need to build a distinct IAM role to contain those permissions, a policy to describe what permissions we want to grant, and a trust relationship that allows the implicit (statically assigned) ec2 instance to assume the higher permissioned role.

First let's build the IAM role. The role parts are exactly the same, but notice the embedded IAM policy (the trust relationship) is entirely different. Rather than trusting the ec2 service to assume it, it's trusting the first IAM policy only. This assures that only a single specific IAM role can assume this upper IAM role. And the lower role is assigned only to a single ec2 (or more if you want) instance, creating a limited chain of permissions that is very flexible to assign.

Now, let's build an IAM policy of permissions for this expansive role. The example here permits all actions to all services, which is NOT AT ALL a best practice. If at all possible make sure to limit your expansive IAM policies to much more specific actions to specific resources. The policy here should rarely be used.

And you can probably guess what comes next - we need to link the IAM role to the IAM policy that we just built, which looks like this:
When you look at the new role in the AWS console, it'll look like this:

The trust relationship tab will look like this:

Let's Assume The Role, For Real Now


Now that everything is in place, we're ready to go onto the ec2 instance and assume the role. This involves running a batch script, which will do several things - clearing the variables in case of a last run hanging around, figuring out the account ID by calling the AWS ec2 metadata service, figuring out the instance ID, and setting the information to a text file where bash can call it and set the global env variables.

Then we start the cool stuff. AWS ec2 linux AMIs already contain the AWS CLI toolset. If you don't have it, install it for this to work.

First we use the AWS CLI to assume our role, depending on both the dynamic info we gathered earlier - the account number and the EC2 ID. These dynamic pieces permit this same script to be run in any account, and to set an IAM session name that is globally identifiable to this instance, for later CloudTrail-ing.

Then we use jq (javascript query) to export the pieces we need to a file, then we call bash to read the file and set variables into the bash shell environment. Then we cleanup by removing the STS creds from the disk.

Boom, your ec2 instance has now assumed a higher IAM role that the assigned one, and can do all sorts of stuff.

Wrap It All Up


The collected code for all these examples can be found here: https://github.com/KyMidd/AWS_EC2_IAM_Authentication

I'll continue to investigate how to use IAM roles in order to build a comprehensive terraform and Azure DevOps CI/CD, so these types of posts will continue. In the next one, I'll cover how Terraform can handle most of these items itself, so the bash script is not needed.

However, I hope this script and the coverage of IAM helps you in your non-Terraform requirements. Thanks all!

Good luck out there.
kyler

Monday, August 12, 2019

Azure DevOps & Terraform: Breaking Up The Monolith - Strategy

Azure DevOps is a CI/CD automation platform from Microsoft ($MSFT). It supports repositories and running all sorts of code and automated code against the code (among many, many other functions). This includes Terraform, a tool that converts scripted, declarative configurations to real resources in cloud (and other) providers via API calls.

Terraform has been an excellent tool for us so far, and is starting to be adopted by other teams, for other purposes, to manage more accounts and resources. Which means the model we selected - to have a single terraform file (with a single .tfstate file) that calls all resources and configurations for all resources in an environment, is quickly getting strained.

Here's an example - say you have this above environment, with a single file. You have a dozen developers working in parallel building projects and adding them to the single monolithic file. Changes might get through the PR process without being properly vetted. Devs might push changes to the terraform repo and not deploy changes yet - maybe the changes aren't ready yet, maybe they shouldn't be deployed yet for some dependency reason. And now it's time that you want to push a tiny little change - maybe to change the size of an instance. You push your PR, run a terraform plan, and it wants to change 22 resources in 3 different time-zones. Would you push the approval through? If you're an experienced engineer, heck no you wouldn't. You could break any number of things.

So that's a scary situation, and probably an eventuality for most companies that start using terraform and don't plan an extensible way to manage these files But that's okay - for better or worse, the best driver of innovation is impending failure. 

What Options Do We Have?


So how can we fix this problem? I have a few different strategies I want to discuss.

Option A: A few more TF files


We could of course break the single monolithic TF and .tfstate file into a few TF files. For instance, put all servers into a single file, and all databases into another. This has the benefit of minimizing changes to process, and putting off the eventual time where many changes are queued up for TF apply-ing.

This has the benefit also of being easily supported by Azure DevOps - you can point the native Terraform plan/apply steps at the several different files, even have them in different concurrent stages of the Terraform release. They can all run automatically, and boom, you're in business.

The big con here is that the problem is only delayed. You have expanded the ability for your processes to scale, but you're still queueing up changes within a single file. And you're going to need to do this again and again in the future.

What would be more ideal is a solution to the problem, rather than a bandaid. So what else can we do?

Option B: Many project TF files, Terragrunt recursion


A problem with Azure DevOps and Terraform in general is that each Terraform step must be pointed at a single directory, and Terraform doesn't support recursion. Which means if you have half a dozen TF files that need to be run, your TF release pipeline is going to be relatively complex. But if you have hundreds? It'd be untenable. Not to mention that ever time a project is added your release project would need to be updated. 

Which is exactly the gap that Terragrunt looks to fill. It natively supports recursion, complex deployments, and lots of tools to keep your configuration DRY (humorously, Don't Repeat Yourself). 

A pro here is that now you can expand ad infinitum with Terraform stacks. Your can tell your devs that if they drop their terraform code into a folder tree you specify, their code will be executed on the next run. 

There's still some downsides. Terragrunt, because of its additional deployment logic, requires new files to be added, and some changes to your TF stack config. If you already have lots of files, not great. And learning a new tool just for this problem isn't ideal either. One complication that seems trivial (but probably isn't) is the Azure DevOps tasks that consume a Service Principal are for Terraform in particular, not any other command, even if it's very similar (Terragrunt). Which means you're looking for a Terragrunt deployment module, which... doesn't exist (yet). So you're deploying code with straight-up terminal commands, and handling the service principal authentication yourself, which isn't a security best practice. 

And one of our big initial drawbacks remains - when an "apply" is run against the top-level of the folder structure, all changes that have been queued up by PR approvals in the terraform repo will be executed. Again, we might end up pushing out dozens of changes if devs haven't been applying their changes right after getting PRs approved. Still not ideal. 


Ideally, we'd be able to get all the benefits from Option B (Recursive Terragrunt) without learning and implementing a new tool and applying changes en masse during a single run. And what a monster I'd be if I didn't present something that satisfied that criteria - customized 

Option C: Targeted, custom Azure DevOps release pipelines


What many companies do is implement Jenkins, an extensible CI/CD that permits more customization of releases, including setting variables that can target particular files for jobs. This is used to help target and run specific Terraform file updates. 

Thankfully, Azure DevOps supports similar functionality. The functionality is relatively recent and still in development, so documentation isn't great. However, we can piece together enough disparate features to make this work well. 

When initiating a TF release pipeline, we can surface a variable that can be consumed by our TF steps within the pipeline to target specific files for execution. Combine building individual TF files with individual state files with a release pipeline that permits executing single TF files one at a time, and we can scale out indefinitely (thousands of TF files) and programmatically define where the TF state file is stored for each TF file.  

Conclusions


The output of all this:
  • We can scale out TF files indefinitely - TF files now stand alone, and aren't all tied back to a single file that can become cluttered and queue up many changes
  • Changes can be applied carefully and methodically - TF updates aren't applied all at once for an entire folder structure - they are targeted and only a single stack is updated at a time
  • No new tooling has to be implemented - We can rely on native Azure DevOps and Terraform functionality. There's no need to teach your team an entirely new tool and methodology

In future blog posts I'll be looking at Terragrunt to implement TF recursively in a folder structure, and separately at customizing Azure DevOps release pipelines with custom variables to permit releases targeting only a single arbitrary TF file.

Good luck out there!
kyler

Sunday, August 4, 2019

Connect Azure DevOps to AWS

Azure DevOps (ADO) is a CI/CD platform from Microsoft ($MSFT). It permits a great deal of flexibility in the type of code run, the structure and permissions sets applied to jobs, and many other items of your creation and management of resources and automated jobs. However, support for other cloud providers is (perhaps obviously) weaker than at $MSFT's native Azure Cloud.

However, that doesn't meant $MSFT hasn't made inroads into helping us connect Azure DevOps jobs to the other cloud providers.

I've spent the week researching how to integrate the two. The closest I could find were specific use cases, like Elastic Beanstalk deployments (sans terraform) or arguments about how things worked, or why. No one seems to have built it before, so I knew this challenge would make an interesting blog post. I've done my best to package up the code and lessons to permit you to get this stuff going in your own lab as well.

Install the Microsoft DevLab Terraform Add-On Into Azure DevOps


So first of all, let's install the add-on. Make sure you're signed into dev.azure.com with whatever account you'd like to connect this service to, then go here: https://marketplace.visualstudio.com/items?itemName=ms-devlabs.custom-terraform-tasks

At the top, click the button that says "Get It Free".



Make sure your org is selected, then click "Install". Once complete, you're good to go. Head back to dev.azure.com (or click "Proceed to Organization") to get started.

Remote State, State Locking, Permissions


Azure DevOps builds these items for us in the Azure cloud, so we never have to worry about it. However, when we're crossing clouds we'll need to build a few items to enable Azure DevOps to take over and do its thing in AWS. The items we need to address are: 
  • Remote state storage - Terraform uses a state file to keep track of resources and map the text TF configuration to the resource IDs in the environment. We'll need to store it somewhere. In AWS, the preferred method is an S3 bucket
  • State Locking - When Terraform is actively making changes to a remote state file, it locks the file so no one else can make changes at the same time. This prevents the remote state file from being corrupted by multiple concurrent writes. The preferred way to handle this in AWS is a DynamoDB database. 
  • Permissions - This is the most complex bit - We need to create an IAM user in AWS that ADO can connect as (authentication) and associate any IAM policies the ADO user might require to the role (authorization). 

Catch22, Immediately


Our next step is to build some resources in AWS to permit this connection (IAM), store the remote state (S3 bucket), and handle state locking (DynamoDB). What intuitively makes the most sense is to use our trusty Azure DevOps to build a terraform job to build the things.

But it's a catch-22 - we can't execute the job against Terraform without the permissions already in place. And we can't very easily managed the AWS resources with Terraform if we build them by hand. So what do we do?

The best solution I've found is to create the Azure DevOps "seed" configuration in AWS via a Terraform job from my desktop, without using a remote state file. Once we get all the configuration in place to where Azure DevOps can take over, we'll add the remote-state file from our desktop to the S3 bucket, and start running our jobs from ADO.

Let's build some resources!

Local Terraform - S3, IAM, DynamoDB


Doing all this from the ground up is time consuming and complex! So I did that work for you, and created a cheat-sheet of Terraform to help you get started.

https://github.com/KyMidd/AzureDevOps_Terraform_AwsSeed

This GitHub repo contains a few files you can use to get a running start. Make sure to preserve the folder structure - the main.tf file uses the path to the ado_seed to find it.

Let's walk through what we're doing in the main.tf file. The first block of main.tf initializes terraform, and requires we use version 0.12.6 exactly. When you run terraform it'll tell you if your version is behind. Right now, 0.12.6 is the state of the art.
Then we define the provider - in this case, AWS. Change the region to whatever region you'd like. When we update this in the future for cloud hosting in ADO, we'll add a remote state location to this block. For now, though, we want to create resources in AWS from our computer, and store the tfstate locally. Then we call the ado_seed module and pass it some variables. This helps ADO name the resources specific to what you'd like. You'll also have an opportunity to look over the ado_seed module itself and see where that info is.
Let's pop into the ado_seed module and see what TF code we're running. First, we're building the S3 bucket. The name of the S3 bucket can be anything, but it has to be globally unique. Also, these S3 buckets are only useable for us in the same region as the environment, so it makes sense to include the region ID in the name for ease of use.

We're enabling strong encryption by default, versioning history as the state file changes, and a terraform attribute called lifecycle prevent_destroy which means TF will error out before replacing or destroying the resource, which is good news for us - we will be in trouble if our state file gets destroyed.
Then we're going to build the DynamoDB. Terraform can consume this database to use it for state locking. Basically, when terraform is editing the state file in S3, it'll put an entry into the database here. When it's done, it removes the entry. As long as every TF session is configured to use the same database, the state locking mechanism works. The primary key for this DB is required to be LockID.
Then we need to start on the IAM user, role, and policies. Bear with me, because the AWS implementation of permissions is incredibly verbose.

First, we need to create an IAM user. This user is where we can generate secret credentials to teach something how to connect as it - for instance, to tell ADO to connect as this user. The user itself doesn't contain permissions - there's no authorization, only authentication.
Then we create a policy for the IAM user. This is a list of the permissions we grant it. I've done something here for simplicity that isn't a good practice - note that the second rule in this policy grants our IAM user ALL rights to ALL resources. That convenient, but if someone compromises this user, not great. It's a better idea to iterate through each permission your ADO service requires and grant it there.
Then we link the policy to the IAM user.
Despite this step-by-step walkthrough I'd recommend copying the whole things down to your computer to avoid syntax and spelling issues and go from there.

Local AWS Authentication, TF Apply


Now that we understand all the steps, let's authenticate our local comp to our AWS environment and build these items. Log into your AWS account and click on your org name, then on "My Security Credentials"


Click on "Create New Access Key" and then copy down the data that is displayed. This credential provides root level access to your AWS account, so 100% do not share it. Copy down both before closing this window - it won't be displayed again.


Export that info to your terminal using this type of syntax:

Run "terraform init" and then "terraform apply" from your desktop in the directory where the main.tf is. Once you see the confirmation to create, type "yes" and hit enter. Terraform will report if there were any issues.


Now we have an S3 bucket for storage, a DynamoDB for locking, and an IAM user for authentication. Let's switch to Azure DevOps to move our Terraform jobs to the cloud!

State to S3, Create IAM Creds


Now that we have our environment in the state we want it, we need to make sure our cloud Terraform jobs know about the state of the environment as it exists right now. To do that, we'll need to upload our local terraform.tfstate file into the S3 bucket.

Head over to the S3 bucket and click on Upload in the top left. Find your terraform.tfstate file in the root of the location you ran your "terraform apply" in and upload it. All options can be left at their defaults.


Once that's done, we need to head into the IAM console to generate some secrets info for our new IAM user so we can provide it to ADO for authentication. Head over to the IAM console --> Users --> and find your user. Click it to jump into it.

Click on the "Security credentials" tab, then click on "Create access key" to generate an IAM secret.


This IAM secret will only be shown once, so don't close this window. Copy down the Access key ID and Secret access key. We'll use that information in the next section. 


Integrating Azure DevOps with AWS IAM


With that done, we're finally(!) ready to head over to Azure DevOps and add a service principal that utilizes this new IAM user and the secrets info we just created.

Drop into ADO --> your project --> Project Settings in the bottom left. Under pipelines, find "Service Connections". These service connections are useful in that they are able to store and manage the secrets and configuration required to authenticate to a cloud environment. Our terraform jobs will be able to consume this info and make our lives easier.

Click on "New service connection" in the top left of this panel and find the "AWS for Terraform" selection. If it's not listed there, head back up to the top of this blog and make sure to follow the steps under "Install the Microsoft DevLab Terraform Add-On Into Azure DevOps".


Fill in the information requested. The name is just a string, name it whatever makes sense to you. The access key and secrets key id are the information from the IAM user that we just generated. This ISN'T your own user's root access to the env. That will work, but isn't a best practice since the root user has unfettered access to the account, not the permissions we set in the IAM policy assigned to this user. Also fill in the region - these service accounts (and S3 buckets, for what it's worth) are only valid in the region they are created for. So if you need to deploy this stuff in multiple regions, you're going to have multiple S3 buckets and multiple service connections, one for each region.


Update Code, then push to ADO repo


We're moving all our workflows into the Azure DevOps cloud, which means we need our Terraform code to live there also. The only change we have to make before pushing this code to our ADO repo is to add the "backend s3" block to our terraform config in the main.tf, like so: 

Since we're just starting this repo, you'll probably push directly to master. For info on how to do that (or how to start up a branch in git and add your changes to it), refer to previous blogs.

I put mine in the folders terraform / terraform-aws / main.tf.


Pipeline Time!


Now all the pieces are in place, and we can get to actually building the pipeline and setting up each step we'll need to actually DO stuff! This is exciting times. Let's do it.

I'm assuming the build pipeline for the terraform repo is already complete. If it isn't, refer back to previous blogs on this site on how to build that.

Head into your project, then click on Pipelines --> Releases. Click on "New" in the top left, then on New release pipeline to build a new one.



ADO really wants to be helpful and can be somewhat confusing. Make sure to click "start with an empty job" to avoid the wizard.


Click on the Artifacts box in the left on "Add an artifact" to pull up our build artifact selection wizard. 


On the "select an artifact" screen, find your terraform build pipeline, then probably use the "Latest" Default version. Click Add to head back to our release pipeline automation screen. 


Click on "Add a stage" to add the new Terraform Plan stage. 


Call the stage AWS Terraform Plan or something similar. This stage will only do validations and planning - no changes will be executed. That'll help us confirm our stages and configuration are working correctly before we move on to executing changes. 


Click the plus (+) sign on the right side of the agent job to add a step and search for "terraform". Look for the "terraform tool installer". It'll handle installing the version of terraform you specify. 

Remember we've required terraform version 0.12.6 in our config files, so make sure to specify the right version here.


Click the plus (+) sign on the agent bar on the left again and again search for "terraform". Look for a step called just that - Terraform. There are several add-on modules that sound similar but have different capabilities, so look for one that looks like this picture. Click add to put this step into your workflow.


Change the provider to AWS, and set the TF command to "init". Also make sure to hit the 3 dots to the right of the "Configuration directory" to find where the command should be executed at - the location of your main.tf file.


Under the Amazon Web Services (AWS) backend configuration, find your Amazon Web Services connection - the service connection we built earlier that uses the IAM user. If nothing shows up, click the refresh button on the right or double-check you created the correct type of ADO service connection. Set the bucket name to the bucket you created from your desktop. Then set the "Key" to the path and name of your terraform.tfstate file in the S3 bucket. I just put mine in the root of the S3 bucket, so my key is simply "terraform.tfstate".

And boom, that's init. You can stop and test here, but I'd recommend adding a few more steps to make sure we're all good to go. We'll want to add a "terraform validate" and a "terraform plan" step to this stage. The easiest way is to right click on the "Terraform Init" step we just created and click "Clone task(s)".


Update the third stage to "terraform validate" and the fourth to "terraform plan". Each stage requires different information, but it's all information we've covered already. Once you've created the stages, it'll look like this: 


Once you feel good about it, click on "Save" in the top right, then click on "Create release", then "Create". Click on the release banner at the top to jump into the release logs. 


In my experience these freeze a lot, so be aware of the "refresh" at the top. If the "Terraform Plan" stage fails, click into the logs and you can check out why. Click on the "terraform plan" stage to see the CLI results. Hopefully yours looks like this also, which means all things have gone well, and our ADO Terraform now has the same state files as we did locally and all things are working. 


Profit!


There ya go, a functional Azure DevOps Terraform pipeline to build and manage your resources in AWS. Woot! 

Try building your own resources and see how things go! Try to tack on pull requests, validation of PRs pre-merge, and anything else, and report back the cool things you find! 

Good luck out there! 
kyler

Sunday, July 14, 2019

Network Engineering is Dying (Except at Cloud Providers)

Hey all!

This past week I spoke to a recruiter for one of the gang of 4 largest companies in tech. That term refers to Google, Amazon, Facebook, Apple (and sometimes Microsoft). The recruiter pitched me on a network engineering role - something that I've happily done for years now.

For the past 20+ years, network engineering teams from most companies have maintained the networks that connect computers which serve up every internet service we interact with each day. Network engineers make sure redundancies exist for the inevitable failures of a network that spans the globe, and they verify the health of all the hardware devices and interfaces which run the network.

A common analogy for network engineering is building the roads for the application "cars" to drive upon.

These jobs have been stable and profitable, integral to the growth and stability of any company that wants to use the internet to drive its business (read: all of them). Most would jump at the opportunity to take any job at these companies. These jobs sparkle on resumes, and even if the day-to-day is similar to most other jobs in the industry, the looming profile these companies have in the news cycle mean it'd be foolish to write off an employment opportunity like this one.

However, the world of network engineering is changing. Many would say dying.


With the exception of maybe a dozen companies on the planet, nearly every company is moving away from physical data centers. IT orgs struggle with the long lead times required to make changes in physical data centers. Purchasing hardware, organizing cabling standards, cooling, 24x7 staffing, and dozens of other concerns are simply avoided by moving to the cloud.

Ironically, the only companies who aren't decreasing their data center footprint? Cloud providers. 


Because of the increased demand, cloud providers are growing their physical data centers at an incredible rate. This requires hiring network engineers, data center engineers, and others with the skillsets to grow them in a scalable way.

The gilded cage of skill-set lock-in
The problem, of course, is skillset lock-in. Not only do most of the gang of 4 famously build their own tooling, but their business model is shared by almost no other company on the globe - to build world-spanning data centers and massive internet-scale networks.

Only cloud providers still invest in physical data centers - and the skillsets required to run them.


Spending time in your career at one of these companies in a department focused on these legacy networks is a dead-end in a career because of this skillset lock-in. It'll be difficult for the folks locked into these positions to leave the very small network of a dozen or so companies that provide these massive clouds and take just a job at just about anywhere else, because these other companies are looking for reliability engineers (SREs), DevOps engineers, and any number of other software-defined cloud computing experts that need entirely different skillsets than those harbored within these divisions at the gang of 4.

If you have the opportunity to work in these divisions at cloud providers, good luck to you! Their famously great pay and benefits are nothing to scoff at. But I hope you consider my points above about career lock-in. Your career must be played as a strategic long-game, and I worry these jobs might be the wrong move.

Best of luck out there.
kyler