Kyler Middleton: iam

Thursday, August 29, 2019

Recursive Terraform with Terragrunt

Hey all!

Terraform is capable of remarkable things, not least of which is speaking API commands for many dozens of providers, which lets terraform configuration do amazing things. That isn't to say that terraform is perfect, or as flexible as we'd like it to be, which is why some free-lancers have built wrappers around terraform to add functionality.

One that has gained significant traction is Terragrunt. Terragrunt is a tool that permits terraform to run in parallel, on lots of different main.tf files, and to go even further towards a DRY (Don't Repeat Yourself) implementation with sym-links to shared files. The github page for the tool goes into greater depth and specificity than I'd be able to, so visit it if you're interested!

Terragrunt also permits something cool - to recursively dive into a folder structure and execute all the main.tf files that are found there. Terragrunt can also manage each of these files' state files and help store them in an appropriate back-end.

This recursive property lets you separate your resources out into n number of data stacks, all with separate state files. This is particularly useful within the context of a CI/CD that has to be hard-coded to execute a single or few commands, e.g. terraform against a single location. If your users are going to be adding main.tf files with new resources in new folders, you'll have to constantly be updating your CI/CD to point at those new main.tf files. Or you can use terragrunt and tell it to dive into a folder structure recursively and grab them all, each time its run, automatically.

During this blog I'll walk through how to set up terragrunt, how to separate resources and reference them from different main.tf files, all in the context of what I hope is) an interesting example - building public and private subnets, 2x servers, a load balancer, listener, target group, and all sorts of other cool stuff. At the end of this demo, you'll have an internet-facing Application Load Balancer in AWS that can accept incoming http connections and load-share them between the 2x ec2 servers we'll build.

AWS Bootstrap - S3, IAM

Before we can run terragrunt against an AWS environment, we need to add some resources, which would include an S3 bucket to store the remote state, as well as an IAM user with a policy that lets them do things.

First, log into your AWS account and in the top right, click on My Security Credentials. This is your root user, with unlimited abilities. We don't want to do a ton with this, but we'll use it to get started

Then click on "Access keys" drop-down to expose your keys. If there aren't any there, that's fine.

Click on "Create New Access Keys", then on "show access keys". Copy those down, then export them into your browser session. These exports will work on a linux or mac computer - you'll need to export in windows syntax for a windows session.

That will permit your terminal to authenticate to your AWS and do things.

Navigate to projects/ado_init on your local disk and update the ado_init_variables.tf file with custom names. The S3 name has to be globally unique, the rest are whatever names make sense to you.

Then run "terraform apply" against the main.tf there. It will build an S3 bucket and the IAM user with policies that you can assume.

Find that IAM user in the IAM panel, click on Users on left side, and find the new IAM user that we built. Click on it to open it up, then click on "Secure Credentials". Click "create access key" to generate some CLI creds.

Now we're ready to get started in terragrunt, so let's dive in!

Terragrunt Properties

Terragrunt is written in the same language as Terraform - HCL, so it'll look very familiar. Here's an example of a "terragrunt.hcl" file that exists in the networking folder of my terraform project.

Notice that it looks exactly like how a remote state backend is written in terraform. And rather than having all this information in both places, terragrunt requires us to remove that information from the terraform main.tf file. Really all that's left in the terraform file is the terraform init statement.

Here's what my folder structure looks like. Notice that there are several different "projects", which are components of my environment, broken out into main.tf files. All are under the "projects" folder, so that's where we can run our terragrunt commands.

Each project component will have its own terragrunt.hcl file, but they'll vary slightly. The reason for this is that each one will maintain a separate state file. Here's the terragrunt.hcl file for the security_group project. Notice that the s3 bucket stays the same - we're putting all our remote state files into the same s3 bucket, but the "key" (folder path and filename) changes, as well as the dynamo-db table.

You'll need to sync down the Terragrunt git repo for this demo to your local comp.

Go through the various terragrunt.hcl files and update the S3 bucket to the one you created in the ado_init step earlier. You can also update the name of the dynamoDB table if you'd like - terragrunt will automatically create these for you if they don't exist yet.

Now it's time to run it!

Run Terragrunt

Navigate to the projects folder and run command "terragrunt apply-all". This command will tell terragrunt to recurse through the directories and execute the main.tf files in each directory that has a terragrunt.hcl file. It'll read the terragrunt.hcl file in each directory and grab (or push) the terraform state to that remote location.

You'll see log messages from each of the main.tf files as it goes, and it can sometimes be hard to tell where the log messages come from.

You may need to run the command a few times - there are inter-dependencies among the several files, and some files can't execute at all when their data sources reference something in a data stack that doesn't exist yet.

Try it out and let me know what you think! Good luck out there.
kyler

Saturday, August 24, 2019

AWS IAM: Assuming an IAM role from an EC2 instance

tl;dr: A batch script (code provided) to assume an IAM role from an ec2 instance. Also provided is terraform code to build the IAM roles with proper linked permissions, which can be tricky.

I'm working through an interesting problem - syncing Azure DevOps to AWS, and making the connection functional, scalable, and simple. Sometimes, when designing anything, a path is followed that doesn't pan out. This is one of those paths, and I wanted to share some lessons learned and code that might help you if this path is a winner for you.

Our security model for EC2 requires that a machine assume a higher IAM policy when it is required, but the rest of the time it have much lower permissions. That's a common use case, and a best practice.

Some applications support assuming a higher IAM role natively - I later learned, after pursuing this, that terraform is one of those applications (more details on that in a future blog). However, some applications can't, and require you to do the heavy lifting yourself.

IAM - a Sordid (and Ongoing) History

IAM (Identity and Access Management) is complex beast that controls authentication (who are you?) and authorization (what are you allowed to do?). Because even simple complex can be made complex with enough work, IAM supports recursive role assumptions, so a resource that starts with 1 set of credentials can assume a different (or more expensive) set of credentials during certain actions.

This has the benefit of being very flexible, and the detriment of allowing deployments so complex it can require a serious amount of nancy drew-ing to sort out what permissions something "really" has.

This complexity has led to a series of high profile security vulnerabilities introduced by a lack of understanding or a too-complex deployment in some of what are generally thought to be the most security companies. The most recent high profile one was Capital One's hack by an ex-AWS employee. The ec2 IAM policies were written in such a way as to provide access to all s3 buckets, so once a single ec2 instance was compromised, all data everywhere was compromised. KrebsOnSecurity has a great write-up of the incident.

Definitions - Policies, Roles, and Trust Relationships, Oh My

So clearly, lack of understanding here can be a vulnerability all in itself, so let's break down what pieces comprise IAM.

Policies: Policies are a list of permissions that can be granted. They are not allowed to be assigned to resources themselves (to my knowledge). Rather, they are assigned to one or more roles, and the roles are assigned to or assumed by resources.
Roles: An IAM role is a bucket of permissions. The permissions it contains are not "within" the role, but rather are described in the IAM policies that are assigned to the role. These roles can be assigned to a resource (think ec2 resources being assigned a single ec2 role) or assumed by a resource or process.
Instance Profile: An IAM Instance Profile is a somewhat hidden feature of IAM roles. Instance Profiles are assigned 1:1 to an IAM Role, and when assigned, allow an ec2 instance to be assigned the role. To be even simpler: This stand-alone resource acts as a check box for an IAM role on whether it can be assumed by an ec2 instance or not.

Interesting tip: I say this this resource type is somewhat hidden because when an IAM role is created in the GUI, an Instance Profile is automatically created and assigned. However, if you're building an IAM Role via command line or API call (thing Terraform or CloudFormation), this resource isn't automatically created, and instead acts as a "gotcha".

Trust Relationship: An IAM Trust Relationship is a special policy attached to an IAM Role that controls who can assume the role. This is a key part of our IAM role assuming, and we'll walk through the different policies required on the implicit (assigned) IAM role for the ec2 instance vs the IAM role assumed by the instance.

The Implicit IAM Role

We'll build several IAM roles, with associated policies and trust relationships. First, let's build the Implicit IAM role. This role will be assigned directly to the ec2 instance, and is static.

Note that this role has an embedded IAM policy - this is our trust policy that permits the ec2 instance service to assume this role - this is required if any ec2 instance will be assigned the role.

Next we'll create an IAM policy for this implicit role. The only permission we want this policy to contain is the ability to use the STS service to assume a specific IAM role. Otherwise, this ec2 instance should act as a normal virtual machine, and not be able to edit or control the AWS environment around it.

Then we link the two together - remember that roles and policies are not linked by default, and have to be assigned together.

And remember this implicit IAM role needs to be statically assigned to an ec2 instance, and that requires it to have an instance profile, so let's build that and assign to the IAM role.

Once this is all applied, it'll look like this:

And here's the trust policy under the "trust relationships" tab. You should see the ec2 service is trusted by this policy to be assumed.

Now that we have an IAM role with a policy and a trust relationship to the ec2 service (and that gotcha of an instance profile), let's go assign it to an ec2 instance. I didn't include terraform code for this, so you'll build an ec2 instance by hand. Once ready, go into the instance settings, and click "Attach/Replace IAM Role".

Find the IAM role you want to associate with the ec2 instance (the implicit one we just built). If you don't see it, try the refresh icon next to the list, or go check to make sure the instance profile is built and associated with the IAM role properly.

Great, now we have an IAM role, assigned to an ec2 instance, that permits it to assume a higher permissions role. Which is all well and good, but we haven't built that higher permissions role yet, so let's do that.

More Permissions, Give Me More!

The whole point of this exercise is for the ec2 instance to be able to assume a set of more expansive permissions when it needs it, so we need to build a distinct IAM role to contain those permissions, a policy to describe what permissions we want to grant, and a trust relationship that allows the implicit (statically assigned) ec2 instance to assume the higher permissioned role.

First let's build the IAM role. The role parts are exactly the same, but notice the embedded IAM policy (the trust relationship) is entirely different. Rather than trusting the ec2 service to assume it, it's trusting the first IAM policy only. This assures that only a single specific IAM role can assume this upper IAM role. And the lower role is assigned only to a single ec2 (or more if you want) instance, creating a limited chain of permissions that is very flexible to assign.

Now, let's build an IAM policy of permissions for this expansive role. The example here permits all actions to all services, which is NOT AT ALL a best practice. If at all possible make sure to limit your expansive IAM policies to much more specific actions to specific resources. The policy here should rarely be used.

And you can probably guess what comes next - we need to link the IAM role to the IAM policy that we just built, which looks like this:

When you look at the new role in the AWS console, it'll look like this:

The trust relationship tab will look like this:

Let's Assume The Role, For Real Now

Now that everything is in place, we're ready to go onto the ec2 instance and assume the role. This involves running a batch script, which will do several things - clearing the variables in case of a last run hanging around, figuring out the account ID by calling the AWS ec2 metadata service, figuring out the instance ID, and setting the information to a text file where bash can call it and set the global env variables.

Then we start the cool stuff. AWS ec2 linux AMIs already contain the AWS CLI toolset. If you don't have it, install it for this to work.

First we use the AWS CLI to assume our role, depending on both the dynamic info we gathered earlier - the account number and the EC2 ID. These dynamic pieces permit this same script to be run in any account, and to set an IAM session name that is globally identifiable to this instance, for later CloudTrail-ing.

Then we use jq (javascript query) to export the pieces we need to a file, then we call bash to read the file and set variables into the bash shell environment. Then we cleanup by removing the STS creds from the disk.

Boom, your ec2 instance has now assumed a higher IAM role that the assigned one, and can do all sorts of stuff.

Wrap It All Up

The collected code for all these examples can be found here: https://github.com/KyMidd/AWS_EC2_IAM_Authentication

I'll continue to investigate how to use IAM roles in order to build a comprehensive terraform and Azure DevOps CI/CD, so these types of posts will continue. In the next one, I'll cover how Terraform can handle most of these items itself, so the bash script is not needed.

However, I hope this script and the coverage of IAM helps you in your non-Terraform requirements. Thanks all!

Good luck out there.
kyler