TDD for Infrastructure

Test Driven Development (TDD) is an important principle for producing quality software. This is not a new concept. The Extreme Programming (XP) agile methodology (1999) outlined the concept before the acronym became more widely accepted as “Another requirement is testability. You must be able to create automated unit and functional tests… You may need to change your system design to be easier to test. Just remember, where there is a will there is a way to test.” Another clear way to describe the hurdles TDD has encountered as a common sense approach is “This is opposed to software development that allows code to be added that is not proven to meet requirements.”

Infrastructure setup is still software. All setup should have adequate testing to ensure at anytime (not just during installation or configuration) any system is in a known state. While Configuration Management (CM) works with the goal of convergence, i.e. ensuring a system is in a known state, testing should be able to validate and identify any non-conformance and not to attempt to correct.

The Bash Automated Test System (BATS) is a known framework for shell scripting. It is very easy to use.

Good habits come from always doing them. Even for a quick test of a running MySQL server via vagrant for a blog post, the automated installation during setup includes validating a simple infrastructure setup via a bats test.

$ tail install.sh

...
sudo mysql -NBe "SHOW GRANTS"
systemctl status mysqld.service
ps -ef | grep mysqld
pidof mysqld
bats /vagrant/mysql8.bats

Rather than having some output that requires a human to read and interpret each line and make a determination, automated it. A good result is:

$ vagrant up
...
    mysql8: ok 1 bats present
    mysql8: ok 2 rpm present
    mysql8: ok 3 openssl present
    mysql8: ok 4 mysql rpm install
    mysql8: ok 5 mysql server command present
    mysql8: ok 6 mysql client command present
    mysql8: ok 7 mysqld running
    mysql8: ok 8 automated mysql access 

A unsuccessful installation is:

$ vagrant provision
...
    mysql8: not ok 8 automated mysql access
    mysql8: # (in test file /vagrant/mysql8.bats, line 60)
    mysql8: #   `[ "${status}" -eq 0 ]' failed
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.

$ echo $?
1

This amount of very simple testing and re-execution of testing either via ssh or a re-provision highlighted a bug in the installation script. Anybody that wishes to identify please reach out directly!

...
# Because openssl does not always give you a special character
NEWPASSWD="$(openssl rand -base64 24)+"
mysql -uroot -p${PASSWD} -e "ALTER USER USER() IDENTIFIED BY '${NEWPASSWD}'" --connect-expired-password
# TODO: create mylogin.cnf which is more obfuscated
echo "[mysql]
user=root
password='$NEWPASSWD'" | sudo tee -a /root/.my.cnf
sudo mysql -NBe "SHOW GRANTS"
systemctl status mysqld.service
ps -ef | grep mysqld
pidof mysqld
bats /vagrant/mysql8.bats

A simple trick with a BATS test is to echo any output that will help debug a failing test. If the test succeeds no output is given, if it fails you get the information for free. For example, lets say your test is:

# Note: additional security to both access the server via ssh
#       and accessing sudo should be in place for production systems
@test "automated mysql access" {
  local EXPECTED="${USER}@localhost"
  run sudo mysql -NBe "SELECT USER()"
  [ "${status}" -eq 0 ]
  [ "${output}" = "${EXPECTED}" ]
}

Execution will only provide:

 ✗ automated mysql access
   (in test file /vagrant/mysql8.bats, line 62)
     `[ "${output}" = "${EXPECTED}" ]' failed

What you want to see to more easily identify the problem is:

 ✗ automated mysql access
   (in test file /vagrant/mysql8.bats, line 62)
     `[ "${output}" = "${EXPECTED}" ]' failed
   root@localhost != vagrant@localhost

This echo enables a better and quicker ability to correct the failing test.

...
  [ "${status}" -eq 0 ]
  echo "${output} != ${EXPECTED}"
  [ "${output}" = "${EXPECTED}" ]
...

Testing is only as good as the boundary conditions put in place. Here is an example where your code used a number of environment variables and your testing process performed checks that these variables existed.

@test "EXAMPLE_VAR is defined ${EXAMPLE_VAR}" {
  [ -n "${EXAMPLE_VAR}" ]
}

The code was subsequently refactored and the environment variable was removed. Do you remove the test that checks for its existence? No. You should not ensure the variable is not set, so that any code now or in the future acts as desired.

@test "EXAMPLE_VAR is NOT defined" {
  [ -z "${EXAMPLE_VAR}" ]
}

References:
[1] https://en.wikipedia.org/wiki/Test-driven_development
[2] http://www.extremeprogramming.org/when.html
[3] https://github.com/sstephenson/bats
[4] https://github.com/bats-core/bats-core

Enforcing a least privileged security model can be hard

In a greenfield environment you generally have the luxury to right any wrongs of any past tech debt. It can be more difficult to apply this to an existing environment? For example, my setup is configured to just work with the AWS CLI and various litmus tests to validate that. Generally instructions would include, valid your AWS access.  This can be as simple as: 

$ aws ec2 describe-regions
$ aws ec2 describe-availability-zones --profile oh

As part of documenting some upcoming Athena/Hadoop/Pig/RDBMS posts I decided it was important to separate out the AWS IAM privileges with a new user and permission policies.This introduced a number of steps that simply do not work.  Creating a new AWS IAM user is not complex. Validating console and API access of that user required some revised setup.

$ aws ec2 describe-regions

An error occurred (AuthFailure) when calling the DescribeRegions operation: AWS was not able to validate the provided access credentials

In order to be able to use the CLI you require your aws_access_key_id and aws_secret_access_key information as well as aws_session_token if used. In order for a new individual user to gain this information, you also need a number of policy rules including the ability to ListAccessKeys, CreateAccessKey and potentially DeleteAccessKey.

 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:DeleteAccessKey",
                "iam:CreateAccessKey",
                "iam:ListAccessKeys"
            ],
            "Resource": "arn:aws:iam::[account]:user/[username]"
        }
    ]
}

In this example, we also restrict the least privileged model with a specific user resource ARN. For a single user account that is workable, for a large organization it would not.
This gives the ability to configure your AWS CLI via typical ~/.aws/credentials and/or ~/aws/config settings. Performing  the litmus test now gives:

$ aws ec2 describe-regions

An error occurred (UnauthorizedOperation) when calling the DescribeRegions operation: You are not authorized to perform this operation.

This requires a policy of:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeRegions"
            ],
            "Resource": "*"
        }
    ]
}
$ aws ec2 describe-regions | jq '.Regions[0]'
{
  "Endpoint": "ec2.eu-north-1.amazonaws.com",
  "RegionName": "eu-north-1",
  "OptInStatus": "opt-in-not-required"
}


$ aws ec2 describe-availability-zones --filter "Name=region-name,Values=us-east-1" | jq -r '.AvailabilityZones[].ZoneName'

us-east-1a
us-east-1b
us-east-1c
us-east-1d
us-east-1e
us-east-1f

However, this may be too restrictive for a larger organization.  The EC2 Access level for ‘list’ includes currently over 120 individual permissions. A more open policy could be:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:Describe*"
            ],
            "Resource": "*"
        }
    ]
}

However this does not provide all of the EC2 ‘list’ actions, e.g. ExportClientVpnClientConfiguration, and it includes several ‘read’ actions, e.g. DescribeVolumesModifications.
Selecting the ‘list’ tickbox via the GUI will provide all actions by name individually in the policy action list, currently 117, however this is not forward compatible for any future list defined access level actions.

This is before the exercise to starting granting access to a new AWS service – Athena, and its data source S3.