ECS Golang REST API

#development

Background

This project involves deploying a simple Golang REST API into AWS, using managed services like ECS (with the Fargate launch type) and RDS.

After completion of the project, I will have a highly-available, scalable, and containerized REST API available on public internet.

Additionally, I will have a pipeline that allows for simple blue/green updates to the containerized application in order to support future development efforts.

Implementation

Code repository: https://github.com/mehlj/ecs-golang-api

The project consists of application code (Golang) that comprises the API, and declarative IaC that defines the infrastructure necessary to support the API (Terraform).

All AWS infrastructure is provisioned and managed via Terraform.

Terraform remote state is hosted in S3, with DynamoDB state locks.

All code is source controlled in GitHub repositories. The limited usage of secrets are restricted to AWS Secrets Manager.

Github Actions handles deployment of the application and dependent AWS infrastructure.

Application

The REST API is written entirely in Go.

It interacts with a Postgres backend, defined via a standard DSN/connection string environment variable PG_DSN.

It idempotently configures the database with a single table in the format the application is expecting.

Interaction with the API revolves around simple CRUD operations:

CREATE: POST /product
READ:   GET /products or GET /product?name=object
UPDATE: PUT /product
DELETE: DELETE /product

The API uses Go libraries net/http and github.com/gorilla/mux to serve HTTP:

// URL : /product?name=apple
func QueryProduct(w http.ResponseWriter, r *http.Request) {
	w.Header().Set("Content-Type", "application/json")

	k := r.URL.Query().Get("name")

	p := QueryRow(k)
	json.NewEncoder(w).Encode(p)
}

<snip>
	r := mux.NewRouter().StrictSlash(true)

	r.HandleFunc("/", DefaultHandler)
	r.HandleFunc("/product", QueryProduct).Methods("GET")
<snip>

SQL interaction uses prepared statements and leverages the libraries database/sql and github.com/lib/pq:

<snip>
func RemoveRow(product Product) {
	// open connection
	db, err := sql.Open("postgres", os.Getenv("PG_DSN"))
	checkSQLError(err)

	// delete product
	_, err = db.Exec("DELETE FROM products WHERE name=$1", product.Name)
	checkSQLError(err)
}

func checkSQLError(err error) {
	if err != nil {
		panic(err)
	}
}

The application is packaged into a Docker image using a simple Dockerfile at the root of the repository:

FROM golang:1.15-buster

WORKDIR /opt/mehlj-pipeline/api/

COPY api/* ./

RUN go mod download

RUN go build main.go sql.go

CMD ["./main"]

Infrastructure

The AWS infrastructure that supports this application stack mainly consists of ECR, ECS, RDS, a VPC carved into private/public subnets, and an ALB to handle ingress to the application.

A GitHub Actions pipeline provisions infrastructure (declared by Terraform IaC) and deploys the application in the following general order.

Container Image Repository

Before the Docker image of the API is built, a repository/registry for the image must exist.

Since we are on AWS, usage of ECR makes sense, as it is a managed service that integrates nicely with security groups and IAM.

We can define our ECR repository using the aws_ecr_repository resource:

resource "aws_ecr_repository" "ecr" {
  name = "mehlj-pipeline"
}

Along with repository, lifecycle, and IAM policies necessary to access and configure the repository:

data "aws_iam_policy_document" "ecrpolicy" {
  <snip>
}
resource "aws_ecr_repository_policy" "ecrpolicy" {
  repository = aws_ecr_repository.ecr.name
  policy     = data.aws_iam_policy_document.ecrpolicy.json
}
resource "aws_ecr_lifecycle_policy" "repositoryPolicy" {
  <snip>
}

A Terraform output is defined, such that when this resource is provisioned, it will output + store the unique/random repository URL that is needed to push images.

output "repository_url" {
  description = "URL for the ECR repository."

  value = aws_ecr_repository.ecr.repository_url
}

Database

The API uses Postgres-style prepared statements, so we will provision an RDS instance with the Postgres engine.

We can provision a database with the aws_db_instance resource:

<snip>
resource "aws_db_instance" "mehlj-pipeline" {
  identifier             = "mehlj-pipeline"
  db_name                = "ecspoc"
  instance_class         = "db.t3.micro"
  allocated_storage      = 5
  engine                 = "postgres"
  engine_version         = "14.1"
  username               = "dbuser"
  password               = jsondecode(data.aws_secretsmanager_secret_version.current.secret_string)["vault"]
  db_subnet_group_name   = aws_db_subnet_group.mehlj-pipeline.name
  vpc_security_group_ids = [aws_security_group.rds.id]
  parameter_group_name   = aws_db_parameter_group.mehlj-pipeline.name
  skip_final_snapshot    = true
}

Again, we will define outputs that allow us to track the resource-unique URLs that are necessary to connect to our database later:

output "rds_hostname" {
  description = "RDS instance hostname"
  value       = aws_db_instance.mehlj-pipeline.address
  sensitive   = true
}

output "rds_port" {
  description = "RDS instance port"
  value       = aws_db_instance.mehlj-pipeline.port
  sensitive   = true
}

output "rds_username" {
  description = "RDS instance root username"
  value       = aws_db_instance.mehlj-pipeline.username
  sensitive   = true
}

Build

To build our container image, we simply need to read our Dockerfile and do a standard docker build and docker push workflow.

Since we are pushing to an ECR registry, we need to ensure we are logged into ECR first.

Amazon provides an Action for this common task, which we can leverage in our pipeline:

    - name: Login to Amazon ECR
      uses: aws-actions/amazon-ecr-login@v1

After login, we can build and push our image with the commit SHA as the image tag:

<snip>
    outputs:
      FULL_IMAGE_TAG: $
<snip>

    - name: Build, tag, and push the image to Amazon ECR
      id: get_image_tag
      env:
        ECR_REPOSITORY: $
        IMAGE_TAG: $
      run: |
        docker build -t $ECR_REPOSITORY:$IMAGE_TAG .
        docker push $ECR_REPOSITORY:$IMAGE_TAG
        echo "FULL_IMAGE_TAG=$ECR_REPOSITORY:$IMAGE_TAG" >> "$GITHUB_OUTPUT"

Using Job Outputs, we can pass the full image tag onto a later stage of the pipeline. Namely - the deployment stage will need it, since it is responsible for deploying said container image.

Deploy

This stage is responsible for provisioning all other resources that the application depends on. Namely, the VPC, subnets, the ECS cluster/service/task definition, and the frontend ALB.

VPC/Subnets

To simulate a production environment, and better protect our AWS resources, a VPC with multiple subnets and varying levels of accessibility was provisioned.

A large private IP block was instantiated, with the intent of carving it up:

resource "aws_vpc" "ecs_vpc" {
  cidr_block = "10.128.0.0/16"
}

One group of subnets within this VPC must be publicly accessible, since the application itself will be publicly available.

These subnets will house the publicly-facing ALB and provide outbound internet access via NAT Gateways.

data "aws_availability_zones" "available_zones" {
  state = "available"
}

<snip>

resource "aws_subnet" "public" {
  count             = 2
  cidr_block        = cidrsubnet(aws_vpc.ecs_vpc.cidr_block, 8, 2 + count.index)
  vpc_id            = aws_vpc.ecs_vpc.id
  availability_zone = data.aws_availability_zones.available_zones.names[count.index]

  # instances launched in subnet get assigned a public IP
  map_public_ip_on_launch = true
}

The other group of subnets should not be publicly exposed.

These subnets will house all other IP resources, including ECS tasks and RDS.

resource "aws_subnet" "private" {
  count             = 2
  cidr_block        = cidrsubnet(aws_vpc.ecs_vpc.cidr_block, 8, count.index)
  vpc_id            = aws_vpc.ecs_vpc.id
  availability_zone = data.aws_availability_zones.available_zones.names[count.index]
}

ECS

First, the ECS cluster must be provisioned:

resource "aws_ecs_cluster" "ecs_cluster" {
  name = "poc-cluster"
}

Next, we must create a JSON “task definition” that ECS will use to define container specifications, such as CPU, memory, the requested image, environment variables, etc.

We will leverage Fargate to limit the toil involved in managing dedicated ECS cluster members, and again, to simulate a production environment.

resource "aws_ecs_task_definition" "taskdef" {
  family       = "mehlj-pipeline"
  network_mode = "awsvpc"

  # EC2 backing vs fargate
  requires_compatibilities = ["FARGATE"]
  cpu                      = 1024
  memory                   = 2048

  # Summary: Execution Roles grant the ECS agents permission to make AWS API calls.
  # I.e.: The task is able to send container logs to CloudWatch or pull an image from ECR.
  # -----
  # allows ECS agent to pull image from ECR
  execution_role_arn = aws_iam_role.ecsTaskExecution_role.arn

  container_definitions = jsonencode([
    {
      name      = "mehlj-pipeline"
      image     = var.image_tag
      cpu       = 1024
      memory    = 2048
      essential = true
      environment = [
        {
          name  = "PG_DSN"
          value = var.pg_dsn
        }
      ]
      network_mode = "awsvpc"
      portMappings = [
        {
          containerPort = 80
          hostPort      = 80
        }
      ]
      # needs IAM as well
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          awslogs-group         = "ecs-poc"
          awslogs-region        = "us-east-1"
          awslogs-create-group  = "true"
          awslogs-stream-prefix = "mehlj-pipeline"
        }
      }
    },
  ])
}

The cluster and the container specifications are defined, but we still need to tell ECS “how” to deploy that container(s) into the cluster.

There are a few methods of doing this, but in general, for long-lived servers, the resource aws_ecs_service is best used.

Defining an ECS Service tells ECS to keep that set of containers alive and healthy at all times, and at the desired_count we specify.

resource "aws_ecs_service" "service" {
  name            = "mehlj-pipeline"
  cluster         = aws_ecs_cluster.ecs_cluster.id
  task_definition = aws_ecs_task_definition.taskdef.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  network_configuration {
    security_groups = [aws_security_group.task_sg.id]
    subnets         = aws_subnet.private.*.id
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.asg-tg.arn
    container_name   = "mehlj-pipeline"
    container_port   = 80
  }

  depends_on = [aws_lb_listener.ecs-lb-list]
}

ALB

The containers and underlying infrastructure is defined, but the ECS tasks are still being provisioned into the private VPC subnet.

Rather than exposing the containers directly to the public subnet, we can add an abstraction layer while adding industry-standard load balancing features to the stack.

Said layer can be represented as an Application Load Balancer. The ALB will be provisioned in the public subnet, and security groups/IAM will be configured to allow communication to the protected ECS tasks.

We can define the load balancer using the aws_lb resource:

resource "aws_lb" "ecs-lb" {
  name            = "ecs-lb"
  subnets         = aws_subnet.public.*.id
  security_groups = [aws_security_group.ecs_alb_sg.id]

  enable_deletion_protection = false
}

We can then configure our load balancer to listen on port 80 and use HTTP:

resource "aws_lb_target_group" "asg-tg" {
  name        = "asg-tg"
  port        = 80
  protocol    = "HTTP"
  vpc_id      = aws_vpc.ecs_vpc.id
  target_type = "ip"
}

resource "aws_lb_listener" "ecs-lb-list" {
  load_balancer_arn = aws_lb.ecs-lb.arn
  port              = "80"
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.asg-tg.arn
  }
}

Note: the integration of this load balancer with our ECS service is done in earlier code:

resource "aws_ecs_service" "service" {
<snip>
  load_balancer {
    target_group_arn = aws_lb_target_group.asg-tg.arn
    container_name   = "mehlj-pipeline"
    container_port   = 80
  }
<snip>
}

Once again, we can define a Terraform output to determine/track the unique ALB URL that is needed for browsers to communicate with our public application.

output "alb_domain_name" {
  value = aws_lb.ecs-lb.dns_name
}

Validation

This last pipeline stage simply runs curl against the public ALB URL until the application starts responding with HTTP 200.

This stage allows one to quickly glean how the previous stages went, and how well the application is responding.

  validate:
    name: Ensure app is responding publicly
    runs-on: ubuntu-22.04
    needs: deploy

    steps:
      - name: cURL for 5 minutes until errors out
        env:
          ALB_URL: $
        run: curl --head -X GET -v --retry 20 --retry-all-errors --retry-max-time 300 $ALB_URL 

After all pipeline stages run, the API container image will be built, all infrastructure supporting the application will be provisioned, and the API will be deployed into said infrastructure.

Interaction

# Create
[justen.mehl@workstation ~]$ curl ecs-lb-350925889.us-east-1.elb.amazonaws.com
Hello World!!

[justen.mehl@workstation ~]$ curl -d '{"Name":"ball","quantity":4}' -X POST ecs-lb-350925889.us-east-1.elb.amazonaws.com/product
{"Name":"ball","quantity":4}

# Read
[justen.mehl@workstation ~]$ curl -d '{"Name":"oranges","quantity":2}' -X POST ecs-lb-350925889.us-east-1.elb.amazonaws.com/product
{"Name":"oranges","quantity":2}

[justen.mehl@workstation ~]$ curl ecs-lb-350925889.us-east-1.elb.amazonaws.com/product?name=oranges
{"Name":"oranges","quantity":2}

# Update
[justen.mehl@workstation ~]$ curl -d '{"Name":"ball","quantity":5}' -X PUT ecs-lb-350925889.us-east-1.elb.amazonaws.com/product
{"Name":"ball","quantity":5}

# Delete
[justen.mehl@workstation ~]$ curl -d '{"Name":"ball","quantity":5}' -X DELETE ecs-lb-350925889.us-east-1.elb.amazonaws.com/product
{"Name":"ball","quantity":5}

Lessons Learned

RDS Table Configuration - Best Practices

I reached a certain point where I had provisioned an RDS database with a Postgres engine, but I needed to create a table with two columns that my application expected.

I provisioned the database itself with Terraform, so my natural inclination was to configure the table with Terraform as well.

After more research, it looks like that workflow is an anti-pattern, and application-unique configuration such as that table should be performed in a different manner than Terraform.

At that point, we enter configuration management territory, so Ansible was an option. AWS Lambda was another option if I wanted to stay strictly within AWS tooling.

Ultimately, I opted to add the table configuration responsibility to my Go API itself:

func CreateTable() {
	// open connection
	db, err := sql.Open("postgres", os.Getenv("PG_DSN"))
	checkSQLError(err)

	// format table properly
	_, err = db.Exec("CREATE TABLE IF NOT EXISTS products(name TEXT, quantity INT)")
	checkSQLError(err)
}

<snip>
func main() {
	// prep database
	CreateTable()

	r := mux.NewRouter().StrictSlash(true)

	r.HandleFunc("/", DefaultHandler)
<snip>
}

The API containers already had firewall rules and IAM permissions in place to interact with RDS, thus making this a simple option.

Using IF NOT EXISTS allowed me to make this operation idempotent, which was my main concern.

Before serving the HTTP API, the application ensures that the table format is in place.

Docker Image Tagging Strategy

I ultimately wanted a pipeline that, upon committing ANY changes to my API code, would rebuild the application and properly roll out that change to the ECS cluster.

My first hiccup with this was when I began writing my ECS task definition and had to define my image tag:

<snip>
  container_definitions = jsonencode([
    {
      name      = "mehlj-pipeline"
      image     = ?
<snip>

I wanted this image tag to change after every commit. How could I hardcode this value and still have it update every time?

My first thought was to hardcode this line to use latest, and in my pipeline, always tag my image latest and push it up. That way, my task definition code stays the same, but I’m always receiving the latest image in the cluster.

<snip>
  container_definitions = jsonencode([
    {
      name      = "mehlj-pipeline"
      image     = <ECR_URL>:latest
<snip>

This functionally worked okay, but when I first pushed up minor changes to the API code, the containers in the cluster did not change at all, despite the pipeline running successfully.

It turns out, Terraform cannot determine if there was a change to the latest image. I could have configured the pipeline to force an update to the cluster, similar to kubectl rollout, but that seemed unclean.

My final solution was to use a Terraform variable as the value for image in the task definition.

variable "image_tag" {
  description = "Container image tag, corresponds to commit SHA"
  type        = string
  default     = "placeholder"
}

<snip>
  container_definitions = jsonencode([
    {
      name      = "mehlj-pipeline"
      image     = var.image_tag
<snip>

This value can change on every pipeline run, so long as I provide that image tag value through an environment variable called TF_VAR_image_tag.

Before I provision my resource that depends on that Terraform variable, I must supply the environment variable using a Job Output from the earlier Build stage.

<snip>
    - run: terraform apply -auto-approve
      env:
        TF_VAR_image_tag: $
<snip>

Ultimately, the full image tag corresponds to the unique SHA of the commit, so the ECS task definition is always updated upon a commit.

And when a task definition is updated (new revision), the ECS service handles proper blue/green deployment of the change to the cluster.

Parameter Placeholders (Postgres vs SQLite)

Before this project was re-engineered to fit a cloud deployment, it was deployed into a lab Kubernetes environment and used a SQLite database mounted locally on a Volume.

Originally, all of my SQL prepared statements used ? as the parameter placeholder value.

stmt, err := db.Prepare("INSERT INTO products(name, quantity)  values(?,?)")

This works for SQLite, as well as MySQL (and some others), but evidently, Postgres uses another value.

It requires $1, and for multiple parameters, $1, $2, and so on.

After reworking my statements to fit this syntax, my SQL statements functioned properly.

stmt, err := db.Prepare("INSERT INTO products(name, quantity)  values($1,$2)")

GitHub Actions - Job Outputs and Secrets

At one point during development, I was building my Postgres DSN/connection string in one Job (the RDS deployment job), and trying to pass that value as an Output to a later Job (the deployment job).

Problem was, the connection string contains a sensitive combination of the RDS URL and the credentials to the database.

I missed the section of the GitHub Actions documentation that stated this:

Outputs containing secrets are redacted on the runner and not sent to GitHub Actions.

I would run my pipeline, and everything looked successful, but my ECS tasks would fail because the environment variable TF_VAR_pg_dsn was empty, and it should have contained the Postgres DSN.

After a lot of debugging, I finally found a small warning in my pipeline:

Warning: Skip output 'TF_VAR_pg_dsn' since it may contain secret.

GitHub Actions performs secret detection, and if it detects any in an Output, it will silently fail the passing of said output.

Understandably so, since passing information between Jobs involves sending/storing the data on GitHub’s infrastructure, albeit temporarily.

The fix to this problem was to properly use a Secret Manager, in this case, AWS Secrets Manager. And since that secret is highly available, I don’t have to pass the output between jobs, I can simply build the DSN right before running terraform apply.

    - name: Read database secrets from AWS Secrets Manager into environment variables
      uses: abhilash1in/aws-secrets-manager-action@v2.1.0
      with:
        secrets: mehlj_lab_creds
        parse-json: true

    - run: terraform init

    - name: Build Postgres DSN string
      run: echo TF_VAR_pg_dsn="postgres://$(terraform output -raw rds_username):${MEHLJ_LAB_CREDS_VAULT}@$(terraform output -raw rds_hostname):5432/ecspoc" >> $GITHUB_ENV

    - run: terraform fmt -check
    - run: terraform apply -auto-approve

$GITHUB_ENV stores temporary and job-local environment variables, and has no problem storing secrets, since they are lost after the job ends.

The ultimate best practice is to not inject this DSN as an environment variable to the ECS tasks at all - and instead store that information entirely in a secret manager, and have the ECS tasks read directly from said manager.

My first thought towards this approach revealed a challenge - that DSN URL includes a unique Terraform output, and I don’t know the value until that pipeline Job executes. There may have been a way to programatically generate and store the secret, but my research stopped there. This solution seemed suitable for an educational project such as this.

Written on March 20, 2023