Skip to content

hardfinhq/terraform-aws-tailscale-subnet-router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Terraform module for Tailscale subnet router in ECS Fargate

This module deploys a Tailscale subnet router as an AWS Fargate ECS task. The subnet router runs within an AWS VPC and advertises (to the Tailnet) the entire CIDR block for that VPC.

Docker Container

The _docker/tailscale.Dockerfile file extends the tailscale/tailscale image with an entrypoint script that starts the Tailscale daemon and runs tailscale up using an auth key and the relevant advertised CIDR block.

This Docker container must be built and pushed to an ECR repository.

docker build \
  --tag tailscale-subnet-router:v1.20230311.1 \
  --file ./_docker/tailscale.Dockerfile \
  .

# Optionally override the tag for the base `tailscale/tailscale` image
docker build \
  --build-arg TAILSCALE_TAG=v1.38.4 \
  --tag tailscale-subnet-router:v1.20230311.1 \
  --file ./_docker/tailscale.Dockerfile \
  .

Operator's Notes

  • The Tailscale state (/var/lib/tailscale) is stored in an EFS disk so that the subnet router only needs to be authorized once.
  • When deploying a new version, ECS will do a rolling update so two ECS tasks will be simultaneously claiming to be the same host. This conflict will eventually resolve itself some time after the older task exits, but may be confusing during the rollout.

Room for Improvement

Throughput

Right now this explicitly maps exactly one subnet router per VPC. As an organization grows, this can cause the subnet router to get saturated and cause a bottleneck. One of the perks of a mesh VPN is that bottlenecks via a centralized controller aren't possible, so reintroducing a bottleneck is unfortunate.

The best way to avoid this bottleneck is to not use a subnet router at all, but many engineering organizations can't (or don't want to) run Tailscale as a sidecar for all workloads. Assuming a subnet router will be used, there are a few ways bottlenecks can be mitigated:

  • Use smaller VPCs and utilize VPC peering as needed.
  • Use multiple subnet routers to cover one VPC. To enable this we could allow the CIDR range covered by the subnet router (via --advertise-routes) to be configurable.
  • Use subnet router failover for business users.
  • Use the subnet router only as a way to access jump / bastion hosts (with access limited via Tailscale network ACLs) and then rely on scaling jump hosts to increase throughput.

State

In the current form, this module uses AWS EFS to persist the Tailscale state in /var/lib/tailscale across deploys.

tailscaled --state arn:aws:ssm:zz-minotaur-7:123456789012:parameter/sandbox-tailscale

VPC

This module assumes a VPC Name is used, equivalent to:

data "aws_vpc" "sandbox" {
  tags = {
    Name = "sandbox"
  }
}

We'd be open to accepting a vpc_id directly.

Subnet group

The subnet_group variable is of note; it is used to filter subnets tagged with group={subnet_group}. This is a convention we use at Hardfin to group together subnets that are part of the same VPC (usually one subnet per AZ). In Terraform, this is determined via:

data "aws_subnets" "primary" {
  filter {
    name   = "vpc-id"
    values = ["vpc-51edfd86d3223cdff"]
  }
  tags = {
    group = "sandbox-igw-zz-minotaur-7"
  }
}

We'd be open to accepting an aws_subnet_ids list directly.