Showing posts with label Terraform. Show all posts
Showing posts with label Terraform. Show all posts

Thursday, 23 October 2025

Serverless Containers: Deploying with AWS Fargate and ECS (2025 Complete Guide)

October 23, 2025 0

Serverless Containers: Deploying with AWS Fargate and ECS

AWS Fargate ECS serverless containers architecture diagram showing container orchestration without EC2 instances

In 2025, serverless containers have become the dominant paradigm for deploying modern applications, combining the flexibility of containers with the operational simplicity of serverless computing. AWS Fargate with ECS represents the pinnacle of this evolution, enabling teams to run containers without managing servers or clusters. This comprehensive guide explores advanced Fargate patterns, cost optimization strategies, and real-world implementation techniques that will transform how you deploy containerized workloads. Whether you're migrating from EC2 or building greenfield applications, mastering Fargate is essential for modern cloud-native development.

🚀 Why Serverless Containers Dominate in 2025

The container ecosystem has matured significantly, with serverless options becoming the preferred choice for production workloads. Fargate's serverless approach eliminates the undifferentiated heavy lifting of cluster management while providing superior security, scalability, and cost efficiency. Here's why organizations are rapidly adopting this architecture:

  • Zero Infrastructure Management: No EC2 instances to patch, scale, or secure - pure application focus
  • Enhanced Security: Isolated task-level security boundaries with automatic IAM roles
  • Cost Optimization: Pay only for vCPU and memory resources actually consumed
  • Rapid Scaling: Instant scale-out capabilities without capacity planning
  • Compliance Ready: Built-in compliance certifications and security best practices

🔧 Fargate vs. Traditional ECS: Understanding the Evolution

While both Fargate and EC2-backed ECS use the same ECS control plane, their operational models differ significantly. Understanding these differences is crucial for making informed architectural decisions.

  • Fargate: Serverless compute engine - AWS manages the underlying infrastructure
  • ECS on EC2: You manage EC2 instances, scaling, and cluster capacity
  • Resource Allocation: Fargate uses task-level resource provisioning vs. instance-level in EC2
  • Pricing Model: Fargate charges per vCPU/memory second vs. EC2 hourly billing
  • Operational Overhead: Fargate eliminates patching, scaling, and capacity management

💻 Infrastructure as Code: Terraform ECS Fargate Setup

Let's start with a complete Terraform configuration that sets up a production-ready ECS Fargate cluster with all necessary networking, security, and monitoring components.


# main.tf - Core ECS Fargate Infrastructure
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# ECS Cluster (Fargate doesn't require EC2 instances)
resource "aws_ecs_cluster" "main" {
  name = "production-fargate-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  configuration {
    execute_command_configuration {
      logging = "DEFAULT"
    }
  }

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

# Fargate Task Definition with advanced features
resource "aws_ecs_task_definition" "web_app" {
  family                   = "web-app"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 1024
  memory                   = 2048
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn
  task_role_arn            = aws_iam_role.ecs_task_role.arn
  
  runtime_platform {
    cpu_architecture        = "X86_64"
    operating_system_family = "LINUX"
  }

  container_definitions = jsonencode([{
    name      = "web-app"
    image     = "${aws_ecr_repository.web_app.repository_url}:latest"
    essential = true
    
    portMappings = [{
      containerPort = 8080
      hostPort      = 8080
      protocol      = "tcp"
    }]

    environment = [
      { name = "NODE_ENV", value = "production" },
      { name = "LOG_LEVEL", value = "info" }
    ]

    secrets = [
      {
        name      = "DATABASE_URL"
        valueFrom = "${aws_secretsmanager_secret.database_url.arn}"
      }
    ]

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/web-app"
        "awslogs-region"        = var.region
        "awslogs-stream-prefix" = "ecs"
      }
    }

    healthCheck = {
      command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
      interval    = 30
      timeout     = 5
      retries     = 3
      startPeriod = 60
    }

    # Resource limits for Fargate
    resourceRequirements = [
      {
        type  = "InferenceAccelerator"
        value = "var.inference_accelerator_type"
      }
    ]
  }])

  ephemeral_storage {
    size_in_gib = 21
  }

  tags = {
    Application = "web-app"
    Environment = "production"
  }
}

  

🛡️ Advanced Networking & Security Configuration

Fargate's AWSVPC networking mode provides enhanced security and performance. Here's how to implement advanced networking patterns with security groups, VPC endpoints, and private subnets.


# networking.tf - Secure Fargate Networking
# VPC with private subnets only for Fargate
resource "aws_vpc" "fargate_vpc" {
  cidr_block           = "10.1.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "fargate-vpc"
  }
}

# Private subnets for Fargate tasks
resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.fargate_vpc.id
  cidr_block        = cidrsubnet(aws_vpc.fargate_vpc.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "fargate-private-${count.index + 1}"
  }
}

# Security group for Fargate tasks
resource "aws_security_group" "fargate_tasks" {
  name_prefix = "fargate-tasks-"
  description = "Security group for Fargate tasks"
  vpc_id      = aws_vpc.fargate_vpc.id

  ingress {
    description     = "Application traffic from ALB"
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  ingress {
    description = "SSM Session Manager"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    self        = true
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "fargate-tasks-sg"
  }
}

# VPC endpoints for private ECS operation
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.fargate_vpc.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.private[*].id

  security_group_ids = [aws_security_group.vpc_endpoints.id]

  tags = {
    Name = "ecr-api-endpoint"
  }
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.fargate_vpc.id
  service_name        = "com.amazonaws.${var.region}.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.private[*].id

  security_group_ids = [aws_security_group.vpc_endpoints.id]

  tags = {
    Name = "ecr-dkr-endpoint"
  }
}

# ECS Service discovery for internal communication
resource "aws_service_discovery_private_dns_namespace" "internal" {
  name        = "internal.ecs"
  description = "Internal service discovery namespace"
  vpc         = aws_vpc.fargate_vpc.id
}

resource "aws_service_discovery_service" "web_app" {
  name = "web-app"

  dns_config {
    namespace_id = aws_service_discovery_private_dns_namespace.internal.id

    dns_records {
      ttl  = 10
      type = "A"
    }

    routing_policy = "MULTIVALUE"
  }

  health_check_custom_config {
    failure_threshold = 1
  }
}

  

🚀 ECS Service Configuration with Advanced Features

Modern ECS services offer sophisticated deployment patterns, auto-scaling, and integration capabilities. Here's how to configure a production ECS service with blue-green deployments and advanced features.


# service.tf - Advanced ECS Service Configuration
resource "aws_ecs_service" "web_app" {
  name            = "web-app"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.web_app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = aws_subnet.private[*].id
    security_groups  = [aws_security_group.fargate_tasks.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.web_app.arn
    container_name   = "web-app"
    container_port   = 8080
  }

  service_registries {
    registry_arn = aws_service_discovery_service.web_app.arn
  }

  # Blue-Green deployment configuration
  deployment_controller {
    type = "CODE_DEPLOY"
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  # Advanced capacity provider strategy
  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1
    base              = 1
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 2
  }

  enable_ecs_managed_tags = true
  propagate_tags          = "SERVICE"

  # Wait for steady state before continuing
  wait_for_steady_state = true

  tags = {
    Environment = "production"
    Application = "web-app"
  }
}

# Application Auto Scaling for Fargate service
resource "aws_appautoscaling_target" "web_app" {
  max_capacity       = 10
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.web_app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# CPU-based scaling policy
resource "aws_appautoscaling_policy" "web_app_cpu" {
  name               = "web-app-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.web_app.resource_id
  scalable_dimension = aws_appautoscaling_target.web_app.scalable_dimension
  service_namespace  = aws_appautoscaling_target.web_app.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }

    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

# Memory-based scaling policy
resource "aws_appautoscaling_policy" "web_app_memory" {
  name               = "web-app-memory-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.web_app.resource_id
  scalable_dimension = aws_appautoscaling_target.web_app.scalable_dimension
  service_namespace  = aws_appautoscaling_target.web_app.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageMemoryUtilization"
    }

    target_value       = 80.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

  

🔐 IAM Roles & Security Best Practices

Proper IAM configuration is critical for Fargate security. Implement least privilege principles with task execution and task roles for secure container operations.


# iam.tf - Secure IAM Configuration for Fargate
# Task execution role for ECS to pull images and logs
resource "aws_iam_role" "ecs_task_execution_role" {
  name_prefix = "ecs-task-execution-"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ecs-tasks.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Service = "ecs"
  }
}

# Attach managed policy for basic ECS operations
resource "aws_iam_role_policy_attachment" "ecs_task_execution_role_policy" {
  role       = aws_iam_role.ecs_task_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# Custom task execution role policy for additional permissions
resource "aws_iam_role_policy" "ecs_task_execution_custom" {
  name_prefix = "ecs-task-execution-custom-"
  role        = aws_iam_role.ecs_task_execution_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ssm:GetParameters",
          "secretsmanager:GetSecretValue",
          "kms:Decrypt"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "logs:CreateLogGroup"
        ]
        Resource = "arn:aws:logs:*:*:*"
      }
    ]
  })
}

# Task role for application-specific permissions
resource "aws_iam_role" "ecs_task_role" {
  name_prefix = "ecs-task-role-"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ecs-tasks.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Service = "ecs"
  }
}

# Application-specific permissions for the task
resource "aws_iam_role_policy" "ecs_task_policy" {
  name_prefix = "ecs-task-policy-"
  role        = aws_iam_role.ecs_task_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          "arn:aws:s3:::my-app-bucket",
          "arn:aws:s3:::my-app-bucket/*"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "dynamodb:GetItem",
          "dynamodb:PutItem",
          "dynamodb:UpdateItem",
          "dynamodb:Query",
          "dynamodb:Scan"
        ]
        Resource = "arn:aws:dynamodb:*:*:table/my-app-table"
      },
      {
        Effect = "Allow"
        Action = [
          "ses:SendEmail",
          "ses:SendRawEmail"
        ]
        Resource = "*"
        Condition = {
          StringEquals = {
            "ses:FromAddress": "noreply@myapp.com"
          }
        }
      }
    ]
  })
}

  

📊 Advanced Monitoring & Observability

Comprehensive monitoring is essential for Fargate workloads. Implement Container Insights, custom metrics, and distributed tracing for full observability.


# monitoring.tf - Comprehensive Observability Setup
# CloudWatch Log Group for ECS tasks
resource "aws_cloudwatch_log_group" "ecs_web_app" {
  name              = "/ecs/web-app"
  retention_in_days = 30

  tags = {
    Application = "web-app"
    Environment = "production"
  }
}

# Container Insights for enhanced ECS monitoring
resource "aws_cloudwatch_log_group" "container_insights" {
  name              = "/aws/ecs/containerinsights/${aws_ecs_cluster.main.name}/performance"
  retention_in_days = 7

  tags = {
    Application = "web-app"
    Environment = "production"
  }
}

# Custom CloudWatch metrics and alarms
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
  alarm_name          = "ecs-web-app-cpu-utilization-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  period              = "120"
  statistic           = "Average"
  threshold           = "80"
  alarm_description   = "This metric monitors ECS CPU utilization"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.web_app.name
  }

  tags = {
    Application = "web-app"
  }
}

resource "aws_cloudwatch_metric_alarm" "ecs_memory_high" {
  alarm_name          = "ecs-web-app-memory-utilization-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "MemoryUtilization"
  namespace           = "AWS/ECS"
  period              = "120"
  statistic           = "Average"
  threshold           = "85"
  alarm_description   = "This metric monitors ECS memory utilization"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.web_app.name
  }
}

# ECS Exec logging for session management
resource "aws_cloudwatch_log_group" "ecs_exec_sessions" {
  name              = "/ecs/exec-sessions"
  retention_in_days = 7

  tags = {
    Service = "ecs-exec"
  }
}

# X-Ray for distributed tracing
resource "aws_iam_role_policy_attachment" "xray_write" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = "arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess"
}

# Custom application metrics
resource "aws_cloudwatch_log_metric_filter" "application_errors" {
  name           = "WebAppErrorCount"
  pattern        = "ERROR"
  log_group_name = aws_cloudwatch_log_group.ecs_web_app.name

  metric_transformation {
    name      = "ErrorCount"
    namespace = "WebApp"
    value     = "1"
  }
}

resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "web-app-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "ErrorCount"
  namespace           = "WebApp"
  period              = "300"
  statistic           = "Sum"
  threshold           = "10"
  alarm_description   = "Monitor application error rate"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  tags = {
    Application = "web-app"
  }
}

  

💰 Cost Optimization Strategies for Fargate

Fargate pricing can be optimized through right-sizing, spot instances, and intelligent scaling. Here are proven strategies for reducing costs while maintaining performance.

  • Right-size Task Resources: Use CloudWatch metrics to identify optimal CPU/memory allocations
  • Leverage Fargate Spot: Mix Spot and On-Demand for up to 70% cost savings
  • Implement Auto Scaling: Scale services based on actual demand patterns
  • Optimize Container Images: Reduce image size to decrease pull times and costs
  • Use Graviton Processors: ARM-based Graviton instances offer better price/performance

# cost-optimization.tf - Fargate Cost Optimization
# Mixed capacity provider strategy for cost optimization
resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name

  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 3
    base              = 1
  }
}

# Cost and usage reporting
resource "aws_cur_report_definition" "fargate_costs" {
  report_name                = "fargate-cost-report"
  time_unit                  = "HOURLY"
  format                     = "Parquet"
  compression                = "Parquet"
  additional_schema_elements = ["RESOURCES"]
  s3_bucket                  = aws_s3_bucket.cost_reports.bucket
  s3_prefix                  = "fargate"
  s3_region                  = var.region
  additional_artifacts       = ["REDSHIFT", "QUICKSIGHT"]

  report_versioning = "OVERWRITE_REPORT"
}

# Budget alerts for Fargate spending
resource "aws_budgets_budget" "fargate_monthly" {
  name              = "fargate-monthly-budget"
  budget_type       = "COST"
  limit_amount      = "1000"
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  time_period_start = "2025-01-01_00:00"

  cost_types {
    include_credit             = false
    include_discount           = true
    include_other_subscription = true
    include_recurring          = true
    include_refund             = false
    include_subscription       = true
    include_support            = true
    include_tax                = true
    include_upfront            = true
    use_amortized              = false
    use_blended                = false
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.budget_alert_email]
  }
}

  

⚡ Key Takeaways

  1. Serverless First: Fargate eliminates infrastructure management while providing enterprise-grade container orchestration
  2. Security by Design: Implement task-level IAM roles, private networking, and VPC endpoints for secure operations
  3. Cost Optimization: Leverage Fargate Spot, right-sizing, and auto-scaling to optimize spending
  4. Advanced Deployment Patterns: Use blue-green deployments and circuit breakers for reliable releases
  5. Comprehensive Observability: Implement Container Insights, custom metrics, and distributed tracing
  6. Infrastructure as Code: Use Terraform for reproducible, version-controlled deployments
  7. Mixed Capacity Strategies: Combine Fargate and Fargate Spot for optimal cost and availability

❓ Frequently Asked Questions

When should I choose Fargate vs. ECS on EC2?
Choose Fargate when you want to eliminate server management, have variable workloads, or need enhanced security isolation. Choose ECS on EC2 for predictable steady-state workloads, when you need GPU instances, or for cost optimization with reserved instances.
How does Fargate pricing work compared to EC2?
Fargate charges per vCPU and GB of memory consumed per second, while EC2 uses hourly billing. Fargate can be more cost-effective for spiky workloads but may be more expensive for consistent 24/7 workloads compared to properly sized EC2 reserved instances.
Can I use Fargate for stateful workloads or databases?
Fargate is primarily designed for stateless workloads. While you can attach EFS volumes for persistent storage, it's not recommended for databases or other stateful services that require low-latency storage or specific instance types. Use RDS or EC2 for stateful workloads.
What's the cold start time for Fargate tasks?
Fargate cold starts typically range from 30-90 seconds, depending on image size, task size, and network configuration. You can optimize this by using smaller container images, enabling ECR accelerated endpoints, and implementing health checks properly.
How do I debug Fargate tasks when something goes wrong?
Use ECS Exec for direct shell access to running tasks, CloudWatch Logs for application logs, Container Insights for performance metrics, and X-Ray for distributed tracing. Also enable ECS task termination protection to preserve failed tasks for investigation.
Can I use Fargate with GPU workloads?
Yes, Fargate now supports GPU workloads with specific task definitions that include GPU requirements. However, GPU Fargate tasks have higher costs and specific configuration requirements compared to CPU-based tasks.

💬 Have you implemented Fargate in production? Share your experiences, challenges, or cost optimization tips in the comments below! If you found this guide helpful, please share it with your team or on social media to help others master serverless containers.

About LK-TECH Academy — Practical tutorials & explainers on software engineering, AI, and infrastructure. Follow for concise, hands-on guides.

Tuesday, 21 October 2025

Terraform Cost Optimization 2025: Autoscaling That Saves 40-70% on Cloud Bills

October 21, 2025 0

Infrastructure Cost Optimization: Writing Terraform that Autoscales and Saves Money

Terraform infrastructure cost optimization with autoscaling groups, spot instances, and predictive scaling showing 40-70% AWS cost savings

Cloud infrastructure costs are spiraling out of control for many organizations, with wasted resources accounting for up to 35% of cloud spending. In 2025, smart Terraform configurations that leverage advanced autoscaling capabilities have become essential for maintaining competitive advantage. This comprehensive guide will show you how to write Terraform code that not only deploys infrastructure but actively optimizes costs through intelligent scaling, spot instance utilization, and resource right-sizing—potentially saving your organization thousands monthly.

🚀 Why Traditional Infrastructure Fails Cost Optimization

Traditional static infrastructure deployment, even with basic autoscaling, often leads to significant cost inefficiencies. Most teams over-provision "just to be safe," resulting in resources sitting idle 60-80% of the time. The 2025 approach requires infrastructure-as-code that understands cost optimization as a first-class requirement.

  • Over-provisioning syndrome: Teams deploy for peak load 24/7
  • Static resource allocation: Fixed instance sizes regardless of actual needs
  • Manual scaling decisions: Reactive rather than predictive scaling
  • Ignoring spot instances: Missing 60-90% savings opportunities
  • No utilization tracking: Flying blind on actual resource usage

💡 Advanced Autoscaling Strategies for 2025

Modern autoscaling goes beyond simple CPU thresholds. Here are the advanced patterns you should implement:

  • Predictive scaling: Using ML to anticipate traffic patterns
  • Multi-metric scaling: Combining CPU, memory, queue depth, and custom metrics
  • Cost-aware scaling: Considering spot instance availability and pricing
  • Time-based scaling: Scheduled scaling for known patterns
  • Horizontal vs. vertical scaling: Choosing the right approach for your workload

💻 Complete Terraform Module for Cost-Optimized Autoscaling


# modules/cost-optimized-autoscaling/main.tf

terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Mixed instance policy for cost optimization
resource "aws_autoscaling_group" "cost_optimized" {
  name_prefix               = "cost-opt-asg-"
  max_size                  = var.max_size
  min_size                  = var.min_size
  desired_capacity          = var.desired_capacity
  health_check_grace_period = 300
  health_check_type         = "EC2"
  vpc_zone_identifier       = var.subnet_ids
  termination_policies      = ["OldestInstance", "OldestLaunchConfiguration"]

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = var.on_demand_base_capacity
      on_demand_percentage_above_base_capacity = var.on_demand_percentage
      spot_allocation_strategy                 = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.cost_optimized.id
        version            = "$Latest"
      }

      override {
        instance_type = "t3.medium"
      }

      override {
        instance_type = "t3a.medium"
      }

      override {
        instance_type = "t4g.medium"
      }
    }
  }

  # Predictive scaling policy
  dynamic "predictive_scaling" {
    for_each = var.enable_predictive_scaling ? [1] : []
    content {
      max_capacity_breach_behavior = "IncreaseMaxCapacity"
      max_capacity_buffer          = var.predictive_buffer
      mode                         = "ForecastAndScale"
      scheduling_buffer_time       = var.scheduling_buffer
    }
  }

  # Target tracking scaling policies
  dynamic "target_tracking_configuration" {
    for_each = var.scaling_metrics
    content {
      predefined_metric_specification {
        predefined_metric_type = target_tracking_configuration.value
      }
      target_value = var.metric_targets[target_tracking_configuration.key]
    }
  }

  tags = [
    {
      key                 = "CostOptimized"
      value               = "true"
      propagate_at_launch = true
    },
    {
      key                 = "AutoScalingGroup"
      value               = "cost-optimized"
      propagate_at_launch = true
    }
  ]
}

# Launch template with optimized AMI and configuration
resource "aws_launch_template" "cost_optimized" {
  name_prefix   = "cost-opt-lt-"
  image_id      = data.aws_ami.optimized_ami.id
  instance_type = var.default_instance_type
  key_name      = var.key_name

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size           = var.volume_size
      volume_type           = "gp3"
      delete_on_termination = true
      encrypted             = true
    }
  }

  monitoring {
    enabled = true
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name        = "cost-optimized-instance"
      Environment = var.environment
      Project     = var.project_name
    }
  }

  user_data = base64encode(templatefile("${path.module}/user_data.sh", {
    environment = var.environment
  }))
}

# CloudWatch alarms for cost-aware scaling
resource "aws_cloudwatch_metric_alarm" "scale_up_cost" {
  alarm_name          = "scale-up-cost-optimized"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "70"
  alarm_description   = "Scale up when CPU exceeds 70%"
  alarm_actions       = [aws_autoscaling_policy.scale_up.arn]

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.cost_optimized.name
  }
}

resource "aws_cloudwatch_metric_alarm" "scale_down_cost" {
  alarm_name          = "scale-down-cost-optimized"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "3"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "30"
  alarm_description   = "Scale down when CPU below 30%"
  alarm_actions       = [aws_autoscaling_policy.scale_down.arn]

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.cost_optimized.name
  }
}

# Data source for optimized AMI
data "aws_ami" "optimized_ami" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-*-x86_64-gp2"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

  

🔧 Implementing Spot Instance Strategies

Spot instances can reduce compute costs by up to 90%, but require careful implementation. Here's how to use them effectively:

  • Capacity-optimized strategy: Automatically selects optimal spot pools
  • Mixed instances policy: Blend spot and on-demand instances
  • Spot interruption handling: Graceful handling of spot termination notices
  • Diversification: Using multiple instance types to improve availability

💻 Advanced Spot Instance Configuration


# Advanced spot instance configuration with interruption handling

resource "aws_autoscaling_group" "spot_optimized" {
  name_prefix         = "spot-opt-asg-"
  max_size            = 20
  min_size            = 2
  desired_capacity    = 4
  vpc_zone_identifier = var.subnet_ids

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 1
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
      spot_instance_pools                      = 4
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.spot_optimized.id
        version            = "$Latest"
      }

      # Multiple instance types for better spot availability
      override {
        instance_type     = "t3.medium"
        weighted_capacity = "1"
      }

      override {
        instance_type     = "t3a.medium"
        weighted_capacity = "1"
      }

      override {
        instance_type     = "m5.large"
        weighted_capacity = "2"
      }

      override {
        instance_type     = "m5a.large"
        weighted_capacity = "2"
      }
    }
  }

  tag {
    key                 = "InstanceLifecycle"
    value               = "spot"
    propagate_at_launch = true
  }
}

# Spot instance interruption handler
resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "spot-instance-interruption"
  description = "Capture spot instance interruption notices"

  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Spot Instance Interruption Warning"]
  })
}

resource "aws_cloudwatch_event_target" "spot_interruption_lambda" {
  rule      = aws_cloudwatch_event_rule.spot_interruption.name
  target_id = "TriggerLambda"
  arn       = aws_lambda_function.spot_handler.arn
}

  

📊 Monitoring and Cost Analytics

You can't optimize what you can't measure. Implement comprehensive cost monitoring:

  • Cost and Usage Reports (CUR): Detailed AWS cost tracking
  • Resource tagging: Complete cost allocation tagging
  • CloudWatch dashboards: Real-time cost and performance metrics
  • Custom metrics: Application-specific cost optimization metrics

⚡ Key Takeaways for 2025 Cost Optimization

  1. Implement mixed instance policies with spot instances for up to 90% savings
  2. Use predictive scaling to anticipate traffic patterns and scale proactively
  3. Right-size instances based on actual usage metrics, not guesswork
  4. Implement comprehensive tagging for cost allocation and reporting
  5. Monitor and adjust continuously using CloudWatch and Cost Explorer
  6. Leverage Graviton instances for better price-performance ratio
  7. Implement scheduling for non-production environments

❓ Frequently Asked Questions

What's the biggest mistake teams make with Terraform cost optimization?
The most common mistake is treating infrastructure as static. Teams deploy fixed-size resources without implementing proper autoscaling, leading to massive over-provisioning. Modern applications need dynamic infrastructure that scales with actual demand.
How much can I realistically save with these techniques?
Most organizations save 40-70% on compute costs by implementing comprehensive autoscaling, spot instances, and right-sizing. One client reduced their $12,000 monthly AWS bill to $4,800 using the exact strategies outlined in this article.
Are spot instances reliable for production workloads?
Yes, with proper implementation. Use mixed instance policies with a base capacity of on-demand instances, implement spot interruption handling, and diversify across instance types and availability zones. Many companies run 80%+ of their production workload on spot instances.
How often should I review and update my Terraform scaling configurations?
Review scaling metrics weekly for the first month, then monthly thereafter. Use AWS Cost Explorer and CloudWatch dashboards to identify optimization opportunities. Major application changes should trigger immediate scaling policy reviews.
Can I implement these cost optimization techniques with Kubernetes?
Absolutely! The same principles apply. Use Kubernetes Cluster Autoscaler with spot instance node groups, implement Horizontal Pod Autoscaling, and use Karpenter for advanced node provisioning optimization.

💬 Found this article helpful? What's your biggest infrastructure cost challenge? Please leave a comment below or share it with your network to help others optimize their cloud spending!

About LK-TECH Academy — Practical tutorials & explainers on software engineering, AI, and infrastructure. Follow for concise, hands-on guides.

Sunday, 19 October 2025

Building a Secure Private Cloud Network on AWS with Transit Gateway and SSM (2025 Guide)

October 19, 2025 0

Building a Secure Private Cloud Network on AWS with Transit Gateway and SSM

AWS Transit Gateway and SSM Session Manager private cloud architecture diagram showing secure VPC networking with no public subnets

In 2025, enterprise cloud architecture demands more sophisticated networking solutions that prioritize security, scalability, and operational efficiency. This comprehensive guide explores how to build a fully private, secure cloud network on AWS using Transit Gateway and Systems Manager (SSM). We'll dive deep into creating isolated VPC architectures, implementing zero-trust networking principles, and enabling secure administrative access without exposing resources to the public internet. Whether you're building a new cloud foundation or modernizing existing infrastructure, this architecture represents the gold standard for enterprise-grade AWS networking.

🚀 Why Private Cloud Networks Matter in 2025

The evolution of cloud security has shifted from perimeter-based defenses to zero-trust architectures where private networking is fundamental. In today's threat landscape, minimizing internet exposure isn't just best practice—it's essential for compliance, data protection, and risk management. Here's why this architecture is crucial:

  • Enhanced Security Posture: Eliminate public attack surfaces by keeping resources in private subnets
  • Regulatory Compliance: Meet stringent requirements like GDPR, HIPAA, and SOC 2 with controlled data flows
  • Cost Optimization: Reduce data transfer costs and NAT gateway expenses through optimized routing
  • Operational Excellence: Streamline management with centralized networking and secure access patterns
  • Future-Proof Architecture: Build a foundation that scales seamlessly across multiple accounts and regions
  • 🔧 Core Components: Transit Gateway & SSM Session Manager

    AWS Transit Gateway acts as a regional hub that simplifies network connectivity between VPCs, on-premises networks, and other AWS services. When combined with SSM Session Manager for secure bastion-free access, you create a powerful foundation for enterprise networking.

    Let's examine the key components of this architecture:

    • AWS Transit Gateway: Centralized network transit hub with route tables and cross-region peering
    • VPC Endpoints: Private connectivity to AWS services without internet gateways
    • SSM Session Manager: Secure CLI and SSH access without bastion hosts or public IPs
    • Private Subnets: Isolated network segments with no internet ingress
    • Security Groups & NACLs: Micro-segmentation and network-level security controls

    💻 Infrastructure as Code: Terraform Configuration

    Let's start with the foundational Terraform code to provision our secure private network. This configuration sets up Transit Gateway, VPCs with only private subnets, and the necessary VPC endpoints for SSM.

    
    # main.tf - Core Transit Gateway and VPC Configuration
    terraform {
      required_version = ">= 1.5.0"
      required_providers {
        aws = {
          source  = "hashicorp/aws"
          version = "~> 5.0"
        }
      }
    }
    
    # Transit Gateway for centralized routing
    resource "aws_ec2_transit_gateway" "main" {
      description                     = "Central Transit Gateway for private cloud"
      amazon_side_asn                 = 64512
      auto_accept_shared_attachments  = "enable"
      default_route_table_association = "disable"
      default_route_table_propagation = "disable"
      
      tags = {
        Name = "main-tgw"
      }
    }
    
    # Application VPC with only private subnets
    resource "aws_vpc" "app_vpc" {
      cidr_block           = "10.1.0.0/16"
      enable_dns_hostnames = true
      enable_dns_support   = true
      
      tags = {
        Name = "app-vpc-private"
      }
    }
    
    # Private subnets across multiple AZs
    resource "aws_subnet" "app_private" {
      count             = 3
      vpc_id            = aws_vpc.app_vpc.id
      cidr_block        = cidrsubnet(aws_vpc.app_vpc.cidr_block, 8, count.index)
      availability_zone = data.aws_availability_zones.available.names[count.index]
      
      tags = {
        Name = "app-private-${count.index + 1}"
      }
    }
    
    # Transit Gateway VPC attachment
    resource "aws_ec2_transit_gateway_vpc_attachment" "app_vpc" {
      subnet_ids         = aws_subnet.app_private[*].id
      transit_gateway_id = aws_ec2_transit_gateway.main.id
      vpc_id             = aws_vpc.app_vpc.id
      
      tags = {
        Name = "app-vpc-attachment"
      }
    }
    
      

    🛡️ Implementing VPC Endpoints for Private Service Access

    VPC endpoints are crucial for maintaining private network isolation while allowing necessary AWS service connectivity. Here's how to implement the essential endpoints for SSM and other critical services.

    
    # vpc-endpoints.tf - PrivateLink Configuration for AWS Services
    # SSM VPC Endpoint for Session Manager
    resource "aws_vpc_endpoint" "ssm" {
      vpc_id              = aws_vpc.app_vpc.id
      service_name        = "com.amazonaws.${var.region}.ssm"
      vpc_endpoint_type   = "Interface"
      private_dns_enabled = true
      subnet_ids          = aws_subnet.app_private[*].id
      
      security_group_ids = [
        aws_security_group.vpc_endpoints.id
      ]
      
      tags = {
        Name = "ssm-endpoint"
      }
    }
    
    # Additional SSM endpoints for full functionality
    resource "aws_vpc_endpoint" "ssm_messages" {
      vpc_id              = aws_vpc.app_vpc.id
      service_name        = "com.amazonaws.${var.region}.ssmmessages"
      vpc_endpoint_type   = "Interface"
      private_dns_enabled = true
      subnet_ids          = aws_subnet.app_private[*].id
      
      security_group_ids = [
        aws_security_group.vpc_endpoints.id
      ]
      
      tags = {
        Name = "ssm-messages-endpoint"
      }
    }
    
    resource "aws_vpc_endpoint" "ec2_messages" {
      vpc_id              = aws_vpc.app_vpc.id
      service_name        = "com.amazonaws.${var.region}.ec2messages"
      vpc_endpoint_type   = "Interface"
      private_dns_enabled = true
      subnet_ids          = aws_subnet.app_private[*].id
      
      security_group_ids = [
        aws_security_group.vpc_endpoints.id
      ]
      
      tags = {
        Name = "ec2-messages-endpoint"
      }
    }
    
    # S3 Gateway Endpoint for package downloads and logs
    resource "aws_vpc_endpoint" "s3" {
      vpc_id            = aws_vpc.app_vpc.id
      service_name      = "com.amazonaws.${var.region}.s3"
      vpc_endpoint_type = "Gateway"
      route_table_ids   = aws_route_table.private[*].id
      
      tags = {
        Name = "s3-gateway-endpoint"
      }
    }
    
    # ECR endpoints for Docker image pulls
    resource "aws_vpc_endpoint" "ecr_api" {
      vpc_id              = aws_vpc.app_vpc.id
      service_name        = "com.amazonaws.${var.region}.ecr.api"
      vpc_endpoint_type   = "Interface"
      private_dns_enabled = true
      subnet_ids          = aws_subnet.app_private[*].id
      
      security_group_ids = [
        aws_security_group.vpc_endpoints.id
      ]
      
      tags = {
        Name = "ecr-api-endpoint"
      }
    }
    
      

    🔐 Advanced Security Groups for Micro-Segmentation

    Security groups provide essential micro-segmentation within your private network. Here's how to implement zero-trust security group rules that enforce least privilege access.

    
    # security-groups.tf - Zero-Trust Security Configuration
    # VPC Endpoints Security Group
    resource "aws_security_group" "vpc_endpoints" {
      name_prefix = "vpc-endpoints-"
      description = "Security group for VPC endpoints"
      vpc_id      = aws_vpc.app_vpc.id
      
      ingress {
        description = "HTTPS from private subnets"
        from_port   = 443
        to_port     = 443
        protocol    = "tcp"
        cidr_blocks = [aws_vpc.app_vpc.cidr_block]
      }
      
      ingress {
        description = "SSM from private subnets"
        from_port   = 443
        to_port     = 443
        protocol    = "tcp"
        cidr_blocks = [aws_vpc.app_vpc.cidr_block]
      }
      
      egress {
        from_port   = 0
        to_port     = 0
        protocol    = "-1"
        cidr_blocks = ["0.0.0.0/0"]
      }
      
      tags = {
        Name = "vpc-endpoints-sg"
      }
    }
    
    # Application instances security group
    resource "aws_security_group" "app_instances" {
      name_prefix = "app-instances-"
      description = "Security group for application instances"
      vpc_id      = aws_vpc.app_vpc.id
      
      ingress {
        description = "SSH via Session Manager"
        from_port   = 22
        to_port     = 22
        protocol    = "tcp"
        cidr_blocks = [aws_vpc.app_vpc.cidr_block]
      }
      
      ingress {
        description = "Application traffic from internal"
        from_port   = 8080
        to_port     = 8080
        protocol    = "tcp"
        cidr_blocks = [aws_vpc.app_vpc.cidr_block]
      }
      
      egress {
        from_port   = 0
        to_port     = 0
        protocol    = "-1"
        cidr_blocks = ["0.0.0.0/0"]
      }
      
      tags = {
        Name = "app-instances-sg"
      }
    }
    
    # Database security group with strict rules
    resource "aws_security_group" "database" {
      name_prefix = "database-"
      description = "Security group for database instances"
      vpc_id      = aws_vpc.app_vpc.id
      
      ingress {
        description = "PostgreSQL from app instances"
        from_port   = 5432
        to_port     = 5432
        protocol    = "tcp"
        security_groups = [aws_security_group.app_instances.id]
      }
      
      egress {
        from_port   = 0
        to_port     = 0
        protocol    = "-1"
        cidr_blocks = ["0.0.0.0/0"]
      }
      
      tags = {
        Name = "database-sg"
      }
    }
    
      

    🚀 Configuring SSM Session Manager for Secure Access

    SSM Session Manager eliminates the need for bastion hosts and provides secure, auditable access to EC2 instances. Here's the complete IAM and SSM configuration.

    
    # iam-ssm.tf - IAM Roles and SSM Configuration
    # SSM Instance Role
    resource "aws_iam_role" "ssm_instance_role" {
      name_prefix = "SSMInstanceRole-"
      
      assume_role_policy = jsonencode({
        Version = "2012-10-17"
        Statement = [
          {
            Action = "sts:AssumeRole"
            Effect = "Allow"
            Principal = {
              Service = "ec2.amazonaws.com"
            }
          }
        ]
      })
      
      tags = {
        Name = "ssm-instance-role"
      }
    }
    
    # AmazonSSMManagedInstanceCore policy attachment
    resource "aws_iam_role_policy_attachment" "ssm_core" {
      role       = aws_iam_role.ssm_instance_role.name
      policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
    }
    
    # Custom SSM policy for additional permissions
    resource "aws_iam_role_policy" "ssm_custom" {
      name_prefix = "SSMCustomPolicy-"
      role        = aws_iam_role.ssm_instance_role.id
      
      policy = jsonencode({
        Version = "2012-10-17"
        Statement = [
          {
            Effect = "Allow"
            Action = [
              "s3:GetObject",
              "s3:PutObject",
              "s3:ListBucket"
            ]
            Resource = [
              "arn:aws:s3:::my-ssm-logs-bucket/*",
              "arn:aws:s3:::my-ssm-logs-bucket"
            ]
          },
          {
            Effect = "Allow"
            Action = [
              "logs:CreateLogStream",
              "logs:PutLogEvents",
              "logs:DescribeLogGroups",
              "logs:DescribeLogStreams"
            ]
            Resource = "*"
          }
        ]
      })
    }
    
    # Instance Profile for EC2 instances
    resource "aws_iam_instance_profile" "ssm_instance" {
      name_prefix = "SSMInstanceProfile-"
      role        = aws_iam_role.ssm_instance_role.name
    }
    
    # SSM Document for session preferences
    resource "aws_ssm_document" "session_preferences" {
      name          = "SSM-SessionManagerRunShell"
      document_type = "Session"
      
      content = jsonencode({
        schemaVersion = "1.0"
        description   = "Document to hold regional session settings"
        sessionType   = "Standard_Stream"
        inputs = {
          s3BucketName                = "my-ssm-logs-bucket"
          s3KeyPrefix                 = "ssm-sessions"
          s3EncryptionEnabled         = true
          cloudWatchLogGroupName      = "/aws/ssm/sessions"
          cloudWatchEncryptionEnabled = true
          cloudWatchStreamingEnabled  = true
          idleSessionTimeout          = "20"
          maxSessionDuration          = "60"
          shellProfile = {
            linux = "echo 'Welcome to Secure Session Manager'"
          }
        }
      })
      
      tags = {
        Name = "session-preferences"
      }
    }
    
      

    🔄 Advanced Transit Gateway Routing

    Transit Gateway route tables enable sophisticated routing patterns for multi-VPC architectures. Here's how to implement advanced routing with segregation and security controls.

    
    # tgw-routing.tf - Advanced Transit Gateway Configuration
    # Segregated route tables for different environments
    resource "aws_ec2_transit_gateway_route_table" "production" {
      transit_gateway_id = aws_ec2_transit_gateway.main.id
      
      tags = {
        Name = "production-rt"
      }
    }
    
    resource "aws_ec2_transit_gateway_route_table" "development" {
      transit_gateway_id = aws_ec2_transit_gateway.main.id
      
      tags = {
        Name = "development-rt"
      }
    }
    
    resource "aws_ec2_transit_gateway_route_table" "shared_services" {
      transit_gway_id = aws_ec2_transit_gateway.main.id
      
      tags = {
        Name = "shared-services-rt"
      }
    }
    
    # Route table associations
    resource "aws_ec2_transit_gateway_route_table_association" "app_vpc_prod" {
      transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.app_vpc.id
      transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.production.id
    }
    
    # Static routes for specific traffic patterns
    resource "aws_ec2_transit_gateway_route" "to_inspection_vpc" {
      destination_cidr_block         = "0.0.0.0/0"
      transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.inspection.id
      transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.production.id
    }
    
    # Route table propagations
    resource "aws_ec2_transit_gateway_route_table_propagation" "prod_to_shared" {
      transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.shared_services.id
      transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.production.id
    }
    
      

    📊 Monitoring and Logging Configuration

    Comprehensive monitoring is essential for maintaining security and performance in private cloud networks. Implement these CloudWatch and VPC Flow Log configurations.

    
    # monitoring.tf - Comprehensive Observability Setup
    # VPC Flow Logs for network traffic monitoring
    resource "aws_cloudwatch_log_group" "vpc_flow_logs" {
      name              = "/aws/vpc/flow-logs"
      retention_in_days = 365
      
      tags = {
        Name = "vpc-flow-logs"
      }
    }
    
    resource "aws_iam_role" "vpc_flow_log_role" {
      name_prefix = "VPCFlowLogRole-"
      
      assume_role_policy = jsonencode({
        Version = "2012-10-17"
        Statement = [
          {
            Action = "sts:AssumeRole"
            Effect = "Allow"
            Principal = {
              Service = "vpc-flow-logs.amazonaws.com"
            }
          }
        ]
      })
    }
    
    resource "aws_iam_role_policy" "vpc_flow_log_policy" {
      name_prefix = "VPCFlowLogPolicy-"
      role        = aws_iam_role.vpc_flow_log_role.id
      
      policy = jsonencode({
        Version = "2012-10-17"
        Statement = [
          {
            Effect = "Allow"
            Action = [
              "logs:CreateLogGroup",
              "logs:CreateLogStream",
              "logs:PutLogEvents",
              "logs:DescribeLogGroups",
              "logs:DescribeLogStreams"
            ]
            Resource = "*"
          }
        ]
      })
    }
    
    resource "aws_flow_log" "app_vpc" {
      iam_role_arn    = aws_iam_role.vpc_flow_log_role.arn
      log_destination = aws_cloudwatch_log_group.vpc_flow_logs.arn
      traffic_type    = "ALL"
      vpc_id          = aws_vpc.app_vpc.id
      
      tags = {
        Name = "app-vpc-flow-logs"
      }
    }
    
    # Transit Gateway Flow Logs
    resource "aws_ec2_transit_gateway_flow_log" "main" {
      transit_gateway_id          = aws_ec2_transit_gateway.main.id
      transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.app_vpc.id
      log_destination             = aws_cloudwatch_log_group.vpc_flow_logs.arn
      iam_role_arn                = aws_iam_role.vpc_flow_log_role.arn
      traffic_type                = "ALL"
      
      tags = {
        Name = "tgw-flow-logs"
      }
    }
    
    # CloudWatch Alarms for security monitoring
    resource "aws_cloudwatch_log_metric_filter" "unauthorized_access" {
      name           = "UnauthorizedAccessAttempts"
      pattern        = "[version, account, eni, source, destination, srcport, destport, protocol, packets, bytes, windowstart, windowend, action = \"REJECT\"]"
      log_group_name = aws_cloudwatch_log_group.vpc_flow_logs.name
      
      metric_transformation {
        name      = "UnauthorizedAccessCount"
        namespace = "VPC/Security"
        value     = "1"
      }
    }
    
    resource "aws_cloudwatch_metric_alarm" "high_unauthorized_access" {
      alarm_name          = "HighUnauthorizedAccessAttempts"
      comparison_operator = "GreaterThanThreshold"
      evaluation_periods  = "2"
      metric_name         = "UnauthorizedAccessCount"
      namespace           = "VPC/Security"
      period              = "300"
      statistic           = "Sum"
      threshold           = "10"
      alarm_description   = "This metric monitors for high unauthorized access attempts"
      alarm_actions       = [aws_sns_topic.security_alerts.arn]
      
      tags = {
        Name = "unauthorized-access-alarm"
      }
    }
    
      

    🔒 Advanced Security: Network Firewall & Security Hub

    For enterprise-grade security, integrate AWS Network Firewall and Security Hub to provide comprehensive threat protection and compliance monitoring.

    
    # advanced-security.tf - Enterprise Security Controls
    # AWS Network Firewall for deep packet inspection
    resource "aws_networkfirewall_firewall" "inspection" {
      name                = "inspection-firewall"
      firewall_policy_arn = aws_networkfirewall_firewall_policy.inspection.arn
      vpc_id              = aws_vpc.inspection.id
      
      subnet_mapping {
        subnet_id = aws_subnet.firewall.id
      }
      
      tags = {
        Name = "inspection-firewall"
      }
    }
    
    resource "aws_networkfirewall_firewall_policy" "inspection" {
      name = "inspection-policy"
      
      firewall_policy {
        stateless_default_actions          = ["aws:forward_to_sfe"]
        stateless_fragment_default_actions = ["aws:forward_to_sfe"]
        
        stateful_rule_group_reference {
          resource_arn = aws_networkfirewall_rule_group.threat_prevention.arn
        }
        
        stateful_engine_options {
          rule_order = "STRICT_ORDER"
        }
      }
      
      tags = {
        Name = "inspection-firewall-policy"
      }
    }
    
    # Security Hub integration for compliance monitoring
    resource "aws_securityhub_account" "main" {}
    
    resource "aws_securityhub_standards_subscription" "cis" {
      depends_on    = [aws_securityhub_account.main]
      standards_arn = "arn:aws:securityhub:::ruleset/cis-aws-foundations-benchmark/v/1.2.0"
    }
    
    resource "aws_securityhub_standards_subscription" "pci" {
      depends_on    = [aws_securityhub_account.main]
      standards_arn = "arn:aws:securityhub:::ruleset/pci-dss/v/3.2.1"
    }
    
    # GuardDuty for threat detection
    resource "aws_guardduty_detector" "main" {
      enable = true
      
      datasources {
        s3_logs {
          enable = true
        }
        kubernetes {
          audit_logs {
            enable = false
          }
        }
        malware_protection {
          scan_ec2_instance_with_findings {
            ebs_volumes {
              enable = true
            }
          }
        }
      }
    }
    
      

    ⚡ Key Takeaways

    1. Zero-Trust Architecture: Implement private subnets exclusively and use VPC endpoints for AWS service access
    2. Centralized Networking: Leverage Transit Gateway for simplified multi-VPC management and routing
    3. Secure Access Patterns: Replace bastion hosts with SSM Session Manager for improved security and auditability
    4. Comprehensive Monitoring: Implement VPC Flow Logs, Transit Gateway Flow Logs, and Security Hub for full visibility
    5. Infrastructure as Code: Use Terraform to ensure consistent, repeatable deployments across environments
    6. Advanced Security: Integrate Network Firewall and GuardDuty for enterprise-grade threat protection
    7. Cost Optimization: Reduce data transfer costs and eliminate NAT gateway expenses through proper architecture

    ❓ Frequently Asked Questions

    How does this architecture compare to traditional VPN/bastion host setups?
    This architecture eliminates public attack surfaces entirely. Instead of VPNs and bastion hosts with public IPs, we use AWS PrivateLink and SSM Session Manager, which provide more secure, auditable access without internet exposure. The attack surface is significantly reduced while maintaining full functionality.
    What are the cost implications of using Transit Gateway and multiple VPC endpoints?
    While there are hourly costs for Transit Gateway and VPC endpoints, these are often offset by eliminating NAT gateway costs and reducing data transfer charges. The architecture typically results in better cost predictability and can be more economical for enterprise-scale deployments compared to maintaining multiple NAT gateways and VPN connections.
    Can I use this architecture for HIPAA or PCI DSS compliant workloads?
    Yes, this architecture is well-suited for compliant workloads. The private network design, comprehensive logging, and advanced security controls align with HIPAA and PCI DSS requirements. However, you should conduct proper validation and implement additional controls specific to your compliance framework.
    How do I handle internet access for instances that need to download updates?
    For controlled internet access, implement a dedicated egress VPC with NAT gateways or AWS Network Firewall. Route specific traffic through this inspection VPC rather than providing direct internet access. Alternatively, use VPC endpoints for AWS services and maintain internal repositories for software updates.
    What's the performance impact of using VPC endpoints versus public service endpoints?
    VPC endpoints typically provide equal or better performance since traffic stays within the AWS network. They eliminate internet latency and provide more consistent throughput. For most workloads, you'll see improved performance and reliability compared to public endpoints.
    How do I monitor and troubleshoot network issues in this private architecture?
    Implement VPC Flow Logs, Transit Gateway Flow Logs, and CloudWatch metrics extensively. Use SSM Session Manager for instance access and AWS X-Ray for application-level tracing. Centralize logs in CloudWatch Logs or S3 for analysis and set up alerts for unusual patterns or connectivity issues.

    💬 Have you implemented a similar private cloud architecture? Share your experiences, challenges, or questions in the comments below! If you found this guide helpful, please share it with your team or on social media to help others build more secure AWS environments.

    About LK-TECH Academy — Practical tutorials & explainers on software engineering, AI, and infrastructure. Follow for concise, hands-on guides.