When should I choose Fargate vs. ECS on EC2?

Choose Fargate when you want to eliminate server management, have variable workloads, or need enhanced security isolation. Choose ECS on EC2 for predictable steady-state workloads, when you need GPU instances, or for cost optimization with reserved instances.

How does Fargate pricing work compared to EC2?

Fargate charges per vCPU and GB of memory consumed per second, while EC2 uses hourly billing. Fargate can be more cost-effective for spiky workloads but may be more expensive for consistent 24/7 workloads compared to properly sized EC2 reserved instances.

Can I use Fargate for stateful workloads or databases?

Fargate is primarily designed for stateless workloads. While you can attach EFS volumes for persistent storage, it's not recommended for databases or other stateful services that require low-latency storage or specific instance types. Use RDS or EC2 for stateful workloads.

What's the cold start time for Fargate tasks?

Fargate cold starts typically range from 30-90 seconds, depending on image size, task size, and network configuration. You can optimize this by using smaller container images, enabling ECR accelerated endpoints, and implementing health checks properly.

How do I debug Fargate tasks when something goes wrong?

Use ECS Exec for direct shell access to running tasks, CloudWatch Logs for application logs, Container Insights for performance metrics, and X-Ray for distributed tracing. Also enable ECS task termination protection to preserve failed tasks for investigation.

Can I use Fargate with GPU workloads?

Yes, Fargate now supports GPU workloads with specific task definitions that include GPU requirements. However, GPU Fargate tasks have higher costs and specific configuration requirements compared to CPU-based tasks.

What's the biggest mistake teams make with Terraform cost optimization?

The most common mistake is treating infrastructure as static. Teams deploy fixed-size resources without implementing proper autoscaling, leading to massive over-provisioning. Modern applications need dynamic infrastructure that scales with actual demand.

How much can I realistically save with these techniques?

Most organizations save 40-70% on compute costs by implementing comprehensive autoscaling, spot instances, and right-sizing. One client reduced their $12,000 monthly AWS bill to $4,800 using the exact strategies outlined in this article.

Are spot instances reliable for production workloads?

Yes, with proper implementation. Use mixed instance policies with a base capacity of on-demand instances, implement spot interruption handling, and diversify across instance types and availability zones. Many companies run 80%+ of their production workload on spot instances.

How often should I review and update my Terraform scaling configurations?

Review scaling metrics weekly for the first month, then monthly thereafter. Use AWS Cost Explorer and CloudWatch dashboards to identify optimization opportunities. Major application changes should trigger immediate scaling policy reviews.

Can I implement these cost optimization techniques with Kubernetes?

Absolutely! The same principles apply. Use Kubernetes Cluster Autoscaler with spot instance node groups, implement Horizontal Pod Autoscaling, and use Karpenter for advanced node provisioning optimization.

How does this architecture compare to traditional VPN/bastion host setups?

This architecture eliminates public attack surfaces entirely. Instead of VPNs and bastion hosts with public IPs, we use AWS PrivateLink and SSM Session Manager, which provide more secure, auditable access without internet exposure. The attack surface is significantly reduced while maintaining full functionality.

What are the cost implications of using Transit Gateway and multiple VPC endpoints?

While there are hourly costs for Transit Gateway and VPC endpoints, these are often offset by eliminating NAT gateway costs and reducing data transfer charges. The architecture typically results in better cost predictability and can be more economical for enterprise-scale deployments compared to maintaining multiple NAT gateways and VPN connections.

Can I use this architecture for HIPAA or PCI DSS compliant workloads?

Yes, this architecture is well-suited for compliant workloads. The private network design, comprehensive logging, and advanced security controls align with HIPAA and PCI DSS requirements. However, you should conduct proper validation and implement additional controls specific to your compliance framework.

How do I handle internet access for instances that need to download updates?

For controlled internet access, implement a dedicated egress VPC with NAT gateways or AWS Network Firewall. Route specific traffic through this inspection VPC rather than providing direct internet access. Alternatively, use VPC endpoints for AWS services and maintain internal repositories for software updates.

What's the performance impact of using VPC endpoints versus public service endpoints?

VPC endpoints typically provide equal or better performance since traffic stays within the AWS network. They eliminate internet latency and provide more consistent throughput. For most workloads, you'll see improved performance and reliability compared to public endpoints.

How do I monitor and troubleshoot network issues in this private architecture?

Implement VPC Flow Logs, Transit Gateway Flow Logs, and CloudWatch metrics extensively. Use SSM Session Manager for instance access and AWS X-Ray for application-level tracing. Centralize logs in CloudWatch Logs or S3 for analysis and set up alerts for unusual patterns or connectivity issues.

LK‑TECH Academy – Master the Latest in Web & App Development: Terraform

Showing posts with label Terraform. Show all posts

Thursday, 23 October 2025

Serverless Containers: Deploying with AWS Fargate and ECS (2025 Complete Guide)

nan October 23, 2025 0

Serverless Containers: Deploying with AWS Fargate and ECS

AWS Fargate ECS serverless containers architecture diagram showing container orchestration without EC2 instances

In 2025, serverless containers have become the dominant paradigm for deploying modern applications, combining the flexibility of containers with the operational simplicity of serverless computing. AWS Fargate with ECS represents the pinnacle of this evolution, enabling teams to run containers without managing servers or clusters. This comprehensive guide explores advanced Fargate patterns, cost optimization strategies, and real-world implementation techniques that will transform how you deploy containerized workloads. Whether you're migrating from EC2 or building greenfield applications, mastering Fargate is essential for modern cloud-native development.

🚀 Why Serverless Containers Dominate in 2025

The container ecosystem has matured significantly, with serverless options becoming the preferred choice for production workloads. Fargate's serverless approach eliminates the undifferentiated heavy lifting of cluster management while providing superior security, scalability, and cost efficiency. Here's why organizations are rapidly adopting this architecture:

Zero Infrastructure Management: No EC2 instances to patch, scale, or secure - pure application focus
Enhanced Security: Isolated task-level security boundaries with automatic IAM roles
Cost Optimization: Pay only for vCPU and memory resources actually consumed
Rapid Scaling: Instant scale-out capabilities without capacity planning
Compliance Ready: Built-in compliance certifications and security best practices

🔧 Fargate vs. Traditional ECS: Understanding the Evolution

While both Fargate and EC2-backed ECS use the same ECS control plane, their operational models differ significantly. Understanding these differences is crucial for making informed architectural decisions.

Fargate: Serverless compute engine - AWS manages the underlying infrastructure
ECS on EC2: You manage EC2 instances, scaling, and cluster capacity
Resource Allocation: Fargate uses task-level resource provisioning vs. instance-level in EC2
Pricing Model: Fargate charges per vCPU/memory second vs. EC2 hourly billing
Operational Overhead: Fargate eliminates patching, scaling, and capacity management

💻 Infrastructure as Code: Terraform ECS Fargate Setup

Let's start with a complete Terraform configuration that sets up a production-ready ECS Fargate cluster with all necessary networking, security, and monitoring components.


# main.tf - Core ECS Fargate Infrastructure
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# ECS Cluster (Fargate doesn't require EC2 instances)
resource "aws_ecs_cluster" "main" {
  name = "production-fargate-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  configuration {
    execute_command_configuration {
      logging = "DEFAULT"
    }
  }

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

# Fargate Task Definition with advanced features
resource "aws_ecs_task_definition" "web_app" {
  family                   = "web-app"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 1024
  memory                   = 2048
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn
  task_role_arn            = aws_iam_role.ecs_task_role.arn
  
  runtime_platform {
    cpu_architecture        = "X86_64"
    operating_system_family = "LINUX"
  }

  container_definitions = jsonencode([{
    name      = "web-app"
    image     = "${aws_ecr_repository.web_app.repository_url}:latest"
    essential = true
    
    portMappings = [{
      containerPort = 8080
      hostPort      = 8080
      protocol      = "tcp"
    }]

    environment = [
      { name = "NODE_ENV", value = "production" },
      { name = "LOG_LEVEL", value = "info" }
    ]

    secrets = [
      {
        name      = "DATABASE_URL"
        valueFrom = "${aws_secretsmanager_secret.database_url.arn}"
      }
    ]

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/web-app"
        "awslogs-region"        = var.region
        "awslogs-stream-prefix" = "ecs"
      }
    }

    healthCheck = {
      command     = ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"]
      interval    = 30
      timeout     = 5
      retries     = 3
      startPeriod = 60
    }

    # Resource limits for Fargate
    resourceRequirements = [
      {
        type  = "InferenceAccelerator"
        value = "var.inference_accelerator_type"
      }
    ]
  }])

  ephemeral_storage {
    size_in_gib = 21
  }

  tags = {
    Application = "web-app"
    Environment = "production"
  }
}

🛡️ Advanced Networking & Security Configuration

Fargate's AWSVPC networking mode provides enhanced security and performance. Here's how to implement advanced networking patterns with security groups, VPC endpoints, and private subnets.


# networking.tf - Secure Fargate Networking
# VPC with private subnets only for Fargate
resource "aws_vpc" "fargate_vpc" {
  cidr_block           = "10.1.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "fargate-vpc"
  }
}

# Private subnets for Fargate tasks
resource "aws_subnet" "private" {
  count             = 3
  vpc_id            = aws_vpc.fargate_vpc.id
  cidr_block        = cidrsubnet(aws_vpc.fargate_vpc.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "fargate-private-${count.index + 1}"
  }
}

# Security group for Fargate tasks
resource "aws_security_group" "fargate_tasks" {
  name_prefix = "fargate-tasks-"
  description = "Security group for Fargate tasks"
  vpc_id      = aws_vpc.fargate_vpc.id

  ingress {
    description     = "Application traffic from ALB"
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  ingress {
    description = "SSM Session Manager"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    self        = true
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "fargate-tasks-sg"
  }
}

# VPC endpoints for private ECS operation
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.fargate_vpc.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.private[*].id

  security_group_ids = [aws_security_group.vpc_endpoints.id]

  tags = {
    Name = "ecr-api-endpoint"
  }
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = aws_vpc.fargate_vpc.id
  service_name        = "com.amazonaws.${var.region}.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.private[*].id

  security_group_ids = [aws_security_group.vpc_endpoints.id]

  tags = {
    Name = "ecr-dkr-endpoint"
  }
}

# ECS Service discovery for internal communication
resource "aws_service_discovery_private_dns_namespace" "internal" {
  name        = "internal.ecs"
  description = "Internal service discovery namespace"
  vpc         = aws_vpc.fargate_vpc.id
}

resource "aws_service_discovery_service" "web_app" {
  name = "web-app"

  dns_config {
    namespace_id = aws_service_discovery_private_dns_namespace.internal.id

    dns_records {
      ttl  = 10
      type = "A"
    }

    routing_policy = "MULTIVALUE"
  }

  health_check_custom_config {
    failure_threshold = 1
  }
}

🚀 ECS Service Configuration with Advanced Features

Modern ECS services offer sophisticated deployment patterns, auto-scaling, and integration capabilities. Here's how to configure a production ECS service with blue-green deployments and advanced features.


# service.tf - Advanced ECS Service Configuration
resource "aws_ecs_service" "web_app" {
  name            = "web-app"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.web_app.arn
  desired_count   = 2
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = aws_subnet.private[*].id
    security_groups  = [aws_security_group.fargate_tasks.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.web_app.arn
    container_name   = "web-app"
    container_port   = 8080
  }

  service_registries {
    registry_arn = aws_service_discovery_service.web_app.arn
  }

  # Blue-Green deployment configuration
  deployment_controller {
    type = "CODE_DEPLOY"
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  # Advanced capacity provider strategy
  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 1
    base              = 1
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 2
  }

  enable_ecs_managed_tags = true
  propagate_tags          = "SERVICE"

  # Wait for steady state before continuing
  wait_for_steady_state = true

  tags = {
    Environment = "production"
    Application = "web-app"
  }
}

# Application Auto Scaling for Fargate service
resource "aws_appautoscaling_target" "web_app" {
  max_capacity       = 10
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.web_app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# CPU-based scaling policy
resource "aws_appautoscaling_policy" "web_app_cpu" {
  name               = "web-app-cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.web_app.resource_id
  scalable_dimension = aws_appautoscaling_target.web_app.scalable_dimension
  service_namespace  = aws_appautoscaling_target.web_app.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }

    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

# Memory-based scaling policy
resource "aws_appautoscaling_policy" "web_app_memory" {
  name               = "web-app-memory-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.web_app.resource_id
  scalable_dimension = aws_appautoscaling_target.web_app.scalable_dimension
  service_namespace  = aws_appautoscaling_target.web_app.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageMemoryUtilization"
    }

    target_value       = 80.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

🔐 IAM Roles & Security Best Practices

Proper IAM configuration is critical for Fargate security. Implement least privilege principles with task execution and task roles for secure container operations.


# iam.tf - Secure IAM Configuration for Fargate
# Task execution role for ECS to pull images and logs
resource "aws_iam_role" "ecs_task_execution_role" {
  name_prefix = "ecs-task-execution-"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ecs-tasks.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Service = "ecs"
  }
}

# Attach managed policy for basic ECS operations
resource "aws_iam_role_policy_attachment" "ecs_task_execution_role_policy" {
  role       = aws_iam_role.ecs_task_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# Custom task execution role policy for additional permissions
resource "aws_iam_role_policy" "ecs_task_execution_custom" {
  name_prefix = "ecs-task-execution-custom-"
  role        = aws_iam_role.ecs_task_execution_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ssm:GetParameters",
          "secretsmanager:GetSecretValue",
          "kms:Decrypt"
        ]
        Resource = "*"
      },
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "logs:CreateLogGroup"
        ]
        Resource = "arn:aws:logs:*:*:*"
      }
    ]
  })
}

# Task role for application-specific permissions
resource "aws_iam_role" "ecs_task_role" {
  name_prefix = "ecs-task-role-"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ecs-tasks.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Service = "ecs"
  }
}

# Application-specific permissions for the task
resource "aws_iam_role_policy" "ecs_task_policy" {
  name_prefix = "ecs-task-policy-"
  role        = aws_iam_role.ecs_task_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          "arn:aws:s3:::my-app-bucket",
          "arn:aws:s3:::my-app-bucket/*"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "dynamodb:GetItem",
          "dynamodb:PutItem",
          "dynamodb:UpdateItem",
          "dynamodb:Query",
          "dynamodb:Scan"
        ]
        Resource = "arn:aws:dynamodb:*:*:table/my-app-table"
      },
      {
        Effect = "Allow"
        Action = [
          "ses:SendEmail",
          "ses:SendRawEmail"
        ]
        Resource = "*"
        Condition = {
          StringEquals = {
            "ses:FromAddress": "noreply@myapp.com"
          }
        }
      }
    ]
  })
}

📊 Advanced Monitoring & Observability

Comprehensive monitoring is essential for Fargate workloads. Implement Container Insights, custom metrics, and distributed tracing for full observability.


# monitoring.tf - Comprehensive Observability Setup
# CloudWatch Log Group for ECS tasks
resource "aws_cloudwatch_log_group" "ecs_web_app" {
  name              = "/ecs/web-app"
  retention_in_days = 30

  tags = {
    Application = "web-app"
    Environment = "production"
  }
}

# Container Insights for enhanced ECS monitoring
resource "aws_cloudwatch_log_group" "container_insights" {
  name              = "/aws/ecs/containerinsights/${aws_ecs_cluster.main.name}/performance"
  retention_in_days = 7

  tags = {
    Application = "web-app"
    Environment = "production"
  }
}

# Custom CloudWatch metrics and alarms
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
  alarm_name          = "ecs-web-app-cpu-utilization-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  period              = "120"
  statistic           = "Average"
  threshold           = "80"
  alarm_description   = "This metric monitors ECS CPU utilization"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.web_app.name
  }

  tags = {
    Application = "web-app"
  }
}

resource "aws_cloudwatch_metric_alarm" "ecs_memory_high" {
  alarm_name          = "ecs-web-app-memory-utilization-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "MemoryUtilization"
  namespace           = "AWS/ECS"
  period              = "120"
  statistic           = "Average"
  threshold           = "85"
  alarm_description   = "This metric monitors ECS memory utilization"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.web_app.name
  }
}

# ECS Exec logging for session management
resource "aws_cloudwatch_log_group" "ecs_exec_sessions" {
  name              = "/ecs/exec-sessions"
  retention_in_days = 7

  tags = {
    Service = "ecs-exec"
  }
}

# X-Ray for distributed tracing
resource "aws_iam_role_policy_attachment" "xray_write" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = "arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess"
}

# Custom application metrics
resource "aws_cloudwatch_log_metric_filter" "application_errors" {
  name           = "WebAppErrorCount"
  pattern        = "ERROR"
  log_group_name = aws_cloudwatch_log_group.ecs_web_app.name

  metric_transformation {
    name      = "ErrorCount"
    namespace = "WebApp"
    value     = "1"
  }
}

resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "web-app-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "ErrorCount"
  namespace           = "WebApp"
  period              = "300"
  statistic           = "Sum"
  threshold           = "10"
  alarm_description   = "Monitor application error rate"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  tags = {
    Application = "web-app"
  }
}

💰 Cost Optimization Strategies for Fargate

Fargate pricing can be optimized through right-sizing, spot instances, and intelligent scaling. Here are proven strategies for reducing costs while maintaining performance.

Right-size Task Resources: Use CloudWatch metrics to identify optimal CPU/memory allocations
Leverage Fargate Spot: Mix Spot and On-Demand for up to 70% cost savings
Implement Auto Scaling: Scale services based on actual demand patterns
Optimize Container Images: Reduce image size to decrease pull times and costs
Use Graviton Processors: ARM-based Graviton instances offer better price/performance


# cost-optimization.tf - Fargate Cost Optimization
# Mixed capacity provider strategy for cost optimization
resource "aws_ecs_cluster_capacity_providers" "main" {
  cluster_name = aws_ecs_cluster.main.name

  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 3
    base              = 1
  }
}

# Cost and usage reporting
resource "aws_cur_report_definition" "fargate_costs" {
  report_name                = "fargate-cost-report"
  time_unit                  = "HOURLY"
  format                     = "Parquet"
  compression                = "Parquet"
  additional_schema_elements = ["RESOURCES"]
  s3_bucket                  = aws_s3_bucket.cost_reports.bucket
  s3_prefix                  = "fargate"
  s3_region                  = var.region
  additional_artifacts       = ["REDSHIFT", "QUICKSIGHT"]

  report_versioning = "OVERWRITE_REPORT"
}

# Budget alerts for Fargate spending
resource "aws_budgets_budget" "fargate_monthly" {
  name              = "fargate-monthly-budget"
  budget_type       = "COST"
  limit_amount      = "1000"
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  time_period_start = "2025-01-01_00:00"

  cost_types {
    include_credit             = false
    include_discount           = true
    include_other_subscription = true
    include_recurring          = true
    include_refund             = false
    include_subscription       = true
    include_support            = true
    include_tax                = true
    include_upfront            = true
    use_amortized              = false
    use_blended                = false
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.budget_alert_email]
  }
}

⚡ Key Takeaways

Serverless First: Fargate eliminates infrastructure management while providing enterprise-grade container orchestration
Security by Design: Implement task-level IAM roles, private networking, and VPC endpoints for secure operations
Cost Optimization: Leverage Fargate Spot, right-sizing, and auto-scaling to optimize spending
Advanced Deployment Patterns: Use blue-green deployments and circuit breakers for reliable releases
Comprehensive Observability: Implement Container Insights, custom metrics, and distributed tracing
Infrastructure as Code: Use Terraform for reproducible, version-controlled deployments
Mixed Capacity Strategies: Combine Fargate and Fargate Spot for optimal cost and availability

❓ Frequently Asked Questions

When should I choose Fargate vs. ECS on EC2?: Choose Fargate when you want to eliminate server management, have variable workloads, or need enhanced security isolation. Choose ECS on EC2 for predictable steady-state workloads, when you need GPU instances, or for cost optimization with reserved instances.
How does Fargate pricing work compared to EC2?: Fargate charges per vCPU and GB of memory consumed per second, while EC2 uses hourly billing. Fargate can be more cost-effective for spiky workloads but may be more expensive for consistent 24/7 workloads compared to properly sized EC2 reserved instances.
Can I use Fargate for stateful workloads or databases?: Fargate is primarily designed for stateless workloads. While you can attach EFS volumes for persistent storage, it's not recommended for databases or other stateful services that require low-latency storage or specific instance types. Use RDS or EC2 for stateful workloads.
What's the cold start time for Fargate tasks?: Fargate cold starts typically range from 30-90 seconds, depending on image size, task size, and network configuration. You can optimize this by using smaller container images, enabling ECR accelerated endpoints, and implementing health checks properly.
How do I debug Fargate tasks when something goes wrong?: Use ECS Exec for direct shell access to running tasks, CloudWatch Logs for application logs, Container Insights for performance metrics, and X-Ray for distributed tracing. Also enable ECS task termination protection to preserve failed tasks for investigation.
Can I use Fargate with GPU workloads?: Yes, Fargate now supports GPU workloads with specific task definitions that include GPU requirements. However, GPU Fargate tasks have higher costs and specific configuration requirements compared to CPU-based tasks.

💬 Have you implemented Fargate in production? Share your experiences, challenges, or cost optimization tips in the comments below! If you found this guide helpful, please share it with your team or on social media to help others master serverless containers.

About LK-TECH Academy — Practical tutorials & explainers on software engineering, AI, and infrastructure. Follow for concise, hands-on guides.

Tags # AWS # aws 2025 # aws fargate Continue Reading

Tuesday, 21 October 2025

Terraform Cost Optimization 2025: Autoscaling That Saves 40-70% on Cloud Bills

nan October 21, 2025 0

Infrastructure Cost Optimization: Writing Terraform that Autoscales and Saves Money

Terraform infrastructure cost optimization with autoscaling groups, spot instances, and predictive scaling showing 40-70% AWS cost savings

Cloud infrastructure costs are spiraling out of control for many organizations, with wasted resources accounting for up to 35% of cloud spending. In 2025, smart Terraform configurations that leverage advanced autoscaling capabilities have become essential for maintaining competitive advantage. This comprehensive guide will show you how to write Terraform code that not only deploys infrastructure but actively optimizes costs through intelligent scaling, spot instance utilization, and resource right-sizing—potentially saving your organization thousands monthly.

🚀 Why Traditional Infrastructure Fails Cost Optimization

Traditional static infrastructure deployment, even with basic autoscaling, often leads to significant cost inefficiencies. Most teams over-provision "just to be safe," resulting in resources sitting idle 60-80% of the time. The 2025 approach requires infrastructure-as-code that understands cost optimization as a first-class requirement.

Over-provisioning syndrome: Teams deploy for peak load 24/7
Static resource allocation: Fixed instance sizes regardless of actual needs
Manual scaling decisions: Reactive rather than predictive scaling
Ignoring spot instances: Missing 60-90% savings opportunities
No utilization tracking: Flying blind on actual resource usage

💡 Advanced Autoscaling Strategies for 2025

Modern autoscaling goes beyond simple CPU thresholds. Here are the advanced patterns you should implement:

Predictive scaling: Using ML to anticipate traffic patterns
Multi-metric scaling: Combining CPU, memory, queue depth, and custom metrics
Cost-aware scaling: Considering spot instance availability and pricing
Time-based scaling: Scheduled scaling for known patterns
Horizontal vs. vertical scaling: Choosing the right approach for your workload

💻 Complete Terraform Module for Cost-Optimized Autoscaling


# modules/cost-optimized-autoscaling/main.tf

terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Mixed instance policy for cost optimization
resource "aws_autoscaling_group" "cost_optimized" {
  name_prefix               = "cost-opt-asg-"
  max_size                  = var.max_size
  min_size                  = var.min_size
  desired_capacity          = var.desired_capacity
  health_check_grace_period = 300
  health_check_type         = "EC2"
  vpc_zone_identifier       = var.subnet_ids
  termination_policies      = ["OldestInstance", "OldestLaunchConfiguration"]

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = var.on_demand_base_capacity
      on_demand_percentage_above_base_capacity = var.on_demand_percentage
      spot_allocation_strategy                 = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.cost_optimized.id
        version            = "$Latest"
      }

      override {
        instance_type = "t3.medium"
      }

      override {
        instance_type = "t3a.medium"
      }

      override {
        instance_type = "t4g.medium"
      }
    }
  }

  # Predictive scaling policy
  dynamic "predictive_scaling" {
    for_each = var.enable_predictive_scaling ? [1] : []
    content {
      max_capacity_breach_behavior = "IncreaseMaxCapacity"
      max_capacity_buffer          = var.predictive_buffer
      mode                         = "ForecastAndScale"
      scheduling_buffer_time       = var.scheduling_buffer
    }
  }

  # Target tracking scaling policies
  dynamic "target_tracking_configuration" {
    for_each = var.scaling_metrics
    content {
      predefined_metric_specification {
        predefined_metric_type = target_tracking_configuration.value
      }
      target_value = var.metric_targets[target_tracking_configuration.key]
    }
  }

  tags = [
    {
      key                 = "CostOptimized"
      value               = "true"
      propagate_at_launch = true
    },
    {
      key                 = "AutoScalingGroup"
      value               = "cost-optimized"
      propagate_at_launch = true
    }
  ]
}

# Launch template with optimized AMI and configuration
resource "aws_launch_template" "cost_optimized" {
  name_prefix   = "cost-opt-lt-"
  image_id      = data.aws_ami.optimized_ami.id
  instance_type = var.default_instance_type
  key_name      = var.key_name

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size           = var.volume_size
      volume_type           = "gp3"
      delete_on_termination = true
      encrypted             = true
    }
  }

  monitoring {
    enabled = true
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name        = "cost-optimized-instance"
      Environment = var.environment
      Project     = var.project_name
    }
  }

  user_data = base64encode(templatefile("${path.module}/user_data.sh", {
    environment = var.environment
  }))
}

# CloudWatch alarms for cost-aware scaling
resource "aws_cloudwatch_metric_alarm" "scale_up_cost" {
  alarm_name          = "scale-up-cost-optimized"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "70"
  alarm_description   = "Scale up when CPU exceeds 70%"
  alarm_actions       = [aws_autoscaling_policy.scale_up.arn]

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.cost_optimized.name
  }
}

resource "aws_cloudwatch_metric_alarm" "scale_down_cost" {
  alarm_name          = "scale-down-cost-optimized"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "3"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "30"
  alarm_description   = "Scale down when CPU below 30%"
  alarm_actions       = [aws_autoscaling_policy.scale_down.arn]

  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.cost_optimized.name
  }
}

# Data source for optimized AMI
data "aws_ami" "optimized_ami" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-*-x86_64-gp2"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

🔧 Implementing Spot Instance Strategies

Spot instances can reduce compute costs by up to 90%, but require careful implementation. Here's how to use them effectively:

Capacity-optimized strategy: Automatically selects optimal spot pools
Mixed instances policy: Blend spot and on-demand instances
Spot interruption handling: Graceful handling of spot termination notices
Diversification: Using multiple instance types to improve availability

💻 Advanced Spot Instance Configuration


# Advanced spot instance configuration with interruption handling

resource "aws_autoscaling_group" "spot_optimized" {
  name_prefix         = "spot-opt-asg-"
  max_size            = 20
  min_size            = 2
  desired_capacity    = 4
  vpc_zone_identifier = var.subnet_ids

  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 1
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
      spot_instance_pools                      = 4
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.spot_optimized.id
        version            = "$Latest"
      }

      # Multiple instance types for better spot availability
      override {
        instance_type     = "t3.medium"
        weighted_capacity = "1"
      }

      override {
        instance_type     = "t3a.medium"
        weighted_capacity = "1"
      }

      override {
        instance_type     = "m5.large"
        weighted_capacity = "2"
      }

      override {
        instance_type     = "m5a.large"
        weighted_capacity = "2"
      }
    }
  }

  tag {
    key                 = "InstanceLifecycle"
    value               = "spot"
    propagate_at_launch = true
  }
}

# Spot instance interruption handler
resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "spot-instance-interruption"
  description = "Capture spot instance interruption notices"

  event_pattern = jsonencode({
    source      = ["aws.ec2"]
    detail-type = ["EC2 Spot Instance Interruption Warning"]
  })
}

resource "aws_cloudwatch_event_target" "spot_interruption_lambda" {
  rule      = aws_cloudwatch_event_rule.spot_interruption.name
  target_id = "TriggerLambda"
  arn       = aws_lambda_function.spot_handler.arn
}

📊 Monitoring and Cost Analytics

You can't optimize what you can't measure. Implement comprehensive cost monitoring:

Cost and Usage Reports (CUR): Detailed AWS cost tracking
Resource tagging: Complete cost allocation tagging
CloudWatch dashboards: Real-time cost and performance metrics
Custom metrics: Application-specific cost optimization metrics

⚡ Key Takeaways for 2025 Cost Optimization

Implement mixed instance policies with spot instances for up to 90% savings
Use predictive scaling to anticipate traffic patterns and scale proactively
Right-size instances based on actual usage metrics, not guesswork
Implement comprehensive tagging for cost allocation and reporting
Monitor and adjust continuously using CloudWatch and Cost Explorer
Leverage Graviton instances for better price-performance ratio
Implement scheduling for non-production environments

❓ Frequently Asked Questions

What's the biggest mistake teams make with Terraform cost optimization?: The most common mistake is treating infrastructure as static. Teams deploy fixed-size resources without implementing proper autoscaling, leading to massive over-provisioning. Modern applications need dynamic infrastructure that scales with actual demand.
How much can I realistically save with these techniques?: Most organizations save 40-70% on compute costs by implementing comprehensive autoscaling, spot instances, and right-sizing. One client reduced their $12,000 monthly AWS bill to $4,800 using the exact strategies outlined in this article.
Are spot instances reliable for production workloads?: Yes, with proper implementation. Use mixed instance policies with a base capacity of on-demand instances, implement spot interruption handling, and diversify across instance types and availability zones. Many companies run 80%+ of their production workload on spot instances.
How often should I review and update my Terraform scaling configurations?: Review scaling metrics weekly for the first month, then monthly thereafter. Use AWS Cost Explorer and CloudWatch dashboards to identify optimization opportunities. Major application changes should trigger immediate scaling policy reviews.
Can I implement these cost optimization techniques with Kubernetes?: Absolutely! The same principles apply. Use Kubernetes Cluster Autoscaler with spot instance node groups, implement Horizontal Pod Autoscaling, and use Karpenter for advanced node provisioning optimization.

💬 Found this article helpful? What's your biggest infrastructure cost challenge? Please leave a comment below or share it with your network to help others optimize their cloud spending!

About LK-TECH Academy — Practical tutorials & explainers on software engineering, AI, and infrastructure. Follow for concise, hands-on guides.

Tags # autoscaling # AWS # cloud-computing Continue Reading

Sunday, 19 October 2025

Building a Secure Private Cloud Network on AWS with Transit Gateway and SSM (2025 Guide)

nan October 19, 2025 0

Building a Secure Private Cloud Network on AWS with Transit Gateway and SSM

AWS Transit Gateway and SSM Session Manager private cloud architecture diagram showing secure VPC networking with no public subnets

In 2025, enterprise cloud architecture demands more sophisticated networking solutions that prioritize security, scalability, and operational efficiency. This comprehensive guide explores how to build a fully private, secure cloud network on AWS using Transit Gateway and Systems Manager (SSM). We'll dive deep into creating isolated VPC architectures, implementing zero-trust networking principles, and enabling secure administrative access without exposing resources to the public internet. Whether you're building a new cloud foundation or modernizing existing infrastructure, this architecture represents the gold standard for enterprise-grade AWS networking.

🚀 Why Private Cloud Networks Matter in 2025

The evolution of cloud security has shifted from perimeter-based defenses to zero-trust architectures where private networking is fundamental. In today's threat landscape, minimizing internet exposure isn't just best practice—it's essential for compliance, data protection, and risk management. Here's why this architecture is crucial:

Enhanced Security Posture: Eliminate public attack surfaces by keeping resources in private subnets

Regulatory Compliance

Cost Optimization: Reduce data transfer costs and NAT gateway expenses through optimized routing

Operational Excellence: Streamline management with centralized networking and secure access patterns

Future-Proof Architecture: Build a foundation that scales seamlessly across multiple accounts and regions

🔧 Core Components: Transit Gateway & SSM Session Manager

AWS Transit Gateway acts as a regional hub that simplifies network connectivity between VPCs, on-premises networks, and other AWS services. When combined with SSM Session Manager for secure bastion-free access, you create a powerful foundation for enterprise networking.

Let's examine the key components of this architecture:

AWS Transit Gateway: Centralized network transit hub with route tables and cross-region peering
VPC Endpoints: Private connectivity to AWS services without internet gateways
SSM Session Manager: Secure CLI and SSH access without bastion hosts or public IPs
Private Subnets: Isolated network segments with no internet ingress
Security Groups & NACLs: Micro-segmentation and network-level security controls

💻 Infrastructure as Code: Terraform Configuration

Let's start with the foundational Terraform code to provision our secure private network. This configuration sets up Transit Gateway, VPCs with only private subnets, and the necessary VPC endpoints for SSM.


# main.tf - Core Transit Gateway and VPC Configuration
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Transit Gateway for centralized routing
resource "aws_ec2_transit_gateway" "main" {
  description                     = "Central Transit Gateway for private cloud"
  amazon_side_asn                 = 64512
  auto_accept_shared_attachments  = "enable"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
  
  tags = {
    Name = "main-tgw"
  }
}

# Application VPC with only private subnets
resource "aws_vpc" "app_vpc" {
  cidr_block           = "10.1.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = {
    Name = "app-vpc-private"
  }
}

# Private subnets across multiple AZs
resource "aws_subnet" "app_private" {
  count             = 3
  vpc_id            = aws_vpc.app_vpc.id
  cidr_block        = cidrsubnet(aws_vpc.app_vpc.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = "app-private-${count.index + 1}"
  }
}

# Transit Gateway VPC attachment
resource "aws_ec2_transit_gateway_vpc_attachment" "app_vpc" {
  subnet_ids         = aws_subnet.app_private[*].id
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  vpc_id             = aws_vpc.app_vpc.id
  
  tags = {
    Name = "app-vpc-attachment"
  }
}

🛡️ Implementing VPC Endpoints for Private Service Access

VPC endpoints are crucial for maintaining private network isolation while allowing necessary AWS service connectivity. Here's how to implement the essential endpoints for SSM and other critical services.


# vpc-endpoints.tf - PrivateLink Configuration for AWS Services
# SSM VPC Endpoint for Session Manager
resource "aws_vpc_endpoint" "ssm" {
  vpc_id              = aws_vpc.app_vpc.id
  service_name        = "com.amazonaws.${var.region}.ssm"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.app_private[*].id
  
  security_group_ids = [
    aws_security_group.vpc_endpoints.id
  ]
  
  tags = {
    Name = "ssm-endpoint"
  }
}

# Additional SSM endpoints for full functionality
resource "aws_vpc_endpoint" "ssm_messages" {
  vpc_id              = aws_vpc.app_vpc.id
  service_name        = "com.amazonaws.${var.region}.ssmmessages"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.app_private[*].id
  
  security_group_ids = [
    aws_security_group.vpc_endpoints.id
  ]
  
  tags = {
    Name = "ssm-messages-endpoint"
  }
}

resource "aws_vpc_endpoint" "ec2_messages" {
  vpc_id              = aws_vpc.app_vpc.id
  service_name        = "com.amazonaws.${var.region}.ec2messages"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.app_private[*].id
  
  security_group_ids = [
    aws_security_group.vpc_endpoints.id
  ]
  
  tags = {
    Name = "ec2-messages-endpoint"
  }
}

# S3 Gateway Endpoint for package downloads and logs
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.app_vpc.id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
  
  tags = {
    Name = "s3-gateway-endpoint"
  }
}

# ECR endpoints for Docker image pulls
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.app_vpc.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = aws_subnet.app_private[*].id
  
  security_group_ids = [
    aws_security_group.vpc_endpoints.id
  ]
  
  tags = {
    Name = "ecr-api-endpoint"
  }
}

🔐 Advanced Security Groups for Micro-Segmentation

Security groups provide essential micro-segmentation within your private network. Here's how to implement zero-trust security group rules that enforce least privilege access.


# security-groups.tf - Zero-Trust Security Configuration
# VPC Endpoints Security Group
resource "aws_security_group" "vpc_endpoints" {
  name_prefix = "vpc-endpoints-"
  description = "Security group for VPC endpoints"
  vpc_id      = aws_vpc.app_vpc.id
  
  ingress {
    description = "HTTPS from private subnets"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [aws_vpc.app_vpc.cidr_block]
  }
  
  ingress {
    description = "SSM from private subnets"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [aws_vpc.app_vpc.cidr_block]
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "vpc-endpoints-sg"
  }
}

# Application instances security group
resource "aws_security_group" "app_instances" {
  name_prefix = "app-instances-"
  description = "Security group for application instances"
  vpc_id      = aws_vpc.app_vpc.id
  
  ingress {
    description = "SSH via Session Manager"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [aws_vpc.app_vpc.cidr_block]
  }
  
  ingress {
    description = "Application traffic from internal"
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    cidr_blocks = [aws_vpc.app_vpc.cidr_block]
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "app-instances-sg"
  }
}

# Database security group with strict rules
resource "aws_security_group" "database" {
  name_prefix = "database-"
  description = "Security group for database instances"
  vpc_id      = aws_vpc.app_vpc.id
  
  ingress {
    description = "PostgreSQL from app instances"
    from_port   = 5432
    to_port     = 5432
    protocol    = "tcp"
    security_groups = [aws_security_group.app_instances.id]
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "database-sg"
  }
}

🚀 Configuring SSM Session Manager for Secure Access

SSM Session Manager eliminates the need for bastion hosts and provides secure, auditable access to EC2 instances. Here's the complete IAM and SSM configuration.


# iam-ssm.tf - IAM Roles and SSM Configuration
# SSM Instance Role
resource "aws_iam_role" "ssm_instance_role" {
  name_prefix = "SSMInstanceRole-"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
  
  tags = {
    Name = "ssm-instance-role"
  }
}

# AmazonSSMManagedInstanceCore policy attachment
resource "aws_iam_role_policy_attachment" "ssm_core" {
  role       = aws_iam_role.ssm_instance_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

# Custom SSM policy for additional permissions
resource "aws_iam_role_policy" "ssm_custom" {
  name_prefix = "SSMCustomPolicy-"
  role        = aws_iam_role.ssm_instance_role.id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          "arn:aws:s3:::my-ssm-logs-bucket/*",
          "arn:aws:s3:::my-ssm-logs-bucket"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "logs:DescribeLogGroups",
          "logs:DescribeLogStreams"
        ]
        Resource = "*"
      }
    ]
  })
}

# Instance Profile for EC2 instances
resource "aws_iam_instance_profile" "ssm_instance" {
  name_prefix = "SSMInstanceProfile-"
  role        = aws_iam_role.ssm_instance_role.name
}

# SSM Document for session preferences
resource "aws_ssm_document" "session_preferences" {
  name          = "SSM-SessionManagerRunShell"
  document_type = "Session"
  
  content = jsonencode({
    schemaVersion = "1.0"
    description   = "Document to hold regional session settings"
    sessionType   = "Standard_Stream"
    inputs = {
      s3BucketName                = "my-ssm-logs-bucket"
      s3KeyPrefix                 = "ssm-sessions"
      s3EncryptionEnabled         = true
      cloudWatchLogGroupName      = "/aws/ssm/sessions"
      cloudWatchEncryptionEnabled = true
      cloudWatchStreamingEnabled  = true
      idleSessionTimeout          = "20"
      maxSessionDuration          = "60"
      shellProfile = {
        linux = "echo 'Welcome to Secure Session Manager'"
      }
    }
  })
  
  tags = {
    Name = "session-preferences"
  }
}

🔄 Advanced Transit Gateway Routing

Transit Gateway route tables enable sophisticated routing patterns for multi-VPC architectures. Here's how to implement advanced routing with segregation and security controls.


# tgw-routing.tf - Advanced Transit Gateway Configuration
# Segregated route tables for different environments
resource "aws_ec2_transit_gateway_route_table" "production" {
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  
  tags = {
    Name = "production-rt"
  }
}

resource "aws_ec2_transit_gateway_route_table" "development" {
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  
  tags = {
    Name = "development-rt"
  }
}

resource "aws_ec2_transit_gateway_route_table" "shared_services" {
  transit_gway_id = aws_ec2_transit_gateway.main.id
  
  tags = {
    Name = "shared-services-rt"
  }
}

# Route table associations
resource "aws_ec2_transit_gateway_route_table_association" "app_vpc_prod" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.app_vpc.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.production.id
}

# Static routes for specific traffic patterns
resource "aws_ec2_transit_gateway_route" "to_inspection_vpc" {
  destination_cidr_block         = "0.0.0.0/0"
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.inspection.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.production.id
}

# Route table propagations
resource "aws_ec2_transit_gateway_route_table_propagation" "prod_to_shared" {
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.shared_services.id
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.production.id
}

📊 Monitoring and Logging Configuration

Comprehensive monitoring is essential for maintaining security and performance in private cloud networks. Implement these CloudWatch and VPC Flow Log configurations.


# monitoring.tf - Comprehensive Observability Setup
# VPC Flow Logs for network traffic monitoring
resource "aws_cloudwatch_log_group" "vpc_flow_logs" {
  name              = "/aws/vpc/flow-logs"
  retention_in_days = 365
  
  tags = {
    Name = "vpc-flow-logs"
  }
}

resource "aws_iam_role" "vpc_flow_log_role" {
  name_prefix = "VPCFlowLogRole-"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "vpc-flow-logs.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "vpc_flow_log_policy" {
  name_prefix = "VPCFlowLogPolicy-"
  role        = aws_iam_role.vpc_flow_log_role.id
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "logs:DescribeLogGroups",
          "logs:DescribeLogStreams"
        ]
        Resource = "*"
      }
    ]
  })
}

resource "aws_flow_log" "app_vpc" {
  iam_role_arn    = aws_iam_role.vpc_flow_log_role.arn
  log_destination = aws_cloudwatch_log_group.vpc_flow_logs.arn
  traffic_type    = "ALL"
  vpc_id          = aws_vpc.app_vpc.id
  
  tags = {
    Name = "app-vpc-flow-logs"
  }
}

# Transit Gateway Flow Logs
resource "aws_ec2_transit_gateway_flow_log" "main" {
  transit_gateway_id          = aws_ec2_transit_gateway.main.id
  transit_gateway_attachment_id = aws_ec2_transit_gateway_vpc_attachment.app_vpc.id
  log_destination             = aws_cloudwatch_log_group.vpc_flow_logs.arn
  iam_role_arn                = aws_iam_role.vpc_flow_log_role.arn
  traffic_type                = "ALL"
  
  tags = {
    Name = "tgw-flow-logs"
  }
}

# CloudWatch Alarms for security monitoring
resource "aws_cloudwatch_log_metric_filter" "unauthorized_access" {
  name           = "UnauthorizedAccessAttempts"
  pattern        = "[version, account, eni, source, destination, srcport, destport, protocol, packets, bytes, windowstart, windowend, action = \"REJECT\"]"
  log_group_name = aws_cloudwatch_log_group.vpc_flow_logs.name
  
  metric_transformation {
    name      = "UnauthorizedAccessCount"
    namespace = "VPC/Security"
    value     = "1"
  }
}

resource "aws_cloudwatch_metric_alarm" "high_unauthorized_access" {
  alarm_name          = "HighUnauthorizedAccessAttempts"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "UnauthorizedAccessCount"
  namespace           = "VPC/Security"
  period              = "300"
  statistic           = "Sum"
  threshold           = "10"
  alarm_description   = "This metric monitors for high unauthorized access attempts"
  alarm_actions       = [aws_sns_topic.security_alerts.arn]
  
  tags = {
    Name = "unauthorized-access-alarm"
  }
}

🔒 Advanced Security: Network Firewall & Security Hub

For enterprise-grade security, integrate AWS Network Firewall and Security Hub to provide comprehensive threat protection and compliance monitoring.


# advanced-security.tf - Enterprise Security Controls
# AWS Network Firewall for deep packet inspection
resource "aws_networkfirewall_firewall" "inspection" {
  name                = "inspection-firewall"
  firewall_policy_arn = aws_networkfirewall_firewall_policy.inspection.arn
  vpc_id              = aws_vpc.inspection.id
  
  subnet_mapping {
    subnet_id = aws_subnet.firewall.id
  }
  
  tags = {
    Name = "inspection-firewall"
  }
}

resource "aws_networkfirewall_firewall_policy" "inspection" {
  name = "inspection-policy"
  
  firewall_policy {
    stateless_default_actions          = ["aws:forward_to_sfe"]
    stateless_fragment_default_actions = ["aws:forward_to_sfe"]
    
    stateful_rule_group_reference {
      resource_arn = aws_networkfirewall_rule_group.threat_prevention.arn
    }
    
    stateful_engine_options {
      rule_order = "STRICT_ORDER"
    }
  }
  
  tags = {
    Name = "inspection-firewall-policy"
  }
}

# Security Hub integration for compliance monitoring
resource "aws_securityhub_account" "main" {}

resource "aws_securityhub_standards_subscription" "cis" {
  depends_on    = [aws_securityhub_account.main]
  standards_arn = "arn:aws:securityhub:::ruleset/cis-aws-foundations-benchmark/v/1.2.0"
}

resource "aws_securityhub_standards_subscription" "pci" {
  depends_on    = [aws_securityhub_account.main]
  standards_arn = "arn:aws:securityhub:::ruleset/pci-dss/v/3.2.1"
}

# GuardDuty for threat detection
resource "aws_guardduty_detector" "main" {
  enable = true
  
  datasources {
    s3_logs {
      enable = true
    }
    kubernetes {
      audit_logs {
        enable = false
      }
    }
    malware_protection {
      scan_ec2_instance_with_findings {
        ebs_volumes {
          enable = true
        }
      }
    }
  }
}

⚡ Key Takeaways

Zero-Trust Architecture: Implement private subnets exclusively and use VPC endpoints for AWS service access
Centralized Networking: Leverage Transit Gateway for simplified multi-VPC management and routing
Secure Access Patterns: Replace bastion hosts with SSM Session Manager for improved security and auditability
Comprehensive Monitoring: Implement VPC Flow Logs, Transit Gateway Flow Logs, and Security Hub for full visibility
Infrastructure as Code: Use Terraform to ensure consistent, repeatable deployments across environments
Advanced Security: Integrate Network Firewall and GuardDuty for enterprise-grade threat protection
Cost Optimization: Reduce data transfer costs and eliminate NAT gateway expenses through proper architecture

❓ Frequently Asked Questions

How does this architecture compare to traditional VPN/bastion host setups?: This architecture eliminates public attack surfaces entirely. Instead of VPNs and bastion hosts with public IPs, we use AWS PrivateLink and SSM Session Manager, which provide more secure, auditable access without internet exposure. The attack surface is significantly reduced while maintaining full functionality.
What are the cost implications of using Transit Gateway and multiple VPC endpoints?: While there are hourly costs for Transit Gateway and VPC endpoints, these are often offset by eliminating NAT gateway costs and reducing data transfer charges. The architecture typically results in better cost predictability and can be more economical for enterprise-scale deployments compared to maintaining multiple NAT gateways and VPN connections.
Can I use this architecture for HIPAA or PCI DSS compliant workloads?: Yes, this architecture is well-suited for compliant workloads. The private network design, comprehensive logging, and advanced security controls align with HIPAA and PCI DSS requirements. However, you should conduct proper validation and implement additional controls specific to your compliance framework.
How do I handle internet access for instances that need to download updates?: For controlled internet access, implement a dedicated egress VPC with NAT gateways or AWS Network Firewall. Route specific traffic through this inspection VPC rather than providing direct internet access. Alternatively, use VPC endpoints for AWS services and maintain internal repositories for software updates.
What's the performance impact of using VPC endpoints versus public service endpoints?: VPC endpoints typically provide equal or better performance since traffic stays within the AWS network. They eliminate internet latency and provide more consistent throughput. For most workloads, you'll see improved performance and reliability compared to public endpoints.
How do I monitor and troubleshoot network issues in this private architecture?: Implement VPC Flow Logs, Transit Gateway Flow Logs, and CloudWatch metrics extensively. Use SSM Session Manager for instance access and AWS X-Ray for application-level tracing. Centralize logs in CloudWatch Logs or S3 for analysis and set up alerts for unusual patterns or connectivity issues.

💬 Have you implemented a similar private cloud architecture? Share your experiences, challenges, or questions in the comments below! If you found this guide helpful, please share it with your team or on social media to help others build more secure AWS environments.

About LK-TECH Academy — Practical tutorials & explainers on software engineering, AI, and infrastructure. Follow for concise, hands-on guides.

Tags # aws 2025 # aws networking # aws transit gateway Continue Reading

Thursday, 23 October 2025

Serverless Containers: Deploying with AWS Fargate and ECS (2025 Complete Guide)

Serverless Containers: Deploying with AWS Fargate and ECS

🚀 Why Serverless Containers Dominate in 2025

🔧 Fargate vs. Traditional ECS: Understanding the Evolution

💻 Infrastructure as Code: Terraform ECS Fargate Setup

🛡️ Advanced Networking & Security Configuration

🚀 ECS Service Configuration with Advanced Features

🔐 IAM Roles & Security Best Practices

📊 Advanced Monitoring & Observability

💰 Cost Optimization Strategies for Fargate

⚡ Key Takeaways

❓ Frequently Asked Questions

Tuesday, 21 October 2025

Terraform Cost Optimization 2025: Autoscaling That Saves 40-70% on Cloud Bills

Infrastructure Cost Optimization: Writing Terraform that Autoscales and Saves Money

🚀 Why Traditional Infrastructure Fails Cost Optimization

💡 Advanced Autoscaling Strategies for 2025

💻 Complete Terraform Module for Cost-Optimized Autoscaling

🔧 Implementing Spot Instance Strategies

💻 Advanced Spot Instance Configuration

📊 Monitoring and Cost Analytics

⚡ Key Takeaways for 2025 Cost Optimization

❓ Frequently Asked Questions

Sunday, 19 October 2025

Building a Secure Private Cloud Network on AWS with Transit Gateway and SSM (2025 Guide)

Building a Secure Private Cloud Network on AWS with Transit Gateway and SSM

🚀 Why Private Cloud Networks Matter in 2025

🔧 Core Components: Transit Gateway & SSM Session Manager

💻 Infrastructure as Code: Terraform Configuration

🛡️ Implementing VPC Endpoints for Private Service Access

🔐 Advanced Security Groups for Micro-Segmentation

🚀 Configuring SSM Session Manager for Secure Access

🔄 Advanced Transit Gateway Routing

📊 Monitoring and Logging Configuration

🔒 Advanced Security: Network Firewall & Security Hub

⚡ Key Takeaways

❓ Frequently Asked Questions

Follow Us

Important Links

Report Abuse

Search This Blog

Related Articles

Recent

Featured

Popular

Blog Archive

Recent Post

Recent Comments

Categories

Contact

Tags