Ghost on AWS: Monitoring, Logging, and Operational Excellence

Running Ghost on AWS requires visibility into application performance, system health, and user experience. Without proper monitoring, you're flying blind - unable to detect issues before users complain or understand what went wrong when problems occur. This post details a comprehensive monitoring setup that provides complete operational visibility at a reasonable cost.

The monitoring infrastructure we've built provides comprehensive visibility with seven operational alarms plus intelligent deployment suppression, dashboard widgets showing real-time metrics, and automated Log Insights queries for rapid troubleshooting. The system is designed to detect issues like database connection spikes, memory leaks, and potential DDoS attempts before they impact users.

The Monitoring Architecture

The monitoring system consists of CloudWatch dashboards displaying real-time metrics, SNS topics delivering email alerts, and Log Insights queries for troubleshooting. Every component generates metrics that flow into a unified dashboard providing a single pane of glass for operations.

graph TB
    subgraph Ghost Infrastructure
        ECS[ECS Fargate<br/>Ghost + Nginx]
        ALB[Application<br/>Load Balancer]
        RDS[(Aurora<br/>Serverless)]
        WAF[AWS WAF]
    end

    subgraph Monitoring Layer
        CW[CloudWatch<br/>Metrics]
        Logs[CloudWatch<br/>Logs]
        Insights[Container<br/>Insights]
    end

    subgraph Alerting
        Alarms[CloudWatch<br/>Alarms]
        SNS[SNS Topics]
        Email[Email<br/>Notifications]
    end

    subgraph Visualization
        Dashboard[CloudWatch<br/>Dashboard]
        Queries[Log Insights<br/>Queries]
    end

    ECS --> CW
    ECS --> Logs
    ECS --> Insights
    ALB --> CW
    RDS --> CW
    WAF --> CW

    CW --> Dashboard
    Logs --> Queries

    CW --> Alarms
    Alarms --> SNS
    SNS --> Email

    Dashboard --> Ops[Operations Team]
    Email --> Ops

Each layer serves a specific purpose. The infrastructure layer generates metrics and logs. CloudWatch aggregates these into actionable insights. Alarms detect anomalies and trigger notifications. Dashboards provide visual confirmation and historical context.

CloudWatch Dashboard Setup

The CDK creates a comprehensive dashboard with six widget groups monitoring different aspects of the system. The dashboard auto-refreshes every minute during incidents and hourly during normal operations.

import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';

export class GhostMonitoring extends Construct {
  constructor(scope: Construct, id: string, props: GhostMonitoringProps) {
    super(scope, id);

    // Create unified monitoring dashboard
    this.dashboard = new cloudwatch.Dashboard(this, 'Dashboard', {
      dashboardName: `ghost-cms-${props.domainName.replace('.', '-')}`,
      defaultInterval: cdk.Duration.hours(1),
    });

    // ECS Service Metrics - CPU, Memory, Task Count
    const ecsMetricsWidget = new cloudwatch.GraphWidget({
      title: 'ECS Service Metrics',
      left: [
        new cloudwatch.Metric({
          namespace: 'AWS/ECS',
          metricName: 'CPUUtilization',
          dimensionsMap: {
            ClusterName: props.cluster.clusterName,
            ServiceName: props.service.serviceName,
          },
          statistic: 'Average',
          label: 'CPU Utilization',
        }),
        new cloudwatch.Metric({
          namespace: 'AWS/ECS',
          metricName: 'MemoryUtilization',
          dimensionsMap: {
            ClusterName: props.cluster.clusterName,
            ServiceName: props.service.serviceName,
          },
          statistic: 'Average',
          label: 'Memory Utilization',
        }),
      ],
      right: [
        new cloudwatch.Metric({
          namespace: 'AWS/ECS',
          metricName: 'RunningTaskCount',
          dimensionsMap: {
            ClusterName: props.cluster.clusterName,
            ServiceName: props.service.serviceName,
          },
          statistic: 'Average',
          label: 'Running Tasks',
        }),
      ],
    });

The dashboard displays ECS metrics showing container resource utilization, ALB metrics tracking request patterns and response times, target health indicating container availability, HTTP status code distribution revealing error rates, database metrics monitoring connections and capacity, and performance metrics tracking query latency.

Alerting Strategy

The alerting system uses seven operational alarms covering the most critical failure modes, plus a composite alarm that intelligently suppresses false positives during deployments.

// Create SNS topic for all alarms
this.alarmTopic = new sns.Topic(this, 'AlarmTopic', {
  displayName: 'Ghost CMS Alarms',
});

// Add email subscription for immediate notification
if (props.alertEmail) {
  this.alarmTopic.addSubscription(
    new snsSubscriptions.EmailSubscription(props.alertEmail),
  );
}

// High CPU Alarm - indicates scaling need or runaway process
const cpuAlarm = new cloudwatch.Alarm(this, 'HighCpuAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/ECS',
    metricName: 'CPUUtilization',
    dimensionsMap: {
      ClusterName: props.cluster.clusterName,
      ServiceName: props.service.serviceName,
    },
    statistic: 'Average',
  }),
  threshold: 80,
  evaluationPeriods: 2,
  datapointsToAlarm: 2,
  treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
  alarmDescription: 'Alarm when CPU exceeds 80%',
});
cpuAlarm.addAlarmAction(new cloudwatchActions.SnsAction(this.alarmTopic));

The operational alarms deployed cover:

CPU utilization over 80% - indicates scaling needs or runaway processes
Memory utilization over 80% - suggests memory leaks in themes or plugins
Unhealthy targets (composite) - alerts only for sustained issues, not deployments
5xx errors exceeding 10 in 5 minutes - reveals application issues
Response times over 2 seconds - indicates performance degradation
Database connections over 40 - prevents connection exhaustion

Intelligent Deployment Suppression

A key innovation in our monitoring is the composite alarm pattern that prevents false alerts during deployments:

// Extract correct dimensions from ARNs for CloudWatch metrics
const targetGroupFullName = cdk.Fn.select(
  5,
  cdk.Fn.split(':', props.targetGroup.targetGroupArn),
);
const arnParts = cdk.Fn.split('/', props.loadBalancer.loadBalancerArn);
const loadBalancerFullName = cdk.Fn.join('/', [
  cdk.Fn.select(1, arnParts),
  cdk.Fn.select(2, arnParts),
  cdk.Fn.select(3, arnParts),
]);

// Base unhealthy alarm (no direct SNS action)
const unhealthyAlarm = new cloudwatch.Alarm(this, 'UnhealthyHostsAlarm', {
  metric: unhealthyMetric,
  threshold: 1,
  evaluationPeriods: 3,
  datapointsToAlarm: 3,
  treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
});

// Deployment detector alarm
const deploymentDetector = new cloudwatch.Alarm(this, 'DeploymentDetector', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/ECS',
    metricName: 'RunningTaskCount',
    dimensionsMap: {
      ClusterName: props.cluster.clusterName,
      ServiceName: props.service.serviceName,
    },
  }),
  threshold: 2,
  comparisonOperator:
    cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
});

// Composite alarm: only alert when unhealthy AND not deploying
const unhealthyNotDeployingAlarm = new cloudwatch.CompositeAlarm(
  this,
  'UnhealthyNotDeploying',
  {
    compositeAlarmName: `ghost-unhealthy-not-deploying-${props.domainName.replace(
      '.',
      '-',
    )}`,
    alarmRule: cloudwatch.AlarmRule.allOf(
      cloudwatch.AlarmRule.fromAlarm(
        unhealthyAlarm,
        cloudwatch.AlarmState.ALARM,
      ),
      cloudwatch.AlarmRule.not(
        cloudwatch.AlarmRule.fromAlarm(
          deploymentDetector,
          cloudwatch.AlarmState.ALARM,
        ),
      ),
    ),
  },
);
unhealthyNotDeployingAlarm.addAlarmAction(
  new cloudwatchActions.SnsAction(this.alarmTopic),
);

This pattern ensures you're only alerted for real issues, not normal deployment transitions.

Multi-Layered Health Check Architecture

The Ghost on AWS setup employs a sophisticated multi-layered health check system that ensures high availability and enables zero-downtime deployments. Understanding these different layers is crucial for maintaining a resilient production environment.

Three Levels of Health Monitoring

The architecture implements health checks at three distinct levels, each serving a specific purpose in maintaining service reliability:

graph TD
    subgraph "External Layer"
        ALB[Application Load Balancer]
        TG[Target Group Health Check]
    end

    subgraph "Container Layer"
        Nginx[Nginx Container<br/>Port 80]
        Ghost[Ghost Container<br/>Port 2368]
        ActivityPub[ActivityPub Container<br/>Port 8080]
    end

    subgraph "Application Layer"
        HealthEndpoint[/health endpoint<br/>Returns 200 OK]
        GhostAPI[Ghost Admin API<br/>/ghost/api/admin/site/]
    end

    ALB --> TG
    TG --> Nginx
    Nginx --> HealthEndpoint
    Ghost --> GhostAPI

1. ALB Target Group Health Checks

The Application Load Balancer continuously monitors container health through target group health checks. These checks determine whether traffic should be routed to a container:

// Configure health check for Nginx
this.service.targetGroup.configureHealthCheck({
  path: '/health',
  healthyHttpCodes: '200',
  interval: cdk.Duration.seconds(30),
  timeout: cdk.Duration.seconds(5),
  healthyThresholdCount: 2,
  unhealthyThresholdCount: 3,
});

Key configuration details:

Health check path: /health - A simple endpoint in Nginx that returns 200 OK
Check frequency: Every 30 seconds
Timeout: 5 seconds per check
Healthy threshold: 2 consecutive successful checks to mark healthy
Unhealthy threshold: 3 consecutive failures to mark unhealthy
Deregistration delay: 30 seconds before removing unhealthy targets

2. Container-Level Configuration

While Docker supports HEALTHCHECK instructions, the current implementation relies on ALB health checks rather than container-level health checks. This design choice simplifies the architecture while maintaining reliability:

// Nginx container serves as the health check responder
const nginxContainer = taskDefinition.addContainer('NginxContainer', {
  image: ecs.ContainerImage.fromAsset('containers/nginx-proxy'),
  portMappings: [
    {
      containerPort: 80,
      protocol: ecs.Protocol.TCP,
    },
  ],
  essential: true, // Container must be running for task to be healthy
});

The Nginx configuration includes a dedicated health endpoint:

    # Health check endpoint for ALB
    location /health {
        access_log off;
        return 200 "healthy\n";
        add_header Content-Type text/plain;
    }

This approach provides several benefits:

Fast response times: Simple 200 response without backend calls
Reduced load: No database queries or application logic
Clear separation: Health checks don't impact application performance

3. Deployment Health Management

The ECS service configuration includes sophisticated deployment controls that work with health checks to ensure safe rollouts:

// Create Fargate service with health check grace period
this.service = new ecsPatterns.ApplicationLoadBalancedFargateService(
  this,
  'Service',
  {
    cluster: this.cluster,
    taskDefinition: taskDefinition,
    desiredCount: 1,
    healthCheckGracePeriod: cdk.Duration.minutes(5),
    enableExecuteCommand: true,
  },
);

// Configure deployment circuit breaker with automatic rollback
const cfnService = this.service.service.node.defaultChild as ecs.CfnService;
cfnService.deploymentConfiguration = {
  minimumHealthyPercent: 100, // Never go below desired count
  maximumPercent: 200, // Can temporarily double containers
  deploymentCircuitBreaker: {
    enable: true,
    rollback: true, // Auto-rollback on failure
  },
};

Health Check Flow During Deployments

Understanding how health checks interact during deployments is crucial for zero-downtime updates:

New task starts: ECS launches new container with updated image
Grace period: 5-minute window where health checks are ignored
Initial checks: After grace period, ALB begins health checks
Healthy threshold: Container must pass 2 consecutive checks (1 minute)
Traffic routing: ALB begins routing traffic to new container
Old task draining: Existing connections complete gracefully
Circuit breaker: If new tasks fail repeatedly, automatic rollback occurs

Why Nginx as the Health Check Target?

The architecture uses Nginx as the primary health check target rather than Ghost directly for several reasons:

Proxy readiness: Ensures the entire request path is functional
Fast response: No application processing required
Isolation: Health checks don't impact Ghost performance
Multiple backends: Can verify connectivity to Ghost, ActivityPub, and analytics

Container Dependencies and Startup Order

For complex deployments with multiple containers, proper dependency management ensures healthy startup:

// ActivityPub depends on database migration completion
activityPubContainer.addContainerDependencies({
  container: initContainer,
  condition: ecs.ContainerDependencyCondition.SUCCESS,
});

// Ghost waits for ActivityPub initialization if enabled
ghostContainer.addContainerDependencies({
  container: initContainer,
  condition: ecs.ContainerDependencyCondition.SUCCESS,
});

This ensures services start in the correct order and are fully initialized before receiving traffic.

Best Practices for Production Health Checks

Based on this implementation, here are key recommendations:

Use simple health endpoints: Avoid complex logic that could fail
Set appropriate grace periods: Allow time for initialization
Configure circuit breakers: Enable automatic rollback for safety
Monitor all layers: Track both ALB and application metrics
Test deployment scenarios: Verify health checks during updates

The multi-layered health check architecture provides robust monitoring while minimizing false positives and ensuring smooth deployments. This design has proven effective for maintaining high availability in production Ghost deployments.

Container Logging Architecture

Each container type has its own log group with specific retention periods and structured logging formats. This separation allows targeted troubleshooting and cost optimization.

// Ghost container logs with structured JSON format
const ghostLogGroup = new logs.LogGroup(this, 'GhostLogs', {
  logGroupName: '/ecs/ghost',
  retention: logs.RetentionDays.ONE_WEEK,
  removalPolicy: cdk.RemovalPolicy.DESTROY,
});

// Nginx sidecar logs for request analysis
const nginxLogGroup = new logs.LogGroup(this, 'NginxLogs', {
  logGroupName: '/ecs/ghost/nginx',
  retention: logs.RetentionDays.THREE_DAYS,
  removalPolicy: cdk.RemovalPolicy.DESTROY,
});

// Container configuration with CloudWatch logging
const ghostContainer = taskDefinition.addContainer('ghost', {
  image: ecs.ContainerImage.fromRegistry('ghost:5-alpine'),
  logging: ecs.LogDrivers.awsLogs({
    streamPrefix: 'ghost',
    logGroup: ghostLogGroup,
  }),
  environment: {
    NODE_ENV: 'production',
    logging__level: 'info',
    logging__transports: '["stdout"]',
  },
});

The logging configuration includes Ghost application logs with one-week retention for debugging, Nginx access logs with three-day retention for traffic analysis, ActivityPub federation logs for troubleshooting federation issues, and init container logs capturing startup and migration output.

Log Insights Queries

Pre-configured queries enable rapid troubleshooting of common issues. These queries search across all container logs to identify patterns and correlate events.

// Pre-configured query for finding errors
const errorLogQuery = new logs.QueryDefinition(this, 'ErrorLogQuery', {
  queryDefinitionName: 'Ghost-Errors',
  queryString: new logs.QueryString({
    fields: ['@timestamp', '@message'],
    filter: '@message like /ERROR/',
    sort: '@timestamp desc',
    limit: 100,
  }),
  logGroups: [ghostLogGroup],
});

// Query for slow database queries
const slowQueryLog = new logs.QueryDefinition(this, 'SlowQueries', {
  queryDefinitionName: 'Ghost-Slow-DB-Queries',
  queryString: new logs.QueryString({
    fields: ['@timestamp', 'query', 'duration'],
    filter: 'duration > 1000',
    sort: 'duration desc',
    limit: 50,
  }),
  logGroups: [ghostLogGroup],
});

Additional queries identify failed login attempts for security monitoring, track newsletter send progress, monitor image upload failures, and analyze traffic patterns from Nginx logs.

Auto-scaling Configuration

The monitoring metrics drive auto-scaling decisions, ensuring the application scales smoothly under load while controlling costs during quiet periods.

const scaling = service.autoScaleTaskCount({
  minCapacity: 1,
  maxCapacity: 3,
});

// Scale on CPU utilization
scaling.scaleOnCpuUtilization('CpuScaling', {
  targetUtilizationPercent: 70,
  scaleInCooldown: cdk.Duration.minutes(5),
  scaleOutCooldown: cdk.Duration.minutes(1),
});

// Scale on memory utilization
scaling.scaleOnMemoryUtilization('MemoryScaling', {
  targetUtilizationPercent: 70,
  scaleInCooldown: cdk.Duration.minutes(5),
  scaleOutCooldown: cdk.Duration.minutes(1),
});

The scaling configuration maintains 70% target utilization for both CPU and memory, scales out quickly in one minute to handle traffic spikes, scales in slowly after five minutes to avoid flapping, and supports one to three container instances based on load.

Note: Auto-scaling creates its own control alarms (separate from monitoring alarms) that will show as ALARM when utilization is low - this is normal behavior as they signal the auto-scaler not to add more capacity.

Health Check Configuration

Health checks ensure only healthy containers receive traffic, with circuit breaker protection for safe deployments.

const targetGroup = new elbv2.ApplicationTargetGroup(this, 'TargetGroup', {
  vpc: props.vpc,
  port: 80,
  protocol: elbv2.ApplicationProtocol.HTTP,
  targetType: elbv2.TargetType.IP,
  healthCheck: {
    enabled: true,
    path: '/ghost/api/admin/site/',
    protocol: elbv2.Protocol.HTTP,
    healthyHttpCodes: '200',
    interval: cdk.Duration.seconds(30),
    timeout: cdk.Duration.seconds(5),
    healthyThresholdCount: 2,
    unhealthyThresholdCount: 3,
  },
  deregistrationDelay: cdk.Duration.seconds(30),
});

// ECS service with circuit breaker
const service = new ecs.FargateService(this, 'GhostService', {
  cluster: props.cluster,
  taskDefinition,
  desiredCount: 1,
  assignPublicIp: false,
  circuitBreaker: { rollback: true },
  healthCheckGracePeriod: cdk.Duration.minutes(5),
});

Health checks verify the Ghost admin API every 30 seconds, require two consecutive successful checks for healthy status, mark containers unhealthy after three failures, and automatically roll back failed deployments.

WAF Monitoring

The WAF integration provides security metrics and blocks malicious traffic before it reaches the application.

const webAcl = new wafv2.CfnWebACL(this, 'WebACL', {
  defaultAction: { allow: {} },
  rules: [
    {
      name: 'RateLimitRule',
      priority: 1,
      statement: {
        rateBasedStatement: {
          limit: 2000,
          aggregateKeyType: 'IP',
        },
      },
      action: { block: {} },
      visibilityConfig: {
        sampledRequestsEnabled: true,
        cloudWatchMetricsEnabled: true,
        metricName: 'RateLimitRule',
      },
    },
    {
      name: 'AWSManagedRulesCommonRuleSet',
      priority: 2,
      statement: {
        managedRuleGroupStatement: {
          vendorName: 'AWS',
          name: 'AWSManagedRulesCommonRuleSet',
        },
      },
      overrideAction: { none: {} },
      visibilityConfig: {
        sampledRequestsEnabled: true,
        cloudWatchMetricsEnabled: true,
        metricName: 'CommonRuleSet',
      },
    },
  ],
  visibilityConfig: {
    sampledRequestsEnabled: true,
    cloudWatchMetricsEnabled: true,
    metricName: 'ghost-waf',
  },
});

WAF monitoring tracks rate limit violations identifying potential DDoS attacks, blocked requests by rule showing attack patterns, sampled requests for security analysis, and geographical distribution of blocked traffic.

Cost Optimization

The monitoring setup uses several strategies to minimize costs while maintaining visibility.

Log Retention Policies

Different retention periods for different log types optimize storage costs:

// Critical application logs - 1 week
const appLogs = new logs.LogGroup(this, 'AppLogs', {
  retention: logs.RetentionDays.ONE_WEEK,
});

// Access logs - 3 days
const accessLogs = new logs.LogGroup(this, 'AccessLogs', {
  retention: logs.RetentionDays.THREE_DAYS,
});

// Debug logs - 1 day
const debugLogs = new logs.LogGroup(this, 'DebugLogs', {
  retention: logs.RetentionDays.ONE_DAY,
});

Metric Filters Instead of Lambda

Using CloudWatch metric filters avoids Lambda costs for simple metric extraction:

new logs.MetricFilter(this, 'ErrorCountMetric', {
  logGroup: ghostLogGroup,
  filterPattern: logs.FilterPattern.literal('[ERROR]'),
  metricNamespace: 'Ghost/Application',
  metricName: 'ErrorCount',
  metricValue: '1',
});

Container Insights Optimization

Container Insights provides deep visibility but can be expensive. We enable it selectively:

const cluster = new ecs.Cluster(this, 'Cluster', {
  vpc,
  containerInsights: true, // Enable for production only
  enableFargateCapacityProviders: true,
});

Cost Considerations

The monitoring setup is designed to be cost-effective while providing comprehensive visibility. Based on typical Ghost deployment patterns, the estimated monthly costs are approximately $10-20 depending on traffic volume and log retention settings. The investment provides complete operational visibility and peace of mind for production deployments.

Subscirbing to Notifications

After deployment:

Subscribe to SNS alerts: The email subscription requires confirmation

aws sns subscribe \
  --topic-arn arn:aws:sns:region:account:GhostStack-MonitoringAlarmTopic* \
  --protocol email \
  --notification-endpoint your-email@example.com

Verify alarm configuration: Check that all alarms are in OK state initially
Customize thresholds: Adjust based on your traffic patterns and requirements
Monitor auto-scaling alarms separately: These will show as ALARM when load is low (this is normal)

The monitoring system provides the operational excellence needed for production Ghost deployments. With comprehensive metrics, intelligent alerting, and cost-effective logging, you can confidently run Ghost on AWS knowing issues will be detected and resolved quickly.