Solving Docker BuildKit Compatibility Issues with Amazon ECS

A deep dive into resolving Docker BuildKit attestation manifest incompatibilities that prevent ECS deployments, including investigation steps and solutions.

Solving Docker BuildKit Compatibility Issues with Amazon ECS

Docker containers that build successfully on local development machines can fail when deployed to Amazon ECS through CDK. This post documents the investigation and resolution of a deployment failure caused by Docker BuildKit's attestation manifests, which are incompatible with Amazon ECS.

The root cause stems from Docker BuildKit creating OCI image manifests with additional security attestation layers that ECS cannot parse. While these attestations provide supply chain security benefits, ECS expects simple, single-platform Docker manifests.

The Problem

Our CDK deployments of containerized services were failing with cryptic errors. The services would deploy successfully to ECR, but ECS tasks refused to start. The most puzzling aspect? The exact same Docker builds worked flawlessly on our local development machines.

Success

Push via CDK

Pull Image

Failure

Local Docker Build

Container Runs Locally

ECR Repository

ECS Task

CannotPullContainerError

The deployment failures manifested in several ways. CDK reported build failures despite local builds working perfectly. Images appeared in ECR but showed 0 MB size. ECS tasks immediately stopped with pull errors, and the error messages referenced missing linux/amd64 descriptors.

The ECS console displayed this error:

Stopped reason:
CannotPullContainerError: failed to resolve reference
"xxx.dkr.ecr.region.amazonaws.com/repo:tag":
pulling from host xxx.dkr.ecr.region.amazonaws.com
failed with status code [manifests tag]: 400 Bad Request

The Investigation

Step 1: Verifying Local Builds

Our first instinct was to verify that the Dockerfiles were correct. We ran local builds:

# Local build test
docker build -t test-image .
docker run test-image
 
# Check image details
docker images | grep test-image
# Output: test-image  latest  abc123  2 minutes ago  292MB

Everything worked perfectly. The containers started, ran their applications, and had the expected file sizes. This eliminated Dockerfile issues as the cause.

Step 2: Pursuing Red Herrings

Since local builds worked but CDK builds failed, we suspected issues with our monorepo structure:

// Original Dockerfile attempting to handle pnpm workspaces
FROM node:18-alpine
WORKDIR /app
 
# We thought this was the problem
COPY package.json pnpm-lock.yaml pnpm-workspace.yaml ./
COPY containers/my-app/package.json ./containers/my-app/
 
# Hours spent converting this to standalone npm...
RUN pnpm install --frozen-lockfile --filter my-app...

We spent significant time converting pnpm workspaces to standalone npm packages, reorganizing .dockerignore files, and restructuring the monorepo layout. None of these changes had any effect on the actual problem.

Step 3: Analyzing ECR Manifests

The breakthrough came when we inspected the image manifests in ECR:

# Check manifest type
aws ecr batch-get-image \
  --repository-name my-repo \
  --image-ids imageTag=latest \
  --query 'images[0].imageManifest' \
  --output text | jq '.mediaType'
 
# Output: "application/vnd.oci.image.index.v1+json"

This revealed that our images were using OCI index manifests rather than simple Docker manifests. Further investigation showed attestation layers:

# Check for attestation layers
aws ecr batch-get-image \
  --repository-name my-repo \
  --image-ids imageTag=latest \
  --query 'images[0].imageManifest' \
  --output text | jq '.manifests[]'

Understanding the Root Cause

Docker BuildKit vs Legacy Builder

Docker BuildKit, enabled by default in recent Docker versions, creates sophisticated image manifests that include:

Docker BuildKit

OCI Image Index

Platform Manifest
linux/amd64

Attestation Manifest
Provenance Data

Attestation Manifest
SBOM Data

Legacy Builder

Simple Manifest
linux/amd64 Only

ECS Compatible ✓

ECS Incompatible ✗

BuildKit adds security attestations (provenance and SBOM data) that provide supply chain security benefits. However, ECS expects simple, single-platform manifests and cannot parse these additional layers.

Manifest Structure Comparison

BuildKit Manifest (Incompatible with ECS):

{
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "schemaVersion": 2,
  "manifests": [
    {
      "mediaType": "application/vnd.oci.image.manifest.v1+json",
      "platform": {
        "architecture": "amd64",
        "os": "linux"
      }
    },
    {
      "mediaType": "application/vnd.in-toto+json",
      "annotations": {
        "in-toto.io/predicate-type": "https://slsa.dev/provenance/v0.2"
      }
    }
  ]
}

Legacy Builder Manifest (Compatible with ECS):

{
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "schemaVersion": 2,
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json"
  }
}

The Solution

Immediate Fix

The solution was surprisingly simple once we understood the root cause. We disabled BuildKit in our deployment script:

#!/bin/bash
set -e
 
# Disable Docker BuildKit to avoid attestation manifest issues with ECS
# ECS expects simple linux/amd64 images, not manifest lists
export DOCKER_BUILDKIT=0
 
# Continue with CDK deployment
cdk deploy --all

CDK Configuration

For CDK projects, you can also ensure platform specification:

import { DockerImageAsset, Platform } from "aws-cdk-lib/aws-ecr-assets";
 
export class MyStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
 
    // Create Docker image asset with explicit platform
    const asset = new DockerImageAsset(this, "MyImage", {
      directory: path.join(__dirname, "../containers/my-app"),
      platform: Platform.LINUX_AMD64,
      // BuildKit will be disabled by environment variable
    });
 
    // Use in ECS task definition
    const taskDefinition = new FargateTaskDefinition(this, "TaskDef");
 
    taskDefinition.addContainer("Container", {
      image: ContainerImage.fromDockerImageAsset(asset),
      memoryLimitMiB: 512,
      logging: LogDrivers.awsLogs({
        streamPrefix: "my-app",
      }),
    });
  }
}

Validation Steps

After implementing the fix, we validated our deployments:

# 1. Build with BuildKit disabled
export DOCKER_BUILDKIT=0
docker build -t test-image .
 
# 2. Check manifest type
docker manifest inspect test-image | jq '.mediaType'
# Should show: "application/vnd.docker.distribution.manifest.v2+json"
 
# 3. Deploy to ECS
cdk deploy
 
# 4. Verify ECS tasks are running
aws ecs list-tasks \
  --cluster my-cluster \
  --service-name my-service \
  --desired-status RUNNING

Lessons Learned

This investigation revealed several important insights about containerized deployments. Not all AWS services support the latest container specifications. While BuildKit's attestations provide valuable security benefits, compatibility with your deployment target takes precedence.

Local Docker builds can behave differently from CDK's build process. Our local Docker daemon handled BuildKit manifests perfectly, but ECS has different requirements. Testing in an environment that closely matches production would have revealed this issue earlier.

Error messages don't always point to the root cause. The "platform mismatch" error suggested architecture incompatibilities, when the real problem was manifest format parsing. Understanding the complete deployment pipeline helps identify where issues actually originate.

Before refactoring application structure, investigate deployment tool differences. We spent hours restructuring our monorepo when the solution was a simple environment variable change.

Best Practices

Creating a deployment checklist helps prevent similar issues. Key items to verify include disabling BuildKit in deployment scripts, confirming base images use linux/amd64 architecture, validating CDK platform specifications, and testing ECS task launches in staging environments. Ensuring local builds match expected sizes and reviewing .dockerignore configurations can catch issues early in the deployment process.

Diagnostic commands prove invaluable for troubleshooting container deployment issues:

# diagnose.sh
#!/bin/bash
 
echo "Checking Docker BuildKit status..."
echo "DOCKER_BUILDKIT=${DOCKER_BUILDKIT}"
 
echo "Checking image manifest type..."
aws ecr batch-get-image \
  --repository-name $1 \
  --image-ids imageTag=$2 \
  --query 'images[0].imageManifest' \
  --output text | jq '.mediaType'
 
echo "Checking for attestation layers..."
aws ecr batch-get-image \
  --repository-name $1 \
  --image-ids imageTag=$2 \
  --query 'images[0].imageManifest' \
  --output text | jq '.manifests[] | select(.mediaType | contains("in-toto"))'

Conclusion

Docker BuildKit's advanced features create a compatibility gap with Amazon ECS's current manifest parsing capabilities. While this caused significant debugging time, the solution—disabling BuildKit—is straightforward once you understand the root cause.

The key takeaway? When containerized services work locally but fail in cloud deployments, investigate the build and deployment toolchain differences before diving into application code changes. Sometimes the simplest solution is the right one.

Remember to keep this workaround documented and revisit it periodically as both Docker and AWS ECS continue to evolve. What's incompatible today might become the standard tomorrow.