Mode 3: Cloud Rendering

Run video rendering at scale on AWS Fargate. Best for production deployments, long videos, and parallel batch processing.

Overview

Cloud rendering uses AWS ECS Fargate to run containerized render workers in the cloud. Each render job gets its own isolated task with dedicated resources (4 vCPU, 16GB RAM).

Ideal for:

Production video rendering at scale
Long videos (5+ minutes)
Parallel batch processing (10+ concurrent renders)
Hands-off automated rendering
No local compute resource consumption

Not ideal for:

Development and testing (use Mode 1)
Immediate render results (2-minute cold start)
Cost-sensitive small batches (Lambda cheaper for short videos)

How It Works

Cloud rendering follows this workflow:

1. User clicks "Render" in UI
   ↓
2. Job record created (status: queued)
   ↓
3. EventBridge trigger (every 1 minute)
   ↓
4. Lambda polls for queued jobs
   ↓
5. Lambda starts ECS Fargate task
   ↓
6. Task provisions (~ 2 minutes)
   ↓
7. Container starts, runs worker
   ↓
8. Worker processes render job
   ↓
9. MP4 uploaded to S3
   ↓
10. RenderRun record created
   ↓
11. Task exits, resources released

Architecture Components

1. ECR Repository

Stores Docker images for render workers.

Name: babulus-render-worker
Region: us-east-1
Retention: Images retained on stack deletion

2. VPC & Networking

Private network for task execution.

Configuration: 2 availability zones, 1 NAT gateway
Subnets: Private subnets with NAT for internet access
Security: Outbound-only, no inbound connections

3. ECS Cluster

Orchestrates Fargate task execution.

Name: babulus-render-cluster
Type: ECS with Fargate launch type
Container Insights: Enabled for monitoring

4. Task Definition

Defines container configuration.

CPU: 4 vCPU (4096 units)
Memory: 16 GB (16384 MB)
Image: Latest from ECR
Entrypoint: src/worker-ecs.ts

5. Render Trigger Lambda

Polls for jobs and starts tasks.

Trigger: EventBridge schedule (every 1 minute)
Runtime: Node.js 20
Timeout: 30 seconds
IAM: Permissions to run ECS tasks, query AppSync

6. CloudWatch Monitoring

Tracks worker health and performance.

Alarms: High error rate, long execution, no completions
Dashboard: Lambda invocations, task metrics, errors
Log Retention: 1 week

Prerequisites

AWS Account

You need an AWS account with:

ECS Fargate enabled
Sufficient service quotas (10 concurrent tasks)
Permissions to create VPC, ECS, Lambda resources

Docker Image

The render worker image must be built and pushed to ECR:

# Build image
docker build --platform linux/amd64 -t babulus-render-worker:latest -f Dockerfile .

# Tag for ECR
docker tag babulus-render-worker:latest \
  335163751677.dkr.ecr.us-east-1.amazonaws.com/babulus-render-worker:latest

# Login to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  335163751677.dkr.ecr.us-east-1.amazonaws.com

# Push to ECR
docker push 335163751677.dkr.ecr.us-east-1.amazonaws.com/babulus-render-worker:latest

Infrastructure Deployed

The ECS infrastructure must be deployed via Amplify Gen 2:

cd apps/studio-web
npx ampx sandbox

Or deployed to production by pushing to the main branch (Amplify Hosting auto-deploys).

Creating Render Jobs

Via API

Use the Amplify Data client to create a job:

import { generateClient } from 'aws-amplify/data';

const client = generateClient({ authMode: 'userPool' });

// Create render job
const { data: job, errors } = await client.models.Job.create({
  kind: 'render',
  status: 'queued',
  orgId: 'your-org-id',
  inputJson: JSON.stringify({
    generationRunId: 'your-generation-run-id'  // Links to generated assets
  })
});

if (errors) {
  console.error('Failed to create job:', errors);
} else {
  console.log(`Created render job: ${job.id}`);
}

Via GraphQL

Direct GraphQL mutation:

mutation CreateRenderJob {
  createJob(input: {
    kind: "render"
    status: "queued"
    orgId: "your-org-id"
    inputJson: "{\"generationRunId\":\"gen-123\"}"
  }) {
    id
    status
    createdAt
  }
}

Monitoring Render Jobs

Check ECS Tasks

List running tasks:

aws ecs list-tasks \
  --cluster babulus-render-cluster \
  --region us-east-1

Get task details:

aws ecs describe-tasks \
  --cluster babulus-render-cluster \
  --tasks <task-arn> \
  --region us-east-1

View Logs

Task logs are in CloudWatch:

aws logs tail "amplify-...-RenderTaskDefinition..." \
  --region us-east-1 \
  --follow

Lambda trigger logs:

aws logs tail "/aws/lambda/amplify-...-RenderTriggerFunction..." \
  --region us-east-1 \
  --follow

CloudWatch Dashboard

Access the pre-configured dashboard:

Open AWS Console
Navigate to CloudWatch → Dashboards
Select babulus-generation-worker
View metrics for invocations, errors, duration, throttles

Cost Analysis

Cloud rendering costs are based on task runtime:

Pricing (us-east-1):

vCPU: $0.04048 per vCPU per hour
Memory: $0.004445 per GB per hour

Per-task cost (4 vCPU, 16GB RAM):

vCPU: 4 × $0.04048 = $0.16192/hour
Memory: 16 × $0.004445 = $0.07112/hour
Total: $0.23304/hour = $0.00388/minute

Example costs:

Video Duration	Render Time	Task Cost
30 seconds	~2 minutes	$0.008
60 seconds	~3 minutes	$0.012
5 minutes	~8 minutes	$0.031
30 minutes	~40 minutes	$0.155

Additional costs:

ECR storage: $0.10/GB/month (minimal for Docker images)
Data transfer: $0.09/GB out (only MP4 downloads)
CloudWatch logs: $0.50/GB ingested (typical: $1-5/month)

Monthly estimate (100 renders/day):

100 renders × 3 minutes × $0.00388/min = $1.16/day
~$35/month for compute
~$2-5/month for logs and storage
Total: ~$40/month

Performance Characteristics

Task Provisioning

Cold start: 1.5 - 2.5 minutes

Pull Docker image from ECR
Initialize container
Start Node.js process

Warm starts: Not applicable (tasks exit after each render)

Rendering Speed

With 4 vCPU, 16GB RAM:

Video Length	Render Time	FPS	Total Time (provision + render)
30 seconds	1-2 minutes	~15-30	3-4 minutes
60 seconds	2-3 minutes	~20-30	4-5 minutes
5 minutes	6-8 minutes	~20-30	8-10 minutes
30 minutes	35-45 minutes	~15-25	37-47 minutes

Frame capture rate: 15-30 FPS (varies by scene complexity) Encoding speed: ~60-120x realtime

Parallel Processing

ECS can run up to 10 concurrent tasks (configurable):

Sequential: 10 videos × 3 minutes = 30 minutes
Parallel:   10 videos ÷ 10 tasks = 3 minutes

Scale limit: MAX_CONCURRENT_TASKS environment variable (default: 10)

Troubleshooting

Task Fails Immediately

Check CloudWatch logs:

aws logs tail "amplify-...-RenderTaskDefinition..." --since 10m --region us-east-1

Common causes:

Empty AMPLIFY_OUTPUTS (authentication fails)
Invalid worker credentials
Missing S3 permissions
Network connectivity issues

Task Stuck in PENDING

Check VPC configuration:

NAT gateway attached to private subnets
Route tables configured correctly
Security group allows outbound HTTPS

Check service quotas:

aws service-quotas get-service-quota \
  --service-code ecs \
  --quota-code L-3032A538 \
  --region us-east-1

Lambda Not Triggering Tasks

Check EventBridge rule:

aws events list-rules --name-prefix amplify- --region us-east-1

Check Lambda logs:

aws logs tail "/aws/lambda/amplify-...-RenderTriggerFunction..." --region us-east-1

Verify Lambda has IAM permissions:

ecs:RunTask
ecs:DescribeTasks
iam:PassRole
appsync:GraphQL

High Task Failure Rate

Check CloudWatch alarm:

Navigate to CloudWatch → Alarms
Look for babulus-worker-high-error-rate

Common fixes:

Update Docker image with bug fix
Increase task timeout
Add retry logic
Check for intermittent network issues

Scaling Configuration

Increase Concurrent Tasks

Edit apps/studio-web/amplify/backend.ts:

environment: {
  // ... other vars
  MAX_CONCURRENT_TASKS: '20',  // Increase from 10 to 20
}

Deploy:

git add apps/studio-web/amplify/backend.ts
git commit -m "Increase concurrent render tasks to 20"
git push origin main

Adjust Task Resources

For longer videos, increase CPU/RAM:

const renderTaskDefinition = new ecs.FargateTaskDefinition(backend.stack, 'RenderTaskDefinition', {
  cpu: 8192,        // 8 vCPU (was 4096)
  memoryLimitMiB: 30720,  // 30 GB (was 16384)
  // ...
});

Change Polling Interval

Edit EventBridge schedule:

const renderWorkerRule = new events.Rule(backend.stack, 'RenderWorkerSchedule', {
  schedule: events.Schedule.rate(Duration.seconds(30)),  // Every 30 seconds (was 1 minute)
  // ...
});

Security Best Practices

1. Use Secrets Manager for Worker Credentials

Currently worker credentials are in plaintext environment variables. Move to Secrets Manager:

import * as secretsmanager from 'aws-cdk-lib/aws-secretsmanager';

const workerSecret = new secretsmanager.Secret(backend.stack, 'WorkerCredentials', {
  secretName: 'babulus-worker-credentials',
  generateSecretString: {
    secretStringTemplate: JSON.stringify({ email: 'render-worker@babulus.internal' }),
    generateStringKey: 'password',
  },
});

// Grant task read access
workerSecret.grantRead(taskRole);

// Update task definition
renderTaskDefinition.addContainer('render-worker', {
  // ...
  secrets: {
    WORKER_EMAIL: ecs.Secret.fromSecretsManager(workerSecret, 'email'),
    WORKER_PASSWORD: ecs.Secret.fromSecretsManager(workerSecret, 'password'),
  },
});

2. Restrict Task IAM Permissions

Only grant permissions the worker actually needs:

taskRole.addToPolicy(
  new iam.PolicyStatement({
    actions: ['s3:GetObject', 's3:PutObject'],  // No DeleteObject
    resources: [`${bucket.bucketArn}/renders/*`],  // Only renders prefix
  })
);

3. Enable VPC Flow Logs

Monitor network traffic:

vpc.addFlowLog('FlowLog', {
  destination: ec2.FlowLogDestination.toCloudWatchLogs(),
  trafficType: ec2.FlowLogTrafficType.ALL,
});

Pros & Cons

Advantages

✅ Scales automatically (up to 10 concurrent) ✅ No server management required ✅ Pay only for task runtime ✅ Handles long-running renders (no Lambda timeout) ✅ Isolated execution per job ✅ CloudWatch monitoring built-in

Disadvantages

❌ Cold start overhead (~2 minutes) ❌ Requires AWS infrastructure ❌ More complex setup vs. local ❌ Costs more per render than local ❌ Debugging requires CloudWatch access

Next Steps

Monitoring & Alerts - Configure SNS alerts
Cost Optimization - Reduce render costs
Performance Tuning - Speed up renders
Troubleshooting Guide - Common issues