TechAnek

Automating AMI security patch updates across environments at scale

AMI security

What Improved After Automation ?

DevOps Performance Metrics
Deployment Reliability
IMPROVED
0%
Operational Effort
REDUCED
0%
Patch Adoption Speed
FASTER
0%
Error Reduction
REDUCTION
0%

Summary

A large financial services client with a multibillion dollar annual turnover faced repeated operational friction while keeping server fleets patched. Their application stacks are provisioned by CloudFormation and populated into autoscaling groups. AMI updates required repetitive manual edits to stack parameters, which created human errors and consumed developer time.

We delivered an automated, auditable, single-click workflow that updates AMI IDs across CloudFormation stacks and reports progress back into CI. The solution cut manual touch points to one action and made the process observable and reliable.

Client and Context

Our client is a leading global asset management firm known for its innovative approach to alternative investments and retirement solutions. Guided by values like pushing boundaries, creating opportunities and leading with integrity, they support both institutional and individual investors across credit, equity, real assets and retirement strategies.

As part of their digital transformation journey, the firm built an AI-powered Application on AWS to enhance customer engagement, showcasing their commitment to responsible, scalable technology and modern client experiences.

The Core Problem

Updating AMIs at scale looked simple in theory but in practice created several failure modes and inefficiencies.

  • Many CloudFormation stacks across accounts and environments. Each stack contained an AMI parameter. This required tracking and updating multiple independent parameters to move an environment to a new AMI.

  • Manual updates were error prone. Engineers sometimes updated the wrong stack, forgot to trigger the stack update after changing an SSM parameter, or assumed a stack was already current when it was not.

  • Detection was hard. Some AMIs were expired or untagged, making it difficult to tell whether a stack was referencing a new image.

  • Pace of change made the task repetitive. With patched AMIs released on a weekly cadence, the work consumed developer time that would be better spent on higher value items.

  • Timeouts and responsiveness. The initial approach to automation that queried all stacks synchronously ran into API Gateway timeouts when attempting to scan large accounts.

The client needed a reliable, auditable workflow that removed most manual steps while preserving control and notification.

Responsive Testimonial Design
We designed a lightweight control plane that integrates with the client CI system and the existing CloudFormation workflow. Key goals were minimal user interaction, clear status reporting and robust execution under account scale. Solution overview

Components we delivered

Infrastructure as code

  • Terraform modules to provision a secure REST API gateway, two AWS Lambda functions, and a DynamoDB table to store job state and metadata.

  • Fine grained IAM roles and least privilege policies for each component.

Job initiation and discovery

  • A Jenkins Groovy pipeline acts as the single click interface for engineers. The pipeline asks the user to provide a target AMI ID and an environment tag.

  • The pipeline calls the REST API gateway with a JSON payload. The API gateway invokes the first Lambda to begin discovery.

Lambda one – discovery and orchestration

  • The first Lambda identifies which autoscaling groups and CloudFormation stacks match the configured AMI name and tags.

  • It writes an entry into DynamoDB with a unique job ID. The DynamoDB record stores stack names, account ids, regions, and execution metadata.

  • The Lambda then triggers stack updates selectively. It can either update stack parameters directly or update SSM parameter store values that the stacks reference, depending on the client account pattern.

  • The Lambda operates asynchronously for large accounts. If its initial run will exceed the API gateway time budget, it re-invokes itself with state saved in DynamoDB and continues processing in the background.

Lambda two – status and reporting

  • The second Lambda reads DynamoDB using the job ID and reports the real time status of the update campaign.

  • It returns lists of stacks that are pending, done, failed, or skipped. It captures CloudFormation events and mini failure reasons so the caller can act.

CI integration and notifications

  • The Jenkins pipeline triggers the discovery Lambda and then polls the status Lambda in a loop through the API gateway.

  • Depending on the status returned the pipeline remains running until the job completes or errors out.

  • On completion, the pipeline sends an email to the user with a human readable report listing succeeded stacks, failed stacks with failure messages, and stacks that required no change.

  • An additional Jenkins job can fetch the latest AMI by OS family on demand and populate it into an interactive build parameter so engineers do not need to login to the console.

Jenkins CI Pipeline Build triggered Single-Click Trigger API Gateway REST Endpoint Async: No 504 Timeouts29s limit handled safely λ Lambda 1 Discovery & Orchestration Scan Stacks Tag Filter Re-invokes self asynchronouslyUpdates CF stacks or SSM params DynamoDB State Store Job Metadata job-id: xyz-9312 status: IN_PROGRESS stacks: 12 queuedDurable job state + metadata λ Lambda 2 Status & Reporting Done: 8 stacks Pending: 3 stacks Failed: 1 stack Real-time polling by Jenkins Dev CloudFormation AMI Updated Successfully ami-0a1b2c3d4e Staging CloudFormation AMI Updated Successfully ami-0a1b2c3d4e Production CloudFormation AMI Updated Successfully ami-0a1b2c3d4e Email Report Post-execution Succeeded / Failed / Skipped Human-readable audit trail Trigger / Response Invoke / Result Status / Report Write Job Read Status Notify DEV STAGE PROD 98% Deployment Reliability 85% Less Manual Effort Automating AMI Security Patch Updates at Scale Jenkins API Gateway Lambda DynamoDB CloudFormation

Design decisions and rationale

Use DynamoDB for durable state

  • A single DynamoDB table serves as the source of truth for each update campaign. It enables safe re-invocation of the discovery Lambda, has sufficient scalability for the job sizes, and provides a simple query surface for the status Lambda.

Asynchronous orchestration for reliability

  • API Gateway has a hard timeout of 29 seconds for synchronous integration. Scanning dozens or hundreds of stacks can take much longer. Requiring synchronous completion created brittle behavior.

  • The discovery Lambda was implemented to accept the initial API request then fork asynchronous processing. It writes an initial job entry and returns a job ID quickly. Background processing continues regardless of the HTTP connection.

Minimal blast radius and opt in

  • The orchestration only targets stacks that match a supplied AMI name pattern and configured tags. This preserves safe boundaries and prevents accidental updates to unrelated stacks.

Clear observability and retry semantics

  • The Jenkins pipeline triggers the discovery Lambda and then polls the status Lambda in a loop through the API gateway.

  • Depending on the status returned the pipeline remains running until the job completes or errors out.

  • On completion, the pipeline sends an email to the user with a human readable report listing succeeded stacks, failed stacks with failure messages, and stacks that required no change.

  • An additional Jenkins job can fetch the latest AMI by OS family on demand and populate it into an interactive build parameter so engineers do not need to login to the console.

Technical challenge and solution

When we first ran the discovery Lambda across a large account the function took 40 to 50 seconds just to enumerate resources and prepare updates. This triggered API Gateway 504 errors for upstream callers. We considered raising API Gateway timeouts at the account level but deemed that fragile and impractical across the client’s many accounts.

The robust fix was to change to an asynchronous execution model. The initial Lambda call now performs three quick steps,

  • Validate input and permissions

  • Create a DynamoDB job record with a unique job id and initial metadata

  • Return the job id to the caller

The Lambda then re-invokes itself asynchronously using the AWS Lambda invocation API or pushes tasks to an SQS queue that the Lambda worker consumes. Background runs pick up the job id from DynamoDB and proceed to enumerate stacks and update them. This approach makes the API response time predictable and allows discovery to continue regardless of HTTP timeouts.

Conclusion

Implementing end-to-end tracing with OpenTelemetry transformed how our client understood and managed their Azure AI Application. By capturing every step from prompt to response they gained clear visibility into the system, making it easier to fix issues, improve performance, and deliver more accurate answers. With trace data now flowing into Azure Application Insights, the team can quickly respond to feedback, ensure reliability and keep improving turning observability into a key driver of trust and continuous improvement.

Ready to Make Your AI Systems Transparent and Traceable?

schedule a consultation and see how our enterprise-grade observability strategy can optimize your Azure-based AI ecosystem, ensure compliance, and drive continuous innovation.
Let’s turn your monitoring challenges into a competitive advantage.