Automating AMI security patch updates across environments at scale

What Improved After Automation ?

DevOps Performance Metrics

Deployment Reliability

IMPROVED

Operational Effort

REDUCED

Patch Adoption Speed

FASTER

Error Reduction

REDUCTION

Summary

A large financial services client with a multibillion dollar annual turnover faced repeated operational friction while keeping server fleets patched. Their application stacks are provisioned by CloudFormation and populated into autoscaling groups. AMI updates required repetitive manual edits to stack parameters, which created human errors and consumed developer time.

We delivered an automated, auditable, single-click workflow that updates AMI IDs across CloudFormation stacks and reports progress back into CI. The solution cut manual touch points to one action and made the process observable and reliable.

Client and Context

Our client is a leading global asset management firm known for its innovative approach to alternative investments and retirement solutions. Guided by values like pushing boundaries, creating opportunities and leading with integrity, they support both institutional and individual investors across credit, equity, real assets and retirement strategies.

As part of their digital transformation journey, the firm built an AI-powered Application on AWS to enhance customer engagement, showcasing their commitment to responsible, scalable technology and modern client experiences.

The Core Problem

Updating AMIs at scale looked simple in theory but in practice created several failure modes and inefficiencies.

Many CloudFormation stacks across accounts and environments. Each stack contained an AMI parameter. This required tracking and updating multiple independent parameters to move an environment to a new AMI.
Manual updates were error prone. Engineers sometimes updated the wrong stack, forgot to trigger the stack update after changing an SSM parameter, or assumed a stack was already current when it was not.
Detection was hard. Some AMIs were expired or untagged, making it difficult to tell whether a stack was referencing a new image.
Pace of change made the task repetitive. With patched AMIs released on a weekly cadence, the work consumed developer time that would be better spent on higher value items.
Timeouts and responsiveness. The initial approach to automation that queried all stacks synchronously ran into API Gateway timeouts when attempting to scan large accounts.

The client needed a reliable, auditable workflow that removed most manual steps while preserving control and notification.

Responsive Testimonial Design

❝

We designed a lightweight control plane that integrates with the client CI system and the existing CloudFormation workflow. Key goals were minimal user interaction, clear status reporting and robust execution under account scale. Solution overview

Components we delivered

Infrastructure as code

Terraform modules to provision a secure REST API gateway, two AWS Lambda functions, and a DynamoDB table to store job state and metadata.
Fine grained IAM roles and least privilege policies for each component.

Job initiation and discovery

A Jenkins Groovy pipeline acts as the single click interface for engineers. The pipeline asks the user to provide a target AMI ID and an environment tag.
The pipeline calls the REST API gateway with a JSON payload. The API gateway invokes the first Lambda to begin discovery.

Lambda one – discovery and orchestration

The first Lambda identifies which autoscaling groups and CloudFormation stacks match the configured AMI name and tags.
It writes an entry into DynamoDB with a unique job ID. The DynamoDB record stores stack names, account ids, regions, and execution metadata.
The Lambda then triggers stack updates selectively. It can either update stack parameters directly or update SSM parameter store values that the stacks reference, depending on the client account pattern.
The Lambda operates asynchronously for large accounts. If its initial run will exceed the API gateway time budget, it re-invokes itself with state saved in DynamoDB and continues processing in the background.

Lambda two – status and reporting

The second Lambda reads DynamoDB using the job ID and reports the real time status of the update campaign.
It returns lists of stacks that are pending, done, failed, or skipped. It captures CloudFormation events and mini failure reasons so the caller can act.

CI integration and notifications

The Jenkins pipeline triggers the discovery Lambda and then polls the status Lambda in a loop through the API gateway.
Depending on the status returned the pipeline remains running until the job completes or errors out.
On completion, the pipeline sends an email to the user with a human readable report listing succeeded stacks, failed stacks with failure messages, and stacks that required no change.
An additional Jenkins job can fetch the latest AMI by OS family on demand and populate it into an interactive build parameter so engineers do not need to login to the console.

Design decisions and rationale

Use DynamoDB for durable state

A single DynamoDB table serves as the source of truth for each update campaign. It enables safe re-invocation of the discovery Lambda, has sufficient scalability for the job sizes, and provides a simple query surface for the status Lambda.

Asynchronous orchestration for reliability

API Gateway has a hard timeout of 29 seconds for synchronous integration. Scanning dozens or hundreds of stacks can take much longer. Requiring synchronous completion created brittle behavior.
The discovery Lambda was implemented to accept the initial API request then fork asynchronous processing. It writes an initial job entry and returns a job ID quickly. Background processing continues regardless of the HTTP connection.

Minimal blast radius and opt in

The orchestration only targets stacks that match a supplied AMI name pattern and configured tags. This preserves safe boundaries and prevents accidental updates to unrelated stacks.

Clear observability and retry semantics

The Jenkins pipeline triggers the discovery Lambda and then polls the status Lambda in a loop through the API gateway.
Depending on the status returned the pipeline remains running until the job completes or errors out.
On completion, the pipeline sends an email to the user with a human readable report listing succeeded stacks, failed stacks with failure messages, and stacks that required no change.
An additional Jenkins job can fetch the latest AMI by OS family on demand and populate it into an interactive build parameter so engineers do not need to login to the console.

Technical challenge and solution

When we first ran the discovery Lambda across a large account the function took 40 to 50 seconds just to enumerate resources and prepare updates. This triggered API Gateway 504 errors for upstream callers. We considered raising API Gateway timeouts at the account level but deemed that fragile and impractical across the client’s many accounts.

The robust fix was to change to an asynchronous execution model. The initial Lambda call now performs three quick steps,

Validate input and permissions
Create a DynamoDB job record with a unique job id and initial metadata
Return the job id to the caller

The Lambda then re-invokes itself asynchronously using the AWS Lambda invocation API or pushes tasks to an SQS queue that the Lambda worker consumes. Background runs pick up the job id from DynamoDB and proceed to enumerate stacks and update them. This approach makes the API response time predictable and allows discovery to continue regardless of HTTP timeouts.

Conclusion

Implementing end-to-end tracing with OpenTelemetry transformed how our client understood and managed their Azure AI Application. By capturing every step from prompt to response they gained clear visibility into the system, making it easier to fix issues, improve performance, and deliver more accurate answers. With trace data now flowing into Azure Application Insights, the team can quickly respond to feedback, ensure reliability and keep improving turning observability into a key driver of trust and continuous improvement.

Ready to Make Your AI Systems Transparent and Traceable?

schedule a consultation and see how our enterprise-grade observability strategy can optimize your Azure-based AI ecosystem, ensure compliance, and drive continuous innovation.
Let’s turn your monitoring challenges into a competitive advantage.

DevOps Transformation

Cloud Services

Platform Engineering

AI & ML Ops

Kubernetes Consultant

AWS Developer

DevOps Engineer

SRE Engineer

Automating AMI security patch updates across environments at scale

Automating AMI security patch updates across environments at scale

Automating AMI security patch updates across environments at scale

What Improved After Automation ?

Summary

Client and Context

The Core Problem

Components we delivered

Design decisions and rationale

Technical challenge and solution

Conclusion

Ready to Make Your AI Systems Transparent and Traceable?

Subscribe to Newsletter

Latest Update

Pages

Contact