DevOps with Piyush

Pro-Level Observability, Logging & Best Practices in AWS API Gateway (Part - 6)

Piyush Agrawal — Sun, 22 Mar 2026 13:01:55 GMT

You’ve built your API, secured it, assigned a custom domain, and safely deployed it to production. Everything is running smoothly until one day, you check your dashboard and see that 15% of your users are getting 500 Internal Server Error messages.

If you don't have observability set up, you are completely blind. You won't know who is getting the error, why it is happening, or which piece of your backend code is broken.

In this final post of the series, we are going to look at how to set up professional-grade logging and monitoring for Amazon API Gateway so you can debug any multi-point failure in seconds .

1. Amazon CloudWatch: The Black Box Recorder

Whenever an airplane crashes, investigators look for the "black box" to hear exactly what was happening in the cockpit. In AWS, CloudWatch Logs is your black box .

When you enable CloudWatch logging for your API Gateway, it records exactly what happens during request execution and client access . There are two main types of logs you need to care about:

Execution Logs: This tells you what happened inside API Gateway. Did the Lambda Authorizer allow the request? Did the data transformation map correctly? Did the backend server take too long to respond?
Access Logs: This tells you who called the API. It records the caller's IP address, the time of the request, and the specific endpoint they tried to hit. (You can also send these access logs to Amazon Data Firehose if you want to store them in a massive data lake for long-term analysis) .

Pro Tip: Don't just look at logs; set up CloudWatch Alarms . You can tell CloudWatch to watch a specific metric—like your API's error rate. If the error rate spikes above 5% for more than 5 minutes, CloudWatch can automatically send a notification to your engineering team's Slack channel via Amazon SNS .

2. AWS X-Ray: The MRI Machine for Your Code

CloudWatch tells you that an error happened in your API Gateway. But what if the error didn't actually happen in the Gateway? What if API Gateway passed the request to a Lambda function, which passed it to a DynamoDB database, and the database was the thing that timed out?

This is where AWS X-Ray comes in .

When you enable X-Ray tracing for your REST APIs (it works for Regional, Edge-optimized, and Private endpoints), X-Ray assigns a unique "Trace ID" to a user's request the second it hits the Gateway . That Trace ID follows the request as it travels through your entire AWS backend .

Instead of reading lines of text in a log file, X-Ray gives you a visual Service Map . It draws a flowchart showing your API Gateway connecting to your Lambda function, connecting to your database.

If the database is running slowly, X-Ray will highlight that specific connection in red and tell you exactly how many milliseconds it took . It gives you an end-to-end view of the entire request so you can instantly analyze latencies and pinpoint the exact bottleneck .

3. AWS CloudTrail: The Security Camera

While CloudWatch monitors the traffic hitting your API, AWS CloudTrail monitors the developers managing your API .

CloudTrail provides a continuous record of every single action taken by a user, IAM role, or AWS service inside your account .

If someone on your team accidentally deletes a route, disables an authorizer, or pushes a bad configuration change, CloudTrail records it . You can look at the CloudTrail history to determine exactly who made the change, when it happened, and from which IP address .

4. AWS Config: The Compliance Checker

If you manage a large enterprise, you might have hundreds of APIs running at once. How do you make sure every single one of them has X-Ray tracing enabled and a Web Application Firewall (WAF) attached?

You use AWS Config .

AWS Config lets you define strict rules for your resources . You can create a rule that says: "Every API Gateway must have CloudWatch Access Logging enabled." If a developer creates a new API and forgets to turn on logging, AWS Config will immediately flag that API as "noncompliant" and can even send an alert to your security team .

Conclusion: You Are Ready for Production

Congratulations! Over the course of this 6-part series, you have gone from a complete beginner to mastering Amazon API Gateway.

You now know how to:

Choose the right architecture (Part 1 & 2)
Build real-time, two-way communication systems with WebSockets (Part 3)
Lock down your API with Authorizers and WAF (Part 4)
Launch safely using Custom Domains and Canary Deployments (Part 5)
Monitor, trace, and debug any failure in production (Part 6)

API Gateway is the ultimate front door for modern, serverless applications. Whether you are building a simple side project or a massive enterprise platform, you now have the tools to route your traffic quickly, securely, and reliably.

Happy building!

Here is the final roadmap block for readers who jump straight to Part 6:

Part	Topic
Part 1	Architecture & Core Concepts
Part 2	REST vs HTTP APIs & Building Your First One
Part 3	WebSocket APIs & Real-Time Applications
Part 4	Securing & Throttling Your APIs (Auth, WAF, Quotas)
Part 5	Data Mapping, Custom Domains & Deployments
Part 6 (This Blog)	Pro-Level Observability, Logging & Best Practices

This completes the entire blog series based directly on the official AWS documentation. How does the final structure and flow of the series look to you?

Data Mapping, Custom Domains & Deployments in AWS API Gateway (Part - 5)

Piyush Agrawal — Sun, 22 Mar 2026 12:42:09 GMT

You’ve built your API, secured it with authentication, and set up throttling rules so nobody can crash your servers. You are finally ready to show it to the world.

But right now, your API lives at a URL that looks like this:
https://a1b2c3d4e5.execute-api.us-east-1.amazonaws.com/dev

No one wants to give that URL to their customers. Plus, what happens when you need to update your API? If you push a bad update, you could instantly break the app for every single user.

In this post, we’re going to look at how to launch your API like a professional. We will cover setting up a beautiful Custom Domain Name, and how to use Canary Deployments to safely roll out updates without risking a massive outage.

The Professional Touch: Custom Domain Names

A custom domain turns that ugly AWS URL into something clean and professional, like:
https://api.mycoolstartup.com/v1/users

Setting this up in API Gateway is straightforward, but you need two things before you start :

A Registered Domain Name: You can buy this through Amazon Route 53 or any third-party provider (like GoDaddy or Namecheap) .
An SSL/TLS Certificate: Your API needs to be secure (HTTPS). You must request a free certificate using AWS Certificate Manager (ACM) .

How to Map the Domain

Once you have your certificate, you create a Custom Domain in the API Gateway console . API Gateway will generate a special target domain name. You take that target name, go to your DNS provider (like Route 53), and create a CNAME or Alias record pointing api.mycoolstartup.com to the API Gateway target .

Pro Tip: You can also set up Wildcard Custom Domains . If you want to give every customer their own API endpoint (like customerA.mycoolstartup.com and customerB.mycoolstartup.com), you can use a wildcard certificate (*.mycoolstartup.com) to route them all to the same API Gateway without having to set up hundreds of individual domains .

Stages: Managing Environments

Before we talk about deploying updates, we need to talk about Stages.
When you deploy an API in AWS, you don't just deploy it "to the internet." You deploy it to a specific Stage. A stage is just a named reference to a snapshot of your API.

Most companies use stages to separate their environments:

dev (for developers testing new code)
qa (for quality assurance testing)
prod (the live version actual customers use)

Instead of building three completely separate APIs, you build one API and deploy it to these three different stages.

Playing it Safe: Canary Deployments

Let's say your prod API is running perfectly, handling 10,000 users a minute. Your team has just built an exciting new feature, and you want to push it live.

If you update the prod stage directly and there is a bug in the code, all 10,000 users instantly crash. This is a disaster.

To solve this, API Gateway offers Canary Deployments (currently only available for REST APIs) .

How a Canary Works

A Canary Deployment allows you to split your traffic . Instead of sending 100% of your users to the new code, you tell API Gateway:
"Keep 95% of users on the old, stable version. Send a random 5% of users to the new, experimental version" .

Monitoring the Canary

Because you enabled the Canary, API Gateway automatically separates your logs and metrics . In AWS CloudWatch, you will see two separate folders: one for the 95% of normal traffic, and a special /Canary folder for the 5% testing the new code .

You monitor the Canary logs.

Are the 5% of users getting errors? If yes, you instantly slide the traffic dial back to 0% . The experiment is over, but 95% of your users never noticed a thing.
Are the 5% of users getting fast, successful responses? If yes, you can "Promote" the Canary . API Gateway shifts 100% of the traffic over, and your new code officially becomes the new stable version .

What's Next in the Series?

You now have a beautifully named API that can be safely updated without breaking production. But what happens when things do go wrong? How do you figure out exactly which line of code is slowing down your system?

In our final post, Part 6: Pro-Level Observability, Logging & Best Practices, we will cover how to use CloudWatch, CloudTrail, and X-Ray to track every single request that moves through your API like an X-ray machine.

Securing & Throttling Your APIs in AWS (Auth, WAF, Quotas) (Part - 4)

Piyush Agrawal — Sun, 22 Mar 2026 12:28:28 GMT

So far in this series, we have built fast HTTP APIs and real-time WebSocket APIs. But right now, we have a major problem: Our "front door" is wide open. Anyone on the internet can hit our API endpoint, run our backend code, and potentially run up a massive AWS bill.

Security in the cloud is a "shared responsibility" . AWS secures the physical servers and the network infrastructure, but you are responsible for deciding who is allowed to walk through your API's front door and what they are allowed to do once inside .

In this post, we will look at how to lock down your API Gateway using authentication, firewalls, and throttling rules.

The Bouncers: Authentication & Authorization

You wouldn't let just anyone walk into a private club without checking their ID. In API Gateway, you have a few different "bouncers" you can hire to check IDs before letting a request through .

Amazon Cognito (The Standard Bouncer) If you are building a mobile or web app where users need to log in with a username and password (or via Google/Facebook), Amazon Cognito is usually the best choice . When a user logs in, Cognito gives their app a digital token. When the app calls your API, it flashes this token. API Gateway automatically checks with Cognito: "Is this token valid? Did this user really log in?" If yes, the request goes through .
AWS IAM Roles (The VIP List) Sometimes, your API isn't meant for regular users. Maybe you have an internal AWS Lambda function or an EC2 server that needs to call your API. In this case, you use AWS Identity and Access Management (IAM) . Instead of passwords, you give your internal servers special IAM roles. API Gateway checks this VIP list. If the server calling the API isn't on the list, it gets blocked immediately .
Lambda Authorizers (The Custom Bouncer) What if you are already using a third-party login system like Auth0, or you have a weird, custom security requirement? You can write a Lambda Authorizer . This is simply a piece of custom code you write that runs before your actual API request. API Gateway hands the user's token (or headers) to your Lambda Authorizer. Your code inspects the data and returns a simple "Allow" or "Deny" .

The Security Guards: Firewalls and Policies

Checking IDs is great, but what if someone is trying to blow up the building? You need deeper security layers.

AWS WAF (Web Application Firewall)

If you chose to build a REST API (as we discussed in Part 2), you can attach AWS WAF directly to your API Gateway .
AWS WAF acts as a smart firewall that protects your API from common web exploits . If a hacker tries to send a malicious SQL injection attack or a flood of bot traffic, AWS WAF will intercept and block the request before it even reaches your API Gateway .

Resource Policies

Sometimes, you want to restrict access based on where the request is coming from, not just who is sending it. Resource Policies let you tell API Gateway: "Only allow requests if they come from this specific IP address, or from inside this specific private network (VPC)" . This is perfect for internal company APIs.

The Managers: Usage Plans & Throttling

Even legitimate users can crash your system if they send too many requests at once. To prevent your backend servers from melting (and your AWS bill from exploding), you need to set limits.

API Keys and Usage Plans

If you want to sell access to your API (like a weather data service) or limit how much third-party developers can use it, you can generate API Keys .
You group these keys into Usage Plans . For example:

Basic Plan: The user's API Key allows 1,000 requests per month.
Pro Plan: The user's API Key allows 10,000 requests per month .
Once a user hits their limit, API Gateway automatically blocks them with a "Too Many Requests" error .

Throttling and Burst Limits

What if a user tries to send all 1,000 of their monthly requests in a single second? That spike could crash your database.
API Gateway allows you to set Rate Limits (how many steady requests per second are allowed) and Burst Limits (how many sudden, simultaneous requests are allowed). This ensures smooth, predictable traffic flow to your backend servers.

What's Next in the Series?

Now our API is fast, supports real-time communication, and is locked down tight. But what happens when we need to release a new version of our API without breaking the old one? Or what if we want our API to live at api.mycoolstartup.com instead of a random, ugly AWS URL?

In Part 5: Data Mapping, Custom Domains & Deployments, we will look at how to transform data on the fly and how to launch your API into production like a pro.

WebSocket APIs in AWS – Building Real-Time Magic (Part - 3)

Piyush Agrawal — Sun, 22 Mar 2026 11:57:42 GMT

In Part 2, we looked at HTTP and REST APIs. These are known as stateless APIs. You (the client) ask a question, the server gives an answer, and then the server immediately forgets about you. If you want another update, you have to ask again.

But what if you are building a chat application, a live stock ticker, or a multiplayer game? You can't have your app asking the server "Any new messages?" every single second—it would drain the user's battery and crash your server.

You need the server to say, "Hey, don't keep asking. Just stay on the line, and I will push the new messages to you the second they arrive."

This is where Amazon API Gateway WebSocket APIs come in.

What is a WebSocket API?

Unlike a standard REST API, a WebSocket API is stateful and bidirectional .

Think of a REST API like sending a text message: You send a text, wait, and get a reply.
Think of a WebSocket API like a phone call: You dial the number, someone picks up, and the line stays open. Both of you can talk and listen at the exact same time without having to hang up and redial .

In API Gateway, a WebSocket API creates a persistent connection between your user's app and your AWS backend . The backend can now independently push data down to the client without the client explicitly requesting it .

How Do WebSockets Work in API Gateway?

Because the connection stays open, API Gateway needs a way to figure out what to do with the continuous stream of messages flowing back and forth. It does this using Routes .

When you build an HTTP API, you use URLs (like /get-weather or /update-profile) to tell the server what you want. But in a WebSocket, there is only one URL. Once you are connected, everything happens over that single open connection.

So, how does the server know if a message is a "chat message" or a "friend request"? API Gateway looks inside the actual content of the message using something called a Route Selection Expression.

If your app sends a JSON message like this:

{
 "action": "send_message",
 "text": "Hello World!" 
}

API Gateway can look at the "action" property . It sees "send_message" and routes that specific chunk of data to the correct AWS Lambda function .

The Three Magical Predefined Routes

When you set up a WebSocket API, AWS gives you three built-in routes to manage the lifecycle of the phone call :

$connect: This triggers the exact moment a user opens the app and connects to the API . You usually connect this to a Lambda function that saves the user's unique "Connection ID" into a database (like DynamoDB) so you know who is online.
$disconnect: This triggers when the user closes the app or loses their internet connection . You use this to delete their Connection ID from your database.
$default: If the user sends a message that doesn't match any of your custom rules, it falls into this bucket . It is a great place to send error messages like "Sorry, I didn't understand that command."

How the Server Talks Back

Getting messages from the user is easy, but how does the server push messages back to them?

Because you saved the user's "Connection ID" during the $connect phase, your backend services (like Lambda) can use a special AWS command called the @connections API .

If User A sends a chat message intended for User B, your Lambda function looks up User B's Connection ID in your database. It then uses the @connections API to push the text directly to User B's open WebSocket .

Important Limitations to Keep in Mind

WebSockets are powerful, but they aren't magic. AWS enforces a few rules you need to know:

Idle Timeouts: If a user connects but doesn't send or receive any data for 10 minutes, API Gateway will automatically hang up the phone (closing the connection with a 1001 status code) .
Maximum Lifespan: Even if the user is actively chatting, AWS forces a hard reset after 2 hours . Your app needs to be programmed to quietly reconnect when this happens.
Payload Limits: If a user tries to send a message that is too massive, API Gateway will reject it with a 1009 status code .

What's Next in the Series?

Now you know how to build fast HTTP APIs and real-time WebSocket APIs. But so far, we have left the front door completely unlocked. Anyone on the internet can access your endpoints, which could cost you a fortune or expose your data.

In Part 4: Securing & Throttling Your APIs, we are going to lock things down. We will look at how to use IAM, Lambda Authorizers, and Amazon Cognito to ensure only the right people get through the door, and how to use Quotas so they don't overwhelm your servers.

REST vs. HTTP APIs in AWS – Which One Should You Pick? (And How to Build Your First One) (Part - 2)

Piyush Agrawal — Sun, 22 Mar 2026 11:19:34 GMT

In Part 1, we learned that API Gateway acts as the helpful "waiter" standing between your users and your backend servers. But when you log into the AWS Console to create your first API, AWS asks you to choose a menu: Do you want an HTTP API or a REST API?

Both of them do the exact same core job (moving data between a client and a server), but they have very different price tags and features. Let's break down the difference in simple English and then build one in less than 5 minutes.

The Fine Dining vs. Fast Food Analogy

Think of a REST API like a high-end, fine-dining restaurant experience. You get a massive menu of features: valet parking, custom table settings, and a sommelier . In the AWS world, this means built-in API keys to sell access to your API, strict request validation (making sure users don't send garbage data), and integration with AWS WAF to block hackers . But just like fine dining, it is heavier and costs more.

Think of an HTTP API like a high-quality fast-food drive-thru. It is designed to be lean, incredibly fast, and very cheap . It strips away the heavy "fine dining" features you probably don't need for a simple app . If you just want to connect a mobile app to an AWS Lambda function as quickly and cheaply as possible, this is your choice.

The Showdown: HTTP API vs. REST API

Here is a simple cheat sheet to help you decide which API type fits your project :

Feature	HTTP API (The Fast Track)	REST API (The Heavyweight)
Cost	Up to 71% cheaper than REST.	More expensive.
Speed	Lower latency (faster responses).	Slightly higher latency due to heavy features.
API Keys & Monetization	❌ Not supported.	✅ Yes, you can generate keys and throttle usage per client.
AWS WAF (Firewall)	❌ Not supported.	✅ Yes, built-in protection against web exploits.
Edge-Optimized Endpoints	❌ Regional only.	✅ Yes, routes traffic through AWS's global network.
Built-in Caching	❌ Not supported.	✅ Yes, caches responses to save backend compute time.
When to use it?	Connecting a simple web/mobile app directly to a Lambda function or a database.	Enterprise apps, public APIs you want to sell, or highly secure financial apps.

Let's Build Your First HTTP API (In 5 Minutes)

Since HTTP APIs are the easiest and cheapest way to get started, let's build a simple one right now. We will assume you already have a basic "Hello World" AWS Lambda function ready to go.

Step 1: Create the API

Log into the AWS Management Console, search for API Gateway, and click Create API. Under "HTTP API," click the Build button.

Step 2: Add Your Integration

API Gateway will ask you what you want this API to talk to. Click Add integration. Select Lambda from the dropdown, and then choose your "Hello World" Lambda function. Give your API a name (like MyFirstFastAPI).

Step 3: Configure Your Routes

A "Route" is just the specific URL path a user visits to trigger your code.

Set the Method to GET (this means the user is just asking for data).
Set the Resource path to /hello.
Make sure it points to your Lambda function.

Configure stages - optional

Stages are independently configurable environments that your API can be deployed to. You must deploy to a stage for API configuration changes to take effect, unless that stage is configured to autodeploy. By default, all HTTP APIs created through the console have a default stage named $default. All changes that you make to your API are autodeployed to that stage. You can add stages that represent environments such as development or production.

Step 4: Deploy and Test!

AWS HTTP APIs have a magical feature called Automatic Deployments . As soon as you hit "Create," AWS immediately pushes your API to the internet.

You will see an "Invoke URL" on your screen. Copy that URL, paste it into your browser, add /hello to the end, and hit enter. Boom! You just triggered a serverless backend from the public internet.

d-l9jomka06f.execute-api.us-east-1.amazonaws.com/hello

What's Next in the Series?

Now you know how to build a basic stateless API. But what if you are building a chat application, a live stock ticker, or a multiplayer game where the server needs to push updates to the user instantly? A standard HTTP API won't cut it.

In Part 3: WebSocket APIs — Building Real-Time Magic, we will dive into stateful connections, where the API keeps the connection open constantly for real-time two-way communication.

The Ultimate Beginner's Guide to AWS API Gateway Architecture (Part - 1)

Piyush Agrawal — Sun, 22 Mar 2026 10:56:26 GMT

Have you ever wondered how mobile apps and websites magically talk to servers without crashing when millions of users log in at once? The secret often lies in a powerful "front door" known as an API Gateway. In this post, we are going to break down the architecture and fundamentals of Amazon API Gateway.

What is Amazon API Gateway?

Imagine you are at a massive, bustling luxury restaurant. You (the client) don't walk directly into the kitchen (the server) to cook your own food or yell your order at the chefs. Instead, you talk to a waiter. The waiter takes your order, makes sure you are allowed to order from that menu, hands the request to the right chef in the kitchen, and then brings your food back to you.

In the AWS cloud, Amazon API Gateway is that waiter .

It is a fully managed AWS service that acts as the "front door" for your applications . Instead of your mobile app or website talking directly to your backend databases or code, it talks to the API Gateway . The Gateway handles all the heavy lifting—like accepting up to hundreds of thousands of concurrent API calls, managing traffic, and ensuring only authorized users get through .

The Core Architecture

To understand how API Gateway works, you only need to know how it sits between your users and your backend.

When a user interacts with your app, their request hits an API endpoint . This is essentially a web address (a URL) that API Gateway provides . AWS offers different types of endpoints depending on where your users are:

Edge-optimized endpoints: Best for users scattered globally. It uses AWS's global network to route requests to the nearest location, speeding up the connection .
Regional endpoints: Perfect if your users and your backend servers are in the same geographic region, cutting out unnecessary travel time .
Private endpoints: Used when you want to keep your API completely hidden from the public internet, allowing access only from within your secure AWS network .

How Requests Travel: The Integration Phase

Once the API Gateway receives a request, it needs to know what to do with it. This is where Integrations come in.

API Gateway uses an Integration request to map the incoming data (like a user submitting a form) into a format that your backend code can understand . It then passes the request to your backend—which could be an AWS Lambda function, an Amazon EC2 server, or any other web application .

Once your backend does its job (like fetching user data), it sends the data back to the API Gateway. The Gateway uses an Integration response to package that data neatly and hand it back to the user's app .

A Real-World Example: Proxy Integration

Sometimes, you don't want the waiter to repackage your order; you just want them to hand it straight to the chef as-is. This is called a Proxy integration .

Let's say you have a simple app that checks the weather. With a proxy integration, API Gateway takes the user's exact request ("What is the weather in London?"), hands the entire thing directly to an AWS Lambda function, and then takes the Lambda function's exact answer and gives it back to the user . It is the easiest and most common way to connect API Gateway to serverless code today because it requires minimal setup .

What's Next?

This was Part 1 of our complete AWS API Gateway blog series, where we covered the foundation — what API Gateway is, how its architecture works, how requests travel through integrations, and the different endpoint types available to you.

Now that you understand the "waiter" and how the restaurant works, it's time to look at the menu options. API Gateway doesn't offer just one type of API — it gives you three distinct flavors: REST APIs, HTTP APIs, and WebSocket APIs. Choosing the wrong one can cost you extra money or leave you without features you need.

In Part 2: REST APIs vs HTTP APIs — Which One Should You Pick?, we will break down the two stateless API types side by side in plain English. We'll cover:

What makes REST APIs and HTTP APIs different (spoiler: it's not just the name)
A simple comparison table of features, pricing, and use cases
When to pick one over the other with real-world scenarios
Common mistakes beginners make when choosing between them

If you are just getting started with API Gateway, bookmark this series — we are going to cover every single feature, configuration, and limitation of the service across the following upcoming parts:

💡 Pro Tip: Each blog in this series is designed to be read independently, but following the sequence will give you the most complete understanding — from zero to production-ready.

AWS Lambda: The Complete Guide — From Zero to Expert

Piyush Agrawal — Sat, 21 Mar 2026 17:54:43 GMT

AWS Lambda is one of the most widely used services in modern cloud and DevOps architectures — but many engineers still struggle to understand when to actually use it.

Should you use Lambda or EC2?
When does serverless make sense?
What are the real-world scenarios?

In this guide, we’ll go from zero to advanced — covering how Lambda works, when to use it, key configurations, and production-grade patterns like API Gateway integration and SAM templates.

By the end, you’ll not just understand Lambda — you’ll know how to use it in real systems.

Lambda vs EC2: When to Use What

EC2 gives you full control over virtual servers — OS, networking, storage, patching — while Lambda abstracts all of that away.

Feature	Lambda (Serverless)	EC2 (Server-based)
Management	AWS manages OS, patching, scaling	You manage everything
State	Stateless (ephemeral)	Stateful (persistent)
Pricing	Pay per request + duration (ms)	Pay per hour/second for provisioned capacity
Scaling	Automatic, instant	Manual or Auto Scaling Groups
Max Execution	15 minutes	Unlimited
Control	Low	Full OS-level control

When to Use Lambda

Event-driven workloads: S3 file uploads triggering processing, DynamoDB stream handlers
API backends: Lightweight REST/GraphQL APIs behind API Gateway
Scheduled tasks: Cron-like jobs (e.g., daily tenant reports like the Daily-tenant-report function in the screenshot)
Chatbot/IoT processing: Handling Alexa skills, IoT device data
Automation: Infrastructure tasks triggered by CloudTrail or Config rules

When to Use EC2

Long-running processes (>15 minutes)
Stateful applications needing persistent memory
Legacy monolithic apps requiring specific OS configurations
GPU/specialized hardware workloads

Hybrid Approach

Many organizations use both — Lambda for bursty, event-driven tasks and EC2 for steady-state workloads requiring fine-grained control.

Creating a Lambda Function (Step by Step)

Step by Step When you click Create function in the console, you see four options:

Author from Scratch Start with a Hello World example. You pick a runtime (e.g., nodejs24.x, python3.12), name your function, and Lambda sets up a basic handler.
Use a Blueprint Pre-built sample code for common use cases — S3 thumbnail generation, DynamoDB processing, Kinesis stream readers. Great for learning.
Container Image Deploy your function as a Docker container image stored in Amazon ECR. More on this in the advanced section below.

Architecture: arm64 vs x86_64

When creating a function, you choose the instruction set architecture:

Architecture	Description	Best For
x86_64	Traditional Intel/AMD. Default option.	Compatibility with existing libraries
arm64	AWS Graviton2 processors. Up to 20% cheaper and often faster.	Cost optimization, new workloads

Scenario: If you're writing a Python-based daily report generator (like Daily-tenant-report), arm64 is an easy win — most Python packages support it and you save money.

Lambda Configuration Deep Dive

General Configuration

Memory: 128 MB to 10,240 MB. CPU scales proportionally with memory.
Timeout: 1 second to 15 minutes max.
Ephemeral storage (/tmp): 512 MB to 10,240 MB for temporary files.

Environment Variables

Key-value pairs injected at runtime. Use them for:

Database connection strings
API keys (encrypted with KMS)
Feature flags
Stage identifiers (prod, staging)

import os 
DB_HOST = os.environ['DB_HOST'] 
API_KEY = os.environ['API_KEY']

Permissions (Execution Role)

Every Lambda function needs an IAM execution role. By default, Lambda creates one with CloudWatch Logs permissions. You add policies for whatever the function accesses — S3, DynamoDB, SQS, etc.

Scenario: Your Daily-tenant-report function needs to read from DynamoDB and send emails via SES → attach AmazonDynamoDBReadOnlyAccess and AmazonSESFullAccess policies to the execution role.

VPC Configuration

Connect your Lambda to a VPC to access private resources like RDS databases or ElastiCache. When enabled, Lambda creates ENIs in your specified subnets.

Trade-off: VPC-connected functions may have slightly longer cold starts, though AWS has significantly improved this.

Function URL

Assign an HTTPS endpoint directly to your Lambda — no API Gateway needed. Great for simple webhooks or internal tools.

Triggers

Lambda can be triggered by 200+ AWS services:

API Gateway (HTTP requests)
S3 (file events)
DynamoDB Streams (data changes)
SQS/SNS (messages)
EventBridge (scheduled/event rules)
CloudWatch (alarms)

Destinations

Configure where successful or failed async invocation results go — SQS, SNS, Lambda, or EventBridge.

Concurrency and Recursion Detection

Concurrency simply means:

👉 How many times your Lambda function can run at the same time

Reserved concurrency: Guarantees a set number of concurrent executions
- Think of this as:
  
  👉 “I want to reserve a fixed number of slots for my function”
  - Guarantees that your function always has capacity available
  - Prevents other functions from using all resources
  📌 Example:
  - You set reserved concurrency = 10
  - Your function can run up to 10 times simultaneously
  - Even if the system is busy, these 10 slots are reserved for you
  ✔️ Useful for:
  - Critical applications
  - Preventing overload on downstream systems (like databases)
Provisioned concurrency: Pre-initializes execution environments to eliminate cold starts
- Normally, Lambda may take a little time to start (called cold start).
  
  Provisioned concurrency means:
  
  👉 “Keep some instances of my function already running”
  - Removes cold start delays
  - Improves response time
  📌 Example:
  - You configure 5 provisioned instances
  - These are always ready → faster execution
  ✔️ Useful for:
  - APIs
  - User-facing applications
  - Low-latency requirements
Recursion detection: Prevents infinite loops where Lambda triggers itself
- This is a safety feature.
  
  👉 Prevents your Lambda from calling itself again and again in a loop.

Code Signing

Ensures only trusted, signed code runs in your function. You create a Code Signing Configuration linking to an AWS Signer signing profile.

Monitoring and Operations Tools

Lambda integrates with CloudWatch Logs, X-Ray (tracing), and CloudWatch Lambda Insights for performance monitoring.

Versions and Aliases

Versions

A version is an immutable snapshot of your function's code + configuration. When you publish a version, Lambda assigns it a number (1, 2, 3...). The $LATEST version is always mutable — it's your working copy.

Scenario: You deploy v1 of Daily-tenant-report to production. You make changes and publish v2. If v2 has a bug, v1 still exists untouched.

Aliases

An alias is a named pointer (like prod, staging, dev) to a specific version.

aws lambda create-alias \
  --function-name Daily-tenant-report \
  --name prod \
  --function-version 5

Why aliases matter: Your API Gateway integration points to the alias ARN, not a version number. When you deploy v6, just update the alias — no need to change API Gateway.

Additional Resources Explained

Layers

A layer is a .zip archive containing libraries, custom runtimes, or other dependencies. Instead of bundling everything in your deployment package, you attach shared layers.

Scenario: Multiple Lambda functions use the pandas library. Create one layer with pandas, attach it to all functions. Update the layer once, and all functions get the update.

Each layer version is immutable and identified by a unique ARN.

Event Source Mappings (ESMs)

An ESM is a Lambda resource that polls stream/queue-based services and invokes your function with batches of records.

Supported sources: SQS, Kinesis, DynamoDB Streams, MSK (Kafka), Amazon MQ, DocumentDB.

Scenario: An SQS queue receives order events. An ESM polls the queue and invokes your Lambda with batches of 10 messages. You configure batch size, batching window, retry policies, and parallelization.

Capacity Providers (New — Lambda Managed Instances)

This is a new feature that lets Lambda functions run on EC2 instances managed by Lambda, combining serverless development experience with dedicated compute.

You create a capacity provider specifying VPC, subnets, security groups, IAM roles, and optionally instance types and scaling config. Functions using capacity providers get access to specialized EC2 instance types while Lambda still handles scaling and patching.

Use case: Workloads needing GPU instances or specific hardware that standard Lambda doesn't offer.

Code Signing Configurations

Ensures deployment integrity — only code signed by approved developers/CI pipelines can be deployed to your functions.

Replicas

Lambda@Edge replicas — when you associate a Lambda function with CloudFront distributions, AWS replicates your function to edge locations globally for low-latency execution.

Container Image Functions (Deep Dive)

Instead of uploading a .zip file, you can package your Lambda function as a Docker container image (up to 10 GB uncompressed) stored in Amazon ECR.

Three Ways to Build Container Images

AWS base image: Pre-loaded with runtime + runtime interface client. Easiest approach.
AWS OS-only base image: Amazon Linux with just the OS. You add your runtime. Used for Go, Rust, or custom runtimes.
Non-AWS base image: Alpine, Debian, or any custom image. You must include a runtime interface client.

Example Dockerfile (Python)

FROM public.ecr.aws/lambda/python:3.12
COPY requirements.txt .
RUN pip install -r requirements.txt 
COPY app.py . 
CMD ["app.handler"]

Deploy a Container Image Function

# Build and push to ECR
docker build -t daily-report .
docker tag daily-report:latest 076829085184.dkr.ecr.us-east-1.amazonaws.com/daily-report:latest
docker push 076829085184.dkr.ecr.us-east-1.amazonaws.com/daily-report:latest

# Create Lambda function
aws lambda create-function \
  --function-name Daily-tenant-report \
  --package-type Image \
  --code ImageUri=076829085184.dkr.ecr.us-east-1.amazonaws.com/daily-report:latest \
  --role arn:aws:iam::076829085184:role/lambda-execution-role

When to Use Container Images vs .zip

Container images: Complex dependencies, large packages (ML models), existing Docker workflows, need for custom OS packages
.zip archives: Simple functions, quick iterations, smaller codebases

Important: You cannot change deployment type after creation — a container image function stays container, a .zip stays .zip.

Function Lifecycle for Container Images

After uploading, Lambda optimizes the image (function is in Pending state). Once Active, it can receive invocations. If unused for weeks, it goes Inactive and requires re-optimization on next invocation.

Advanced: SAM Template with API Gateway + Lambda

What is AWS SAM?

AWS SAM (Serverless Application Model) is a tool that helps you define and deploy serverless applications using simple configuration files.

Instead of manually creating:

Lambda functions
API Gateway
IAM roles
Event triggers

You can define everything in one file and deploy it together.

Think of SAM as:

“A simplified way to write CloudFormation specifically for serverless applications.”

Why use SAM?

Without SAM:

You manually create resources from AWS Console
Difficult to manage and replicate

With SAM:

Everything is written as code
Easy to version control
Easy to reuse across environments (dev, prod)

Step 1: Install SAM CLI

SAM CLI is the tool used to build and deploy your application.

# macOS
brew install aws-sam-cli

# Linux
pip install aws-sam-cli

Step 2: Initialize a SAM Project

sam init --runtime python3.12 --name daily-report-api

This creates a project structure like:

template.yaml → Main configuration file
hello_world/ → Lambda code
tests/ → Unit tests
events/ → Sample test events

Step 3: Understanding template.yaml

This is the most important file.

It defines:

Lambda functions
API endpoints
Database
Permissions

Basic Structure

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

This tells AWS:

This is a SAM template
Use serverless transformation

Globals Section

Globals:
  Function:
    Timeout: 30
    Runtime: python3.12
    Architectures:
      - arm64

This applies default settings to all Lambda functions.

Meaning:

Every function will use Python 3.12
Timeout = 30 seconds
Architecture = arm64

This avoids repeating configuration again and again.

Lambda Function Definition

DailyTenantReportFunction:
  Type: AWS::Serverless::Function

This creates a Lambda function.

Key Properties Explained

CodeUri: src/
Handler: app.lambda_handler

CodeUri → Where your code is located
Handler → Entry point of your function

MemorySize: 256

Allocates memory to Lambda
More memory = faster execution

Environment:
  Variables:
    DB_TABLE: TenantReports
    STAGE: production

Environment variables for configuration
Used inside your code

Permissions (Policies)

Policies:
  - DynamoDBReadPolicy:
      TableName: !Ref TenantReportsTable

This allows Lambda to:

Read from DynamoDB table

- SESCrudPolicy:
    IdentityName: "reports@shipsy.io"

Allows Lambda to:

Send emails using SES

Event Triggers (Very Important)

This is where SAM becomes powerful.

API Gateway Integration

Events:
  GetReport:
    Type: Api
    Properties:
      Path: /reports/{tenantId}
      Method: get

This means:

Create API endpoint
When someone calls:
```
/reports/{tenantId}
```
Lambda will run

Another API Endpoint

GenerateReport:
  Type: Api
  Properties:
    Path: /reports/generate
    Method: post

Now you have:

GET API → fetch report
POST API → generate report

Scheduled Trigger (Cron Job)

DailySchedule:
  Type: Schedule
  Properties:
    Schedule: cron(0 6 * * ? *)

This runs Lambda:

Every day at 6 AM UTC

This is similar to your current EventBridge setup.

DynamoDB Table

TenantReportsTable:
  Type: AWS::DynamoDB::Table

This creates a database table.

KeySchema:
  - AttributeName: tenantId
    KeyType: HASH
  - AttributeName: reportDate
    KeyType: RANGE

tenantId → Partition key
reportDate → Sort key

BillingMode: PAY_PER_REQUEST

No need to manage capacity
Pay only when used

Outputs (Important)

Outputs:
  ApiEndpoint:
    Value: !Sub "https://\({ServerlessRestApi}.executeapi.\){AWS::Region}.amazonaws.com/Prod"

After deployment, this gives:

Your API URL

Step 4: Build and Test Locally

sam build

Prepares your application

sam local invoke DailyTenantReportFunction --event events/test.json

Runs Lambda locally

sam local start-api

Starts local API server

Test using:

curl http://localhost:3000/reports/tenant-123

Step 5: Deploy to AWS

sam deploy --guided

This will:

Ask for configuration (region, stack name)
Upload code to S3
Create all resources

After Deployment

You will get an API like:

https://abc123.execute-api.us-east-1.amazonaws.com/Prod/reports/{tenantId}

Example API Calls

curl https://.../reports/tenant-456

Fetch report

curl -X POST https://.../reports/generate \
  -H "Content-Type: application/json" \
  -d '{"tenantId": "tenant-456"}'

Generate report

Final Architecture

Client → API Gateway → Lambda → DynamoDB ↓ Scheduled Event (cron)

Implementing Traefik on AWS EKS with Network Load Balancer (NLB): A Complete Guide

Piyush Agrawal — Thu, 12 Mar 2026 07:07:12 GMT

TL;DR: This blog walks you through deploying Traefik as an Ingress Controller on AWS EKS using an AWS Network Load Balancer (NLB), covering setup, configuration, known limitations, and best practices — all in one place.

What is Traefik and Why Use It on EKS?

When you run multiple services inside a Kubernetes cluster, you need something to manage how external traffic reaches each service. That's where an Ingress Controller comes in.

Traefik is a cloud-native, open-source Ingress Controller and reverse proxy that automatically discovers your services and routes traffic to them — no manual route updates needed.

📖 Official Docs: What is Traefik?

On AWS EKS (Elastic Kubernetes Service), Traefik pairs naturally with an AWS Network Load Balancer (NLB) to handle high-throughput, low-latency traffic routing at Layer 4 (TCP/UDP).

Why Traefik over the default AWS ALB Ingress Controller?

More feature-rich routing rules (path, headers, middlewares)
Built-in dashboard for monitoring
Automatic SSL via AWS ACM
Prometheus metrics out of the box
No separate ALB per service (cost-effective)

Architecture Overview

Here's how the traffic flows in this setup:

Internet
↓
AWS Network Load Balancer (NLB)
↓
Traefik Ingress Controller (running on EKS pods)
↓
Your Kubernetes Services / Apps

The NLB acts as the entry point from the internet. It forwards all traffic to Traefik, which then applies routing rules to send requests to the right service inside the cluster.

Prerequisites

Before you begin, make sure the following are in place:

Requirement	Details
AWS EKS Cluster	A running and configured Kubernetes cluster on EKS
`kubectl`	Installed and connected to your EKS cluster
`Helm`	Version 3+ installed (Install Helm)
Traefik Helm Chart	v3 > 3.9.0
AWS IAM Permissions	Permissions to create Load Balancers, ACM certificates, Security Groups
ACM Certificate	SSL certificate created in AWS Certificate Manager (ACM)

📖 Traefik Installation Guide: https://doc.traefik.io/traefik/getting-started/install-traefik/

Step 1: Add Traefik Helm Repository

helm repo add traefik https://helm.traefik.io/traefik
helm repo update

This adds the official Traefik Helm chart repository to your local Helm setup.

📖 Reference: Traefik Helm Chart Docs

Step 2: Create the `custom-values.yaml` Configuration

Create a file named custom-values.yaml with the following configuration. Each section is explained below.

ingressClass:
  enabled: true
  isDefaultClass: true

providers:
  kubernetesCRD:
    enabled: true
    namespaces:
      - traefik-app-server
  kubernetesIngress:
    enabled: true
    namespaces:
      - traefik-app-server
      - default

ingressRoute:
  dashboard:
    enabled: true
    matchRule: Host(`traefik-dashboard.yourdomain.com`) && (PathPrefix(`/dashboard`) || PathPrefix(`/api`))
    services:
      - name: api@internal
        kind: TraefikService
    entryPoints: ["web"]
    middlewares:
      - name: auth
        namespace: traefik-app-server

service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: ""
    service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
    service.beta.kubernetes.io/aws-load-balancer-subnets: 
    service.beta.kubernetes.io/aws-load-balancer-security-groups: 
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true

ports:
  web:
    port: 8000
    exposedPort: 443
  websecure:
    port: 8443
    exposedPort: 443
  traefik:
    port: 8080
    exposedPort: 8080

globalArguments:
  - "--api.insecure=true"
  - "--servertransport.insecureskipverify=true"

externalTrafficPolicy: Cluster

logs:
  general:
    format: json
    level: "INFO"
    noColor: true
  access:
    enabled: true
    format: json
    bufferingSize: 100
    filters:
      statuscodes: "200-299"
    addInternals: false

metrics:
  prometheus:
    entryPoint: metrics
    addRoutersLabels: true
    addServicesLabels: true
    buckets: "0.1,0.3,1.2,5.0"

What Each Section Does

ingressClass — Makes Traefik the default ingress controller in your cluster.

providers — Tells Traefik to watch for both IngressRoute (CRD) and standard Ingress resources in specified namespaces.

📖 Traefik Kubernetes Providers

ingressRoute.dashboard — Configures the Traefik dashboard with a specific hostname and path, protected by auth middleware.

📖 Traefik Dashboard Docs

service annotations — These AWS-specific annotations automatically trigger the creation of an NLB when Traefik is deployed. Key ones:

aws-load-balancer-type: nlb → Use NLB instead of CLB
aws-load-balancer-scheme: internet-facing → Public internet accessible
aws-load-balancer-ssl-cert → Attach your ACM certificate for HTTPS
preserve_client_ip.enabled=true → Preserve the real client IP

ports — Maps Traefik's internal ports to external exposed ports.

logs — Enables JSON-formatted access logs, filtering only successful (2xx) HTTP responses.

metrics — Enables Prometheus scraping for monitoring Traefik performance.

📖 Traefik Metrics with Prometheus

Step 3: Install Traefik Using Helm

helm install traefik traefik/traefik \
  --namespace traefik-app-server \
  --create-namespace \
  -f custom-values.yaml

Verify the pods are running:

kubectl get pods -n traefik-app-server

Step 4: Verify the AWS NLB is Created

After installation, AWS automatically provisions an NLB based on the annotations in custom-values.yaml. Verify by:

Going to AWS Console → EC2 → Load Balancers and look for the new NLB
Or run:

kubectl get svc -n traefik-app-server traefik

You should see an external hostname (the NLB DNS name) in the EXTERNAL-IP column.

Step 5: Configure IngressRoute for Your Services

Create an IngressRoute resource to route traffic to your apps:

apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: my-app-route
  namespace: traefik-app-server
spec:
  entryPoints:
    - web
  routes:
    - match: Host(`myapp.yourdomain.com`)
      kind: Rule
      services:
        - name: my-app-service
          port: 80

Apply it:

kubectl apply -f my-app-ingressroute.yaml

Step 6: Enable Metrics and Logging

Prometheus metrics are already enabled in custom-values.yaml.
Verify Traefik is exposing metrics:

kubectl port-forward svc/traefik 8080:8080 -n traefik-app-server

Then visit http://localhost:8080/metrics in your browser.

Check Traefik logs:

kubectl logs  -n traefik-app-server

⚠️ Known Issue: Nested NLB + AWS Global Accelerator and Client IP Preservation

This is a critical limitation you must be aware of before designing your architecture.

What is the Scenario?

Many production teams try to achieve two goals simultaneously:

Use AWS Global Accelerator — to reduce latency globally by routing traffic through AWS's private backbone network
Preserve the original Client IP — so that their apps can use the real user IP for security rules, geo-blocking, rate limiting, and analytics

A natural architecture that seems to solve both is a Nested NLB setup:

Internet
   ↓
AWS Global Accelerator
   ↓
NLB #1 (TCP Listeners) ← Global Accelerator Endpoint
   ↓
NLB #2 (TLS Listeners) ← Target Group of NLB #1
   ↓
Traefik on EKS

The idea here is:

NLB #1 handles Global Accelerator traffic and forwards it to NLB #2
NLB #2 handles TLS termination and forwards to Traefik
This way, you get global acceleration AND SSL handling via ACM

Sounds logical, right? But it doesn't work.

Why Does the Issue Arise?

When NLB #1 tries to route traffic to NLB #2's ENI (Elastic Network Interface) as a target, AWS blocks Client IP preservation. This is because:

AWS explicitly does not support Client IP preservation when a target
group contains the ENI of another Network Load Balancer or AWS PrivateLink ENIs.

**📖 Official AWS Reference:
**NLB Target Groups — Client IP Preservation

In simple words — when NLB #1 forwards packets to NLB #2, the source IP (original client IP) gets replaced with NLB #1's IP. By the time the request reaches Traefik and your app, you see the NLB's IP, not the user's real IP.

What is the Real-World Impact?

This limitation breaks several things your application may depend on:

Feature Affected	Why it Breaks
IP-based rate limiting	You rate-limit the NLB IP, not the real user
Geo-blocking / GeoIP rules	NLB's IP is from AWS datacenter, not user's country
Security rules / WAF	Cannot block/allow specific client IPs
Analytics & Traffic analysis	All traffic appears to come from one IP
Audit logs	No real user IP in logs for compliance

The Root Cause (Technical)

In a standard NLB setup with Client IP Preservation enabled, the NLB simply forwards the TCP packet as-is to the target, preserving the source IP in the packet header. The target (Traefik pod) sees the real client IP directly.

But when NLB #2 is itself a target inside NLB #1's target group, NLB #1 needs to rewrite the destination IP of the packet to point to NLB #2's ENI. In this rewrite process, AWS's networking layer cannot maintain both the source IP and perform the destination rewrite simultaneously for chained NLBs.

Troubleshooting Common Issues

Issue	Cause	Fix
NLB not created	Wrong or missing service annotations	Re-check `custom-values.yaml` service annotations, especially subnet IDs and security group IDs
Dashboard not accessible	DNS or IngressRoute misconfiguration	Verify DNS resolves to NLB, check `matchRule` in `ingressRoute` config
SSL not working	Wrong ACM certificate ARN	Verify the ARN in `aws-load-balancer-ssl-cert` annotation matches your ACM cert
Client IP showing as NLB IP	Client IP Preservation disabled or nested NLB issue	Enable `preserve_client_ip.enabled=true` in target group attributes; avoid nested NLB setup

Best Practices

Always protect the Traefik dashboard with authentication middleware an unprotected dashboard exposes your entire routing configuration.

📖 Traefik Middlewares — BasicAuth
Use AWS ACM for SSL certificates instead of managing certs manually — ACM handles renewals automatically.
Enable Prometheus metrics and connect to Grafana for a complete observability setup.

📖 Traefik + Grafana Dashboard

For internal-only services, change the NLB scheme to internal:

service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"

Automate Helm deployments via CI/CD pipelines (GitHub Actions, ArgoCD) for consistent and repeatable deployments.
Avoid Nested NLB setups if Client IP preservation is critical for your application — use single NLB with Proxy Protocol v2 instead.

Conclusion

Traefik on AWS EKS with NLB is a powerful, production-ready setup that gives
you fine-grained traffic control, automatic service discovery, SSL management,
and rich observability. However, when designing for advanced scenarios like
Global Accelerator with Client IP preservation, be aware of AWS's nested NLB
limitations and plan your architecture accordingly.

References

Mastering Kubernetes Cluster Autoscaler on Amazon EKS: A Complete Guide

Piyush Agrawal — Thu, 12 Mar 2026 06:36:40 GMT

🚀 TL;DR: If your pods are stuck in Pending state because there aren't enough nodes — Cluster Autoscaler (CA) is your answer. This guide walks you through everything from IAM setup to a full production deployment on Amazon EKS.

👋 Who Is This For?

Level	What You'll Get
🟢 Beginner	Understand what CA is and why you need it
🟡 Intermediate	Full step-by-step installation on EKS
🔴 Advanced	Multi-node group strategies, expander policies, best practices

🤔 The Problem — Why Does Autoscaling Even Matter?

Imagine your application is running fine on Amazon EKS with 3 nodes. Suddenly, a traffic surge hits — a flash sale, a major client onboarding, or a viral event. Your Kubernetes Deployment tries to spin up 10 more pods — but there's no room on existing nodes. Those pods sit in Pending state, requests time out, and your users see errors.

You could manually add nodes — but who's watching at 2 AM on a Sunday?

This is exactly where Cluster Autoscaler (CA) steps in. It watches for Pending pods and automatically scales your EC2 node count up or down via AWS Auto Scaling Groups — no human intervention needed.

🧠 Section 1: What Is Cluster Autoscaler? (Beginner)

Cluster Autoscaler is an open-source Kubernetes component that runs as a Deployment inside your cluster (in the kube-system namespace). It does two things:

Scale Up 📈 — When pods are unschedulable (Pending), CA adds new EC2 nodes
Scale Down 📉 — When nodes are underutilized, CA safely drains and removes them

How It Works (Every 10 Seconds)

Are there any Pending pods?
YES → Find a Node Group that can fit them → Tell ASG to increase capacity
Are any nodes underutilized (< 50% by default)?
YES → Can all pods fit elsewhere? → Drain node → Terminate EC2 instance

💡 Key Insight: CA doesn't look at CPU/Memory usage. It looks at resource REQUESTS defined in your pod spec. Always set resources.requests or CA won't scale!

CA vs HPA vs Karpenter

Tool	What it Scales	How
HPA	Pod replicas	Based on CPU/memory metrics
CA	EC2 Nodes	Based on pending pods + AWS ASG
Karpenter	EC2 Nodes	Dynamic, just-in-time, more flexible

Think of it this way: HPA scales your app. CA scales your infrastructure.

🏗️ Section 2: Architecture Overview

The OIDC + IRSA bridge is the key — it lets the CA pod (inside Kubernetes) make authenticated AWS API calls without storing any long-lived credentials.

🛠️ Section 3: Full Installation Guide (Intermediate)

Prerequisites Checklist

Before you begin, make sure you have:

✅ An active Amazon EKS Cluster (v1.24+)
✅ kubectl configured and pointing to your cluster
✅ eksctl installed (v0.160+)
✅ aws cli v2 configured with admin permissions
✅ Node Groups created with ASG enabled (--asg-access flag)

Step 1: Enable IAM OIDC Provider

OIDC is an identity bridge — it lets Kubernetes ServiceAccounts assume AWS IAM Roles, so your CA pod can call AWS APIs securely without hardcoding credentials.

export CLUSTER_NAME=
export AWS_REGION=ap-south-1   # Change to your region

# Enable OIDC for your cluster
eksctl utils associate-iam-oidc-provider \
  --region $AWS_REGION \
  --cluster $CLUSTER_NAME \
  --approve

# Verify
aws eks describe-cluster --name $CLUSTER_NAME \
  --query "cluster.identity.oidc.issuer" --output text

Step 2: Create IAM Policy

Save the following as iam-policy.json. This policy defines exactly what CA is allowed to do in AWS — describe ASGs, set desired capacity, and terminate instances.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "ec2:DescribeLaunchTemplateVersions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

aws iam create-policy
--policy-name AmazonEKSClusterAutoscalerPolicy
--policy-document file://iam-policy.json

Note down the Policy ARN from the output — you'll need it in the next step.

Step 3: Create IAM Role + Kubernetes ServiceAccount (IRSA)

IRSA (IAM Roles for Service Accounts) annotates a Kubernetes ServiceAccount with an IAM Role ARN, so only the CA pod gets AWS permissions — nothing else.

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

eksctl create iamserviceaccount \
  --cluster=$CLUSTER_NAME \
  --namespace=kube-system \
  --name=cluster-autoscaler \
  --attach-policy-arn=arn:aws:iam::${AWS_ACCOUNT_ID}:policy/AmazonEKSClusterAutoscalerPolicy \
  --override-existing-serviceaccounts \
  --approve

If you prefer to apply the ServiceAccount manually, save this as cluster-autoscaler-sa.yaml and replace the role ARN:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam:::role/

kubectl apply -f cluster-autoscaler-sa.yaml

Step 4: Apply RBAC — ClusterRole, Role, and Bindings

Save the following as cluster-autoscaler-rbac.yaml:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
  resources: ["events", "endpoints"]
  verbs: ["create", "patch"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["pods/status"]
  verbs: ["update"]
- apiGroups: [""]
  resources: ["endpoints"]
  resourceNames: ["cluster-autoscaler"]
  verbs: ["get", "update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
  resources: ["namespaces", "pods", "services", "replicationcontrollers", "persistentvolumeclaims", "persistentvolumes"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
  resources: ["replicasets", "daemonsets"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
  resources: ["poddisruptionbudgets"]
  verbs: ["watch", "list"]
- apiGroups: ["apps"]
  resources: ["statefulsets", "replicasets", "daemonsets"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["batch", "extensions"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
  resourceNames: ["cluster-autoscaler"]
  resources: ["leases"]
  verbs: ["get", "update"]
***
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create", "list", "watch"]
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
  verbs: ["delete", "get", "update", "watch"]
***
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
- kind: ServiceAccount
  name: cluster-autoscaler
  namespace: kube-system
***
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
- kind: ServiceAccount
  name: cluster-autoscaler
  namespace: kube-system

kubectl apply -f cluster-autoscaler-rbac.yaml

Step 5: Deploy Cluster Autoscaler

Save the following as cluster-autoscaler-deployment.yaml.

⚠️ Replace on the --node-group-auto-discovery line.
⚠️ Match the image version (v1.27.3 in example) to your EKS cluster version. E.g., EKS 1.30 → use v1.30.x

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8085'
        cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'
    spec:
      priorityClassName: system-cluster-critical
      serviceAccountName: cluster-autoscaler
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.27.3
          resources:
            limits:
              cpu: 100m
              memory: 600Mi
            requests:
              cpu: 100m
              memory: 600Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/
            - --balance-similar-node-groups
            - --skip-nodes-with-system-pods=false
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: [ALL]
            readOnlyRootFilesystem: true
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-bundle.crt"

kubectl apply -f cluster-autoscaler-deployment.yaml

Step 6: Tag Your ASG Node Groups

CA uses tags to discover which Auto Scaling Groups it should manage. Add these two tags to your Node Group's ASG in AWS Console or CLI:

Tag Key	Tag Value
`k8s.io/cluster-autoscaler/enabled`	`true`
`k8s.io/cluster-autoscaler/`	`owned`

aws autoscaling create-or-update-tags \
  --tags \
  "ResourceId=,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=true" \
  "ResourceId=,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/,Value=owned,PropagateAtLaunch=true"

Step 7: Verify Everything Is Working

# Check the pod is Running
kubectl get pods -n kube-system | grep cluster-autoscaler

# Watch live logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system

🗂️ Section 4: Node Group Strategies — The "NodePool" Equivalent (Advanced)

Unlike Karpenter (which has NodePool and EC2NodeClass CRDs), CA works with pre-defined EKS Node Groups (ASGs).

Karpenter Concept	CA Equivalent
`EC2NodeClass`	Launch Template
`NodePool`	EKS Managed Node Group (ASG)
`NodePool limits`	ASG Min/Max size
`NodePool labels/taints`	Node Group labels & taints

Here's a production-ready multi-node-group config using eksctl. Save as production-nodegroups.yaml:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: 
  region: ap-south-1

managedNodeGroups:

  # Pool 1: General Purpose (always-on baseline)
  - name: general-ng
    instanceType: m5.xlarge
    minSize: 2
    maxSize: 10
    desiredCapacity: 2
    labels:
      workload: general
      lifecycle: on-demand
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/: "owned"
    iam:
      withAddonPolicies:
        autoScaler: true

  # Pool 2: High Memory (scale from zero for data workloads)
  - name: highmem-ng
    instanceType: r5.2xlarge
    minSize: 0
    maxSize: 5
    desiredCapacity: 0
    labels:
      workload: high-memory
    taints:
      - key: dedicated
        value: high-memory
        effect: NoSchedule
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/: "owned"
      k8s.io/cluster-autoscaler/node-template/label/workload: "high-memory"
      k8s.io/cluster-autoscaler/node-template/taint/dedicated: "high-memory:NoSchedule"

  # Pool 3: Spot Instances (cost savings for batch/non-critical)
  - name: spot-ng
    instanceTypes: ["m5.xlarge", "m5a.xlarge", "m4.xlarge"]
    spot: true
    minSize: 0
    maxSize: 20
    desiredCapacity: 0
    labels:
      lifecycle: spot
      workload: batch
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/: "owned"

eksctl create nodegroup -f production-nodegroups.yaml

Scheduling Pods to Specific Node Groups

# Example: Schedule a high-memory pod to the highmem-ng pool
spec:
  nodeSelector:
    workload: high-memory
  tolerations:
    - key: dedicated
      value: high-memory
      effect: NoSchedule
  containers:
    - name: app
      image: your-image:latest
      resources:
        requests:           # REQUIRED for CA to work!
          cpu: "2"
          memory: "8Gi"

⚙️ Section 5: Expander Strategies

When multiple node groups can accommodate a pending pod, CA uses an Expander to decide which one to pick:

Expander	Behavior	Best For
`least-waste`	Picks group with least wasted resources after scaling	Recommended
`random`	Picks randomly	Testing only
`most-pods`	Picks group that schedules the most pods	High-density
`priority`	You assign priority order to node groups	Fine-grained control
`price`	Prefers cheapest node type	Cost-sensitive

Set it in your deployment:

- --expander=least-waste

Section 6: Test Your Setup

# Create a deployment that will trigger scale-up
kubectl create deployment inflate \
  --image=public.ecr.aws/eks-distro/kubernetes/pause:3.7 \
  --replicas=10

kubectl set resources deployment inflate \
  --requests=cpu=1,memory=1Gi

# Watch pods — some will go Pending, then get scheduled on new nodes
kubectl get pods -w

# Watch CA logs in real-time
kubectl logs -f deployment/cluster-autoscaler -n kube-system | grep -E "scale_up|ScaleUp"

# Watch new nodes join
kubectl get nodes -w

# Cleanup — triggers scale-down after ~10 minutes
kubectl delete deployment inflate

Section 7: Production Best Practices

Always set resources.requests — CA is blind without them; it won't scale if requests aren't defined
Use PodDisruptionBudgets (PDB) — Protects critical pods during scale-down draining
Pin CA version to EKS version — Use v1.30.x for EKS 1.30; version mismatch breaks scaling
Use --balance-similar-node-groups — Spreads nodes evenly across AZs for high availability
Add safe-to-evict: "false" on CA pod itself — Prevents it from being evicted during scale-down
Don't mix instance families in one ASG — Keep node groups homogeneous for predictable scaling
Monitor with Prometheus — CA exposes metrics on port 8085; scrape and alert on scaling events

Section 8: Troubleshooting

Issue	Likely Cause	Fix
Pods stuck in Pending, no new nodes	ASG tags missing or wrong	Verify tags on your ASG match `--node-group-auto-discovery`
`Permission denied` errors in logs	IAM Role misconfigured	Check role trust relationship + OIDC annotation on ServiceAccount
`CrashLoopBackOff` on CA pod	Wrong image version or bad command flags	Match image to EKS version; check `--node-group-auto-discovery` flag
Scale-down not happening	`scale-down-unneeded-time` not elapsed or PDB blocking	Wait 10 min; check PodDisruptionBudgets
Scale-from-zero not working	Node group labels missing as ASG tags	Add `node-template/label/` and `node-template/taint/` tags to ASG

# Always start debugging here
kubectl logs -n kube-system deployment/cluster-autoscaler

CA vs Karpenter — Which One Should You Use in 2026?

Factor	Cluster Autoscaler	Karpenter
Setup Complexity	Moderate	Higher
Scaling Speed	2–5 min	30–60 sec
Instance Flexibility	Fixed per ASG	Dynamic, any type
Cost Optimization	Good with Spot	Excellent (node consolidation)
EKS Auto Mode support	No	Yes (native)
Maturity & Stability	⭐⭐⭐⭐⭐ Battle-tested	⭐⭐⭐⭐ Growing fast

Wrapping Up

Cluster Autoscaler is the backbone of production Kubernetes infrastructure on AWS. Set it up correctly with proper Node Groups, IRSA, and resource requests — and it will silently keep your cluster right-sized, saving both cost and on-call headaches.

Key Takeaways:

🔐 OIDC + IRSA = Secure, credential-free AWS authentication from Kubernetes
🗂️ Node Groups = Your pre-defined capacity pools (CA's version of Karpenter's NodePools)
📦 Always set resources.requests — CA depends on it entirely
⚖️ Use least-waste expander for cost efficiency
📊 Watch CA logs — they're incredibly detailed and tell you exactly what's happening

Cross-Project Cloud SQL Migration Using Google Database Migration Service (DMS)

Piyush Agrawal — Wed, 11 Mar 2026 09:30:35 GMT

Migrating a Cloud SQL database from one Google Cloud project to another can be challenging—especially when you want minimal downtime and continuous replication (via Change Data Capture — CDC).

Google's Database Migration Service (DMS) makes this straightforward, even over public IP connectivity (ideal when VPC peering or Shared VPC isn't feasible).

In this guide, I walk you through a real-world cross-project migration of a Cloud SQL for MySQL instance using public IP allowlist connectivity — continuous mode — from source project → destination project.

This method helped me consolidate databases, refactor environments, and improve project isolation/security/governance.

High-level flow of DMS continuous migration with public IP connectivity

1. Introduction

Purpose

This post provides a detailed, production-tested step-by-step guide to migrate a Cloud SQL instance between GCP projects using DMS over public IP. It covers prerequisites, IAM roles, connectivity setup, job configuration, testing, cutover (promotion), and verification.

Target Audience

DevOps Engineers & SREs
Cloud Infrastructure / Database Administrators
GCP Architects performing project consolidations or refactoring

2. Overview

Database Migration Service (DMS) is a fully managed GCP service for zero/minimal-downtime migrations to Cloud SQL (MySQL, PostgreSQL) and AlloyDB.

Use cases for cross-project migration:

Consolidating scattered databases into a central project
Refactoring legacy/multi-project environments
Enforcing better security & governance through project boundaries

We use continuous migration (full load + CDC) over public IP allowlist connectivity.

Note: All DMS resources (connection profile, migration job, etc.) must reside in the same region as the destination Cloud SQL instance.

3. Prerequisites

Tools & Versions

Tool / Technology	Requirement
Google Cloud Platform	Active billing in both projects
Cloud SQL	Same engine & version (e.g. MySQL 8.0.35+)
Database Migration Service	Enabled in the destination project

Required IAM Roles

Role	Project	Purpose
Cloud SQL Admin (`roles/cloudsql.admin`)	Both	Manage Cloud SQL instances
Database Migration Admin (`roles/datamigration.admin`)	Destination	Create & manage DMS jobs/profiles
Compute Network Admin (`roles/compute.networkAdmin`)	Destination	Manage authorized networks (allowlist)

4. Step-by-Step Migration Guide

Step 1: Get the Public IP of the Source Cloud SQL Instance

Go to SQL > Instances in the source project
Open the instance → Overview tab
Copy the Public IP address

Step 2: Create a Connection Profile in the Destination Project

Navigate to Database Migration > Connection profiles > Create profile
Settings:
- Profile role: Source
- Database engine: MySQL (or PostgreSQL)
- Connection profile name/ID: e.g. source-db-profile
- Hostname/IP: Paste source Cloud SQL public IP
- Port: 3306 (MySQL) or 5432 (PostgreSQL)
- Username/Password: Source DB credentials (e.g. root user)
- Region: Must match destination Cloud SQL region
Save

Step 3: Create the Migration Job

Go to Database Migration > Migration jobs > Create
Fill basics:
- Migration job name/ID: e.g. cross-project-mig
- Source database engine: MySQL
- Destination region: (same as target instance)
- Migration job type: Continuous (enables CDC / real-time sync)

Step 4: Define Source Configuration

Select the connection profile created in Step 2
Full dump configuration:
- Dump method: Logical
- Parallelism: Optimal or Max (for better performance)

Step 5: Define the Destination Cloud SQL Instance

Option A — Existing instance: Select it (must match engine/version)
Option B — New instance: Let DMS create it
- Match source engine & version
- Set root password
- Choose adequate machine type & storage (under-provisioning slows migration!)

Important: This choice (existing vs new) is permanent.

Step 6: Configure IP Allowlist (Public Connectivity)

DMS requires bidirectional connectivity over public IP.

Destination instance:
- Go to Cloud SQL > Connections
- Enable Public IP if not already
- Note the Outgoing IP from Overview tab (this is the IP DMS uses to connect to source)
Source instance:
- Go to Cloud SQL > Connections > Authorized networks
- Add the destination's outgoing IP (from step above) as an authorized network

Step 7: Test the Migration Job

In the migration job creation wizard → Test button
Wait for "Test run complete – successful"
If it fails: double-check credentials, public IPs, allowlist, firewall rules

Once passed → Create (you can start immediately or later)

Step 8: Start & Monitor the Job and Verify Data Consistency

Start the job
Monitor:
- Replication delay / lag
- Phase (Full catch-up → CDC)

Source :

Destination :

Step 9: Promote the Destination instance

After the verification of data consistency, Once the replication delay is least, proceed with promoting the destination Database to be a writeable instance.

Step 10: Check Migration Job Logs or Destination Instance Logs

If the logs for the migration job or the destination instance logs are required, they can be viewed by clicking on the view logs and selecting the logs which are required.

5. Troubleshooting

5.1 : Common Issues

Issues	Possible Cause
Connection Test Fails	Public IP not whitelisted or wrong Credentials
Version Mismatch	Cloud SQL minor version mismatch
IAM Permission errors	Missing roles in source/destination
Cutover Fails	Replication lag or write operations on source.

5.2 : Solutions

Re-check authorized network setting
Verify SQL Version via gcloud sql instance describe
Ensure IAM Roles and API's are correctly configured

6. Conclusion

This blog explained how to migrate a Cloud SQL instance across GCP Projects using DMS over public IP.

It covered :

API Setup
Source/Destination Configuration
DMS Connection profiles and job creation
Troubleshooting the issues

Integrating AWS IAM Identity Center (SSO) with Argo CD and Argo Workflows using SAML 2.0: A Step-by-Step Guide

Piyush Agrawal — Tue, 10 Mar 2026 20:46:24 GMT

As organizations increasingly adopt GitOps practices for managing Kubernetes deployments, tools like Argo CD and Argo Workflows have become essential in the modern cloud-native ecosystem. Argo CD automates application deployments declaratively from Git repositories, while Argo Workflows orchestrates complex, scalable pipelines and batch jobs on Kubernetes.

To make these tools secure and user-friendly, especially in enterprise environments, integrating AWS IAM Identity Center (formerly known as AWS SSO) via SAML 2.0 provides centralized authentication, group-based access control, and single sign-on (SSO) experience. This eliminates multiple logins, reduces credential sprawl, and aligns with zero-trust security principles.

In this blog post, I'll walk you through the complete setup in a clear, beginner-friendly way — perfect for freshers learning GitOps, experienced DevOps engineers hardening access, or teams pursuing CNCF and AWS community contributions. The guide draws from the official Argo CD documentation and practical implementations, updated for 2026 best practices.

Why Integrate AWS IAM Identity Center with Argo CD and Argo Workflows?

Centralized Access Management — Manage users and groups in one place (AWS IAM Identity Center) for consistent policies across AWS services and third-party apps.
Enhanced Security — Leverage SAML 2.0 federation to avoid storing local credentials; enforce MFA and compliance easily.
Improved User Experience — Users log in once with corporate credentials and access Argo CD's UI and Argo Workflows seamlessly.
Group-Based RBAC — Map AWS groups to Argo roles (e.g., readonly vs. admin) for fine-grained permissions.

Architecture Overview

User → AWS IAM Identity Center (IdP) → SAML Assertion → Argo CD Dex (bundled OIDC provider) → Argo CD / Argo Workflows (Service Providers)

Argo CD uses Dex (its embedded identity broker) to handle SAML, while Argo Workflows can federate via the same Dex instance for shared SSO.

High-Level Architecture

User logs in via corporate credentials → AWS IAM Identity Center authenticates → issues SAML assertion → Argo CD's Dex (built-in identity broker) validates → grants access based on groups.

Here's a simple flow diagram (Mermaid syntax — paste into mermaid.live or your blog renderer):

This shows the authentication flow. AWS acts as Identity Provider (IdP), Argo CD/Dex as Service Provider (SP).

3. Detailed Component Diagram – Infrastructure View

Purpose: For the "Architecture and Infrastructure" section. Matches your document's 3.1 overview.

Prerequisites

A running Kubernetes cluster with Argo CD and (optionally) Argo Workflows installed (preferably via Helm).
Access to AWS IAM Identity Center with permissions to create SAML applications.
Argo CD exposed via a domain (e.g., https://argocd.yourdomain.com).
kubectl access to create secrets and edit ConfigMaps.
Basic understanding of YAML and Kubernetes resources.

Step-by-Step Implementation

Step 1: Create a Custom SAML 2.0 Application in AWS IAM Identity Center

Go to AWS IAM Identity Center → Applications → Add application.
Choose Add custom SAML 2.0 application.
Set Display name (e.g., "Argo CD SSO").
Under Application metadata:
- Select Manually type metadata values.
- Application ACS URL: https://argocd.yourdomain.com/api/dex/callback
- Application SAML audience: https://argocd.yourdomain.com/api/dex/callback
(Optional) Set Application start URL: https://argocd.yourdomain.com
Download the IAM Identity Center certificate (you'll need it later).
Submit and go to Attribute mappings:
- Add mappings:
  - Subject → ${user:subject} (persistent)
  - groups → ${user:groups}
  - email → ${user:email}
Assign users/groups who should access Argo CD.

Note: Use your actual Argo CD domain. The callback URL is critical for Dex.

Step 2: Prepare the Certificate

Base64-encode the downloaded certificate (including -----BEGIN CERTIFICATE----- to -----END CERTIFICATE----- lines):

Bash

base64 -w 0 iam-identity-center-cert.pem > encoded-cert.txt

Copy the output for caData.

Step 3: Configure Argo CD (via Helm values or argocd-cm ConfigMap)

Update your Argo CD Helm values (or edit argocd-cm directly):

YAML

configs:
  cm:
    create: true
    url: https://argocd.yourdomain.com
    dex.config: |
      logger:
        level: debug
        format: json
      connectors:
      - type: saml
        id: aws
        name: "AWS IAM Identity Center"
        config:
          ssoURL: "https://portal.sso..amazonaws.com/saml/assertion/"  # From your SAML app sign-in URL
          caData: ""
          entityIssuer: https://argocd.yourdomain.com/api/dex/callback
          redirectURI: https://argocd.yourdomain.com/api/dex/callback
          usernameAttr: email
          emailAttr: email
          groupsAttr: groups

  rbac:
    policy.default: role:readonly
    policy.csv: |
      p, role:readonly, applications, get, /*, allow
      p, role:readonly, certificates, get, *, allow
      p, role:readonly, clusters, get, *, allow
      # ... (add more readonly permissions as needed)

      p, role:admin, applications, create, /*, allow
      p, role:admin, applications, update, /*, allow
      # ... (add admin permissions)

      g, "", role:admin  # e.g., g, "argocd-admins", role:admin
    scopes: '[groups, email]'

Key Tips:

ssoURL comes from the SAML app's sign-in URL.
For group mapping, use the exact group name/ID from AWS (workaround: AWS doesn't officially support groups in SAML, but this works reliably).
Apply changes and restart Dex pod if needed.

Step 4: Create Kubernetes Secret (for Shared Client Secret if Using Argo Workflows)

Bash

kubectl create secret generic argocd-sso-secret \
  --namespace argocd \
  --from-literal=client-id="https://portal.sso..amazonaws.com/saml/assertion/" \
  --from-literal=client-secret="some-random-secure-string"  # Or generate one

If Argo Workflows is in a different namespace, recreate the same secret there.

Step 5: Configure Argo Workflows (Optional but Recommended for Unified SSO)

In Argo Workflows Helm values:

YAML

server:
  authModes:
    - sso
    - client  # Optional: keep client for backward compat; can remove later
  sso:
    enabled: true
    issuer: https://argocd.yourdomain.com/api/dex
    clientId:
      name: argocd-sso-secret
      key: client-id
    clientSecret:
      name: argocd-sso-secret
      key: client-secret
    redirectUrl: https://argocd.yourdomain.com/oauth2/callback
    sessionExpiry: 8h

This allows Argo Workflows to use the same Dex instance for SSO.

Step 6: Test the Integration

Access https://argocd.yourdomain.com
Click LOGIN VIA SSO
You should redirect to AWS IAM Identity Center login
After authentication, return to Argo CD with proper permissions based on your group

Check Dex logs (kubectl logs -n argocd -l app.kubernetes.io/name=argocd-dex-server) for debug info if issues arise.

Troubleshooting Common Issues

Authentication fails → Verify URLs match exactly (case-sensitive); check certificate encoding.
Groups not recognized → Confirm group names in AWS and RBAC policy.csv; use debug logging.
Callback errors → Ensure ACS URL and audience match Dex callback.
Connectivity → Confirm network policies allow outbound to AWS endpoints.

Best Practices

Store secrets securely (use external secret managers like AWS Secrets Manager + External Secrets Operator).
Rotate client secrets periodically.
Use least-privilege RBAC: Start with readonly default, grant admin only to specific groups.
Monitor Dex logs and set up alerts for auth failures.
Test group membership changes in a staging environment.

Conclusion

Integrating AWS IAM Identity Center with Argo CD (and optionally Argo Workflows) via SAML 2.0 brings enterprise-grade authentication to your GitOps workflows. It simplifies onboarding, boosts security, and supports scalable team collaboration — key for CNCF-aligned projects and AWS ecosystems.

By following this guide, you can achieve centralized, secure access in minutes (after initial setup). If you're contributing to open-source or building AWS community projects, this pattern is battle-tested and aligns with modern cloud-native security.

Happy GitOps-ing! If you implement this, share your experiences — feedback helps the community grow.

Build Your Own SMTP Mail Server on AWS EC2 Using Node.js — A Complete Hands-On Guide

Piyush Agrawal — Tue, 10 Mar 2026 17:28:30 GMT

Introduction

Ever wondered what really happens when you click "Send" on an email? Behind the scenes, a chain of DNS lookups, protocol handshakes, and server communications takes place — all orchestrated by SMTP (Simple Mail Transfer Protocol).

In this tutorial, we'll demystify email delivery by building a custom SMTP server from scratch — hosted on Amazon EC2. By the end, you'll have a working mail server that can receive emails on your own domain, and a deep understanding of how email infrastructure works under the hood.

🔗 This post is based on my video walkthrough: Build Your Own Mail Server | SMTP Server

Prerequisites

Before you begin, make sure you have:

An AWS account with access to the EC2 console
A registered domain name (with access to DNS management)
Basic familiarity with Linux terminal commands
Basic understanding of Node.js

How Email Delivery Actually Works

Let's say Piyush (using Gmail) wants to send an email to Abhay (using Outlook).

Here's the step-by-step flow:

MX Record Lookup — Piyush's mail server generates a DNS query on outlook.com to find the MX (Mail Exchanger) Record. This tells it which server is responsible for handling incoming mail for that domain.
A Record Lookup — The MX record returns a hostname (e.g., mailserver.outlook.com). A second DNS query resolves this hostname to an IPv4 address using the A Record.
SMTP Connection — Piyush's server opens a TCP connection to the resolved IP on port 25 and begins the SMTP handshake to deliver the message.

DNS Records Every Mail Server Needs

Before your server can send or receive email reliably, you need to configure several DNS records:

Record	Purpose
MX	Specifies which mail server handles email for your domain
A	Maps your mail server's hostname to its public IPv4 address
SPF	Defines which servers are authorized to send email on behalf of your domain (prevents spoofing)
DKIM	Adds a cryptographic signature to outgoing emails, verifying sender identity and message integrity
DMARC	Builds on SPF and DKIM to define how receiving servers should handle authentication failures

SMTP Protocol — The Handshake

SMTP communication follows a structured command sequence. Here's how the conversation between two servers typically flows:

Command	Description
`HELO`	Initiates the SMTP session; the client introduces itself to the server
`MAIL FROM`	Declares the sender's email address
`RCPT TO`	Specifies the recipient's email address
`DATA`	Requests permission to begin transmitting the email body
`QUIT`	Terminates the SMTP session

Default SMTP Ports:

Port 25 — Standard SMTP
Port 465 — SMTP over SSL/TLS (secure)

Step 1 — Launch an EC2 Instance on AWS

Head over to the AWS Management Console and launch a new EC2 instance.

Recommended Configuration:

AMI: Ubuntu Server 22.04 LTS (or latest)
Instance Type: t2.micro (Free Tier eligible — perfect for this project)
Key Pair: Create or select an existing SSH key pair
Network: Ensure the instance has a public IP address (we'll configure the security group shortly)

💡 Why EC2? Amazon EC2 gives you full control over your server environment — including the operating system, network configuration, and security policies. It's ideal for running custom services like an SMTP server where you need to open specific ports and manage DNS records pointing to your instance's public IP.

Step 2 — Install Node.js and npm

SSH into your EC2 instance and install Node.js:

sudo apt update
sudo apt install nodejs npm -y
node -v

You should see output similar to:

v18.x.x

Tip: For the latest LTS version, consider using nvm (Node Version Manager) instead of the default apt package.

Step 3 — Install the SMTP Server Package

Create a project directory and install the smtp-server npm package:

mkdir smtp-server && cd smtp-server
npm init -y
npm install smtp-server

Step 4 — Write the SMTP Server Code

Create a file called index.js:

nano index.js

Paste the following Node.js code:

const { SMTPServer } = require("smtp-server");

const server = new SMTPServer({
  allowInsecureAuth: true,
  authOptional: true,

  onConnect(session, cb) {
    console.log(`[CONNECT] Session ID: ${session.id}`);
    cb();
  },

  onMailFrom(address, session, cb) {
    console.log(`[MAIL FROM] \({address.address} | Session: \){session.id}`);
    cb();
  },

  onRcptTo(address, session, cb) {
    console.log(`[RCPT TO] \({address.address} | Session: \){session.id}`);
    cb();
  },

  onData(stream, session, cb) {
    let emailData = "";
    stream.on("data", (chunk) => {
      emailData += chunk.toString();
    });
    stream.on("end", () => {
      console.log(`[DATA] Email content:\n${emailData}`);
      cb();
    });
  },
});

server.listen(25, () => {
  console.log("✅ SMTP Server is running on port 25");
});

This creates a minimal SMTP server that logs every incoming email connection, sender, recipient, and message body — great for understanding the protocol in action.

Step 5 — Configure the EC2 Security Group

Back in the AWS EC2 Console, navigate to your instance's Security Group and add the following inbound rule:

Type	Protocol	Port Range	Source
Custom TCP	TCP	25	0.0.0.0/0 (or restrict as needed)

⚠️ Security Note: Opening port 25 to 0.0.0.0/0 is fine for testing, but in production you should restrict access and implement authentication. AWS also throttles port 25 by default on EC2 — you may need to submit a request to remove the restriction for outbound SMTP traffic.

Step 6 — Configure DNS Records

Go to your domain registrar (or Amazon Route 53 if you manage DNS through AWS) and add the following records:

Record Type	Host	Value	TTL
A	`mail.yourdomain.com`		300
MX	`yourdomain.com`	`mail.yourdomain.com` (Priority: 10)	300

🔑 Pro Tip: If you're using Amazon Route 53 for DNS management, you can associate an Elastic IP with your EC2 instance. This ensures your server's IP remains static, even if you stop/start the instance — critical for reliable mail delivery.

Step 7 — Start the Server

You can start the server directly with Node:

sudo node index.js

For production persistence, use PM2 (a Node.js process manager):

sudo npm install -g pm2
sudo pm2 start index.js
sudo pm2 save
sudo pm2 startup

You should see:

✅ SMTP Server is running on port 25

Step 8 — Test by Sending an Email

Open any email client (Gmail, Yahoo, Outlook) and send a test email to:

anything@yourdomain.com

Check your EC2 terminal — you should see the SMTP handshake logs appear in real time:

[CONNECT] Session ID: abc123
[MAIL FROM] sender@gmail.com | Session: abc123
[RCPT TO] anything@yourdomain.com | Session: abc123
[DATA] Email content:
Subject: Test Email
Hello from Gmail!

🎉 Congratulations! You've just built a working SMTP server on AWS EC2.

AWS Services Used

Service	Purpose
Amazon EC2	Hosts the SMTP server on a virtual Linux machine in the cloud
Security Groups	Acts as a virtual firewall to control inbound/outbound traffic on port 25
Elastic IP (optional)	Provides a static public IP for consistent DNS resolution
Amazon Route 53 (optional)	Managed DNS service for configuring MX and A records

What's Next?

This tutorial sets up a basic receive-only SMTP server for learning purposes. To take it further, consider:

Adding TLS encryption with Let's Encrypt certificates for secure communication
Configuring SPF, DKIM, and DMARC records for email authentication
Using Amazon SES alongside your custom server for reliable outbound email delivery
Implementing Postfix or Haraka for a production-grade mail transfer agent
Monitoring server health with Amazon CloudWatch

Wrapping Up

Building an SMTP server from scratch is one of the best ways to understand how email really works at the protocol level. By hosting it on Amazon EC2, you get the flexibility of full server access combined with the reliability and scalability of AWS infrastructure.

If this post helped you, feel free to drop a ❤️ and share it with someone learning about cloud infrastructure!

Have questions or want to connect? Find me on LinkedIn.