<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[DevOps with Piyush]]></title><description><![CDATA[Hands-on DevOps tutorials covering AWS, Kubernetes (EKS), ArgoCD, Terraform, CI/CD, and cloud security by a CKA-certified DevOps engineer.]]></description><link>https://blog.devopswithpiyush.in</link><generator>RSS for Node</generator><lastBuildDate>Sun, 26 Apr 2026 06:13:41 GMT</lastBuildDate><atom:link href="https://blog.devopswithpiyush.in/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Pro-Level Observability, Logging & Best Practices in AWS API Gateway (Part - 6)]]></title><description><![CDATA[You’ve built your API, secured it, assigned a custom domain, and safely deployed it to production. Everything is running smoothly until one day, you check your dashboard and see that 15% of your users]]></description><link>https://blog.devopswithpiyush.in/api-gateway-6</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/api-gateway-6</guid><category><![CDATA[AWS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[aws-apigateway]]></category><category><![CDATA[serverless]]></category><category><![CDATA[observability]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[api]]></category><category><![CDATA[Devops articles]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Sun, 22 Mar 2026 13:01:55 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/f753ed77-c86f-4477-ba28-2107075e407a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You’ve built your API, secured it, assigned a custom domain, and safely deployed it to production. Everything is running smoothly until one day, you check your dashboard and see that 15% of your users are getting <code>500 Internal Server Error</code> messages.</p>
<p>If you don't have observability set up, you are completely blind. You won't know <em>who</em> is getting the error, <em>why</em> it is happening, or <em>which</em> piece of your backend code is broken.</p>
<p>In this final post of the series, we are going to look at how to set up professional-grade logging and monitoring for Amazon API Gateway so you can debug any multi-point failure in seconds .</p>
<h2><strong>1. Amazon CloudWatch: The Black Box Recorder</strong></h2>
<p>Whenever an airplane crashes, investigators look for the "black box" to hear exactly what was happening in the cockpit. In AWS, <strong>CloudWatch Logs</strong> is your black box .</p>
<p>When you enable CloudWatch logging for your API Gateway, it records exactly what happens during request execution and client access . There are two main types of logs you need to care about:</p>
<ul>
<li><p><strong>Execution Logs:</strong> This tells you what happened <em>inside</em> API Gateway. Did the Lambda Authorizer allow the request? Did the data transformation map correctly? Did the backend server take too long to respond?</p>
</li>
<li><p><strong>Access Logs:</strong> This tells you <em>who</em> called the API. It records the caller's IP address, the time of the request, and the specific endpoint they tried to hit. (You can also send these access logs to <strong>Amazon Data Firehose</strong> if you want to store them in a massive data lake for long-term analysis) .</p>
</li>
</ul>
<p><em>Pro Tip:</em> Don't just look at logs; set up <strong>CloudWatch Alarms</strong> . You can tell CloudWatch to watch a specific metric—like your API's error rate. If the error rate spikes above 5% for more than 5 minutes, CloudWatch can automatically send a notification to your engineering team's Slack channel via Amazon SNS .</p>
<h2><strong>2. AWS X-Ray: The MRI Machine for Your Code</strong></h2>
<p>CloudWatch tells you <em>that</em> an error happened in your API Gateway. But what if the error didn't actually happen in the Gateway? What if API Gateway passed the request to a Lambda function, which passed it to a DynamoDB database, and the database was the thing that timed out?</p>
<p>This is where <strong>AWS X-Ray</strong> comes in .</p>
<p>When you enable X-Ray tracing for your REST APIs (it works for Regional, Edge-optimized, and Private endpoints), X-Ray assigns a unique "Trace ID" to a user's request the second it hits the Gateway . That Trace ID follows the request as it travels through your entire AWS backend .</p>
<p>Instead of reading lines of text in a log file, X-Ray gives you a visual <strong>Service Map</strong> . It draws a flowchart showing your API Gateway connecting to your Lambda function, connecting to your database.</p>
<p>If the database is running slowly, X-Ray will highlight that specific connection in red and tell you exactly how many milliseconds it took . It gives you an end-to-end view of the entire request so you can instantly analyze latencies and pinpoint the exact bottleneck .</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/8cebbee5-211f-4925-bbd1-2d8249fecbf9.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>3. AWS CloudTrail: The Security Camera</strong></h2>
<p>While CloudWatch monitors the <em>traffic</em> hitting your API, <strong>AWS CloudTrail</strong> monitors the <em>developers</em> managing your API .</p>
<p>CloudTrail provides a continuous record of every single action taken by a user, IAM role, or AWS service inside your account .</p>
<p>If someone on your team accidentally deletes a route, disables an authorizer, or pushes a bad configuration change, CloudTrail records it . You can look at the CloudTrail history to determine exactly <em>who</em> made the change, <em>when</em> it happened, and from <em>which IP address</em> .</p>
<h2><strong>4. AWS Config: The Compliance Checker</strong></h2>
<p>If you manage a large enterprise, you might have hundreds of APIs running at once. How do you make sure every single one of them has X-Ray tracing enabled and a Web Application Firewall (WAF) attached?</p>
<p>You use <strong>AWS Config</strong> .</p>
<p>AWS Config lets you define strict rules for your resources . You can create a rule that says: <em>"Every API Gateway must have CloudWatch Access Logging enabled."</em> If a developer creates a new API and forgets to turn on logging, AWS Config will immediately flag that API as "noncompliant" and can even send an alert to your security team .</p>
<h2><strong>Conclusion: You Are Ready for Production</strong></h2>
<p>Congratulations! Over the course of this 6-part series, you have gone from a complete beginner to mastering Amazon API Gateway.</p>
<p>You now know how to:</p>
<ul>
<li><p>Choose the right architecture (<strong>Part 1 &amp; 2</strong>)</p>
</li>
<li><p>Build real-time, two-way communication systems with WebSockets (<strong>Part 3</strong>)</p>
</li>
<li><p>Lock down your API with Authorizers and WAF (<strong>Part 4</strong>)</p>
</li>
<li><p>Launch safely using Custom Domains and Canary Deployments (<strong>Part 5</strong>)</p>
</li>
<li><p>Monitor, trace, and debug any failure in production (<strong>Part 6</strong>)</p>
</li>
</ul>
<p>API Gateway is the ultimate front door for modern, serverless applications. Whether you are building a simple side project or a massive enterprise platform, you now have the tools to route your traffic quickly, securely, and reliably.</p>
<p>Happy building!</p>
<p>Here is the final roadmap block for readers who jump straight to Part 6:</p>
<table>
<thead>
<tr>
<th><strong>Part</strong></th>
<th><strong>Topic</strong></th>
</tr>
</thead>
<tbody><tr>
<td><strong>Part 1</strong></td>
<td>Architecture &amp; Core Concepts</td>
</tr>
<tr>
<td><strong>Part 2</strong></td>
<td>REST vs HTTP APIs &amp; Building Your First One</td>
</tr>
<tr>
<td><strong>Part 3</strong></td>
<td>WebSocket APIs &amp; Real-Time Applications</td>
</tr>
<tr>
<td><strong>Part 4</strong></td>
<td>Securing &amp; Throttling Your APIs (Auth, WAF, Quotas)</td>
</tr>
<tr>
<td><strong>Part 5</strong></td>
<td>Data Mapping, Custom Domains &amp; Deployments</td>
</tr>
<tr>
<td><strong>Part 6 (This Blog)</strong></td>
<td>Pro-Level Observability, Logging &amp; Best Practices</td>
</tr>
</tbody></table>
<p>This completes the entire blog series based directly on the official AWS documentation. How does the final structure and flow of the series look to you?</p>
]]></content:encoded></item><item><title><![CDATA[Data Mapping, Custom Domains & Deployments in AWS API Gateway (Part - 5)]]></title><description><![CDATA[You’ve built your API, secured it with authentication, and set up throttling rules so nobody can crash your servers. You are finally ready to show it to the world.
But right now, your API lives at a U]]></description><link>https://blog.devopswithpiyush.in/api-gateway-5</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/api-gateway-5</guid><category><![CDATA[AWS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[route53]]></category><category><![CDATA[#domains]]></category><category><![CDATA[Custom Domain]]></category><category><![CDATA[deployment]]></category><category><![CDATA[Canary deployment]]></category><category><![CDATA[aws-apigateway]]></category><category><![CDATA[serverless]]></category><category><![CDATA[backend]]></category><category><![CDATA[System Design]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Sun, 22 Mar 2026 12:42:09 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/14d8863e-7b87-4f48-838c-a4ebb329d33e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You’ve built your API, secured it with authentication, and set up throttling rules so nobody can crash your servers. You are finally ready to show it to the world.</p>
<p>But right now, your API lives at a URL that looks like this:<br /><a href="https://a1b2c3d4e5.execute-api.us-east-1.amazonaws.com/dev"><code>https://a1b2c3d4e5.execute-api.us-east-1.amazonaws.com/dev</code></a></p>
<p>No one wants to give that URL to their customers. Plus, what happens when you need to update your API? If you push a bad update, you could instantly break the app for every single user.</p>
<p>In this post, we’re going to look at how to launch your API like a professional. We will cover setting up a beautiful <strong>Custom Domain Name</strong>, and how to use <strong>Canary Deployments</strong> to safely roll out updates without risking a massive outage.</p>
<h2><strong>The Professional Touch: Custom Domain Names</strong></h2>
<p>A custom domain turns that ugly AWS URL into something clean and professional, like:<br /><a href="https://api.mycoolstartup.com/v1/users"><code>https://api.mycoolstartup.com/v1/users</code></a></p>
<p>Setting this up in API Gateway is straightforward, but you need two things before you start :</p>
<ol>
<li><p><strong>A Registered Domain Name:</strong> You can buy this through Amazon Route 53 or any third-party provider (like GoDaddy or Namecheap) .</p>
</li>
<li><p><strong>An SSL/TLS Certificate:</strong> Your API needs to be secure (HTTPS). You must request a free certificate using <strong>AWS Certificate Manager (ACM)</strong> .</p>
</li>
</ol>
<h2><strong>How to Map the Domain</strong></h2>
<p>Once you have your certificate, you create a Custom Domain in the API Gateway console . API Gateway will generate a special target domain name. You take that target name, go to your DNS provider (like Route 53), and create a <code>CNAME</code> or <code>Alias</code> record pointing <a href="http://api.mycoolstartup.com"><code>api.mycoolstartup.com</code></a> to the API Gateway target .</p>
<p><em>Pro Tip:</em> You can also set up <strong>Wildcard Custom Domains</strong> . If you want to give every customer their own API endpoint (like <a href="http://customerA.mycoolstartup.com"><code>customerA.mycoolstartup.com</code></a> and <a href="http://customerB.mycoolstartup.com"><code>customerB.mycoolstartup.com</code></a>), you can use a wildcard certificate (<code>*.</code><a href="http://mycoolstartup.com"><code>mycoolstartup.com</code></a>) to route them all to the same API Gateway without having to set up hundreds of individual domains .</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/37b2e60e-a305-406d-8b49-d46bd1d64b10.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>Stages: Managing Environments</strong></h2>
<p>Before we talk about deploying updates, we need to talk about <strong>Stages</strong>.<br />When you deploy an API in AWS, you don't just deploy it "to the internet." You deploy it to a specific Stage. A stage is just a named reference to a snapshot of your API.</p>
<p>Most companies use stages to separate their environments:</p>
<ul>
<li><p><code>dev</code> (for developers testing new code)</p>
</li>
<li><p><code>qa</code> (for quality assurance testing)</p>
</li>
<li><p><code>prod</code> (the live version actual customers use)</p>
</li>
</ul>
<p>Instead of building three completely separate APIs, you build one API and deploy it to these three different stages.</p>
<h2><strong>Playing it Safe: Canary Deployments</strong></h2>
<p>Let's say your <code>prod</code> API is running perfectly, handling 10,000 users a minute. Your team has just built an exciting new feature, and you want to push it live.</p>
<p>If you update the <code>prod</code> stage directly and there is a bug in the code, all 10,000 users instantly crash. This is a disaster.</p>
<p>To solve this, API Gateway offers <strong>Canary Deployments</strong> (currently only available for REST APIs) .</p>
<h2><strong>How a Canary Works</strong></h2>
<p>A Canary Deployment allows you to split your traffic . Instead of sending 100% of your users to the new code, you tell API Gateway:<br /><em>"Keep 95% of users on the old, stable version. Send a random 5% of users to the new, experimental version"</em> .</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/28475a4a-d757-4266-850f-a8865be0659f.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>Monitoring the Canary</strong></h2>
<p>Because you enabled the Canary, API Gateway automatically separates your logs and metrics . In AWS CloudWatch, you will see two separate folders: one for the 95% of normal traffic, and a special <code>/Canary</code> folder for the 5% testing the new code .</p>
<p>You monitor the Canary logs.</p>
<ul>
<li><p>Are the 5% of users getting errors? <strong>If yes</strong>, you instantly slide the traffic dial back to 0% . The experiment is over, but 95% of your users never noticed a thing.</p>
</li>
<li><p>Are the 5% of users getting fast, successful responses? <strong>If yes</strong>, you can "Promote" the Canary . API Gateway shifts 100% of the traffic over, and your new code officially becomes the new stable version .</p>
</li>
</ul>
<hr />
<h2><strong>What's Next in the Series?</strong></h2>
<p>You now have a beautifully named API that can be safely updated without breaking production. But what happens when things <em>do</em> go wrong? How do you figure out exactly which line of code is slowing down your system?</p>
<p>In our final post, <strong>Part 6: Pro-Level Observability, Logging &amp; Best Practices</strong>, we will cover how to use CloudWatch, CloudTrail, and X-Ray to track every single request that moves through your API like an X-ray machine.</p>
]]></content:encoded></item><item><title><![CDATA[Securing & Throttling Your APIs in AWS (Auth, WAF, Quotas) (Part - 4)]]></title><description><![CDATA[So far in this series, we have built fast HTTP APIs and real-time WebSocket APIs. But right now, we have a major problem: Our "front door" is wide open. Anyone on the internet can hit our API endpoint]]></description><link>https://blog.devopswithpiyush.in/api-gateway-4</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/api-gateway-4</guid><category><![CDATA[AWS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Devops articles]]></category><category><![CDATA[#Devopscommunity]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[serverless]]></category><category><![CDATA[Security]]></category><category><![CDATA[aws-apigateway]]></category><category><![CDATA[API Gateway]]></category><category><![CDATA[backend]]></category><category><![CDATA[System Design]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Sun, 22 Mar 2026 12:28:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/33494487-4c0c-4956-99d7-097bce3981b5.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>So far in this series, we have built fast HTTP APIs and real-time WebSocket APIs. But right now, we have a major problem: Our "front door" is wide open. Anyone on the internet can hit our API endpoint, run our backend code, and potentially run up a massive AWS bill.</p>
<p>Security in the cloud is a "shared responsibility" . AWS secures the physical servers and the network infrastructure, but <strong>you</strong> are responsible for deciding who is allowed to walk through your API's front door and what they are allowed to do once inside .</p>
<p>In this post, we will look at how to lock down your API Gateway using authentication, firewalls, and throttling rules.</p>
<h2><strong>The Bouncers: Authentication &amp; Authorization</strong></h2>
<p>You wouldn't let just anyone walk into a private club without checking their ID. In API Gateway, you have a few different "bouncers" you can hire to check IDs before letting a request through .</p>
<ol>
<li><p><strong>Amazon</strong> <strong>Cognito</strong> (The Standard Bouncer) If you are building a mobile or web app where users need to log in with a username and password (or via Google/Facebook), Amazon Cognito is usually the best choice . When a user logs in, Cognito gives their app a digital token. When the app calls your API, it flashes this token. API Gateway automatically checks with Cognito: "Is this token valid? Did this user really log in?" If yes, the request goes through .</p>
</li>
<li><p><strong>AWS</strong> <strong>IAM</strong> <strong>Roles</strong> (The VIP List) Sometimes, your API isn't meant for regular users. Maybe you have an internal AWS Lambda function or an EC2 server that needs to call your API. In this case, you use AWS Identity and Access Management (IAM) . Instead of passwords, you give your internal servers special IAM roles. API Gateway checks this VIP list. If the server calling the API isn't on the list, it gets blocked immediately .</p>
</li>
<li><p><strong>Lambda</strong> <strong>Authorizers</strong> (The Custom Bouncer) What if you are already using a third-party login system like Auth0, or you have a weird, custom security requirement? You can write a Lambda Authorizer . This is simply a piece of custom code you write that runs before your actual API request. API Gateway hands the user's token (or headers) to your Lambda Authorizer. Your code inspects the data and returns a simple "Allow" or "Deny" .</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/49e0393e-122c-4466-833a-31d952b6a983.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>The Security Guards: Firewalls and Policies</strong></h2>
<p>Checking IDs is great, but what if someone is trying to blow up the building? You need deeper security layers.</p>
<h2><strong>AWS WAF (Web Application Firewall)</strong></h2>
<p>If you chose to build a <strong>REST API</strong> (as we discussed in Part 2), you can attach <strong>AWS WAF</strong> directly to your API Gateway .<br />AWS WAF acts as a smart firewall that protects your API from common web exploits . If a hacker tries to send a malicious SQL injection attack or a flood of bot traffic, AWS WAF will intercept and block the request before it even reaches your API Gateway .</p>
<h2><strong>Resource Policies</strong></h2>
<p>Sometimes, you want to restrict access based on <em>where</em> the request is coming from, not just <em>who</em> is sending it. <strong>Resource Policies</strong> let you tell API Gateway: <em>"Only allow requests if they come from this specific IP address, or from inside this specific private network (VPC)"</em> . This is perfect for internal company APIs.</p>
<h2><strong>The Managers: Usage Plans &amp; Throttling</strong></h2>
<p>Even legitimate users can crash your system if they send too many requests at once. To prevent your backend servers from melting (and your AWS bill from exploding), you need to set limits.</p>
<h2><strong>API Keys and Usage Plans</strong></h2>
<p>If you want to sell access to your API (like a weather data service) or limit how much third-party developers can use it, you can generate <strong>API Keys</strong> .<br />You group these keys into <strong>Usage Plans</strong> . For example:</p>
<ul>
<li><p><strong>Basic Plan:</strong> The user's API Key allows 1,000 requests per month.</p>
</li>
<li><p><strong>Pro Plan:</strong> The user's API Key allows 10,000 requests per month .<br />Once a user hits their limit, API Gateway automatically blocks them with a "Too Many Requests" error .</p>
</li>
</ul>
<h2><strong>Throttling and Burst Limits</strong></h2>
<p>What if a user tries to send all 1,000 of their monthly requests in a single second? That spike could crash your database.<br />API Gateway allows you to set <strong>Rate Limits</strong> (how many steady requests per second are allowed) and <strong>Burst Limits</strong> (how many sudden, simultaneous requests are allowed). This ensures smooth, predictable traffic flow to your backend servers.</p>
<hr />
<h2><strong>What's Next in the Series?</strong></h2>
<p>Now our API is fast, supports real-time communication, and is locked down tight. But what happens when we need to release a new version of our API without breaking the old one? Or what if we want our API to live at <a href="http://api.mycoolstartup.com"><code>api.mycoolstartup.com</code></a> instead of a random, ugly AWS URL?</p>
<p>In <strong>Part 5: Data Mapping, Custom Domains &amp; Deployments</strong>, we will look at how to transform data on the fly and how to launch your API into production like a pro.</p>
]]></content:encoded></item><item><title><![CDATA[WebSocket APIs in AWS – Building Real-Time Magic (Part - 3)]]></title><description><![CDATA[In Part 2, we looked at HTTP and REST APIs. These are known as stateless APIs. You (the client) ask a question, the server gives an answer, and then the server immediately forgets about you. If you wa]]></description><link>https://blog.devopswithpiyush.in/api-gateway-3</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/api-gateway-3</guid><category><![CDATA[AWS]]></category><category><![CDATA[aws-apigateway]]></category><category><![CDATA[API Gateway]]></category><category><![CDATA[Devops]]></category><category><![CDATA[serverless]]></category><category><![CDATA[websockets]]></category><category><![CDATA[backend]]></category><category><![CDATA[StateFUL]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Devops articles]]></category><category><![CDATA[DevOps Journey]]></category><category><![CDATA[#Devopscommunity]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Sun, 22 Mar 2026 11:57:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/f8d163d1-e095-47a5-a679-ac4ca449ddf4.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In Part 2, we looked at HTTP and REST APIs. These are known as <em>stateless</em> APIs. You (the client) ask a question, the server gives an answer, and then the server immediately forgets about you. If you want another update, you have to ask again.</p>
<p>But what if you are building a chat application, a live stock ticker, or a multiplayer game? You can't have your app asking the server "Any new messages?" every single second—it would drain the user's battery and crash your server.</p>
<p>You need the server to say, "Hey, don't keep asking. Just stay on the line, and I will push the new messages to you the second they arrive."</p>
<p>This is where <strong>Amazon API Gateway WebSocket APIs</strong> come in.</p>
<h2><strong>What is a WebSocket API?</strong></h2>
<p>Unlike a standard REST API, a WebSocket API is <strong>stateful and bidirectional</strong> .</p>
<p>Think of a REST API like sending a text message: You send a text, wait, and get a reply.<br />Think of a WebSocket API like a phone call: You dial the number, someone picks up, and the line stays open. Both of you can talk and listen at the exact same time without having to hang up and redial .</p>
<p>In API Gateway, a WebSocket API creates a persistent connection between your user's app and your AWS backend . The backend can now independently push data down to the client without the client explicitly requesting it .</p>
<h2><strong>How Do WebSockets Work in API Gateway?</strong></h2>
<p>Because the connection stays open, API Gateway needs a way to figure out what to do with the continuous stream of messages flowing back and forth. It does this using <strong>Routes</strong> .</p>
<p>When you build an HTTP API, you use URLs (like <code>/get-weather</code> or <code>/update-profile</code>) to tell the server what you want. But in a WebSocket, there is only one URL. Once you are connected, everything happens over that single open connection.</p>
<p>So, how does the server know if a message is a "chat message" or a "friend request"? API Gateway looks inside the actual content of the message using something called a <strong>Route Selection Expression</strong>.</p>
<p>If your app sends a JSON message like this:</p>
<pre><code class="language-json">{
 "action": "send_message",
 "text": "Hello World!" 
}
</code></pre>
<p>API Gateway can look at the <code>"action"</code> property . It sees <code>"send_message"</code> and routes that specific chunk of data to the correct AWS Lambda function .</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/a9d2e342-248a-42ff-972a-a5b106fdc593.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>The Three Magical Predefined Routes</strong></h2>
<p>When you set up a WebSocket API, AWS gives you three built-in routes to manage the lifecycle of the phone call :</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/0d9378be-1043-4dcf-96df-c10d4b3c7298.png" alt="" style="display:block;margin:0 auto" />

<ol>
<li><p><code>$connect</code>: This triggers the exact moment a user opens the app and connects to the API . You usually connect this to a Lambda function that saves the user's unique "Connection ID" into a database (like DynamoDB) so you know who is online.</p>
</li>
<li><p><code>$disconnect</code>: This triggers when the user closes the app or loses their internet connection . You use this to delete their Connection ID from your database.</p>
</li>
<li><p><code>$default</code>: If the user sends a message that doesn't match any of your custom rules, it falls into this bucket . It is a great place to send error messages like "Sorry, I didn't understand that command."</p>
</li>
</ol>
<h2><strong>How the Server Talks Back</strong></h2>
<p>Getting messages from the user is easy, but how does the server push messages back to them?</p>
<p>Because you saved the user's "Connection ID" during the <code>$connect</code> phase, your backend services (like Lambda) can use a special AWS command called the <code>@connections</code> <strong>API</strong> .</p>
<p>If User A sends a chat message intended for User B, your Lambda function looks up User B's Connection ID in your database. It then uses the <code>@connections</code> API to push the text directly to User B's open WebSocket .</p>
<h2><strong>Important Limitations to Keep in Mind</strong></h2>
<p>WebSockets are powerful, but they aren't magic. AWS enforces a few rules you need to know:</p>
<ul>
<li><p><strong>Idle Timeouts:</strong> If a user connects but doesn't send or receive any data for 10 minutes, API Gateway will automatically hang up the phone (closing the connection with a <strong>1001 status code</strong>) .</p>
</li>
<li><p><strong>Maximum Lifespan:</strong> Even if the user is actively chatting, AWS forces a hard reset after 2 hours . Your app needs to be programmed to quietly reconnect when this happens.</p>
</li>
<li><p><strong>Payload Limits:</strong> If a user tries to send a message that is too massive, API Gateway will reject it with a <strong>1009 status code</strong> .</p>
</li>
</ul>
<hr />
<h2><strong>What's Next in the Series?</strong></h2>
<p>Now you know how to build fast HTTP APIs and real-time WebSocket APIs. But so far, we have left the front door completely unlocked. Anyone on the internet can access your endpoints, which could cost you a fortune or expose your data.</p>
<p>In <strong>Part 4: Securing &amp; Throttling Your APIs</strong>, we are going to lock things down. We will look at how to use IAM, Lambda Authorizers, and Amazon Cognito to ensure only the right people get through the door, and how to use Quotas so they don't overwhelm your servers.</p>
]]></content:encoded></item><item><title><![CDATA[REST vs. HTTP APIs in AWS – Which One Should You Pick? (And How to Build Your First One) (Part - 2)]]></title><description><![CDATA[In Part 1, we learned that API Gateway acts as the helpful "waiter" standing between your users and your backend servers. But when you log into the AWS Console to create your first API, AWS asks you t]]></description><link>https://blog.devopswithpiyush.in/api-gateway-2</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/api-gateway-2</guid><category><![CDATA[AWS]]></category><category><![CDATA[aws-apigateway]]></category><category><![CDATA[API Gateway]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[Devops]]></category><category><![CDATA[serverless]]></category><category><![CDATA[APIs]]></category><category><![CDATA[REST API]]></category><category><![CDATA[http api]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Sun, 22 Mar 2026 11:19:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/1e48e602-7d4e-4017-8150-dee8792dd198.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In Part 1, we learned that API Gateway acts as the helpful "waiter" standing between your users and your backend servers. But when you log into the AWS Console to create your first API, AWS asks you to choose a menu: Do you want an HTTP API or a REST API?</p>
<p>Both of them do the exact same core job (moving data between a client and a server), but they have very different price tags and features. Let's break down the difference in simple English and then build one in less than 5 minutes.</p>
<h2><strong>The Fine Dining vs. Fast Food Analogy</strong></h2>
<p>Think of a <strong>REST API</strong> like a high-end, fine-dining restaurant experience. You get a massive menu of features: valet parking, custom table settings, and a sommelier . In the AWS world, this means built-in API keys to sell access to your API, strict request validation (making sure users don't send garbage data), and integration with AWS WAF to block hackers . But just like fine dining, it is heavier and costs more.</p>
<p>Think of an <strong>HTTP API</strong> like a high-quality fast-food drive-thru. It is designed to be lean, incredibly fast, and very cheap . It strips away the heavy "fine dining" features you probably don't need for a simple app . If you just want to connect a mobile app to an AWS Lambda function as quickly and cheaply as possible, this is your choice.</p>
<h2><strong>The Showdown: HTTP API vs. REST API</strong></h2>
<p>Here is a simple cheat sheet to help you decide which API type fits your project :</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>HTTP API (The Fast Track)</th>
<th>REST API (The Heavyweight)</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Cost</strong></td>
<td>Up to 71% cheaper than REST.</td>
<td>More expensive.</td>
</tr>
<tr>
<td><strong>Speed</strong></td>
<td>Lower latency (faster responses).</td>
<td>Slightly higher latency due to heavy features.</td>
</tr>
<tr>
<td><strong>API Keys &amp; Monetization</strong></td>
<td>❌ Not supported.</td>
<td>✅ Yes, you can generate keys and throttle usage per client.</td>
</tr>
<tr>
<td><strong>AWS WAF (Firewall)</strong></td>
<td>❌ Not supported.</td>
<td>✅ Yes, built-in protection against web exploits.</td>
</tr>
<tr>
<td><strong>Edge-Optimized Endpoints</strong></td>
<td>❌ Regional only.</td>
<td>✅ Yes, routes traffic through AWS's global network.</td>
</tr>
<tr>
<td><strong>Built-in Caching</strong></td>
<td>❌ Not supported.</td>
<td>✅ Yes, caches responses to save backend compute time.</td>
</tr>
<tr>
<td><strong>When to use it?</strong></td>
<td>Connecting a simple web/mobile app directly to a Lambda function or a database.</td>
<td>Enterprise apps, public APIs you want to sell, or highly secure financial apps.</td>
</tr>
</tbody></table>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/7d0694f2-8789-4ea7-a8eb-755195a5b3f3.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>Let's Build Your First HTTP API (In 5 Minutes)</strong></h2>
<p>Since HTTP APIs are the easiest and cheapest way to get started, let's build a simple one right now. We will assume you already have a basic "Hello World" AWS Lambda function ready to go.</p>
<h2><strong>Step 1: Create the API</strong></h2>
<p>Log into the AWS Management Console, search for <strong>API Gateway</strong>, and click <strong>Create API</strong>. Under "HTTP API," click the <strong>Build</strong> button.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/1ff01d92-cd32-4b5b-a30b-14699bd6f4dc.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>Step 2: Add Your Integration</strong></h2>
<p>API Gateway will ask you what you want this API to talk to. Click <strong>Add integration</strong>. Select <strong>Lambda</strong> from the dropdown, and then choose your "Hello World" Lambda function. Give your API a name (like <code>MyFirstFastAPI</code>).</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/fe8f090b-c876-42ef-8f27-d585305c54ab.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>Step 3: Configure Your Routes</strong></h2>
<p>A "Route" is just the specific URL path a user visits to trigger your code.</p>
<ul>
<li><p>Set the Method to <strong>GET</strong> (this means the user is just asking for data).</p>
</li>
<li><p>Set the Resource path to <code>/hello</code>.</p>
</li>
<li><p>Make sure it points to your Lambda function.</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/87570e51-1ab6-4d79-965d-61a05f3f8414.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>Configure stages <em>- optional</em></strong></h2>
<p>Stages are independently configurable environments that your API can be deployed to. You must deploy to a stage for API configuration changes to take effect, unless that stage is configured to autodeploy. By default, all HTTP APIs created through the console have a default stage named $default. All changes that you make to your API are autodeployed to that stage. You can add stages that represent environments such as development or production.</p>
<h2><strong>Step 4: Deploy and Test!</strong></h2>
<p>AWS HTTP APIs have a magical feature called <strong>Automatic Deployments</strong> . As soon as you hit "Create," AWS immediately pushes your API to the internet.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/02afa3be-0fbb-4208-b5aa-23116edfda48.png" alt="" style="display:block;margin:0 auto" />

<p>You will see an "Invoke URL" on your screen. Copy that URL, paste it into your browser, add <code>/hello</code> to the end, and hit enter. Boom! You just triggered a serverless backend from the public internet.</p>
<pre><code class="language-plaintext">d-l9jomka06f.execute-api.us-east-1.amazonaws.com/hello
</code></pre>
<hr />
<h2><strong>What's Next in the Series?</strong></h2>
<p>Now you know how to build a basic stateless API. But what if you are building a chat application, a live stock ticker, or a multiplayer game where the server needs to push updates to the user instantly? A standard HTTP API won't cut it.</p>
<p>In <strong>Part 3: WebSocket APIs — Building Real-Time Magic</strong>, we will dive into stateful connections, where the API keeps the connection open constantly for real-time two-way communication.</p>
]]></content:encoded></item><item><title><![CDATA[The Ultimate Beginner's Guide to AWS API Gateway Architecture (Part - 1)]]></title><description><![CDATA[Have you ever wondered how mobile apps and websites magically talk to servers without crashing when millions of users log in at once? The secret often lies in a powerful "front door" known as an API G]]></description><link>https://blog.devopswithpiyush.in/api-gateway-1</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/api-gateway-1</guid><category><![CDATA[AWS]]></category><category><![CDATA[API Gateway]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Cloud]]></category><category><![CDATA[Devops]]></category><category><![CDATA[serverless]]></category><category><![CDATA[learning]]></category><category><![CDATA[lambda]]></category><category><![CDATA[System Design]]></category><category><![CDATA[backend]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Sun, 22 Mar 2026 10:56:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/65132fc4-5c9f-4106-b23a-30a03bb278fe.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Have you ever wondered how mobile apps and websites magically talk to servers without crashing when millions of users log in at once? The secret often lies in a powerful "front door" known as an API Gateway. In this post, we are going to break down the architecture and fundamentals of Amazon API Gateway.</p>
<h2><strong>What is Amazon API Gateway?</strong></h2>
<p>Imagine you are at a massive, bustling luxury restaurant. You (the client) don't walk directly into the kitchen (the server) to cook your own food or yell your order at the chefs. Instead, you talk to a waiter. The waiter takes your order, makes sure you are allowed to order from that menu, hands the request to the right chef in the kitchen, and then brings your food back to you.</p>
<p>In the AWS cloud, <strong>Amazon API Gateway is that waiter</strong> .</p>
<p>It is a fully managed AWS service that acts as the "front door" for your applications . Instead of your mobile app or website talking directly to your backend databases or code, it talks to the API Gateway . The Gateway handles all the heavy lifting—like accepting up to hundreds of thousands of concurrent API calls, managing traffic, and ensuring only authorized users get through .</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/259c9828-0984-47be-9cb0-2d3bcbeeec5a.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>The Core Architecture</strong></h2>
<p>To understand how API Gateway works, you only need to know how it sits between your users and your backend.</p>
<p>When a user interacts with your app, their request hits an <strong>API endpoint</strong> . This is essentially a web address (a URL) that API Gateway provides . AWS offers different types of endpoints depending on where your users are:</p>
<ul>
<li><p><strong>Edge-optimized endpoints:</strong> Best for users scattered globally. It uses AWS's global network to route requests to the nearest location, speeding up the connection .</p>
</li>
<li><p><strong>Regional endpoints:</strong> Perfect if your users and your backend servers are in the same geographic region, cutting out unnecessary travel time .</p>
</li>
<li><p><strong>Private endpoints:</strong> Used when you want to keep your API completely hidden from the public internet, allowing access only from within your secure AWS network .</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/fb07c281-7fd6-4c6f-9dee-4cde8f55cef0.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>How Requests Travel: The Integration Phase</strong></h2>
<p>Once the API Gateway receives a request, it needs to know what to do with it. This is where <strong>Integrations</strong> come in.</p>
<p>API Gateway uses an <strong>Integration request</strong> to map the incoming data (like a user submitting a form) into a format that your backend code can understand . It then passes the request to your backend—which could be an AWS Lambda function, an Amazon EC2 server, or any other web application .</p>
<p>Once your backend does its job (like fetching user data), it sends the data back to the API Gateway. The Gateway uses an <strong>Integration response</strong> to package that data neatly and hand it back to the user's app .</p>
<h2><strong>A Real-World Example: Proxy Integration</strong></h2>
<p>Sometimes, you don't want the waiter to repackage your order; you just want them to hand it straight to the chef as-is. This is called a <strong>Proxy integration</strong> .</p>
<p>Let's say you have a simple app that checks the weather. With a proxy integration, API Gateway takes the user's exact request ("What is the weather in London?"), hands the entire thing directly to an AWS Lambda function, and then takes the Lambda function's exact answer and gives it back to the user . It is the easiest and most common way to connect API Gateway to serverless code today because it requires minimal setup .</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/fa018df1-41d5-4ced-aadc-cf6b9da1a750.png" alt="" style="display:block;margin:0 auto" />

<h2><strong>What's Next?</strong></h2>
<p>This was <strong>Part 1</strong> of our complete AWS API Gateway blog series, where we covered the foundation — what API Gateway is, how its architecture works, how requests travel through integrations, and the different endpoint types available to you.</p>
<p>Now that you understand the "waiter" and how the restaurant works, it's time to look at the <strong>menu options</strong>. API Gateway doesn't offer just one type of API — it gives you three distinct flavors: <strong>REST APIs, HTTP APIs, and WebSocket APIs</strong>. Choosing the wrong one can cost you extra money or leave you without features you need.</p>
<p>In <strong>Part 2: REST APIs vs HTTP APIs — Which One Should You Pick?</strong>, we will break down the two stateless API types side by side in plain English. We'll cover:</p>
<ul>
<li><p>What makes REST APIs and HTTP APIs different (spoiler: it's not just the name)</p>
</li>
<li><p>A simple comparison table of features, pricing, and use cases</p>
</li>
<li><p>When to pick one over the other with real-world scenarios</p>
</li>
<li><p>Common mistakes beginners make when choosing between them</p>
</li>
</ul>
<p>If you are just getting started with API Gateway, <strong>bookmark this series</strong> — we are going to cover every single feature, configuration, and limitation of the service across the following upcoming parts:</p>
<p>💡 <strong>Pro Tip:</strong> Each blog in this series is designed to be read independently, but following the sequence will give you the most complete understanding — from zero to production-ready.</p>
]]></content:encoded></item><item><title><![CDATA[AWS Lambda: The Complete Guide — From Zero to Expert]]></title><description><![CDATA[AWS Lambda is one of the most widely used services in modern cloud and DevOps architectures — but many engineers still struggle to understand when to actually use it.
Should you use Lambda or EC2?When]]></description><link>https://blog.devopswithpiyush.in/aws-lambda</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/aws-lambda</guid><category><![CDATA[aws lambda]]></category><category><![CDATA[AWS]]></category><category><![CDATA[serverless]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[API Gateway]]></category><category><![CDATA[Terraform]]></category><category><![CDATA[sam-template]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Sat, 21 Mar 2026 17:54:43 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/0433770c-28dc-448e-b920-9e2b9f815903.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>AWS Lambda is one of the most widely used services in modern cloud and DevOps architectures — but many engineers still struggle to understand when to actually use it.</p>
<p>Should you use Lambda or EC2?<br />When does serverless make sense?<br />What are the real-world scenarios?</p>
<p>In this guide, we’ll go from zero to advanced — covering how Lambda works, when to use it, key configurations, and production-grade patterns like API Gateway integration and SAM templates.</p>
<p>By the end, you’ll not just understand Lambda — you’ll know how to use it in real systems.</p>
<hr />
<h2><strong>Lambda vs EC2: When to Use What</strong></h2>
<p>EC2 gives you full control over virtual servers — OS, networking, storage, patching — while Lambda abstracts all of that away.​</p>
<table>
<thead>
<tr>
<th>Feature</th>
<th>Lambda (Serverless)</th>
<th>EC2 (Server-based)</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Management</strong></td>
<td>AWS manages OS, patching, scaling</td>
<td>You manage everything</td>
</tr>
<tr>
<td><strong>State</strong></td>
<td>Stateless (ephemeral)</td>
<td>Stateful (persistent)</td>
</tr>
<tr>
<td><strong>Pricing</strong></td>
<td>Pay per request + duration (ms)</td>
<td>Pay per hour/second for provisioned capacity</td>
</tr>
<tr>
<td><strong>Scaling</strong></td>
<td>Automatic, instant</td>
<td>Manual or Auto Scaling Groups</td>
</tr>
<tr>
<td><strong>Max Execution</strong></td>
<td>15 minutes</td>
<td>Unlimited</td>
</tr>
<tr>
<td><strong>Control</strong></td>
<td>Low</td>
<td>Full OS-level control</td>
</tr>
</tbody></table>
<h2><strong>When to Use Lambda</strong></h2>
<ul>
<li><p><strong>Event-driven workloads</strong>: S3 file uploads triggering processing, DynamoDB stream handlers</p>
</li>
<li><p><strong>API backends</strong>: Lightweight REST/GraphQL APIs behind API Gateway</p>
</li>
<li><p><strong>Scheduled tasks</strong>: Cron-like jobs (e.g., daily tenant reports like the <code>Daily-tenant-report</code> function in the screenshot)</p>
</li>
<li><p><strong>Chatbot/IoT processing</strong>: Handling Alexa skills, IoT device data​</p>
</li>
<li><p><strong>Automation</strong>: Infrastructure tasks triggered by CloudTrail or Config rules​</p>
</li>
</ul>
<h2><strong>When to Use EC2</strong></h2>
<ul>
<li><p>Long-running processes (&gt;15 minutes)</p>
</li>
<li><p>Stateful applications needing persistent memory</p>
</li>
<li><p>Legacy monolithic apps requiring specific OS configurations</p>
</li>
<li><p>GPU/specialized hardware workloads​</p>
</li>
</ul>
<h2><strong>Hybrid Approach</strong></h2>
<p>Many organizations use both — Lambda for bursty, event-driven tasks and EC2 for steady-state workloads requiring fine-grained control.</p>
<h2>Creating a Lambda Function (Step by Step)</h2>
<p>Step by Step When you click Create function in the console, you see four options: ​</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/ae6a9500-8bf7-4210-bb1c-83027d9f0c16.png" alt="" style="display:block;margin:0 auto" />

<ol>
<li><p>Author from Scratch Start with a Hello World example. You pick a runtime (e.g., nodejs24.x, python3.12), name your function, and Lambda sets up a basic handler.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/438b3782-b433-4768-bf4f-94ddbf968038.png" alt="" style="display:block;margin:0 auto" />
</li>
<li><p>Use a Blueprint Pre-built sample code for common use cases — S3 thumbnail generation, DynamoDB processing, Kinesis stream readers. Great for learning.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/a67b3a37-9c03-4fde-94ae-2724e40e3b3e.png" alt="" style="display:block;margin:0 auto" />
</li>
<li><p>Container Image Deploy your function as a Docker container image stored in Amazon ECR. More on this in the advanced section below.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/c1377b83-529f-4f48-bb05-0a163570d4d1.png" alt="" style="display:block;margin:0 auto" /></li>
</ol>
<hr />
<h2><strong>Architecture: arm64 vs x86_64</strong></h2>
<p>When creating a function, you choose the instruction set architecture:</p>
<table>
<thead>
<tr>
<th>Architecture</th>
<th>Description</th>
<th>Best For</th>
</tr>
</thead>
<tbody><tr>
<td><strong>x86_64</strong></td>
<td>Traditional Intel/AMD. Default option.</td>
<td>Compatibility with existing libraries</td>
</tr>
<tr>
<td><strong>arm64</strong></td>
<td>AWS Graviton2 processors. Up to 20% cheaper and often faster.</td>
<td>Cost optimization, new workloads</td>
</tr>
</tbody></table>
<p><strong>Scenario</strong>: If you're writing a Python-based daily report generator (like <code>Daily-tenant-report</code>), <code>arm64</code> is an easy win — most Python packages support it and you save money.</p>
<hr />
<h2><strong>Lambda Configuration Deep Dive</strong></h2>
<h2><strong>General Configuration</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/de0f99b2-7547-47e3-b417-e344695d10cf.png" alt="" style="display:block;margin:0 auto" />

<ul>
<li><p><strong>Memory</strong>: 128 MB to 10,240 MB. CPU scales proportionally with memory.</p>
</li>
<li><p><strong>Timeout</strong>: 1 second to 15 minutes max.</p>
</li>
<li><p><strong>Ephemeral storage (</strong><code>/tmp</code><strong>)</strong>: 512 MB to 10,240 MB for temporary files.</p>
</li>
</ul>
<h2><strong>Environment Variables</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/6f4bb45d-91c8-4544-9010-1a0bd0f32486.png" alt="" style="display:block;margin:0 auto" />

<p>Key-value pairs injected at runtime. Use them for:</p>
<ul>
<li><p>Database connection strings</p>
</li>
<li><p>API keys (encrypted with KMS)</p>
</li>
<li><p>Feature flags</p>
</li>
<li><p>Stage identifiers (<code>prod</code>, <code>staging</code>)</p>
</li>
</ul>
<pre><code class="language-python">import os 
DB_HOST = os.environ['DB_HOST'] 
API_KEY = os.environ['API_KEY']
</code></pre>
<h2><strong>Permissions (Execution Role)</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/6db08e69-d9ed-4036-aa1b-d8666d91a74d.png" alt="" style="display:block;margin:0 auto" />

<p>Every Lambda function needs an IAM execution role. By default, Lambda creates one with CloudWatch Logs permissions. You add policies for whatever the function accesses — S3, DynamoDB, SQS, etc.</p>
<p><strong>Scenario</strong>: Your <code>Daily-tenant-report</code> function needs to read from DynamoDB and send emails via SES → attach <code>AmazonDynamoDBReadOnlyAccess</code> and <code>AmazonSESFullAccess</code> policies to the execution role.</p>
<h2><strong>VPC Configuration</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/8d025aca-37f5-480a-b078-25889154a7b8.png" alt="" style="display:block;margin:0 auto" />

<p>Connect your Lambda to a VPC to access private resources like RDS databases or ElastiCache. When enabled, Lambda creates ENIs in your specified subnets.</p>
<p><strong>Trade-off</strong>: VPC-connected functions may have slightly longer cold starts, though AWS has significantly improved this.</p>
<h2><strong>Function URL</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/467b2f4d-7365-4bd6-91fc-10fd23771e37.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/a2935733-033e-444b-88a9-028f6e89c15e.png" alt="" style="display:block;margin:0 auto" />

<p>Assign an HTTPS endpoint directly to your Lambda — no API Gateway needed. Great for simple webhooks or internal tools.</p>
<h2><strong>Triggers</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/0a66712d-3241-4b9c-bc6c-6b7dc7565578.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/3c24a882-0c14-4970-b3e9-fdbbea2833ce.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/57493e02-638b-4952-aacf-738f8bb2531c.png" alt="" style="display:block;margin:0 auto" />

<p>Lambda can be triggered by 200+ AWS services:</p>
<ul>
<li><p>API Gateway (HTTP requests)</p>
</li>
<li><p>S3 (file events)</p>
</li>
<li><p>DynamoDB Streams (data changes)</p>
</li>
<li><p>SQS/SNS (messages)</p>
</li>
<li><p>EventBridge (scheduled/event rules)</p>
</li>
<li><p>CloudWatch (alarms)</p>
</li>
</ul>
<h2><strong>Destinations</strong></h2>
<p>Configure where successful or failed async invocation results go — SQS, SNS, Lambda, or EventBridge.</p>
<h2><strong>Concurrency and Recursion Detection</strong></h2>
<p>Concurrency simply means:</p>
<p>👉 <strong>How many times your Lambda function can run at the same time</strong></p>
<ul>
<li><p><strong>Reserved concurrency</strong>: Guarantees a set number of concurrent executions</p>
<ul>
<li><p>Think of this as:</p>
<p>👉 <em>“I want to reserve a fixed number of slots for my function”</em></p>
<ul>
<li><p>Guarantees that your function always has capacity available</p>
</li>
<li><p>Prevents other functions from using all resources</p>
</li>
</ul>
<p>📌 <strong>Example:</strong></p>
<ul>
<li><p>You set reserved concurrency = 10</p>
</li>
<li><p>Your function can run up to 10 times simultaneously</p>
</li>
<li><p>Even if the system is busy, these 10 slots are reserved for you</p>
</li>
</ul>
<p>✔️ Useful for:</p>
<ul>
<li><p>Critical applications</p>
</li>
<li><p>Preventing overload on downstream systems (like databases)</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p><strong>Provisioned concurrency</strong>: Pre-initializes execution environments to eliminate cold starts</p>
<ul>
<li><p>Normally, Lambda may take a little time to start (called <strong>cold start</strong>).</p>
<p>Provisioned concurrency means:</p>
<p>👉 <em>“Keep some instances of my function already running”</em></p>
<ul>
<li><p>Removes cold start delays</p>
</li>
<li><p>Improves response time</p>
</li>
</ul>
<p>📌 <strong>Example:</strong></p>
<ul>
<li><p>You configure 5 provisioned instances</p>
</li>
<li><p>These are always ready → faster execution</p>
</li>
</ul>
<p>✔️ Useful for:</p>
<ul>
<li><p>APIs</p>
</li>
<li><p>User-facing applications</p>
</li>
<li><p>Low-latency requirements</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p><strong>Recursion detection</strong>: Prevents infinite loops where Lambda triggers itself</p>
<ul>
<li><p>This is a safety feature.</p>
<p>👉 Prevents your Lambda from calling itself again and again in a loop.</p>
</li>
</ul>
</li>
</ul>
<h2><strong>Code Signing</strong></h2>
<p>Ensures only trusted, signed code runs in your function. You create a Code Signing Configuration linking to an AWS Signer signing profile.​</p>
<h2><strong>Monitoring and Operations Tools</strong></h2>
<p>Lambda integrates with CloudWatch Logs, X-Ray (tracing), and CloudWatch Lambda Insights for performance monitoring.</p>
<h2><strong>Versions and Aliases</strong></h2>
<h2><strong>Versions</strong></h2>
<p>A <strong>version</strong> is an immutable snapshot of your function's code + configuration. When you publish a version, Lambda assigns it a number (1, 2, 3...). The <code>$LATEST</code> version is always mutable — it's your working copy.​</p>
<p><strong>Scenario</strong>: You deploy v1 of <code>Daily-tenant-report</code> to production. You make changes and publish v2. If v2 has a bug, v1 still exists untouched.</p>
<h2><strong>Aliases</strong></h2>
<p>An <strong>alias</strong> is a named pointer (like <code>prod</code>, <code>staging</code>, <code>dev</code>) to a specific version.</p>
<pre><code class="language-python">aws lambda create-alias \
  --function-name Daily-tenant-report \
  --name prod \
  --function-version 5
</code></pre>
<p><strong>Why aliases matter</strong>: Your API Gateway integration points to the alias ARN, not a version number. When you deploy v6, just update the alias — no need to change API Gateway.​</p>
<h2><strong>Additional Resources Explained</strong></h2>
<h2><strong>Layers</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/c7d45c2a-2b54-415c-a4a9-9b2f93dad903.png" alt="" style="display:block;margin:0 auto" />

<p>A <strong>layer</strong> is a .zip archive containing libraries, custom runtimes, or other dependencies. Instead of bundling everything in your deployment package, you attach shared layers.​</p>
<p><strong>Scenario</strong>: Multiple Lambda functions use the <code>pandas</code> library. Create one layer with <code>pandas</code>, attach it to all functions. Update the layer once, and all functions get the update.</p>
<p>Each layer version is immutable and identified by a unique ARN.​</p>
<h2><strong>Event Source Mappings (ESMs)</strong></h2>
<p>An ESM is a Lambda resource that <strong>polls</strong> stream/queue-based services and invokes your function with batches of records.</p>
<p>Supported sources: SQS, Kinesis, DynamoDB Streams, MSK (Kafka), Amazon MQ, DocumentDB.</p>
<p><strong>Scenario</strong>: An SQS queue receives order events. An ESM polls the queue and invokes your Lambda with batches of 10 messages. You configure batch size, batching window, retry policies, and parallelization.​</p>
<h2><strong>Capacity Providers (New — Lambda Managed Instances)</strong></h2>
<p>This is a new feature that lets Lambda functions run on EC2 instances managed by Lambda, combining serverless development experience with dedicated compute.</p>
<p>You create a capacity provider specifying VPC, subnets, security groups, IAM roles, and optionally instance types and scaling config. Functions using capacity providers get access to specialized EC2 instance types while Lambda still handles scaling and patching.​</p>
<p><strong>Use case</strong>: Workloads needing GPU instances or specific hardware that standard Lambda doesn't offer.​</p>
<h2><strong>Code Signing Configurations</strong></h2>
<p>Ensures deployment integrity — only code signed by approved developers/CI pipelines can be deployed to your functions.</p>
<h2><strong>Replicas</strong></h2>
<p>Lambda@Edge replicas — when you associate a Lambda function with CloudFront distributions, AWS replicates your function to edge locations globally for low-latency execution.</p>
<h2><strong>Container Image Functions (Deep Dive)</strong></h2>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/92429a58-3c74-4861-b979-a1e61d8183fa.png" alt="" style="display:block;margin:0 auto" />

<p>Instead of uploading a .zip file, you can package your Lambda function as a <strong>Docker container image</strong> (up to 10 GB uncompressed) stored in Amazon ECR.</p>
<h2><strong>Three Ways to Build Container Images</strong></h2>
<ol>
<li><p><strong>AWS base image</strong>: Pre-loaded with runtime + runtime interface client. Easiest approach.​</p>
</li>
<li><p><strong>AWS OS-only base image</strong>: Amazon Linux with just the OS. You add your runtime. Used for Go, Rust, or custom runtimes.​</p>
</li>
<li><p><strong>Non-AWS base image</strong>: Alpine, Debian, or any custom image. You must include a runtime interface client.​</p>
</li>
</ol>
<h2><strong>Example Dockerfile (Python)</strong></h2>
<pre><code class="language-plaintext">FROM public.ecr.aws/lambda/python:3.12
COPY requirements.txt .
RUN pip install -r requirements.txt 
COPY app.py . 
CMD ["app.handler"]
</code></pre>
<h2><strong>Deploy a Container Image Function</strong></h2>
<pre><code class="language-shell"># Build and push to ECR
docker build -t daily-report .
docker tag daily-report:latest 076829085184.dkr.ecr.us-east-1.amazonaws.com/daily-report:latest
docker push 076829085184.dkr.ecr.us-east-1.amazonaws.com/daily-report:latest

# Create Lambda function
aws lambda create-function \
  --function-name Daily-tenant-report \
  --package-type Image \
  --code ImageUri=076829085184.dkr.ecr.us-east-1.amazonaws.com/daily-report:latest \
  --role arn:aws:iam::076829085184:role/lambda-execution-role
</code></pre>
<h2><strong>When to Use Container Images vs .zip</strong></h2>
<ul>
<li><p><strong>Container images</strong>: Complex dependencies, large packages (ML models), existing Docker workflows, need for custom OS packages</p>
</li>
<li><p><strong>.zip archives</strong>: Simple functions, quick iterations, smaller codebases</p>
</li>
</ul>
<p><strong>Important</strong>: You cannot change deployment type after creation — a container image function stays container, a .zip stays .zip.​</p>
<h2><strong>Function Lifecycle for Container Images</strong></h2>
<p>After uploading, Lambda optimizes the image (function is in <code>Pending</code> state). Once <code>Active</code>, it can receive invocations. If unused for weeks, it goes <code>Inactive</code> and requires re-optimization on next invocation.​</p>
<h2><strong>Advanced: SAM Template with API Gateway + Lambda</strong></h2>
<h2>What is AWS SAM?</h2>
<p>AWS SAM (Serverless Application Model) is a tool that helps you define and deploy serverless applications using simple configuration files.</p>
<p>Instead of manually creating:</p>
<ul>
<li><p>Lambda functions</p>
</li>
<li><p>API Gateway</p>
</li>
<li><p>IAM roles</p>
</li>
<li><p>Event triggers</p>
</li>
</ul>
<p>You can define everything in one file and deploy it together.</p>
<p>Think of SAM as:</p>
<blockquote>
<p>“A simplified way to write CloudFormation specifically for serverless applications.”</p>
</blockquote>
<h2>Why use SAM?</h2>
<p>Without SAM:</p>
<ul>
<li><p>You manually create resources from AWS Console</p>
</li>
<li><p>Difficult to manage and replicate</p>
</li>
</ul>
<p>With SAM:</p>
<ul>
<li><p>Everything is written as code</p>
</li>
<li><p>Easy to version control</p>
</li>
<li><p>Easy to reuse across environments (dev, prod)</p>
</li>
</ul>
<h2>Step 1: Install SAM CLI</h2>
<p>SAM CLI is the tool used to build and deploy your application.</p>
<pre><code class="language-shell"># macOS
brew install aws-sam-cli

# Linux
pip install aws-sam-cli
</code></pre>
<hr />
<h2>Step 2: Initialize a SAM Project</h2>
<pre><code class="language-shell">sam init --runtime python3.12 --name daily-report-api
</code></pre>
<p>This creates a project structure like:</p>
<ul>
<li><p>template.yaml → Main configuration file</p>
</li>
<li><p>hello_world/ → Lambda code</p>
</li>
<li><p>tests/ → Unit tests</p>
</li>
<li><p>events/ → Sample test events</p>
</li>
</ul>
<hr />
<h2>Step 3: Understanding template.yaml</h2>
<p>This is the most important file.</p>
<p>It defines:</p>
<ul>
<li><p>Lambda functions</p>
</li>
<li><p>API endpoints</p>
</li>
<li><p>Database</p>
</li>
<li><p>Permissions</p>
</li>
</ul>
<hr />
<h2>Basic Structure</h2>
<pre><code class="language-yaml">AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
</code></pre>
<p>This tells AWS:</p>
<ul>
<li><p>This is a SAM template</p>
</li>
<li><p>Use serverless transformation</p>
</li>
</ul>
<hr />
<h2>Globals Section</h2>
<pre><code class="language-yaml">Globals:
  Function:
    Timeout: 30
    Runtime: python3.12
    Architectures:
      - arm64
</code></pre>
<p>This applies default settings to all Lambda functions.</p>
<p>Meaning:</p>
<ul>
<li><p>Every function will use Python 3.12</p>
</li>
<li><p>Timeout = 30 seconds</p>
</li>
<li><p>Architecture = arm64</p>
</li>
</ul>
<p>This avoids repeating configuration again and again.</p>
<hr />
<h2>Lambda Function Definition</h2>
<pre><code class="language-yaml">DailyTenantReportFunction:
  Type: AWS::Serverless::Function
</code></pre>
<p>This creates a Lambda function.</p>
<hr />
<h3>Key Properties Explained</h3>
<pre><code class="language-yaml">CodeUri: src/
Handler: app.lambda_handler
</code></pre>
<ul>
<li><p>CodeUri → Where your code is located</p>
</li>
<li><p>Handler → Entry point of your function</p>
</li>
</ul>
<hr />
<pre><code class="language-yaml">MemorySize: 256
</code></pre>
<ul>
<li><p>Allocates memory to Lambda</p>
</li>
<li><p>More memory = faster execution</p>
</li>
</ul>
<hr />
<pre><code class="language-yaml">Environment:
  Variables:
    DB_TABLE: TenantReports
    STAGE: production
</code></pre>
<ul>
<li><p>Environment variables for configuration</p>
</li>
<li><p>Used inside your code</p>
</li>
</ul>
<hr />
<h2>Permissions (Policies)</h2>
<pre><code class="language-yaml">Policies:
  - DynamoDBReadPolicy:
      TableName: !Ref TenantReportsTable
</code></pre>
<p>This allows Lambda to:</p>
<ul>
<li>Read from DynamoDB table</li>
</ul>
<hr />
<pre><code class="language-yaml">- SESCrudPolicy:
    IdentityName: "reports@shipsy.io"
</code></pre>
<p>Allows Lambda to:</p>
<ul>
<li>Send emails using SES</li>
</ul>
<h2>Event Triggers (Very Important)</h2>
<p>This is where SAM becomes powerful.</p>
<hr />
<h3>API Gateway Integration</h3>
<pre><code class="language-yaml">Events:
  GetReport:
    Type: Api
    Properties:
      Path: /reports/{tenantId}
      Method: get
</code></pre>
<p>This means:</p>
<ul>
<li><p>Create API endpoint</p>
</li>
<li><p>When someone calls:</p>
<pre><code class="language-plaintext">/reports/{tenantId}
</code></pre>
</li>
<li><p>Lambda will run</p>
</li>
</ul>
<hr />
<h3>Another API Endpoint</h3>
<pre><code class="language-yaml">GenerateReport:
  Type: Api
  Properties:
    Path: /reports/generate
    Method: post
</code></pre>
<p>Now you have:</p>
<ul>
<li><p>GET API → fetch report</p>
</li>
<li><p>POST API → generate report</p>
</li>
</ul>
<hr />
<h3>Scheduled Trigger (Cron Job)</h3>
<pre><code class="language-yaml">DailySchedule:
  Type: Schedule
  Properties:
    Schedule: cron(0 6 * * ? *)
</code></pre>
<p>This runs Lambda:</p>
<ul>
<li>Every day at 6 AM UTC</li>
</ul>
<p>This is similar to your current EventBridge setup.</p>
<hr />
<h2>DynamoDB Table</h2>
<pre><code class="language-yaml">TenantReportsTable:
  Type: AWS::DynamoDB::Table
</code></pre>
<p>This creates a database table.</p>
<pre><code class="language-yaml">KeySchema:
  - AttributeName: tenantId
    KeyType: HASH
  - AttributeName: reportDate
    KeyType: RANGE
</code></pre>
<ul>
<li><p>tenantId → Partition key</p>
</li>
<li><p>reportDate → Sort key</p>
</li>
</ul>
<hr />
<pre><code class="language-yaml">BillingMode: PAY_PER_REQUEST
</code></pre>
<ul>
<li><p>No need to manage capacity</p>
</li>
<li><p>Pay only when used</p>
</li>
</ul>
<hr />
<h2>Outputs (Important)</h2>
<pre><code class="language-yaml">Outputs:
  ApiEndpoint:
    Value: !Sub "https://\({ServerlessRestApi}.executeapi.\){AWS::Region}.amazonaws.com/Prod"
</code></pre>
<p>After deployment, this gives:</p>
<ul>
<li>Your API URL</li>
</ul>
<hr />
<h2>Step 4: Build and Test Locally</h2>
<pre><code class="language-shell">sam build
</code></pre>
<ul>
<li>Prepares your application</li>
</ul>
<hr />
<pre><code class="language-shell">sam local invoke DailyTenantReportFunction --event events/test.json
</code></pre>
<ul>
<li>Runs Lambda locally</li>
</ul>
<hr />
<pre><code class="language-shell">sam local start-api
</code></pre>
<ul>
<li>Starts local API server</li>
</ul>
<p>Test using:</p>
<pre><code class="language-shell">curl http://localhost:3000/reports/tenant-123
</code></pre>
<hr />
<h2>Step 5: Deploy to AWS</h2>
<pre><code class="language-shell">sam deploy --guided
</code></pre>
<p>This will:</p>
<ul>
<li><p>Ask for configuration (region, stack name)</p>
</li>
<li><p>Upload code to S3</p>
</li>
<li><p>Create all resources</p>
</li>
</ul>
<hr />
<h2>After Deployment</h2>
<p>You will get an API like:</p>
<pre><code class="language-plaintext">https://abc123.execute-api.us-east-1.amazonaws.com/Prod/reports/{tenantId}
</code></pre>
<h2>Example API Calls</h2>
<pre><code class="language-plaintext">curl https://.../reports/tenant-456
</code></pre>
<p>Fetch report</p>
<hr />
<pre><code class="language-shell">curl -X POST https://.../reports/generate \
  -H "Content-Type: application/json" \
  -d '{"tenantId": "tenant-456"}'
</code></pre>
<p>Generate report</p>
<p>Final Architecture</p>
<pre><code class="language-plaintext">Client → API Gateway → Lambda → DynamoDB ↓ Scheduled Event (cron)
</code></pre>
]]></content:encoded></item><item><title><![CDATA[Implementing Traefik on AWS EKS with Network Load Balancer (NLB): A Complete Guide]]></title><description><![CDATA[TL;DR: This blog walks you through deploying Traefik as an Ingress Controller on AWS EKS using an AWS Network Load Balancer (NLB), covering setup, configuration, known limitations, and best practices ]]></description><link>https://blog.devopswithpiyush.in/traefik-ingress-eks-nlb-guide</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/traefik-ingress-eks-nlb-guide</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[AWS]]></category><category><![CDATA[EKS]]></category><category><![CDATA[Traefik]]></category><category><![CDATA[Devops]]></category><category><![CDATA[ingress]]></category><category><![CDATA[Cloud Computing]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Thu, 12 Mar 2026 07:07:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/2f3ea986-b82a-43d1-9a8b-8e90c71c7871.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>TL;DR:</strong> This blog walks you through deploying Traefik as an Ingress Controller on AWS EKS using an AWS Network Load Balancer (NLB), covering setup, configuration, known limitations, and best practices — all in one place.</p>
</blockquote>
<hr />
<h2>What is Traefik and Why Use It on EKS?</h2>
<p>When you run multiple services inside a Kubernetes cluster, you need something to manage how external traffic reaches each service. That's where an <strong>Ingress Controller</strong> comes in.</p>
<p><strong>Traefik</strong> is a cloud-native, open-source Ingress Controller and reverse proxy that automatically discovers your services and routes traffic to them — no manual route updates needed.</p>
<blockquote>
<p>📖 Official Docs: <a href="https://doc.traefik.io/traefik/">What is Traefik?</a></p>
</blockquote>
<p>On <strong>AWS EKS (Elastic Kubernetes Service)</strong>, Traefik pairs naturally with an AWS <strong>Network Load Balancer (NLB)</strong> to handle high-throughput, low-latency traffic routing at Layer 4 (TCP/UDP).</p>
<p><strong>Why Traefik over the default AWS ALB Ingress Controller?</strong></p>
<ul>
<li><p>More feature-rich routing rules (path, headers, middlewares)</p>
</li>
<li><p>Built-in dashboard for monitoring</p>
</li>
<li><p>Automatic SSL via AWS ACM</p>
</li>
<li><p>Prometheus metrics out of the box</p>
</li>
<li><p>No separate ALB per service (cost-effective)</p>
</li>
</ul>
<hr />
<h2>Architecture Overview</h2>
<p>Here's how the traffic flows in this setup:</p>
<pre><code class="language-plaintext">Internet
↓
AWS Network Load Balancer (NLB)
↓
Traefik Ingress Controller (running on EKS pods)
↓
Your Kubernetes Services / Apps
</code></pre>
<p>The NLB acts as the entry point from the internet. It forwards all traffic to Traefik, which then applies routing rules to send requests to the right service inside the cluster.</p>
<hr />
<h2>Prerequisites</h2>
<p>Before you begin, make sure the following are in place:</p>
<table>
<thead>
<tr>
<th>Requirement</th>
<th>Details</th>
</tr>
</thead>
<tbody><tr>
<td>AWS EKS Cluster</td>
<td>A running and configured Kubernetes cluster on EKS</td>
</tr>
<tr>
<td><code>kubectl</code></td>
<td>Installed and connected to your EKS cluster</td>
</tr>
<tr>
<td><code>Helm</code></td>
<td>Version 3+ installed (<a href="https://helm.sh/docs/intro/install/">Install Helm</a>)</td>
</tr>
<tr>
<td>Traefik Helm Chart</td>
<td>v3 &gt; 3.9.0</td>
</tr>
<tr>
<td>AWS IAM Permissions</td>
<td>Permissions to create Load Balancers, ACM certificates, Security Groups</td>
</tr>
<tr>
<td>ACM Certificate</td>
<td>SSL certificate created in AWS Certificate Manager (ACM)</td>
</tr>
</tbody></table>
<blockquote>
<p>📖 Traefik Installation Guide: <a href="https://doc.traefik.io/traefik/getting-started/install-traefik/">https://doc.traefik.io/traefik/getting-started/install-traefik/</a></p>
</blockquote>
<hr />
<h2>Step 1: Add Traefik Helm Repository</h2>
<pre><code class="language-bash">helm repo add traefik https://helm.traefik.io/traefik
helm repo update
</code></pre>
<p>This adds the official Traefik Helm chart repository to your local Helm setup.</p>
<blockquote>
<p><strong>📖 Reference:</strong> <a href="https://doc.traefik.io/traefik/getting-started/install-traefik/#use-the-helm-chart"><strong>Traefik Helm Chart Docs</strong></a></p>
</blockquote>
<hr />
<h2><strong>Step 2: Create the</strong> <code>custom-values.yaml</code> <strong>Configuration</strong></h2>
<p>Create a file named <code>custom-values.yaml</code> with the following configuration. Each section is explained below.</p>
<pre><code class="language-shell">ingressClass:
  enabled: true
  isDefaultClass: true

providers:
  kubernetesCRD:
    enabled: true
    namespaces:
      - traefik-app-server
  kubernetesIngress:
    enabled: true
    namespaces:
      - traefik-app-server
      - default

ingressRoute:
  dashboard:
    enabled: true
    matchRule: Host(`traefik-dashboard.yourdomain.com`) &amp;&amp; (PathPrefix(`/dashboard`) || PathPrefix(`/api`))
    services:
      - name: api@internal
        kind: TraefikService
    entryPoints: ["web"]
    middlewares:
      - name: auth
        namespace: traefik-app-server

service:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "&lt;your-acm-cert-arn&gt;"
    service.beta.kubernetes.io/aws-load-balancer-access-log-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
    service.beta.kubernetes.io/aws-load-balancer-subnets: &lt;your-subnet-ids&gt;
    service.beta.kubernetes.io/aws-load-balancer-security-groups: &lt;your-sg-ids&gt;
    service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true

ports:
  web:
    port: 8000
    exposedPort: 443
  websecure:
    port: 8443
    exposedPort: 443
  traefik:
    port: 8080
    exposedPort: 8080

globalArguments:
  - "--api.insecure=true"
  - "--servertransport.insecureskipverify=true"

externalTrafficPolicy: Cluster

logs:
  general:
    format: json
    level: "INFO"
    noColor: true
  access:
    enabled: true
    format: json
    bufferingSize: 100
    filters:
      statuscodes: "200-299"
    addInternals: false

metrics:
  prometheus:
    entryPoint: metrics
    addRoutersLabels: true
    addServicesLabels: true
    buckets: "0.1,0.3,1.2,5.0"
</code></pre>
<h2><strong>What Each Section Does</strong></h2>
<p><code>ingressClass</code> — Makes Traefik the default ingress controller in your cluster.</p>
<p><code>providers</code> — Tells Traefik to watch for both <code>IngressRoute</code> (CRD) and standard <code>Ingress</code> resources in specified namespaces.</p>
<blockquote>
<p><strong>📖</strong> <a href="https://doc.traefik.io/traefik/providers/kubernetes-crd/"><strong>Traefik Kubernetes Providers</strong></a></p>
</blockquote>
<p><code>ingressRoute.dashboard</code> — Configures the Traefik dashboard with a specific hostname and path, protected by auth middleware.</p>
<blockquote>
<p><strong>📖</strong> <a href="https://doc.traefik.io/traefik/operations/dashboard/"><strong>Traefik Dashboard Docs</strong></a></p>
</blockquote>
<p><code>service annotations</code> — These AWS-specific annotations automatically trigger the creation of an NLB when Traefik is deployed. Key ones:</p>
<ul>
<li><p><code>aws-load-balancer-type: nlb</code> → Use NLB instead of CLB</p>
</li>
<li><p><code>aws-load-balancer-scheme: internet-facing</code> → Public internet accessible</p>
</li>
<li><p><code>aws-load-balancer-ssl-cert</code> → Attach your ACM certificate for HTTPS</p>
</li>
<li><p><code>preserve_client_ip.enabled=true</code> → Preserve the real client IP</p>
</li>
</ul>
<p><code>ports</code> — Maps Traefik's internal ports to external exposed ports.</p>
<p><code>logs</code> — Enables JSON-formatted access logs, filtering only successful (2xx) HTTP responses.</p>
<p><code>metrics</code> — Enables Prometheus scraping for monitoring Traefik performance.</p>
<blockquote>
<p><strong>📖</strong> <a href="https://doc.traefik.io/traefik/observability/metrics/prometheus/"><strong>Traefik Metrics with Prometheus</strong></a></p>
</blockquote>
<h2><strong>Step 3: Install Traefik Using Helm</strong></h2>
<pre><code class="language-shell">helm install traefik traefik/traefik \
  --namespace traefik-app-server \
  --create-namespace \
  -f custom-values.yaml
</code></pre>
<p>Verify the pods are running:</p>
<pre><code class="language-plaintext">kubectl get pods -n traefik-app-server
</code></pre>
<hr />
<h2><strong>Step 4: Verify the AWS NLB is Created</strong></h2>
<p>After installation, AWS automatically provisions an NLB based on the annotations in <code>custom-values.yaml</code>. Verify by:</p>
<ol>
<li><p>Going to <strong>AWS Console → EC2 → Load Balancers</strong> and look for the new NLB</p>
</li>
<li><p>Or run:</p>
</li>
</ol>
<pre><code class="language-plaintext">kubectl get svc -n traefik-app-server traefik
</code></pre>
<p>You should see an external hostname (the NLB DNS name) in the <code>EXTERNAL-IP</code> column.</p>
<h2><strong>Step 5: Configure IngressRoute for Your Services</strong></h2>
<p>Create an <code>IngressRoute</code> resource to route traffic to your apps:</p>
<pre><code class="language-shell">apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: my-app-route
  namespace: traefik-app-server
spec:
  entryPoints:
    - web
  routes:
    - match: Host(`myapp.yourdomain.com`)
      kind: Rule
      services:
        - name: my-app-service
          port: 80
</code></pre>
<p>Apply it:</p>
<pre><code class="language-shell">kubectl apply -f my-app-ingressroute.yaml
</code></pre>
<h2><strong>Step 6: Enable Metrics and Logging</strong></h2>
<p>Prometheus metrics are already enabled in <code>custom-values.yaml</code>.<br />Verify Traefik is exposing metrics:</p>
<pre><code class="language-plaintext">kubectl port-forward svc/traefik 8080:8080 -n traefik-app-server
</code></pre>
<p>Then visit <code>http://localhost:8080/metrics</code> in your browser.</p>
<p>Check Traefik logs:</p>
<pre><code class="language-shell">kubectl logs &lt;traefik-pod-name&gt; -n traefik-app-server
</code></pre>
<hr />
<h2><strong>⚠️ Known Issue: Nested NLB + AWS Global Accelerator and Client IP Preservation</strong></h2>
<p>This is a <strong>critical limitation</strong> you must be aware of before designing your architecture.</p>
<h2><strong>What is the Scenario?</strong></h2>
<p>Many production teams try to achieve two goals simultaneously:</p>
<ol>
<li><p><strong>Use AWS Global Accelerator</strong> — to reduce latency globally by routing traffic through AWS's private backbone network</p>
</li>
<li><p><strong>Preserve the original Client IP</strong> — so that their apps can use the real user IP for security rules, geo-blocking, rate limiting, and analytics</p>
</li>
</ol>
<p>A natural architecture that seems to solve both is a <strong>Nested NLB setup</strong>:</p>
<pre><code class="language-plaintext">Internet
   ↓
AWS Global Accelerator
   ↓
NLB #1 (TCP Listeners) ← Global Accelerator Endpoint
   ↓
NLB #2 (TLS Listeners) ← Target Group of NLB #1
   ↓
Traefik on EKS
</code></pre>
<p>The idea here is:</p>
<ul>
<li><p><strong>NLB #1</strong> handles Global Accelerator traffic and forwards it to NLB #2</p>
</li>
<li><p><strong>NLB #2</strong> handles TLS termination and forwards to Traefik</p>
</li>
<li><p>This way, you get global acceleration AND SSL handling via ACM</p>
</li>
</ul>
<p><strong>Sounds logical, right? But it doesn't work.</strong></p>
<h2><strong>Why Does the Issue Arise?</strong></h2>
<p>When NLB #1 tries to route traffic to <strong>NLB #2's ENI (Elastic Network Interface)</strong> as a target, AWS blocks Client IP preservation. This is because:</p>
<blockquote>
<p><strong>AWS explicitly does not support Client IP preservation when a target<br />group contains the ENI of another Network Load Balancer or AWS PrivateLink ENIs.</strong></p>
</blockquote>
<blockquote>
<p>**📖 Official AWS Reference:<br />**<a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#client-ip-preservation"><strong>NLB Target Groups — Client IP Preservation</strong></a></p>
</blockquote>
<p>In simple words — when NLB #1 forwards packets to NLB #2, the source IP (original client IP) gets <strong>replaced with NLB #1's IP</strong>. By the time the request reaches Traefik and your app, you see the NLB's IP, not the user's real IP.</p>
<h2><strong>What is the Real-World Impact?</strong></h2>
<p>This limitation breaks several things your application may depend on:</p>
<table>
<thead>
<tr>
<th>Feature Affected</th>
<th>Why it Breaks</th>
</tr>
</thead>
<tbody><tr>
<td>IP-based rate limiting</td>
<td>You rate-limit the NLB IP, not the real user</td>
</tr>
<tr>
<td>Geo-blocking / GeoIP rules</td>
<td>NLB's IP is from AWS datacenter, not user's country</td>
</tr>
<tr>
<td>Security rules / WAF</td>
<td>Cannot block/allow specific client IPs</td>
</tr>
<tr>
<td>Analytics &amp; Traffic analysis</td>
<td>All traffic appears to come from one IP</td>
</tr>
<tr>
<td>Audit logs</td>
<td>No real user IP in logs for compliance</td>
</tr>
</tbody></table>
<h2><strong>The Root Cause (Technical)</strong></h2>
<p>In a standard NLB setup with Client IP Preservation enabled, the NLB simply <strong>forwards the TCP packet as-is</strong> to the target, preserving the source IP in the packet header. The target (Traefik pod) sees the real client IP directly.</p>
<p>But when NLB #2 is itself a target inside NLB #1's target group, NLB #1 needs to rewrite the destination IP of the packet to point to NLB #2's ENI. In this rewrite process, AWS's networking layer <strong>cannot maintain both the source IP and perform the destination rewrite simultaneously</strong> for chained NLBs.</p>
<hr />
<h2><strong>Troubleshooting Common Issues</strong></h2>
<table>
<thead>
<tr>
<th>Issue</th>
<th>Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td>NLB not created</td>
<td>Wrong or missing service annotations</td>
<td>Re-check <code>custom-values.yaml</code> service annotations, especially subnet IDs and security group IDs</td>
</tr>
<tr>
<td>Dashboard not accessible</td>
<td>DNS or IngressRoute misconfiguration</td>
<td>Verify DNS resolves to NLB, check <code>matchRule</code> in <code>ingressRoute</code> config</td>
</tr>
<tr>
<td>SSL not working</td>
<td>Wrong ACM certificate ARN</td>
<td>Verify the ARN in <code>aws-load-balancer-ssl-cert</code> annotation matches your ACM cert</td>
</tr>
<tr>
<td>Client IP showing as NLB IP</td>
<td>Client IP Preservation disabled or nested NLB issue</td>
<td>Enable <code>preserve_client_ip.enabled=true</code> in target group attributes; avoid nested NLB setup</td>
</tr>
</tbody></table>
<h2><strong>Best Practices</strong></h2>
<ul>
<li><p><strong>Always protect the Traefik dashboard</strong> with authentication middleware an unprotected dashboard exposes your entire routing configuration.</p>
<blockquote>
<p><strong>📖</strong> <a href="https://doc.traefik.io/traefik/middlewares/http/basicauth/"><strong>Traefik Middlewares — BasicAuth</strong></a></p>
</blockquote>
</li>
<li><p><strong>Use AWS ACM for SSL certificates</strong> instead of managing certs manually — ACM handles renewals automatically.</p>
</li>
<li><p><strong>Enable Prometheus metrics</strong> and connect to Grafana for a complete observability setup.</p>
<blockquote>
<p><strong>📖</strong> <a href="https://doc.traefik.io/traefik/observability/metrics/prometheus/"><strong>Traefik + Grafana Dashboard</strong></a></p>
</blockquote>
</li>
<li><p><strong>For internal-only services</strong>, change the NLB scheme to <code>internal</code>:</p>
<pre><code class="language-plaintext">service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"
</code></pre>
</li>
<li><p><strong>Automate Helm deployments</strong> via CI/CD pipelines (GitHub Actions, ArgoCD) for consistent and repeatable deployments.</p>
</li>
<li><p><strong>Avoid Nested NLB setups</strong> if Client IP preservation is critical for your application — use single NLB with Proxy Protocol v2 instead.</p>
</li>
</ul>
<h2><strong>Conclusion</strong></h2>
<p>Traefik on AWS EKS with NLB is a powerful, production-ready setup that gives<br />you fine-grained traffic control, automatic service discovery, SSL management,<br />and rich observability. However, when designing for advanced scenarios like<br />Global Accelerator with Client IP preservation, be aware of AWS's nested NLB<br />limitations and plan your architecture accordingly.</p>
<h2><strong>References</strong></h2>
<ul>
<li><p><a href="https://doc.traefik.io/traefik/"><strong>Traefik Official Documentation</strong></a></p>
</li>
<li><p><a href="https://doc.traefik.io/traefik/getting-started/install-traefik/"><strong>Traefik Installation Guide</strong></a></p>
</li>
<li><p><a href="https://doc.traefik.io/traefik/providers/kubernetes-crd/"><strong>Traefik Kubernetes CRD Provider</strong></a></p>
</li>
<li><p><a href="https://doc.traefik.io/traefik/routing/providers/kubernetes-crd/"><strong>Traefik IngressRoute Reference</strong></a></p>
</li>
<li><p><a href="https://doc.traefik.io/traefik/routing/entrypoints/#proxyprotocol"><strong>Traefik Proxy Protocol</strong></a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#client-ip-preservation"><strong>AWS NLB Target Groups — Client IP Preservation</strong></a></p>
</li>
<li><p><a href="https://aws.amazon.com/about-aws/whats-new/2023/08/aws-global-accelerator-client-ip-address-preservation-network-load-balancer-endpoints/"><strong>AWS Global Accelerator + NLB Client IP Preservation</strong></a></p>
</li>
<li><p><a href="https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html"><strong>AWS EKS Network Load Balancing</strong></a></p>
</li>
<li><p><a href="https://doc.traefik.io/traefik/observability/metrics/prometheus/"><strong>Traefik Prometheus Metrics</strong></a></p>
</li>
<li><p><a href="https://doc.traefik.io/traefik/operations/dashboard/"><strong>Traefik Dashboard Docs</strong></a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Mastering Kubernetes Cluster Autoscaler on Amazon EKS: A Complete Guide]]></title><description><![CDATA[🚀 TL;DR: If your pods are stuck in Pending state because there aren't enough nodes — Cluster Autoscaler (CA) is your answer. This guide walks you through everything from IAM setup to a full productio]]></description><link>https://blog.devopswithpiyush.in/kubernetes-cluster-autoscaler-eks-guide</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/kubernetes-cluster-autoscaler-eks-guide</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[AWS]]></category><category><![CDATA[EKS]]></category><category><![CDATA[Devops]]></category><category><![CDATA[clusterautoscaler]]></category><category><![CDATA[Cloud Computing]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Thu, 12 Mar 2026 06:36:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/1022bb7f-51fb-44dd-a333-b09bbef798bb.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<blockquote>
<p>🚀 <strong>TL;DR:</strong> If your pods are stuck in <code>Pending</code> state because there aren't enough nodes — <strong>Cluster Autoscaler (CA)</strong> is your answer. This guide walks you through everything from IAM setup to a full production deployment on Amazon EKS.</p>
</blockquote>
<hr />
<h2>👋 Who Is This For?</h2>
<table>
<thead>
<tr>
<th>Level</th>
<th>What You'll Get</th>
</tr>
</thead>
<tbody><tr>
<td>🟢 Beginner</td>
<td>Understand what CA is and why you need it</td>
</tr>
<tr>
<td>🟡 Intermediate</td>
<td>Full step-by-step installation on EKS</td>
</tr>
<tr>
<td>🔴 Advanced</td>
<td>Multi-node group strategies, expander policies, best practices</td>
</tr>
</tbody></table>
<hr />
<h2>🤔 The Problem — Why Does Autoscaling Even Matter?</h2>
<p>Imagine your application is running fine on Amazon EKS with 3 nodes. Suddenly, a <strong>traffic surge hits</strong> — a flash sale, a major client onboarding, or a viral event. Your Kubernetes Deployment tries to spin up 10 more pods — but there's <strong>no room</strong> on existing nodes. Those pods sit in <code>Pending</code> state, requests time out, and your users see errors.</p>
<p>You <em>could</em> manually add nodes — but who's watching at 2 AM on a Sunday?</p>
<p>This is exactly where <strong>Cluster Autoscaler (CA)</strong> steps in. It watches for <code>Pending</code> pods and automatically <strong>scales your EC2 node count up or down</strong> via AWS Auto Scaling Groups — no human intervention needed.</p>
<hr />
<h2>🧠 Section 1: What Is Cluster Autoscaler? (Beginner)</h2>
<p>Cluster Autoscaler is an <strong>open-source Kubernetes component</strong> that runs as a <code>Deployment</code> inside your cluster (in the <code>kube-system</code> namespace). It does two things:</p>
<ul>
<li><p><strong>Scale Up 📈</strong> — When pods are unschedulable (Pending), CA adds new EC2 nodes</p>
</li>
<li><p><strong>Scale Down 📉</strong> — When nodes are underutilized, CA safely drains and removes them</p>
</li>
</ul>
<h3>How It Works (Every 10 Seconds)</h3>
<ol>
<li><p>Are there any Pending pods?<br />YES → Find a Node Group that can fit them → Tell ASG to increase capacity</p>
</li>
<li><p>Are any nodes underutilized (&lt; 50% by default)?<br />YES → Can all pods fit elsewhere? → Drain node → Terminate EC2 instance</p>
</li>
</ol>
<blockquote>
<p>💡 <strong>Key Insight:</strong> CA doesn't look at CPU/Memory <em>usage</em>. It looks at <strong>resource REQUESTS</strong> defined in your pod spec. <strong>Always set</strong> <code>resources.requests</code> or CA won't scale!</p>
</blockquote>
<h3>CA vs HPA vs Karpenter</h3>
<table>
<thead>
<tr>
<th>Tool</th>
<th>What it Scales</th>
<th>How</th>
</tr>
</thead>
<tbody><tr>
<td><strong>HPA</strong></td>
<td>Pod replicas</td>
<td>Based on CPU/memory metrics</td>
</tr>
<tr>
<td><strong>CA</strong></td>
<td>EC2 Nodes</td>
<td>Based on pending pods + AWS ASG</td>
</tr>
<tr>
<td><strong>Karpenter</strong></td>
<td>EC2 Nodes</td>
<td>Dynamic, just-in-time, more flexible</td>
</tr>
</tbody></table>
<p>Think of it this way: <strong>HPA scales your app. CA scales your infrastructure.</strong></p>
<hr />
<h2>🏗️ Section 2: Architecture Overview</h2>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/bda0e453-29cd-4097-8d50-182913712a5a.png" alt="" style="display:block;margin:0 auto" />

<p>The <strong>OIDC + IRSA</strong> bridge is the key — it lets the CA pod (inside Kubernetes) make authenticated AWS API calls without storing any long-lived credentials.</p>
<hr />
<h2>🛠️ Section 3: Full Installation Guide (Intermediate)</h2>
<h3>Prerequisites Checklist</h3>
<p>Before you begin, make sure you have:</p>
<ul>
<li><p>✅ An active <strong>Amazon EKS Cluster</strong> (v1.24+)</p>
</li>
<li><p>✅ <code>kubectl</code> configured and pointing to your cluster</p>
</li>
<li><p>✅ <code>eksctl</code> installed (v0.160+)</p>
</li>
<li><p>✅ <code>aws cli</code> v2 configured with admin permissions</p>
</li>
<li><p>✅ Node Groups created with <strong>ASG enabled</strong> (<code>--asg-access</code> flag)</p>
</li>
</ul>
<hr />
<h3>Step 1: Enable IAM OIDC Provider</h3>
<p>OIDC is an identity bridge — it lets Kubernetes ServiceAccounts assume AWS IAM Roles, so your CA pod can call AWS APIs securely without hardcoding credentials.</p>
<pre><code class="language-bash">export CLUSTER_NAME=&lt;your-cluster-name&gt;
export AWS_REGION=ap-south-1   # Change to your region

# Enable OIDC for your cluster
eksctl utils associate-iam-oidc-provider \
  --region $AWS_REGION \
  --cluster $CLUSTER_NAME \
  --approve

# Verify
aws eks describe-cluster --name $CLUSTER_NAME \
  --query "cluster.identity.oidc.issuer" --output text
</code></pre>
<h3>Step 2: <strong>Create IAM Policy</strong></h3>
<p>Save the following as <code>iam-policy.json</code>. This policy defines exactly what CA is allowed to do in AWS — describe ASGs, set desired capacity, and terminate instances.</p>
<pre><code class="language-shell">{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "ec2:DescribeLaunchTemplateVersions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}
</code></pre>
<blockquote>
<p>aws iam create-policy<br />--policy-name AmazonEKSClusterAutoscalerPolicy<br />--policy-document file://iam-policy.json</p>
</blockquote>
<p>Note down the <strong>Policy ARN</strong> from the output — you'll need it in the next step.</p>
<h3>Step 3: <strong>Create IAM Role + Kubernetes ServiceAccount (IRSA)</strong></h3>
<p>IRSA (IAM Roles for Service Accounts) annotates a Kubernetes ServiceAccount with an IAM Role ARN, so only the CA pod gets AWS permissions — nothing else.</p>
<pre><code class="language-shell">export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

eksctl create iamserviceaccount \
  --cluster=$CLUSTER_NAME \
  --namespace=kube-system \
  --name=cluster-autoscaler \
  --attach-policy-arn=arn:aws:iam::${AWS_ACCOUNT_ID}:policy/AmazonEKSClusterAutoscalerPolicy \
  --override-existing-serviceaccounts \
  --approve
</code></pre>
<p>If you prefer to apply the ServiceAccount manually, save this as <code>cluster-autoscaler-sa.yaml</code> and replace the role ARN:</p>
<pre><code class="language-plaintext">apiVersion: v1
kind: ServiceAccount
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::&lt;YOUR-ACCOUNT-ID&gt;:role/&lt;YOUR-IAM-ROLE-NAME&gt;
</code></pre>
<pre><code class="language-shell">kubectl apply -f cluster-autoscaler-sa.yaml
</code></pre>
<h2>Step 4: <strong>Apply RBAC — ClusterRole, Role, and Bindings</strong></h2>
<p>Save the following as <code>cluster-autoscaler-rbac.yaml</code>:</p>
<pre><code class="language-plaintext">apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
  resources: ["events", "endpoints"]
  verbs: ["create", "patch"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
- apiGroups: [""]
  resources: ["pods/status"]
  verbs: ["update"]
- apiGroups: [""]
  resources: ["endpoints"]
  resourceNames: ["cluster-autoscaler"]
  verbs: ["get", "update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
  resources: ["namespaces", "pods", "services", "replicationcontrollers", "persistentvolumeclaims", "persistentvolumes"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
  resources: ["replicasets", "daemonsets"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
  resources: ["poddisruptionbudgets"]
  verbs: ["watch", "list"]
- apiGroups: ["apps"]
  resources: ["statefulsets", "replicasets", "daemonsets"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
  resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
  verbs: ["watch", "list", "get"]
- apiGroups: ["batch", "extensions"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
  resourceNames: ["cluster-autoscaler"]
  resources: ["leases"]
  verbs: ["get", "update"]
***
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create", "list", "watch"]
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
  verbs: ["delete", "get", "update", "watch"]
***
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
- kind: ServiceAccount
  name: cluster-autoscaler
  namespace: kube-system
***
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
- kind: ServiceAccount
  name: cluster-autoscaler
  namespace: kube-system
</code></pre>
<pre><code class="language-shell">kubectl apply -f cluster-autoscaler-rbac.yaml
</code></pre>
<h2><strong>Step 5: Deploy Cluster Autoscaler</strong></h2>
<p>Save the following as <code>cluster-autoscaler-deployment.yaml</code>.</p>
<blockquote>
<p><strong>⚠️ Replace</strong> <code>&lt;YOUR-CLUSTER-NAME&gt;</code> <strong>on the</strong> <code>--node-group-auto-discovery</code> <strong>line.<br />⚠️ Match the image version (</strong><code>v1.27.3</code> <strong>in example) to your EKS cluster version. E.g., EKS 1.30 → use</strong> <code>v1.30.x</code></p>
</blockquote>
<pre><code class="language-plaintext">apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8085'
        cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'
    spec:
      priorityClassName: system-cluster-critical
      serviceAccountName: cluster-autoscaler
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.27.3
          resources:
            limits:
              cpu: 100m
              memory: 600Mi
            requests:
              cpu: 100m
              memory: 600Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/&lt;YOUR-CLUSTER-NAME&gt;
            - --balance-similar-node-groups
            - --skip-nodes-with-system-pods=false
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: [ALL]
            readOnlyRootFilesystem: true
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-bundle.crt"
</code></pre>
<pre><code class="language-shell">kubectl apply -f cluster-autoscaler-deployment.yaml
</code></pre>
<h2><strong>Step 6: Tag Your ASG Node Groups</strong></h2>
<p>CA uses <strong>tags</strong> to discover which Auto Scaling Groups it should manage. Add these two tags to your Node Group's ASG in AWS Console or CLI:</p>
<table>
<thead>
<tr>
<th>Tag Key</th>
<th>Tag Value</th>
</tr>
</thead>
<tbody><tr>
<td><code>k8s.io/cluster-autoscaler/enabled</code></td>
<td><code>true</code></td>
</tr>
<tr>
<td><code>k8s.io/cluster-autoscaler/&lt;YOUR-CLUSTER-NAME&gt;</code></td>
<td><code>owned</code></td>
</tr>
</tbody></table>
<pre><code class="language-shell">aws autoscaling create-or-update-tags \
  --tags \
  "ResourceId=&lt;YOUR-ASG-NAME&gt;,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=true,PropagateAtLaunch=true" \
  "ResourceId=&lt;YOUR-ASG-NAME&gt;,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/&lt;YOUR-CLUSTER-NAME&gt;,Value=owned,PropagateAtLaunch=true"
</code></pre>
<h2><strong>Step 7: Verify Everything Is Working</strong></h2>
<pre><code class="language-plaintext"># Check the pod is Running
kubectl get pods -n kube-system | grep cluster-autoscaler

# Watch live logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system
</code></pre>
<h2><strong>🗂️ Section 4: Node Group Strategies — The "NodePool" Equivalent (Advanced)</strong></h2>
<p>Unlike Karpenter (which has <code>NodePool</code> and <code>EC2NodeClass</code> CRDs), CA works with <strong>pre-defined EKS Node Groups (ASGs)</strong>.</p>
<table>
<thead>
<tr>
<th>Karpenter Concept</th>
<th>CA Equivalent</th>
</tr>
</thead>
<tbody><tr>
<td><code>EC2NodeClass</code></td>
<td>Launch Template</td>
</tr>
<tr>
<td><code>NodePool</code></td>
<td>EKS Managed Node Group (ASG)</td>
</tr>
<tr>
<td><code>NodePool limits</code></td>
<td>ASG Min/Max size</td>
</tr>
<tr>
<td><code>NodePool labels/taints</code></td>
<td>Node Group labels &amp; taints</td>
</tr>
</tbody></table>
<p>Here's a production-ready multi-node-group config using <code>eksctl</code>. Save as <code>production-nodegroups.yaml</code>:</p>
<pre><code class="language-shell">apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: &lt;YOUR-CLUSTER-NAME&gt;
  region: ap-south-1

managedNodeGroups:

  # Pool 1: General Purpose (always-on baseline)
  - name: general-ng
    instanceType: m5.xlarge
    minSize: 2
    maxSize: 10
    desiredCapacity: 2
    labels:
      workload: general
      lifecycle: on-demand
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/&lt;YOUR-CLUSTER-NAME&gt;: "owned"
    iam:
      withAddonPolicies:
        autoScaler: true

  # Pool 2: High Memory (scale from zero for data workloads)
  - name: highmem-ng
    instanceType: r5.2xlarge
    minSize: 0
    maxSize: 5
    desiredCapacity: 0
    labels:
      workload: high-memory
    taints:
      - key: dedicated
        value: high-memory
        effect: NoSchedule
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/&lt;YOUR-CLUSTER-NAME&gt;: "owned"
      k8s.io/cluster-autoscaler/node-template/label/workload: "high-memory"
      k8s.io/cluster-autoscaler/node-template/taint/dedicated: "high-memory:NoSchedule"

  # Pool 3: Spot Instances (cost savings for batch/non-critical)
  - name: spot-ng
    instanceTypes: ["m5.xlarge", "m5a.xlarge", "m4.xlarge"]
    spot: true
    minSize: 0
    maxSize: 20
    desiredCapacity: 0
    labels:
      lifecycle: spot
      workload: batch
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/&lt;YOUR-CLUSTER-NAME&gt;: "owned"
</code></pre>
<pre><code class="language-shell">eksctl create nodegroup -f production-nodegroups.yaml
</code></pre>
<h2><strong>Scheduling Pods to Specific Node Groups</strong></h2>
<pre><code class="language-shell"># Example: Schedule a high-memory pod to the highmem-ng pool
spec:
  nodeSelector:
    workload: high-memory
  tolerations:
    - key: dedicated
      value: high-memory
      effect: NoSchedule
  containers:
    - name: app
      image: your-image:latest
      resources:
        requests:           # REQUIRED for CA to work!
          cpu: "2"
          memory: "8Gi"
</code></pre>
<h2><strong>⚙️ Section 5: Expander Strategies</strong></h2>
<p>When multiple node groups can accommodate a pending pod, CA uses an <strong>Expander</strong> to decide which one to pick:</p>
<table>
<thead>
<tr>
<th>Expander</th>
<th>Behavior</th>
<th>Best For</th>
</tr>
</thead>
<tbody><tr>
<td><code>least-waste</code></td>
<td>Picks group with least wasted resources after scaling</td>
<td><strong>Recommended</strong></td>
</tr>
<tr>
<td><code>random</code></td>
<td>Picks randomly</td>
<td>Testing only</td>
</tr>
<tr>
<td><code>most-pods</code></td>
<td>Picks group that schedules the most pods</td>
<td>High-density</td>
</tr>
<tr>
<td><code>priority</code></td>
<td>You assign priority order to node groups</td>
<td>Fine-grained control</td>
</tr>
<tr>
<td><code>price</code></td>
<td>Prefers cheapest node type</td>
<td>Cost-sensitive</td>
</tr>
</tbody></table>
<p>Set it in your deployment:</p>
<pre><code class="language-shell">- --expander=least-waste
</code></pre>
<h2><strong>Section 6: Test Your Setup</strong></h2>
<pre><code class="language-shell"># Create a deployment that will trigger scale-up
kubectl create deployment inflate \
  --image=public.ecr.aws/eks-distro/kubernetes/pause:3.7 \
  --replicas=10

kubectl set resources deployment inflate \
  --requests=cpu=1,memory=1Gi

# Watch pods — some will go Pending, then get scheduled on new nodes
kubectl get pods -w

# Watch CA logs in real-time
kubectl logs -f deployment/cluster-autoscaler -n kube-system | grep -E "scale_up|ScaleUp"

# Watch new nodes join
kubectl get nodes -w

# Cleanup — triggers scale-down after ~10 minutes
kubectl delete deployment inflate
</code></pre>
<h2><strong>Section 7: Production Best Practices</strong></h2>
<ol>
<li><p><strong>Always set</strong> <code>resources.requests</code> — CA is blind without them; it won't scale if requests aren't defined</p>
</li>
<li><p><strong>Use</strong> <code>PodDisruptionBudgets (PDB)</code> — Protects critical pods during scale-down draining</p>
</li>
<li><p><strong>Pin CA version to EKS version</strong> — Use <code>v1.30.x</code> for EKS 1.30; version mismatch breaks scaling</p>
</li>
<li><p><strong>Use</strong> <code>--balance-similar-node-groups</code> — Spreads nodes evenly across AZs for high availability</p>
</li>
<li><p><strong>Add</strong> <code>safe-to-evict: "false"</code> <strong>on CA pod itself</strong> — Prevents it from being evicted during scale-down</p>
</li>
<li><p><strong>Don't mix instance families in one ASG</strong> — Keep node groups homogeneous for predictable scaling</p>
</li>
<li><p><strong>Monitor with Prometheus</strong> — CA exposes metrics on port <code>8085</code>; scrape and alert on scaling events</p>
</li>
</ol>
<h2><strong>Section 8: Troubleshooting</strong></h2>
<table>
<thead>
<tr>
<th>Issue</th>
<th>Likely Cause</th>
<th>Fix</th>
</tr>
</thead>
<tbody><tr>
<td>Pods stuck in Pending, no new nodes</td>
<td>ASG tags missing or wrong</td>
<td>Verify tags on your ASG match <code>--node-group-auto-discovery</code></td>
</tr>
<tr>
<td><code>Permission denied</code> errors in logs</td>
<td>IAM Role misconfigured</td>
<td>Check role trust relationship + OIDC annotation on ServiceAccount</td>
</tr>
<tr>
<td><code>CrashLoopBackOff</code> on CA pod</td>
<td>Wrong image version or bad command flags</td>
<td>Match image to EKS version; check <code>--node-group-auto-discovery</code> flag</td>
</tr>
<tr>
<td>Scale-down not happening</td>
<td><code>scale-down-unneeded-time</code> not elapsed or PDB blocking</td>
<td>Wait 10 min; check PodDisruptionBudgets</td>
</tr>
<tr>
<td>Scale-from-zero not working</td>
<td>Node group labels missing as ASG tags</td>
<td>Add <code>node-template/label/</code> and <code>node-template/taint/</code> tags to ASG</td>
</tr>
</tbody></table>
<pre><code class="language-shell"># Always start debugging here
kubectl logs -n kube-system deployment/cluster-autoscaler
</code></pre>
<h2><strong>CA vs Karpenter — Which One Should You Use in 2026?</strong></h2>
<table>
<thead>
<tr>
<th>Factor</th>
<th>Cluster Autoscaler</th>
<th>Karpenter</th>
</tr>
</thead>
<tbody><tr>
<td>Setup Complexity</td>
<td>Moderate</td>
<td>Higher</td>
</tr>
<tr>
<td>Scaling Speed</td>
<td>2–5 min</td>
<td>30–60 sec</td>
</tr>
<tr>
<td>Instance Flexibility</td>
<td>Fixed per ASG</td>
<td>Dynamic, any type</td>
</tr>
<tr>
<td>Cost Optimization</td>
<td>Good with Spot</td>
<td>Excellent (node consolidation)</td>
</tr>
<tr>
<td>EKS Auto Mode support</td>
<td>No</td>
<td>Yes (native)</td>
</tr>
<tr>
<td>Maturity &amp; Stability</td>
<td>⭐⭐⭐⭐⭐ Battle-tested</td>
<td>⭐⭐⭐⭐ Growing fast</td>
</tr>
</tbody></table>
<h2><strong>Wrapping Up</strong></h2>
<p>Cluster Autoscaler is the backbone of production Kubernetes infrastructure on AWS. Set it up correctly with proper Node Groups, IRSA, and resource requests — and it will silently keep your cluster right-sized, saving both cost and on-call headaches.</p>
<p><strong>Key Takeaways:</strong></p>
<ul>
<li><p>🔐 OIDC + IRSA = Secure, credential-free AWS authentication from Kubernetes</p>
</li>
<li><p>🗂️ Node Groups = Your pre-defined capacity pools (CA's version of Karpenter's NodePools)</p>
</li>
<li><p>📦 Always set <code>resources.requests</code> — CA depends on it entirely</p>
</li>
<li><p>⚖️ Use <code>least-waste</code> expander for cost efficiency</p>
</li>
<li><p>📊 Watch CA logs — they're incredibly detailed and tell you exactly what's happening</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Cross-Project Cloud SQL Migration Using Google Database Migration Service (DMS)]]></title><description><![CDATA[Migrating a Cloud SQL database from one Google Cloud project to another can be challenging—especially when you want minimal downtime and continuous replication (via Change Data Capture — CDC).
Google']]></description><link>https://blog.devopswithpiyush.in/gcp-cross-project-db-migration</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/gcp-cross-project-db-migration</guid><category><![CDATA[google cloud]]></category><category><![CDATA[GCP]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Databases]]></category><category><![CDATA[#dms]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[MySQL]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Wed, 11 Mar 2026 09:30:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/686ab37a-1501-4017-a322-3a0374cfeb8f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<hr />
<p>Migrating a <strong>Cloud SQL</strong> database from one Google Cloud project to another can be challenging—especially when you want <strong>minimal downtime</strong> and <strong>continuous replication</strong> (via Change Data Capture — CDC).</p>
<p>Google's <strong>Database Migration Service (DMS)</strong> makes this straightforward, even over <strong>public IP</strong> connectivity (ideal when VPC peering or Shared VPC isn't feasible).</p>
<p>In this guide, I walk you through a real-world <strong>cross-project</strong> migration of a <strong>Cloud SQL for MySQL</strong> instance using <strong>public IP allowlist</strong> connectivity — <strong>continuous mode</strong> — from source project → destination project.</p>
<p>This method helped me consolidate databases, refactor environments, and improve project isolation/security/governance.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/bf61277b-4d17-4734-b04b-fa1d5b2f384f.png" alt="" style="display:block;margin:0 auto" />

<p><em>High-level flow of DMS continuous migration with public IP connectivity</em></p>
<h2>1. Introduction</h2>
<h3>Purpose</h3>
<p>This post provides a detailed, production-tested step-by-step guide to migrate a <strong>Cloud SQL</strong> instance between GCP projects using <strong>DMS</strong> over public IP. It covers prerequisites, IAM roles, connectivity setup, job configuration, testing, cutover (promotion), and verification.</p>
<h3>Target Audience</h3>
<ul>
<li><p>DevOps Engineers &amp; SREs</p>
</li>
<li><p>Cloud Infrastructure / Database Administrators</p>
</li>
<li><p>GCP Architects performing project consolidations or refactoring</p>
</li>
</ul>
<h2>2. Overview</h2>
<p><strong>Database Migration Service (DMS)</strong> is a fully managed GCP service for <strong>zero/minimal-downtime</strong> migrations to <strong>Cloud SQL</strong> (MySQL, PostgreSQL) and AlloyDB.</p>
<p><strong>Use cases for cross-project migration</strong>:</p>
<ul>
<li><p>Consolidating scattered databases into a central project</p>
</li>
<li><p>Refactoring legacy/multi-project environments</p>
</li>
<li><p>Enforcing better security &amp; governance through project boundaries</p>
</li>
</ul>
<p>We use <strong>continuous migration</strong> (full load + CDC) over <strong>public IP allowlist</strong> connectivity.</p>
<p><strong>Note</strong>: All DMS resources (connection profile, migration job, etc.) <strong>must reside in the same region</strong> as the destination Cloud SQL instance.</p>
<h2>3. Prerequisites</h2>
<h3>Tools &amp; Versions</h3>
<table>
<thead>
<tr>
<th>Tool / Technology</th>
<th>Requirement</th>
</tr>
</thead>
<tbody><tr>
<td>Google Cloud Platform</td>
<td>Active billing in <strong>both</strong> projects</td>
</tr>
<tr>
<td>Cloud SQL</td>
<td>Same engine &amp; version (e.g. MySQL 8.0.35+)</td>
</tr>
<tr>
<td>Database Migration Service</td>
<td>Enabled in the <strong>destination</strong> project</td>
</tr>
</tbody></table>
<h3>Required IAM Roles</h3>
<table>
<thead>
<tr>
<th>Role</th>
<th>Project</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td>Cloud SQL Admin (<code>roles/cloudsql.admin</code>)</td>
<td>Both</td>
<td>Manage Cloud SQL instances</td>
</tr>
<tr>
<td>Database Migration Admin (<code>roles/datamigration.admin</code>)</td>
<td>Destination</td>
<td>Create &amp; manage DMS jobs/profiles</td>
</tr>
<tr>
<td>Compute Network Admin (<code>roles/compute.networkAdmin</code>)</td>
<td>Destination</td>
<td>Manage authorized networks (allowlist)</td>
</tr>
</tbody></table>
<h2>4. Step-by-Step Migration Guide</h2>
<h3>Step 1: Get the Public IP of the Source Cloud SQL Instance</h3>
<ul>
<li><p>Go to <strong>SQL &gt; Instances</strong> in the <strong>source</strong> project</p>
</li>
<li><p>Open the instance → <strong>Overview</strong> tab</p>
</li>
<li><p>Copy the <strong>Public IP address</strong></p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/f63afd96-d9f7-4c46-8eb8-493198bdf48a.png" alt="" style="display:block;margin:0 auto" />

<h3>Step 2: Create a Connection Profile in the Destination Project</h3>
<ul>
<li><p>Navigate to <strong>Database Migration &gt; Connection profiles &gt; Create profile</strong></p>
</li>
<li><p>Settings:</p>
<ul>
<li><p><strong>Profile role</strong>: Source</p>
</li>
<li><p><strong>Database engine</strong>: MySQL (or PostgreSQL)</p>
</li>
<li><p><strong>Connection profile name/ID</strong>: e.g. <code>source-db-profile</code></p>
</li>
<li><p><strong>Hostname/IP</strong>: Paste source Cloud SQL <strong>public IP</strong></p>
</li>
<li><p><strong>Port</strong>: 3306 (MySQL) or 5432 (PostgreSQL)</p>
</li>
<li><p><strong>Username/Password</strong>: Source DB credentials (e.g. <code>root</code> user)</p>
</li>
<li><p><strong>Region</strong>: Must match destination Cloud SQL region</p>
</li>
</ul>
</li>
<li><p>Save</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/be2dcf92-f5dc-46d6-ada1-58ddbcd78fcf.png" alt="" style="display:block;margin:0 auto" />

<h3>Step 3: Create the Migration Job</h3>
<ul>
<li><p>Go to <strong>Database Migration &gt; Migration jobs &gt; Create</strong></p>
</li>
<li><p>Fill basics:</p>
<ul>
<li><p><strong>Migration job name/ID</strong>: e.g. <code>cross-project-mig</code></p>
</li>
<li><p><strong>Source database engine</strong>: MySQL</p>
</li>
<li><p><strong>Destination region</strong>: (same as target instance)</p>
</li>
<li><p><strong>Migration job type</strong>: <strong>Continuous</strong> (enables CDC / real-time sync)</p>
</li>
</ul>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/5ff89511-ee94-4ee4-aa38-9b1bed17e837.png" alt="" style="display:block;margin:0 auto" />

<h3>Step 4: Define Source Configuration</h3>
<ul>
<li><p>Select the connection profile created in Step 2</p>
</li>
<li><p><strong>Full dump configuration</strong>:</p>
<ul>
<li><p>Dump method: <strong>Logical</strong></p>
</li>
<li><p>Parallelism: <strong>Optimal</strong> or <strong>Max</strong> (for better performance)</p>
</li>
</ul>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/bd562775-df49-4674-8f00-75776ea36221.png" alt="" style="display:block;margin:0 auto" />

<h3>Step 5: Define the Destination Cloud SQL Instance</h3>
<ul>
<li><p><strong>Option A</strong> — Existing instance: Select it (must match engine/version)</p>
</li>
<li><p><strong>Option B</strong> — New instance: Let DMS create it</p>
<ul>
<li><p>Match source engine &amp; version</p>
</li>
<li><p>Set root password</p>
</li>
<li><p>Choose adequate machine type &amp; storage (under-provisioning slows migration!)</p>
</li>
</ul>
</li>
</ul>
<p><strong>Important</strong>: This choice (existing vs new) is <strong>permanent</strong>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/ea40fbc5-e0c5-4980-beb2-ada1755be0ba.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/25f3715d-7ede-439b-9382-64bc68deee06.png" alt="" style="display:block;margin:0 auto" />

<h3>Step 6: Configure IP Allowlist (Public Connectivity)</h3>
<p>DMS requires bidirectional connectivity over public IP.</p>
<ol>
<li><p><strong>Destination instance</strong>:</p>
<ul>
<li><p>Go to <strong>Cloud SQL &gt; Connections</strong></p>
</li>
<li><p>Enable <strong>Public IP</strong> if not already</p>
</li>
<li><p>Note the <strong>Outgoing IP</strong> from Overview tab (this is the IP DMS uses to connect <strong>to source</strong>)</p>
</li>
</ul>
</li>
<li><p><strong>Source instance</strong>:</p>
<ul>
<li><p>Go to <strong>Cloud SQL &gt; Connections &gt; Authorized networks</strong></p>
</li>
<li><p>Add the <strong>destination's outgoing IP</strong> (from step above) as an authorized network</p>
</li>
</ul>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/daba0747-4483-4a65-b565-72abbe51cac6.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/c83dd31c-f6ad-4158-b267-771dc3ca5f51.png" alt="" style="display:block;margin:0 auto" />

<h3>Step 7: Test the Migration Job</h3>
<ul>
<li><p>In the migration job creation wizard → <strong>Test</strong> button</p>
</li>
<li><p>Wait for "Test run complete – successful"</p>
</li>
<li><p>If it fails: double-check credentials, public IPs, allowlist, firewall rules</p>
</li>
</ul>
<p>Once passed → <strong>Create</strong> (you can start immediately or later)</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/7caafd02-bfe0-4294-abe8-1dbe0f216456.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/e986437a-3590-4f87-bf49-a2a8b220b98a.png" alt="" style="display:block;margin:0 auto" />

<h3>Step 8: Start &amp; Monitor the Job and Verify Data Consistency</h3>
<ul>
<li><p>Start the job</p>
</li>
<li><p>Monitor:</p>
<ul>
<li><p>Replication delay / lag</p>
</li>
<li><p>Phase (Full catch-up → CDC)</p>
</li>
</ul>
</li>
</ul>
<p>Source :</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/cf28bddc-a461-4efa-99d7-6c56deed408d.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/ec6c025b-946f-49d0-8865-988c2d4acc14.png" alt="" style="display:block;margin:0 auto" />

<p>Destination :</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/4e42df5f-612c-47a7-aaee-e5dff7546474.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/200de87b-f70e-4cec-a1dd-967d6435ce08.png" alt="" style="display:block;margin:0 auto" />

<h3>Step 9: Promote the Destination instance</h3>
<p>After the verification of data consistency, Once the replication delay is least, proceed with promoting the destination Database to be a writeable instance.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/ef3d3635-a306-47c4-bdae-7a4f1f32a0a3.png" alt="" style="display:block;margin:0 auto" />

<h3>Step 10: Check Migration Job Logs or Destination Instance Logs</h3>
<p>If the logs for the migration job or the destination instance logs are required, they can be viewed by clicking on the view logs and selecting the logs which are required.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/0f0c01af-edee-43e3-b7f9-727fee59578f.png" alt="" style="display:block;margin:0 auto" />

<h2>5. Troubleshooting</h2>
<h3>5.1 : Common Issues</h3>
<table>
<thead>
<tr>
<th>Issues</th>
<th>Possible Cause</th>
</tr>
</thead>
<tbody><tr>
<td>Connection Test Fails</td>
<td>Public IP not whitelisted or wrong Credentials</td>
</tr>
<tr>
<td>Version Mismatch</td>
<td>Cloud SQL minor version mismatch</td>
</tr>
<tr>
<td>IAM Permission errors</td>
<td>Missing roles in source/destination</td>
</tr>
<tr>
<td>Cutover Fails</td>
<td>Replication lag or write operations on source.</td>
</tr>
</tbody></table>
<h3>5.2 : Solutions</h3>
<ul>
<li><p>Re-check authorized network setting</p>
</li>
<li><p>Verify SQL Version via gcloud sql instance describe</p>
</li>
<li><p>Ensure IAM Roles and API's are correctly configured</p>
</li>
</ul>
<h2>6. Conclusion</h2>
<p>This blog explained how to migrate a Cloud SQL instance across GCP Projects using DMS over public IP.</p>
<p>It covered :</p>
<ol>
<li><p>API Setup</p>
</li>
<li><p>Source/Destination Configuration</p>
</li>
<li><p>DMS Connection profiles and job creation</p>
</li>
<li><p>Troubleshooting the issues</p>
</li>
</ol>
]]></content:encoded></item><item><title><![CDATA[Integrating AWS IAM Identity Center (SSO) with Argo CD and Argo Workflows using SAML 2.0: A Step-by-Step Guide]]></title><description><![CDATA[As organizations increasingly adopt GitOps practices for managing Kubernetes deployments, tools like Argo CD and Argo Workflows have become essential in the modern cloud-native ecosystem. Argo CD auto]]></description><link>https://blog.devopswithpiyush.in/integrating-aws-iam-identity-center-sso-with-argo-cd-and-argo-workflows-using-saml-2-0-a-step-by-step-guide</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/integrating-aws-iam-identity-center-sso-with-argo-cd-and-argo-workflows-using-saml-2-0-a-step-by-step-guide</guid><category><![CDATA[AWS]]></category><category><![CDATA[ArgoCD]]></category><category><![CDATA[gitops]]></category><category><![CDATA[AWS IAM Identity Center]]></category><category><![CDATA[SSO]]></category><category><![CDATA[SSO - Single Sign-On]]></category><category><![CDATA[argoworkflow]]></category><category><![CDATA[workflows]]></category><category><![CDATA[Active Directory]]></category><category><![CDATA[IAM]]></category><category><![CDATA[Security]]></category><category><![CDATA[DevSecOps]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Tue, 10 Mar 2026 20:46:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/b7490023-bf98-4a7b-96b6-947b9d159c2d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As organizations increasingly adopt <strong>GitOps</strong> practices for managing Kubernetes deployments, tools like <strong>Argo CD</strong> and <strong>Argo Workflows</strong> have become essential in the modern cloud-native ecosystem. Argo CD automates application deployments declaratively from Git repositories, while Argo Workflows orchestrates complex, scalable pipelines and batch jobs on Kubernetes.</p>
<p>To make these tools secure and user-friendly, especially in enterprise environments, integrating <strong>AWS IAM Identity Center</strong> (formerly known as AWS SSO) via <strong>SAML 2.0</strong> provides centralized authentication, group-based access control, and single sign-on (SSO) experience. This eliminates multiple logins, reduces credential sprawl, and aligns with zero-trust security principles.</p>
<p>In this blog post, I'll walk you through the complete setup in a clear, beginner-friendly way — perfect for freshers learning GitOps, experienced DevOps engineers hardening access, or teams pursuing CNCF and AWS community contributions. The guide draws from the official Argo CD documentation and practical implementations, updated for 2026 best practices.</p>
<h3>Why Integrate AWS IAM Identity Center with Argo CD and Argo Workflows?</h3>
<ul>
<li><p><strong>Centralized Access Management</strong> — Manage users and groups in one place (AWS IAM Identity Center) for consistent policies across AWS services and third-party apps.</p>
</li>
<li><p><strong>Enhanced Security</strong> — Leverage SAML 2.0 federation to avoid storing local credentials; enforce MFA and compliance easily.</p>
</li>
<li><p><strong>Improved User Experience</strong> — Users log in once with corporate credentials and access Argo CD's UI and Argo Workflows seamlessly.</p>
</li>
<li><p><strong>Group-Based RBAC</strong> — Map AWS groups to Argo roles (e.g., readonly vs. admin) for fine-grained permissions.</p>
</li>
</ul>
<h3>Architecture Overview</h3>
<p>User → AWS IAM Identity Center (IdP) → SAML Assertion → Argo CD Dex (bundled OIDC provider) → Argo CD / Argo Workflows (Service Providers)</p>
<p>Argo CD uses <strong>Dex</strong> (its embedded identity broker) to handle SAML, while Argo Workflows can federate via the same Dex instance for shared SSO.</p>
<h3>High-Level Architecture</h3>
<p>User logs in via corporate credentials → AWS IAM Identity Center authenticates → issues SAML assertion → Argo CD's <strong>Dex</strong> (built-in identity broker) validates → grants access based on groups.</p>
<p>Here's a simple flow diagram (Mermaid syntax — paste into <a href="http://mermaid.live">mermaid.live</a> or your blog renderer):</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/25a05205-9880-4ccb-ac57-b70d0b007a1b.svg" alt="" style="display:block;margin:0 auto" />

<p>This shows the <strong>authentication flow</strong>. AWS acts as Identity Provider (<strong>IdP</strong>), Argo CD/Dex as Service Provider (<strong>SP</strong>).</p>
<h3>3. Detailed Component Diagram – Infrastructure View</h3>
<p><strong>Purpose</strong>: For the "Architecture and Infrastructure" section. Matches your document's 3.1 overview.</p>
<img src="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/4d5b923c-8a0c-4beb-875f-f634b4ee55fc.png" alt="" style="display:block;margin:0 auto" />

<h3>Prerequisites</h3>
<ul>
<li><p>A running Kubernetes cluster with Argo CD and (optionally) Argo Workflows installed (preferably via Helm).</p>
</li>
<li><p>Access to <strong>AWS IAM Identity Center</strong> with permissions to create SAML applications.</p>
</li>
<li><p>Argo CD exposed via a domain (e.g., <a href="https://argocd.yourdomain.com">https://argocd.yourdomain.com</a>).</p>
</li>
<li><p>kubectl access to create secrets and edit ConfigMaps.</p>
</li>
<li><p>Basic understanding of YAML and Kubernetes resources.</p>
</li>
</ul>
<h3>Step-by-Step Implementation</h3>
<h4>Step 1: Create a Custom SAML 2.0 Application in AWS IAM Identity Center</h4>
<ol>
<li><p>Go to <strong>AWS IAM Identity Center</strong> → <strong>Applications</strong> → <strong>Add application</strong>.</p>
</li>
<li><p>Choose <strong>Add custom SAML 2.0 application</strong>.</p>
</li>
<li><p>Set <strong>Display name</strong> (e.g., "Argo CD SSO").</p>
</li>
<li><p>Under <strong>Application metadata</strong>:</p>
<ul>
<li><p>Select <strong>Manually type metadata values</strong>.</p>
</li>
<li><p><strong>Application ACS URL</strong>: <a href="https://argocd.yourdomain.com/api/dex/callback">https://argocd.yourdomain.com/api/dex/callback</a></p>
</li>
<li><p><strong>Application SAML audience</strong>: <a href="https://argocd.yourdomain.com/api/dex/callback">https://argocd.yourdomain.com/api/dex/callback</a></p>
</li>
</ul>
</li>
<li><p>(Optional) Set <strong>Application start URL</strong>: <a href="https://argocd.yourdomain.com">https://argocd.yourdomain.com</a></p>
</li>
<li><p>Download the <strong>IAM Identity Center certificate</strong> (you'll need it later).</p>
</li>
<li><p>Submit and go to <strong>Attribute mappings</strong>:</p>
<ul>
<li><p>Add mappings:</p>
<ul>
<li><p><strong>Subject</strong> → ${user:subject} (persistent)</p>
</li>
<li><p><strong>groups</strong> → ${user:groups}</p>
</li>
<li><p><strong>email</strong> → ${user:email}</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Assign users/groups who should access Argo CD.</p>
</li>
</ol>
<p><strong>Note</strong>: Use your actual Argo CD domain. The callback URL is critical for Dex.</p>
<h4>Step 2: Prepare the Certificate</h4>
<p>Base64-encode the downloaded certificate (including -----BEGIN CERTIFICATE----- to -----END CERTIFICATE----- lines):</p>
<p>Bash</p>
<pre><code class="language-plaintext">base64 -w 0 iam-identity-center-cert.pem &gt; encoded-cert.txt
</code></pre>
<p>Copy the output for caData.</p>
<h4>Step 3: Configure Argo CD (via Helm values or argocd-cm ConfigMap)</h4>
<p>Update your Argo CD Helm values (or edit argocd-cm directly):</p>
<p>YAML</p>
<pre><code class="language-plaintext">configs:
  cm:
    create: true
    url: https://argocd.yourdomain.com
    dex.config: |
      logger:
        level: debug
        format: json
      connectors:
      - type: saml
        id: aws
        name: "AWS IAM Identity Center"
        config:
          ssoURL: "https://portal.sso.&lt;region&gt;.amazonaws.com/saml/assertion/&lt;id&gt;"  # From your SAML app sign-in URL
          caData: "&lt;base64-encoded-cert-from-step-2&gt;"
          entityIssuer: https://argocd.yourdomain.com/api/dex/callback
          redirectURI: https://argocd.yourdomain.com/api/dex/callback
          usernameAttr: email
          emailAttr: email
          groupsAttr: groups

  rbac:
    policy.default: role:readonly
    policy.csv: |
      p, role:readonly, applications, get, /*, allow
      p, role:readonly, certificates, get, *, allow
      p, role:readonly, clusters, get, *, allow
      # ... (add more readonly permissions as needed)

      p, role:admin, applications, create, /*, allow
      p, role:admin, applications, update, /*, allow
      # ... (add admin permissions)

      g, "&lt;your-aws-group-id&gt;", role:admin  # e.g., g, "argocd-admins", role:admin
    scopes: '[groups, email]'
</code></pre>
<p><strong>Key Tips</strong>:</p>
<ul>
<li><p>ssoURL comes from the SAML app's sign-in URL.</p>
</li>
<li><p>For group mapping, use the exact group name/ID from AWS (workaround: AWS doesn't officially support groups in SAML, but this works reliably).</p>
</li>
<li><p>Apply changes and restart Dex pod if needed.</p>
</li>
</ul>
<h4>Step 4: Create Kubernetes Secret (for Shared Client Secret if Using Argo Workflows)</h4>
<p>Bash</p>
<pre><code class="language-shell">kubectl create secret generic argocd-sso-secret \
  --namespace argocd \
  --from-literal=client-id="https://portal.sso.&lt;region&gt;.amazonaws.com/saml/assertion/&lt;id&gt;" \
  --from-literal=client-secret="some-random-secure-string"  # Or generate one
</code></pre>
<p>If Argo Workflows is in a different namespace, recreate the same secret there.</p>
<h4>Step 5: Configure Argo Workflows (Optional but Recommended for Unified SSO)</h4>
<p>In Argo Workflows Helm values:</p>
<p>YAML</p>
<pre><code class="language-shell">server:
  authModes:
    - sso
    - client  # Optional: keep client for backward compat; can remove later
  sso:
    enabled: true
    issuer: https://argocd.yourdomain.com/api/dex
    clientId:
      name: argocd-sso-secret
      key: client-id
    clientSecret:
      name: argocd-sso-secret
      key: client-secret
    redirectUrl: https://argocd.yourdomain.com/oauth2/callback
    sessionExpiry: 8h
</code></pre>
<p>This allows Argo Workflows to use the same Dex instance for SSO.</p>
<h4>Step 6: Test the Integration</h4>
<ul>
<li><p>Access <a href="https://argocd.yourdomain.com">https://argocd.yourdomain.com</a></p>
</li>
<li><p>Click <strong>LOGIN VIA SSO</strong></p>
</li>
<li><p>You should redirect to AWS IAM Identity Center login</p>
</li>
<li><p>After authentication, return to Argo CD with proper permissions based on your group</p>
</li>
</ul>
<p>Check Dex logs (kubectl logs -n argocd -l app.kubernetes.io/name=argocd-dex-server) for debug info if issues arise.</p>
<h3>Troubleshooting Common Issues</h3>
<ul>
<li><p><strong>Authentication fails</strong> → Verify URLs match exactly (case-sensitive); check certificate encoding.</p>
</li>
<li><p><strong>Groups not recognized</strong> → Confirm group names in AWS and RBAC policy.csv; use debug logging.</p>
</li>
<li><p><strong>Callback errors</strong> → Ensure ACS URL and audience match Dex callback.</p>
</li>
<li><p><strong>Connectivity</strong> → Confirm network policies allow outbound to AWS endpoints.</p>
</li>
</ul>
<h3>Best Practices</h3>
<ul>
<li><p>Store secrets securely (use external secret managers like AWS Secrets Manager + External Secrets Operator).</p>
</li>
<li><p>Rotate client secrets periodically.</p>
</li>
<li><p>Use least-privilege RBAC: Start with readonly default, grant admin only to specific groups.</p>
</li>
<li><p>Monitor Dex logs and set up alerts for auth failures.</p>
</li>
<li><p>Test group membership changes in a staging environment.</p>
</li>
</ul>
<h3>Conclusion</h3>
<p>Integrating <strong>AWS IAM Identity Center</strong> with <strong>Argo CD</strong> (and optionally Argo Workflows) via SAML 2.0 brings enterprise-grade authentication to your GitOps workflows. It simplifies onboarding, boosts security, and supports scalable team collaboration — key for CNCF-aligned projects and AWS ecosystems.</p>
<p>By following this guide, you can achieve centralized, secure access in minutes (after initial setup). If you're contributing to open-source or building AWS community projects, this pattern is battle-tested and aligns with modern cloud-native security.</p>
<p>Happy GitOps-ing! If you implement this, share your experiences — feedback helps the community grow.</p>
]]></content:encoded></item><item><title><![CDATA[Build Your Own SMTP Mail Server on AWS EC2 Using Node.js — A Complete Hands-On Guide]]></title><description><![CDATA[Introduction
Ever wondered what really happens when you click "Send" on an email? Behind the scenes, a chain of DNS lookups, protocol handshakes, and server communications takes place — all orchestrat]]></description><link>https://blog.devopswithpiyush.in/build-smtp-mail-server-aws-ec2-nodejs</link><guid isPermaLink="true">https://blog.devopswithpiyush.in/build-smtp-mail-server-aws-ec2-nodejs</guid><category><![CDATA[AWS]]></category><category><![CDATA[ec2]]></category><category><![CDATA[Devops]]></category><category><![CDATA[jo]]></category><category><![CDATA[Node.js]]></category><category><![CDATA[smtp]]></category><dc:creator><![CDATA[Piyush Agrawal]]></dc:creator><pubDate>Tue, 10 Mar 2026 17:28:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69b00c50abc0d950015d60e7/ab1ad4cf-02b4-4047-85af-e85b9588f877.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Introduction</h2>
<p>Ever wondered what really happens when you click "Send" on an email? Behind the scenes, a chain of DNS lookups, protocol handshakes, and server communications takes place — all orchestrated by SMTP (Simple Mail Transfer Protocol).</p>
<p>In this tutorial, we'll demystify email delivery by building a custom SMTP server from scratch — hosted on <strong>Amazon EC2</strong>. By the end, you'll have a working mail server that can receive emails on your own domain, and a deep understanding of how email infrastructure works under the hood.</p>
<blockquote>
<p>🔗 This post is based on my video walkthrough: <a href="https://youtu.be/l3htAzOAx7c?si=HwQNHU9txEmOBboc">Build Your Own Mail Server | SMTP Server</a></p>
</blockquote>
<hr />
<h2>Prerequisites</h2>
<p>Before you begin, make sure you have:</p>
<ul>
<li><p>An <strong>AWS account</strong> with access to the EC2 console</p>
</li>
<li><p>A registered <strong>domain name</strong> (with access to DNS management)</p>
</li>
<li><p>Basic familiarity with <strong>Linux terminal commands</strong></p>
</li>
<li><p>Basic understanding of <strong>Node.js</strong></p>
</li>
</ul>
<hr />
<h2>How Email Delivery Actually Works</h2>
<p>Let's say <strong>Piyush</strong> (using Gmail) wants to send an email to <strong>Abhay</strong> (using Outlook).</p>
<p>Here's the step-by-step flow:</p>
<ol>
<li><p><strong>MX Record Lookup</strong> — Piyush's mail server generates a DNS query on <code>outlook.com</code> to find the <strong>MX (Mail Exchanger) Record</strong>. This tells it which server is responsible for handling incoming mail for that domain.</p>
</li>
<li><p><strong>A Record Lookup</strong> — The MX record returns a hostname (e.g., <code>mailserver.outlook.com</code>). A second DNS query resolves this hostname to an <strong>IPv4 address</strong> using the <strong>A Record</strong>.</p>
</li>
<li><p><strong>SMTP Connection</strong> — Piyush's server opens a TCP connection to the resolved IP on <strong>port 25</strong> and begins the SMTP handshake to deliver the message.</p>
</li>
</ol>
<hr />
<h2>DNS Records Every Mail Server Needs</h2>
<p>Before your server can send or receive email reliably, you need to configure several DNS records:</p>
<table>
<thead>
<tr>
<th>Record</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td><strong>MX</strong></td>
<td>Specifies which mail server handles email for your domain</td>
</tr>
<tr>
<td><strong>A</strong></td>
<td>Maps your mail server's hostname to its public IPv4 address</td>
</tr>
<tr>
<td><strong>SPF</strong></td>
<td>Defines which servers are authorized to send email on behalf of your domain (prevents spoofing)</td>
</tr>
<tr>
<td><strong>DKIM</strong></td>
<td>Adds a cryptographic signature to outgoing emails, verifying sender identity and message integrity</td>
</tr>
<tr>
<td><strong>DMARC</strong></td>
<td>Builds on SPF and DKIM to define how receiving servers should handle authentication failures</td>
</tr>
</tbody></table>
<hr />
<h2>SMTP Protocol — The Handshake</h2>
<p>SMTP communication follows a structured command sequence. Here's how the conversation between two servers typically flows:</p>
<table>
<thead>
<tr>
<th>Command</th>
<th>Description</th>
</tr>
</thead>
<tbody><tr>
<td><code>HELO</code></td>
<td>Initiates the SMTP session; the client introduces itself to the server</td>
</tr>
<tr>
<td><code>MAIL FROM</code></td>
<td>Declares the sender's email address</td>
</tr>
<tr>
<td><code>RCPT TO</code></td>
<td>Specifies the recipient's email address</td>
</tr>
<tr>
<td><code>DATA</code></td>
<td>Requests permission to begin transmitting the email body</td>
</tr>
<tr>
<td><code>QUIT</code></td>
<td>Terminates the SMTP session</td>
</tr>
</tbody></table>
<p><strong>Default SMTP Ports:</strong></p>
<ul>
<li><p><strong>Port 25</strong> — Standard SMTP</p>
</li>
<li><p><strong>Port 465</strong> — SMTP over SSL/TLS (secure)</p>
</li>
</ul>
<hr />
<h2>Step 1 — Launch an EC2 Instance on AWS</h2>
<p>Head over to the <strong>AWS Management Console</strong> and launch a new EC2 instance.</p>
<p><strong>Recommended Configuration:</strong></p>
<ul>
<li><p><strong>AMI:</strong> Ubuntu Server 22.04 LTS (or latest)</p>
</li>
<li><p><strong>Instance Type:</strong> <code>t2.micro</code> (Free Tier eligible — perfect for this project)</p>
</li>
<li><p><strong>Key Pair:</strong> Create or select an existing SSH key pair</p>
</li>
<li><p><strong>Network:</strong> Ensure the instance has a <strong>public IP address</strong> (we'll configure the security group shortly)</p>
</li>
</ul>
<blockquote>
<p>💡 <strong>Why EC2?</strong> Amazon EC2 gives you full control over your server environment — including the operating system, network configuration, and security policies. It's ideal for running custom services like an SMTP server where you need to open specific ports and manage DNS records pointing to your instance's public IP.</p>
</blockquote>
<hr />
<h2>Step 2 — Install Node.js and npm</h2>
<p>SSH into your EC2 instance and install Node.js:</p>
<pre><code class="language-bash">sudo apt update
sudo apt install nodejs npm -y
node -v
</code></pre>
<p>You should see output similar to:</p>
<pre><code class="language-plaintext">v18.x.x
</code></pre>
<blockquote>
<p><strong>Tip:</strong> For the latest LTS version, consider using <a href="https://github.com/nvm-sh/nvm">nvm (Node Version Manager)</a> instead of the default apt package.</p>
</blockquote>
<hr />
<h2>Step 3 — Install the SMTP Server Package</h2>
<p>Create a project directory and install the <code>smtp-server</code> npm package:</p>
<pre><code class="language-bash">mkdir smtp-server &amp;&amp; cd smtp-server
npm init -y
npm install smtp-server
</code></pre>
<hr />
<h2>Step 4 — Write the SMTP Server Code</h2>
<p>Create a file called <code>index.js</code>:</p>
<pre><code class="language-bash">nano index.js
</code></pre>
<p>Paste the following Node.js code:</p>
<pre><code class="language-javascript">const { SMTPServer } = require("smtp-server");

const server = new SMTPServer({
  allowInsecureAuth: true,
  authOptional: true,

  onConnect(session, cb) {
    console.log(`[CONNECT] Session ID: ${session.id}`);
    cb();
  },

  onMailFrom(address, session, cb) {
    console.log(`[MAIL FROM] \({address.address} | Session: \){session.id}`);
    cb();
  },

  onRcptTo(address, session, cb) {
    console.log(`[RCPT TO] \({address.address} | Session: \){session.id}`);
    cb();
  },

  onData(stream, session, cb) {
    let emailData = "";
    stream.on("data", (chunk) =&gt; {
      emailData += chunk.toString();
    });
    stream.on("end", () =&gt; {
      console.log(`[DATA] Email content:\n${emailData}`);
      cb();
    });
  },
});

server.listen(25, () =&gt; {
  console.log("✅ SMTP Server is running on port 25");
});
</code></pre>
<blockquote>
<p>This creates a minimal SMTP server that logs every incoming email connection, sender, recipient, and message body — great for understanding the protocol in action.</p>
</blockquote>
<hr />
<h2>Step 5 — Configure the EC2 Security Group</h2>
<p>Back in the <strong>AWS EC2 Console</strong>, navigate to your instance's <strong>Security Group</strong> and add the following <strong>inbound rule</strong>:</p>
<table>
<thead>
<tr>
<th>Type</th>
<th>Protocol</th>
<th>Port Range</th>
<th>Source</th>
</tr>
</thead>
<tbody><tr>
<td>Custom TCP</td>
<td>TCP</td>
<td>25</td>
<td>0.0.0.0/0 (or restrict as needed)</td>
</tr>
</tbody></table>
<blockquote>
<p>⚠️ <strong>Security Note:</strong> Opening port 25 to <code>0.0.0.0/0</code> is fine for testing, but in production you should restrict access and implement authentication. AWS also throttles port 25 by default on EC2 — you may need to <a href="https://aws.amazon.com/premiumsupport/knowledge-center/ec2-port-25-throttle/">submit a request</a> to remove the restriction for outbound SMTP traffic.</p>
</blockquote>
<hr />
<h2>Step 6 — Configure DNS Records</h2>
<p>Go to your <strong>domain registrar</strong> (or <strong>Amazon Route 53</strong> if you manage DNS through AWS) and add the following records:</p>
<table>
<thead>
<tr>
<th>Record Type</th>
<th>Host</th>
<th>Value</th>
<th>TTL</th>
</tr>
</thead>
<tbody><tr>
<td><strong>A</strong></td>
<td><code>mail.yourdomain.com</code></td>
<td><code>&lt;Your EC2 Public IP&gt;</code></td>
<td>300</td>
</tr>
<tr>
<td><strong>MX</strong></td>
<td><code>yourdomain.com</code></td>
<td><code>mail.yourdomain.com</code> (Priority: 10)</td>
<td>300</td>
</tr>
</tbody></table>
<blockquote>
<p>🔑 <strong>Pro Tip:</strong> If you're using <strong>Amazon Route 53</strong> for DNS management, you can associate an <strong>Elastic IP</strong> with your EC2 instance. This ensures your server's IP remains static, even if you stop/start the instance — critical for reliable mail delivery.</p>
</blockquote>
<hr />
<h2>Step 7 — Start the Server</h2>
<p>You can start the server directly with Node:</p>
<pre><code class="language-bash">sudo node index.js
</code></pre>
<p>For production persistence, use <strong>PM2</strong> (a Node.js process manager):</p>
<pre><code class="language-bash">sudo npm install -g pm2
sudo pm2 start index.js
sudo pm2 save
sudo pm2 startup
</code></pre>
<p>You should see:</p>
<pre><code class="language-plaintext">✅ SMTP Server is running on port 25
</code></pre>
<hr />
<h2>Step 8 — Test by Sending an Email</h2>
<p>Open any email client (Gmail, Yahoo, Outlook) and send a test email to:</p>
<pre><code class="language-plaintext">anything@yourdomain.com
</code></pre>
<p>Check your EC2 terminal — you should see the SMTP handshake logs appear in real time:</p>
<pre><code class="language-plaintext">[CONNECT] Session ID: abc123
[MAIL FROM] sender@gmail.com | Session: abc123
[RCPT TO] anything@yourdomain.com | Session: abc123
[DATA] Email content:
Subject: Test Email
Hello from Gmail!
</code></pre>
<p>🎉 <strong>Congratulations!</strong> You've just built a working SMTP server on AWS EC2.</p>
<hr />
<h2>AWS Services Used</h2>
<table>
<thead>
<tr>
<th>Service</th>
<th>Purpose</th>
</tr>
</thead>
<tbody><tr>
<td><strong>Amazon EC2</strong></td>
<td>Hosts the SMTP server on a virtual Linux machine in the cloud</td>
</tr>
<tr>
<td><strong>Security Groups</strong></td>
<td>Acts as a virtual firewall to control inbound/outbound traffic on port 25</td>
</tr>
<tr>
<td><strong>Elastic IP</strong> <em>(optional)</em></td>
<td>Provides a static public IP for consistent DNS resolution</td>
</tr>
<tr>
<td><strong>Amazon Route 53</strong> <em>(optional)</em></td>
<td>Managed DNS service for configuring MX and A records</td>
</tr>
</tbody></table>
<hr />
<h2>What's Next?</h2>
<p>This tutorial sets up a <strong>basic receive-only SMTP server</strong> for learning purposes. To take it further, consider:</p>
<ul>
<li><p>Adding <strong>TLS encryption</strong> with Let's Encrypt certificates for secure communication</p>
</li>
<li><p>Configuring <strong>SPF, DKIM, and DMARC</strong> records for email authentication</p>
</li>
<li><p>Using <strong>Amazon SES</strong> alongside your custom server for reliable outbound email delivery</p>
</li>
<li><p>Implementing <strong>Postfix</strong> or <strong>Haraka</strong> for a production-grade mail transfer agent</p>
</li>
<li><p>Monitoring server health with <strong>Amazon CloudWatch</strong></p>
</li>
</ul>
<hr />
<h2>Wrapping Up</h2>
<p>Building an SMTP server from scratch is one of the best ways to understand how email really works at the protocol level. By hosting it on <strong>Amazon EC2</strong>, you get the flexibility of full server access combined with the reliability and scalability of AWS infrastructure.</p>
<p>If this post helped you, feel free to drop a ❤️ and share it with someone learning about cloud infrastructure!</p>
<hr />
<p><em>Have questions or want to connect? Find me on</em> <a href="https://linkedin.com"><em>LinkedIn</em></a><em>.</em></p>
]]></content:encoded></item></channel></rss>