Spotlight
AWS Lambda Functions: Return Response and Continue Executing
A how-to guide using the Node.js Lambda runtime.
Wed, 03 Jan 2018
So you’ve built your first Serverless app (if you haven’t, start here) and are ready to go into production. Given a lot of the rhetoric about Serverless these days (or for our purposes, more clearly… cloud-native apps built with AWS platform services), you might be forgiven for thinking that there is zero operational effort. Sadly, that nirvana has not yet appeared. While there is WAY less operational effort, it’s not zero; there are some important things to keep in mind when planning to run Serverless apps in production. Below are the top five that Trek10 has identified over years of managing Serverless in production for our clients.
This is probably the most obvious and most common. Apps have errors and you have to watch for them… not exactly groundbreaking. What is a bit different is figuring out the right threshold for alerting. For a low-volume system it may make sense to alert on every error, but for any sufficiently high-volume system you need to do some work to weed out the noise that comes from typical transient errors. Typically some rate of errors under 0.1% is always going to occur. With asynchronous Lambda invocations (for example from an S3 object event), we recommend using dead letter queues and only alerting off of those, so you can ignore all transient errors knowing that AWS will usually retry and get success.
You may also want to consider tools that improve your visibility beyond the built-in basics of CloudWatch metrics and Lambda logs in CloudWatch Logs. Error tracking services like Sentry or Rollbar work just as well as in traditional architectures in helping to track errors. When it comes to tracing, though, you’ll need to look at a new generation of tools: AWS X-Ray and IOPipe are two of the more popular options.
While scaling with AWS platform services is mostly transparent, that’s not 100% the case. There are a few dials in the system; it is important to know where they are and how to monitor them to optimize scalability and costs. Some are obvious and easily visible like DynamoDB Provisioned Throughput (which also has auto-scaling now, by the way) or Kinesis shards, others are slightly more hidden like Lambda Concurrency Limits, and still others like S3 pre-partitioning are completely hidden and can only be monitored by observing symptoms like S3 error rate or PUT latency. Carefully review each part of the system to identify all of the relevant dials.
Security is never a solved problem… the risks just shift. With no long-running VM and often no network to manage, Serverless greatly reduces the attack surface area of many traditional threats. This doesn’t mean security is solved, though: it just allows you to shift your focus to other threat areas.
You should look at automating security scanning against all of these threats in your production infrastructure as well as integrating security analysis into your CI pipeline to avoid deploying vulnerabilities in the first place.
This is sort of the flipside of a great benefit of Serverless: cost is truly usage-based… but cost is truly usage-based. If you get unwanted or unexpected traffic, costs could spike quickly. So it is important to monitor costs on a daily basis so you can quickly detect any cost spikes and block the offending traffic or optimize your application to minimize costs.
And finally, everyone’s favorite…
All of the AWS platform services are by default running in multiple AWS Availability Zones (AZs, which are one or more data centers with independent power and network within a given AWS Region), so in theory 2-3 AWS data centers would need to go down simultaneously to cause an outage, which is a very uncommon (i.e. much less often than once a year) scenario. But the reality has been that these services in fact have cross-AZ dependencies and have region-wide outages. In the past 15 months there have been multiple outages to services like DynamoDB, S3, and Lambda. So this is a real thing that you need to plan for. Trek10’s very own Jared Short had a great talk at Serverlessconf NYC about this if you’d like to learn more.
The first step is determining the extent to which you can build for multi-region failover or possibly even multi-region active-active. Two new features, API Gateway Regional Endpoints and DynamoDB Global Tables, have made this significantly more realistic, but you still need to conduct an RTO/RPO cost sensitivity analysis to decide if it makes sense for you.
Next, you need to build your operational response plan for these outages. While your ops team may not be able to fix AWS’s issue, it still has a key role to play: Identify as early as possible that there is a problem, trace root cause to the AWS services, look for confirmation from AWS that the problem is on their side (usually, AWS Support initially and then with some lag the AWS Status Page), and then effectively communicate to end users, initiate failover plans as appropriate, and monitor status on the AWS side. In other words… they’ll be busy!
So as you can see, Serverless Ops is not entirely new. There are some new things, but it’s mostly just a shift in focus. The overall effort should be significantly less than Ops for an equivalently-sized infrastructure, but it is really important to tackle and master if you are going to be successful in adopting Serverless in your organization.
At Trek10, we are experts in building and operating Serverless infrastructure and enabling others to do so, so let us know if you’d like some help.
Questions/comments? Feel free to reach us at serverless@trek10.com.
A how-to guide using the Node.js Lambda runtime.