Spotlight
Amazon Q: GenAI a Feature or a System?
Identifying where challenges and advantages exist in the quest for immediate value in Generative AI.
Mon, 02 Nov 2015
AWS’s recent outage to DynamoDB and related services in the US East region is a good reminder of some fundamentals of designing for availability on AWS.
AWS’s recent outage to DynamoDB and related services in the US East region is a good reminder of some fundamentals of designing for availability on AWS.
If you need a primer on Availability Zones (AZs) vs Regions, check this out. This is an important distinction for services like EC2 and RDS that exist as a single virtual machine, so it is very unlikely that individual virtual machines in completely isolated AZs will go down, outside of a major natural disaster in the region. But many AWS services like S3 and DynamoDB are multi-AZ by default. This is great, but it also means that they have more cross-AZ dependency than one might first assume. They are all more susceptible to full-region outages because they actually do share some cross-AZ dependencies.
This DynamoDB outage was the first one in about 4 years… so this is not a common occurrence. Still, an outage is an outage. But if you had designed your use of DynamoDB for region-level redundancy, you could have survived this incident with zero downtime.
Happily, while region-level redundancy is not a trivial task, it really isn’t that bad, at least compared with how hard and expensive it is to build something comparable in a legacy data center world.
So where to begin? Here’s an initial checklist for thinking about region-level redundancy:
If you are doing your own replication with services running on EC2 (rather than using one of the AWS built-in features mentioned above, you will need to figure out how to keep your data private when it is moving between regions. Remember than AWS VPCs (think of it as your LAN) can only exist in a single region… so you essentially need to connect two LANs to keep your data on your own network. You could create your own tunnel between EC2 instances. Trek10’s preferred method for a more robust connection is to use a virtual networking appliance running in EC2 from Cohesive Networks. AWS does have a feature called VPC Peering, but it currently only allows you to connect VPCs in the same region, but AWS has stated that they plan to add cross-region support to it in the future.
Whether or not you want to invest in the time to automate the failover process really depends on your Recovery Time Objective (RTO). If you need an RTO of under 1-2 hours, you really should automate. Between promoting your database to master, updating DNS, and any other app-specific changes, it should be relatively straightforward to script every step.
And finally, remember that the safest plan for solid region-level redundancy is active-active. Keeping two regions actively serving requests is the most complex architecturally and most overhead to manage, but it is the best way to make certain that you will be able to stay up if one region goes down.
So remember the bottom line… full region failures for a service, while infrequent, DO happen. If you need the highest possible level of uptime, multi-region redundancy is absolutely possible. Whether or not you take the time to do it is your design choice!
Identifying where challenges and advantages exist in the quest for immediate value in Generative AI.