Karl Robinson
May 1, 2020
Karl is CEO and Co-Founder of Logicata – he’s an AWS Community Builder in the Cloud Operations category, and AWS Certified to Solutions Architect Professional level. Knowledgeable, informal, and approachable, Karl has founded, grown, and sold internet and cloud-hosting companies.
In this post I’m going to give an overview of the AWS Well-Architected Framework, then I’ll do a deep dive on the Reliability pillar—one of the five core pillars that should underpin your AWS architecture.
An Overview of the AWS Well-Architected Framework
The AWS Well-Architected Framework comprises five pillars. Designed by AWS, this series of best practise principles is aimed at helping customers compare their AWS environments against these best practises to identify areas for improvement. The framework is based on the extensive experience gleaned by AWS solutions architects over the years, in tens of thousands of AWS deployments.
The framework guides AWS customers through a series of questions that enables them to better understand how well their architecture aligns with best practise. The ultimate goal is to help AWS customers to build environments that are secure, high performance, resilient and efficient.
AWS offers a free-to-use Well-Architected Tool, which guides customers through the questions in relation to their specific AWS workloads and then provides a plan on how best to architect the cloud environment using established best practises.
What Are the AWS Framework Pillars?
Let’s look at the five pillars of the AWS Well-Architected Framework to understand at a high level what each pillar is about.
AWS 5 Pillars
- Operational Excellence: the operational excellence pillar focuses on the day-to-day operations of a customer’s AWS infrastructure, including change management, deployment automation, monitoring, responding to events and defining standardized operating models.
- Security: the security pillar focuses on protecting systems and data. This includes identity and access management, security information and event management, data confidentiality and integrity and systems access.
- Reliability: the reliability pillar focuses on architecting for failure. Rapid recovery from failure is essential for modern businesses.
- Performance Efficiency: The performance efficiency pillar focuses on the efficient use of IT and computing resources.
- Cost Optimization: the cost optimization pillar focuses on avoiding unnecessary expenditure on AWS resources.
AWS Well-Architected Framework – Reliability Pillar
The Reliability pillar of the Well-Architected Framework looks at how well a system can recover from infrastructure or service failures. It also considers how a system can automatically scale to meet demand and how disruptions, such as misconfigurations or intermittent network issues, can be mitigated.
The Reliability pillar is based on five foundational architectural principles:
Principle | Description |
---|---|
Test Recovery Procedures | It is much easier to simulate failure scenarios in the cloud – automation can be used to simulate failure, to assist with the formulation of recovery plans and procedures |
Automatically recover from failure | By defining business level KPIs, it is possible to monitor systems and trigger automation when a KPI threshold is breached. This enables automated recovery processes to repair or work around the failure. |
Scale Horizontally | Single large resources should be replaced by multiple smaller resources to minimize the impact of a failure. |
Stop guessing capacity | A common cause of IT system failure is insufficient capacity. In the cloud, resource utilization can be monitored and additional resources can be added and removed automatically. |
Automate change management | All infrastructure changes should be made via automation. |
Foundations of AWS Reliability
Limit Management
In order to build a reliable application infrastructure, it is essential to understand the potential limitations of that infrastructure. The ability to monitor when those limits are being reached is equally important, so corrective action can be taken. Limits could be CPU or RAM capacity in an instance, network throughput on a particular connection, number of connections available to a database and so on.
EC2 instances are limited to 20 per region by default. And there are many other service limits—the best place to find out the service limits that currently apply is AWS Trusted Advisor. You can also use Amazon Cloudwatch to set alerts for when limits are being approached, as well as for metrics, such as EBS volume capacity, provisioned IOPS and Network IO.
Further reading: How to Change or Upgrade an EC2 Instance Type
Networking
It is important to consider future growth requirements when architecting IP address-based networks. Amazon VPC (Virtual Private Cloud) enables customers to build out complex network architectures. It is recommended to utilize private address ranges (as defined by RFC1918) for VPC CIDR blocks. Be sure to select ranges that will not conflict with ranges in use elsewhere in your network topology.
When allocating CIDR blocks, it is important to:
- Allow IP address space for multiple VPCs per region
- Consider connections between AWS accounts—other parts of the business may operate AWS resources in separate AWS accounts, but need to interconnect with shared services
- Allow for subnets that span multiple availability zones within a VPC
- Leave unused CIDR block space within your VPC
You need to consider how you will connect the rest of your network with your AWS resources. Will you use VPNs? If so, how will these terminate in your VPC and how will you ensure that they are resilient and have sufficient throughput? You may wish to use AWS Direct Connect—again, how will you ensure the resilience of this connection? Perhaps you will require multiple connections back to separate locations outside of the AWS cloud.
Key AWS services for network topology include:
- Amazon Virtual Private Cloud: for creation of subnets and IP address allocation
- Amazon EC2: compute service where any required VPN appliances will run
- Amazon Route 53: Amazon’s DNS service
- AWS Global Accelerator: a network acceleration service that directs traffic to optimal AWS network endpoints
- Elastic Load Balancing: layer 7 load balancing that enables autoscaling to cope with increases and decreases in demand
- AWS Shield: distributed ‘denial of service’ mitigation provided both free of charge and with an optional additional subscription for an enhanced level of protection
Infrastructure High Availability
First of all, you need to decide exactly what ‘high availability’ means for your application. How much downtime for scheduled and unscheduled maintenance? And what budget do you have available to achieve the level of availability you desire?
There is a big difference in how you will approach the architecture of, say, an internal application that requires 99% availability, versus a mission critical customer-facing application that requires ‘five nines’ (99.999%) availability or higher.
If you are looking to achieve five 9s availability, then every single component of your architecture will need to be able to achieve that level of availability to avoid single points of failure. This will mean adding in a lot of redundancy to the solution, which will of course add to the cost.
Five 9s availability only allows for five minutes of downtime per year. This is virtually impossible to achieve without a high degree of automated deployment and automated recovery from failure—human intervention simply won’t be able to keep up. Any changes to the environment need to be thoroughly tested in a full scale non-production environment, which in itself will significantly add to the overall infrastructure cost.
The table below lists common sources of service interruption, which need to be considered in any high availability design:
Category | Description |
---|---|
Hardware | Failure of any hardware component e.g. storage, server, network |
Deployment | Failure of any automated or manual deployments to application code, hardware, network or configuration |
Load | Saturated load on any component of the application or of the overall infrastructure itself |
Data | Corrupt data accepted into the system that cannot be processed |
Expired credentials | Expiration of a certificate or credentials e.g. SSL certificate expiry |
Dependency | Failure of a dependent service |
Infrastructure | Power supply or HVAC failure impacting hardware availability |
Identifier exhaustion | Exceeding available capacity, hitting throttling limits, etc. |
Application High Availability
There’s no point designing a five 9s availability infrastructure if the application itself cannot achieve that level of availability. Here are four things to consider when designing a highly available application:
- Fault Isolation Zones: in AWS terms this can mean architecting your application to leverage multiple Regions and Availability Zones. Regions are geographic locations around the globe that contain two or more Availability Zones (AZ). Availability Zones are physically separate datacentres within a region with isolated power, network and cooling. So, in theory, no two Availability Zones should fail at the same time.
- Redundant Components: component redundancy starts right down at the hardware level, with redundant power supplies, hard drives and network interfaces. But, it then extends up the stack to the server level e.g. multiple web servers, multi-AZ or multi-region databases and so on.
- Microservices: service-oriented architecture, where software applications are broken down into smaller, independent service units. Read more about this on our blog post: AWS Microservices.
- Recovery Oriented Computing: Recovery Oriented Computing (ROC) focuses on having the right monitoring in place to detect all types of failure, and then automating recovery procedures to automatically recover from a failure.
Operational Considerations for Reliability
- Deployment: where possible, deployments should be automated using a deployment methodology (e.g. Blue-Green, Canary, Feature Toggles or Failure Isolation Zone deployments) to decrease the risk of failure.
- Testing: testing should be carried out to match availability goals—one of the most effective testing methods is canary testing, which runs constantly and simulates customer behaviour.
- Monitoring and Alerting: deep monitoring of both your infrastructure and your application is essential to meet availability goals. You need to know the status and availability of each component of the infrastructure and application, as well as the overall user experience being delivered.
So, we’ve touched on a number of the different elements of the AWS Reliability pillar to get you thinking about the architecture of your AWS infrastructure and applications. AWS has a great white paper, which goes into a lot more detail and lists out some hypothetical examples to illustrate some of the concepts.
If you need any help, either in reviewing your current AWS infrastructure against the AWS Well-Architected Framework or in designing highly available AWS systems, then Logicata is more than happy to help. Our AWS Managed Services ensure continuous improvement of your application infrastructure against the Well-Architected Framework. Please reach out to us for more information.