Logo

dev-resources.site

for different kinds of informations.

Building a multi-region highly available identity provider with the AWS cloud and Ory Hydra

Published at
11/7/2023
Categories
aws
resiliency
cloudskills
highavailability
Author
derekberger
Author
11 person written this
derekberger
open
Building a multi-region highly available identity provider with the AWS cloud and Ory Hydra

AsurionID is an OpenID Connect (OIDC) compatible identity provider. It allows Asurion developers to easily integrate identity and access management into their applications using a standard protocol (OIDC) and open-source libraries. Our team worked from specific requirements, including custom user experience and low cost, so we decided to build a homegrown solution instead of using an off-the-shelf solution. We built AsurionID on AWS using open-source Ory Hydra and custom microservices.

High availability using multi-AZ in a single region

As shown in the diagram below, in AsurionID's initial architecture its microservices ran on Amazon Elastic Kubernetes Service (EKS) across 3 Availability Zones (AZs) in a single region. Amazon ElastiCache for Redis, used for storing temporary session data, was also deployed in 2 AZs (primary in one AZ and replica in another AZ). We used Amazon Aurora multi-AZ features to protect the database against AZ-level failures.

Multi-AZ high availability
Multi-AZ high availability


This provided AsurionID with availability of up to three nines (99.9%) in a single region. As more and more applications adopted AsurionID for identity and access management, it became more critical to our business. We wanted to protect AsurionID against region-level service disruptions which are less frequent but can be more impactful. That’s what led us to multi-region architecture.

Designed for protection against regional service disruptions

In our latest architecture, all microservices now run in active-active mode, in two EKS clusters, across two AWS regions. With active-active, both regions' services are always live and taking traffic, and we use Route 53 weighted routing to distribute customer traffic between the two regions.

Multi-region, active-active microservices
Multi-region, active-active microservices


We leverage Route 53 inverted health checks, following the Secondary Takes Over Primary (STOP) pattern, to handle failover if microservices encounter region-level disruption.

In our implementation of STOP, we associate the weighted DNS records with the inverted health checks, and those health checks with S3 objects. We invoke health check failure for a particular DNS by uploading its associated object. The failing health check stops Route 53 from forwarding requests to its associated regional ALB.

STOP pattern for failing over microservices
STOP pattern for failing over microservices


With this approach, we have achieved static stability and independence from the Route 53 control plane for failing over our microservices, which has resulted in higher availability for AsurionID microservices, up to four nines (99.99%).

We have taken a slightly different approach for the caching layer. Since we cache only ephemeral data like one-time passcodes (OTP), we aren’t replicating this data to the secondary region. But we have another ElastiCache for Redis cluster always running in the secondary region, and in case our caching layer is impaired by an AWS regional service interruption, we would invoke failover using STOP, just like our microservices.

Multi-region caching architecture
Multi-region caching architecture


This new architecture has helped us achieve static stability and control plane independence for the caching layer as well as the application layer.

For the database, we are using Aurora Global database with a read replica in the secondary region.

Aurora Global database
Aurora Global database


In case of a region-level Aurora impairment, we would promote the second region's instance to primary.

Future Enhancements

We now strive for the same static stability and control-plane independence in the database layer as we have for our microservices and caching layers. In our current database architecture, the promotion of the read replica triggers a Lambda that updates Route 53 CNAME values (a control plane function) to route all application traffic to the new primary database cluster. We are looking for new approaches to database failover that use data plane operations.

One potential option is AWS Route 53 Application Recovery Controller (ARC). Route 53 ARC works with Route 53 health checks to enable failover using the data plane, with the extra capability of checking the standby database to ensure it is ready for failover. ARC can also fail over an entire application stack in one operation, making it expandable to our cache and microservice layers.

Conclusion

In this article, we have walked you through how AsurionID started out with a multi-AZ approach to high availability and how we further improved availability with a multi-region architecture. Our architecture protects AsurionID against regional AWS service disruptions, achieves static stability, and uses data plane functions for failing over the microservices and caching layers.

While the primary goals of our multi-region architecture were improved availability and resiliency, the architecture has provided the team with even more benefits. We can now perform releases and infrastructure upgrades during business hours without impacting customers by routing traffic to one region while performing tasks in the other. The ability to perform critical operations during the day has improved the quality of life for the engineers. Of course, we could have realized these capabilities with a single-region architecture, but for us, they became additional benefits of a multi-region architecture.


Asurion is a leading tech care company that provides device protection, tech support, repair, and replacement services to 300 million customers worldwide. It partners with mobile carriers, retailers, and device manufacturers to deliver innovative solutions for smartphones, tablets, computers, and home appliances in over 20 countries worldwide. Asurion is headquartered in Nashville, TN.

highavailability Article's
30 articles in total
Favicon
How to Design a Secure and Scalable Multi-Region Architecture on AWS
Favicon
Optimizing Kubernetes for High Availability (HA)
Favicon
High Availability Database Architecture on AWS: A Deep Dive
Favicon
Docker Autoscaling: Enhancing Application Resilience and Resource Efficiency
Favicon
Why we decided to go with Kubernetes
Favicon
Strategies for Minimizing System Downtime and Ensuring High Availability and Redundancy for Your Application
Favicon
High Availability vs Disaster Recovery: Which Is Better
Favicon
Snowflake's Blueprint for Resilience: High Availability and Disaster Recovery
Favicon
Keeping the Lights On: How Monitoring Tools Ensure High Availability in DevOps
Favicon
How Cloudflare Achieved 55 Million Requests per Second with Just 15 PostgreSQL Clusters! πŸ’»
Favicon
Building a multi-region highly available identity provider with the AWS cloud and Ory Hydra
Favicon
Load Balancers Pain Points
Favicon
Load Balancing 101 βš–οΈ: Achieving Scalability and High Availability πŸ€ΉπŸ»β€β™€οΈ
Favicon
Cumulocity IoT Edge: Fault Tolerance and Data Resilience vs High Availability (HA)
Favicon
Achieving High Availability in Microsoft Azure
Favicon
How to create a simple high availability apache webserver cluster
Favicon
YugabyteDB: how does a master deal with HA
Favicon
DigitalOcean Kubernetes Control Plane General Availability (GA), now with a 99.95% SLA
Favicon
How to Achieve Geo-redundancy with Zeebe
Favicon
Pipy + Redis + Sentinel = High available Redis
Favicon
Moderating Pod's appetites on a K8s node: the brief
Favicon
Creating SSL-Enabled Mirror on InterSystems IRIS Using Public Key Infrastructure (PKI)
Favicon
High Availability in Azure: App Service, Function Apps
Favicon
High Availability in Azure: Traffic Management
Favicon
High Availability in Azure: Storage Redundancies
Favicon
High Availability in Azure: Availability Sets
Favicon
High Availability in Azure: Availability Zones
Favicon
High Availability in Azure: The basics
Favicon
Don’t let your Apps down, enable High Availability!
Favicon
Running-multi server Dokku: problems and options

Featured ones: