The Secret Behind Non-disruptive Cloud Infrastructure Upgrade
samsung account is a global account service that brings the samsung universe together, from all samsung services to online and offline stores. it handles large-scale traffic with security and reliability. as a core samsung service, all tasks on samsung account, from general service deployments to cloud infrastructure upgrades, must be carried out without interruption to the service. this blog introduces the architecture designed for an elastic kubernetes service upgrade and shares our experience with upgrading the cloud infrastructure without interruptions to the high-traffic samsung account service. what is samsung account? samsung account is an account service that brings together more than 60 services and applications in 256 countries with over 1.7 billion user accounts. it is used for samsung electronics services including samsung pay, smartthings, and samsung health, as well as for authentication on various devices such as mobile, wearable, tv, pc, etc. samsung account helps deliver a secure and reliable customer experience with one account on a variety of contact points from online stores (such as samsung.com) and offline stores to our customer services. evolution of current samsung account architecture as the number of user accounts and connected services has grown, the infrastructure and service of samsung account has also evolved. it switched to the aws-based cloud for service stability and efficiency in 2019, and is currently servicing 4 regions: 3 global regions (eu, us, ap) and china. currently, samsung account consists of more than 70 microservices. in 2022, samsung account switched to the kubernetes base in order to reliably support microservices architecture (msa). kubernetes is an open-source orchestration platform that supports the easy deployment, scaling, and management of containerized applications. in 2023, samsung account reinforced disaster recovery (dr) to be able to provide failover across global regions, and expanded the ap region to improve user experience. in other words, samsung account has repeatedly evolved its infrastructure and services, and is currently running stably with traffic over 2.7 million requests per second (rps) and over 200k db transactions per second (tps). each aws-based samsung account region, with its own virtual private cloud, (vpc) is accessible through user devices, server-to-server, or the web. in particular, the web access provides a variety of features such as samsung.com and tv qr login on aws cloudfront, a content delivery network (cdn). samsung account microservices are being serviced on containers within elastic kubernetes service (eks) clusters, and internal communication between regions uses vpc peering. samsung account is using several managed services from aws to deliver various features. it is using aurora, dynamodb, and managed streaming for apache kafka (msk) as storage to build data sync between regions, and it provides account services based on different managed services including elasticache, pinpoint, and simple queue service (sqs). let's elaborate on the aws managed services that samsung account uses. the first is eks, which is a kubernetes service for running over 70 microservices on msa. next, aurora is used to save and query data as an rdb and dynamodb does the same but as a nosql database. along with them, elasticache (redis oss) is used to manage cache and sessions and msk handles delivering events from integrated services and data sync. if you’re building an aws-based service yourself, you would probably use these managed services as well. frustrating upgrades contrasting the convenience of managed services there is a major challenge to consider when you use these managed services, though. end of support comes, on average, after 1.5 years for eks and 2 years for aurora. various other services like elasticache and msk face the same problem. such service support termination is natural for aws, but upgrading these services when support ceases is often a painful task for those running them. because operation resources are often reduced upon switching to the cloud, large-scale upgrades that come around every 1 or 2 years have to be performed without enough resources for emergency response. these managed service upgrades put a major burden on samsung account. more than 60 integrated services have to be upgraded without causing interruptions, and the upgrades must be rolled out across a total of 4 regions. on top of that, samsung account is developing and running more than 70 microservices, so a significant amount of support and cooperation from development teams is required. the most challenging of all is that the upgrades need to be performed while dealing with traffic of over 2.7m rps and db traffic of 200k tps. eks upgrade sequence and restrictions you might think upgrading eks on aws is easy. in general, when upgrading eks, you start with the control plane including etcd and the apis that manage eks. afterwards, you move to the data plane where the actual service pods are on, and finally to eks add-ons. in theory, it is possible to upgrade eks following this sequence without any impact to the service operation. however, there are restrictions to general eks upgrades. if an upgrade fails in any of the 3 steps above due to missing eks api specs or incompatibility issues, a rollback is not available at all. in addition, it is difficult to do a compatibility check for the services and add-ons in advance. multi-cluster architecture for non-disruptive eks upgrades after much thought, samsung account decided to go with a simple but reliable option to perform eks upgrades. it's possible that many other services are using a similar way to upgrade eks or run actual services. samsung account chose to upgrade eks based on a multi-cluster architecture with 2 eks clusters. the architecture is built to enable an existing eks version to continue providing the service, while a new eks version on a separate cluster performs a compatibility validation with various microservices and add-ons before receiving traffic. the advantage of this method is that you can implement a rollback plan where the old eks version takes over the traffic if any issues occur when switching to the new eks version. a lesson we have learned from providing the samsung account service under high traffic is that there will be issues when you actually start processing traffic, no matter how perfectly you've built your infrastructure or service. for these reasons, it is essential to have a rollback plan in place whenever you deploy a service or upgrade your infrastructure. when you perform a multi-cluster upgrade, traffic must be switched between the old and new eks clusters. simply put, there are 2 main approaches. one approach is to switch traffic by placing a proxy server between the 2 clusters. the other approach is to switch the target ip using dns. needless to say, there may be a variety of other ways to accomplish this. in the first option, using a proxy server, you may encounter overload issues when handling high-volume traffic, such as with samsung account. additionally, there are too many application load balancers (albs) used for approximately 70 microservices, making it impractical to create a proxy server for each alb. in the second option, using dns, the actual user, client, and server replace the service ip of the old eks with that of the new eks during a dns lookup, redirecting requests to a different target at the user level. the dns option does not require a proxy server, and switching traffic is easy by simply editing the dns record. however, there is a risk that the traffic switch might not happen immediately due to propagation-related delays with dns. the dns-based traffic switch architecture was applied to achieve a non-disruptive eks upgrade for samsung account. let us describe the dns layers of samsung account with a hypothetical example. the top domain is account.samsung.com, and there are 3 global region domains under it, classified based on latency or geolocation. for us.account.samsung.com, the layers are split into service.us-old-eks.a.s.com and service.us-new-eks.a.s.com, representing the old and new domains. this is a simple, hypothetical example. in reality, samsung account uses more dns layers. during the recent eks upgrade, we switched traffic between the internal domains of the 2 eks clusters based on weighted records while adjusting the ratio, rather than switching all at once. for instance, when a user sends a request to account.samsung.com, it goes through us.account.samsung.com, and the actual eks service ip is applied at the end based on the specified weight. retrospective of the non-disruptive eks upgrade in summary, i would say "it's a successful upgrade if the connected services haven't noticed." with this eks upgrade, we deployed and switched traffic for a total of 3 regions, 6 eks clusters, and more than 210 microservices over the course of one month. the traffic switch was conducted with ratios set based on each service's load and characteristics, and no issues with connected services were reported during this one month eks upgrade. of course, as they say, "it's not over until it's over." we did have a minor incident where there were insufficient internal ips in the internal subnet due to many eks nodes and service pods becoming active simultaneously, which scared us for a moment. we secured the ip resources by reducing the number of pods for kubelet and add-ons by about a thousand and quickly scaling up the eks nodes. one thing we realized while switching traffic with dns is that 99.9% of the entire traffic can be switched within 5 minutes when the dns weight is adjusted. closing note richard branson, co-founder of virgin group, once said, "you don't learn to walk by following rules. you learn by doing, and by falling over." samsung account has been growing and evolving, addressing many bumps along the way. we continue to resolve various challenges with the stability of our service as the priority, keeping this "learning while falling over" spirit in mind. thank you.