• re:viewing AWS re:Invent

    16 Apr 2015 in reinvent | permalink

    We previously wrote about how Route 53 organizes software deployments. We thought it'd be worthwhile to highlight an architecture track talk at AWS re:Invent 2014. In this talk, Chris Munns, AWS Solutions Architect discusses various deployment techniques and some of the new AWS services customers can leverage to improve their deployment strategies.

    A useful reference is this video on CodeDeploy and GitHub integration:

  • Exponential Backoff And Jitter

    Introducing OCC

    Optimistic concurrency control (OCC) is a time-honored way for multiple writers to safely modify a single object without losing writes. OCC has three nice properties: it will always make progress as long as the underlying store is available, it’s easy to understand, and it’s easy to implement. DynamoDB’s conditional writes make OCC a natural fit for DynamoDB users, and it’s natively supported by the DynamoDBMapper client.

    While OCC is guaranteed to make progress, it can still perform quite poorly under high contention. The simplest of these contention cases is when a whole lot of clients start at the same time, and try to update the same database row. With one client guaranteed to succeed every round, the time to complete all the updates grows linearly with contention.

    For the graphs in this post, I used a small simulator to model the behavior of OCC on a network with delay (and variance in delay), against a remote database. In this simulation, the network introduces delay with a mean of 10ms and variance of 4ms. The first simulation shows how completion time grows linearly with contention. This linear growth is because one client succeeds every round, so it takes N rounds for all N clients to succeed.

    Unfortunately, that’s not the whole picture. With N clients contending, the total amount of work done by the system increases with N2.

    Adding Backoff

    The problem here is that N clients compete in the first round, N-1 in the second round, and so on. Having every client compete in every round is wasteful. Slowing clients down may help, and the classic way to slow clients down is capped exponential backoff. Capped exponential backoff means that clients multiply their backoff by a constant after each attempt, up to some maximum value. In our case, after each unsuccessful attempt, clients sleep for:

    sleep = min(cap, base * 2 ** attempt)
    

    Running the simulation again shows that backoff helps a small amount, but doesn’t solve the problem. Client work has only been reduced slightly.

    The best way to see the problem is to look at the times these exponentially backed-off calls happen.

    It’s obvious that the exponential backoff is working, in that the calls are happening less and less frequently. The problem also stands out: there are still clusters of calls. Instead of reducing the number of clients competing in every round, we’ve just introduced times when no client is competing. Contention hasn’t been reduced much, although the natural variance in network delay has introduced some spreading.

    Adding Jitter

    The solution isn’t to remove backoff. It’s to add jitter. Initially, jitter may appear to be a counter-intuitive idea: trying to improve the performance of a system by adding randomness. The time series above makes a great case for jitter – we want to spread out the spikes to an approximately constant rate. Adding jitter is a small change to the sleep function:

    sleep = random_between(0, min(cap, base * 2 ** attempt))
    

    That time series looks a whole lot better. The gaps are gone, and beyond the initial spike, there’s an approximately constant rate of calls. It’s also had a great effect on the total number of calls.

    In the case with 100 contending clients, we’ve reduced our call count by more than half. We’ve also significantly improved the time to completion, when compared to un-jittered exponential backoff.

    There are a few ways to implement these timed backoff loops. Let’s call the algorithm above “Full Jitter”, and consider two alternatives. The first alternative is “Equal Jitter”, where we always keep some of the backoff and jitter by a smaller amount:

    temp = min(cap, base * 2 ** attempt)
    sleep = temp / 2 + random_between(0, temp / 2)
    

    The intuition behind this one is that it prevents very short sleeps, always keeping some of the slow down from the backoff. A second alternative is “Decorrelated Jitter”, which is similar to “Full Jitter”, but we also increase the maximum jitter based on the last random value.

    sleep = min(cap, random_between(base, sleep * 3))
    

    Which approach do you think is best?

    Looking at the amount of client work, the number of calls is approximately the same for “Full” and “Equal” jitter, and higher for “Decorrelated”. Both cut down work substantially relative to both the no-jitter approaches.

    The no-jitter exponential backoff approach is the clear loser. It not only takes more work, but also takes more time than the jittered approaches. In fact, it takes so much more time we have to leave it off the graph to get a good comparison of the other methods.

    Of the jittered approaches, “Equal Jitter” is the loser. It does slightly more work than “Full Jitter”, and takes much longer. The decision between “Decorrelated Jitter” and “Full Jitter” is less clear. The “Full Jitter” approach uses less work, but slightly more time. Both approaches, though, present a substantial decrease in client work and server load.

    It’s worth noting that none of these approaches fundamentally change the N2 nature of the work to be done, but do substantially reduce work at reasonable levels of contention. The return on implementation complexity of using jittered backoff is huge, and it should be considered a standard approach for remote clients.

    All of the graphs and numbers from this post were generated using a simple simulation of OCC behavior. You can get our simulator code on GitHub, in the aws-arch-backoff-simulator project.

    - Marc Brooker

  • Internet Routing and Traffic Engineering

    15 Dec 2014 in performance, routing | permalink

    Internet Routing

    Internet routing today is handled through the use of a routing protocol known as BGP (Border Gateway Protocol). Individual networks on the Internet are represented as an autonomous system (AS). An autonomous system has a globally unique autonomous system number (ASN) which is allocated by a Regional Internet Registry (RIR), who also handle allocation of IP addresses to networks. Each individual autonomous system establishes BGP peering sessions to other autonomous systems to exchange routing information. A BGP peering session is a TCP session established between two routers, each one in a particular autonomous system. This BGP peering session rides across a link, such as a 10Gigabit Ethernet interface between those routers. The routing information contains an IP address prefix and subnet mask. This translates which IP addresses are associated with an autonomous system number (AS origin). Routing information propagates across these autonomous systems based upon policies that individual networks define.

    Routing with BGP on the Internet

    This is where things get a bit interesting because various factors influence how routing is handled on the Internet. There are two main types of relationships between autonomous systems today: Transit and Peering.

    Transit is where an autonomous system will pay an upstream network (known as a transit provider) for the ability to forward traffic towards them who will forward that traffic further. It also provides for the autonomous system purchasing (who is the customer in this relationship) to have their routing information propagated to their adjacencies. Transit involves obtaining direct connectivity from a customer network to an upstream transit provider network. These sorts of connections can be multiple 10Gigabit Ethernet links between each other's routers. Transit pricing is based upon network utilization in a particular dominant direction with 95th percentile billing. A transit provider will look at a months worth of utilization and in the traffic dominant direction they will bill on the 95th percentile of utilization. The unit used in billing is measured in bits-per-second (bps) and is communicated in a price per Mbps (for example - $2 per Mbps).

    Transit

    Peering is where an autonomous system will connect to another autonomous system and agree to exchange traffic with each other (and routing information) of their own networks and any customers (transit customers) they have. With peering, there are two methods that connectivity is formed on. The first is where direct connectivity is established between individual networks routers with multiple 10Gigabit Ethernet or 100Gigabit Ethernet links. This sort of connectivity is known as "private peering" or PNI (Private Network Interconnect). This sort of connection provides both parties with clear visibility into the interface utilization of traffic in both directions (inbound and outbound). Another form of peering that is established is via Internet Exchange switches, or IX's. With an Internet Exchange, multiple networks will obtain direct connectivity into a set of Ethernet switches. Individual networks can establish BGP sessions across this exchange with other participants. The benefit of the Internet Exchange is that it allows multiple networks to connect to a common location and use it for one-to-many connectivity. A downside is that any given network does not have visibility into the network utilization of other participants.

    Peering

    Most networks will deploy their network equipment (routers, Dense Wave Division Multiplexing (DWDM) transport equipment) into colocation facilities where networks will establish direct connectivity to each other. This can be via Internet Exchange switches (which are also found in these colocation facilities) or direct connections which are fiber optics cables ran between individual suites/racks where the network gear is located.

    Public and Private Peering

    Routing Policy

    Networks will define their routing policy to prefer routing to other networks based upon a variety of items. The BGP best path decision process in a routers operating system dictates how a router will prefer one BGP path over another. Network operators will write their policy to influence that BGP best-path decision process based upon factors such as the cost to deliver traffic to a destination network in addition to performance.

    A typical routing policy within most networks will dictate that internal (their own) and routes learned from their own customers are to be preferred over all other paths. After that, most networks will then prefer peering routes since peering is typically free and often times can provide a shorter/optimal path to reach a destination. Finally the least preferred route to a destination is over paid transit links. When it comes to transit paths, both cost and performance are typically factors in determining how to reach a destination network.

    Routing policies themselves are defined on routers in a simple text-based policy language that is specific to the router operating system. They contain two types of functions: matching on one or multiple routes and an action for that match. The matching can include a list of actual IP prefixes and subnet lengths, ASN origins, AS-Paths or other types of BGP attributes (communities, next-hop, etc). The actions can include resetting BGP attributes such as local-preference, Multi-Exit-Discriminators (MED) and various other values (communities, Origin, etc). Below is a simplified example of a routing policy on routes learned from a transit provider. It has multiple terms to permit an operator to match on specific Internet routes to set a different local-preference value to control what traffic should be forwarded through that provider. There are additional actions to set other BGP attributes related to classifying the routes so they can be easily identified and acted upon by other routers in the network.

    policy-statement TRANSIT-1-IN {
        term PREFER-OVER-PEERING {
            from as-path-group TRANSIT-1-OVERRIDE;
            then {
                metric 1000;
                local-preference 2010;
                community set TRANSIT;
                community add LOCATION;
                accept;
            }
        }
        term PREFER-OVER-OTHER-TRANSIT {
            from as-path-group TRANSIT-1-HIGH-PREF;
            then {
                metric 1000;
                local-preference 1010;
                community set TRANSIT;
                community add LOCATION;
                accept;
            }
        }
        term DEPREF-OTHER-TRANSIT {
            from as-path-group TRANSIT-1-LOW-PREF;
            then {
                metric 1000;
                local-preference 990;
                community set TRANSIT;
                community add LOCATION;
                accept;
            }
        }
        term DEFAULT-TERM {
            then {
                metric 1000;
                local-preference 1000;
                community set TRANSIT;
                community add LOCATION;
                accept;
            }
        }
    }
    

    Network operators will tune their routing policy to determine how to send traffic and how to receive traffic through adjacent autonomous systems. This practice is generally known as BGP traffic-engineering. Making outbound traffic changes is by far the easiest to implement because it involves identifying the particular routes you are interested in directing and increasing the routing preference to egress through a particular adjacency. Operators must take care to examine certain things before and after any policy change to understand the impact of their actions.

    Inbound traffic-engineering is a bit more difficult as it requires a network operator to alter routing information announcements leaving your network to influence how other autonomous systems on the Internet prefer to route to you. While influencing the directly adjacent networks to you is somewhat trivial, influencing networks further beyond those directly connected can be tricky. This technique requires the use of features that a transit provider can grant via BGP. In the BGP protocol, there is a certain type of attribute known as Communities. Communities are strings you can pass in a routing update across BGP sessions. Most networks use communities to classify routes as transit vs. peer vs. customer. The transit-customer relationship usually gives certain capabilities to a customer to control the further propagation of routes to their adjacencies. This grants a network with the ability to traffic-engineer further upstream to networks it is not directly connected to.

    Traffic-engineering is used for several reasons today on the Internet. The first reason might be to reduce bandwidth costs by preferring particular paths (different transit providers). The other is for performance reasons, where a particular transit provider may have less-congested/lower-latency path to a destination network. Network operators will view a variety of metrics to determine if there is a problem and start to make policy changes to examine the outcome. Of course on the Internet, the scale of the traffic being moved around counts. Moving a few Gbps of traffic from one path to another may improve performance, but if you move tens of Gbps over you may encounter congestion on this newly selected path. The links between various networks on the Internet today operate where they scale capacity based upon observed utilization. Even though you may be paying a transit provider for connectivity, this doesn't mean every link to external networks is scaled for the amount of traffic you wish to push. As traffic grows, links will be added between individual networks. So causing a massive change in utilization on the Internet can result in congestion as these new paths are handling an increased amount of traffic than they never had before. The result is that network operators must pay attention when moving traffic over in increments as well as communication with other networks to gauge the impact of any traffic moves.

    Outbound Traffic-Engineering

    Inbound Traffic-Engineering

    Complicating the above traffic engineering operations is that you are not the only person on the Internet trying to push traffic to certain destinations. Other networks are also in a similar position where they're trying to deliver traffic and will perform their own traffic-engineering. There are also many networks that will refuse to peer with other networks for several reasons. For example, some networks may cite an imbalance in in vs. outbound (traffic ratios) or feel that traffic is being dumped on their network. In these cases, the only way to reach these destinations is via a transit provider. In some cases, these networks may offer a "paid peering" product to provide direct connectivity. That paid peering product may be priced at a value that is lower the price of what you would pay for transit or could offer an uncongested path that you'd normally observe over transit. Just because you have a path via transit doesn't mean the path is uncongested at all hours of the day (such as during peak hours).

    One way to eliminate the hops between networks is to do just that - eliminate them via direct connections. AWS provides a service to do this known as AWS Direct Connect. With Direct Connect, customers can connect their network directly into the AWS network infrastructure. This will enable bypassing the Internet via direct physical connectivity and remove any potential Internet routing or capacity issues.

    Traceroute

    In order to determine the paths traffic is taking, tools such as traceroute are very useful. Traceroute operates by sending sending packets to a given destination network and it sets the initial IP TTL value to one. The upstream device will generate an ICMP TTL Exceeded message back to you (the source) which will reveal the first hop in your path to the destination. Subsequent packets will be sent from the source and increment the IP TTL value to show each hop along the way towards the destination. It is important to remember that Internet routing typically involves asymmetric paths - the traffic going towards a destination will take a separate set of hops on the return path. When performing traceroutes to diagnose routing issues it is very useful to obtain the reverse path to help isolate a particular direction of traffic being an issue. With an understanding of both directions traffic is taking, it is then easier to understand what sort of traffic-engineering changes can be made. When dealing with Network Operation Centers (NOCs) or support groups, it is important to provide the Public IP of the source and destination addresses involved in the communication. This provides individuals with the information they can use to help reproduce the issue that is being encountered. It is also useful to include any specific details surrounding the communication, such as if it was HTTP (TCP/80) or HTTPS (TCP/443). Some traceroute applications provide the user with the ability to generate its probes using a variety of protocols such as ICMP Echo Request (ping), UDP or TCP packets to a particular port. Several traceroute programs by default will use ICMP Echo Request or UDP packets (destined to a particular port range). While these work most of the time, various networks on the Internet may filter these sorts of packets and it is recommended to use a traceroute probe that replicates the type of traffic you intend to use to the destination network. For example, using traceroute with TCP/80 or TCP/443 can yield better results when dealing with firewalls or other packet filtering.

    An example of a UDP based traceroute (using well-defined traceroute port ranges), where multiple routes will permit generating TTL Exceeded for packets bound for those destination ports:

    [ec2-user@ip-10-254-10-179 ~]$ traceroute -q 1 www.amazon.com
    traceroute to www.amazon.com (72.21.215.232), 30 hops max, 60 byte packets
     1  ip-10-6-6-1.us-west-2.compute.internal (10.6.6.1)  0.736 ms
     2  100.70.41.1 (100.70.41.1)  0.691 ms
     3  100.70.41.34 (100.70.41.34)  0.458 ms
     4  ip-10-177-64-9.us-west-2.compute.internal (10.177.64.9)  0.336 ms
     5  100.64.12.64 (100.64.12.64)  0.484 ms
     6  100.64.13.1 (100.64.13.1)  6.067 ms
     7  ec2-50-112-0-20.us-west-2.compute.amazonaws.com (50.112.0.20)  1.521 ms
     8  100.64.1.65 (100.64.1.65)  1.408 ms
     9  100.64.0.84 (100.64.0.84)  0.903 ms
    10  100.64.16.89 (100.64.16.89)  18.491 ms
    11  54.239.48.192 (54.239.48.192)  1.379 ms
    12  205.251.232.196 (205.251.232.196)  1.274 ms
    13  54.239.41.26 (54.239.41.26)  58.497 ms
    14  205.251.244.101 (205.251.244.101)  58.998 ms
    15  72.21.222.157 (72.21.222.157)  68.459 ms
    16  72.21.222.85 (72.21.222.85)  79.995 ms
    17  *
    18  *
    19  *
    

    Note that the last hop does not respond, since it most likely denies UDP packets destined to high ports.

    With the same traceroute using TCP/443 (HTTPS), we find multiple routers do not respond but the destination does respond since it is listening on TCP/443:

    TCP Traceroute to port 443 (HTTPS):

    [root@ip-10-254-10-179 ~]# tcptraceroute -q 1 -w 1 www.amazon.com 443
    Selected device eth0, address 10.254.10.179, port 39691 for outgoing packets
    Tracing the path to www.amazon.com (72.21.215.232) on TCP port 443 (https), 30 hops max
     1  10.6.6.1  0.790 ms
     2  100.70.41.1  0.516 ms
     3  100.70.41.36  0.440 ms
     4  10.177.48.9  0.346 ms
     5  100.64.12.76  0.519 ms
     6  *
     7  *
     8  *
     9  *
    10  100.64.16.75  21.096 ms
    11  54.239.48.194  2.077 ms
    12  205.251.232.214  1.470 ms
    13  54.239.41.28  58.619 ms
    14  205.251.244.101  59.913 ms
    15  72.21.222.87  61.425 ms
    16  *
    17  *
    18  *
    19  100.64.19.27  66.006 ms
    20  72.21.215.232 [open]  59.023 ms
    

    The hops revealed within traceroute provide some insight into the sort of network devices your packets are traversing. Many network operators will add descriptive information in the DNS reverse PTR records, though each network is going to be different. Typically the DNS entries will indicate the router name, some sort of geographical code and the physical or logical router interface the traffic has traversed. Each individual network names their own routers different so the information here is going to usually indicate if a device is a "core" router (no external or customer interfaces) or an "edge" router (with external network connectivity). Of course, this is not a hard rule and it is common to find multi-function devices within a network. The geographical identifier can vary between IATA airport codes, telecom CLLI codes (or a variation upon them) or internally generated identifiers that are unique to that particular network. Occasionally shortened versions of a physical address or city names will appear in here as well. The actual interface can indicate the interface type and speed, though these are only as accurate as you believe an operator is to publicly reveal this and keep their DNS entries up to date.

    Interpreting physical locations in traceroute

    One important part of traceroute is that the data should be taken with some skepticism. Traceroute will display round-trip-time (RTT) of each individual hop as the packets traverse through the network to their destination. While this value can provide some insight into the latency to these hops, the actual value can be influenced from a variety of factors. For instance, many modern routers today treat packets that TTL expire on them as a low priority when compared to other functions the router is doing (forwarding packets, routing protocols). As a result, the handling of the TTL expired packets and subsequent ICMP TTL Exceeded message generated can take some period of time. This is why it is very common to occasionally see high RTT on intermediate hops within a traceroute (up to hundreds of milliseconds). This does not always indicate that there is a network issue and individuals should always measure the end-to-end latency (via ping or some application tests). In situations where the RTT does increase at a particular hop and continue to increase, this can be an indicator of an overall increase in latency at a particular point in the network. Another item frequently observed in traceroutes are hops that do not respond to traceroute which will be displayed as *'s. This means that the router(s) at this particular hop have either dropped the TTL expired packet or has not generated the ICMP TTL Exceeded message. This is usually the result of two possible things. The first is that many modern routers today implement Control-Plane Policing (CoPP) which are packet filters on the router to control how certain types of packets are handled. In many modern routers today, the use of ASICs (Application-Specific Integrated Circuit) have improved packet lookup & forwarding functions. When a router ASIC receives a packet with the TTL value of one, they will punt the packet to an additional location within the router to handle the ICMP TTL Exceeded generation. On most routers, the ICMP TTL Exceeded generation is done on a CPU integrated on a linecard or the main brain of the router itself (known as a route processor, routing engine or supervisor). Since the CPU of a linecard or routing engine is busy performing things such as forwarding table programming and routing protocols, routers will allow protections to be put in place to restrict the rate of how many TTL exceeded packets can be sent to these components. CoPP allows an operator to set functions such as limiting TTL Exceeded messages to a value such as 100 packets per second. Additionally the router itself may have an additional rate-limiter to address how many ICMP TTL Exceeded messages can be generated as well. In this situation, you'll find that hops in your traceroute may sometimes not reply at all because of the use of CoPP. This is also why when performing pings to individual hops (routers) on a traceroute you will see packet loss because CoPP is dropping the packets. The other area where CoPP can be applied is where the router may simply deny all TTL exceeded packets. Within traceroute, these hops will always respond with *'s no matter how many times you execute traceroute.

    A good presentation that explains using traceroute on the Internet and interpreting its results is found here: https://www.nanog.org/meetings/nanog45/presentations/Sunday/RAS_traceroute_N45.pdf

    Troubleshooting issues on the Internet is no easy task and it requires examining multiple sets of information (traceroute, BGP routing tables) to come to a conclusion as to what can be occurring. The use of Internet looking glasses or route servers is useful to providing a different vantage point on the Internet when troubleshooting. The Looking Glass Wikipedia page has several links to sites which you can use to perform pings, traceroutes and examining a BGP routing table from different spots around the world in various networks.

    When reaching out to networks or posting in forums looking for support for Internet routing issues it is important to provide useful information for troubleshooting. This includes the source IP address (the Public IP, not a Private/NAT translated one), the destination IP (once again, the Public IP), what protocol and ports being used (TCP/80 for example) and the specific time/date of when you observed the issue. Traceroutes in both directions are incredibly useful since paths on the Internet can be asymmetric.

  • Selecting Service Endpoints for Reliability and Performance

    Choose Your Route Wisely

    Much like a roadway, the Internet is subject to congestion and blockage that cause slowdowns and at worst prevent packets from arriving at their destination. Like too many cars jamming themselves onto a highway, too much data over a route on the Internet results in slowdowns. Transatlantic cable breaks have much the same effect as road construction, resulting in detours and further congestion and may prevent access to certain sites altogether. Services with multiple points of presence paired with either smart clients or service-side routing logic improve performance and reliability by ensuring your viewers have access to alternative routes when roadways are blocked. This blog post provides you with an overview of service side and client side designs that can improve the reliability and performance of your Internet-facing services.

    In order to understand how smart routing can increase performance, we must first understand why Internet downloads take time. To extend our roadway analogy, imagine we need to drive to the hardware to obtain supplies. If we can get our supplies in one trip and it takes 10 minutes to drive one way, it takes 20 minutes to get the supplies and return to the house project. (For sake of simplicity we’ll exclude the time involved with finding a parking spot, wandering aimlessly around the store in search of the right parts, and loading up materials.) Much of the time is taken up by driving the 5 km across multiples types of roadways to the hardware store.

    Packets on the Internet must also transit physical distance and many links to reach their destination. A traceroute from my home network to my website hosted in eu-west-1 traverse 20 different links. It takes time for a packet to traverse the physical distance between each of these nodes. Traveling at a speed of 200,000 km/sec my packets travel from Seattle to Dublin and back in 170 ms.

    Hop Host                                    Round Trip Time (ms)
    1   192.168.0.1                             2
    2   96.120.-.-                              9
    3   -.seattle.wa.seattle.comcast.net        15
    4   -.burien.wa.seattle.comcast.net         16
    5   -.seattle.wa.seattle.comcast.net        22
    6   -.seattle.wa.ibone.comcast.net          27
    7   4.68.63.65                              11
    8   ae-32-52.ebr2.Seattle1.Level3.net       189
    9   ae-2-2.ebr2.Denver1.Level3.net          186
    10  ae-3-3.ebr1.Chicago2.Level3.net         186
    11  ae-1-100.ebr2.Chicago2.Level3.net       200
    12  ae-6-6.ebr2.Washington12.Level3.net     235
    13  ae-46-46.ebr2.Washington1.Level3.net    186
    14  ae-41-41.ebr2.Paris1.Level3.net         189
    15  ae-2-12.bar1.Dublin3.Level3.net         194
    16  195.16.170.102                          229
    17  178.236.0.80                            345
    18  178.236.3.129                           202
    19  178.236.0.85                            176
    20  server.dub2.r.cloudfront.net            170
    

    Unlike roadways, when congestion occurs on Internet paths routers discard packets. Discarded packets result in retransmits by the sender adding additional round trips to the overall download. The additional round trips result in a slower download. Just as multiple trips to the hardware store results in more time spent driving and less time building.

    To reduce the time of a round trip, a packet needs to travel a shorter physical distance. To avoid packet loss and additional round trips, a packet needs to avoid broken or congested routes. Performance and reliability of an Internet-facing service are improved by using the shortest physical path between the service and the client and by minimizing the number of round trips for data transfer.

    Take the path of shorestest distance

    Take a path that avoids congested links

    Adding multiple points of presence to your architecture is like opening up a chain of hardware stores around the city. With multiple points of presence, drivers can choose a destination hardware store that allows them to drive the shortest physical distant. If the route to any one hardware store is congested or blocked entirely, driver can instead drive to the next nearest store.

    Measuring the Internet

    Finding the fastest and most reliable paths on across the Internet requires real world measurement. The closest endpoint as the crow flies often isn’t the same as fastest via Internet paths. The diagrams below depict an all too common example when driving in the real world. Given the starting point (the white dot), hardware store B (the red marker in the second diagram) is actually closer as the crow flies, but requires a lot more driving distance. A driver that knows the roadways will know that it will take less overall time to drive to store A.

    Hardware store A is 10 minutes away by car

    Hardware store A is 10 minutes away by car

    Hardware store B would be much closer by way of the water, but is 16 minutes by car

    Hardware store B would be much closer by way of the water, but is 16 minutes by car

    The same is true on the Internet, here’s a real-world example from CloudFront just after it launched in 2008. Below is a traceroute from an ISP network in Singapore to the SIN2 edge location in Singapore (internally AWS uses airport codes to identify edge locations). The trace route below shows that a packet sourced from a Singapore ISP to the SIN2 edge location was routed by way of Hong Kong! Based on the connectivity of this viewer network, Singapore customers are better served from the HKG1 edge location rather than SIN2. If routing logic assumed that geographically closer edge locations resulted in lower latency, the router engine would choose SIN2. By measuring latency the system we see that the system should select HKG50 as the preferred edge location, but going doing the routing system can provide the viewer with 32 ms round trip times versus 39 ms round trip times.

    Hop Host                                    Round Trip Time (ms)
    1   server-01.sin2.cloudfront.net           <1
    2   203.83.223.83                           <1
    3   Request timed out.                      <1
    4   Request timed out.                      <1
    5   ix-12-3.icore1.S9U-Singapore.as6453.net 1
    6   Vlan32.icore1.S9U-Singapore.as6453.net  13
    7   if-13-46.icore1.HK2-HongKong.as6453.net 32
    8   ge-5-0-6-0.hkgcw-cr3.ix.singtel.com     31
    9   so-3-1-2-0.hkgcw-cr3.ix.singtel.com     40
    10  ge-1-1-0-0.sngtp-ar6.ix.singtel.com     40
    11  202.160.250.225                         32
    12  203.208.232.49                          *
    13  203.208.232.34                          *
    14  203.208.232.54                          39
    15  203.208.249.242                         39
    

    The topology of the Internet is always changing. The SIN to HKG example is old, now SIN2 is directly peered with the network above. Continuous or just in time measurements allow a system to adapt when Internet topology changes. Once the AWS networking team built a direct connection between our SIN edge location and the Singapore ISP, the CloudFront measurement system automatically observed lower latencies. SIN2 became preferred lowest latency pop. The direct path results in 2.2 ms round trip times versus 39 ms round trip times.

    Hop Host                                    Round Trip Time (ms)
    1   server-01.sin2.cloudfront.net           <1
    2   203.83.223.86                           <1
    3   203.83.223.13                           <1
    4   203.208.175.69                          2
    5   203.208.158.34                          2
    6   203.208.172.94                          2
    7   203.208.131.98                          4
    8   XE-3-0-0.Orchard.singnet.com.sg         2
    9   165.21.255.22                           2
    10  203.208.249.242                         2
    

    Leveraging Real World Measurements for Routing Decisions

    Service-Side Routing

    CloudFront and Route 53 both leverage similar mechanisms to route clients to the closest available end points. CloudFront uses routing logic in its DNS layer. When clients resolve DNS for a CloudFront distribution they receive a set of IPs that map to the closest available edge location. Route 53’s Latency-Based Routing combined with DNS Failover and health checks allows customers to build the same routing logic into any service hosted across multiple AWS regions.

    Availability failures are detected via continuous health checks. Each CloudFront edge location checks the availability of every other edge location and reports the health-state to CloudFront DNS servers (which also happen to be Route 53 DNS servers). If an edge location starts failing health checks, CloudFront DNS servers respond to client queries with IPs that map to the client’s the next-closest healthy edge location. AWS customers can implement similar health check fail over capabilities within their services by use the Route 53 DNS failover and health checks feature.

    Lowest-latency routing is achieved by continuously sampling latency measurements from viewer networks across the Internet to each AWS region and edge location. Latency measurement data is compiled into a list of AWS sites for each viewer network, sorted by latency. CloudFront and Route 53 DNS servers use this data to direct traffic. When network topology shifts occur, the latency measurement system picks up the changes and reorders the list of least-latent AWS endpoints for that given network.

    Client-Side Routing

    But service-side technology isn’t the only way to perform latency-based routing. Client software also has access to all of the data it needs to select the best endpoint. A smart client can intelligently rotate through a list of endpoints whenever a connection failure occurs. Smart clients perform their own latency measurements and choose the lowest latency endpoint as its connection destination. Most DNS resolvers do exactly that. Lee’s prior blog post made mention of DNS resolvers and provided a reference to this presentation.

    The DNS protocol is well suited to smart-client endpoint selection. When a DNS server resolves a DNS name, say www.internetkitties.com, the server must first determine which DNS servers are authoritative to answer queries for the domain. In my case, my domain is hosted by Route 53, which provides me with 4 assigned name servers. Smart DNS resolvers track the query response times to each name server and preferentially query the fastest name server. If the name server fails to respond, the DNS resolver falls back to other name servers on the list.

    There are examples in HTTP as well. The Kindle Fire web browser, Amazon Silk, makes use of a backend-resources hosted in multiple AWS regions to speed up the web browsing experience for its customers. The Amazon Silk client knows about and collects latency measurements against each endpoint of the Amazon Silk backend service. The client connects to the lowest-latency end point. As the device moves around from one network to another, the lowest-latency end point might change.

    Service architecture with multiple points of presence on the Internet combined with latency-based service-side or client-side endpoint selection can improve the performance and reliability of your service. Let us know if you'd like to a follow blog on best practices for measuring Internet latencies or lessons learned from the Amazon Silk client-side endpoint selector.

    - Nate Dye

  • Running Multiple HTTP Endpoints as a Highly Available Health Proxy

    13 Jun 2014 in failover, route 53 | permalink

    Route 53 Health Checks provide the ability to verify that endpoints are reachable and that HTTP and HTTPS endpoints successfully respond. However, there are many situations where DNS failover would be useful, but TCP, HTTP, and HTTPS health checks alone can't sufficiently determine the health of the endpoint. In these cases, it's possible for an application to determine its own health, and then use an HTTP endpoint as a health proxy to Route 53 Health Checks to perform DNS failover. Today we'll look at an example using Route 53 Health Checks to provide DNS failover between many pairs of primary and standby databases.

    Determining the Health Status of Resources

    In this example, we have many pairs of primary and standby databases. For each pair, we would like to perform DNS failover from the primary to the standby whenever the primary is unable to respond to queries. Each of these pairs is independent of the others, so one primary database failing over should not have any impact on the remaining databases.

    The existing Route 53 health check types aren't really sufficient to determine the health of a database. A TCP health check could determine if the database was reachable, but not if it was able to respond to queries. In addition, we'd like to switch to the standby not only if the primary is unable to answer queries, but also proactively if we need to perform maintenance or updates on the primary.

    Reporting Health

    Having determined the criteria for failing over a database, we'll then want to provide this information to Route 53 to perform DNS failover. Since the databases themselves can't meaningfully respond to health checks, we set up an HTTP endpoint to respond to health checks on behalf of the databases. This endpoint will act as a health proxy for the database, responding to health checks with HTTP status 200 when the database is healthy and HTTP status 500 when the database is unhealthy. We can also use a single HTTP endpoint to publish the health of all our databases. Since HTTP health checks can specify a resource path in addition to IP address and port, we can assign a unique resource path for each database.

    To pull the information into Route 53, we then create an HTTP health check for each database. These health checks will all use the HTTP endpoint's IP address and port, and each will have a unique resource path specific to the database we want to check. For example, in the diagram below, we see that the primary uses a path of "/DB1Primary" while the standby database uses a path of "/DB1Standby" on the same IP address endpoint.

    We also configure the health checks with a failure threshold of 1 and interval of 10 seconds to better suit our use case. The default interval of 30 seconds and failure threshold of 3 are useful when we want Route 53 to determine the health of an endpoint, but in this case we already know the health of the endpoint and are simply using health checks to surface this information. By setting the failure threshold to 1, Route 53 immediately takes the result of the latest health check request. This means our health check will become unhealthy on the first health check request after the HTTP endpoint reports the database in unhealthy, rather than waiting for three unhealthy responses. We also use the 10 second interval to speed up our reaction time to an unhealthy database. With both of these changes, Route 53 should detect the HTTP endpoint becoming unhealthy within approximately 10 seconds instead of approximately 90 seconds.

    Increasing Availability

    At this point we have created health checks for our databases which could be associated with resource records sets for DNS failover. Although this works well for providing the health of the resource to Route 53, it does have a downside that traditional health checks don't. If the health proxy itself fails, all health checks targeting it will time out and fail as well. This would prevent DNS failover for databases from working at all, and could result in all DNS queries responding with the standby database.

    The first step to fixing this is to run more than one HTTP endpoint, each providing the status of the same resources. This way, if one of the endpoints fails, we'll still have others to report the same information. We'll then create a health check for each combination of database and HTTP endpoint. So if we run three HTTP endpoints, we'll create three health checks for each database.

    The diagram below shows how the HTTP endpoints receive health statuses from a single primary and standby database, and then pass it on to the Route 53 health checks. The bold line in the diagram depicts how the health status for a single database is sent by multiple HTTP endpoints to multiple Route 53 health checks.

    Component Diagram

    With multiple HTTP endpoints, we'll want to consider the primary database as healthy if at least one HTTP endpoint reports that it is healthy, or if at least one of its health checks is healthy. We'll want to failover if none of the primary health checks report healthy, but at least one of the standby's health checks reports healthy. If none of the primary's or standby's health checks report healthy, we would rather respond with the primary's IP than return nothing at all. This also provides the desired behavior in case all HTTP endpoints go down; we'll respond with the primary database for all database pairs, rather than failing over all databases.

    To configure this behavior, we use multiple weighted round robin (WRR) resource record sets for each database. Each database and health check will have a separate WRR set, with primaries being weighted one and standbys weighted zero. When all primary database's health checks are healthy, Route 53 will randomly choose one the WRR sets for the primary database with a weight of one. Since all these contain the same value, the IP address of the primary database, it doesn't matter which one is selected and we get the correct response. When all health checks for a primary DB are unhealthy, their WRR sets will not be considered and Route 53 will chose randomly from the healthy zero-weighted standby sets. Once again, these all return the same IP address so it doesn't matter which of these is chosen. Finally, if all primary and standby health checks are unhealthy, Route 53 will consider all WRR sets and choose among the one-weighted primary records.

    The diagram below shows an example of WRR sets configured for DNS failover of a single database.

    Component Diagram

    Applications and Tradeoffs

    While this post discussed using health proxies in the context of health checking databases, this same technique can be used to apply DNS failover to many other resources. This could be applied to any other resources which can't easily be checked by TCP, HTTP, or HTTPS health checks, such as file servers, mail servers, etc. This could also be applied to use DNS failover for other criteria, such as manually moving traffic away from a resource for deployment or when the error rate for a service rises above 5%.

    It's also worth mentioning that there are some tradeoffs to this approach that may make it less useful for some applications. The biggest of these is that in this approach the health checks no longer provide a connectivity check to the actual endpoint of interest. Since we configured the health checks to target the HTTP endpoints, we would actually be checking the connectivity of the HTTP endpoints instead of the databases themselves. For the use case in this example, this isn't a problem. The databases are usually queried from other hosts in the same region, so global connectivity isn't a concern.

    The other tradeoff is having to maintain additional hosts for the HTTP endpoints. In the above example, we have more database pairs than HTTP endpoints, so this is a relatively small cost. If we only had a few databases, this design would be much less sensible.

    - John Winder

  • Doing Constant Work to Avoid Failures

    Amazon Route 53's DNS Failover feature allows fast, automatic rerouting at the DNS layer based on the health of some endpoints. Endpoints are actively monitored from multiple locations and both application or connectivity issues can trigger failover.

    Trust No One

    One of the goals in designing the DNS Failover feature was making it resilient to large-scale outages. Should a large subset of targets, such as an entire availability zone, fall off the map, the last thing we'd want to do is accumulate a backlog of processing work, delaying the failover. We avoid incurring delays in such cases by making sure that the amount of work we do is bounded regardless of the runtime conditions. This means bounded workload during all stages of processing: data collection, handling, transmission, aggregation and use. Each of these stages requires some thought for it to not cause delays under any circumstances. On top of its own performance, each stage of the pipeline needs to have its output cardinality bounded in a way to avoid causing trouble for the next stage. For extra protection, that next stage should not trust its input to be bounded and implement safety throttling, preferably, isolating input by type and tenant or by whatever dimension of the input may be variable due to a bug or misconfiguration, such as configuring extra previous stages, each producing output, unexpectedly.

    Let's dig into how it works out for our health checking service "data plane," the part that does all the work once health checks have been configured through the "control plane." As an aside, we deploy our health checking data plane and control planes similar to the methods described previously.

    Health Checkers Example

    The frontline component of the system is a fleet of health checker clusters, running from EC2 regions. Each health check is performed by multiple checkers from multiple regions, repeated at the configured frequency. On failure, we do not retry; retrying immediately would increase the number of performed checks if a large proportion of the targets assigned to a given checker fail. Instead, we just perform the next check as scheduled normally (30 seconds later, by default).

    Component Diagram

    The checkers communicate the results to the aggregators using a steady stream, asynchronously from the checks themselves. It wouldn't matter if, for some reason, we'd be making more checks - the stream is configured separately and does not inflate dynamically.

    Transmitting incremental changes, as opposed to the complete current state, will result in an increase in the transmission size in the event there is a spike in the number of status changes. So to comply with the goal, we can either make sure that the transmission pipe does not choke when absolutely everything changes, or we can just transmit the full snapshot of the current state all the time. It turns out that, in our case, the latter is feasible by optimizing the transmission formats for batching (we use a few bits per state and we pack densely), so we happily do that.

    Aggregators on the DNS hosts process large streams of incoming states - we are talking about millions each second - and are very sensitive to processing time, so we better be sure that neither the content nor the count of the messages impact that processing time. Here, again, we protect ourselves from spikes in the rate of incoming messages by only performing aggregation once a second per batch of relevant messages. For each batch, only the last result sent by a checker is used and those received previously are discarded. This contrasts with running aggregation for every received message and depending on the constancy of that incoming stream. It is also in contrast to queuing up the messages to smoothen out the bursts of incoming requests. Queues are helpful for this purpose, but only help when bursts are not sustained. Ensuring constancy of work regardless of bursts works better.

    In addition to coping with potential incoming message bursts, aggregators need to be able to process messages in a well-bounded time. One of the mistakes we made here originally was not testing the performance of one of the code paths handling some specific unexpected message content. Due to another unexpected condition, at one point, one of the checkers sent a large number of those unexpected states to the aggregators. On their own, these states would be harmless, only triggering some diagnostics. However, the side effect was significant processing slowdown in the aggregators because of the poorly performing code path. Fortunately, because of the constant work design, this simply resulted in skipping some messages during processing and only extending the failure detection latency by a couple seconds. Of course, we fixed the poorly performing path after this; it would have been better to catch this in testing.

    All together, for each pipeline stage, the cardinality or rate of the input is bounded and the cost of processing of all types of input is uniformly bounded as well. As a result, each stage is doing a constant amount of work and will not fall over or cause delays based on external events - which is exactly the goal.

    - Vadim Meleshuk

  • A Case Study in Global Fault Isolation

    In a previous blog post, we talked about using shuffle sharding to get magical fault isolation. Today, we'll examine a specific use case that Route 53 employs and one of the interesting tradeoffs we decided to make as part of our sharding. Then, we'll discuss how you can employ some of these concepts in your own applications.

    Overview of Anycast DNS

    One of our goals at Amazon Route 53 is to provide low-latency DNS resolution to customers. We do this, in part, by announcing our IP addresses using “anycast” from over 50 edge locations around the globe. Anycast works by routing packets to the closest (network-wise) location that is “advertising” a particular address. In the image below, we can see that there are three locations, all of which can receive traffic for the 205.251.194.72 address.

    Anycast Routing

    [Blue circles represent edge locations; orange circles represent AWS regions]

    For example, if a customer has ns-584.awsdns-09.net assigned as a nameserver, issuing a query to that nameserver could result in that query landing at any one of multiple locations responsible for advertising the underlying IP address. Where the query lands depends on the anycast routing of the Internet, but it is generally going to be the closest network-wise (and hence, low latency) location to the end user.

    Behind the scenes, we have thousands of nameserver names (e.g. ns-584.awsdns-09.net) hosted across four top-level domains (.com, .net, .co.uk, and .org). We refer to all the nameservers in one top-level domain as a ‘stripe;' thus, we have a .com stripe, a .net stripe, a .co.uk stripe, and a .org stripe. This is where shuffle sharding comes in: each Route 53 domain (hosted zone) receives four nameserver names one from each of stripe. As a result, it is unlikely that two zones will overlap completely across all four nameservers. In fact, we enforce a rule during nameserver assignment that no hosted zone can overlap by more than two nameservers with any previously created hosted zone.

    DNS Resolution

    Before continuing, it’s worth quickly explaining how DNS resolution works. Typically, a client, such as your laptop or desktop has a “stub resolver.” The stub resolver simply contacts a recursive nameserver (resolver), which in turn queries authoritative nameservers, on the Internet to find the answers to a DNS query. Typically, resolvers are provided by your ISP or corporate network infrastructure, or you may rely on an open resolver such as Google DNS. Route 53 is an authoritative nameserver, responsible for replying to resolvers on behalf of customers. For example, when a client program attempts to look up amazonaws.com, the stub resolver on the machine will query the resolver. If the resolver has the data in cache and the value hasn’t expired, it will use the cached value. Otherwise, the resolver will query authoritative nameservers to find the answer.

    Route 53 Nameserver Striping

    [Every location advertises one or more stripes, but we only show Sydney, Singapore, and Hong Kong in the above diagram for clarity.]

    Each Route 53 edge location is responsible for serving the traffic for one or more stripes. For example, our edge location in Sydney, Australia could serve both the .com and .net, while Singapore could serve just the .org stripe. Any given location can serve the same stripe as other locations. Hong Kong could serve the .net stripe, too. This means that if a resolver in Australia attempts to resolve a query against a nameserver in the .org stripe, which isn’t provided in Australia, the query will go to the closest location that provides the .org stripe (which is likely Singapore). A resolver in Singapore attempting to query against a nameserver in the .net stripe may go to Hong Kong or Sydney depending on the potential Internet routes from that resolver’s particular network. This is shown in the diagram above.

    For any given domain, in general, resolvers learn the lowest latency nameserver based upon the round trip time of the query (this technique is often called SRTT or smooth round-trip time). Over a few queries, a resolver in Australia would gravitate toward using the nameservers on the .net and .com stripes for Route 53 customers’ domains.

    Not all resolvers do this. Some choose randomly amongst the nameservers. Others may end up choosing the slowest one, but our experiments show that about 80% of resolvers use the lowest RTT nameserver. For additional information, this presentation presents information on how various resolvers choose which nameserver they utilize. Additionally, many other resolvers (such as Google Public DNS) use pre-fetching, or have very short timeouts if a resolver fails to resolve against a particular nameserver.

    The Latency-Availability Decision

    Given the above resolver behavior, one option, for a DNS provider like Route 53, might be to advertise all four stripes from every edge location. This would mean that no matter which nameserver a resolver choses, it will always go to the closest network location. However, we believe this provides a poor availability model.

    Why? Because edge locations can sometimes fail to provide resolution for a variety of reasons that are very hard to control: the edge location may lose power or Internet connectivity, the resolver may lose connectivity to the edge location, or an intermediary transit provider may lose connectivity. Our experiments have shown that these types of events can cause about 5 minutes of disruption as the Internet updates routing tables. In recent years another serious risk has arisen: large-scale transit network congestion due to DDOS attacks. Our colleague, Nathan Dye, has a talk from AWS re:Invent that provides more details: www.youtube.com/watch?v=V7vTPlV8P3U.

    In all of these failure scenarios, advertising every nameserver from every location may result in resolvers having no fallback location. All nameservers would route to the same location and resolvers would fail to resolve DNS queries, resulting in an outage for customers.

    In the diagram below, we show the difference for a resolver querying domain X, whose nameservers (NX1, NX2, NX3, NX4) are advertised from all locations and domain Y, whose nameservers (NY1, NY2, NY3, NY4) are advertised in a subset of the locations.

    Route 53 Nameserver Striping

    When the path from the resolver to location A is impaired, all queries to the nameservers for domain X will fail. In comparison, even if the path from the resolver to location A is impaired, there are other transit paths to reach nameservers at locations B, C, and D in order to resolve the DNS for domain Y.

    Route 53 typically advertises only one stripe per edge location. As a result, if anything goes wrong with a resolver being able to reach an edge location, that resolver has three other nameservers in three other locations to which it can fall back. For example, if we deploy bad software that causes the edge location to stop responding, the resolver can still retry elsewhere. This is why we organize our deployments in "stripe order"; Nick Trebon provides a great overview of our deployment strategies in the previous blog post. It also means that queries to Route 53 gain a lot of Internet path diversity, which helps resolvers route around congestion and other intermediary problems on their path to reaching Route 53.

    Route 53's foremost goal is to always meet our promise of a 100% SLA for DNS queries - that all of our customers' DNS names should resolve all the time. Our customers also tell us that latency is next most important feature of a DNS service provider. Maximizing Internet path and edge location diversity for availibility necessarily means that some nameservers will respond from farther-away edge locations. For most resolvers, our method has no impact on the minimum RTT, or fastest nameserver, and how quickly it can respond. As resolvers generally use the fastest nameserver, we’re confident that any compromise in resolution times is small and that this is a good balance between the goals of low latency and high availability.

    On top of our striping across locations, you may have noticed that the four stripes use different top-level domains. We use multiple top-levels domains in case one of the three TLD providers (.com and .net are both operated by Verisign) has any sort of DNS outage. While this rarely happens, it means that as a customer, you’ll have increased protection during a TLD’s DNS outage because at least two of your four nameservers will continue to work.

    Applications

    You, too, can apply the same techniques in your own systems and applications. If your system isn't end-user facing, you could also consider utilizing multiple TLDs for resilience as well. Especially in the case where you control your own API and clients calling the API, there's no reason to place all your eggs in one TLD basket.

    Another application of what we've discussed is minimizing downtime during failovers. For high availability applications, we recommend customers utilize Route 53 DNS Failover. With failover configured, Route 53 will only return answers for healthy endpoints. In order to determine endpoint health, Route 53 issues health checks against your endpoint. As a result, there is a minimum of 10 seconds (assuming you configured fast health checks with a single failover interval) where the application could be down, but failover has not triggered yet. On top of that, there is the additional time incurred for resolvers to expire the DNS entry from their cache based upon the record's TTL. To minimize this failover time, you could write your clients to behave similar to the resolver behavior described earlier. And, while you may not employ an anycast system, you can host your endpoints in multiple locations (e.g. different availability zones and perhaps even different regions). Your clients would learn the SRTT of the multiple endpoints over time and only issue queries to the fastest endpoint, but fallback to the other endpoints if the fastest is unavailable. And, of course, you could shuffle shard your endpoints to achieve increased fault isolation while doing all of the above.

    - Lee-Ming Zen

  • Organizing Software Deployments to Match Failure Conditions

    Deploying new software into production will always carry some amount of risk, and failed deployments (e.g., software bugs, misconfigurations, etc.) will occasionally occur. As a service owner, the goal is to try and reduce the number of these incidents and to limit customer impact when they do occur. One method to reduce potential impact is to shape your deployment strategies around the failure conditions of your service. Thus, when a deployment fails, the service owner has more control over the blast radius as well as the scope of the impact. These strategies require an understanding of how the various components of your system interact, how those components can fail and how those failures impact your customers. This blog post discusses some of the deployment strategies that we've made on the Route 53 team and how these choices affect the availability of our service.

    To begin, I'll briefly describe some of the deployment procedures and the Route 53 architecture in order to provide some context for the deployment strategies that we have chosen. Hopefully, these examples will reveal strategies that could benefit your own service's availability. Like many services, Route 53 consists of multiple environments or stages: one for active development, one for staging changes to production and the production stage itself. The natural tension with trying to reduce the number of failed deployments in production is to add more rigidity and processes that slow down the release of new code. At Route 53, we do not enforce a strict release or deployment schedule; individual developers are responsible for verifying their changes in the staging environment and pushing their changes into production. Typically, our deployments proceed in a pipelined fashion. Each step of the pipeline is referred to as a "wave" and consists of some portion of our fleet. A pipeline is a good abstraction as each wave can be thought of as an independent and separate step. After each wave of the pipeline, the change can be verified -- this can include automatic, scheduled and manual testing as well as the verification of service metrics. Furthermore, we typically space out the earlier waves of production deployment at least 24 hours apart, in order to allow the changes to "bake." Letting our software bake refers to rolling out software changes slowly to allow us to validate those changes and verify service metrics with production traffic before pushing the deployment to the next wave. The clear advantage of deploying new code to only a portion of your fleet is that it reduces the impact of a failed deployment to just the portion of the fleet containing the new code. Another benefit of our deployment infrastructure is that it provides us a mechanism to quickly "roll back" a deployment to a previous software version if any problems are detected which, in many cases, enables us to quickly mitigate a failed deployment.

    Based on our experiences, we have further organized our deployments to try and match our failure conditions to further reduce impact. First, our deployment strategies are tailored to the part of the system that is the target of our deployment. We commonly refer to two main components of Route 53: the control plane and the data plane (pictured below). The control plane consists primarily of our API and DNS change propagation system. Essentially, this is the part of our system that accepts a customer request to create or delete a DNS record and then the transmission of that update to all of our DNS servers distributed across the world. The data plane consists of our fleet of DNS servers that are responsible for answering DNS queries on behalf of our customers. These servers currently reside in more than 50 locations around the world. Both of these components have their own set of failure conditions and differ in how a failed deployment will impact customers. Further, a failure of one component may not impact the other. For example, an API outage where customers are unable to create new hosted zones or records has no impact on our data plane continuing to answer queries for all records created prior to the outage. Given their distinct set of failure conditions, the control plane and data plane have their own deployment strategies, which are each discussed in more detail below.

    Control and Data Planes

    Control Plane Deployments

    The bulk of the of the control plane actually consists of two APIs. The first is our external API that is reachable from the Internet and is the entry point for customers to create, delete and view their DNS records. This external API performs authentication and authorization checks on customer requests before forwarding them to our internal API. The second, internal API supports a much larger set of operations than just the ones needed by the external API; it also includes operations required to monitor and propagate DNS changes to our DNS servers as well as other operations needed to operate and monitor the service. Failed deployments to the external API typically impact a customer's ability to view or modify their DNS records. The availability of this API is critical as our customers may rely on the ability to update their DNS records quickly and reliably during an operational event for their own service or site.

    Deployments to the external API are fairly straightforward. For increased availability, we host the external API in multiple availability zones. Each wave of deployment consists of the hosts within a single availability zone, and each host in that availability zone is deployed to individually. If any single host deployment fails, the deployment to the entire availability zone is halted automatically. Some host failures may be quickly caught and mitigated by the load balancer for our hosts in that particular availability zone, which is responsible for health checking the hosts. Hosts that fail these load balancer health checks are automatically removed from service by the load balancer. Thus, a failed deployment to just a single host would result in it being removed from service automatically and the deployment halted without any operator intervention. For other types of failed deployments that may not cause the load balancer health checks to fail, restricting waves to a single availability zone allows us to easily flip away from that availability zone as soon as the failure is detected. A similar approach could be applied to services that utilize Route 53 plus ELB in multiple regions and availability zones for their services. ELBs automatically health check their back-end instances and remove unhealthy instances from service. By creating Route 53 alias records marked to evaluate target health (see ELB documentation for how to set this up), if all instances behind an ELB are unhealthy, Route 53 will fail away from this alias and attempt to find an alternate healthy record to serve. This configuration will enable automatic failover at the DNS-level for an unhealthy region or availability zone. To enable manual failover, simply convert the alias resource record set for your ELB to either a weighted alias or associate it with a health check whose health you control. To initiate a failover, simply set the weight to 0 or fail the health check. A weighted alias also allows you the ability to slowly increase the traffic to that ELB, which can be useful for verifying your own software deployments to the back-end instances.

    For our internal API, the deployment strategy is more complicated (pictured below). Here, our fleet is partitioned by the type of traffic it handles. We classify traffic into three types: (1) low-priority, long-running operations used to monitor the service (batch fleet), (2) all other operations used to operate and monitor the service (operations fleet) and (3) all customer operations (customer fleet). Deployments to the production internal API are then organized by how critical their traffic is to the service as a whole. For instance, the batch fleet is deployed to first because their operations are not critical to the running of the service and we can tolerate long outages of this fleet. Similarly, we prioritize the operations fleet below that of customer traffic as we would rather continue accepting and processing customer traffic after a failed deployment to the operations fleet. For the internal API, we have also organized our staging waves differently from our production waves. In the staging waves, all three fleets are split across two waves. This is done intentionally to allow us to verify that the code changes work in a split-world where multiple versions of the software are running simultaneously. We have found this to be useful in catching incompatibilities between software versions. Since we never deploy software in production to 100% of our fleet at the same time, our software updates must be designed to be compatible with the previous version. Finally, as with the external API, all wave deployments proceed with a single host at a time. For this API, we also include a deep application health check as part of the deployment. Similar to the load balancer health checks for the external API, if this health check fails, the entire deployment is immediately halted.

    Internal API Wave Deployments

    Data Plane Deployments

    As mentioned earlier, our data plane consists of Route 53's DNS servers, which are distributed across the world in more than 50 distinct locations (we refer to each location as an 'edge location'). An important consideration with our deployment strategy is how we stripe our anycast IP space across locations. In summary, each hosted zone is assigned four delegation name servers, each of which belong to a "stripe" (i.e., one quarter of our anycast range). Generally speaking, each edge location announces only a single stripe, so each stripe is therefore announced by roughly 1/4 of our edge locations worldwide. Thus, when a resolver issues a query against each of the four delegation name servers, those queries are directed via BGP to the closest (in a network sense) edge location from each stripe. While the availability and correctness of our API is important, the availability and correctness of our data plane are even more critical. In this case, an outage directly results in an outage for our customers. Furthermore, the impact of serving even a single wrong answer on behalf of a customer is magnified by that answer being cached by both intermediate resolvers and end clients alike. Thus, deployments to our data plane are organized even more carefully to both prevent failed deployments and to reduce potential impact.

    The safest way to deploy and minimize impact would be to deploy to a single edge location at a time. However, with manual deployments that are overseen by a developer, this approach is just not scalable with how frequently we deploy new software to over 50 locations (with more added each year). Thus, most of our production deployment waves consist of multiple locations; the one exception is our first wave that includes just a single location. Furthermore, this location is specifically chosen because it runs our oldest hardware, which provides us a quick notification for any unintended performance degradation. It is important to note that while the caching behavior for resolvers can cause issues if we serve an incorrect answer, they handle other types failures well. When a recursive resolver receives a query for a record that is not cached, it will typically issue queries to at least three of the four delegation name servers in parallel and it will use the first response it receives. Thus, in the event where one of our locations is black holing customer queries (i.e., not replying to DNS queries), the resolver should receive a response from one of the other delegation name servers. In this case, the only impact is to resolvers where the edge location that is not answering would have been the fastest responder. Now, that resolver will effectively be waiting for the response from the second fastest stripe. To take advantage of this resiliency, our other waves are organized such that they include edge locations that are geographically diverse, with the intent that for any single resolver, there will be nearby locations that are not included in the current deployment wave. Furthermore, to guarantee that at most a single nameserver for all customers is affected, waves are actually organized by stripe. Finally, each stripe is spread across multiple waves so that failures impact only a single name server for a portion of our customers. An example of this strategy is depicted below. A few notes: our staging environment consists of a much smaller number of edge locations than production, so single-location waves are possible. Second, each stripe is denoted by color; in this example, we see deployments spread across a blue and orange stripe. You, too, can think about organizing your deployment strategy around your failure conditions. For example, if you have a database schema used by both your production system and a warehousing system, deploy the change to the warehousing system first to ensure you haven't broken any compatibility. You might catch problems with the warehousing system before it affects customer traffic.

    Data Plane Wave Deployments

    Conclusions

    Our team's experience with operating Route 53 over the last 3+ years have highlighted the importance of reducing the impact from failed deployments. Over the years, we have been able to identify some of the common failure conditions and to organize our software deployments in such a way so that we increase the ease of mitigation while decreasing the potential impact to our customers.

    - Nick Trebon

  • Shuffle Sharding: massive and magical fault isolation

    A standard deck of cards has 52 different playing cards and 2 jokers. If we shuffle a deck thoroughly and deal out a four card hand, there are over 300,000 different hands. Another way to say the same thing is that if we put the cards back, shuffle and deal again, the odds are worse than 1 in 300,000 that we'll deal the same hand again. It's very unlikely.

    It's also unlikely, less than a 1/4 chance, that even just one of the cards will match between the two hands. And to complete the picture, there's a less than a 1/40 chance that two cards will match, and much less than a 1/1000 chance that three cards will be the same.

    Matching cards per hand

    In our last post I promised to cover more about how Route 53 Infima can be used to isolate faults that are request-related, such as user or customer specific problems. Route 53 Infima's Shuffle Sharding takes this pattern of rapidly diminishing likelihood for an increasing number of matches - a pattern which underlies many card games, lotteries and even bingo - and combines it with traditional horizontal scaling to produce a kind of fault isolation that can seem almost magical.

    Traditional horizontal scaling

    All but the smallest services usually run on more than one instance (though there are some impressive exceptions). Using multiple instances means that we can have active redundancy: when one instance has a problem the others can take over. Typically traffic and requests are spread across all of the healthy instances.

    Though this pattern is great for balancing traffic and for handling occasional instance-level failure, it’s terrible if there’s something harmful about the requests themselves: every instance will be impacted. If the service is serving many customers for example, then one busy customer may swamp everyone else. Throttling requests on a per-customer basis can help, but even throttling mechanisms can themselves be overwhelmed.

    Worse still, throttling won't help if the problem is some kind of “poison request”. If a particular request happens to trigger a bug that causes the system to fail over, then the caller may trigger a cascading failure by repeatedly trying the same request against instance after instance until they have all fallen over.

    Sharding and bulkheads

    One fault isolating improvement we can make upon traditional horizontal scaling is to use sharding. Sharding is a technique traditionally used with data storage and indexing systems. Instead of spreading traffic from all customers across every instance, we can divide the instances into shards. For example if we have eight instances for our service, we might create four shards of two instances each (two instances for some redundancy within each shard).

    4 Shards

    Next we face the decision of how to shard. One common way is by customer id, assigning customers to particular shards, but other sharding choices are viable such as by operation type or by resource identifier. You can even choose to do multidimensional sharding, and have customer-resource pairs select a shard, or customer-resource-operation-type.

    Whatever makes the most sense for a given service depends on its innards and its particular mix of risks, but it's usually possible to find some combination of id or operation type that will make a big difference if it can be isolated.

    With this kind of sharding in place, if there is a problem caused by the requests, then we get the same kind of bulkhead effect we have seen before; the problem can be contained to one shard. So in our example above, with four shards then around one quarter of customers (or whichever other dimension has been chosen) may be impacted by a problem triggered by one customer. That's considerably better than all customers being impacted.

    If customers (or objects) are given specific DNS names to use (just as customers are given unique DNS names with many AWS services) then DNS can be used to keep per-customer cleanly separated across shards.

    Shuffle Sharding

    With sharding, we are able to reduce customer impact in direct proportion to the number of instances we have. Even if we had 100 shards, 1% of customers would still experience impact in the event of a problem. One sensible solution for this is to build monitoring systems that can detect these problems and once detected re-shard the problem requests to their own fully isolated shard. This is great, but it’s a reactive measure and can usually only mitigate impact, rather than avoid it in the first place.

    With Shuffle Sharding, we can do better. The basic idea of Shuffle Sharding is to generate shards as we might deal hands from a deck of cards. Take the eight instances example. Previously we divided it into four shards of two instances. With Shuffle Sharding the shards contain two random instances, and the shards, just like our hands of cards, may have some overlap.

    Shuffle Sharding

    By choosing two instances from eight there are 56 potential shuffle shards, much more than the four simple shards we had before.

    At first, it may seem as if these Shuffle Shards are less suited to isolating faults; in the above example diagram, two shuffle shards share instance 5, and so a problem affecting that instance may impact both shards. The key to resolving this is to make the client fault tolerant. By having simple retry logic in the client that causes it to try every endpoint in a Shuffle Shard, until one succeeds, we get a dramatic bulkhead effect.

    With the client trying each instance in the shard, then a customer who is causing a problem to Shuffle Shard 1, may impact both instance 3 and instance 5 and so become impacted, but the customers using Shuffle Shard 2 should experience only negligible (if any) impact if the client retries have been carefully tested and implemented to handle this kind of partial degradation correctly. Thus the real impact is constrained to 1/56th of the overall shuffle shards.

    1/56 impact is a great improvement on 1/4, but we can do even better still. Before, with simple sharding we needed to put two instances in each shard to have some redundancy. With Shuffle Sharding – as in traditional N+1 horizontal scaling – we have more instances available. We can shard to as many instances as we are willing to retry. With 3 retries - a common retry value - we can use four instances in total per shuffle shard.

    With four instances per shuffle shard, we can reduce the impact to 1/1680 of our total customer base and we’ve made the “noisy neighbor” problem much more manageable.

    Infima and Shuffle Sharding

    The Route Infima library includes two kinds of Shuffle sharding. The first kind is simple stateless shuffle sharding. Stateless shuffle sharding uses hashing, much like a bloom filter does, to take a customer, object or other identifiers and turn it into shuffle shard pattern. This technique results in some probability of overlap between customers, just as when we deal hands from a deck of cards. But by being stateless, this kind of shuffle sharding can be easily used, even directly in calling clients.

    The second kind of Shuffle Sharding included is Stateful Searching Shuffle Sharding. These shuffle shards are generated using random assignment, again like hands from a deck of cards, but there is built in support for checking each newly generated shard against every previously assigned shard in order to make guarantees about overlap. For example we might choose to give every shuffle shard 4 out of 20 endpoints, but require that no two shuffle shards ever share more than two particular endpoints.

    Both kinds of shuffle sharding in Infima are compartmentalization aware. For example, we can ensure that shuffle shards also make use of every availability zone. So our instances could be in 2 availability zones, 4 in each one. Infima can make sure to choose 2 endpoints from each zone, rather than simply 2 at random (which might choose both from one availability zone).

    Shuffle Sharding across AZs

    Lastly, Infima also makes it easy to use Shuffle Shards along with RubberTrees, so that endpoints can be easily expressed in DNS using Route 53. For example, every customer can be supplied their own DNS name, which maps to a shuffle shard which is handled by a RubberTree.

    Post-script

    The two general principles at work are that it can often be better to use many smaller things as it lowers the cost of capacity buffers and makes the impact of any contention small, and that it can be beneficial to allow shards to partially overlap in their membership, in return for an exponential increase in the number of shards the system can support.

    Those principles mean that Shuffle Sharding is a general-purpose technique, and you can also choose to Shuffle Shard across many kinds of resources, including pure in-memory data-structures such as queues, rate-limiters, locks and other contended resources.

    As it happens, Amazon Route 53, CloudFront and other AWS services use compartmentalization, per-customer Shuffle Sharding and more to provide fault isolation, and we will be sharing some more details about how some of that works in a future blog post.

    Update from the author: an earlier version of this blog post used an incorrect figure for the number of 4-card hands from a 52-card deck (I wrote 7 million, based on permutations, instead of 300,000 based on combinations).

    - Colm MacCárthaigh

  • AWS and Compartmentalization

    Practically every experienced driver has suffered a flat tire. It’s a real nuisance, you pull over, empty the trunk to get out your spare wheel, jack up the car and replace the puncture before driving yourself to a nearby repair shop. For a car that’s ok, we can tolerate the occasional nuisance, and as drivers we’re never that far from a safe place to pull over or a friendly repair shop.

    Using availability terminology, a spare tire is a kind of standby, a component or system that is idly waiting to be deployed when needed. These are common in computer systems too. Many databases rely on standby failover for example, and some of them even rely on personal intervention, with a human running a script as they might wind a car-jack (though we’d recommend using an Amazon Relational Database instead, which include automated failover).

    But when the stakes are higher, things are done a little differently. Take the systems in a modern passenger jet for example, which despite recent tragic events, have a stellar safety record. A flight can’t pull over, and in the event of a problem an airliner may have to make it several hours before being within range of a runway. For passenger jets it’s common for critical systems to use active redundancy. A twin-engine jet can fly with just one working engine, for example – so if one fails, the other can still easily keep the jet in the air.

    This kind of model is also common in large web systems. There are many EC2 instances handling amazon.com for example, and when one occasionally fails there’s a buffer of capacity spread across the other servers ensuring that customers don’t even notice.

    Jet engines don’t simply fail on their own though. Any one of dozens of components—digital engine controllers, fuel lines and pumps, gears and shafts, and so on--can cause the engine to stop working. For every one of these components, the aircraft designers could try to include some redundancy at the component level (and some do, such as avionics), but there are so many that it’s easier to re-frame the design in terms of fault isolation or compartmentalization: as long as each engine depends on separate instances of each component, then no one component can take out both engines. A fuel line may break, but it can only stop one engine from functioning, and the plane has already been designed to work with one engine out.

    This kind of compartmentalization is particularly useful for complex computer systems. A large website or web service may depend on tens or even hundreds of sub-services. Only so many can themselves include robust active redundancy. By aligning instances of sub-services so that inter-dependencies never go across compartments we can make sure that a problem can be contained to the compartment it started in. It also means that we can try to resolve problems by quarantining whole compartments, without needing to find the root of the problem within the compartment.

    AWS and Compartmentalization

    Amazon Web Services includes some features and offerings that enable effective compartmentalization. Firstly, many Amazon Web Services—for example, Amazon S3 and Amazon RDS—are themselves internally compartmentalized and make use of active redundancy designs so that when failures occur they are hidden.

    We also offer web services and resources in a range of sizes, along with automation in the form of auto-scaling, CloudFormation templates, and Opsworks recipes that make it easy to manage a higher number of instances.

    There is a subtle but important distinction between running a small number of large instances, and a large number of small instances. Four m3.xlarge instances cost as much as two m3.2xlarge instances and provide the same amount of CPU and storage; but for high availability configurations, using four instances requires only a 33% failover capacity buffer and any host-level problem may impact one quarter of your load, whereas using two instances means a 100% buffer and any problem may impact half of your load.

    Thirdly, Amazon Web Services has pre-made compartments: up to four availability zones per region. These availability zones are deeply compartmentalized down to the datacenter, network and power level.

    Suppose that we create a web site or web service that utilizes four availability zones. This means we need a 25% failover capacity buffer per zone (which compares well to a 100% failover capacity buffer in a standard two data center model). Our service consists of a front end, two dependent backend services (“Foo” and “Bar”) and a data-store (for this example, we’ll use S3).

    By constraining any sub-service calls to stay “within” the availability zone we make it easier to isolate faults. If backend service “Bar” fails (for example a software crash) in us-east-1b, this impacts 1/4th of our over-all capacity.

    4-az

    Initially this may not seem much better than if we had spread calls to the Bar service from all zones across all instances of the Bar service; after all, the failure rate would also be one fifth. But the difference is profound.

    Firstly, experience has shown that small problems can often become amplified in complex systems. For example if it takes the “Foo” service longer to handle a failed call to the “Bar” service, then the initial problem with the “Bar” service begins to impact the behavior of “Foo” and in turn the frontends.

    Secondly, by having a simple all-purpose mechanism to fail away from the infected availability zone, the problem can be reliably, simply, and quickly neutralized, just as a plane can be designed to fly on one engine and many types of failure handled with one procedure—if the engine is malfunctioning and a short checklist’s worth of actions don’t restore it to health, just shut it down and land at the next airport.

    Route 53 Infima

    Our suggested mechanism for handling this kind of failure is Amazon Route 53 DNS Failover. As DNS is the service that turns service/website names into the list of particular front-end IP addresses to connect to, it sits at the start of every request and is an ideal layer to neutralize problems.

    With Route 53 health checks and DNS failover, each front-end is constantly health checked and automatically removed from DNS if there is a problem. Route 53 Health Check URLs are fully customizable and can point to a script that checks every dependency in the availability zone (“Is Foo working, Is Bar working, is S3 reachable, etc …”).

    This brings us to Route 53 Infima. Infima is a library designed to model compartmentalization systematically and to help represent those kinds of configurations in DNS. With Infima, you assign endpoints to specific compartments such as availability zone. For advanced configurations you may also layer in additional compartmentalization dimensions; for example you may want to run two different software implementations of the same service (perhaps for blue/green deployments, for application-level redundancy) in each availability zone.

    Once the Infima library has been taught the layout of endpoints within the compartments, failures can be simulated in software and any gaps in capacity identified. But the real power of Infima comes in expressing these configurations in DNS. Our example service had 4 endpoints, in 4 availability zones. One option for expressing this in DNS is to return each endpoint one time in every four. Each answer could also depend on a health check, and when the health check fails, it could be removed from DNS. Infima supports this configuration.

    However, there is a better option. DNS (and naturally Route 53) allows several endpoints to be represented in a single answer, for example:

    ; ANSWER SECTION:

    example.com. 60 IN A 192.0.2.1

    example.com. 60 IN A 192.0.2.2

    example.com. 60 IN A 192.0.2.3

    example.com. 60 IN A 192.0.2.4

    When clients (such as browsers or web services clients) receive these answers they generally try several endpoints until they find one that successfully connects. So by including all of the endpoints we gain some fault tolerance. When an endpoint is failing though, as we’ve seen before, the problem can spread and clients can incur retry timers and some delay, so it’s still desirable to remove IPs from DNS answers in a timely manner.

    Infima can use the list of compartments, endpoints and their healthchecks to build what we call a RubberTree, a pre-computed decision tree of DNS answers that has answers pre-baked ready and waiting for potential failures: a single node failing, a whole compartment failing, combinations of each and so on. This decision tree is then stored as a Route 53 configuration and can automatically handle any failures. So if the 192.0.2.3 endpoint were to fail, then;

    ; ANSWER SECTION:

    example.com. 60 IN A 192.0.2.1

    example.com. 60 IN A 192.0.2.2

    example.com. 60 IN A 192.0.2.4

    will be returned. By having these decision trees pre-baked and always ready and waiting, Route 53 is able to react quickly to endpoint failures, which with compartmentalization means we are also ready to handle failures of any sub-service serving that endpoint.

    The compartmentalization we’ve seen so far is most useful for certain kinds of errors; host-level problems, occasional crashes, application-lockups. But if the problem originates with front-end level requests themselves, for example a denial of service attack, or a “poison pill” request that triggers a calamitous bug then it can quickly infect all of your compartments. Infima also includes some neat functionality to assist in isolating even these kinds of faults, and that will be the topic of our next post.

    Bonus content: busting caches

    I wrote that removing failing endpoints from DNS in a timely manner is important, even when there are multiple endpoints in an answer. One problem we respond to in this area is broken application-level DNS caching. Certain platforms, including many versions of Java do not respect DNS cache lifetimes (the DNS time-to-live or TTL value) and once a DNS response has been resolved it will be used indefinitely.

    One way to mitigate this problem is to use cache “busting”. Route 53 support wildcard records (and wildcard ALIASes, CNAMEs and more). Instead of using a service name such as: “api.example.com”, it is possible to use a wildcard name such as “*.api.example.com”, which will match requests for any name ending in “.api.example.com”.

    An application may then be written in such a way as to resolve a partially random name, e.g. “sdsHdsk3.api.example.com”. This name, since it ends in api.example.com will still receive the right answer, but since it is a unique random name every time, it will defeat (or “bust”) any broken platform or OS DNS caching.

    - Colm MacCárthaigh