高可用简史 已翻译 100%

oschina 投递于 2018/11/10 08:02 (共 16 段, 翻译完成于 11-15)
阅读 1347
收藏 6
0
加载中

I once went to a website that had “hours of operation,” and was only “open” when its brick and mortar counterpart had its lights on. I felt perplexed and a little frustrated; computers are capable of running all day every day, so why shouldn’t they? I’d been habituated to the internet’s incredible availability guarantees.

However, before the internet, 24⁄7 availability wasn’t “a thing.” Availability was desirable, but not something to which we felt fundamentally entitled. We used computers only when we needed them; they weren’t waiting idly by on the off-chance a request came by. As the internet grew, those previously uncommon requests at 3am local time became prime business hours partway across the globe, and making sure that a computer could facilitate the request was important.

已有 1 人翻译此段
我来翻译

Many systems, though, relied on only one computer to facilitate these requests––which we all know is a story that doesn’t end well. To keep things up and running, we needed to distribute the load among multiple computers that could fulfill our needs. However, distributed computation, for all its well-known upsides, has sharp edges: in particular, synchronization and tolerating partial failures within a system. Each generation of engineers has iterated on these solutions to fit the needs of their time.

How distribution came to databases is of particular interest because it’s a difficult problem that’s been much slower to develop than other areas of computer science. Certainly, software tracked the results of some distributed computation in a local database, but the state of the database itself was kept on a single machine. Why? Replicating state across machines is hard.

In this post, we want to take a look at how distributed databases have historically handled partial failures within a system and understand––at a high level––what high availability looks like.

已有 1 人翻译此段
我来翻译

Working with What We Have: Active-Passive

In the days of yore, databases ran on single machines. There was only one node and it handled all reads and all writes. There was no such thing as a “partial failure”; the database was either up or down.

Total failure of a single database was a two-fold problem for the internet; first, computers were being accessed around the clock, so downtime was more likely to directly impact users; second, by placing computers under constant demand, they were more likely to fail. The obvious solution to this problem is to have more than one computer that can handle the request, and this is where the story of distributed databases truly begins.

已有 1 人翻译此段
我来翻译

Living in a single-node world, the most natural solution was to continue letting a single node serve reads and writes and simply sync its state onto a secondary, passive machine––and thus, Active-Passive replication was born.

Active-Passive improved availability by having an up-to-date backup in cases where the active node failed––you could simply start directing traffic to the passive node, thereby promoting it to being active. Whenever you could, you’d replaced the downed server with a new passive machine (and hope the active one didn’t fail in the interim).

At first, replication from the active to the passive node was a synchronous procedure, i.e., transformations weren’t committed until the Passive node acknowledged them. However, it was unclear what to do if the passive node went down. It certainly didn’t make sense for the entire system to go down if the backup system wasn’t available––but with synchronous replication, that’s what would happen.

已有 1 人翻译此段
我来翻译

To further improve availability, data could instead be replicated asynchronously. While its architecture looks the same, it was capable of handling either the active or the passive node going down without impacting the database’s availability.

While asynchronous Active-Passive was another step forward, there were still significant downsides:

  • When the active node died, any data that wasn’t yet replicated to the passive node could be lost––despite the fact that the client was led to believe the data was fully committed.

  • By relying on a single machine to handle traffic, you were still bound to the maximum available resources of a single machine.

已有 1 人翻译此段
我来翻译

Chasing Five 9s: Scale to Many Machines

As the Internet proliferated, business’ needs grew in scale and complexity. For databases this meant that they needed the ability to handle more traffic than any single node could handle, and that providing “always on” high availability became a mandate.

Given that swaths of engineers now had experience working on other distributed technologies, it was clear that databases could move beyond single-node Active-Passive setups and distribute a database across many machines.

已有 1 人翻译此段
我来翻译

Sharding

Again, the easiest place to start is adapting what you currently have, so engineers adapted Active-Passive replication into something more scalable by developing sharding.

In this scheme, you split up a cluster’s data by some value (such as a number of rows or unique values in a primary key) and distributed those segments among a number of sites, each of which has an Active-Passive pair. You then add some kind of routing technology in front of the cluster to direct clients to the correct site for their requests.

Sharding lets you distribute your workload among many machines, improving throughput, as well as creating even greater resilience by tolerating a greater number of partial failures.

已有 1 人翻译此段
我来翻译

Despite these upsides, sharding a system was complex and posed a substantial operational burden on teams. The deliberate accounting of shards could grow so onerous that the routing ended up creeping into an application’s business logic. And worse, if you needed to modify the way a system was sharded (such as a schema change), it often posed a significant (or even monumental) amount of engineering to achieve.

Single-node Active-Passive systems had also provided transactional support (even if not strong consistency). However, the difficulty of coordinating transactions across shards was so knotted and complex, many sharded systems decided to forgo them completely.

已有 1 人翻译此段
我来翻译

Active-Active

Given that sharded databases were difficult to manage and not fully featured, engineers began developing systems that would at least solve one of the problems. What emerged were systems that still didn’t support transactions, but were dramatically easier to manage. With the increased demand on applications’ uptime, it was a sensible decision to help teams meet their SLAs.

The motivating idea behind these systems was that each site could contain some (or all) of a cluster’s data and serve reads and writes for it. Whenever a node received a write it would propagate the change to all other nodes that would need a copy of it. To handle situations where two nodes received writes for the same key, other nodes’ transformations were fed into a conflict resolution algorithm before committing. Given that each site was “active”, it was dubbed Active-Active.

Because each server could handle reads and writes for all of its data, sharding was easier to accomplish algorithmically and made deployments easier to manage.

已有 1 人翻译此段
我来翻译

In terms of availability, Active-Active was excellent. If a node failed, clients just needed to be redirected to another node that did contain the data. As long as a single replica of the data was live, you could serve both reads and writes for it.

While this scheme is fantastic for availability, its design is fundamentally at odds with consistency. Because each site can handle writes for a key (and would in a failover scenario), it’s incredibly difficult to keep data totally synchronized as it is being processed. Instead, the approach is generally to mediate conflicts between sites through the conflict resolution algorithm that makes coarse-grained decisions about how to “smooth out” inconsistencies.

Because that resolution is done post hoc, after a client has already received an answer about a procedure––and has theoretically executed other business logic based on the response––it’s easy for active-active replication to generate anomalies in your data.

Given the premium on uptime, though, the cost of downtime was deemed greater than the cost of potential anomalies, so Active-Active became the dominant replication type.

已有 1 人翻译此段
我来翻译
本文中的所有译文仅用于学习和交流目的,转载请务必注明文章译者、出处、和本文链接。
我们的翻译工作遵照 CC 协议,如果我们的工作有侵犯到您的权益,请及时联系我们。
加载中

评论(0)

返回顶部
顶部