加载中

I once went to a website that had “hours of operation,” and was only “open” when its brick and mortar counterpart had its lights on. I felt perplexed and a little frustrated; computers are capable of running all day every day, so why shouldn’t they? I’d been habituated to the internet’s incredible availability guarantees.

However, before the internet, 24⁄7 availability wasn’t “a thing.” Availability was desirable, but not something to which we felt fundamentally entitled. We used computers only when we needed them; they weren’t waiting idly by on the off-chance a request came by. As the internet grew, those previously uncommon requests at 3am local time became prime business hours partway across the globe, and making sure that a computer could facilitate the request was important.

我曾经访问过一个网站,它有“几个小时的运营时间”,只有当它的实体网站亮着灯的时候,它才会“开放”。我感到很困惑:电脑每天都能运行,为什么网站不能每天都开放呢?我已经习惯了互联网不可思议的可用性。

然而,在互联网之前,7*24 小时可用性可不是一件简单的事。我们想要可用性,但不是从根本上认为应该拥有的东西。我们只在需要的时候才使用电脑;那些电脑是不会坐等一个偶然的请求的。随着互联网的发展,那些以前不常见的请求虽然是在当地时间凌晨3点发来,但在全球其他地方却是黄金营业时间,确保一台电脑能够为这些请求提供服务是很重要的。

Many systems, though, relied on only one computer to facilitate these requests––which we all know is a story that doesn’t end well. To keep things up and running, we needed to distribute the load among multiple computers that could fulfill our needs. However, distributed computation, for all its well-known upsides, has sharp edges: in particular, synchronization and tolerating partial failures within a system. Each generation of engineers has iterated on these solutions to fit the needs of their time.

How distribution came to databases is of particular interest because it’s a difficult problem that’s been much slower to develop than other areas of computer science. Certainly, software tracked the results of some distributed computation in a local database, but the state of the database itself was kept on a single machine. Why? Replicating state across machines is hard.

In this post, we want to take a look at how distributed databases have historically handled partial failures within a system and understand––at a high level––what high availability looks like.

许多系统只依赖一台计算机来实现这些请求 —— 我们都知道这是一个结局不太好的故事。通常为了保持系统正常运行,我们需要在多台计算机之间分配负载,以满足我们的需求。然而,分布式计算尽管有其众所周知的优点,但也有其尖锐的缺点:特别是同步和冗错系统中的部分故障。每一代工程师都在重复这些解决方案,以满足他们的时间需求。

如何将分布式应用到数据库中是一个特别有趣的问题,因为这是一个比计算机科学的其他领域发展慢得多的难题。当然,可以使用软件在本地数据库中跟踪一些分布式计算的结果,但是数据库本身的状态只会保存在一台机器上。为什么呢?因为跨机器复制状态是很难的。

在这篇文章中,我们来看看分布式数据库在历史上是如何处理系统中的部分故障的,并在高层次上理解高可用性是什么样子的。

Working with What We Have: Active-Passive

In the days of yore, databases ran on single machines. There was only one node and it handled all reads and all writes. There was no such thing as a “partial failure”; the database was either up or down.

Total failure of a single database was a two-fold problem for the internet; first, computers were being accessed around the clock, so downtime was more likely to directly impact users; second, by placing computers under constant demand, they were more likely to fail. The obvious solution to this problem is to have more than one computer that can handle the request, and this is where the story of distributed databases truly begins.

使用我们已有的武器:主动-被动

在过去,数据库在单个机器上运行。只有一个节点,它处理所有的读和写。没有所谓的“部分故障”;数据库不是正常就是停机。

在互联网时代的双重问题考验下,单一数据库的方案是完全失效的:首先,计算机是全天候被访问的,因此停机会直接影响用户;其次,将计算机置于持续不断的需求之下,它们更有可能失败。这个问题显然的解决方案是拥有多台能够处理请求的计算机,这就是分布式数据库的真正开始。

Living in a single-node world, the most natural solution was to continue letting a single node serve reads and writes and simply sync its state onto a secondary, passive machine––and thus, Active-Passive replication was born.

Active-Passive improved availability by having an up-to-date backup in cases where the active node failed––you could simply start directing traffic to the passive node, thereby promoting it to being active. Whenever you could, you’d replaced the downed server with a new passive machine (and hope the active one didn’t fail in the interim).

At first, replication from the active to the passive node was a synchronous procedure, i.e., transformations weren’t committed until the Passive node acknowledged them. However, it was unclear what to do if the passive node went down. It certainly didn’t make sense for the entire system to go down if the backup system wasn’t available––but with synchronous replication, that’s what would happen.

生活在单节点世界中,最自然的解决方案是继续让单个节点服务于读写,并将其状态同步到第二台被动机器上 —— 因此,主从备份诞生了。

主从备份通过在活动节点失败的情况下使用最新的备份来提升可用性 —— 您可以轻易地开始将流量定向到从节点,从而将其提升为活动节点。只要有可能,您就会用一台新的被动机器替换掉宕机的服务器(并希望主节点机器在此期间不会出现故障)。

首先,从主节点复制到从节点是一个同步过程,即在从节点确认复制完成之前,不会提交转换。问题来了,如果从节点宕机了该怎么办?如果备份系统不可用,那么整个系统就没有意义了 —— 然而同步复制的确会发生这种情况。

To further improve availability, data could instead be replicated asynchronously. While its architecture looks the same, it was capable of handling either the active or the passive node going down without impacting the database’s availability.

While asynchronous Active-Passive was another step forward, there were still significant downsides:

  • When the active node died, any data that wasn’t yet replicated to the passive node could be lost––despite the fact that the client was led to believe the data was fully committed.

  • By relying on a single machine to handle traffic, you were still bound to the maximum available resources of a single machine.

为了进一步提高可用性,数据可以异步复制。虽然它的体系结构看起来相同,但是它能够在不影响数据库可用性的情况下处理主节点或从节点。

虽然异步主从备份是又一个进步,但仍有明显的缺点:

  • 当活动节点宕机时,任何尚未复制到从节点的数据都可能丢失 —— 此时客户端认为数据已完全提交。

  • 通过依赖一台机器来处理流量,您仍然被限制到一台机器的最大可用资源上。

Chasing Five 9s: Scale to Many Machines

As the Internet proliferated, business’ needs grew in scale and complexity. For databases this meant that they needed the ability to handle more traffic than any single node could handle, and that providing “always on” high availability became a mandate.

Given that swaths of engineers now had experience working on other distributed technologies, it was clear that databases could move beyond single-node Active-Passive setups and distribute a database across many machines.

追逐 5 个 9:扩展到多台机器

随着互联网的发展,商业需求的规模和复杂性都在增长。对于数据库来说,这意味着它们需要能够处理比任何单个节点都能处理的更多流量,提供“始终在线”的高可用性成为了一项重要任务。

鉴于许多工程师现在都有从事其他分布式技术工作的经验,很明显,数据库可以超越单节点主从备份设置,在许多机器上分布式运行数据库。

Sharding

Again, the easiest place to start is adapting what you currently have, so engineers adapted Active-Passive replication into something more scalable by developing sharding.

In this scheme, you split up a cluster’s data by some value (such as a number of rows or unique values in a primary key) and distributed those segments among a number of sites, each of which has an Active-Passive pair. You then add some kind of routing technology in front of the cluster to direct clients to the correct site for their requests.

Sharding lets you distribute your workload among many machines, improving throughput, as well as creating even greater resilience by tolerating a greater number of partial failures.

分片

再一次,最容易开始的地方就是调整你现有的设计,所以工程师通过开发分片将主从复制调整为可伸缩的复制。

在这个方案中,您将集群的数据按照某个值(例如行数或主键中的惟一值)进行分割,然后将这些段分布到多个站点中,每个站点都有一个主从复制对。然后,在集群前面添加某种路由技术,将客户机定向到正确的站点以满足其请求。

分片可以让您在许多机器之间分配工作负载,提高吞吐量,并通过容忍更多的分区故障来提供更高的可用性。

Despite these upsides, sharding a system was complex and posed a substantial operational burden on teams. The deliberate accounting of shards could grow so onerous that the routing ended up creeping into an application’s business logic. And worse, if you needed to modify the way a system was sharded (such as a schema change), it often posed a significant (or even monumental) amount of engineering to achieve.

Single-node Active-Passive systems had also provided transactional support (even if not strong consistency). However, the difficulty of coordinating transactions across shards was so knotted and complex, many sharded systems decided to forgo them completely.

分片尽管有很多优点,但实现起来比较复杂,并且给团队带来了巨大的操作负担。经过深思熟虑的分片算法可能会变得非常繁重,以至于路由最终会悄悄进入应用程序的业务逻辑。更糟糕的是,如果您需要修改系统的分片方式(比如更改模式),那么通常需要进行大量的工程工作。

单节点主从复制系统也提供了事务支持(即使不是强一致性)。然而,在各片之间协调事务的困难是如此复杂,许多分片系统决定完全放弃它们。

Active-Active

Given that sharded databases were difficult to manage and not fully featured, engineers began developing systems that would at least solve one of the problems. What emerged were systems that still didn’t support transactions, but were dramatically easier to manage. With the increased demand on applications’ uptime, it was a sensible decision to help teams meet their SLAs.

The motivating idea behind these systems was that each site could contain some (or all) of a cluster’s data and serve reads and writes for it. Whenever a node received a write it would propagate the change to all other nodes that would need a copy of it. To handle situations where two nodes received writes for the same key, other nodes’ transformations were fed into a conflict resolution algorithm before committing. Given that each site was “active”, it was dubbed Active-Active.

Because each server could handle reads and writes for all of its data, sharding was easier to accomplish algorithmically and made deployments easier to manage.

多活方案

考虑到分片数据库难以管理且功能不全,工程师们开始开发至少能解决其中一个问题的系统。出现的系统仍然不支持事务,但是管理起来非常容易。随着应用程序正常运行时间需求的增加,帮助团队满足 SLA 是一个明智的决定。

这些系统背后的原理是,每个站点可以包含集群数据的一部分(或全部),并为其提供读写服务。每当一个节点收到写操作时,它就会将更改传播到所有需要它的副本的其他节点。为了处理两个节点收到相同键的写操作的情况,在提交之前需将其他节点的转换输入到冲突解决算法中。考虑到每个站点都是“活动的”,它被称为多活方案。

因为每个服务器都可以处理所有数据的读写,所以分片更容易实现算法,部署也更容易管理。

In terms of availability, Active-Active was excellent. If a node failed, clients just needed to be redirected to another node that did contain the data. As long as a single replica of the data was live, you could serve both reads and writes for it.

While this scheme is fantastic for availability, its design is fundamentally at odds with consistency. Because each site can handle writes for a key (and would in a failover scenario), it’s incredibly difficult to keep data totally synchronized as it is being processed. Instead, the approach is generally to mediate conflicts between sites through the conflict resolution algorithm that makes coarse-grained decisions about how to “smooth out” inconsistencies.

Because that resolution is done post hoc, after a client has already received an answer about a procedure––and has theoretically executed other business logic based on the response––it’s easy for active-active replication to generate anomalies in your data.

Given the premium on uptime, though, the cost of downtime was deemed greater than the cost of potential anomalies, so Active-Active became the dominant replication type.

就可用性而言,多活方案非常好。如果某个节点失败,只需将客户机重定向到另一个包含数据的节点。只要数据的一个副本是实时的,您就可以为它提供读写服务。

虽然这个方案在可用性方面非常出色,但是它的设计从根本上没法做到一致性。因为每个站点都可以处理键的写入(在故障转移场景中也是如此),所以在处理数据时保持数据完全同步是非常困难的。相反,这种方法通常是通过冲突解决算法来协调站点之间的冲突,该算法对如何“消除”不一致做出粗粒度的决策。

因为这个解决方案是在事后完成的,在客户端已经收到关于某个过程的响应 —— 并且理论上已经基于响应执行了其他业务逻辑之后 —— 多活复制很容易在数据中生成异常。

但是,考虑到正常运行时间的溢价,停机的成本被认为大于潜在异常的成本,因此多活成为了主要的备份方案。

返回顶部
顶部