加载中

Abstract:Memcached is a well known, simple, in-memory caching solution. This paper describes how Facebook leverages memcached as a building block to construct and scale a distributed key-value store that supports the world’s largest social network. Our system handles billions of requests per second and holds trillions of items to deliver a rich experience for over a billion users around the world.

1 Introduction

Popular and engaging social networking sites present significant infrastructure challenges. Hundreds of mil-lions of people use these networks every day and impose computational, network, and I/O demands that traditional web architectures struggle to satisfy. A social network’s infrastructure needs to (1) allow near real-time communication, (2) aggregate content on-the-fly from multiple sources, (3) be able to access and update very popular shared content, and (4) scale to process millions of user requests per second.

概要:Memcached 是一个知名的,简单的,全内存的缓存方案。这篇文章描述了facebook是如何使用memcached来构建和扩展一个分布式的key-value存储来为世界上最大的社交网站服务的。我们的系统每秒要处理几十亿的请求,同时存储了几万亿的数据项,可以给全世界超过10亿的用户提供丰富体验。

1 介绍

近些年SNS网络大行其道,这对网站基础建设提出了巨大的挑战。每天有亿万的用户在使用这些网络服务,巨大的计算、网络和I/O资源的需求使传统的web架构不堪重 负。SNS网站的基础架构需要满足:1、近乎实时的交流;2、即时聚合不同来源的内容;3、访问和更新非常热门的共享内容;4、每秒处理几百万的用户请求。

We describe how we improved the open source version of memcached[14] and used it as a building block to construct a distributed key-value store for the largest social network in the world. We discuss our journey scaling from a single cluster of servers to multiple geographically distributed clusters. To the best of our knowledge, this system is the largest memcached installation in the world, processing over a billion requests per second and storing trillions of items.

This paper is the latest in a series of works that have recognized the flexibility and utility of distributed key-value stores [1, 2, 5, 6, 12, 14, 34, 36]. This paper focuses on memcached—an open-source implementation of an in-memory hash table—as it provides low latency access to a shared storage pool at low cost. These qualities enable us to build data-intensive features that would otherwise be impractical. For example, a feature that issues hundreds of database queries per page request would likely never leave the prototype stage because it would be too slow and expensive. In our application,however, web pages routinely fetch thousands of key-value pairs from memcached servers.

我们将描述我们是如何改进memcached[14]的开源版本,并且用它作为组件来构建用于世界上最大的社会化网络的分布式key-value存储的。我们会讨论从单集群服务器扩展成地理上分布式的多集群的历程。据我们所知,这个系统是世界上已安装的规模最大的memcached系统,每秒可以处理几十亿的请求,存储数以万亿的数据项。

本文是关于认识分布式key-value存储的灵活性和实用性的系列文章[1, 2, 5, 6, 12, 14, 34, 36]的最后一篇。本文关注于memcached,这是一个全内存哈希表的开源实现,它以较低的开销提供了对共享存储的低迟延访问。有了这些特性我们可以构建数据密集的功能,否则是不可能的。例如,如果一个页面请求会产生数以百计的数据库请求,那么这样的功能只能停止在原型阶段,因为实现起来会太慢,代价也太高。然而,在我们的应用里,web页面通常都会从memcached服务器获取数以千计的key-value对。

One of our goals is to present the important themes that emerge at different scales of our deployment. While qualities like performance, efficiency, fault-tolerance, and consistency are important at all scales, our experience indicates that at specific sizes some qualities re-quire more effort to achieve than others. For example, maintaining data consistency can be easier at small scales if replication is minimal compared to larger ones where replication is often necessary. Additionally, the importance of finding an optimal communication schedule increases as the number of servers increase and net-working becomes the bottleneck.

This paper includes four main contributions: (1)We describe the evolution of Facebook’s memcached-based architecture. (2) We identify enhancements to memcached that improve performance and increase memory efficiency. (3) We highlight mechanisms that improve our ability to operate our system at scale. (4)We characterize the production workloads imposed on our system.

我们的目标之一,是展现部署在不同尺度(系统)上的重要主题。虽然在所有尺度上是很重要的品质,如性能,效率,容错性和一致性,我们的经验表明,在特定大小的一些素质要求比别人更多的努力来实现。举例来说,保持数据的一致性,如果复制的内容是小量的,可以更容易在小尺度的网络上实现,相比较大的网络往往只是复制必要的内容。此外,找到一个最佳的通信调度的重要性增加的数量增加服务器和网络工作成为瓶颈。

本文包括四个主要贡献:(1)我们描述了Facebook的基于memcach架构的演化。 (2)我们确定memcached的提高性能和增加内存效率的改进。 (3)我们简明扼要地讲述提高我们的经营能力我们的系统规模的机制。 (4)我们对生产工作负载赋予了特色(译者加:对工作负载进行了分类?)。

2 Overview

The following properties greatly influence our design. First, users consume an order of magnitude more con-tent than they create. This behavior results in a workload dominated by fetching data and suggests that caching can have significant advantages. Second, our read operations fetch data from a variety of sources such as MySQL databases, HDFS installations, and backend services. This heterogeneity requires a flexible caching strategy able to store data from disparate sources.

-------------------------------------------page 1-------------------------------------------

Memcached provides a simple set of operations (set, get, and delete) that makes it attractive as an elemental component in a large-scale distributed system. The open-source version we started with provides a single-machine in-memory hash table. In this paper, we discuss how we took this basic building block, made it more efficient, and used it to build a distributed key-value store that can process billions of requests per second. Henceforth, we use 'memcached' to refer to the source code or a running binary and 'memcache' to describe the distributed system.

Figure 1: Memcache as a demand-filled look-aside cache. The left half illustrates the read path for a web server on a cache miss. The right half illustrates the write path.

2综述

以下特点大大影响了我们的设计。第一,用户阅读的内容比他们创建的要多一个数量级,这种行为(读写的特点)所产生工作负载,显然让缓存可以发挥很大的优势。第二,我们是从多个来源读取数据的,比如MySQL数据库、HDFS设备和后台服务,这种多样性要求一个灵活的缓存策略,能够从各个独立的源中储存数据。

MemCached提供了一组简单的操作(set、get和delete),使它在一个大规模的分布式系统中成为注目的基础组件。开源版本提供了单机内存哈希表,在本文中,我们从这个开源版本开始,讨论我们是怎么使用这个基础组件,使它变得更有效,并用它来建一个可以处理每秒数十亿请求的分布式的键-值储存系统。接下来,我们用“memcached”来指代它的源码或者它运行的二进制实例,用“memcache”来指代由每个实例构成的分布式系统。


图1:Memcache作为填补需求的旁路缓存系统。左半图说明了WEB服务器读取缓存时命中失败的读取路径,右半图说明其写路径。

Query cache:We rely on memcache to lighten the read load on our databases. In particular, we use memcache as a demand-filled look-aside cache as shown in Figure 1. When a web server needs data, it first requests the value from memcache by providing a string key. If the item addressed by that key is not cached, the web server retrieves the data from the database or other back-end service and populates the cache with the key-value pair. For write requests, the web server issues SQL statements to the database and then sends a delete request to memcache that invalidates any stale data. We choose to delete cached data instead of updating it because deletes are idempotent. Memcache is not the authoritative source of the data and is therefore allowed to evict cached data.

While there are several ways to address excessive read traffic on MySQL databases, we chose to use memcache. It was the best choice given limited engineering resources and time. Additionally, separating our caching layer from our persistence layer allows us to ad-just each layer independently as our workload changes.

查询缓存:我们依赖于memcache来减轻读取数据库的负担。特别的,我们使用memcache作为填补需求的旁路缓存系统,如图1。当一个Web服务器需要数据时,首先通过一个字符串的键在memcache中请求,如果没有找到,它会从数据库或者从后台服务中检索,再使用该键把结果存回memcache中。对于写的请求,Web服务器发送SQL语句到数据库,接着发送删除请求到memcache,使旧的缓存数据失效。因为删除是幂等运算,所以我们使用删除缓存的方式,而不是更新缓存。

在应对MySQL数据库繁重的查询通信的众多方法中,我们选择了memcache,在有限的资源与时间限制下,这是最好的选择。此外,缓存层与持久层分离,让我们可以在工作负载发生变化时快速地调整。

Generic cache:We also leverage memcacheas a more general key-value store. For example, engineers use memcache to store pre-computed results from sophisticated machine learning algorithms which can then be used by a variety of other applications. It takes little ef-fort for new services to leverage the existing marcher infrastructure without the burden of tuning, optimizing, provisioning, and maintaining a large server fleet.

As is, memcached provides no server-to-server coordination; it is an in-memory hash table running on a single server. In the remainder of this paper we describe how we built a distributed key-value store based on memcached capable of operating under Facebook’s workload. Our system provides a suite of configuration, aggregation, and routing services to organize memcached instances into a distributed system.

Figure 2: Overall architecture

通用缓存:我们同样让memcache成为一个更加通用的键-值储存系统。比如说,工程师们使用memcache保存复杂的机器学习算法的中间结果,这些结果能被很多其它应用程序所使用。它只需要我们付出很少的努力,就可以让新增的服务利用现有的正在使用的基础设施,而无需调整、优化、调配和维护大型的服务器群。

正如memcached没有提供服务器到服务器的协同,它仅仅是运行在单机上的一个内存哈希表。接下来我们描述我们是如何基于memcached构建一个分布式键值储存系统,以胜任在Facebook的工作负载下的操作。

图2:整体架构

We structure our paper to emphasize the themes that emerge at three different deployment scales. Our read-heavy workload and wide fan-out is the primary concern when we have one cluster of servers. As it becomes necessary to scale to multiple frontend clusters, we ad-dress data replication between these clusters. Finally, we describe mechanisms to provide a consistent user experience as we spread clusters around the world. Operational complexity and fault tolerance is important at all scales. We present salient data that supports our design decisions and refer the reader to work by Atikoglu et al.[8] for a more detailed analysis of our workload. At a high-level, Figure 2 illustrates this final architecture in which we organize co-located clusters into a region and designate a master region that provides a data stream to keep non-master regions up-to-date.

While evolving our system we prioritize two major design goals. (1) Any change must impact a user-facing or operational issue. Optimizations that have limited scope are rarely considered. (2) We treat the prob-ability of reading transient stale data as a parameter to be tuned, similar to responsiveness. We are willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.

论文的结构主要描述了在三种不同的规模下出现的问题。当我们拥有第一个服务器集群时,频繁的读负载和广泛的输出是我们最大的担心。当有必要扩展到多个前端集群时,我们解决了集群间的数据备份问题。最后,我们描述了一种机制,这种机制让我们可以在全世界伸展集群的同时提供平滑的用户体验。不论在什么尺度上,容错性和操作复杂性总是很重要的。我们展示了重要的数据参考,这些数据指引我们做出了最终的设计决定,读者如需获得更多细节性的分析,请参看Atikoglu et al.[8]的工作。提纲挈领的解释参看图2,这是最终的架构,我们将并置集群组织起来,形成一个群体(region),指定一个主群体(master),由主群体提供数据流让非主群体保持数据同步。

在系统的发展中,我们将这两个重大的设计目标放在首位:

1. 只有已经对用户或者我们的运维产生影响的问题,才值得改变。我们极少考虑范围有限的优化。

2. 对陈旧数据的瞬态读取,其概率和响应度类似,都将作为参数来调整。我们会暴露轻度陈旧的数据以便后台存储和高强度负载绝缘。

3 In a Cluster: Latency and Load

We now consider the challenges of scaling to thousands of servers within a cluster. At this scale, most of our efforts focus on reducing either the latency of fetching cached data or the load imposed due to a cache miss.

-------------------------------------------page 2-------------------------------------------

3.1 Reducing Latency

Whether a request for data results in a cache hit or miss, the latency of memcache’s response is a critical factor in the response time of a user’s request. A single user web request can often result in hundreds of individual memcache get requests. For example, loading one of our popular pages results in an average of 521 distinct items fetched from memcache.1

We provision hundreds of memcached servers in a cluster to reduce load on databases and other services. Items are distributed across the memcached servers through consistent hashing [22]. Thus web servers have to routinely communicate with many memcached servers to satisfy a user request. As a result, all web servers communicate with every memcached server in a short period of time. This all-to-all communication pattern can cause incast congestion [30] or allow a single server to become the bottleneck for many web servers. Data replication often alleviates the single-server bottleneck but leads to significant memory inefficiencies in the common case.

3 集群之中: 延迟和负载

现在考虑集群中数以千计的服务器所带来的挑战。在这种规模之下,我们着眼于减少获取缓存时的负载,以及缓存不中时数据库的负载。

3.1 减少延迟

不论缓存是否命中,memcache的响应时间都是影响总响应时间的重要因素。单个的网页请求一般包含数百个memcache读请求。如一个较火的页面平均需要从memcache中获取521个不同的资源。

为了减少数据库等的负担,我们准备了缓存集群,每个集群都由数百台memcache服务器组成。资源个体经hash后存于不同的memcache服务器中。因此,web服务器必须请求多台memcache服务器,才能满足用户的请求。由此导致在很短的时间里每个web服务器都要和所有的memcache服务器沟通。这种所有对所有的连接模式会导致潮涌堵塞(incast congestion)或者某台服务器不幸成为瓶颈。实时备份可以缓解这种状况,但一般又会引起巨大的内存浪费。(译者:为何?)

We reduce latency mainly by focusing on the memcache client, which runs on each web server. This client serves a range of functions, including serialization, compression, request routing, error handling, and request batching. Clients maintain a map of all available servers, which is updated through an auxiliary configuration system.

Parallel requests and batching: We structure our web-application code to minimize the number of network round trips necessary to respond to page requests. We construct a directed acyclic graph (DAG) representing the dependencies between data. A web server uses this DAG to maximize the number of items that can be fetched concurrently. On average these batches consist of 24 keys per request.2

Client-server communication: Memcached servers do not communicate with each other. When appropriate, we embed the complexity of the system into a stateless client rather than in the memcached servers. This greatly simplifies memcached and allows us to focus on making it highly performant for a more limited use case. Keeping the clients stateless enables rapid iteration in the software and simplifies our deployment process. Client logic is provided as two components: a library that can be embedded into applications or as a standalone proxy named mcrouter. This proxy presents a memcached server interface and routes the requests/replies to/from other servers.

我们减少延迟的方法主要集中在memcache客户端,每一个web服务器都会运行memcache客户端。这个客户端提供一系列功能,包括:串行化、压缩、请求路由、错误处理以及请求批处理。客户端维护着一个对所以可获得的服务器的映射,对这个映射表的更新需要通过一个辅助的配置系统。

并行请求和批处理:我们构建web应用代码,目的是最小化对于页面请求回应所必要的网络往返数。我们构建了有向无环图(DAG)用来表示数据间的依赖。web服务器使用DAG来最大化可以并发读取的项目数。平均来说,这些批量请求对于每个请求包含24个主键。

客户端-服务器通信:memcached服务器不会直接通信。如果适当,我们将系统的复杂度嵌入无状态的客户端,而不是memcached服务器。这极大地简化了memcached,使我们专注于针对更有限的用例提供高性能。保持客户端的无状态使得我们可以快速迭代开发,同时也简化了部署流程。客户端的逻辑可以提供为两种组件:可以嵌入应用的一个库,或者做为一个名为mcrouter的独立的代理程序。这个代理提供memcached服务器的借口,对不同服务器之间的请求/回复进行路由。

Clients use UDP and TCP to communicate with memcached servers. We rely on UDP for get requests to reduce latency and overhead. Since UDP is connection-less, each thread in the web server is allowed to directly communicate with memcached servers directly, bypassing mcrouter, without establishing and maintaining a connection thereby reducing the overhead. The UDP implementation detects packets that are dropped or received out of order (using sequence numbers) and treats them as errors on the client side. It does not provide any mechanism to try to recover from them. In our infrastructure, we find this decision to be practical. Under peak load, memcache clients observe that 0.25% of get requests are discarded. About 80% of these drops are due to late or dropped packets, while the remainder are due to out of order delivery. Clients treat get errors as cache misses, but web servers will skip inserting entries into memcached after querying for data to avoid putting additional load on a possibly overloaded network or server.

Figure 3: Get latency for UDP, TCP via mcrouter

客户端使用UDP和TCP协议与memcached服务器通讯。我们依赖UDP来使请求的延迟和开销缩减。因为UDP是无连接的,web服务器中的每个线程都被允许直接与memcached服务器通信,通过mcrouter,不需要创建与维护连接因而减少了开销。UDP实现了检测出丢失的或失序接收(通过序列号)的包,并在客户端将它们作为异常处理。它没有提供任何试图恢复的机制。在我们的基础架构中,我们发现这个决定很实际。在峰值负载条件下,memcache客户端观察到0.25%的请求会被丢弃。其中大约80%是由于延迟或丢失包,其余的是由于失序的交付。客户端将异常作为缓存不命中处理,但是web服务器在查询出数据以后,会跳过插入条目到memcached,以便避免对可能超载的网络会服务器增添额外的负载。

图 3: 经过mcrouter以后 UDP, TCP得到的延迟

返回顶部
顶部