Abstract:Memcached is a well known, simple, in-memory caching solution. This paper describes how Facebook leverages memcached as a building block to construct and scale a distributed key-value store that supports the world’s largest social network. Our system handles billions of requests per second and holds trillions of items to deliver a rich experience for over a billion users around the world.

1 Introduction

Popular and engaging social networking sites present significant infrastructure challenges. Hundreds of mil-lions of people use these networks every day and impose computational, network, and I/O demands that traditional web architectures struggle to satisfy. A social network’s infrastructure needs to (1) allow near real-time communication, (2) aggregate content on-the-fly from multiple sources, (3) be able to access and update very popular shared content, and (4) scale to process millions of user requests per second.

概要:Memcached 是一个知名的,简单的,全内存的缓存方案。这篇文章描述了facebook是如何使用memcached来构建和扩展一个分布式的key-value存储来为世界上最大的社交网站服务的。我们的系统每秒要处理几十亿的请求,同时存储了几万亿的数据项,可以给全世界超过10亿的用户提供丰富体验。

1 介绍

近些年SNS网络大行其道,这对网站基础建设提出了巨大的挑战。每天有亿万的用户在使用这些网络服务,巨大的计算、网络和I/O资源的需求使传统的web架构不堪重 负。SNS网站的基础架构需要满足:1、近乎实时的交流;2、即时聚合不同来源的内容;3、访问和更新非常热门的共享内容;4、每秒处理几百万的用户请求。

We describe how we improved the open source version of memcached[14] and used it as a building block to construct a distributed key-value store for the largest social network in the world. We discuss our journey scaling from a single cluster of servers to multiple geographically distributed clusters. To the best of our knowledge, this system is the largest memcached installation in the world, processing over a billion requests per second and storing trillions of items.

This paper is the latest in a series of works that have recognized the flexibility and utility of distributed key-value stores [1, 2, 5, 6, 12, 14, 34, 36]. This paper focuses on memcached—an open-source implementation of an in-memory hash table—as it provides low latency access to a shared storage pool at low cost. These qualities enable us to build data-intensive features that would otherwise be impractical. For example, a feature that issues hundreds of database queries per page request would likely never leave the prototype stage because it would be too slow and expensive. In our application,however, web pages routinely fetch thousands of key-value pairs from memcached servers.


本文是关于认识分布式key-value存储的灵活性和实用性的系列文章[1, 2, 5, 6, 12, 14, 34, 36]的最后一篇。本文关注于memcached,这是一个全内存哈希表的开源实现,它以较低的开销提供了对共享存储的低迟延访问。有了这些特性我们可以构建数据密集的功能,否则是不可能的。例如,如果一个页面请求会产生数以百计的数据库请求,那么这样的功能只能停止在原型阶段,因为实现起来会太慢,代价也太高。然而,在我们的应用里,web页面通常都会从memcached服务器获取数以千计的key-value对。

One of our goals is to present the important themes that emerge at different scales of our deployment. While qualities like performance, efficiency, fault-tolerance, and consistency are important at all scales, our experience indicates that at specific sizes some qualities re-quire more effort to achieve than others. For example, maintaining data consistency can be easier at small scales if replication is minimal compared to larger ones where replication is often necessary. Additionally, the importance of finding an optimal communication schedule increases as the number of servers increase and net-working becomes the bottleneck.

This paper includes four main contributions: (1)We describe the evolution of Facebook’s memcached-based architecture. (2) We identify enhancements to memcached that improve performance and increase memory efficiency. (3) We highlight mechanisms that improve our ability to operate our system at scale. (4)We characterize the production workloads imposed on our system.


本文包括四个主要贡献:(1)我们描述了Facebook的基于memcach架构的演化。 (2)我们确定memcached的提高性能和增加内存效率的改进。 (3)我们简明扼要地讲述提高我们的经营能力我们的系统规模的机制。 (4)我们对生产工作负载赋予了特色(译者加:对工作负载进行了分类?)。

2 Overview

The following properties greatly influence our design. First, users consume an order of magnitude more con-tent than they create. This behavior results in a workload dominated by fetching data and suggests that caching can have significant advantages. Second, our read operations fetch data from a variety of sources such as MySQL databases, HDFS installations, and backend services. This heterogeneity requires a flexible caching strategy able to store data from disparate sources.

-------------------------------------------page 1-------------------------------------------

Memcached provides a simple set of operations (set, get, and delete) that makes it attractive as an elemental component in a large-scale distributed system. The open-source version we started with provides a single-machine in-memory hash table. In this paper, we discuss how we took this basic building block, made it more efficient, and used it to build a distributed key-value store that can process billions of requests per second. Henceforth, we use 'memcached' to refer to the source code or a running binary and 'memcache' to describe the distributed system.

Figure 1: Memcache as a demand-filled look-aside cache. The left half illustrates the read path for a web server on a cache miss. The right half illustrates the write path.





Query cache:We rely on memcache to lighten the read load on our databases. In particular, we use memcache as a demand-filled look-aside cache as shown in Figure 1. When a web server needs data, it first requests the value from memcache by providing a string key. If the item addressed by that key is not cached, the web server retrieves the data from the database or other back-end service and populates the cache with the key-value pair. For write requests, the web server issues SQL statements to the database and then sends a delete request to memcache that invalidates any stale data. We choose to delete cached data instead of updating it because deletes are idempotent. Memcache is not the authoritative source of the data and is therefore allowed to evict cached data.

While there are several ways to address excessive read traffic on MySQL databases, we chose to use memcache. It was the best choice given limited engineering resources and time. Additionally, separating our caching layer from our persistence layer allows us to ad-just each layer independently as our workload changes.



Generic cache:We also leverage memcacheas a more general key-value store. For example, engineers use memcache to store pre-computed results from sophisticated machine learning algorithms which can then be used by a variety of other applications. It takes little ef-fort for new services to leverage the existing marcher infrastructure without the burden of tuning, optimizing, provisioning, and maintaining a large server fleet.

As is, memcached provides no server-to-server coordination; it is an in-memory hash table running on a single server. In the remainder of this paper we describe how we built a distributed key-value store based on memcached capable of operating under Facebook’s workload. Our system provides a suite of configuration, aggregation, and routing services to organize memcached instances into a distributed system.

Figure 2: Overall architecture




We structure our paper to emphasize the themes that emerge at three different deployment scales. Our read-heavy workload and wide fan-out is the primary concern when we have one cluster of servers. As it becomes necessary to scale to multiple frontend clusters, we ad-dress data replication between these clusters. Finally, we describe mechanisms to provide a consistent user experience as we spread clusters around the world. Operational complexity and fault tolerance is important at all scales. We present salient data that supports our design decisions and refer the reader to work by Atikoglu et al.[8] for a more detailed analysis of our workload. At a high-level, Figure 2 illustrates this final architecture in which we organize co-located clusters into a region and designate a master region that provides a data stream to keep non-master regions up-to-date.

While evolving our system we prioritize two major design goals. (1) Any change must impact a user-facing or operational issue. Optimizations that have limited scope are rarely considered. (2) We treat the prob-ability of reading transient stale data as a parameter to be tuned, similar to responsiveness. We are willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.

论文的结构主要描述了在三种不同的规模下出现的问题。当我们拥有第一个服务器集群时,频繁的读负载和广泛的输出是我们最大的担心。当有必要扩展到多个前端集群时,我们解决了集群间的数据备份问题。最后,我们描述了一种机制,这种机制让我们可以在全世界伸展集群的同时提供平滑的用户体验。不论在什么尺度上,容错性和操作复杂性总是很重要的。我们展示了重要的数据参考,这些数据指引我们做出了最终的设计决定,读者如需获得更多细节性的分析,请参看Atikoglu et al.[8]的工作。提纲挈领的解释参看图2,这是最终的架构,我们将并置集群组织起来,形成一个群体(region),指定一个主群体(master),由主群体提供数据流让非主群体保持数据同步。


1. 只有已经对用户或者我们的运维产生影响的问题,才值得改变。我们极少考虑范围有限的优化。

2. 对陈旧数据的瞬态读取,其概率和响应度类似,都将作为参数来调整。我们会暴露轻度陈旧的数据以便后台存储和高强度负载绝缘。

3 In a Cluster: Latency and Load

We now consider the challenges of scaling to thousands of servers within a cluster. At this scale, most of our efforts focus on reducing either the latency of fetching cached data or the load imposed due to a cache miss.

-------------------------------------------page 2-------------------------------------------

3.1 Reducing Latency

Whether a request for data results in a cache hit or miss, the latency of memcache’s response is a critical factor in the response time of a user’s request. A single user web request can often result in hundreds of individual memcache get requests. For example, loading one of our popular pages results in an average of 521 distinct items fetched from memcache.1

We provision hundreds of memcached servers in a cluster to reduce load on databases and other services. Items are distributed across the memcached servers through consistent hashing [22]. Thus web servers have to routinely communicate with many memcached servers to satisfy a user request. As a result, all web servers communicate with every memcached server in a short period of time. This all-to-all communication pattern can cause incast congestion [30] or allow a single server to become the bottleneck for many web servers. Data replication often alleviates the single-server bottleneck but leads to significant memory inefficiencies in the common case.

3 集群之中: 延迟和负载


3.1 减少延迟


为了减少数据库等的负担,我们准备了缓存集群,每个集群都由数百台memcache服务器组成。资源个体经hash后存于不同的memcache服务器中。因此,web服务器必须请求多台memcache服务器,才能满足用户的请求。由此导致在很短的时间里每个web服务器都要和所有的memcache服务器沟通。这种所有对所有的连接模式会导致潮涌堵塞(incast congestion)或者某台服务器不幸成为瓶颈。实时备份可以缓解这种状况,但一般又会引起巨大的内存浪费。(译者:为何?)

We reduce latency mainly by focusing on the memcache client, which runs on each web server. This client serves a range of functions, including serialization, compression, request routing, error handling, and request batching. Clients maintain a map of all available servers, which is updated through an auxiliary configuration system.

Parallel requests and batching: We structure our web-application code to minimize the number of network round trips necessary to respond to page requests. We construct a directed acyclic graph (DAG) representing the dependencies between data. A web server uses this DAG to maximize the number of items that can be fetched concurrently. On average these batches consist of 24 keys per request.2

Client-server communication: Memcached servers do not communicate with each other. When appropriate, we embed the complexity of the system into a stateless client rather than in the memcached servers. This greatly simplifies memcached and allows us to focus on making it highly performant for a more limited use case. Keeping the clients stateless enables rapid iteration in the software and simplifies our deployment process. Client logic is provided as two components: a library that can be embedded into applications or as a standalone proxy named mcrouter. This proxy presents a memcached server interface and routes the requests/replies to/from other servers.




Clients use UDP and TCP to communicate with memcached servers. We rely on UDP for get requests to reduce latency and overhead. Since UDP is connection-less, each thread in the web server is allowed to directly communicate with memcached servers directly, bypassing mcrouter, without establishing and maintaining a connection thereby reducing the overhead. The UDP implementation detects packets that are dropped or received out of order (using sequence numbers) and treats them as errors on the client side. It does not provide any mechanism to try to recover from them. In our infrastructure, we find this decision to be practical. Under peak load, memcache clients observe that 0.25% of get requests are discarded. About 80% of these drops are due to late or dropped packets, while the remainder are due to out of order delivery. Clients treat get errors as cache misses, but web servers will skip inserting entries into memcached after querying for data to avoid putting additional load on a possibly overloaded network or server.

Figure 3: Get latency for UDP, TCP via mcrouter


图 3: 经过mcrouter以后 UDP, TCP得到的延迟