翻译于 2013/04/23 09:55
7 人 顶 此译文
Abstract:Memcached is a well known, simple, in-memory caching solution. This paper describes how Facebook leverages memcached as a building block to construct and scale a distributed key-value store that supports the world’s largest social network. Our system handles billions of requests per second and holds trillions of items to deliver a rich experience for over a billion users around the world.
Popular and engaging social networking sites present significant infrastructure challenges. Hundreds of mil-lions of people use these networks every day and impose computational, network, and I/O demands that traditional web architectures struggle to satisfy. A social network’s infrastructure needs to (1) allow near real-time communication, (2) aggregate content on-the-fly from multiple sources, (3) be able to access and update very popular shared content, and (4) scale to process millions of user requests per second.
We describe how we improved the open source version of memcached and used it as a building block to construct a distributed key-value store for the largest social network in the world. We discuss our journey scaling from a single cluster of servers to multiple geographically distributed clusters. To the best of our knowledge, this system is the largest memcached installation in the world, processing over a billion requests per second and storing trillions of items.
This paper is the latest in a series of works that have recognized the flexibility and utility of distributed key-value stores [1, 2, 5, 6, 12, 14, 34, 36]. This paper focuses on memcached—an open-source implementation of an in-memory hash table—as it provides low latency access to a shared storage pool at low cost. These qualities enable us to build data-intensive features that would otherwise be impractical. For example, a feature that issues hundreds of database queries per page request would likely never leave the prototype stage because it would be too slow and expensive. In our application,however, web pages routinely fetch thousands of key-value pairs from memcached servers.
本文是关于认识分布式key-value存储的灵活性和实用性的系列文章[1, 2, 5, 6, 12, 14, 34, 36]的最后一篇。本文关注于memcached，这是一个全内存哈希表的开源实现，它以较低的开销提供了对共享存储的低迟延访问。有了这些特性我们可以构建数据密集的功能，否则是不可能的。例如，如果一个页面请求会产生数以百计的数据库请求，那么这样的功能只能停止在原型阶段，因为实现起来会太慢，代价也太高。然而，在我们的应用里，web页面通常都会从memcached服务器获取数以千计的key-value对。
One of our goals is to present the important themes that emerge at different scales of our deployment. While qualities like performance, efficiency, fault-tolerance, and consistency are important at all scales, our experience indicates that at specific sizes some qualities re-quire more effort to achieve than others. For example, maintaining data consistency can be easier at small scales if replication is minimal compared to larger ones where replication is often necessary. Additionally, the importance of finding an optimal communication schedule increases as the number of servers increase and net-working becomes the bottleneck.
This paper includes four main contributions: (1)We describe the evolution of Facebook’s memcached-based architecture. (2) We identify enhancements to memcached that improve performance and increase memory efficiency. (3) We highlight mechanisms that improve our ability to operate our system at scale. (4)We characterize the production workloads imposed on our system.
我们的目标之一，是展现部署在不同尺度（系统）上的重要主题。虽然在所有尺度上是很重要的品质，如性能，效率，容错性和一致性，我们的经验表明，在特定大小的一些素质要求比别人更多的努力来实现。举例来说，保持数据的一致性，如果复制的内容是小量的，可以更容易在小尺度的网络上实现，相比较大的网络往往只是复制必要的内容。此外，找到一个最佳的通信调度的重要性增加的数量增加服务器和网络工作成为瓶颈。本文包括四个主要贡献：（1）我们描述了Facebook的基于memcach架构的演化。 （2）我们确定memcached的提高性能和增加内存效率的改进。 （3）我们简明扼要地讲述提高我们的经营能力我们的系统规模的机制。 （4）我们对生产工作负载赋予了特色（译者加：对工作负载进行了分类？）。
The following properties greatly influence our design. First, users consume an order of magnitude more con-tent than they create. This behavior results in a workload dominated by fetching data and suggests that caching can have significant advantages. Second, our read operations fetch data from a variety of sources such as MySQL databases, HDFS installations, and backend services. This heterogeneity requires a flexible caching strategy able to store data from disparate sources.
Memcached provides a simple set of operations (set, get, and delete) that makes it attractive as an elemental component in a large-scale distributed system. The open-source version we started with provides a single-machine in-memory hash table. In this paper, we discuss how we took this basic building block, made it more efficient, and used it to build a distributed key-value store that can process billions of requests per second. Henceforth, we use 'memcached' to refer to the source code or a running binary and 'memcache' to describe the distributed system.
Figure 1: Memcache as a demand-filled look-aside cache. The left half illustrates the read path for a web server on a cache miss. The right half illustrates the write path.
Query cache:We rely on memcache to lighten the read load on our databases. In particular, we use memcache as a demand-filled look-aside cache as shown in Figure 1. When a web server needs data, it first requests the value from memcache by providing a string key. If the item addressed by that key is not cached, the web server retrieves the data from the database or other back-end service and populates the cache with the key-value pair. For write requests, the web server issues SQL statements to the database and then sends a delete request to memcache that invalidates any stale data. We choose to delete cached data instead of updating it because deletes are idempotent. Memcache is not the authoritative source of the data and is therefore allowed to evict cached data.
While there are several ways to address excessive read traffic on MySQL databases, we chose to use memcache. It was the best choice given limited engineering resources and time. Additionally, separating our caching layer from our persistence layer allows us to ad-just each layer independently as our workload changes.
Generic cache:We also leverage memcacheas a more general key-value store. For example, engineers use memcache to store pre-computed results from sophisticated machine learning algorithms which can then be used by a variety of other applications. It takes little ef-fort for new services to leverage the existing marcher infrastructure without the burden of tuning, optimizing, provisioning, and maintaining a large server fleet.
As is, memcached provides no server-to-server coordination; it is an in-memory hash table running on a single server. In the remainder of this paper we describe how we built a distributed key-value store based on memcached capable of operating under Facebook’s workload. Our system provides a suite of configuration, aggregation, and routing services to organize memcached instances into a distributed system.
Figure 2: Overall architecture
We structure our paper to emphasize the themes that emerge at three different deployment scales. Our read-heavy workload and wide fan-out is the primary concern when we have one cluster of servers. As it becomes necessary to scale to multiple frontend clusters, we ad-dress data replication between these clusters. Finally, we describe mechanisms to provide a consistent user experience as we spread clusters around the world. Operational complexity and fault tolerance is important at all scales. We present salient data that supports our design decisions and refer the reader to work by Atikoglu et al. for a more detailed analysis of our workload. At a high-level, Figure 2 illustrates this final architecture in which we organize co-located clusters into a region and designate a master region that provides a data stream to keep non-master regions up-to-date.
While evolving our system we prioritize two major design goals. (1) Any change must impact a user-facing or operational issue. Optimizations that have limited scope are rarely considered. (2) We treat the prob-ability of reading transient stale data as a parameter to be tuned, similar to responsiveness. We are willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.
We now consider the challenges of scaling to thousands of servers within a cluster. At this scale, most of our efforts focus on reducing either the latency of fetching cached data or the load imposed due to a cache miss.
Whether a request for data results in a cache hit or miss, the latency of memcache’s response is a critical factor in the response time of a user’s request. A single user web request can often result in hundreds of individual memcache get requests. For example, loading one of our popular pages results in an average of 521 distinct items fetched from memcache.1
We provision hundreds of memcached servers in a cluster to reduce load on databases and other services. Items are distributed across the memcached servers through consistent hashing . Thus web servers have to routinely communicate with many memcached servers to satisfy a user request. As a result, all web servers communicate with every memcached server in a short period of time. This all-to-all communication pattern can cause incast congestion  or allow a single server to become the bottleneck for many web servers. Data replication often alleviates the single-server bottleneck but leads to significant memory inefficiencies in the common case.
We reduce latency mainly by focusing on the memcache client, which runs on each web server. This client serves a range of functions, including serialization, compression, request routing, error handling, and request batching. Clients maintain a map of all available servers, which is updated through an auxiliary configuration system.
Parallel requests and batching: We structure our web-application code to minimize the number of network round trips necessary to respond to page requests. We construct a directed acyclic graph (DAG) representing the dependencies between data. A web server uses this DAG to maximize the number of items that can be fetched concurrently. On average these batches consist of 24 keys per request.2
Client-server communication: Memcached servers do not communicate with each other. When appropriate, we embed the complexity of the system into a stateless client rather than in the memcached servers. This greatly simplifies memcached and allows us to focus on making it highly performant for a more limited use case. Keeping the clients stateless enables rapid iteration in the software and simplifies our deployment process. Client logic is provided as two components: a library that can be embedded into applications or as a standalone proxy named mcrouter. This proxy presents a memcached server interface and routes the requests/replies to/from other servers.
Clients use UDP and TCP to communicate with memcached servers. We rely on UDP for get requests to reduce latency and overhead. Since UDP is connection-less, each thread in the web server is allowed to directly communicate with memcached servers directly, bypassing mcrouter, without establishing and maintaining a connection thereby reducing the overhead. The UDP implementation detects packets that are dropped or received out of order (using sequence numbers) and treats them as errors on the client side. It does not provide any mechanism to try to recover from them. In our infrastructure, we find this decision to be practical. Under peak load, memcache clients observe that 0.25% of get requests are discarded. About 80% of these drops are due to late or dropped packets, while the remainder are due to out of order delivery. Clients treat get errors as cache misses, but web servers will skip inserting entries into memcached after querying for data to avoid putting additional load on a possibly overloaded network or server.
Figure 3: Get latency for UDP, TCP via mcrouter
图 3: 经过mcrouter以后 UDP, TCP得到的延迟