加载中

The 80/20 rule is often attributed to an Italian economist named Vilfredo Pareto. Born in 1848, Pareto was (inspirationally at least) one of the early members of the occupy movement: he observed that 80% of Italy’s wealth at that time was owned by fewer than 20% of Italy’s population. As a bit of a tangent, it’s also worth noting that the 80/20 rule is also anecdotally attributed to Pareto’s observation that 80% of the peas in his garden came from 20% of the plants — so he was apparently more of a pea counter than a bean counter (har har). Regardless, Pareto was no fan of uniformity.

Pareto’s Principle, and the resulting statistical idea of a “Pareto distribution” is an example of what is known in statistics as a power law, and it has incredible relevance in understanding storage access patterns. Here’s why: for virtually all application workloads, accesses to disk are much closer to a Pareto distribution than a uniform random one: a relatively small amount of hot data is used by a majority of I/O requests, while a much larger amount of cold data is accessed with much lower frequency.

80/20 法则通常被认为是源于意大利经济学家维尔弗雷多·帕累托。帕累托出生于1848年,他是(至少被认为是)占领运动的早期成员之一。他发现意大利国家财富的80%是掌握在几乎少于20%的人口手中的。由此发散开来看,80/20法则在其他方面的应用同样值得注意,也是很有趣的:因为帕累托观察发现他的园子里的80%的豌豆产自于20%的作物上(他似乎更喜欢数豌豆而不是其他豆子,哈哈)。无论如何,帕累托是不相信均匀分布的理论的。

帕累托原则,以及由此而来的统计学观点“帕累托分布”被看作是统计学幂率的一个实例,它在理解存储器访问模式上也有出人意料的相关性。这就是为什么应用负载、访问磁盘的问题更接近于帕累托分布而不是均匀的随机分布:即大部分的I/O请求访问少量的热门数据,而大量的冷门数据的访问频率远低于此。


We all know that this is intuitively true, that our systems have a mix of hot and cold data. It is a motivating argument for mixed-media (or “hybrid”) storage systems, and it’s also applied at scale in designing storage systems for applications like Facebook. Here’s the thing: pareto-like distributions are best served by being unfair in the assignment of resources to data. Instead of building a homogenous system out of one type of storage media, these distributions teach us to steal resources from unpopular data to reward the prolific.

A misunderstanding of the Pareto principle leads to all sorts of confusion in building and measuring storage systems. One example of this is flash vendors that argue that building all-flash storage on a single, homogenous layer of flash is a good match for workload demands. From this perspective, homogenous all-flash systems are effectively communist storage. They idealistically decide to invest equal resources in every piece of data resulting in a poor match between resource-level spending and access-level spending. More on this in a bit.

我们的系统是由冷数据和热数据混合组成的,这是一个众所周知的事实。混合介质存储系统技术引起了热烈的争议,它也被应用于为Facebook这种规模的应用程序设计存储系统。问题就在这里:通过给数据分配不均等的资源可以给类似帕累托分布的结构更好的支持。使用多种存储介质来代替同介质存储系统,这样的分配就可以让我们从那些不经常访问的数据处夺来资源补贴给那些经常被访问的数据。

对帕累托原则的误解导致了构建和度量存储系统时的诸多混乱。例如有些闪存芯片供应商坚持认为在单一、同介质的闪存芯片上构建完全基于闪存的存储系统就能很好的满足工作负荷的要求。从这个角度来看,同介质纯闪存的系统还是高效的“共产主义”存储呢。他们理想化的决定投资这种给所有数据分配均等资源的方案,这导致了资源层面的开支与数据访问层面的开支严重不对等,哎,还是在这里多关注一些吧。

Let’s look at some real workload data.

To explain the degree to which storage workloads are non-uniform, let’s look at some real data. We’ve recently been working with a one-year storage trace of eleven developer desktops. As storage traces go, this is a pretty fun dataset to analyze because it contains a lot of data over a very long period of time: storage traces, such as the ones hosted by SNIA, are typically either much shorter (hours to days in total) or much lower fidelity. In total, this twelve month trace describes about 7.6 billion IO operations and a total transfer of 28TB over about 5TB of stored data.

I’d like to quickly summarize this data and try to point out a couple of interesting things that should influence the way you think about architecting for your data.

sec min hour day month year forever 17 GB 129 GB 627 GB 2.0 TB 5.1 TB How old was the data after 1 year?

让我们看一下真实的工作数据

为了解释存储的工作负载数据到底有多么的不一致,我们来看一组真实的数据。我们最近对11个开发者桌面存储数据进行了为期一年的跟踪记录。随着时间的推移,对这些数据集的分析变得很有趣,因为它包含了一个很长时间段内的大量数据:存储的轨迹记录,比如 SNIA保存的数据就非常的小(一天内的总小时)或者精确度也很低。总的来说,为期12个月的跟踪记录了大概76亿次的IO操作和在5TB存储数据上进行的超过28TB传输量。

我想通过快速的总结这些数据,指出一些有趣的东西,获取能对你的思考如何规划你的数据存储有些帮助。

The first chart, above, shows the age of all the stored data at the end of the trace. Of the 5.1TB of data that was stored on those 11 desktops, 3.1TB of data weren’t accessed through the entire year. As a result, the performance of the system through that year was completely unchanged by placement decisions regarding where that cold data was stored.

At the other end of the spectrum, we see that only 627GB, or about 12% of all stored data has been accessed in the past month. We see a similar progression as we move to shorter periods of time. This initial capacity/age analysis really just serves to validate our assumption about access distributions, so now let’s look at a slightly more interesting view…

上面第一个图表,显示的是所有存储数据在结束trace时的时间。5.1TB的数据被存储在11个桌面,3.1TB的数据在整整一年中没有被访问。因此,可以通过一整年都没有被改变数据来决定那些冷数据被存储在哪个位置。

在另一方面,我们看到仅有627GB,或者大约21%的数据有在一个月内被访问。在更短的时间段里,我们看到类似的级数增长。这个初始容量/时间分析,仅仅是用于验证我们对于访问分布的假设,那么,现在让我们看一个更有趣的现象...

32 GB 4.5 TB (35%) 64 GB 5.9 TB (46%) 128 GB 8.0 TB (62%) 256 GB 10.7 TB (84%) 512 GB 12.6 TB (98%) 1 TB 12.8 TB This size of cache... ...would serve this much of the request traffic.

In the graph above, I’ve correlated the amount of actual access over the year, with progressively larger buckets of “hot” data. This graph tries to achieve two new insights with the access data over the year. First, it accounts the number of accesses to the data, to allow us to think about hit rates. Using “least recently used” (LRU) as a model for populating a layer of fast memory, this allows us to reason about what proportion of requests would be served from our top tier (or cache). If you scroll over the graph, you can see how the cumulative hit rate increases as more fast memory is added to the system

Second, the graph allows us to calculate a normalized access cost for the data being stored. Rather than reasoning about storage based on $/GB, lets consider it completely based on access. I picked a completely arbitrary value for the smallest size of cache: at 32GB in the fast tier, I account one dollar per gigabyte accessed. Now look what happens as you grow the amount of fast storage in order to increase hit rate. As you have to repeatedly double the size of the cache to improve hit rate, you are getting relatively fewer actual accesses to data. As a result, data access gets more expensive in a hurry. In the example, a 100% hit rate costs 11x more to provision than the initial 35% in the smallest cache size that we modelled.

32 GB 4.5 TB (35%) 64 GB 5.9 TB (46%) 128 GB 8.0 TB (62%) 256 GB 10.7 TB (84%) 512 GB 12.6 TB (98%) 1 TB 12.8 TB 缓冲大小... ...对应的请求量.

如上图所示,我把一年来实际的访问量同逐渐递增的“热门”数据进度条对应起来。通过上图,我们对一年来的访问数据有了两个方面的新认知。第一个认知是:这张图罗列出了所访问的数据量,通过它我们可以计算出命中率。如果使用“最近最少访问”(LRU)模型作为填充高速内存方法,那么我们就可以推断图中上半部分(即缓冲)所提供的请求率是多少。如果你把鼠标滚动到上图中,你就能看到命中率是怎样随着系统中高速内存的增加而增长的。

第二个认知是:通过这张图,我们可以计算出数据的一般性访问成本。不是推断每GB存储上我们所花费的费用,而是看看纯粹访问所花费的费用。我完全随意地选择了最小缓冲的大小,即以32GB为高速缓存的最小单位,此时我计算出每访问1GB的数据所花费为1美元。为了提高命中率,我们不断地增加高速存储的数量,现在我们看看这种情况下会出现怎样的情形。为了提高命中率,你不得不一而再再而三成倍地增加高速缓冲的时候,你会发现实际的数据访问量却在相对的减少。因此,你就会很容易地得出访问数据的成本将会更贵。在我们的例子里,要实现100%的命中率所花费的金钱是我们最初使用最小高速缓存实现35%命中率所花费的11倍多。

Deciding to be unfair.

Now let’s be clear on one thing above: I am not arguing that you should settle for a 35% hit rate. Instead, I’m arguing that a dollar spent at the tail of your access distribution — spent improving the performance of that 3.1TB of data that was never accessed at all — is probably not being spent on the right thing. I’d argue that that dollar would be much better spent on improving the performance of your hotter data through whatever means are possible.

This is an argument that I recently made in a little bit more detail at Storage Field Day 6, with a lively set of bloggers at the Coho office. I explained some of the broader technical changes that are occurring in storage today, in particular the fact that there are now more than three wildly different connectivity options for solid state storage (SATA/SAS SSDs, PCIe/NVMe, and NVDIMM), each with dramatically different levels of cost and performance.

确定进行不同投入

现在,我们要弄清楚我上面所讨论的事情:我讨论的不是你应当满足35%的命中率。相反,我讨论的是:你花费在访问存储条尾部的资金--即花费在提高根本就没有任何访问的3.1TB性能方面的资金-可能没有花费到正确的地方。我认为资金花费在提高较热门数据访问性能方面会更好一些。

这就是近来我在存储讨论日的第六期或多或少提到的,同时在Coho office的一系列生动的博客日志中提出来的论点。我还说明了当今存储技术方面正在发生的某些显著的技术革新,尤其是现在三种大量使用的固态存储的连接方式(SATA/SAS SSDs,PCIe/NVMe和NVDIMM),它们每一种的费用和性能都差别非常大。

So even if disks go away, storage systems will still need to mix media, to be hybrid, in order to achieve performance with excellent value. This is a reason that I find terms like “hybrid” and “AFA” to be so misleading. A hybrid system isn’t a cheap storage system that still uses disks, it’s any storage layout that decides to spend more to place hot data in high-performance memory. Similarly, an AFA may be composed from three (or more!) different types of storage media, and may well be hybrid.

Coho’s storage stack continuously monitors and characterizes your workloads to appropriately size storage allocations for optimal performance and to report on working set characteristics about your applications. We have recently published exciting new algorithms at a top-tier systems research conference on these results. If you are interested in learning more, my Storage field day presentation (above) provides an overview of our workload monitoring and autotiering design, called Cascade.

因此,为了获得优异的性能,即便不使用磁盘,存储系统仍然需要使用多种介质,实现混合存储。我发现这就是"混合存储“和"全闪存阵列"(AFA)被误解的原因。对仍然使用磁盘做存储者来说,混合存储系统并不是一个廉价存储系统,它只是一个把更多的钱花费在存放热门数据的高性能存储上的一种存储框架。与此类似,全闪存阵列(AFA)也可以由三个(或者更多)种存储介质组合而成,这也是混合存储。

Coho的存储栈持续不断地监视和描绘工作负载,并适当地分配存储以提高性能,同时汇报你所运行的应用的工作性能。不久前,我们在顶级系统研究会议上发布了令人激动的新算法。如果你想了解更多,(上面链接处)我的存储讨论日展示概要地介绍了工作负载监控和自动分级设计,即分层设计。

Nonuniform distributions are everywhere. Thanks to observations such as Pareto’s, system design at all scales benefit from focussing on serving the most popular things as efficiently as possible. Designs like these lead to the differences between highways and rural roads, hub cities in transport systems, core internet router designs, and most of the Netflix Original Series titles. Storage systems are no different, and building storage systems well requires careful, and workload responsive analysis to size and apply working set characteristics appropriately.

Some closing notes:

1. The image at the top of this post is a satirized version of an old Scott paper towel commercial. Some commentary, for example on the society pages.

2. Enormous thanks are due to Jake Wires and Stephen Ingram, who put in a huge amount of work on trace collection, processing, and analysis for the data that’s behind this post.  A bunch of the analysis here is done using queries against Coho’s Counter Stack engine.  Stephen also deserves thanks for helping develop and debug the visualizations, which were prepared using Mike Bostock’s excellent D3js library.

非均匀分布无处不在。正是由于帕雷托的观察,各种类型的系统设计才得益于集中力量尽可能高效地做最流行的事情这样的理念。类似这样的设计也使得高速公路和乡间公路设计、中心城市交通系统设计、互联网核心路由设计以及 许多Netflix入门级系列课程设计上都有所不同。存储系统也不例外,而且建设存储系统需要仔细地对工作负载响应进行分析,这样才能正确地确定存储规模,适应存储工作区的特性。

结尾语:

  1. 这篇文章最顶端的图片是一张旧的讽刺斯科特纸巾商业公司的图片。其社交网页上对其的评论。

  2. 非常感谢Jake Wires和Stephen Ingram,他们投入了大量的工作对这篇文章所采用的数据进行跟踪采集、处理和分析。这儿进行的大量分析是对Coho的Counter Stack引擎查询后得到的结果。还要感谢Stephen帮助开发和调试了界面功能,它使用了由Mike Bostock开发的优秀的D3js库


返回顶部
顶部