The 80/20 rule is often attributed to an Italian economist named Vilfredo Pareto. Born in 1848, Pareto was (inspirationally at least) one of the early members of the occupy movement: he observed that 80% of Italy’s wealth at that time was owned by fewer than 20% of Italy’s population. As a bit of a tangent, it’s also worth noting that the 80/20 rule is also anecdotally attributed to Pareto’s observation that 80% of the peas in his garden came from 20% of the plants — so he was apparently more of a pea counter than a bean counter (har har). Regardless, Pareto was no fan of uniformity.

Pareto’s Principle, and the resulting statistical idea of a “Pareto distribution” is an example of what is known in statistics as a power law, and it has incredible relevance in understanding storage access patterns. Here’s why: for virtually all application workloads, accesses to disk are much closer to a Pareto distribution than a uniform random one: a relatively small amount of hot data is used by a majority of I/O requests, while a much larger amount of cold data is accessed with much lower frequency.

80/20 法则通常被认为是源于意大利经济学家维尔弗雷多·帕累托。帕累托出生于1848年,他是(至少被认为是)占领运动的早期成员之一。他发现意大利国家财富的80%是掌握在几乎少于20%的人口手中的。由此发散开来看,80/20法则在其他方面的应用同样值得注意,也是很有趣的:因为帕累托观察发现他的园子里的80%的豌豆产自于20%的作物上(他似乎更喜欢数豌豆而不是其他豆子,哈哈)。无论如何,帕累托是不相信均匀分布的理论的。


We all know that this is intuitively true, that our systems have a mix of hot and cold data. It is a motivating argument for mixed-media (or “hybrid”) storage systems, and it’s also applied at scale in designing storage systems for applications like Facebook. Here’s the thing: pareto-like distributions are best served by being unfair in the assignment of resources to data. Instead of building a homogenous system out of one type of storage media, these distributions teach us to steal resources from unpopular data to reward the prolific.

A misunderstanding of the Pareto principle leads to all sorts of confusion in building and measuring storage systems. One example of this is flash vendors that argue that building all-flash storage on a single, homogenous layer of flash is a good match for workload demands. From this perspective, homogenous all-flash systems are effectively communist storage. They idealistically decide to invest equal resources in every piece of data resulting in a poor match between resource-level spending and access-level spending. More on this in a bit.



Let’s look at some real workload data.

To explain the degree to which storage workloads are non-uniform, let’s look at some real data. We’ve recently been working with a one-year storage trace of eleven developer desktops. As storage traces go, this is a pretty fun dataset to analyze because it contains a lot of data over a very long period of time: storage traces, such as the ones hosted by SNIA, are typically either much shorter (hours to days in total) or much lower fidelity. In total, this twelve month trace describes about 7.6 billion IO operations and a total transfer of 28TB over about 5TB of stored data.

I’d like to quickly summarize this data and try to point out a couple of interesting things that should influence the way you think about architecting for your data.

sec min hour day month year forever 17 GB 129 GB 627 GB 2.0 TB 5.1 TB How old was the data after 1 year?


为了解释存储的工作负载数据到底有多么的不一致,我们来看一组真实的数据。我们最近对11个开发者桌面存储数据进行了为期一年的跟踪记录。随着时间的推移,对这些数据集的分析变得很有趣,因为它包含了一个很长时间段内的大量数据:存储的轨迹记录,比如 SNIA保存的数据就非常的小(一天内的总小时)或者精确度也很低。总的来说,为期12个月的跟踪记录了大概76亿次的IO操作和在5TB存储数据上进行的超过28TB传输量。


The first chart, above, shows the age of all the stored data at the end of the trace. Of the 5.1TB of data that was stored on those 11 desktops, 3.1TB of data weren’t accessed through the entire year. As a result, the performance of the system through that year was completely unchanged by placement decisions regarding where that cold data was stored.

At the other end of the spectrum, we see that only 627GB, or about 12% of all stored data has been accessed in the past month. We see a similar progression as we move to shorter periods of time. This initial capacity/age analysis really just serves to validate our assumption about access distributions, so now let’s look at a slightly more interesting view…



32 GB 4.5 TB (35%) 64 GB 5.9 TB (46%) 128 GB 8.0 TB (62%) 256 GB 10.7 TB (84%) 512 GB 12.6 TB (98%) 1 TB 12.8 TB This size of cache... ...would serve this much of the request traffic.

In the graph above, I’ve correlated the amount of actual access over the year, with progressively larger buckets of “hot” data. This graph tries to achieve two new insights with the access data over the year. First, it accounts the number of accesses to the data, to allow us to think about hit rates. Using “least recently used” (LRU) as a model for populating a layer of fast memory, this allows us to reason about what proportion of requests would be served from our top tier (or cache). If you scroll over the graph, you can see how the cumulative hit rate increases as more fast memory is added to the system

Second, the graph allows us to calculate a normalized access cost for the data being stored. Rather than reasoning about storage based on $/GB, lets consider it completely based on access. I picked a completely arbitrary value for the smallest size of cache: at 32GB in the fast tier, I account one dollar per gigabyte accessed. Now look what happens as you grow the amount of fast storage in order to increase hit rate. As you have to repeatedly double the size of the cache to improve hit rate, you are getting relatively fewer actual accesses to data. As a result, data access gets more expensive in a hurry. In the example, a 100% hit rate costs 11x more to provision than the initial 35% in the smallest cache size that we modelled.

32 GB 4.5 TB (35%) 64 GB 5.9 TB (46%) 128 GB 8.0 TB (62%) 256 GB 10.7 TB (84%) 512 GB 12.6 TB (98%) 1 TB 12.8 TB 缓冲大小... ...对应的请求量.



Deciding to be unfair.

Now let’s be clear on one thing above: I am not arguing that you should settle for a 35% hit rate. Instead, I’m arguing that a dollar spent at the tail of your access distribution — spent improving the performance of that 3.1TB of data that was never accessed at all — is probably not being spent on the right thing. I’d argue that that dollar would be much better spent on improving the performance of your hotter data through whatever means are possible.

This is an argument that I recently made in a little bit more detail at Storage Field Day 6, with a lively set of bloggers at the Coho office. I explained some of the broader technical changes that are occurring in storage today, in particular the fact that there are now more than three wildly different connectivity options for solid state storage (SATA/SAS SSDs, PCIe/NVMe, and NVDIMM), each with dramatically different levels of cost and performance.



这就是近来我在存储讨论日的第六期或多或少提到的,同时在Coho office的一系列生动的博客日志中提出来的论点。我还说明了当今存储技术方面正在发生的某些显著的技术革新,尤其是现在三种大量使用的固态存储的连接方式(SATA/SAS SSDs,PCIe/NVMe和NVDIMM),它们每一种的费用和性能都差别非常大。

So even if disks go away, storage systems will still need to mix media, to be hybrid, in order to achieve performance with excellent value. This is a reason that I find terms like “hybrid” and “AFA” to be so misleading. A hybrid system isn’t a cheap storage system that still uses disks, it’s any storage layout that decides to spend more to place hot data in high-performance memory. Similarly, an AFA may be composed from three (or more!) different types of storage media, and may well be hybrid.

Coho’s storage stack continuously monitors and characterizes your workloads to appropriately size storage allocations for optimal performance and to report on working set characteristics about your applications. We have recently published exciting new algorithms at a top-tier systems research conference on these results. If you are interested in learning more, my Storage field day presentation (above) provides an overview of our workload monitoring and autotiering design, called Cascade.



Nonuniform distributions are everywhere. Thanks to observations such as Pareto’s, system design at all scales benefit from focussing on serving the most popular things as efficiently as possible. Designs like these lead to the differences between highways and rural roads, hub cities in transport systems, core internet router designs, and most of the Netflix Original Series titles. Storage systems are no different, and building storage systems well requires careful, and workload responsive analysis to size and apply working set characteristics appropriately.

Some closing notes:

1. The image at the top of this post is a satirized version of an old Scott paper towel commercial. Some commentary, for example on the society pages.

2. Enormous thanks are due to Jake Wires and Stephen Ingram, who put in a huge amount of work on trace collection, processing, and analysis for the data that’s behind this post.  A bunch of the analysis here is done using queries against Coho’s Counter Stack engine.  Stephen also deserves thanks for helping develop and debug the visualizations, which were prepared using Mike Bostock’s excellent D3js library.

非均匀分布无处不在。正是由于帕雷托的观察,各种类型的系统设计才得益于集中力量尽可能高效地做最流行的事情这样的理念。类似这样的设计也使得高速公路和乡间公路设计、中心城市交通系统设计、互联网核心路由设计以及 许多Netflix入门级系列课程设计上都有所不同。存储系统也不例外,而且建设存储系统需要仔细地对工作负载响应进行分析,这样才能正确地确定存储规模,适应存储工作区的特性。


  1. 这篇文章最顶端的图片是一张旧的讽刺斯科特纸巾商业公司的图片。其社交网页上对其的评论。

  2. 非常感谢Jake Wires和Stephen Ingram,他们投入了大量的工作对这篇文章所采用的数据进行跟踪采集、处理和分析。这儿进行的大量分析是对Coho的Counter Stack引擎查询后得到的结果。还要感谢Stephen帮助开发和调试了界面功能,它使用了由Mike Bostock开发的优秀的D3js库