## 存储系统的 80/20 法则

The 80/20 rule is often attributed to an Italian economist named Vilfredo Pareto. Born in 1848, Pareto was (inspirationally at least) one of the early members of the occupy movement: he observed that 80% of Italy’s wealth at that time was owned by fewer than 20% of Italy’s population. As a bit of a tangent, it’s also worth noting that the 80/20 rule is also anecdotally attributed to Pareto’s observation that 80% of the peas in his garden came from 20% of the plants — so he was apparently more of a pea counter than a bean counter (har har). Regardless, Pareto was no fan of uniformity.

Pareto’s Principle, and the resulting statistical idea of a “Pareto distribution” is an example of what is known in statistics as a power law, and it has incredible relevance in understanding storage access patterns. Here’s why: for virtually all application workloads, accesses to disk are much closer to a Pareto distribution than a uniform random one: a relatively small amount of hot data is used by a majority of I/O requests, while a much larger amount of cold data is accessed with much lower frequency.

80/20 法则通常被认为是源于意大利经济学家维尔弗雷多·帕累托。帕累托出生于1848年，他是（至少被认为是）占领运动的早期成员之一。他发现意大利国家财富的80%是掌握在几乎少于20%的人口手中的。由此发散开来看，80/20法则在其他方面的应用同样值得注意，也是很有趣的：因为帕累托观察发现他的园子里的80%的豌豆产自于20%的作物上（他似乎更喜欢数豌豆而不是其他豆子，哈哈）。无论如何，帕累托是不相信均匀分布的理论的。

We all know that this is intuitively true, that our systems have a mix of hot and cold data. It is a motivating argument for mixed-media (or “hybrid”) storage systems, and it’s also applied at scale in designing storage systems for applications like Facebook. Here’s the thing: pareto-like distributions are best served by being unfair in the assignment of resources to data. Instead of building a homogenous system out of one type of storage media, these distributions teach us to steal resources from unpopular data to reward the prolific.

A misunderstanding of the Pareto principle leads to all sorts of confusion in building and measuring storage systems. One example of this is flash vendors that argue that building all-flash storage on a single, homogenous layer of flash is a good match for workload demands. From this perspective, homogenous all-flash systems are effectively communist storage. They idealistically decide to invest equal resources in every piece of data resulting in a poor match between resource-level spending and access-level spending. More on this in a bit.

# Let’s look at some real workload data.

To explain the degree to which storage workloads are non-uniform, let’s look at some real data. We’ve recently been working with a one-year storage trace of eleven developer desktops. As storage traces go, this is a pretty fun dataset to analyze because it contains a lot of data over a very long period of time: storage traces, such as the ones hosted by SNIA, are typically either much shorter (hours to days in total) or much lower fidelity. In total, this twelve month trace describes about 7.6 billion IO operations and a total transfer of 28TB over about 5TB of stored data.

I’d like to quickly summarize this data and try to point out a couple of interesting things that should influence the way you think about architecting for your data.

sec min hour day month year forever 17 GB 129 GB 627 GB 2.0 TB 5.1 TB How old was the data after 1 year?

# 让我们看一下真实的工作数据

The first chart, above, shows the age of all the stored data at the end of the trace. Of the 5.1TB of data that was stored on those 11 desktops, 3.1TB of data weren’t accessed through the entire year. As a result, the performance of the system through that year was completely unchanged by placement decisions regarding where that cold data was stored.

At the other end of the spectrum, we see that only 627GB, or about 12% of all stored data has been accessed in the past month. We see a similar progression as we move to shorter periods of time. This initial capacity/age analysis really just serves to validate our assumption about access distributions, so now let’s look at a slightly more interesting view…

32 GB 4.5 TB (35%) 64 GB 5.9 TB (46%) 128 GB 8.0 TB (62%) 256 GB 10.7 TB (84%) 512 GB 12.6 TB (98%) 1 TB 12.8 TB This size of cache... ...would serve this much of the request traffic.

In the graph above, I’ve correlated the amount of actual access over the year, with progressively larger buckets of “hot” data. This graph tries to achieve two new insights with the access data over the year. First, it accounts the number of accesses to the data, to allow us to think about hit rates. Using “least recently used” (LRU) as a model for populating a layer of fast memory, this allows us to reason about what proportion of requests would be served from our top tier (or cache). If you scroll over the graph, you can see how the cumulative hit rate increases as more fast memory is added to the system

Second, the graph allows us to calculate a normalized access cost for the data being stored. Rather than reasoning about storage based on \$/GB, lets consider it completely based on access. I picked a completely arbitrary value for the smallest size of cache: at 32GB in the fast tier, I account one dollar per gigabyte accessed. Now look what happens as you grow the amount of fast storage in order to increase hit rate. As you have to repeatedly double the size of the cache to improve hit rate, you are getting relatively fewer actual accesses to data. As a result, data access gets more expensive in a hurry. In the example, a 100% hit rate costs 11x more to provision than the initial 35% in the smallest cache size that we modelled.

32 GB 4.5 TB (35%) 64 GB 5.9 TB (46%) 128 GB 8.0 TB (62%) 256 GB 10.7 TB (84%) 512 GB 12.6 TB (98%) 1 TB 12.8 TB 缓冲大小... ...对应的请求量.

# Deciding to be unfair.

Now let’s be clear on one thing above: I am not arguing that you should settle for a 35% hit rate. Instead, I’m arguing that a dollar spent at the tail of your access distribution — spent improving the performance of that 3.1TB of data that was never accessed at all — is probably not being spent on the right thing. I’d argue that that dollar would be much better spent on improving the performance of your hotter data through whatever means are possible.

This is an argument that I recently made in a little bit more detail at Storage Field Day 6, with a lively set of bloggers at the Coho office. I explained some of the broader technical changes that are occurring in storage today, in particular the fact that there are now more than three wildly different connectivity options for solid state storage (SATA/SAS SSDs, PCIe/NVMe, and NVDIMM), each with dramatically different levels of cost and performance.

## 确定进行不同投入

So even if disks go away, storage systems will still need to mix media, to be hybrid, in order to achieve performance with excellent value. This is a reason that I find terms like “hybrid” and “AFA” to be so misleading. A hybrid system isn’t a cheap storage system that still uses disks, it’s any storage layout that decides to spend more to place hot data in high-performance memory. Similarly, an AFA may be composed from three (or more!) different types of storage media, and may well be hybrid.

Coho’s storage stack continuously monitors and characterizes your workloads to appropriately size storage allocations for optimal performance and to report on working set characteristics about your applications. We have recently published exciting new algorithms at a top-tier systems research conference on these results. If you are interested in learning more, my Storage field day presentation (above) provides an overview of our workload monitoring and autotiering design, called Cascade.

Coho的存储栈持续不断地监视和描绘工作负载，并适当地分配存储以提高性能，同时汇报你所运行的应用的工作性能。不久前，我们在顶级系统研究会议上发布了令人激动的新算法。如果你想了解更多，（上面链接处）我的存储讨论日展示概要地介绍了工作负载监控和自动分级设计，即分层设计。

Nonuniform distributions are everywhere. Thanks to observations such as Pareto’s, system design at all scales benefit from focussing on serving the most popular things as efficiently as possible. Designs like these lead to the differences between highways and rural roads, hub cities in transport systems, core internet router designs, and most of the Netflix Original Series titles. Storage systems are no different, and building storage systems well requires careful, and workload responsive analysis to size and apply working set characteristics appropriately.

Some closing notes:

1. The image at the top of this post is a satirized version of an old Scott paper towel commercial. Some commentary, for example on the society pages.

2. Enormous thanks are due to Jake Wires and Stephen Ingram, who put in a huge amount of work on trace collection, processing, and analysis for the data that’s behind this post.  A bunch of the analysis here is done using queries against Coho’s Counter Stack engine.  Stephen also deserves thanks for helping develop and debug the visualizations, which were prepared using Mike Bostock’s excellent D3js library.

### 结尾语：

1. 这篇文章最顶端的图片是一张旧的讽刺斯科特纸巾商业公司的图片。其社交网页上对其的评论。

2. 非常感谢Jake Wires和Stephen Ingram，他们投入了大量的工作对这篇文章所采用的数据进行跟踪采集、处理和分析。这儿进行的大量分析是对Coho的Counter Stack引擎查询后得到的结果。还要感谢Stephen帮助开发和调试了界面功能，它使用了由Mike Bostock开发的优秀的D3js库