翻译于 2014/12/16 17:07
1 人 顶 此译文
The 80/20 rule is often attributed to an Italian economist named Vilfredo Pareto. Born in 1848, Pareto was (inspirationally at least) one of the early members of the occupy movement: he observed that 80% of Italy’s wealth at that time was owned by fewer than 20% of Italy’s population. As a bit of a tangent, it’s also worth noting that the 80/20 rule is also anecdotally attributed to Pareto’s observation that 80% of the peas in his garden came from 20% of the plants — so he was apparently more of a pea counter than a bean counter (har har). Regardless, Pareto was no fan of uniformity.
Pareto’s Principle, and the resulting statistical idea of a “Pareto distribution” is an example of what is known in statistics as a power law, and it has incredible relevance in understanding storage access patterns. Here’s why: for virtually all application workloads, accesses to disk are much closer to a Pareto distribution than a uniform random one: a relatively small amount of hot data is used by a majority of I/O requests, while a much larger amount of cold data is accessed with much lower frequency.
We all know that this is intuitively true, that our systems have a mix of hot and cold data. It is a motivating argument for mixed-media (or “hybrid”) storage systems, and it’s also applied at scale in designing storage systems for applications like Facebook. Here’s the thing: pareto-like distributions are best served by being unfair in the assignment of resources to data. Instead of building a homogenous system out of one type of storage media, these distributions teach us to steal resources from unpopular data to reward the prolific.
A misunderstanding of the Pareto principle leads to all sorts of confusion in building and measuring storage systems. One example of this is flash vendors that argue that building all-flash storage on a single, homogenous layer of flash is a good match for workload demands. From this perspective, homogenous all-flash systems are effectively communist storage. They idealistically decide to invest equal resources in every piece of data resulting in a poor match between resource-level spending and access-level spending. More on this in a bit.
To explain the degree to which storage workloads are non-uniform, let’s look at some real data. We’ve recently been working with a one-year storage trace of eleven developer desktops. As storage traces go, this is a pretty fun dataset to analyze because it contains a lot of data over a very long period of time: storage traces, such as the ones hosted by SNIA, are typically either much shorter (hours to days in total) or much lower fidelity. In total, this twelve month trace describes about 7.6 billion IO operations and a total transfer of 28TB over about 5TB of stored data.
I’d like to quickly summarize this data and try to point out a couple of interesting things that should influence the way you think about architecting for your data.
sec min hour day month year forever 17 GB 129 GB 627 GB 2.0 TB 5.1 TB How old was the data after 1 year?
The first chart, above, shows the age of all the stored data at the end of the trace. Of the 5.1TB of data that was stored on those 11 desktops, 3.1TB of data weren’t accessed through the entire year. As a result, the performance of the system through that year was completely unchanged by placement decisions regarding where that cold data was stored.
At the other end of the spectrum, we see that only 627GB, or about 12% of all stored data has been accessed in the past month. We see a similar progression as we move to shorter periods of time. This initial capacity/age analysis really just serves to validate our assumption about access distributions, so now let’s look at a slightly more interesting view…
32 GB 4.5 TB (35%) 64 GB 5.9 TB (46%) 128 GB 8.0 TB (62%) 256 GB 10.7 TB (84%) 512 GB 12.6 TB (98%) 1 TB 12.8 TB This size of cache... ...would serve this much of the request traffic.
In the graph above, I’ve correlated the amount of actual access over the year, with progressively larger buckets of “hot” data. This graph tries to achieve two new insights with the access data over the year. First, it accounts the number of accesses to the data, to allow us to think about hit rates. Using “least recently used” (LRU) as a model for populating a layer of fast memory, this allows us to reason about what proportion of requests would be served from our top tier (or cache). If you scroll over the graph, you can see how the cumulative hit rate increases as more fast memory is added to the system
Second, the graph allows us to calculate a normalized access cost for the data being stored. Rather than reasoning about storage based on $/GB, lets consider it completely based on access. I picked a completely arbitrary value for the smallest size of cache: at 32GB in the fast tier, I account one dollar per gigabyte accessed. Now look what happens as you grow the amount of fast storage in order to increase hit rate. As you have to repeatedly double the size of the cache to improve hit rate, you are getting relatively fewer actual accesses to data. As a result, data access gets more expensive in a hurry. In the example, a 100% hit rate costs 11x more to provision than the initial 35% in the smallest cache size that we modelled.
32 GB 4.5 TB (35%) 64 GB 5.9 TB (46%) 128 GB 8.0 TB (62%) 256 GB 10.7 TB (84%) 512 GB 12.6 TB (98%) 1 TB 12.8 TB 缓冲大小... ...对应的请求量.
Now let’s be clear on one thing above: I am not arguing that you should settle for a 35% hit rate. Instead, I’m arguing that a dollar spent at the tail of your access distribution — spent improving the performance of that 3.1TB of data that was never accessed at all — is probably not being spent on the right thing. I’d argue that that dollar would be much better spent on improving the performance of your hotter data through whatever means are possible.
This is an argument that I recently made in a little bit more detail at Storage Field Day 6, with a lively set of bloggers at the Coho office. I explained some of the broader technical changes that are occurring in storage today, in particular the fact that there are now more than three wildly different connectivity options for solid state storage (SATA/SAS SSDs, PCIe/NVMe, and NVDIMM), each with dramatically different levels of cost and performance.
这就是近来我在存储讨论日的第六期或多或少提到的，同时在Coho office的一系列生动的博客日志中提出来的论点。我还说明了当今存储技术方面正在发生的某些显著的技术革新，尤其是现在三种大量使用的固态存储的连接方式（SATA/SAS SSDs,PCIe/NVMe和NVDIMM)，它们每一种的费用和性能都差别非常大。
So even if disks go away, storage systems will still need to mix media, to be hybrid, in order to achieve performance with excellent value. This is a reason that I find terms like “hybrid” and “AFA” to be so misleading. A hybrid system isn’t a cheap storage system that still uses disks, it’s any storage layout that decides to spend more to place hot data in high-performance memory. Similarly, an AFA may be composed from three (or more!) different types of storage media, and may well be hybrid.
Coho’s storage stack continuously monitors and characterizes your workloads to appropriately size storage allocations for optimal performance and to report on working set characteristics about your applications. We have recently published exciting new algorithms at a top-tier systems research conference on these results. If you are interested in learning more, my Storage field day presentation (above) provides an overview of our workload monitoring and autotiering design, called Cascade.
Nonuniform distributions are everywhere. Thanks to observations such as Pareto’s, system design at all scales benefit from focussing on serving the most popular things as efficiently as possible. Designs like these lead to the differences between highways and rural roads, hub cities in transport systems, core internet router designs, and most of the Netflix Original Series titles. Storage systems are no different, and building storage systems well requires careful, and workload responsive analysis to size and apply working set characteristics appropriately.
Some closing notes:
1. The image at the top of this post is a satirized version of an old Scott paper towel commercial. Some commentary, for example on the society pages.
2. Enormous thanks are due to Jake Wires and Stephen Ingram, who put in a huge amount of work on trace collection, processing, and analysis for the data that’s behind this post. A bunch of the analysis here is done using queries against Coho’s Counter Stack engine. Stephen also deserves thanks for helping develop and debug the visualizations, which were prepared using Mike Bostock’s excellent D3js library.