加载中

A lot of people seem to be talking about and performing load tests on HTTP servers, perhaps because there’s a lot more choice of servers these days.

That’s great, but I see a lot of the same mistakes being made, making the conclusions doubtful at best. Having spent a fair amount of time benchmarking high-performance proxy caches and origin servers for my day job, here are a few things that I think are important to keep in mind.

It’s not the final word, but hopefully it’ll help start a discussion.

有很多人在谈论HTTP服务器软件的性能测试,也许是因为现在有太多的服务器选择。

这很好,但是我看到有人很多基本相同的问题,使得测试结果的推论值得怀疑。在日常工作中花费了很多时间在高性能代理缓存和源站性能测试方面之后,这里有我认为比较重要的一些方面来分享。

希望能抛砖引玉。

0. Consistency.

The most important thing to get right is to test the same time, every time. Any changes in the system — whether its an OS upgrade or another app running and stealing bandwidth or CPU — can affect your test results, so you need to be aggressive about nailing down the test environment.

While it’s tempting to say that the way to achieve this is to run everything on VMs, I’m not convinced that adding another layer of abstraction (as well as more processes running on the host OS) is going to lead to more consistent results. Because of this, dedicated hardware is best. Failing that, just run all of the tests you can in one session, and make it clear that comparisons between different sessions don’t work.

0. 一致性

最最重要的是,每次都测试同一个时间点。因为系统发生的每个改变,无论是OS升级还是运行了其它消耗带宽和CPU的应用,都会影响测试的结果,所以一定要把测试环境固定下来。

也许,有人会说那就把测试放虚拟机里做吧,听起来不错。但是,这种方式加多了一个抽象层(而且宿主机上也跑了更多的进程),如果说这样就能得到更加一致的结果,我是无论如何也不会相信的。我觉得,最好的办法是为测试准备一套专用硬件。如果做不到这个,那么一定要把所有测试放在同一个会话里,不要去比较不同会话里的测试结果。

1. One Machine, One Job.

The most common mistake I see people making is benchmarking a server on the same box where the load is generated. This doesn’t just put your results out a little bit, it makes them completely unreliable, because the load generator’s “steal” of resources varies depending on how the server handles the load, which depends on resource availability.

The best way to maintain consistency is to have dedicated, separate hardware for the test subject and load generator, and to test on a closed network. This isn’t very expensive; you don’t need the latest-and-greatest to compare apples to apples, it just has to be consistent.

So, if you see someone saying that they benchmarked on localhost, or if they fail to say how many boxes they used to generate and serve the load, ignore the results; at best they’ll only be a basic indication, and at the worst the’ll be very misleading.

1. 每台机器,各司其职

人们常常会犯另一个错误,他们把负载生成器和被测试的服务器放在同一台机器上。这样做将导致产生不可靠的测试结果,因为,负载生成器实际上是「窃取」了一部分资源,而且这部分资源的量还会随着服务器处理负载情况的变化而变化。

最好的做法是,为测试主体和负载生成器分配不同的硬件,而且将它们放在封闭的网络上。这样做的代价并不是太高,我们并不需要非常高精尖的配置,只需要确保一致性就好。

所以,如果有人跟你说,他们的测试是在localhost上做的,或者拒绝透露测试用了几台机器,那么你尽可以忽略他们的结果。因为,这样的结果,往好了说,只有最基本的定性作用,往坏了说,甚至可能会误人子弟。


2. Check the Network.

Before each test, you need to understand how much capacity your network has, so that you’ll know when it’s limiting your test, rather than the server your’e testing.

One way to do this is with iperf:

qa1:~> iperf -c qa2
------------------------------------------------------------
Client connecting to qa2, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.106 port 56014 connected with 192.168.1.107 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   943 Mbits/sec

… which shows that I have about 943 Mbits a second available on my Gigabit network (it’s not 1,000 because of TCP overheads).

Once you know the bandwidth available, you need to make sure that it isn’t a limiting factor. There are a number of ways to do this, but the easiest is to use a tool that keeps track of the traffic in use. For example, httperf shows bandwidth use like this:

Net I/O: 23399.7 KB/s (191.7*10^6 bps)

… which tells me that I’m only using about 192 Mbits of my Gigabit in this test.

Keep in mind that the numbers you see from a load generation tool are not going to include the overhead of TCP, and if your load varies throughout the test, it can burst higher than the average. Also, sheer bandwidth isn’t the whole story — for example, if you’re using cheap NICs or switches (you are using a switch, right?), they can be swamped by the sheer number of network segments flying around.

Screen Shot 2011-05-17 At 3.51.46 Pm

For all of these reasons and more, it’s a good idea to make sure your tests don’t get very close to the available bandwidth you measure; instead, make sure the bandwidth used doesn’t exceed a proportion of it (e.g., 2/3). Monitoring your network (both the interfaces and the switch) for errors and peak rates is also a good idea.

2. 检查网络

在测试前,一定要知道你的网络有多大容量,这样,你才会知道,什么时候是你测试的服务器制约了测试,什么时候是网络制约了测试。

一种方法是通过iperf:

qa1:~> iperf -c qa2
------------------------------------------------------------
Client connecting to qa2, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.106 port 56014 connected with 192.168.1.107 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   943 Mbits/sec

从上面的输出中可以看到,我的千兆网可以达到943Mbps的速度(之所以不到1000Mbps,是由于TCP的开销)。

知道网络容量后,我们需要确保它不会成为制约测试的因素。有几种方法,最简单的是记录当前使用的带宽。例如,httperf可以像这样展示当前的带宽用量:

Net I/O: 23399.7 KB/s (191.7*10^6 bps)

上例表明,我们目前只用了192Mbps。

记住,我们在负载产生工具中看到的数字并没有包括TCP开销,而且,如果我们的负载在整个测试期间并不固定,那么突发带宽一定会有超过平均值的时候。而且除了带宽以外还有其它的问题,例如,廉价的网卡和交换机很有可能会被大量的数据包淹没。

Screen Shot 2011-05-17 At 3.51.46 Pm

基于以上的种种原因,我们最好不要让测试的带宽逼近网络的可用带宽,最好是不要超过某个比例,比如2/3。对网络(包括网卡和交换机)错误和峰值速率进行监控也是一个很好的办法。

3. Remove OS Limitations.

Likewise, you need to make sure that the operating system doesn’t impose artificial limits on your server’s performance.

TCP tuning is somewhat important here, but it’ll affect all test subjects equally. The important thing is to make sure that your server doesn’t run out of file descriptors.

3. 去除OS限制

同样,我们还需要确保OS不会对服务器的表现构成限制。

TCP参数的调整颇为重要,但它会对所有测试主体产生相同的影响。更为重要的是,不要让你测试的服务器用光文件描述符(file descriptor)。

4. Don’t Test the Client.

Modern, high-performance servers make it very easy to mistake limitations in your load generator for the capacity of the server you’re testing. So, check to make sure your client box isn’t maxed out on CPU, and if there’s any doubt whatsoever, use more than one load generation box to double-check your numbers (autobench makes this relatively painless).

It also helps to assure that your load generation hardware is better than the server hardware you’re testing; e.g., I generate load with a four-core i5-750 box, and run the server on a slower, two-core i3-350 box, often only using one of the cores.

Another factor to be mindful of is client-side errors, especially running out of ephemeral ports. There are lots of strategies for this, from expanding the port range on your box, to setting up multiple interfaces on the box and making sure that the client uses them (sometimes tricker than it sounds). You can also tune the TIME_WAIT period (as long as it’s ONLY a test box!), or just use HTTP persistent connections and aggressive client-side timeouts to make sure your connection rate doesn’t exceed available ports.

One of the things I like about httperf is that it gives a summary of errors at the end of the run:

Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

Here, the server-originated issues are on the first line (such as the request timing out due to the server exceeding the--timeoutoption, or when it refuses or resets a connection), and the client-side errors (like running out of file descriptors or addresses) on the second line.

This helps you know when the test itself is faulty.

4. 避免压测客户端

现代高性能服务器使得容易将负载产生器的限制当作服务器的能力。所以,认真检查以确保你的客户端没有用爆CPU,如果有任何怀疑,就使用更多客户端压力来验证(autobench可以让这件事更简单)。

确保负载产生器的硬件优于要测试的服务端硬件,也很有帮助;比如,使用4核心的i5-750服务器产生负载,将服务器运行在较慢的双核i3-350服务器上,而且经常只用两个核心之一。

另外一个需要注意的因素是客户端错误,尤其是临时端口(ephemeral ports)用尽。处理这个问题有很多策略,扩大服务器的可用端口范围,或设置多个网络接口,确认由客户端使用它们(有时需要一些技巧)。也可以优化TIME_WAIT时间(只有是测试环境就没问题),或仅仅使用HTTP长连接和激进的客户端超时策略,确保连接速率不会用尽端口数。

我喜欢httperf的一个原因是它在结束时提供了错误的概要:

Errors: total 0 client-timo 0 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

这里,服务器的问题位于第一行(比如由于服务器端超过--timeout设置的超时时间而造成的请求超时,或拒绝连接,或连接重置),客户端错误(如文件描述符用尽或地址用尽)在第二行。

在测试本身出错时,这能够帮你及时了解。

5. Overload is not Capacity.

Many — if not most — load generation tools will by default throw as much load as they can at a server, and report that number back.

This is great for finding out how your server reacts to overload — an important thing to know — but it doesn’t really show the capacity of your server. That’s because pretty much every server loses some capacity once you throw more work at it than it can handle.

A better way to get an idea of capacity is to test your server at progressively higher loads, until it reaches capacity and then backs off; you should be able to graph it as a curve that peaks and then backs off. How much it backs off will indicate how well your server deals with overload.

autobench is one way to do this with httperf; it allows you to specify a range of rates to test at, so that you can generate graphs like this:

Apache-Worker

Here, you can see that, for the smallest response size, the server peaks at 16,000 responses/second, but quickly drops down to 14,000 responses/second under overload (with a corresponding jump up to about 60ms response latency). Other response sizes don’t drop as much when overloaded, but you can see the error bars pop up, which shows it struggling.

5. 过载时的容量并不是真正的容量

产生尽可能大的负载,扔给服务器,这是许多负载产生工具的工作模式。

这种方法对于检查服务器在过载情况下的表现也许不错,但它并不能真正帮我们确定服务器的容量。因为,许多服务器在过载的情况下都会损失一部分的容量。

更好的方法是逐步加大负载,直到服务器达到容量上限,出现性能下降为止。我们可以把结果绘制成一条先抵达峰值、随后下降的曲线,而下降的程度意味着服务器应对过载时的表现情况。

autobench是其中的一种方法,我们可以设置测试的范围,然后就可以得到这样一张图:

Apache-Worker

可以看到,在响应消息最小时,服务器的处理峰值为16,000响应/秒,但在过载情况下快速衰落至14,000响应/秒。而在响应消息更大时,过载情况下的衰落并没有这么多,但可以看到错误条不停弹出来,说明了服务器的紧张境况。

6. Thirty Seconds isn’t a Test.

It takes a while for the various layers of buffers and caches in the applications, OS and network stacks to stabilise, so a 30 second test can be very misleading. If you’re going to release numbers, test for at least three minutes, preferably more like five or ten.

7. Do More than Hello World.

Finding out how quickly your implementation can serve a 4-byte response body is an interested but extremely limited look at how it performs. What happens when the response body is 4k — or 100k — is often much more interesting, and more representative of how it’ll handle real-life load.

Another thing to look at is how it handles load with a large number — say, 10,000 — of outstanding idle persistent connections (opened with a separate tool). A decent, modern server shouldn’t be bothered by this, but it causes issues more often than you’d think.

These are just two examples, of course.

6. 30秒的测试不算测试

由于处于应用、OS与网络栈各层的缓冲区需要一定的稳定时间,所以30秒的测试很可能不准确。如果你的测试数据是要正式发布的,请至少测试3分钟,或者更长一些,比如5到10分钟。

7. 不要仅仅测试Hello World

如果要测试服务器的响应是否迅速,用4字节的响应消息过于局限,意义不大,4k甚至100k才更有现实意义。

另外一个需要测试的是服务器在面对大量空闲连接时的表现,比如10,000个连接。这对现代的成熟服务器来说本来不算什么,但往往会导致一些你想象不到的问题。

当然,以上仅仅只是两个例子而已。

8. Not Just Averages.

If someone tells you that a server does 1,000 responses a second with an average latency of 5ms, that’s great. But what if some of those responses took 100ms? They can still achieve that average. What if for 10% of the test period, the server was only able to achieve 500 responses a second, because it was doing garbage collection?

Averages are quick indicators, nothing more. Timelines and histograms contain a lot of critical information that they omit. If your testing tool doesn’t provide this information, find one that does (or submit a patch, if it’s Open Source).

Here’s what httperf shows:

Total: connections 180000 requests 180000 replies 180000 test-duration 179.901 s

Connection rate: 1000.0 conn/s (99.9 ms/conn, <=2 concurrent connections)
Connection time [ms]: min 0.4 avg 0.5 max 12.9 median 0.5 stddev 0.4
Connection time [ms]: connect 0.1
Connection length [replies/conn]: 1.000

Request rate: 1000.0 req/s (.9 ms/req)
Request size [B]: 79.0

Reply rate [replies/s]: min 999.1 avg 1000.0 max 1000.2 stddev 0.1 (35 samples)
Reply time [ms]: response 0.4 transfer 0.0
Reply size [B]: header 385.0 content 1176.0 footer 0.0 (total 1561.0)
Reply status: 1xx=0 2xx=0 3xx=0 4xx=1800 5xx=0

Here, you can see not only the average response rates, but also a min, max and standard deviation. Likewise for connection time.

8. 仅仅平均值是不够的

如果有人告诉你,服务器可以每秒产生1000个响应,平均时延为5ms,听起来是不是很棒?但是,如果这1000个响应里,有些需要100ms呢?或者,如果说在整个测试的10%时间里,由于垃圾收集的关系,只能达到500个响应/秒的速度,你怎么看?

平均值是个快速指标,但是,仅此而已。有许多重要的信息,包含在时间线和直方图(histogram)中,但平均值并没有提供。如果你的测试工具不提供时间线和直方图,那么还是换一个吧(开源的话,还可以选择提交一个patch)。

httperf可以显示:

Total: connections 180000 requests 180000 replies 180000 test-duration 179.901 s

Connection rate: 1000.0 conn/s (99.9 ms/conn, <=2 concurrent connections)
Connection time [ms]: min 0.4 avg 0.5 max 12.9 median 0.5 stddev 0.4
Connection time [ms]: connect 0.1
Connection length [replies/conn]: 1.000

Request rate: 1000.0 req/s (.9 ms/req)
Request size [B]: 79.0

Reply rate [replies/s]: min 999.1 avg 1000.0 max 1000.2 stddev 0.1 (35 samples)
Reply time [ms]: response 0.4 transfer 0.0
Reply size [B]: header 385.0 content 1176.0 footer 0.0 (total 1561.0)
Reply status: 1xx=0 2xx=0 3xx=0 4xx=1800 5xx=0
可以看到,它不仅显示了响应速度的平均值,还显示了最小值、最大值和标准差(deviation)。连接时间也是这样。

9. Publish it All.

A result given without enough information to reproduce it is at best a useless statement that requires people to take it on faith (a bad idea), and at worst an intentional effort to mislead. Publish your results with all of the context of the test; not only the hardware you used, but also OS versions and configurations, network setup (with iperf results), server and load generator versions and configuration, workload used, and source code if necessary.

Ideally, this would take the form of a repository (e.g., on github) that allows anyone to reproduce your results (for their hardware) with as little overhead as possible.

9. 把相关信息全部发布出来

如果只是给出一个结果,而不给出重现它的必要信息,那么往好了说,这个结果只是一个全凭大家靠信仰来相信的没有用的结论,往坏了说,它甚至有故意误人子弟之嫌(译者注: 各种数据库的宣传式的benchmark表示纷纷中枪)。所以,如果要发布测试结果,记得要把测试的相关上下文也一起发布,不光是测试所用的硬件,还应包括OS版本与配置、网络设置、服务器与负载产生器的版本与配置、所用的负载,甚至必要时还要加上源代码。

理想情况下,我们可以采用代码库(比如github)的形式,使任何人都能以最小的代价(用自己的硬件)重现你的结果。

返回顶部
顶部