加载中

It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.

And computers are big, too. You can buy a 1000MHz machine with 2 gigabytes of RAM and an 1000Mbit/sec Ethernet card for $1200 or so. Let's see - at 20000 clients, that's 50KHz, 100Kbytes, and 50Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of twenty thousand clients. (That works out to $0.08 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.

In 1999 one of the busiest ftp sites, cdrom.com, actually handled 10000 clients simultaneously through a Gigabit Ethernet pipe. As of 2001, that same speed is now being offered by several ISPs, who expect it to become increasingly popular with large business customers.

是时候让 Web 服务器同时处理一万客户端了,你不觉得吗?毕竟,现在的 Web 是一个大地盘了。

并且,计算机也是一样的大。 你可以花 $1200 左右购买一台 1000MHz,2Gb RAM 和一块 1000Mbit/s 以太网卡的机器。我们来看看——在 20000 客户端(是 50KHz,100Kb 和 50Kb/s/客户端)时,它不采取任何更多的马力而是采用 4Kb 的硬盘和为2万客户端中的每个一秒一次的发送它们到网络。(顺便说一句,这是$0.08 每客户端。 那些按 $100/客户端 许可费来收取费用的一些操作系统开始看起来有点沉重!)。所以硬件不再是瓶颈了。

1999年最繁忙的FTP网站: cdrom.com,实际上通过一个千兆的以太网管道同时地处理了 10000 客户端。截至 2001年,同样的速度现在由多个ISP提供,期望它变得越来越受大型商业客户的欢迎。

And the thin client model of computing appears to be coming back in style -- this time with the server out on the Internet, serving thousands of clients.

With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, as that's my personal area of interest, but Windows is also covered a bit.

Contents

瘦客户端计算模式显现出回来的风格——这一时刻,在互联网上的服务器正在为数千计的客户端服务。

考虑到这一点,这里有几个关于如何配置操作系统和编写代码以支持数千客户端的注意事项。讨论的中心是围绕类 Unix 操作系统的。因为这是我个人感兴趣的领域。但Windows也包括了一点。

目录

Related Sites

See Nick Black's execellent Fast UNIX Servers page for a circa-2009 look at the situation.

In October 2003, Felix von Leitner put together an excellent web page and presentation about network scalability, complete with benchmarks comparing various networking system calls and operating systems. One of his observations is that the 2.6 Linux kernel really does beat the 2.4 kernel, but there are many, many good graphs that will give the OS developers food for thought for some time. (See also the Slashdot comments; it'll be interesting to see whether anyone does followup benchmarks improving on Felix's results.)

Book to Read First

If you haven't read it already, go out and get a copy of Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1) by the late W. Richard Stevens. It describes many of the I/O strategies and pitfalls related to writing high-performance servers. It even talks about the 'thundering herd' problem. And while you're at it, go read Jeff Darcy's notes on high-performance server design.

(Another book which might be more helpful for those who are *using* rather than *writing* a web server is Building Scalable Web Sites by Cal Henderson.)

相关站点

阅读Nick Black写的超级棒的 Fast UNIX Servers 文章.
2003年十月, Felix von Leitner 做了一个超级好的网页,展示了网络的可扩展性,他完成了在各种不同的网络系统请求和操作系统下的benchmark比较。他的一个实验观察结果就是linux2.6的内核确实比2.4的要好,但还有有很多很多好的图表数据可以引起OS开发者的深思(如有兴趣可以看看 Slashdot 的评论;是否真的有遵循Felix的实验结果对benchm的提高进行跟踪的)

提前阅读

如果你还没有读过上述,那么先出去买一本W. Richard Stevens写的 Unix Network Programming : Networking Apis: Sockets and Xti (Volume 1)  . 这本书描述了很多编写高性能服务器的I/O策略和误区,甚至还讲解了关于 'thundering herd' 问题。惊群问题
如果你读过了,那么请读这本 Jeff Darcy's notes on high-performance server design.(Cal Henderson写的,对更加倾向于使用一款web 服务器而非开发一款服务器 来构建可扩展web站点的同志,这本书更加有用.)

I/O frameworks

Prepackaged libraries are available that abstract some of the techniques presented below, insulating your code from the operating system and making it more portable.

  • ACE, a heavyweight C++ I/O framework, contains object-oriented implementations of some of these I/O strategies and many other useful things. In particular, his Reactor is an OO way of doing nonblocking I/O, and Proactor is an OO way of doing asynchronous I/O.
  • ASIO is an C++ I/O framework which is becoming part of the Boost library. It's like ACE updated for the STL era.
  • libevent is a lightweight C I/O framework by Niels Provos. It supports kqueue and select, and soon will support poll and epoll. It's level-triggered only, I think, which has both good and bad sides. Niels has a nice graph of time to handle one event as a function of the number of connections. It shows kqueue and sys_epoll as clear winners.
  • My own attempts at lightweight frameworks (sadly, not kept up to date):
    • Poller is a lightweight C++ I/O framework that implements a level-triggered readiness API using whatever underlying readiness API you want (poll, select, /dev/poll, kqueue, or sigio). It's useful for benchmarks that compare the performance of the various APIs. This document links to Poller subclasses below to illustrate how each of the readiness APIs can be used.
    • rn is a lightweight C I/O framework that was my second try after Poller. It's lgpl (so it's easier to use in commercial apps) and C (so it's easier to use in non-C++ apps). It was used in some commercial products.
  • Matt Welsh wrote a paper in April 2000 about how to balance the use of worker thread and event-driven techniques when building scalable servers. The paper describes part of his Sandstorm I/O framework.
  • Cory Nelson's Scale! library - an async socket, file, and pipe I/O library for Windows

I/O 框架

以下所列的为几个包装好的库,它们抽象出了一些下面所表达的技术,并且可以使你的代码与具体操作系统隔离,从而具有更好的移植性。
·      ACE, 一个重量级的C++ I/O框架,用面向对象实现了一些I/O策略和其它有用的东西,特别的,它的Reactor框架是用OO方式处理非阻塞I/O,而Proactor框架是用OO方式处理异步I/O的。
·      ASIO 一个C++的I/O框架,正在成为Boost库的一部分。它像是ACE过渡到STL时代。(译注:ACE内部有自己的容器实现,它和C++ 标准库中的容器是不兼容的。)
·      libevent 由Niels Provos用C 语言编写的一个轻量级的I/O框架。它支持kqueue和select,并且很快就可以支持poll和epoll(翻译此文时已经支持)。我想它应该是只采用了水平触发机制,该机制功过参半。Niels给出了 一张图 来说明时间和连接数目在处理一个事件上的功能,从图上可以看出kqueue和sys_epoll明显胜出。
·     我本人也尝试过写一个轻量级的框架(很可惜没有维持至今):
          o      Poller 是一个轻量级的C++ I/O框架,它使用任何一种准备就绪API(poll, select, /dev/poll, kqueue, sigio)实现水平触发准备就绪API。以其它 不同的API为基准,Poller的性能好得多。该链接文档的下面一部分说明了如何使用这些准备就绪API。
          o      rn 是一个轻量级的C I/O框架,也是我继Poller后的第二个框架。该框架可以很容易的被用于商业应用中,也容易的适用于非C++应用中。它如今已经在几个商业产品中使用。
·     2000年4月,Matt Welsh就构建服务器如何平衡工作线程和事件驱动技术的使用方面写了一篇 论文,在该论文中描述了他自己的Sandstorm I/O框架。
·      Cory Nelson's Scale! library - 一个Windows下的异步套接字,文件和管道的I/O库。

I/O Strategies

Designers of networking software have many options. Here are a few:
  • Whether and how to issue multiple I/O calls from a single thread
    • Don't; use blocking/synchronous calls throughout, and possibly use multiple threads or processes to achieve concurrency
    • Use nonblocking calls (e.g. write() on a socket set to O_NONBLOCK) to start I/O, and readiness notification (e.g. poll() or /dev/poll) to know when it's OK to start the next I/O on that channel. Generally only usable with network I/O, not disk I/O.
    • Use asynchronous calls (e.g. aio_write()) to start I/O, and completion notification (e.g. signals or completion ports) to know when the I/O finishes. Good for both network and disk I/O.
  • How to control the code servicing each client
    • one process for each client (classic Unix approach, used since 1980 or so)
    • one OS-level thread handles many clients; each client is controlled by:
      • a user-level thread (e.g. GNU state threads, classic Java with green threads)
      • a state machine (a bit esoteric, but popular in some circles; my favorite)
      • a continuation (a bit esoteric, but popular in some circles)
    • one OS-level thread for each client (e.g. classic Java with native threads)
    • one OS-level thread for each active client (e.g. Tomcat with apache front end; NT completion ports; thread pools)
  • Whether to use standard O/S services, or put some code into the kernel (e.g. in a custom driver, kernel module, or VxD)

The following five combinations seem to be popular:

  1. Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
  2. Serve many clients with each thread, and use nonblocking I/O and readiness change notification
  3. Serve many clients with each server thread, and use asynchronous I/O
  4. serve one client with each server thread, and use blocking I/O
  5. Build the server code into the kernel

I/O 策略

网络软件设计者往往有很多种选择,以下列出一些:
  • 是否处理多个I/O?如何处理在单一线程中的多个I/O调用?
    • 不处理,从头到尾使用阻塞和同步I/O调用,可以使用多线程或多进程来达到并发效果。
    • 使用非阻塞调用(如在一个设置O_NONBLOCK选项的socket上使用write)读取I/O,当I/O完成时发出通知(如poll,/dev/poll)从而开始下一个I/O。这种主要使用在网络I/O上,而不是磁盘的I/O上.
    • 使用异步调用(如aio_write())读取I/O,当I/O完成时会发出通知(如信号或者完成端口),可以同时使用在网络I/O和磁盘I/O上.
  • 如何控制对每个客户的服务?
    • 对每个客户使用一个进程(经典的Unix方法,自从1980年一直使用)
    • 一个系统级的线程处理多个客户,每个客户是如下一种:
      • 一种用户级的线程(如GNU 状态线程,经典的java绿色线程)
      • 一个状态机 (有点深奥,但一些场景下很流行; 我喜欢)
      • 一个延续性线程 (有点深奥,但一些场景下很流行)
    • 一个系统级的线程对应一个来自客户端的连接 (如经典的java中的带native 线程)
    • 一个系统级的线程对应每一个活动的客户端连接(如Tomcat坐镇而 apache 做前端的;NT完成端口; 线程池)
  • 是否使用标准的操作系统服务,还是把一些代码放入内核中(如自定义驱动,内核模块,VxD)

下面的五种组合应该是最常用的了:

  1. 一个线程服务多个客户端,使用非阻塞I/O水平触发的就绪通知
  2. 一个线程服务多个客户端,使用非阻塞I/O和就绪改变时通知
  3. 一个服务线程服务多个客户端,使用异步I/O
  4. 一个服务线程服务一个客户端,使用阻塞I/O
  5. 把服务器代码编译进内核

1. Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification

... set nonblocking mode on all network handles, and use select() or poll() to tell which network handle has data waiting. This is the traditional favorite. With this scheme, the kernel tells you whether a file descriptor is ready, whether or not you've done anything with that file descriptor since the last time the kernel told you about it. (The name 'level triggered' comes from computer hardware design; it's the opposite of 'edge triggered'. Jonathon Lemon introduced the terms in his BSDCON 2000 paper on kqueue().)

Note: it's particularly important to remember that readiness notification from the kernel is only a hint; the file descriptor might not be ready anymore when you try to read from it. That's why it's important to use nonblocking mode when using readiness notification.

1.一个线程服务多个客户端,并使用非阻塞I/O和电平触发的就绪通知

... 将所有网络处理单元设置为非阻塞状态,并使用select() 或 poll()识别哪个网络处理单元有等待数据。这是传统所推崇的。在这种场景,内核会告诉你一个文件描述符是否已经具备,自从上次内核告诉你这个文件描述符以后,你是否对它完成了某种事件。(名词“电平触发”(level triggered)来自于计算机硬件设计领域;它是'边缘触发' (edge triggered)的对立面。Jonathon Lemon在他的BSDCON 2000 关于kqueue()的论文 中引入了这些术语。)

注意:特别重要的是,要记住来自内核的就绪通知只是一个提示;当你准备从文件描述符读的时候,它可能还未准备就绪。这就是为什么当使用就绪通知的时候要使用非阻塞状态如此重要了。

An important bottleneck in this method is that read() or sendfile() from disk blocks if the page is not in core at the moment; setting nonblocking mode on a disk file handle has no effect. Same thing goes for memory-mapped disk files. The first time a server needs disk I/O, its process blocks, all clients must wait, and that raw nonthreaded performance goes to waste.
This is what asynchronous I/O is for, but on systems that lack AIO, worker threads or processes that do the disk I/O can also get around this bottleneck. One approach is to use memory-mapped files, and if mincore() indicates I/O is needed, ask a worker to do the I/O, and continue handling network traffic. Jef Poskanzer mentions that Pai, Druschel, and Zwaenepoel's 1999 Flash web server uses this trick; they gave a talk at Usenix '99 on it. It looks like mincore() is available in BSD-derived Unixes like FreeBSD and Solaris, but is not part of the Single Unix Specification. It's available as part of Linux as of kernel 2.3.51, thanks to Chuck Lever.

这个方法中的一个重要瓶颈,就是read()或sendfile()对磁盘块操作时,如果当前内存中并不存在该页;将磁 盘文件描述符设置为非阻塞将没有效果。同样的问题也发生在内存映射磁盘文件中。当一个服务端第一次需要磁盘I/O时,它的进程模块,所有的客户端都必须等待,因此最初的非线程的性能就被消耗了。
这也是异步I/O的目的所在,但是仅仅是在没有AIO的系统,处理磁盘I/O的工作线程或工作进程也可能遭遇此 瓶颈。一条途径就是使用内存映射文件,如果mincore()指明I/O为必需的话,那么就会要求一个工作线 程来完成此I/O,然后继续处理网络事件。Jef Poskanzer提到Pai,Druschel和Zwaenepoel的1999 Flash web服务器就使用了这个方法;他们还就此在 Usenix '99上做了一个演讲。看上去就好像BSD衍生出来的 Unixes如 FreeBSD和Solaris 中提供了mincore()一样,但是它并不是 单一 Unix 规范的一部分,在Linux的2.3.51 的内核中提供了该方法, 感谢Chuck Lever

But in November 2003 on the freebsd-hackers list, Vivek Pei et al reported very good results using system-wide profiling of their Flash web server to attack bottlenecks. One bottleneck they found was mincore (guess that wasn't such a good idea after all) Another was the fact that sendfile blocks on disk access; they improved performance by introducing a modified sendfile() that return something like EWOULDBLOCK when the disk page it's fetching is not yet in core. (Not sure how you tell the user the page is now resident... seems to me what's really needed here is aio_sendfile().) The end result of their optimizations is a SpecWeb99 score of about 800 on a 1GHZ/1GB FreeBSD box, which is better than anything on file at spec.org.

但在 2003年11月的 freebsd-hackers列表中,Vivek Pei报告了一个不错的成果,他们对它们的Flash Web服务器在系统范围做性能分析,然后再攻击其瓶颈。他们找到的一个瓶颈就是mincore(猜测根本就不是什么好方法),另外一个事实就是sendfile在磁盘块访问时阻塞了;他们引入了一个修改后的sendfile(),当需要读 取的磁盘页不在内核中时,返回类似EWOULDBLOCK的值,这样便提高了性能。(不确定你怎样告诉用户页面现在位于何处…在我看来这里真正需要的是aio_sendfile()。) 他们优化的最终结果是SpecWeb99,在1GHZ/1GB FreeBSD沙箱上跑了约800分,这要比在spec.org存档的任何记录都要好。

There are several ways for a single thread to tell which of a set of nonblocking sockets are ready for I/O:

  • The traditional select()
    Unfortunately, select() is limited to FD_SETSIZE handles. This limit is compiled in to the standard library and user programs. (Some versions of the C library let you raise this limit at user app compile time.)

    See Poller_select (cc, h) for an example of how to use select() interchangeably with other readiness notification schemes.

  • The traditional poll()
    There is no hardcoded limit to the number of file descriptors poll() can handle, but it does get slow about a few thousand, since most of the file descriptors are idle at any one time, and scanning through thousands of file descriptors takes time.

    Some OS's (e.g. Solaris 8) speed up poll() et al by use of techniques like poll hinting, which was implemented and benchmarked by Niels Provos for Linux in 1999.

    See Poller_poll (cc, h, benchmarks) for an example of how to use poll() interchangeably with other readiness notification schemes.

  • /dev/poll
    This is the recommended poll replacement for Solaris.

    The idea behind /dev/poll is to take advantage of the fact that often poll() is called many times with the same arguments. With /dev/poll, you get an open handle to /dev/poll, and tell the OS just once what files you're interested in by writing to that handle; from then on, you just read the set of currently ready file descriptors from that handle.

    It appeared quietly in Solaris 7 (see patchid 106541) but its first public appearance was in Solaris 8; according to Sun, at 750 clients, this has 10% of the overhead of poll().

    Various implementations of /dev/poll were tried on Linux, but none of them perform as well as epoll, and were never really completed. /dev/poll use on Linux is not recommended.

    See Poller_devpoll (cc, h benchmarks ) for an example of how to use /dev/poll interchangeably with many other readiness notification schemes. (Caution - the example is for Linux /dev/poll, might not work right on Solaris.)

  • kqueue()
    This is the recommended poll replacement for FreeBSD (and, soon, NetBSD).

    See below. kqueue() can specify either edge triggering or level triggering.

对于单线程来说有很多方法来分辨一组非阻塞socket中哪一个已经准备好I/O了:

  • 传统的select()
    很不幸select()受FD_SETSIZE限制。这个限制已经被编译到了标准库和用户程序。(有些版本的C语言库允许在用户应用编译的时候提高这个值。)

    参照 Poller_select (cc, h) 做为一个如何使用select()替代其它就绪通知场景例子。

  • 传统的poll()
    poll()可以处理的文件描述符数量没有硬编码的限制,但是这种方式会慢几千倍,因为大部分文件描述符总是空闲的,扫描上千个文件描述符是耗时的。
    有些操作系统(比如Solaris8)使用类似轮询暗示(poll hinting)的办法来加速poll(),在1999年Niels Provos实现了这种办法并且做了基准测试


    参照 Poller_poll (cc, h, benchmarks) 做为一个如何使用poll()替代其它就绪通知场景的例子。

  • /dev/poll
    在Solaris系统中,这是推荐的poll的替代品。
    隐藏在/dev/poll后面的想法是利用poll()经常以相同的参数调用很多次的事实。使用/dev/poll,你获取一个对/dev/poll的打开的句柄,通过写这个句柄就可以一次性告诉操作系统你对哪些文件感兴趣;从这以后,你只需要读从那个句柄返回的当前准备好的文件描述符。


    这一特性悄悄出现在了Solaris7 (see patchid 106541) 但是首先公开出现是在 Solaris 8; 参照Sun的数据,在750个客户端的时候,这种实现仅占poll()开销的10%。

    在Linux上/dev/poll有很多种实现,但是没有一种性能与epoll一样好,这些实现从来没有真正完整过。在linux上/dev/poll是不建议的。

    参照 Poller_devpoll (cc, h benchmarks ) 做为一个如何使用/dev/poll替代其它就绪通知场景的例子。(注意-这个例子是针对linux /dev/poll的,在Solaris上可能是无法工作的。)

  • kqueue()
    在FreeBSD(以及NetBSD)上,这是推荐的poll的替代品。
    看下面,kqueue()可以被指定为边缘触发或者水平触发。


2. Serve many clients with each thread, and use nonblocking I/O and readiness change notification

Readiness change notification (or edge-triggered readiness notification) means you give the kernel a file descriptor, and later, when that descriptor transitions from not ready to ready, the kernel notifies you somehow. It then assumes you know the file descriptor is ready, and will not send any more readiness notifications of that type for that file descriptor until you do something that causes the file descriptor to no longer be ready (e.g. until you receive the EWOULDBLOCK error on a send, recv, or accept call, or a send or recv transfers less than the requested number of bytes).

When you use readiness change notification, you must be prepared for spurious events, since one common implementation is to signal readiness whenever any packets are received, regardless of whether the file descriptor was already ready.

This is the opposite of "level-triggered" readiness notification. It's a bit less forgiving of programming mistakes, since if you miss just one event, the connection that event was for gets stuck forever. Nevertheless, I have found that edge-triggered readiness notification made programming nonblocking clients with OpenSSL easier, so it's worth trying.

[Banga, Mogul, Drusha '99] described this kind of scheme in 1999.

2. 一个线程服务多个客户端,并使用非阻塞 I/O 和变更就绪通知

变更就绪通知(或边沿触发就绪通知)意味着你给内核一个文件描述符,这之后,当描述符从 未就绪 变换到 就绪 时,内核就以某种方式通知你。然后假设你知道文件描述符是就绪的,直到你做一些事情使得文件描述符不再是就绪的时才不会继续发送更多该类型的文件描述符的就绪通知(例如:直到在一次发送,接收或接受调用,或一次发送或接收传输小于请求的字节数的时候你收到 EWOULDBLOCK 错误)。

当你使用变更就绪通知时,你必须准备好伪事件,因为一个共同的信号就绪实现时,任何接收数据包,无论是文件描述符都已经是就绪的。

这是“水平触发”就绪通知的对立面。这是有点不宽容的编程错误,因为如果你错过了唯一的事件,连接事件就将永远的卡住。不过,我发现边沿触发就绪通知使使用 OpenSSL 编程非阻塞客户端更容易,所以是值得尝试的。

[Banga, Mogul, Drusha '99] 在1999年描述了这种方案。

返回顶部
顶部