加载中

Java performance has the reputation of being something of a Dark Art. Partly this is due to the sophistication of the platform, which makes it hard to reason about in many cases. However, there has historically also been a trend for Java performance techniques to consist of a body of folk wisdom rather than applied statistics and empirical reasoning. In this article, I hope to address some of the most egregious of these technical fairytales.

1. Java is slow

Of all the most outdated Java Performance fallacies, this is probably the most glaringly obvious.

Sure, back in the 90s and very early 2000s, Java could be slow at times.


However we have had over 10 years of improvements in virtual machine and JIT technology since thenand Java's overall performance is now screamingly fast.

In six separate web performance benchmarks, Java frameworks took 22 out of the 24 top-four positions.

The JVM's use of profiling to only optimize the commonly-used codepaths, but to optimize those heavily has paid off. JIT-compiled Java code is now as fast as C++ in a large (and growing) number of cases.

Despite this, the perception of Java as a slow platform persists, perhaps due to a negative historical bias from people who had experiences with early versions of the Java platform.

We suggest remaining objective and assessing up-to-date performance results before jumping to conclusions.

Java性能问题被冠以某种黑暗魔法的称谓。一部分是因为其平台的复杂性,在很多情况下,无法定位其性能问题根源。然而,在以前对于Java性能的技巧,有一种趋向:认为其由人们的智慧,经验构成,而不是应用统计和实证推理。在这篇文章中,我希望去验证一些最荒谬的技术神话。

1. Java运行慢

在所有最过时的Java性能谬论当中,这可能是最明显的言论。

是的,在90年代和20年代初期,Java确实有点慢。

然而,在那之后,我们有超过10年的时间来改进虚拟机和JIT技术,现在Java整个体系的性能已经快的令人惊讶。

在6个单独的web性能测试基准中,Java框架占据了24个当中的22个前四的位置。

JVM性能分析组件的使用不仅优化了通用的代码路径,而且在优化那些严重领域也很有成效。JIT编译代码的速度在大多数情况下跟C++一样快了。

尽管这样,关于Java运行慢的言论还是存在,估计是由于历史原因造成的偏见,这个偏见来自当时那些使用Java早期版本的人们。

我们建议,在匆忙下结论之前先保留意见和评估一下最新的性能结果。

2. A single line of Java means anything in isolation

Consider the following short line of code:

MyObject obj = new MyObject();

To a Java developer, it seems obvious that this code must allocate an object and run the appropriate constructor.

From that we might begin to reason about performance boundaries. We know that there is some finite amount of work that must be going on, and so we can attempt to calculate performance impact based on our presumptions.

This is a cognitive bias that can trap us into thinking that we know, a priori, that any work will need to be done at all.

In actuality, both javac and the JIT compiler can optimize away dead code. In the case of the JIT compiler, code can even be optimized away speculatively, based on profiling data. In such cases the line of code won't run at all, and so it will have zero performance impact.

Furthermore, in some JVMs, such as JRockit, the JIT compiler can even decompose object operations so that allocations can be avoided even if the code path is not completely dead.

The moral of the story here is that context is significant when dealing with Java performance, and premature optimization can produce counter-intuitive results. For best results don’t attempt to optimize prematurely. Instead always build your code and use performance tuning techniques to locate and correct your performance hot spots.

2. 单行java代码意味着任何事都是孤立的

考虑以下一小段代码:

MyObject obj = new MyObject();

对一个Java开发者而言,很明显能看出这行代码需要分配一个对象和运行一个对应的构造方法。

从这来看,我们可以开始推出性能边界。我们知道有一些精确数量的工作必须继续,因此基于我们的推测,我们可以计算出性能影响。

这儿有个认知偏见,那就是根据以往的经验,任何工作都需要被做。

实际上,javac和JIT编译器都可以优化无效代码。就拿JIT编译器来说,代码甚至可以基于数据分析而被优化掉,在这种情况下,该行代码将不会被运行,因此它就不会有什么性能方面的影响。

而且,在一些Java虚拟机(JVM)中,例如JRockit中,即使代码路径没有完全失效,JIT编译器为了避免分配对象甚至可以执行分解对象操作。

这段文字在这里的意义就是当处理Java性能方面的问题时,上下文很重要。而且过早的优化可能会产生意料之外的结果。所以为了获得最好的结果,不要试图过早的优化。与其不断构建你的代码不如用性能调整技术去定位并且改正代码性能的潜在危险区。

3. A microbenchmark means what you think it does

As we saw above, reasoning about a small section of code is less accurate than analyzing overall application performance.

Nonetheless developers love to write microbenchmarks. The visceral pleasure that some people derive from tinkering with some low-level aspect of the platform seems to be endless.

Richard Feynman once said: "The first principle is that you must not fool yourself - and you are the easiest person to fool". Nowhere is this truer than when writing Java microbenchmarks.

Writing good microbenchmarks is profoundly difficult. The Java platform is sophisticated and complex, and many microbenchmarks only succeed in measuring transient effects, or other unintended aspects of the platform.

For example, a naively written microbenchmark will frequently end up measuring the timing subsystem or perhaps garbage collection rather than the effect it was trying to capture.

3.你以为一个微基准测试代表了一切


正如前面所说,探讨一小段代码性能所得出的结果要比分析应用全局性能的结果要粗略的多。

虽说如此,但开发者们仍然喜欢写一些微基准测试。一些人追求通过改良平台底层某些方面而获得性能提升所带给他们的内在快乐,并乐此不疲。

Richard Feynman曾说过“第一原则就是不要自欺欺人,你就是那个最容易被糊弄的人”。现在,当你编写java微基准测试的时候,你真得好好想想这句话。

要想写出良好的微基准测试是很困难的。以深奥、复杂著称的Java平台为例,许多微基准测试仅能胜任测量瞬时效应或者一些平台无关的方面。

例如,一个未经过良好设计的微基准测试通常只注重测量耗时的子系统或垃圾回收的性能而忽略了在捕捉这些变化过程中所产生的开销。

Only developers and teams that have a real need for should write microbenchmarks. These benchmarks should be published in their entirety (including source code), and should be reproducible and subject to peer review and deep scrutiny.

The Java platform's many optimizations imply that statistics of individual runs matters. A single benchmark must be run many times and the results aggregated to get a really reliable answer.

If you feel you must write microbenchmarks, then a good place to start is by reading the paper "Statistically Rigorous Java Performance Evaluation" by Georges, Buytaert, Eeckhout. Without proper treatment of the statistics, it is very easy to be misled.

There are well-developed tools and communities around them (for example, Google's Caliper) - if you absolutely must write microbenchmarks, then do not do so by yourself - you need the viewpoints and experience of your peers.

只有当开发者和团队有真正基准需求的时候才需要写微基准测试。这些基准测试应该打包到项目(包括源代码)随项目一起发布,并且这些基准测试应该是可重现的并可提供给他人审阅和进一步的审查。

Java平台的许多优化结果都指的是只运行单一基准测试用例时所得到的统计结果。一个单独的基准测试必须要多次运行才能得到一个比较趋近于真实答案的结果。

如果你认为到了必须要写微基准测试的时候,首先请读一下Georges, Buytaert, Eeckhout所著的"Statistically Rigorous Java Performance Evaluation"。如果没有统计学的知识,你会很容易误入歧途的。

网上有很多好的工具和社区来帮助你进行基准测试,例如Google的Caliper。如果你不得不写基准测试时,不要埋头苦干,你需要从他人那里汲取意见和经验。

4. Algorithmic slowness is the most common cause of performance problems

A very familiar cognitive fallacy among developers (and humans in general) is to assume that the parts of a system that they control are the important ones.

In Java performance, this manifests itself by Java developers believing that algorithmic quality is the dominant cause of performance problems. Developers think about code, so they have a natural bias towards thinking about their algorithms.

In practice, when dealing with a range of real-world performance problems, algorithm design was found to be the fundamental issue less than 10% of the time.

Instead, garbage collection, database access and misconfiguration were all much more likely to cause application slowness than algorithms.

Most applications deal with relatively small amounts of data, so that even major algorithmic inefficiencies don't often lead to severe performance problems. To be sure, we are acknowledging that the algorithms were suboptimal; nonetheless the amount of inefficiency they added was small relative to other, much more dominant performance effects from other parts of the application stack.

So our best advice is to use empirical, production data to uncover the true causes of performance problems. Measure; don't guess!

4.算法慢是性能问题的最普遍原因


在程序员(和普通大众)中普遍存在一个错误观点就是他们总是理所当然地认为自己所负责的那部分系统才是最重要的。

就Java性能这个问题来说,Java开发者认为算法的质量是性能问题的主要原因。开发者会考虑如何编码,因此他们本性上就会潜意识地去考虑算法。

实际上,当处理现实中的性能问题时,算法设计占用了解决基本问题不到10%的时间。

相反,相对于算法,垃圾回收,数据库访问和配置错误会更可能造成程序缓慢。

大多数应用程序处理相对少量的数据,因此即使主算法有缺陷也不会导致严重的性能问题。因此,我们得承认算法对于性能问题来说是次要的;因为算法带来的低效相对于其他部分造成的影响来说是相对较小的,大多的性能问题来自于应用程序栈的其他部分。

因此我们的最佳建议就是依靠经验和产品数据来找到引起性能问题的真正原因。要动手采集数据而不是凭空猜测。

5. Caching solves everything

"Every problem in Computer Science can be solved by adding another level of indirection"

This programmer's aphorism, attributed to David Wheeler (and thanks to the Internet, to at least two other Computer Scientists), is surprisingly common, especially among web developers.

Often this fallacy arises due to analysis paralysis when faced with an existing, poorly understood architecture.

Rather than deal with an intimidating extant system, a developer will frequently choose to hide from it by sticking a cache in front and hoping for the best. Of course, this approach just complicates the overall architecture and makes the situation worse for the next developer who seeks to understand the status quo of production.

Large, sprawling architectures are written one line, and one subsystem at a time. However, in many cases simpler, refactored architectures are more performant - and they are almost always easier to understand.

So when you are evaluating whether caching is really necessary, plan to collect basic usage statistics (miss rate, hit rate, etc.) to prove that the caching layer is actually adding value.

缓存能解决一切问题

“关于计算机科学的每一个问题都可以通过附加另外一个层面间接的方式被解决”

这句程序员的格言,来至于David Wheeler(幸亏有因特网,至少有另外两位计算机科学家),是惊人的相似,特别是在web开发者中。

通常出现这种谬论是因为当面对一个现有的,理解不够透彻的架构时出现的分析瘫痪。

与其处理一个令人生畏的现存系统,开发者经常会选择躲避它通过添加一个缓存并且抱着最大的希望。当然,这个方法仅仅使整个架构变的更复杂,并且对试图理解产品架构现状的下一位开发者而言是一件很糟糕的事情。

夸大的说,不规则架构每次被写入一行和一个子系统。然而,在许多情况下,更简单的重构架构会有更好的性能,而且它们几乎也更易于被理解。

因此当你评估是不是需要缓存时,计划去收集基本用法统计(缺失率,命中率等)去证明实际上缓存层是个附加值。

6. All apps need to be concerned about Stop-The-World

A fact of life of the Java platform is that all application threads must periodically stop to allow Garbage Collection to run. This is sometimes brandished as a serious weakness, even in the absence of any real evidence.

Empirical studies have shown that human beings cannot normally perceive changes in numeric data (e.g. price movements) occurring more frequently than once every 200ms. 

Consequently for applications that have a human as their primary user, a useful rule of thumb is that Stop-The-World (STW) pause of 200ms or under is usually of no concern. Some applications (e.g. streaming video) need lower GC jitter than this, but many GUI applications will not.

There are a minority of applications (such as low-latency trading, or mechanical control systems) for which a 200ms pause is unacceptable. Unless your application is in that minority it is unlikely your users will perceive any impact from the garbage collector.

It is also worth mentioning that in any system where there are more application threads than physical cores, the operating system scheduler will have to intervene to time-slice access to the CPUs. Stop-The-World sounds scary, but in practice, every application (whether JVM or not) has to deal with contended access to scarce compute resources.

Without measurement, it isn't clear that the JVM's approach has any meaningful additional impact on application performance.

In summary, determine whether pause times are actually affecting your application by turning on GC logs. Analyze the logs (either by hand, or with scripting or a tool) to determine the pause times. Then decide whether these really pose a problem for your application domain. Most importantly, ask yourself a most poignant question: have any users actually complained?

6. 所有应用都要考虑到STW

(译注:“stop-the-world” 机制简称STW,即,在执行垃圾收集算法时,Java应用程序的其他所有除了垃圾收集帮助器线程之外的线程都被挂起)

Java平台的一个存在事实是,所有应用线程必须周期性的停止以便让垃圾搜集器GC运行。这有时被夸大为严重的弱点,即使是在缺少真实证据的情况下。

实证研究已经说明,人类通常无法察觉到频率超过每200毫秒一次的数字数据的变化(例如价格变动)。

因此对以人类作为首要用户的应用,一条有用的经验就是200毫秒或低于200毫秒的 Stop-The-World (STW)停顿通常无需考虑。有些应用(例如视频流)需要比这个更低的GC波动,但是很多GUI应用不是的。

有少数应用(比如低延迟交易,或者机械控制系统)对200毫秒停顿是不可接受的。除非你的应用属于那个少数,否则你的用户察觉到任何由垃圾回收带来的影响是不太可能的。

值得注意的是,在具有比物理内核更多应用线程的系统中,操作系统任务计划将会干涉对CPU的时间分片访问。Stop-The-World听起来吓人,但实际上,每个应用(无论是不是JVM)都必须处理对稀缺计算资源的内容访问。

如果不做测量,JVM的方法对应用性能带来的额外影响具有何等意义将无法看清。

总体来说,判断停顿的次数实际对应用的影响是通过打开GC日志的办法。分析此日志(或者手工,或者用脚本或工具)来确定停顿的次数。然后再判定这些是否确实给你的应用域带来问题。最重要的是,问自己一个最尖锐的问题:有用户确实抱怨了吗?

7. Hand-rolled Object Pooling is appropriate for a wide range of apps

One common response to the feeling that Stop-The-World pauses are somehow bad is for application groups to invent their own memory management techniques within the Java heap. Often this boils down to implementing an object pooling (or even full-blown reference-counting) approach and requiring any code using the domain objects to participate.

This technique is almost always misguided. It often has its roots in the distant past, where object allocation was expensive and mutability was deemed inconsequential. The world is very different now.

Modern hardware is incredibly efficient at allocation; the bandwidth to memory is at least 2 to 3GB on recent desktop or server hardware. This is a big number; outside of specialist use cases it is not that easy to make real applications saturate that much bandwidth.

Object pooling is generally difficult to implement correctly (especially when there are multiple threads at work) and has several negative requirements that render it a poor choice for general use:

  • All developers who touch the code must be aware of pooling and handle it correctly
  • The boundary between "pool-aware" and "non-pool-aware" code must be known and documented
  • All of this additional complexity must be kept up to date, and regularly reviewed
  • If any of this fails, the risk of silent corruption (similar to pointer re-use in C) is reintroduced

In summary, object pooling should only be used when GC pauses are unacceptable, and intelligent attempts at tuning and refactoring have been unable to reduce pauses to an acceptable level.

7. 手工处理的对象池对很大范围内的应用都是合适的

对Stop-The-World停顿的坏感觉引起一个常见的应对,即在java堆的范围内,为应用程序组发明它们自己的内存管理技术。经常这会归结为实现一个对象池(或甚至是全面引用计数)的方法,并且需要让任何使用了领域对象的代码参与进来。

这种技术几乎总是被误导。它通常具有自身久远以前的根源,那时对象定位代价昂贵,突然的变化被认为是不重要的。但现在的世界已经非常不同。

现代的硬件具有难以想象的定位效率;近来桌面或服务器硬件的内存容量至少达到了2到3GB。这是一个很大的数字;除了专业的使用情形,让实际的应用充满那么大的容量不是很容易。

对象池一般很难正确的实现(特别是有多个线程在工作的时候),并且有几个消极的要求使得把它作为一般场景使用成为一个艰难选择:
  • 所有接触到代码的开发者都必须清楚对象池并正确的处理它
  • “对池清醒”代码与“对池不清醒”代码之间的界限必须要通晓并明文规定
  • 所有这些附加的复杂性必须保持最新,并定期评估
  • 如果这里任何地方失败了,无声损坏的风险(类似C中的指针重用)将被再次引入

总之,只有在GC停顿不能被接受,而且在调试与重构过程中聪明的尝试也不能缩减停顿到可接受水平的时候,对象池才可以使用。

8. CMS is always a better choice of GC than Parallel Old

By default, the Oracle JDK will use a parallel, stop-the-world collector for collecting the old generation.

An alternative choice is Concurrent-Mark-Sweep (CMS). This allows application threads to continue running throughout most of the GC cycle, but it comes at a price, and with quite a few caveats.

Allowing application threads to run alongside GC threads invariably results in application threads mutating the object graph in a way that would affect the liveness of objects. This has to be cleaned up after the fact, and so CMS actually has two (usually very short) STW phases.

This has several consequences:

  1. All application threads have to be brought to safe points and stopped twice per full collection;
  2. Whilst the collection is running concurrently, application throughput is reduced (usually by 50%);
  3. The overall amount of bookkeeping (and CPU cycles) in which the JVM engages to collect garbage via CMS is considerably higher than for parallel collection.

Depending on the application circumstances these prices may be worth paying or they may not. But there’s no such thing as a free lunch. The CMS collector is a remarkable piece of engineering, but it is not a panacea.

So before concluding that CMS is your correct GC strategy, you should first determine that STW pauses from Parallel Old are unacceptable and can't be tuned. And finally, (and I can’t stress this enough), be sure that all metrics are obtained on a production-equivalent system.

8. 在GC中CMS总是比Parallel Old更好

Oracle JDK默认使用一个并行的,全部停止(stop-the-world STW)垃圾收集器来收集老年代的垃圾。

另外一个选择是并发标记清除(CMS)收集器。这个收集器允许程序线程在大部分的GC周期中仍然继续工作,但它需要付出一些代价和带来一些警告。

允许程序线程和GC线程一起运行不可避免地导致对象表的变异同时又影响到对象的活跃性。这不得不在发生后进行清楚,所以CMS实际上有两个STW阶段(通常非常短)。

这会带来一些后果:

  1. 所有程序线程不得不放进一个安全点并且在每次完全收集时停止两次;
  2. 在收集并发运行地同时,程序吞吐量会减少(通常是50%)
  3. 在JVM从事通过CMS来收集垃圾的总体数据上(包括CPU周期)比并行收集更加高的。

依据程序的情况这些成本或者是值得的或者又不是。但并没有免费的午餐。CMS收集器是一个卓越的工程品,但它不是万能药。

所以在介绍前,CMS是你正确的GC策略,你得首先考虑Parallel Old的STW是不可接收的和不能调和的。最后,(我不能足够地强调),确定所有的指标都从相当的生产系统上得到。

9. Increasing the heap size will solve your memory problem

When an application is in trouble and GC is suspected, many application groups will respond by just increasing the heap size. Under some circumstances, this can produce quick wins and allow time for a more considered fix. However, without a full understanding of the causes of the performance problem, this strategy can actually make matters worse.

Consider a badly coded application that is producing too many domain objects (with a typical lifespan of say two to three seconds). If the allocation rate is high enough, garbage collections could occur so rapidly that the domain objects are promoted into the tenured (old) generation. Once in tenured, the domain objects die almost immediately, but they would not be collected until the next full collection.

If this application has its heap size increased, then all we're really doing is adding space for relatively short-lived domain objects to propagate into and die. This can make the length of Stop-The-World pauses worse for no benefit to the application.

Understanding the dynamics of object allocation and lifetime before changing heap size or tuning other parameters is essential. Acting without measuring can make matters worse. The tenuring distribution information from the garbage collector is especially important here.

9. 增加堆内存会解决你内存溢出的问题

当一个应用程序崩溃,GC中止运行时,许多应用组会通过增加堆内存来解决问题。在许多情况下,这可以很快解决问题,并争取时间来考虑出一个更深的解决方案。然而,在没有真正理解性能产生的根源时,这种解决策略实际上会使情况更糟糕。

试想一下,一个编码很烂的应用构造了非常多的领域对象(生命周期大概维持2,3秒)。如果内存分配率足够高,垃圾回收就会很快地执行,并把这些领域对象放到年老代。一旦进入了老年代,对象就会立即死去,但直到下一次完全回收才会被垃圾回收器回收。

如果这个应用增加其堆内存,那么我们能做的是增加空间,为了存放那些相对短期存在,然后消逝的领域对象。这会使得 Stop-The-World 的时间更长,对应用毫无益处。

在改变堆内存和或其他参数之前,理解一下对象的动态分配和生命周期是很有必要的。没做调查就行动,只会使事情更糟。在这里,垃圾回收器的老年分布信息是非常重要的。

返回顶部
顶部