加载中

Summary:

Valgrind is one of the most misunderstood tools I know of in my community. Valgrind is not a leak checker. It has a leak checking tool. I'd argue that this tool happens to be the least useful component.

Without changing the way you invoke Valgrind, you get so much more useful information than most people realize. Valgrind finds latent bugs even when they don't cause your program to fail/crash; it doesn't just tell you where the bug happened, it tells you why it happened, in English. Valgrind is an undefined behavior checking tool first, a function and memory profiler second, a data-race detection tool third, and a leak checking tool last.

There's a reason why this is the first thing I tell students to do at office hours.

概要:

在我的社区中,Valgrind 是我已知的被误解最深的工具。Valgrind 不仅仅是一个内存泄露检查器。它只是包含了一个检查内存泄露的工具而已。但我想说的是这个工具恰恰是 Valgrind 中用处最小的一个组件。

无需改变 Valgrind 的调用方式,你就能得到比大多数人想象的要多得多的极具价值的信息。 Valgrind 会在你的程序奔溃之前找出潜在的错误;它不仅告诉你错误在哪里,还会告诉你原因(用英语哦). Valgrind 首先是一个未知行为 检测工具,其次他是一个函数和内存分析工具, 然后是一个数据竞争条件侦测工具, 它最后才是一个内存泄露检查工具。

First things first:

To run valgrind, simply go to the directory where your program is and run:

valgrind ./myProgram myProgramsFirstArg myProgramsSecondArg

No special arguments.

You'll see both your program's output as well as the debugging output generated by Valgrind (which is prefixed with ==). The output is most helpful (and includes line numbers) if you compile your program with -g before running valgrind over the executable.

For the purposes of this article, please, Ignore all Valgrind output after the "HEAP SUMMARY" line. This is the part we don't care about: the memory leak summary.

首先也是最重要的:

要运行 Valgrind, 你只需切换到你程序所在的目录然后运行如下命令:

valgrind ./myProgram myProgramsFirstArg myProgramsSecondArg

无需特殊的参数。

你将会同时看到你的程序的输出,以及由 Valgrind 生成的调试输出信息(那些 ‘==‘ 开头的行)。如果你的程序在编译生成时带了 -g 选项(生成调试符号信息),Valgrind 将提供更多有帮助的信息(比方说执行代码的行号)。

基于本文的目的, 请 忽略所有 Valgrind 输出内容里 "HEAD SUMMARY" 行之后的内容。 这正是本文不关心的部分:内容泄露摘要。

What can it detect?:

1) Misuse of uninitialized values. At it's most basic:

bool condition;if (condition) {
  //Do the thing}

This is a fun one. A lot of time your code is just going to keep going and fail silently if you run this. It might even do exactly what you hoped it would do... most of the time. In a perfect world, when your code is wrong, it fails every time. Hard and fast errors, not silent, latent, and long-running. Knowing that there is a bug is the first step to fixing it. The problem here is that that bool has no value assigned to it. It is NOT automatically initialized to false (or true). Its value is whatever garbage happened to be in memory at that time.

The valgrind output for the example is of the form:

==2364== Conditional jump or move depends on uninitialized value(s)
==2364==    at 0x400916: main (test.cpp:106)

Notice: This tells us why the code exhibits undefined behavior, not just where. What's more, Valgrind catches it even if the undefined behavior wouldn't cause your program to crash.

它能检测到些什么呢?

1) 误用未初始化的值. 这也是它的基本功:

bool condition;
if (condition) {
  //Do the thing
}

有趣的是,大部分时间里你的程序只是继续运行,然后当运行到这里时,毫无征兆的出现运行失败。 它可能(大多数时候)看似在按你预想的那样的运行。理论上,如果你的程序有错误,那每次运行它它都应该出错。这些错误是硬性的,很快就能显现出来。只有先确定哪里有错误,然后我们才能修复它。问题是我们从一开始就没有赋予那个布尔变量任何值,它也不会被程序自动的初始化. 此时,它的值可能是任何恰好留在它的内存位置上的随机的值。

上面实例中 Valgrind 会输出这样的行:

==2364== Conditional jump or move depends on uninitialized value(s)
==2364==    at 0x400916: main (test.cpp:106)

注意:上述输出给出了代码会引发未知行为的原因,不光只是位置。更棒的是,Valgrind 在这些未知行为引发程序崩溃之前就捕捉到了他们。

I doubt something quite so obvious as the above example is written often, but it'd be much harder to see this mistake in code of the form:

bool condition;if (foo) {
  condition = true;}if (bar) {
  condition = false;}if (baz) {
  condition = true;}if (condition) {
  //Do the thing}

Here we initialize properly some of the time... but not all of the time. Valgrind still catches it if you have a test that exhibits the undefined behavior.

For what it's worth, you can use defensive coding practices to avoid this type of bug in the first place. Prefer to always initialize your variables with a value. Use the auto keyword to require that you do so (you cannot deduce a type without a value to deduce it from). Take a look at the articles on auto on Herb Sutter's blog to find out more.

像上面那样显而易见的错误估计很难出现,但下面这个错误估计就没那么好发现了:

bool condition;
if (foo) {
  condition = true;
}
if (bar) {
  condition = false;
}
if (baz) {
  condition = true;
}
if (condition) {
  //Do the thing
}

这里我们只有某些时候成功地初始化了condition,但不是全部。Valgrind仍然可以检查出这些未定行为。

使用某些防御性编程的方法可以从根源避免这种错误。我比较倾向于给每一个变量一个初始值。或是使用auto关键字来强迫你去初始化某个变量(在没有一个值的情况下,你不能推断出那个变量的类型)。你可以看看Herb Sutter的博客 ,里面提到了更多关于auto关键字的事情。

2) Accessing memory you shouldn't. Touching memory that was never allocated, memory that's been freed, access past the end of allocated memory (so, off by one errors), and inaccessible parts of the stack.

An example:

  vector<int> v { 1, 2, 3, 4, 5 };
  v[5] = 0; //Oops

Do you see it?

If I run this code normally on my computer, it actually seems to run just fine. No crashes over 20 runs... but it's definitely wrong. Even if I did manage to have it open in GDB (another debugging tool) when it crashed, the best I'd get is a stack trace, and it might not be where the problem was caused, but rather, where it manifested, at the symptom, if you will.

Here's the corresponding Valgrind output:

==2710== Invalid write of size 4
==2710==    at 0x400961: foo() (test.cpp:85)
==2710==    by 0x4009A2: main (test.cpp:89)
==2710==  Address 0x5a1d054 is 0 bytes after a block of size 20 alloc'd
==2710==    at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2710==    by 0x400EDF: __gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*) (new_allocator.h:104)
==2710==    by 0x400DCE: std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long) (in /home/mark/test/a.out)
==2710==    by 0x400C5F: void std::vector<int, std::allocator<int> >::_M_range_initialize<int const*>(int const*, int const*, std::forward_iterator_tag) (stl_vector.h:1201)
==2710==    by 0x400AF4: std::vector<int, std::allocator<int> >::vector(std::initializer_list<int>, std::allocator<int> const&) (stl_vector.h:368)
==2710==    by 0x400943: foo() (test.cpp:84)
==2710==    by 0x4009A2: main (test.cpp:89)

That's a little unwieldy if you're not used to looking at stack traces through the STL. Let's break it down.

2) 操作你不该去碰的内存。读写从来没被分配出来的内存,被释放掉的内存;访问超过一块分配好的内存的边界的内存;栈上不能读写的内存。

一个例子:

  vector<int> v { 1, 2, 3, 4, 5 };
  v[5] = 0; //Oops

你看到了么?

如果我在我的计算机上运行这段程序,很可能没有什么问题。可能运行超过20次都不会挂掉一次,但是它绝对是错的。即使我凑巧在使用GDB(一种其他的调试工具)调试它的时候它挂掉了,我最多能得到一个栈的调用记录,但使它并不是造成这个问题的所在,而是这个问题的表现形式。

这里是Valgrind对上边问题的输出:

==2710== Invalid write of size 4
==2710==    at 0x400961: foo() (test.cpp:85)
==2710==    by 0x4009A2: main (test.cpp:89)
==2710==  Address 0x5a1d054 is 0 bytes after a block of size 20 alloc'd
==2710==    at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2710==    by 0x400EDF: __gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*) (new_allocator.h:104)
==2710==    by 0x400DCE: std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long) (in /home/mark/test/a.out)
==2710==    by 0x400C5F: void std::vector<int, std::allocator<int> >::_M_range_initialize<int const*>(int const*, int const*, std::forward_iterator_tag) (stl_vector.h:1201)
==2710==    by 0x400AF4: std::vector<int, std::allocator<int> >::vector(std::initializer_list<int>, std::allocator<int> const&) (stl_vector.h:368)
==2710==    by 0x400943: foo() (test.cpp:84)
==2710==    by 0x4009A2: main (test.cpp:89)

如果你对STL的栈调用不太熟悉,上面的东西并不好懂。让我们仔细看看。

First line tells you why your code exhibited undefined behavior. There was an "Invalid write of size 4". Size 4 means I wrote something 4 bytes big. On my machine, that's probably an int. Invalid write means that I touched memory I shouldn't have. As it happens, this was an off by one error: I wrote past the end of my vector.

Now let’s look at the 2nd and 3rd lines. These are Valgrind's best guess at the part of the stack trace that you care about. Indeed, in my case, foo is where the troubled code was, and main is the function that called foo.

The 4th line is more detail on the matter of "you ran off the end of the memory you were using".

And the rest is a more detailed stack trace that includes the STL. For what it's worth, the problem is never in the STL (ok, almost never).

第一行告诉你为什么你的代码会出现未定行为。这里有一个“Invalid write of size 4”。size 4意味着我写入了一个4字节大的东西。在我的机器上,这可能是一个int类型。invalid意味着我碰了不该碰的的内存。这是一个差一错误:我写入了超过我那个vector结尾的内存。

现在我们看看2,3行。这是Valgrind认为你最感兴趣的栈调用信息。确实,在这个例子中,出现问题的代码在foo中,而main是调用foo的函数。

第四行更为详细地描述了“你越界地使用了内存”这个问题。、

剩下的部分是包括STL在内的更详细的栈调用信息。事实上,问题从不出现在STL中。(好吧,几乎从不。)

3) Misuse of std::memcpy and functions that build on top of it whereby your source and destination arrays overlap (be sure to read my article about why std::memcpy is deprecated, then remember that you'll still invoke it under the hood of a better abstraction)

Not including an example on this error type or the next; I don't think they're especially common in modern code and if you do run into these, running Valgrind normally, without arguments, will expose both types of problems.

4) Invalid freeing of memory (minimal in modern code where you should be using smart pointers anyway)

3) 误用 std::memcpy 以及基于该函数构建的其他函数会导致你的源数组和目标数组地址重叠 (请先 阅读我的这篇文章 里面解释了为什么 std::memcpy 会被弃用,并牢记当你使用其他看似不错的更高层级的抽象层时,你依然无法避免间接的调用到 std::memcpy)

这里就不再给出此项和下一项的示例代码了;我想在现代代码里这种情况已经不常见了,如果您不幸遭遇此类问题,简单的运行 Valgrind 命令,无需任何参数,它就能把这两类问题报告给您。

4) 无效的内存释放 (在现代代码中已经几乎没有了,总之您应该优先使用智能指针)

5) Data races:

If I run:

  auto x = 0;
  thread([&] {
    ++x;
  }).detach();
  ++x;

with:

valgrind --tool=helgrind ./myProgram

I get some useful information:

==2872== Possible data race during read of size 4 at 0xFFEFFFE8C by thread #1
==2872== Locks held: none
==2872==    at 0x401081: main (test.cpp:96)
==2872== 
==2872== This conflicts with a previous write of size 4 by thread #2
==2872== Locks held: none
==2872==    at 0x40103A: main::{lambda()#1}::operator()() const (test.cpp:94)
==2872==    by 0x401F2D: void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (functional:1732)
==2872==    by 0x401E84: std::_Bind_simple<main::{lambda()#1} ()>::operator()() (functional:1720)
==2872==    by 0x401E1D: std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run() (thread:115)
==2872==    by 0x4EEEBEF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19)
==2872==    by 0x4C30E26: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==2872==    by 0x535F181: start_thread (pthread_create.c:312)
==2872==    by 0x566FEFC: clone (clone.S:111)

It tells me that I'm not protecting my data properly. I'm sharing data without synchronizing with a mutex. Bam.

I should mention that although this did find the bug in the code, it also included a ton of false positives on the std::shared_ptr used internally to std::thread. It seems they need to do a bit more work on that front. You could probably write a simple D or python script to scrape helgrind output for only the useful bits.

5) 数据竞争:

如果我运行如下命令:

valgrind --tool=helgrind ./myProgram

其中 myProgram 包含如下代码:

  auto x = 0;
  thread([&] {
    ++x;
  }).detach();
  ++x;

我将得到如下的 Valgrind 反馈:

==2872== Possible data race during read of size 4 at 0xFFEFFFE8C by thread #1
==2872== Locks held: none
==2872==    at 0x401081: main (test.cpp:96)
==2872== 
==2872== This conflicts with a previous write of size 4 by thread #2
==2872== Locks held: none
==2872==    at 0x40103A: main::{lambda()#1}::operator()() const (test.cpp:94)
==2872==    by 0x401F2D: void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke
<>(std::_Index_tuple<>) (functional:1732)
==2872==    by 0x401E84: std::_Bind_simple<main::{lambda()#1} ()>::operator()() 
(functional:1720)
==2872==    by 0x401E1D: std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} 
()> >::_M_run() (thread:115)
==2872==    by 0x4EEEBEF: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.19)
==2872==    by 0x4C30E26: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linu
x.so)
==2872==    by 0x535F181: start_thread (pthread_create.c:312)
==2872==    by 0x566FEFC: clone (clone.S:111)

它告诉我,我的数据没有得到适当的保护。没有通过互斥锁的同步,我就共享了数据。

我必须得说,尽管它侦测到了代码中错误,但是它的输出中依然包含了一堆不精确的诊断(这里它打印出一堆 std::shared_ptr 被 std::thread 内部调用之类的过于冗余的信息)。看起来 Valgrind 还需要在信息筛选方面再接再厉。当然你也可以写个简单的 D 脚本或者 Python 脚本来帮 helgrind 过滤出有用的信息。

6) And yeah... it finds leaks, if you're still not using smart pointers.

Run:

valgrind --leak-check=full ./myProgram

(If you forget that flag, just run valgrind normally once; it'll remind you in the text in the summary area)

On:

auto x = new int(5);

And you'll see:

==2881== 4 bytes in 1 blocks are definitely lost in loss record 1 of 1
==2881==    at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2881==    by 0x400966: main (test.cpp:92)

Valgrind as a function and memory profiler:

In addition to being able to tell you where you've introduced bugs into your program, Valgrind can also help you optimize. Too often people assume that they know what's eating up their runtime or what their big memory problems are... and they're wrong. Use your time wisely: measure!

Run your program with:

valgrind --tool=callgrind ./myProgram

And it'll spit out a file in the same directory whose name is something like callgrind.out.2887. Download the program KCachegrind to get a GUI visualization of the flow of your program, what functions are eating up your runtime, and generally, a better understanding of where to focus your efforts.

Here's what some of the most simple output looks like, showing the runtime cost of each function both in terms of wall time, number of times it was called, and percentage of the total runtime. You can Google for some of the more interesting graphs/flow diagrams it generates.

Similarly, I can evaluate where I'm allocating the most memory by running with --tool=massif. This is often useful for leak checking as well, as larger parts of your memory footprint may be indicative of leaks.

6) 欧耶... 内存泄露了, 你不会还没启用智能指针吧。

运行:

valgrind --leak-check=full ./myProgram

(如果你忘了是哪个参数的话,只要像往常那样运行一次 Valgrind,它会在输出中的内存摘要部分提醒你的)

针对如下行:

auto x = new int(5);

Valgrind 会有如下输出:

==2881== 4 bytes in 1 blocks are definitely lost in loss record 1 of 1
==2881==    at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgp
reload_memcheck-amd64-linux.so)
==2881==    by 0x400966: main (test.cpp:92)

Valgrind 用作函数和内存分析:

Valgrind 不仅能告诉您错误出在哪里,他还能帮您优化代码。 人们常常自以为知道是什么导致了程序在运行时大量消耗内存...,一番折腾之后才发现错了。 怎样才能节约您宝贵的时间呢? 多测量。

通过如下命令运行您的程序:

valgrind --tool=callgrind ./myProgram

它会在被测试程序的目录下生成一个类似名为 ”callgrind.out.2887“ 的文件。下载程序 KCachegrind,它提供一个可视化的界面,显示您的程序的执行路径以及哪个函数在吞噬您宝贵的内存。一目了然,你很快就知道自己该把火力集中在哪里了。

这里有些简单的输出示例, 它列出了每个函数的时间消耗(wall time),内存(百分比)消耗和调用次数。 Google 一下,你能搜到很多它生成的其他有趣的图表。

我们也可以利用 --tool=massif 参数来发现大量消耗内存的代码。它原本是用于检测内存泄露,但是内存泄露就会造成大量的内存滞留在程序。

Conclusions:

Valgrind is much more than a leak checking tool. Change your perspective: Valgrind is an undefined behavior killer.

Valgrind should be your tool of first resort. It not only tells you where your bugs are happening, but why, and it'll tell you this even if your program doesn't crash (unlike GDB on both counts). For what it's worth, GDB is still a very useful tool for getting full stack traces on failed assertions and for debugging concurrent code, among other things.

You may also find it useful to always compile with -pedantic -Wall -Wextra. Your compiler is often smart enough to flag undefined behavior as well. What the compiler misses, Valgrind should catch.

If this interests you, you may want to take a look at some other tools that perform similar duties, often with less of a runtime hit:
Address Sanitizer for clang and g++
Undefined Behavior Sanitizer for clang and g++
Memory Sanitizer for clang
Thread Sanitizer for clang

结语:

Valgrind 远不止是一款内存泄露检测工具。是时候改变您的观念了: Valgrind 要做未知行为的清道夫。

Valgrind 完全可以作为您的首选工具。他不仅向您报告错误的地点和原因,关键是他会抢在程序奔溃之前提醒您(这两点都是 GDB 无法做到的)。 当然 GDB 依然优秀,它能在断言失败时给出完整详尽的堆栈跟踪信息,这对调试并发代码和其他一些情况都是很必要的。

-pedantic -Wall -Wextra 这些编译选项也是相当有用的。越来越聪明的现代编译器也能帮你定位一些未知行为。Valgrind 应该被当做对编译器的有力的补足,而非功能重叠的竞争者。

如果您对此有一进步的兴趣,您还可以看看下面的工具,他们或多或少完成了和 Valgrind 类似的工作,并在对程序的运行时的影响较小:
Address Sanitizer for clang and g++
Undefined Behavior Sanitizer for clang and g++
Memory Sanitizer for clang
Thread Sanitizer for clang


返回顶部
顶部