翻译于 2017/06/21 16:56
1 人 顶 此译文
Post-mortem diagnostics & debugging comes into the picture when you want to figure out what went wrong with your Node.js application in production.
In this chapter of Node.js at Scale we will take a look at node-report, a core project which aims to help you to do post-mortem diagnostics & debugging.
All chapters of Node.js at Scale:
Using npm
Node.js Internals Deep Dive
Building with Node.js
Testing + Node
Node.js in Production
Node.js Post-Mortem Diagnostics & Debugging [this article]
Node.js + Microservices
Request Signing
Distributed Tracing
API Gateways
在你希望判断出你的 Node.js 应用在生产环境中发生了什么错误时,后期诊断和调试就显得尤为重要了。
这里我们探讨 node-report 这个核心项目,用来帮助我们进行后期诊断和调试。
Node.js At Scale 中的全部文章:
使用 npm:
深入 Node.js
使用 Node.js 构建项目
测试 + Node
生产环境中的 Node.js
Node.js的后期诊断和调试[当前文章]
Node.js 与微服务
请求签名
分布式追踪
API 网关
node-report
diagnostics moduleThe purpose of the module is to produce a human-readable diagnostics summary file. It is meant to be used in both development and production environments.
The generated report includes:
JavaScript and native stack traces,
heap statistics,
system information,
resource usage,
loaded libraries.
Currently node-report supports Node.js v4, v6, and v7 on AIX, Linux, MacOS, SmartOS, and Windows.
Adding it to your project just takes an npm install and require:
npm install node-report --save
//index.js require('node-report')
Once you add node-report to your application, it will automatically listen on unhandled exceptions and fatal error events, and will trigger a report generation. Report generation can also be triggered by sending a USR2 signal to the Node.js process.
该模块的目的是生成一个用户可读的诊断摘要文件。 它旨在用于开发和生产环境。
产生的报告包括:
JavaScript 和原生堆栈跟踪,
堆统计,
系统信息,
资源利用率,
加载的库.
目前,节点报告支持 AIX,Linux,MacOS,SmartOS 和 Windows 上的 Node.js v4,v6 和 v7。
将其添加到你的项目只需要安装 npm 其命令:
npm install node-report --save
//index.js require('node-report')
将 node-report 添加到应用程序后,它将自动侦听未处理的异常和致命错误事件,并将触发报告生成。也可以通过向 Node.js 进程发送 USR2 信号来触发报告生成。
node-report
For the sake of simplicity, imagine you have the following endpoint in one of your applications:
function myListener(request, response) { switch (request.url) { case '/exception': throw new Error('*** exception.js: uncaught exception thrown from function myListener()'); } }
This code simply throws an exception once the /exception
route handler is called. To make sure we get the diagnostics information, we have to add the node-report
module to our application, as shown previously.
require('node-report') function my_listener(request, response) { switch (request.url) { case '/exception': throw new Error('*** exception.js: uncaught exception thrown from function my_listener()'); } }
Let's see what happens once the endpoint gets called! Our report just got written into a file:
Writing Node.js report to file: node-report.20170506.100759.20988.001.txt Node.js report completed
为了简单起见,假设你的应用程序中有以下端点:
function myListener(request, response) { switch (request.url) { case '/exception': throw new Error('*** exception.js: uncaught exception thrown from function myListener()'); } }
一旦调用了 /exception
路由处理程序,这个代码就会抛出一个异常。 为了确保我们获得诊断信息,我们必须将 node-report
模块添加到我们的应用程序中,如前文所示。
require('node-report') function my_listener(request, response) { switch (request.url) { case '/exception': throw new Error('*** exception.js: uncaught exception thrown from function my_listener()'); } }
让我们看看端点被调用后会发生什么! 我们的报告刚写进一个文件:
Writing Node.js report to file: node-report.20170506.100759.20988.001.txt Node.js report completed
Once you open the file, you'll get something like this:
=================== Node Report =================== Event: exception, location: "OnUncaughtException" Filename: node-report.20170506.100759.20988.001.txt Dump event time: 2017/05/06 10:07:59 Module load time: 2017/05/06 10:07:53 Process ID: 20988 Command line: node demo/exception.js Node.js version: v6.10.0 (ares: 1.10.1-DEV, http_parser: 2.7.0, icu: 58.2, modules: 48, openssl: 1.0.2k, uv: 1.9.1, v8: 5.1.281.93, zlib: 1.2.8) node-report version: 2.1.2 (built against Node.js v6.10.0, 64 bit) OS version: Darwin 16.4.0 Darwin Kernel Version 16.4.0: Thu Dec 22 22:53:21 PST 2016; root:xnu-3789.41.3~3/RELEASE_X86_64 Machine: Gergelys-MacBook-Pro.local x86_64
You can think of this part as a header for your diagnostics summary - it includes..
the main event why the report was created,
how the Node.js application was started (node demo/exception.js
),
what Node.js version was used,
the host operating system,
and the version of node-report
itself.
The next part of the report includes the captured stack traces, both for JavaScript and the native part:
=================== JavaScript Stack Trace =================== Server.myListener (/Users/gergelyke/Development/risingstack/node-report/demo/exception.js:19:5) emitTwo (events.js:106:13) Server.emit (events.js:191:7) HTTPParser.parserOnIncoming [as onIncoming] (_http_server.js:546:12) HTTPParser.parserOnHeadersComplete (_http_common.js:99:23)
In the JavaScript part, you can see..
the stack trace (which function called which one with line numbers),
and where the exception occurred.
In the native part, you can see the same thing - just on a lower level, in the native code of Node.js
=================== Native Stack Trace =================== 0: [pc=0x103c0bd50] nodereport::OnUncaughtException(v8::Isolate*) [/Users/gergelyke/Development/risingstack/node-report/api.node] 1: [pc=0x10057d1c2] v8::internal::Isolate::Throw(v8::internal::Object*, v8::internal::MessageLocation*) [/Users/gergelyke/.nvm/versions/node/v6.10.0/bin/node] 2: [pc=0x100708691] v8::internal::Runtime_Throw(int, v8::internal::Object**, v8::internal::Isolate*) [/Users/gergelyke/.nvm/versions/node/v6.10.0/bin/node] 3: [pc=0x3b67f8092a7] 4: [pc=0x3b67f99ab41] 5: [pc=0x3b67f921533]
打开文件后,你会看到如下结果:
=================== Node Report =================== Event: exception, location: "OnUncaughtException" Filename: node-report.20170506.100759.20988.001.txt Dump event time: 2017/05/06 10:07:59 Module load time: 2017/05/06 10:07:53 Process ID: 20988 Command line: node demo/exception.js Node.js version: v6.10.0 (ares: 1.10.1-DEV, http_parser: 2.7.0, icu: 58.2, modules: 48, openssl: 1.0.2k, uv: 1.9.1, v8: 5.1.281.93, zlib: 1.2.8) node-report version: 2.1.2 (built against Node.js v6.10.0, 64 bit) OS version: Darwin 16.4.0 Darwin Kernel Version 16.4.0: Thu Dec 22 22:53:21 PST 2016; root:xnu-3789.41.3~3/RELEASE_X86_64 Machine: Gergelys-MacBook-Pro.local x86_64
你可以将此部分视为诊断摘要的标头 - 它包括..
创建该报告的主要事件
Node.js 应用程序是如何被启动的 (node demo/exception.js)
使用了什么版本的 Node.js
主机操作系统
以及 node-report
本身的版本。
报告接下来的一部分包括捕获的堆栈跟踪,包括 JavaScript 和本机部分:
=================== JavaScript Stack Trace =================== Server.myListener (/Users/gergelyke/Development/risingstack/node-report/demo/exception.js:19:5) emitTwo (events.js:106:13) Server.emit (events.js:191:7) HTTPParser.parserOnIncoming [as onIncoming] (_http_server.js:546:12) HTTPParser.parserOnHeadersComplete (_http_common.js:99:23)
在 JavaScript 部分, 你可以看到..
堆栈跟踪 (哪个函数在哪一行被调用),
以及发生异常的地方.
在本机部分,您可以在 Node.js 的本地代码中看到同样的事情 - 只是其在一个较底层的级别
=================== Native Stack Trace =================== 0: [pc=0x103c0bd50] nodereport::OnUncaughtException(v8::Isolate*) [/Users/gergelyke/Development/risingstack/node-report/api.node] 1: [pc=0x10057d1c2] v8::internal::Isolate::Throw(v8::internal::Object*, v8::internal::MessageLocation*) [/Users/gergelyke/.nvm/versions/node/v6.10.0/bin/node] 2: [pc=0x100708691] v8::internal::Runtime_Throw(int, v8::internal::Object**, v8::internal::Isolate*) [/Users/gergelyke/.nvm/versions/node/v6.10.0/bin/node] 3: [pc=0x3b67f8092a7] 4: [pc=0x3b67f99ab41] 5: [pc=0x3b67f921533]
You can see in the heap metrics how each heap space performed during the creation of the report:
new space,
old space,
code space,
map space,
large object space.
These metrics include:
memory size,
committed memory size,
capacity,
used size,
available size.
To better understand how memory handling in Node.js works, check out the following articles:
=================== JavaScript Heap and GC =================== Heap space name: new_space Memory size: 2,097,152 bytes, committed memory: 2,097,152 bytes Capacity: 1,031,680 bytes, used: 530,736 bytes, available: 500,944 bytes Heap space name: old_space Memory size: 3,100,672 bytes, committed memory: 3,100,672 bytes Capacity: 2,494,136 bytes, used: 2,492,728 bytes, available: 1,408 bytes Total heap memory size: 8,425,472 bytes Total heap committed memory: 8,425,472 bytes Total used heap memory: 4,283,264 bytes Total available heap memory: 1,489,426,608 bytes Heap memory limit: 1,501,560,832
您可以在堆参数中查看每个堆空间在创建报告过程中是如何执行的:
新空间,
旧空间,
代码空间,
映射空间,
大对象空间.
这些参数包括:
内存大小,
占用内存大小,
容量,
使用空间,
可用空间.
为了更好地了解 Node.js 工作中是如何处理内存的,请参考以下文章:
=================== JavaScript Heap and GC =================== Heap space name: new_space Memory size: 2,097,152 bytes, committed memory: 2,097,152 bytes Capacity: 1,031,680 bytes, used: 530,736 bytes, available: 500,944 bytes Heap space name: old_space Memory size: 3,100,672 bytes, committed memory: 3,100,672 bytes Capacity: 2,494,136 bytes, used: 2,492,728 bytes, available: 1,408 bytes Total heap memory size: 8,425,472 bytes Total heap committed memory: 8,425,472 bytes Total used heap memory: 4,283,264 bytes Total available heap memory: 1,489,426,608 bytes Heap memory limit: 1,501,560,832
The resource usage section includes metrics on..
CPU usage,
the size of the resident set size,
information on page faults,
and the file system activity.
=================== Resource usage =================== Process total resource usage: User mode CPU: 0.119704 secs Kernel mode CPU: 0.020466 secs Average CPU Consumption : 2.33617% Maximum resident set size: 21,965,570,048 bytes Page faults: 13 (I/O required) 5461 (no I/O required) Filesystem activity: 0 reads 3 writes
The system information section includes..
environment variables,
resource limits (like open files, CPU time or max memory size)
and loaded libraries.
资源利用率部分包括这些指标..
CPU 使用率,
实际使用物理内存大小,
页面错误信息,
文件系统活跃度.
=================== Resource usage =================== Process total resource usage: User mode CPU: 0.119704 secs Kernel mode CPU: 0.020466 secs Average CPU Consumption : 2.33617% Maximum resident set size: 21,965,570,048 bytes Page faults: 13 (I/O required) 5461 (no I/O required) Filesystem activity: 0 reads 3 writes
系统信息部分包括..
环境变量,
资源限制 (如打开文件个数, CPU 时间和最大内存空间)
以及加载的库.
The node-report
module can also help once you have a fatal error, like your application runs out of memory.
By default, you will get an error message something like this:
<--- Last few GCs ---> 23249 ms: Mark-sweep 1380.3 (1420.7) -> 1380.3 (1435.7) MB, 695.6 / 0.0 ms [allocation failure] [scavenge might not succeed]. 24227 ms: Mark-sweep 1394.8 (1435.7) -> 1394.8 (1435.7) MB, 953.4 / 0.0 ms (+ 8.3 ms in 231 steps since start of marking, biggest step 1.2 ms) [allocation failure] [scavenge might not succeed].
On its own, this information is not that helpful. You don't know the context, or what was the state of the application. With node-report
, it gets better.
First of all, in the generated post-mortem diagnostics summary you will have a more descriptive event:
Event: Allocation failed - JavaScript heap out of memory, location: "MarkCompactCollector: semi-space copy, fallback in old gen"
Secondly, you will get the native stack trace - that can help you to understand better why the allocation failed.
一旦你的应用程序有致命的错误,node-report 模块也可以为你提供帮助,比如你应用程序运行时内存溢出。
默认情况下,你会收到如下错误信息:
<--- Last few GCs ---> 23249 ms: Mark-sweep 1380.3 (1420.7) -> 1380.3 (1435.7) MB, 695.6 / 0.0 ms [allocation failure] [scavenge might not succeed]. 24227 ms: Mark-sweep 1394.8 (1435.7) -> 1394.8 (1435.7) MB, 953.4 / 0.0 ms (+ 8.3 ms in 231 steps since start of marking, biggest step 1.2 ms) [allocation failure] [scavenge might not succeed].
就其信息本身而言,这些信息没有什么用处。你不知道运行环境,以及应用程序的状态。但随着 node-report 不断上报,这些信息就变的更有价值了。
首先, 在生成的后期诊断总结中,你将获得一个更具描述性的事件:
Event: Allocation failed - JavaScript heap out of memory, location: "MarkCompactCollector: semi-space copy, fallback in old gen"
其次,你将获得本机堆栈跟踪 - 这可以帮助您更好地了解配置失败的原因。
Imagine you have the following loops which block your event loop. This is a performance nightmare.
var list = [] for (let i = 0; i < 10000000000; i++) { for (let j = 0; i < 1000; i++) { list.push(new MyRecord()) } for (let j=0; i < 1000; i++) { list[j].id += 1 list[j].account += 2 } for (let j = 0; i < 1000; i++) { list.pop() } }
With node-report
you can request reports even when your process is busy, by sending the USR2 signal. Once you do that you will receive the stack trace, and you will see in a minute where your application spends time.
(Examples are taken for the node-report repository)
假如你有以下循环,这个循环会阻止到你的事件循环。 这是一场噩梦般表演。
var list = [] for (let i = 0; i < 10000000000; i++) { for (let j = 0; i < 1000; i++) { list.push(new MyRecord()) } for (let j=0; i < 1000; i++) { list[j].id += 1 list[j].account += 2 } for (let j = 0; i < 1000; i++) { list.pop() } }
使用 node-report 你可以通过发送 USR2 信号在程序运行期间请求上报, 一旦你这么做,你将获得堆栈跟踪,以及你在 1 分钟内你的应用程序耗费的时间。
( 用于 node-report 资源库的示例)
node-report
The creation of the report can also be triggered using the JavaScript API. This way your report will be saved in a file, just like when it was triggered automatically.
const nodeReport = require('node-report') nodeReport.triggerReport()
Using the JavaScript API, the report can also be retrieved as a string.
const nodeReport = require('nodereport') const report = nodeReport.getReport()
If you don't want to use automatic triggers (like the fatal error or the uncaught exception) you can opt-out of them by requiring the API itself - also, the file name can be specified as well:
const nodeReport = require('node-report/api') nodeReport.triggerReport('name-of-the-report')
报告的创建也可以通过使用 JavaScript API 来触发。这样的话,你的报告将保存在文件中,就像它被自动触发一样。
const nodeReport = require('node-report') nodeReport.triggerReport()
使用 JavaScript API,报告也可以作为字符串检索。
const nodeReport = require('nodereport') const report = nodeReport.getReport()
如果你不想使用自动触发 (如致命错误或未捕获的异常) 则可以通过调用 API 本身,选择退出自动触发,同样也可以指定其文件名:
const nodeReport = require('node-report/api') nodeReport.triggerReport('name-of-the-report')
If you feel like making Node.js even better, please consider joining the Postmortem Diagnostics working group, where you can contribute to the module.
The Postmortem Diagnostics working group is dedicated to the support and improvement of postmortem debugging for Node.js. It seeks to elevate the role of postmortem debugging for Node, to assist in the development of techniques and tools, and to make techniques and tools known and available to Node.js users.
In the next chapter of the Node.js at Scale series, we will discuss profiling Node.js Applications. If you have any questions, please let me know in the comments section below.
如果你希望 Node.js 变得更好,请考虑加入 Postmortem Diagnostics 工作组,在这里你可以为该项目做出贡献。
The Postmortem 工作组致力于支持和改进 Node.js 的后期调试,它旨在提升 Node 的后期调试的作用,以协助开发技术和工具,并且为 Node.js 用户提供已知和可用的技术和工具。
在 Node.js Scale 系列的下一章中,我们将讨论 Node.js 应用程序的性能分析。如果你有任何问题,请在下面的评论中告知我们。