加载中

How many times have you found yourself viewing a stack trace in your terminal or inside your monitoring systems and been unable to understand anything from it? If the answer is ‘a lot’, then this blog post is for you. If you do not suffer from this problem often, you might still find this interesting.

When dealing with complex flows that occur from a Node.js server, the ability to get the most out of the errors that it can return to the requesting party is vital. The problem starts when a single error that is created during the handling of a request causes the creation of an additional error somewhere higher up in the chain. When this scenario happens, once you generate a new error and return it in the chain, you lose all relation to the previous, original error.

有多少次你发现自己在终端或监控系统内查看堆栈轨迹,但并不能看出个所以然来?如果你的回答是“很多次”,那么这篇帖子你应该看看。如果你不经常碰上这种情况也没关系,你也可以看看这篇文章解闷。

当处理 Node.js 服务器的复杂数据时,要会从可返回给请求方的错误中总结经验,具备此能力至关重要。在处理一个请求时,一个错误出现会引起链接里另一个错误的出现,于是问题就来了。当此脚本出现时,一旦你生成了新错误,并将它返回到了链接,那你就丢失了与原始错误的所有连接。

At Codefresh, we spent a lot of time trying to find best patterns for dealing with these scenarios. What we really want is the ability to create an error that can be chained to a previous error with the ability to aggregate information from this chain. We also want the interface for doing this to be very simple, but extendable for future enhancements.

We searched for existing modules that could support our needs. The only module we found that answered some of our requirements was WError.

‘WError’ provides you with the ability to wrap existing errors with new ones. The interface is very cool and simple, so we gave it a go. After a period of intensive use, we came to the conclusion that it was not enough:

  • The stack trace of an error would not go over the whole chain, but rather would only show the stack trace of the higher error that was generated.

  • It lacked the ability to easily create your own types of errors.

  • Extending the errors with additional behavior would require extending their code.

在 Codefresh,我们花费了大量的时间试图找到最好的模式来处理这些情境。我们真正想要的是能够让一个错误能链接到前一个错误,有能力获取整个链的聚合信息。我们想要这样的接口来做界面将非常简单,也更利于扩展与改进。

我们探索已经存在的模块以支持我们的需要。我们发现了唯一满足要求的模块就是  WError 。

‘WError’ 提供给你包装已存错误和新错误的能力。接口是非常酷和简单的,因此我们决定试一试。经过一段时间的密集使用,我们得出了一些做得还不太好的地方:

  • 堆栈跟踪的误差不会超过整个链,而只会生成堆栈跟踪高的错误。

  • 它没有能力去简单地创建你自己的错误类型。

  • 扩展错误的附加行为,需要扩展他们的代码。

Introducing CFError

With our extensive experience, we assembled an error module that answers all our requirements. You can find all the information and documentation here: http://codefresh-io.github.io/cf-errors.

Let’s see how you can use CFError with a real example using Express. We will create an Express app that will handle a single request to a specific route. This request will handle a query for getting a single user from a Mongo database. We will define a single route and an additional function that will be in charge of actually retrieving the user from the DB.

var CFError    = require('cf-errors');
var Errors     = CFError.Errors;
var Q          = require('q');
var express    = require('express');

var UserNotFoundError = {
    name: "UserNotFoundError"
};

var app = express();

app.get('/user/:id', function (request, response, next) {
    var userId = request.params.id;
    if (userId !== "coolId") {
        return next(new CFError(Errors.Http.BadRequest, {
            message: "Id must be coolId.",
            internalCode: 04001,
            recognized: true
        }));
    }

    findUserById(userId)
        .done((user) => {
            response.send(user);
        }, (err) => {
            if (err.name === UserNotFoundError.name) {
                next(new CFError(Errors.Http.NotFound, {
                    internalCode: 04041,
                    cause: err,
                    message: `User ${userId} could not be found`,
                    recognized: true
                }));
            }
            else {
                next(new CFError(Errors.Http.InternalServer, {
                    internalCode: 05001,
                    cause: err
                }));
            }
        });
});

var findUserById = function (userId) {
    return User.findOne({_id: userId})
        .exec((user) => {
            if (user) {
                return user;
            }
            else {
                return Q.reject(new CFError(UserNotFoundError, `Failed to retrieve user: ${userId}`));
            }
        })
};

A few things to notice:

  • When creating an error, you have the ability to provide predefined http errors that you can then extend.

  • You can add a ‘cause’ property when creating an error that will chain a previous error to the new one. When printing the stack of an error you will receive the full stack trace of the whole chain printed in a readable manner.

  • You can add any additional fields you want to the error object. We will explain the use of ‘internalCode’ and ‘recognized’ later.

  • You have the ability to define your error objects outside of your code and then just reference them when creating an error.

介绍 CFError

根据我们丰富的经验,我们会用一个错误处理模块响应所有请求。所有相关的信息和文档都能在这里找到:http://codefresh-io.github.io/cf-errors

用一个真实的 Express 示例来看看如何使用 CFError。创建一个 Express 应用,它通过一个特定的路由处理某个单独的请求。这个请求需要从 Mongo 数据库中查询一个用户的信息。现在定义一个路由,以及一个负责从 DB 中获取用户信息的函数。

var CFError    = require('cf-errors');
var Errors     = CFError.Errors;
var Q          = require('q');
var express    = require('express');

var UserNotFoundError = {
    name: "UserNotFoundError"
};

var app = express();

app.get('/user/:id', function (request, response, next) {
    var userId = request.params.id;
    if (userId !== "coolId") {
        return next(new CFError(Errors.Http.BadRequest, {
            message: "Id must be coolId.",
            internalCode: 04001,
            recognized: true
        }));
    }

    findUserById(userId)
        .done((user) => {
            response.send(user);
        }, (err) => {
            if (err.name === UserNotFoundError.name) {
                next(new CFError(Errors.Http.NotFound, {
                    internalCode: 04041,
                    cause: err,
                    message: `User ${userId} could not be found`,
                    recognized: true
                }));
            }
            else {
                next(new CFError(Errors.Http.InternalServer, {
                    internalCode: 05001,
                    cause: err
                }));
            }
        });
});

var findUserById = function (userId) {
    return User.findOne({_id: userId})
        .exec((user) => {
            if (user) {
                return user;
            }
            else {
                return Q.reject(new CFError(UserNotFoundError, `Failed to retrieve user: ${userId}`));
            }
        })
};

有几件事情需要注意:

  • 创建错误信息的时候,可以从预定义的 HTTP 错误扩展。

  • 创建错误信息的时候可以加入一个 ‘cause’ 属性,用来连接到前一个错误,形成错误链。这样在打印错误相关联的调用栈时,就能得到完整错误链的调用栈信息,使之在阅读时容易识别。

  • 你可以在错误信息对象中添加各种字段。稍后我会解释 ‘internalCode’ 和 ‘recognized’ 的用法。

  • 可以在你的代码之外定义错误对象,然后在创建错误信息的时候引用它。

Let’s go ahead and add an error middleware to our Express app.

app.use(function (err, request, response, next) {
    var error;
    if (!(err instanceof CFError)){
        error = new CFError(Errors.Http.InternalServer, {
            cause: err
        }); 
    }
    else {
        if (!err.statusCode){
            error = new CFError(Errors.Http.InternalServer, {
                cause: err
            });
        }
        else {
            error = err;
        }
    }

    console.error(error.stack);
    return response.status(error.statusCode).send(error.message);
});

A few things to notice:

  • We make sure that the final error that is printed to the log and returned to the user is always a ‘CFError’ object. This will allow you to add additional logic to the error middleware.

  • All predefined HTTP errors have a built in ‘statusCode’ property and a ‘message’ property already populated for your use.

  • Extending your errors will allow you to have all the error handling logic inside one place. You will not need to worry about having to print every error object when it is created, rather you print the stack trace only once and get the whole execution flow and context.

Let’s now change the way we return errors to our clients and return an object instead of just the top level error message.

return response.status(error.statusCode).send({
    message: error.message,
    statusCode: error.statusCode,
    internalCode: error.internalCode
});

Great! We now have a unified process of returning errors to our clients.

下面,给 Express 应用添加一个错误处理中间件。

app.use(function (err, request, response, next) {
    var error;
    if (!(err instanceof CFError)){
        error = new CFError(Errors.Http.InternalServer, {
            cause: err
        }); 
    }
    else {
        if (!err.statusCode){
            error = new CFError(Errors.Http.InternalServer, {
                cause: err
            });
        }
        else {
            error = err;
        }
    }

    console.error(error.stack);
    return response.status(error.statusCode).send(error.message);
});

注意:

  • 确保将最后出现的的错误输出到日志,以及总是将 ‘CFError’ 对象返回给用户。这样你才能向错误处理中间件中添加其它逻辑。

  • 所有预定义的 HTTP 错误都有 ‘statusCode’ 属性和 ‘message’ 属性,它们都是和的属性。

  • 扩展错误信息使你可以在一个地方添加处理错误的逻辑。创建错误信息的时候不用考虑去打印每一个错误对象,之后可以一次性打印调用栈的跟踪信息,它包含了完全的执行过程和上下文数据。

现在换个方法向客户端返回错误,返回一个对象来代替顶层的错误消息。

return response.status(error.statusCode).send({
    message: error.message,
    statusCode: error.statusCode,
    internalCode: error.internalCode
});

非常好!现在我们有一个的向客户端返回错误的过程了。

Reporting Errors to Monitoring Systems

At Codefresh, we use New Relic as our APM monitoring system. We noticed that the errors we generated and reported to New Relic could be categorized into two groups: the first consisted of all errors that were generated because of thrown and unexpected behavior of our servers. The second (business exceptions) consisted of all errors that were generated as part of good analysis and a correct handling of our servers. Reporting the second type of errors to New Relic made our Apdex score decrease in unpredictable ways which would result in false positive alarms that we received from our alerting systems.

So we came up with a new convention. Whenever we conclude that a generated error is a result of correct behavior of our system, we construct an error and attach an additional field named ‘recognized’ to it. We wanted the ability to put the ‘recognized’ field on a specific error in the chain, but still be able to get its value even if higher errors did not contain this field. We exposed a function on the CFError object named ‘getFirstValue’ which will retrieve the first value it encounters in the whole chain. Let’s see how we use this in Codefresh.

app.use(function (err, request, response, next) {
    var error;
    if (!(err instanceof CFError)){
        error = new CFError(Errors.Http.InternalServer, {
            cause: err
        });
    }
    else {
        if (!err.statusCode){
            error = new CFError(Errors.Http.InternalServer, {
                cause: err
            });
        }
        else {
            error = err;
        }
    }

    if (!error.getFirstValue('recognized')){
        nr.noticeError(error); //report to monitoring systems (newrelic in our case)
    }

    console.error(error.stack);
    return response.status(error.statusCode).send({
        message: error.message,
        statusCode: error.statusCode,
        internalCode: error.internalCode
    });
});

向监控器(监控进程)通报错误信息

 在Codefresh中,我们使用NewRelic作为APM监控器。要注意我们生成并触发到NewRelic的错误信息分为两类:第一类包含了在我们服务器上因不当操作而产生和抛出的各种错误信息。另一类则是我们的服务器正确分析处理产生的异常部分的各种错误信息(业务异常)。

 向NewRelic报告第二类错误时会造成Apdex积分不可预测地下降,这又导致各种来自我们告警系统的虚假告警消息。

 所以我们给出了一种新的约定,当我们可将一个生成的错误归纳为系统正确行为的结果时,我们构造一个错误对象并为之附加一个recognized字段。我们想要具备一种能力可在错误链条上的某一错误打上recognized标记,但仍能获取它的值,即使更高层级的错误没有包含这个标记。我们在CFError对象上暴露了一个getFirstValue函数,用来取得它在整个错误链条上碰到的第一个值。我们用下面代码看看在Codefresh中是如何使用的。

app.use(function (err, request, response, next) {
    var error;
    if (!(err instanceof CFError)){
        error = new CFError(Errors.Http.InternalServer, {
            cause: err
        });
    }
    else {
        if (!err.statusCode){
            error = new CFError(Errors.Http.InternalServer, {
                cause: err
            });
        }
        else {
            error = err;
        }
    }

    if (!error.getFirstValue('recognized')){
        nr.noticeError(error); //report to monitoring systems (newrelic in our case)
    }

    console.error(error.stack);
    return response.status(error.statusCode).send({
        message: error.message,
        statusCode: error.statusCode,
        internalCode: error.internalCode
    });
});

A few things to notice:

  • Because we already know we are only dealing with CFError objects, we only had to add two lines of code to support this.

  • Since we are explicitly deciding which errors we actually want to send, if you are using New Relic you will need to manually disable the automatic sending of all errors. Currently, in order to achieve this, you will need to manually add all HTTP errors status codes to the ‘ignore_status_codes’ field inside the ‘newrelic.js’ config file. We have already opened a ticket for the New Relic support team to provide an easier way to do this.

exports.config = {
  error_collector: {
    ignore_status_codes: [400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 421, 422, 423, 426, 428, 429, 431, 451, 500, 501, 502, 503, 504, 505, 506, 507, 508, 510, 511]
  }
};

In Conclusion

Getting the best out of your errors requires not only a good error module, but also well-defined processes of when, where and how you do it. You will need to follow your defined patterns, otherwise it tends to get messy.

Reporting only real errors to a monitoring system is vital for your ability as a company to detect and solve problems after they have occurred.

注意:

  • 因为我们知道只需要处理 CFError 对象,所以只需要添加两行代码就行。

  • 既然已经确定了实际要发送的错误,使用 New Relic 的时候就要手工关闭自动发送错误的选项。为了达到这个目的,需要手工将所有 HTTP 错误状态码加入 ‘newrelic.js’  中的 ‘ignore_status_codes’ 字段。我们已经向 New Relic 支持团队提出需要一个更简单的办法来做这个事情。

exports.config = {
  error_collector: {
    ignore_status_codes: [400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 421, 422, 423, 426, 428, 429, 431, 451, 500, 501, 502, 503, 504, 505, 506, 507, 508, 510, 511]
  }
};

小结

想很好地处理错误不仅需要一个好的错误处理模块,还需要定义好处理过程:在什么时候、什么位置用什么方法来处理错误。这需要你遵循自己的设计模式,否则就会搞得一团糟。

仅向监控系统报告实际的错误是至关重要的,这样你的公司才能专注于检查和处理发生的问题。

返回顶部
顶部