Apache Nutch 1.14 发布,Web 爬虫

来源: 投稿
作者: 达尔文
2017-12-27

Apache Nutch 1.14 发布了。Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。

更新内容:

Bug 修复

  • [NUTCH-2071] - A parser failure on a single document may fail crawling job

  • [NUTCH-2235] - Classpath discrepancy with protocol-selenium in deploy mode

  • [NUTCH-2269] - Clean not working after crawl

  • [NUTCH-2295] - Nutch master docker container broken

  • [NUTCH-2297] - CrawlDbReader -stats wrong values for earliest fetch time and shortest interval

  • [NUTCH-2316] - Library conflict with Parser-Tika Plugin and Lib Folder

提升

  • [NUTCH-1763] - Improving comments on the Injector Class

  • [NUTCH-2034] - CrawlDB filtered documents counter.

  • [NUTCH-2035] - Regex filter using case sensitive rules.

  • [NUTCH-2046] - The crawl script should be able to skip an initial injection.

  • [NUTCH-2135] - Ant Eclipse build does not include protocol-interactiveselenium

  • [NUTCH-2193] - Upgrade feed parser plugin to use rome 1.5

完整更新内容请查看发布说明

下载地址:

展开阅读全文
43 收藏
分享
加载中
最新评论 (1)
mark
2017-12-27 09:07
0
回复
举报
更多评论
1 评论
43 收藏
分享
返回顶部
顶部