Apache Nutch 1.1.3 发布,Web 爬虫

王练
 王练
发布于 2017年04月03日
收藏 24

Apache Nutch 项目管理委员宣布 Apache Nutch 1.13 发布,建议所有当前的用户和 1.X 系列的开发人员升级到此版本。

Nutch是一个成熟的、可用于生产的 Web 爬虫。 Nutch 1.x 可以依靠 Apache Hadoop™ 数据结构进行细粒度配置,这对于批处理非常有用。

更新内容:

Sub-task

  • [NUTCH-2246] - Refactor /seed endpoint for backward compatibility

Bug

  • [NUTCH-1553] - Property 'indexer.delete.robots.noindex' not working when using parser-html.

  • [NUTCH-2242] - lastModified not always set

  • [NUTCH-2291] - Fix mrunit dependencies

  • [NUTCH-2337] - urlnormalizer-basic to strip empty port

  • [NUTCH-2345] - FetchItemQueue logs are logged with wrong class name

  • [NUTCH-2349] - urlnormalizer-basic NPE for ill-formed URL "http:/"

  • [NUTCH-2357] - Index metadata throw Exception because writable object cannot be cast to Text

  • [NUTCH-2359] - Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

  • [NUTCH-2364] - http.agent.rotate: IllegalArgumentException / last element of agent names ignored

  • [NUTCH-2366] - Deprecated Job constructor in hostdb/ReadHostDb.java

改进

  • [NUTCH-1308] - Add main() to ZipParser

  • [NUTCH-2164] - Inconsistent 'Modified Time' in crawl db

  • [NUTCH-2234] - Upgrade to elasticsearch 2.3.3

  • [NUTCH-2236] - Upgrade to Hadoop 2.7.2

  • [NUTCH-2262] - Utilize parameterized logging notation across Fetcher

  • [NUTCH-2272] - Index checker server to optionally keep client connection open

  • [NUTCH-2286] - CrawlDbReader -stats to show fetch time and interval

  • [NUTCH-2287] - Indexer-elastic plugin should use Elasticsearch BulkProcessor and BackoffPolicy

  • [NUTCH-2299] - Remove obsolete properties protocol.plugin.check.*

  • [NUTCH-2300] - Fetcher to optionally save robots.txt

  • [NUTCH-2327] - Seeds injected in REST workflow must be ingested into HDFS

  • [NUTCH-2329] - Update Slf4j logging for Java 8 and upgrade miredot plugin version

  • [NUTCH-2336] - SegmentReader to implement Tool

  • [NUTCH-2352] - Log with Generic Class Name at Nutch 1.x

  • [NUTCH-2355] - Protocol plugins to set cookie if Cookie metadata field is present

  • [NUTCH-2367] - Get single record from HostDB

新特性

  • [NUTCH-2132] - Publisher/Subscriber model for Nutch to emit events

Task

下载地址:

http://nutch.apache.org/downloads.html

本站文章除注明转载外,均为本站原创或编译。欢迎任何形式的转载,但请务必注明出处,尊重他人劳动共创开源社区。
转载请注明:文章转载自 OSCHINA 社区 [http://www.oschina.net]
本文标题:Apache Nutch 1.1.3 发布,Web 爬虫
加载中

最新评论(1

j
jungggle
谁在用这个?请举个手😊
返回顶部
顶部