Apache Nutch 2.2 发布,Java 搜索引擎

来源: OSCHINA
编辑: oschina
2013-06-09

Apache Nutch 2.2 发布了,Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

新版本包含众多改进,详细列表如下:

* NUTCH-1576 Need to keep hotStore.flush() exception catching (James Sullivan via lewismc)
* NUTCH-1577 Add target for creating eclipse project (tejasp via lewismc)
* NUTCH-1545 capture batchId and remove references to segments in 2.x crawl script. (Feng)
* NUTCH-1575 support solr authentication in nutch 2.x (Feng)
* NUTCH-1569 Upgrade 2.x to Gora 0.3 (lewismc)
* NUTCH-1243 Junit jar removed from lib (lewismc)
* NUTCH-1249 and NUTCH-1275 : Resolve all issues flagged up by adding javac -Xlint argument (tejasp)
* NUTCH-1513 Support Robots.txt for Ftp urls (tejasp)
* NUTCH-1053 Parsing of RSS feeds fails (tejasp)
* NUTCH-1563 FetchSchedule#getFields is never used by GeneratorJob (Feng)
* NUTCH-1573 Upgrade to most recent JUnit 4.x to improve test flexibility (lewismc)
* Added crawler-commons dependency in pom.xml (tejasp)
* NUTCH-956 solrindex issues: add field tld to Solr schema (Alexis via lewismc, snagel)
* NUTCH-1277 Fix [fallthrough] javac warnings (tejasp)
* NUTCH-1514 Phase out the deprecated configuration properties (if possible) (tejasp)
* NUTCH-1273 Fix [deprecation] javac warnings (lewsimc + tejasp)
* NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (tejasp)
* NUTCH-346 Improve readability of logs/hadoop.log (Renaud Richardet via tejasp)
* NUTCH-1501 Harmonize behavior of parsechecker and indexchecker (snagel + lewismc)
* NUTCH-1551 Improve WebTableReader field order and display batchId (lewismc)
* NUTCH-1552 possibility of a NPE in index-more plugin (kaveh minooie via lewismc)
* NUTCH-1547 BasicIndexingFilter - Problem to index full title (Feng)
* NUTCH-1389 parsechecker and indexchecker to report truncated content (snagel)
* NUTCH-1419 parsechecker and indexchecker to report protocol status (snagel via lewismc)
* NUTCH-1038 Port IndexingFiltersChecker to 2.0 (snagel via lewismc)
* NUTCH-1532 Replace 'segment' mapping field with batchId (patches v2 + v3) (Feng +via lewismc)
* NUTCH-1533 Implement getPrevModifiedTime(), setPrevModifiedTime(), getBatchId() and setBatchId() accessors in o.a.n.storage.WebPage (Feng via lewismc)
* NUTCH-XX fix Elastic Search Ivy configuration (Binoy d via lewismc)
* NUTCH-1542 "adddays" param for generator not present in 2.x (tejasp)
* NUTCH-1393 Display consistent usage of GeneratorJob with 1.X (Lufeng +via lewismc)
* NUTCH-1540 Add Gora buffered read and write maximum limits to nutch-default.xml configuration. (lewismc)
* NUTCH-842 AutoGenerate WebPage code (jnioche via lewismc)
* NUTCH-1536 Ant build file has hardcoded conf dir location (zm via lewismc)
* NUTCH-XX remove unused db.max.inlinks property in nutch-default.xml (lewismc)
* NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default (tejasp)
* NUTCH-1453 Substantiate tests for IndexingFilters (lufeng via lewismc)
* NUTCH-1274 Fix [cast] javac warnings (tejasp via lewismc)
* NUTCH-1516 Nutch 2.x pom.xml out of sync with ivy.xml (lewismc)
* NUTCH-1510 Upgrade to Hadoop 1.1.1 (markus)
* NUTCH-1503 Configuration properties not in sync between FetcherReducer and nutch-default.xml (snagel + lewismc)
* NUTCH-1394 backport NUTCH-1232 Remove site field from index-basic (lewismc)
* NUTCH-1370 Expose exact number of urls injected @runtime (ferdy, snagel and lewismc)
   (includes commit for NUTCH-1471 make explicit which datastore urls are injected to)
* NUTCH-1484 TableUtil unreverseURL fails on file:// URLs (Rogério Pereira Araújo via snagel)
* NUTCH-1451 Upgrade automaton jar to 1.11-8 (lewismc)
* NUTCH-1496 ParserJob logs skipped urls with level info (Nathan Gass via lewismc)
* NUTCH-1488 bin/nutch to run junit from any directory (snagel via lewismc)
* NUTCH-1493 Error adding field 'contentLength'='' during solrindex using index-more (Nathan Gass via lewismc)
* NUTCH-1491 Strip UTF-8 non-character codepoints in title (Nathan Gass via markus)
* NUTCH-1421 RegexURLNormalizer to only skip rules with invalid patterns (snagel)
* NUTCH-1433 Upgrade to Tika 1.2 (jnioche)
* NUTCH-1087 Deprecate crawl command and replace with example script (jnioche)
* NUTCH-874 Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora (part 1) (Kiran Chitturi via lewismc)
* NUTCH-1344 BasicURLNormalizer to normalize https same as http (snagel)
* NUTCH-706 Url regex normalizer: pattern for session id removal not to match "newsId" (Meghna Kukreja via snagel)

展开阅读全文
39 收藏
分享
加载中
最新评论 (9)

引用来自“聂永生”的评论

引用来自“愁乐天”的评论

引用来自“聂永生”的评论

不知道有没有类似于nutch功能的python实现,java太难,不会用.

你可以试试用Whoosh + Scrapy

这个牛X

对于中文搜索的话,推荐使用jieba分词法。可以集成在whoosh中。
2013-09-22 11:20
0
回复
举报
cool
2013-06-09 18:39
0
回复
举报

引用来自“大灰狼”的评论

Scrapy 支持分布式抓取吗

国内有一位仁兄已经做了一个基于Scrapy的分布式爬虫, 你可以看看这个https://github.com/gnemoug/distribute_crawler
2013-06-09 11:49
0
回复
举报
Scrapy 支持分布式抓取吗
2013-06-09 10:31
0
回复
举报

引用来自“愁乐天”的评论

引用来自“聂永生”的评论

不知道有没有类似于nutch功能的python实现,java太难,不会用.

你可以试试用Whoosh + Scrapy

这个牛X
2013-06-09 10:14
0
回复
举报
没有.net版
2013-06-09 10:09
0
回复
举报

引用来自“聂永生”的评论

不知道有没有类似于nutch功能的python实现,java太难,不会用.

你可以试试用Whoosh + Scrapy
2013-06-09 09:11
0
回复
举报
不知道有没有类似于nutch功能的python实现,java太难,不会用.
2013-06-09 08:24
0
回复
举报
更多评论
9 评论
39 收藏
分享
返回顶部
顶部