jsoup 1.10.2 发布,Java 的 HTML 解析器

红薯
 红薯
发布于 2017年01月05日
收藏 27

jsoup 1.10.2 发布了,该版本带来了更快的启动时间,扩展 DOM 树的遍历,提升了 HTTP 兼容性以及修复了一些 bug。

详情包括:

Improvements

  • Improved startup time, particularly on Android, by reducing garbage generation and CPU execution time when loading the HTML entity files. About 1.72x faster in this area.

  • Added Element.is(query) to check if an element matches this CSS query.

  • Added new methods to Elements: next(query), nextAll(query), prev(query), prevAll(query) to select next and previous element siblings from a current selection, with optional selectors.

  • Added Node.root() to get the topmost ancestor of a Node.

  • Added the new selector :containsData(), to find elements that hold data, like script and style tags.

  • Changed Jsoup.isValid(bodyHtml) to validate that the input contains only body HTML that is safe according to the whitelist, and does not include HTML errors. And in the Jsoup.Cleaner.isValid(Document) method, make sure the doc only includes body HTML.

  • In Whitelists, validate that a removed protocol exists before removing said protocol.

  • Allow the Jsoup.Connect thread to be interrupted when reading the input stream; helps when reading from a long stream of data that doesn't read timeout.

  • Jsoup.Connect now uses a desktop user agent by default. Many developers were getting caught by not specifying the user agent, and sending the default Java. That causes many servers to return different content than what they would to a desktop browser, and what the developer was expecting.

  • Increased the default connect/read timeout in Jsoup.Connect to 30 seconds.

  • Jsoup.Connect now detects if a header value is actually in UTF-8 vs the HTTP spec of ISO-8859, and converts the header value appropriately. This improves compatibility with servers that are configured incorrectly.

Fixes

  • Bugfix: in Jsoup.Connect, URLs containing non-URL-safe characters were not encoded to URL safe correctly.

  • Bugfix: a "SYSTEM" flag in doctype tags would be incorrectly removed.

  • Bugfix: removing attributes from an Element with removeAttr() would cause a ConcurrentModificationException.

  • Bugfix: the contents of Comment nodes were not returned by Element.data()

  • Bugfix: if source checked out on Windows with git autocrlf=true, Entities.load would fail because of the r char.

下载地址:https://jsoup.org/download

本站文章除注明转载外,均为本站原创或编译。欢迎任何形式的转载,但请务必注明出处,尊重他人劳动共创开源社区。
转载请注明:文章转载自 OSCHINA 社区 [http://www.oschina.net]
本文标题:jsoup 1.10.2 发布,Java 的 HTML 解析器
加载中

精彩评论

乌龟壳
乌龟壳
jsoup对http的功能性支持比较弱,都改了好多东西了。

最新评论(18

乌龟壳
乌龟壳

引用来自“乌龟壳”的评论

jsoup对http的功能性支持比较弱,都改了好多东西了。

引用来自“今幕明”的评论

你用的啥
jsoup源码拿过来自己定制了一部分
今幕明
今幕明

引用来自“maxid”的评论

jsoup较弱
你用什么的,请推荐一下,java语言的
今幕明
今幕明

引用来自“乌龟壳”的评论

jsoup对http的功能性支持比较弱,都改了好多东西了。
你用的啥
TonyJian
TonyJian

引用来自“maxid”的评论

jsoup较弱
请问弱在哪儿?
TonyJian
TonyJian

引用来自“局长”的评论

jsoup。。记得看过一篇老大写的关于它的教程
java版jquery
maxid
maxid
jsoup较弱
唐代de豆腐
唐代de豆腐
我用来下动图
F_L_F
F_L_F
刷票,抽奖都用它
_vince
_vince
可以看下Jodd 的 Jerry,更类似于javascript的语法操作
乌龟壳
乌龟壳
jsoup对http的功能性支持比较弱,都改了好多东西了。
返回顶部
顶部