jsoup 1.7.1 发布了,下载地址:
jsoup-1.7.1.jar
core libraryjsoup-1.7.1-sources.jar
optional sources jarjsoup-1.7.1-javadoc.jar
optional javadoc jar
jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。
该版本在性能和稳定性方面都有不少提升,功能上也做了改进:
Improvements:
- Improved parse time, now 2.3x faster than previous release, with lower memory consumption.
- Reduced memory consumption and garbage collection when selecting elements.
- Removed an unnecessary synchronisation in Tag.valueOf, allowing multi-threaded parsing to run faster.
- Introduced finer granularity of exceptions in Jsoup.connect, including HttpStatusException and UnsupportedMimeTypeException, allowing programmers better control of error cases.
- In Jsoup.clean, allow custom Document.OutputSettings, to control pretty printing, character set, and entity escaping.
- Whitespace normalise document.title() output.
- In Jsoup.connect, fail faster if the return content type is not supported.
- Made entity decoding less greedy, so that non-entities are less likely to be incorrectly treated as entities.
- In Jsoup.connect, enforce a connection disconnect after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error.
- If a server doesn't specify a content-type header, treat that as OK.
- If a server returns an unsupported character-set header, attempt to decode the content with the default charset (UTF8), instead of bailing with an unsupported charset exception.
Bug fixes:
- Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale.
- Fixed whitespace preservation in textarea tags.
- Fixed an issue that prevented frameset documents to be cleaned by the Cleaner.
- Fixed an issue when normalising whitespace for strings containing high-surrogate characters.
引用来自“sea”的评论
想请教一下,怎么取两个字符串之前的字符,比如
"<h1></h1><div><span>这是内<a href="">容文</a>本信息</span><span class="not"><h1>这是信息</h1></span></div>"
比如想取<span class="not"><h1> 和</h1></span>之间的信息,不要用
doc.getElementsByClass 然后再 getElementsByTag 来取,因为我提供的开始字符作为参数来传递,不可预知的
"<h1></h1><div><span>这是内<a href="">容文</a>本信息</span><span class="not"><h1>这是信息</h1></span></div>"
比如想取<span class="not"><h1> 和</h1></span>之间的信息,不要用
doc.getElementsByClass 然后再 getElementsByTag 来取,因为我提供的开始字符作为参数来传递,不可预知的
测试输出:
Jsoup use time:250 ms
Jodd use time:125 ms
还有就是复杂的html,jsoup会死掉的.那个文件的地址是:https://dl.dropbox.com/u/77543017/stuck.html
引用来自“gdp8”的评论
引用来自“白石”的评论
强烈建议"比较一下Jodd-Wot的Jerry吧:http://jodd.org/doc/jerry/index.html
比jsoup以及htmlparse要棒多了,我也是进行多方综合测试以后选择的Jodd-wot
比jsoup以及htmlparse要棒多了,我也是进行多方综合测试以后选择的Jodd-wot
http://htmlparser.sourceforge.net
引用来自“长江北”的评论
跟htmlcleaner相比,有什么优势!
引用来自“君无畏”的评论
果断升级