jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。
- Improved parse time, now 2.3x faster than previous release, with lower memory consumption.
- Reduced memory consumption and garbage collection when selecting elements.
- Removed an unnecessary synchronisation in Tag.valueOf, allowing multi-threaded parsing to run faster.
- Introduced finer granularity of exceptions in Jsoup.connect, including HttpStatusException and UnsupportedMimeTypeException, allowing programmers better control of error cases.
- In Jsoup.clean, allow custom Document.OutputSettings, to control pretty printing, character set, and entity escaping.
- Whitespace normalise document.title() output.
- In Jsoup.connect, fail faster if the return content type is not supported.
- Made entity decoding less greedy, so that non-entities are less likely to be incorrectly treated as entities.
- In Jsoup.connect, enforce a connection disconnect after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error.
- If a server doesn't specify a content-type header, treat that as OK.
- If a server returns an unsupported character-set header, attempt to decode the content with the default charset (UTF8), instead of bailing with an unsupported charset exception.
Bug fixes:
- Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale.
- Fixed whitespace preservation in textarea tags.
- Fixed an issue that prevented frameset documents to be cleaned by the Cleaner.
- Fixed an issue when normalising whitespace for strings containing high-surrogate characters.