HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
- HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
- HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser | License | Implementation language(s) | Latest date* | HTML parsing | HTML5-compliant parsing | Clean HTML** | Update HTML*** |
---|---|---|---|---|---|---|---|
HTML Tidy | W3C license | ANSI C | 2021-07-17 | Yes | Yes | Yes | Yes |
HtmlUnit | Apache License 2.0 | Java | 2023-10-31 | Yes | ? | No | No |
Beautiful Soup | MIT License | Python | 2023-04-07 | Yes | Yes | ? | No |
jsoup | MIT License | Java | 2025-04-29 | Yes | Yes | Yes | Yes |
Parser | License | Implementation language(s) | Latest date* | HTML Parsing | HTML5-compliant Parsing | Clean HTML** | Update HTML*** |
References
12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine https://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html ↩
HTML Tidy release 5.8.0 https://github.com/htacg/tidy-html5/releases/tag/5.8.0 ↩
What is Tidy? http://www.html-tidy.org/#what_is_tidy ↩
What is Tidy? http://www.html-tidy.org/#what_is_tidy ↩
HtmlUnit 3.7.0 https://github.com/HtmlUnit/htmlunit/releases/tag/3.7.0 ↩
Beautiful Soup release 4.10 https://www.crummy.com/software/BeautifulSoup/bs4/download/4.12/ ↩
jsoup Java HTML Parser release 1.20.1 https://jsoup.org/news/release-1.20.1 ↩