XML::LibXML and HTML (in >=v1.67)

Wed Apr 1 11:03:07 BST 2009

On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute <tjc at wintrmute.net> wrote:
> The problem occurs when the html contains (the commonly used) & symbol
> within attributes, such as:
> <a href="/foo?a=b&c=d">
>
> I know that really one should escape the ampersand in those
> circumstances, however real-world web-pages rarely do this.. And this
> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
> versions.. but eh, maybe it's just the way I'm calling the parser?

XML::Liberal [1] exactly addresses issues like this, and it also got
broken with XML::LibXML 1.67 with its error format change but works
with 1.69_2 on CPAN.

> Alternatively.. what do YOU use to parse real-world websites that are
> often not totally valid?

I use my own Web::Scraper [2,3] to scrape stuff and it uses
HTML::TreeBuilder (and ::XPath) to build a DOM tree and runs XPath or
CSS selector against it. It's definitely slower than LibXML but can
deal with such broken HTML documents very well. If you really care
about performance there's also HTML::TreeBuilder::LibXML on github [4]
that is a drop-in replacement for H::TB::XPath but uses LibXML under
the hood.

Another option would be to filter out such XHTML errors with
HTML::Tidy before passing it to LibXML. It would be neat if you do
that cleanup only if libxml parsing fails even with recover_errors
etc. set.

[1] http://search.cpan.org/dist/XML-Liberal
[2] http://search.cpan.org/dist/Web-Scraper
[3] http://github.com/miyagawa/web-scraper
[4] http://github.com/tokuhirom/html--treebuilder--libxml

-- 
Tatsuhiko Miyagawa