web crawling in perl

Ian Malpass ian at indecorous.com
Mon May 22 19:45:35 BST 2006


Sam Smith wrote:
> 
> What do people think is the "best" perl (or possibly
> otherwise if it's much better) module/script for crawling
> remote websites?
> 
> Some of them are relatively complicated dynamic CGI messes,
> and I'm especially interessted in things which aren't html
> documents (doc, pdf, ppt etc).
> 
> Google suggests LWP::RobotUA and HTML::SimpleLinkExtor and
> rolling my own; lots of simple ones which don't use those
> modules and have large caveats. What've I missed?

I've used LWP::Parallel::UserAgent in the past, with decent results. 
Depends on how impatient you are, etc. I notice that there is an 
LWP::Parallel::RobotUA as well. But beyond that, I've always rolled my 
own using HTML::LinkExtor and/or HTML::Parser.

Ian


More information about the london.pm mailing list