Web scraping frameworks?

Jérôme Étévé jerome.eteve at gmail.com
Tue Mar 4 23:35:52 GMT 2014


Web::Scraper is great to hack something together quickly.

I use it regularly to do some quick ad-hoc data scraping.

For more heavy work, I prefer a combination of the following tools:

- Curl (Net::Curl or its LWP style incarnation LWP::Curl). I found it
to be more resilient than LWP against dodgy http server responses.

- For the page data scraping itself, LibXML (with its load_html and in
recover mode) + XPath. Again, for its resilience against crap HTML. We
all know correct HTML is the exception rather than the norm on the big
bad web.

- For queuing jobs, I'm a big fan of Gearman. It's light, very stable
and very simple.

Of course, it's only a toolbox. I doubt you can find a ready made
"framework" that fits your specific business needs out of the box.

J.



On 4 March 2014 22:55, Pierre M <piemas25 at gmail.com> wrote:
> I love using
>       Web::Scraper
> It's so simple and intuitive to use!
> But it only "goes down" (unless if I've missed something), and it doesn't
> allow to interact with the page (fill forms, click buttons, etc) so it
> doesn't handle complex scraping scenarios. For these, I like
>       Mojo::UserAgent
> which gives me more control. An example here:
>
> http://blog.kraih.com/post/43198036449/mojolicious-hack-of-the-day-web-scraping-with



-- 
Jerome Eteve
+44(0)7738864546
http://www.eteve.net/


More information about the london.pm mailing list