Web scraping frameworks?

Wed Mar 5 12:23:06 GMT 2014

Shameless self-promotion, but I could not resist when "parallel" was
mentioned:
http://blogs.perl.org/users/stas/2013/02/web-scraping-with-modern-perl-part-2---speed-edition.html
My point is: forking parallel workers to crawl one single domain is a
terrible way of doing things. Because of connection persistence. Reopening
connection for each worker defeats the speed gain of parallelism in first
place.

On Wed, Mar 5, 2014 at 12:31 PM, Dave Hodgkinson <davehodg at gmail.com> wrote:

> I've tended to use Parallel::Process where remote sites have been able to
> keep up and haven't been throttled, otherwise just let it run.
>
>
> On Tue, Mar 4, 2014 at 11:49 PM, Kieren Diment <diment at gmail.com> wrote:
>
> > Gearman's fine until you need a reliable queue.  It's certainly less of a
> > pain to set up than rabbitmq, but if you start with gearman and find you
> > need reliability after a while there's substantial pain to be experienced
> > (unless you already know all about your reliable job queue implementation
> > of choice).
> >
> > On 05/03/2014, at 10:35 AM, Jérôme Étévé wrote:
> >
> > > - For queuing jobs, I'm a big fan of Gearman. It's light, very stable
> > > and very simple.
> >
> >
> >
>