Web scraping frameworks?

Dave Hodgkinson davehodg at gmail.com
Wed Mar 5 12:52:58 GMT 2014


Not in my experience! API and parsing time is almost always more. And your
article says we're IO bound anyhow.

Unless you have numbers...


On Wed, Mar 5, 2014 at 12:23 PM, Stanislaw Pusep <creaktive at gmail.com>wrote:

> Shameless self-promotion, but I could not resist when "parallel" was
> mentioned:
>
> http://blogs.perl.org/users/stas/2013/02/web-scraping-with-modern-perl-part-2---speed-edition.html
> My point is: forking parallel workers to crawl one single domain is a
> terrible way of doing things. Because of connection persistence. Reopening
> connection for each worker defeats the speed gain of parallelism in first
> place.
>
>
> On Wed, Mar 5, 2014 at 12:31 PM, Dave Hodgkinson <davehodg at gmail.com>
> wrote:
>
> > I've tended to use Parallel::Process where remote sites have been able to
> > keep up and haven't been throttled, otherwise just let it run.
> >
> >
> > On Tue, Mar 4, 2014 at 11:49 PM, Kieren Diment <diment at gmail.com> wrote:
> >
> > > Gearman's fine until you need a reliable queue.  It's certainly less
> of a
> > > pain to set up than rabbitmq, but if you start with gearman and find
> you
> > > need reliability after a while there's substantial pain to be
> experienced
> > > (unless you already know all about your reliable job queue
> implementation
> > > of choice).
> > >
> > > On 05/03/2014, at 10:35 AM, Jérôme Étévé wrote:
> > >
> > > > - For queuing jobs, I'm a big fan of Gearman. It's light, very stable
> > > > and very simple.
> > >
> > >
> > >
> >
>


More information about the london.pm mailing list