Web scraping frameworks?

Stanislaw Pusep creaktive at gmail.com
Wed Mar 5 13:31:01 GMT 2014


My numbers are right on the top of my article :)
If it is not using 100% CPU, it is definitely not CPU bound.
Also, if you're just fetching data & storing it without parsing anything,
then you can become CPU-bound by the HTTP agent itself:
https://metacpan.org/pod/AnyEvent::Net::Curl::Queued#BENCHMARK
For instance, one can not squeeze much more than 100 requests/second from
WWW::Mechanize (per CPU) even while connecting to localhost.


On Wed, Mar 5, 2014 at 1:52 PM, Dave Hodgkinson <davehodg at gmail.com> wrote:

> Not in my experience! API and parsing time is almost always more. And your
> article says we're IO bound anyhow.
>
> Unless you have numbers...
>
>
> On Wed, Mar 5, 2014 at 12:23 PM, Stanislaw Pusep <creaktive at gmail.com
> >wrote:
>
> > Shameless self-promotion, but I could not resist when "parallel" was
> > mentioned:
> >
> >
> http://blogs.perl.org/users/stas/2013/02/web-scraping-with-modern-perl-part-2---speed-edition.html
> > My point is: forking parallel workers to crawl one single domain is a
> > terrible way of doing things. Because of connection persistence.
> Reopening
> > connection for each worker defeats the speed gain of parallelism in first
> > place.
> >
> >
> > On Wed, Mar 5, 2014 at 12:31 PM, Dave Hodgkinson <davehodg at gmail.com>
> > wrote:
> >
> > > I've tended to use Parallel::Process where remote sites have been able
> to
> > > keep up and haven't been throttled, otherwise just let it run.
> > >
> > >
> > > On Tue, Mar 4, 2014 at 11:49 PM, Kieren Diment <diment at gmail.com>
> wrote:
> > >
> > > > Gearman's fine until you need a reliable queue.  It's certainly less
> > of a
> > > > pain to set up than rabbitmq, but if you start with gearman and find
> > you
> > > > need reliability after a while there's substantial pain to be
> > experienced
> > > > (unless you already know all about your reliable job queue
> > implementation
> > > > of choice).
> > > >
> > > > On 05/03/2014, at 10:35 AM, Jérôme Étévé wrote:
> > > >
> > > > > - For queuing jobs, I'm a big fan of Gearman. It's light, very
> stable
> > > > > and very simple.
> > > >
> > > >
> > > >
> > >
> >
>


More information about the london.pm mailing list