Web scraping frameworks?

Dave Hodgkinson davehodg at gmail.com
Fri Mar 7 13:28:46 GMT 2014


I'll give a talk!

Apropos previous discussions, I'll also try HTTP::Async instead of my usual
route 1. I think it fits better with the approach I'm taking at the moment.




On Fri, Mar 7, 2014 at 1:11 PM, Leo Lapworth <leo at cuckoo.org> wrote:

> Hi Dave,
>
> When you've finished please could you write a blog post?
>
> It would be a better way of sharing what you are doing (and you'd share
> with more people), then we'd also get a summary rather than blow by blow
> updates.
>
> Thanks
>
> Leo
>
>
>
> On 7 March 2014 12:58, Dave Hodgkinson <davehodg at gmail.com> wrote:
>
> >  Web::Scraper::LibXML is about 5x faster. I'll take that.
> >
> >
> >
> > On Fri, Mar 7, 2014 at 12:48 PM, Dave Hodgkinson <davehodg at gmail.com>
> > wrote:
> >
> > > 85% of the time is in XML::XPathEngine
> > >
> > >
> > > On Fri, Mar 7, 2014 at 12:40 PM, Dave Hodgkinson <davehodg at gmail.com
> > >wrote:
> > >
> > >> He's not touched the repo for a couple of years and even then just for
> > >> cosmetic things. I don't hold out much hope there.
> > >>
> > >> I get the feeling I'm missing an XS something somewhere. Suppose I
> could
> > >> profile it.
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Mar 7, 2014 at 12:29 PM, Hernan Lopes <hernanlopes at gmail.com
> > >wrote:
> > >>
> > >>> ask miyagawa =)
> > >>>
> > >>>
> > >>> On Fri, Mar 7, 2014 at 8:48 AM, Dave Hodgkinson <davehodg at gmail.com>
> > >>> wrote:
> > >>>
> > >>> > OK, so I've worked out the DSL and am successfully scraping a page.
> > >>> >
> > >>> > It's taking a second to parse each page. Seems a bit much.
> > >>> >
> > >>> > Installing HTML::TreeBuilder::LibXML seemed like a good idea but
> > didn't
> > >>> > make any difference.
> > >>> >
> > >>> > Any ideas on switches I can flip to make things faster?
> > >>> >
> > >>> >
> > >>> > On Tue, Mar 4, 2014 at 9:44 PM, Dave Cross <dave at dave.org.uk>
> wrote:
> > >>> >
> > >>> > > On 04/03/14 21:33, DAVID HODGKINSON wrote:
> > >>> > >
> > >>> > >>
> > >>> > >> Does something exist?
> > >>> > >>
> > >>> > >> If it doesn't does anyone want to help make it happen?
> > >>> > >>
> > >>> > >> I *really* don't want to have to write the code all over again
> ten
> > >>> > >> times...
> > >>> > >>
> > >>> > >
> > >>> > > Something like Web::Scraper, perhaps?
> > >>> > >
> > >>> > >   https://metacpan.org/pod/Web::Scraper
> > >>> > >
> > >>> > > Dave...
> > >>> > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>


More information about the london.pm mailing list