Web scraping frameworks?

Joel Bernstein joel at fysh.org
Fri Mar 7 13:06:09 GMT 2014


Sounds like you just got there but in the Web::Scraper distribution is also:
https://metacpan.org/pod/Web::Scraper::LibXML
which is a drop-in replacement for Web::Scraper to use LibXML and ought to
perform better on larger documents.

/joel


On 7 March 2014 14:04, Joel Bernstein <joel at fysh.org> wrote:

> Can you show your numbers?
>
>
> On 7 March 2014 13:58, Dave Hodgkinson <davehodg at gmail.com> wrote:
>
>>  Web::Scraper::LibXML is about 5x faster. I'll take that.
>>
>>
>>
>> On Fri, Mar 7, 2014 at 12:48 PM, Dave Hodgkinson <davehodg at gmail.com>
>> wrote:
>>
>> > 85% of the time is in XML::XPathEngine
>> >
>> >
>> > On Fri, Mar 7, 2014 at 12:40 PM, Dave Hodgkinson <davehodg at gmail.com
>> >wrote:
>> >
>> >> He's not touched the repo for a couple of years and even then just for
>> >> cosmetic things. I don't hold out much hope there.
>> >>
>> >> I get the feeling I'm missing an XS something somewhere. Suppose I
>> could
>> >> profile it.
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Mar 7, 2014 at 12:29 PM, Hernan Lopes <hernanlopes at gmail.com
>> >wrote:
>> >>
>> >>> ask miyagawa =)
>> >>>
>> >>>
>> >>> On Fri, Mar 7, 2014 at 8:48 AM, Dave Hodgkinson <davehodg at gmail.com>
>> >>> wrote:
>> >>>
>> >>> > OK, so I've worked out the DSL and am successfully scraping a page.
>> >>> >
>> >>> > It's taking a second to parse each page. Seems a bit much.
>> >>> >
>> >>> > Installing HTML::TreeBuilder::LibXML seemed like a good idea but
>> didn't
>> >>> > make any difference.
>> >>> >
>> >>> > Any ideas on switches I can flip to make things faster?
>> >>> >
>> >>> >
>> >>> > On Tue, Mar 4, 2014 at 9:44 PM, Dave Cross <dave at dave.org.uk>
>> wrote:
>> >>> >
>> >>> > > On 04/03/14 21:33, DAVID HODGKINSON wrote:
>> >>> > >
>> >>> > >>
>> >>> > >> Does something exist?
>> >>> > >>
>> >>> > >> If it doesn't does anyone want to help make it happen?
>> >>> > >>
>> >>> > >> I *really* don't want to have to write the code all over again
>> ten
>> >>> > >> times...
>> >>> > >>
>> >>> > >
>> >>> > > Something like Web::Scraper, perhaps?
>> >>> > >
>> >>> > >   https://metacpan.org/pod/Web::Scraper
>> >>> > >
>> >>> > > Dave...
>> >>> > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>>
>


More information about the london.pm mailing list