Parse-text-from-HTML CPAN module ?

Stephen Collyer scollyer at netspinner.co.uk
Fri Dec 9 18:00:45 GMT 2005


Ovid wrote:
> --- Stephen Collyer <scollyer at netspinner.co.uk> wrote:
> 
> <http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/lib/HTML/
> 
>>>TokeParser/Simple/Token/Text.pm>
>>> 
>>
>>Thanks. Still rather more low level than what I'd like ideally.
>>Maybe I should stop looking and start coding - it may be quicker.
> 
> 
> Agreed that it's lower level than what you want, but it does make
> extracting text pretty quick:
> 
>   my $parser = HTML::TokeParser::Simple->new( file => $file );
>   my $text   = '';
>   while (my $token = $parser->get_token) {
>       $text .= $token->as_is if $token->is_text;
>   }

Right. It doesn't look like a bad place to start; I guess processing
the HTML via a lexer-like interface gives lots of scope for
building up any required data structure on-the-fly.

BTW, I can't figure out from the POD what I get back from as_is.
Is it something a la SAX characters method where the amount of text
returned is not defined, or is it a single w/s separated word, or what ?
I guess this is covered in the HTML::TokeParser docs ?

-- 
Regards

Stephen Collyer
Netspinner Ltd


More information about the london.pm mailing list