Parse-text-from-HTML CPAN module ?

Fri Dec 9 11:52:57 GMT 2005

On Fri, 2005-12-09 at 11:10, Stephen Collyer wrote:
> I have a search-related requirement to take some arbitrary HTML,
> parse out the text and stem it/apply stop words and so on. Now,
> I can cook something up myself with the usual set of modules, but
> this sounds like such a common requirement that someone will
> already have done it and packaged it up, in a nice reusable form.
> 
> Does anyone know if there's a nice, Pure Perl implementation of
> this that I can pick up and use with no further brain-power required ?
> (I'm wondering if there's something in the WWW::Mechanize area that
> is suitable, as that seems to have grown a lot since I last looked).

Getting just the text is a piece of piss with HTML::Parser:

#!/usr/bin/perl

use strict;
use warnings;

my $the_file =<<EOH;
<html>
<head><title>Test</title>
</head>
<body>
<h1>Test Title</h1>
<p>This is a test</p></body></html>
EOH

use HTML::Parser;
my $parser = HTML::Parser->new( text_h => [ \&text_handler,"self,dtext"
],
                                start_document_h => [\&init, "self"] );

$parser->parse($the_file);

print @{$parser->{_private}->{text}};

sub init
{
   my ( $self ) = @_;
   $self->{_private}->{text} = [];
}

sub text_handler
{
    my ( $self, $text) = @_;

    push @{$self->{_private}->{text}}, $text;
}

/J\
-- 

This e-mail is sponsored by http://www.integration-house.com/