Parse-text-from-HTML CPAN module ?

Sat Dec 10 10:03:55 GMT 2005

Stephen Collyer wrote:
> I have a search-related requirement to take some arbitrary HTML,
> parse out the text and stem it/apply stop words and so on. Now,
> I can cook something up myself with the usual set of modules, but
> this sounds like such a common requirement that someone will
> already have done it and packaged it up, in a nice reusable form.

Not in a nice reusable form, but I have code you can cut-n-paste.

  http://wardley.org/perl/Search.pm

It's a hacked-up module I wrote as part of a project for a customer.
It's based on code I gleaned from Advanced Perl Programming.  

You can see it working here:

  http://wardray-premise.com/

You'll need to tweak it a bit to get it working.  Change 'WP::Base' to 
'Class::Base', provide your own config values instead of 'WP::Config',
and remove any user-specific search tweaks I may have added (unless 
you happen to be indexing many documents that contain the word "x-ray").

Usage is something like this:

   my $search = WP::Search->new();

   $search->index_file($path, { title    => "The Badger's Bell End",
                                keywords => "Badger, bell, machine gun" });

   my $results = $search->search("badger rabbit bell ringing");

   # $results->{ query   }    # original query
   # $results->{ words   }    # words in query
   # $results->{ stems   }    # stems of words in query
   # $results->{ results }    # list of results, each is hash containing
                              # document, relevance and percent items.

HTH
A