Boolean-style text searching

Stephen Collyer scollyer at netspinner.co.uk
Fri Dec 9 16:05:39 GMT 2005


Paul Makepeace wrote:
> Are there any CPAN goodies that will effect text searching `a la "(escort
> OR ford) AND NOT (estuary OR brook OR erotic services)"? Or at least
> some of the way there?
> 
> Paul
> 

Do you want to search unindexed text, or are you intending
to index it first ? I have a P::RD grammar that does this
which relies on a binary-coded inverted index that lives
in a MySQL DB, thus:

> my $BooleanExpr = q{
> 
>     boolean : expr /\Z/
>               {
>                   blob2list($item[1]);
>               }
> 
>     expr    : disj
>               {
>                   $item[1];
>               }
> 
>     disj    : <leftop: conj /(?:or|\|)/ conj>
>               {
>                   or_indices($item[1]);
>               }
> 
>     conj    : <leftop: unary /(?:and|\&)/ unary>
>               {
>                   and_indices($item[1]);
>               }
> 
>     unary   : /(?:not|\!)/ unary
>               {
>                   get_not_raw_word_index($item[2]);
>               }
>             |
>               atom
>               {
>                   $item[1];
>               }
>             |
>               '(' expr ')'
>               {
>                   $item[2];
>               }
> 
>     atom    : /[\w+#-]+/
>               {
>                   get_raw_word_index($item[1])
>               }
> 
> };

The various blob2list/and_indices/etc functions perform
various DB related index operations which I can dig out
if you're sufficiently interested, but this may be enough
to get you going. I hacked the binary index logic myself,
but ISTR there's some CPAN modules that will do this for
you now.

You can run the grammar above with something like:

>my $parser = Parse::RecDescent->new($BooleanExpr);
>
>my $result = $parser->boolean("fish and not(salmon or carp)");

and so on.

-- 
Regards

Stephen Collyer
Netspinner Ltd


More information about the london.pm mailing list