Finding the intersection between two regexes

Paul Makepeace paulm at paulm.com
Tue Apr 22 17:05:58 BST 2014


On Tue, Apr 22, 2014 at 4:16 AM, David Cantrell <david at cantrell.org.uk> wrote:
> On Sun, Apr 20, 2014 at 10:14:48PM -0400, Mark Fowler wrote:
>> On Sunday, April 20, 2014, David Cantrell <david at cantrell.org.uk> wrote:
>> > Can anyone point me at some code on the CPAN that, given two regexes,
>> > can figure out whether there are any bits of text that will be matched
>> > by both?
>> I'm not sure I understand the question here, or moreover why you want to do
>> this..is it just an intellectual exercise?
>
> I do actually have a use for it, which would help to explain the
> question.
>
> A large part of Number::Phone is based on data in google's
> libphonenumber project. That has, for most countries, regular
> expressions that match valid fixed lines and valid mobiles. For some
> countries those two regexes can both match some of the same numbers.
> Here's the data:
>   http://goo.gl/hTBAhZ
>
> If you look at the data for Barbados, they have for fixed lines:
>   246[2-9]\d{6}
>
> and for mobiles:
>   246(?:(?:2[346]|45|82)\d|25[0-4])\d{4}

I'd go out on a limb and say that the complete list of overlapping
situations all share a /^prefix/ like this. (This doesn't necessarily
help you since you'd have to exhaustively falsify it but if you're
going for the quick win I bet just looking for prefixes gets you
most/all of the way.)

> then some strings will match both expressions - 2462303333, for example.
> But if you look at the data for Jamaica there are no strings that match
> both regexes.
>
> At the moment I detect these overlaps (and then throw the regexes away
> as being unfit for my purpose) by just going through each country's
> number space. This is practical for NANP countries as I can do it
> all with only about a million comparisons in the worst possible case. It
> would be impractical to apply this to the whole world though.

If your goal is to simply identify overlaps rather than generate
encompassing regexes, you could try attacking it with
intelligently/heuristically generated random numbers.

Paul

>
> --
> David Cantrell | Bourgeois reactionary pig


More information about the london.pm mailing list