These aren't the characters you're looking for...

Abigail abigail at abigail.be
Tue Aug 19 11:21:08 BST 2008


On Tue, Aug 19, 2008 at 10:46:45AM +0100, Andy Wardley wrote:
> I mistakenly wrote this the other day:
>
>     [\s^\n]
>
> What I wanted was to match a whitespace character that wasn't a newline.
>
> Of course, it doesn't work.  The '^' must be at the start for it to work as a
> character class negatorificator.  And you can't mix "inny" classes with "outy"
> classes.  That's just not allowed.
>
> Of course, I could just write this:
>
>     [ \t]
>
> But that doesn't include the Unicode whitespace characters which \s would
> normally match.  So I ended up writing this:
>
>     [ \t\x{85}\x{2028}\x{2029}]
>
> First question: is it safe to match a regex containing Unicode code points
> against a non-unicode string?  I'm sure it is, and it seems to work OK, but my

Yes. In fact, it's safer to do it this way than to use \s.

Because whether "\x85" matches \s depends. If the source string has the
UTF8 flag set, or, in some case (but not in other cases), the pattern 
has the UTF8 flag set, "\x85" will match \s. Otherwise, it won't.

You are also missing quite a number of characters that would match \s, but
aren't included in [ \t\x{85}\x{2028}\x{2029}]. \s matches 25 characters,
including \r, and \cL. NEXT LINE (\x{85} and NO-BREAK SPACE (\x{A0}) only
match with Unicode semantics.

> subconscious woke me up at 3am this morning to remind me to check.  My Camel
> is a little old (3rd ed - 5.6.0) and talks of problems in Unicode processing
> that "will probably be fixed by the time you read this".  Can I tell my
> subconscious to stop worrying and go back to snuggle-bunny land?
>
> Second: am I missing something obvious?  Is there a better way to do it?

You might want to use:

    (?!\n)[\h\v]

which should match any whitespace (including the vertical tab, which isn't
matched by \s), and will match the same set of characters regardless 
whether it's using Unicode semantics or not.

Alternatively, you can use:

   [^\S\n]

but that suffers from the problem points \x{85} and \x{A0}.


Abigail


More information about the london.pm mailing list