These aren't the characters you're looking for...

Dominic Thoreau dominic.thoreau at googlemail.com
Tue Aug 19 11:55:20 BST 2008


2008/8/19 Abigail <abigail at abigail.be>:
>>
>> First question: is it safe to match a regex containing Unicode code points
>> against a non-unicode string?  I'm sure it is, and it seems to work OK, but my
>
> Yes. In fact, it's safer to do it this way than to use \s.
>
> Because whether "\x85" matches \s depends. If the source string has the
> UTF8 flag set, or, in some case (but not in other cases), the pattern
> has the UTF8 flag set, "\x85" will match \s. Otherwise, it won't.
>
> You are also missing quite a number of characters that would match \s, but
> aren't included in [ \t\x{85}\x{2028}\x{2029}]. \s matches 25 characters,
> including \r, and \cL. NEXT LINE (\x{85} and NO-BREAK SPACE (\x{A0}) only
> match with Unicode semantics.

I was trying to write some code the other day to seperete out words,
for some code involved in web spidering.
Using punctuation and white space works just fine in Latin based
languages, but falls down a little on others.

Korean isn't so bad, there are some rules you can apply to Japanese
[1] that help, but Chinese is more of a challenge. Some words are
single characters, some are multiple characters. Whitespace isn't
really used.

[1] at the minimal, every change of character type ( qw{ Kanji
Hiragana Katakana Romaji Numerics Punctuation} ) *except
Kanji->Hiragana, which is used for adding suffixes to verbs.
-- 
"Any technology distinguishable from magic is insufficiently advanced"


More information about the london.pm mailing list