UTF-8 + HTML::Template + CGI::Fast

Dirk Koopman djk at tobit.co.uk
Fri Dec 4 14:48:38 GMT 2009


James Laver wrote:

> This is one of the fun things about character sets. There are three
> ways to determine character set:
> 

<snip>

> 3. Checking if it looks like a given character set (very lossy). Eg.
> the is_utf8() function only checks if it *could* be utf-8. If you pass
> it ascii text, it'll pass. Subsets of some other character sets will
> also pass. There are no guarantees, just percentage chances. Not
> exactly the world's best fallback.
>

When I asked a related question on this list and then read the docs with 
more educated eyes, I got the impression that the is_utf8 function 
merely tells you that the string is in internal utf8 format - which has 
nothing to do with what format the string came in as. It is very confusing.

Because I have mixed input coming into my app, and I can't reliably 
(enough for me) tell what it is (could be any of the iso variants or 
utf8), I don't bother with any of it and have removed all attempts to 
decode it. I just treat it all as strings. As it is a message switch it 
becomes SEP or a UAP.

Dirk




More information about the london.pm mailing list