utf8 oddness

Wed Jun 10 17:47:50 BST 2009

Paul Makepeace writes:
> rpix:~$ perl -le 'print ord("À")'
> 195
> 
> What does 195 refer to? 195 is \xC3 which is another character,
> according to http://jeppesn.dk/utf-8.html (A~ versus A`)

For compatibility, Perl assumes that source code is in Latin-1.  I
assume that your terminal uses UTF-8.  So when you type À, you get the
UTF-8 representation of that Unicode character, which consists of the
two bytes 0xC3 and 0x80.  With Latin-1-encoded source, Perl treats a
string literal containing those bytes as the two-character string
"\xC3\x80"; then ord returns the codepoint for the first character in
that string.

You can change this by telling Perl that your source code is in UTF-8:

    $ perl -le 'use utf8; print ord("À")'
    192

> rpix:~$ perl -le 'print chr(195)'
> ##
> 
> What's happening here?

Perl also assumes that stdout is Latin-1.  chr returns a one-character
string, which gets printed to the Latin-1-encoded stdout as a single
0xC3 byte (which isn't displayable on a UTF-8 terminal).

There are various ways to tell Perl that stdout is UTF-8, including these:

    $ perl -CS -le 'print chr(195)'
    Ã
    $ perl -le 'binmode STDOUT, ":utf8"; print chr(195)'
    Ã

> rpix:~$ perl -le 'print "\xc3\x80"'
> À
> 
> (So printing utf8 octets produces something reasonable.)

But only by coincidence -- the stream you printed to actually expects
to receive UTF-8-encoded data, but Perl thinks the stream uses Latin-1
encoding.  The only reason it seemed to work is that you just happened
to print a string whose Latin-1-encoded bytes could be reinterpreted
by your terminal as valid UTF-8.

> rpix:~$ perl -MEncode -le 'print decode("iso-8859-1", chr(195))'
> ##
> 
> What's this doing? Presumably chr(195) isn't \xC3 in Latin-1 so what is it?

On the contrary, chr(195) eq chr(0xC3) eq "\xC3" always.

Your code here manufactures a string containing the single character
with codepoint U+00C3; that string is byte-encoded internally, so it
consists of the single byte 0xC3.  Then decode() takes that single-byte
string, decodes it from Latin-1 to one of Perl's two internal encodings,
and prints the result.  In particular, it happens to have picked the
single-byte internal encoding, so the entire decode() step did nothing
at all.

This code is questionable, by the way.  chr returns a string in either
of Perl's two internal encodings, but decode expects a byte-encoded
string.  In this case it won't matter, because chr in all current
Perls produces a byte-encoded string for codepoints <= 255.  But if
the 195 varied at run time, and the actual value could be greater than
255, you'd get an exception.

> rpix:~$ perl -MEncode -le '$a = chr(195); print decode("iso-8859-1",
> $a, Encode::FB_CROAK)'
> ##
> 
> Why no croaking?

Because it was possible to decode the input without error as Latin-1.
More generally, *any* Perl string which is byte-encoded internally
can be decoded without error as Latin-1, because all single-byte
codepoints have character allocations in Latin-1.

> rpix:~$ perl -MEncode=from_to -le '$a = chr(195); from_to($a,
> "iso-8859-1", "utf8", Encode::FB_CROAK); print $a'
> Ã
> rpix:~$
> 
> Ah, from_to works where decode didn't. But why? My understanding is
> that from_to is the same except leaves the utf8 flag off. Reassuringly
> at least, the character printed there IS Latin-1's \xC3 (not the
> slightly different accent).

Your use of from_to() here is roughly equivalent to

    encode("utf8", decode("iso-8859-1", $a))

The important part is the encode() step: it encodes the output string
to the bytes that represent it in UTF-8.  Since your terminal uses
UTF-8, this produces output you can see.  (Telling Perl that stdout
is UTF-8-encoded has the same effect, but the transcoding to UTF-8
happens where you can't see it and don't have to worry about it.)

> rpix:~$ perl -MEncode -le 'print Encode::is_utf8("À")'
> 
> How can this not be true?

Because it's a two-character byte-encoded string; there's no UTF-8
here, since you haven't told Perl to expect any, and you haven't used
any characters whose codepoint is high enough to require Perl to use
UTF-8.  And Encode::is_utf8() is documented to just examine the
internal flag that indicates which internal encoding is in use.

> rpix:~$ perl -MEncode -le 'print Encode::is_utf8("À", Encode::FB_CROAK)'
> 
> It's not utf8 but it's not croaking either, ...?

The second argument to Encode::is_utf8() isn't for specifying fallback
behaviour, it's for saying that, if the string is internally marked as
using the multi-byte UTF-8-like encoding, its data should also be
examined to see whether it's valid in that encoding.  But since the
internal flag says "no UTF-8 on this string", that doesn't actually
apply.

For more information on all this, I recommend Juerd's perlunitut
documentation, as found in 5.8.9 and 5.10.

-- 
Aaron Crane ** http://aaroncrane.co.uk/