Encoding/decoding

Jurgen Pletinckx jurgen.pletinckx at gmail.com
Wed Mar 31 17:50:57 BST 2010


Dave Hodgkinson wrote
| Welcome to perl encoding hell.

Nice and toasty in here, isn't it?


Dirk Koopman wrote
| It's (probably) not actually chewed up. It is what utf8 looks like 
| when you display it in iso-8859-* or some form of ascii or M$/IBM 
| codepage.

| There may actually be nothing to do other than make sure that the 
| language environment variable is set correctly (if using something 
| like a terminal window), I have "LANG=en_US.UTF-8" set on mine.

| Or, if we are talking web pages, make sure that (unlike CPAN) you 
| have a character set declaration in the head, such as:
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

Aha! That does help. Yes, web pages is where it's at. Except I assumed the
text was mangled long before I pushed it out to the website and it got close
to a users' eyeballs.

When I add the meta tag above to a page's head, the W3 validator complains
that meta/http-equiv says UTF-8, but the actual HTTP headers say ISO-8859-1,
and it's inclined to believe those. And the httpd.conf for the site contains

AddDefaultCharset UTF-8. Bah. Can't even trust what you read in a conf file
these days.

And Mark Fowler wrote
| It's very hard for anyone to work out the solution to this unless we
| know *exactly* what is in the files, not how it's being rendered.
| 
| What's the exact bytes stored in the files?  Or more bluntly, what
| does this print:
| 
| perl -e 'use Devel::Peek; open my $fh, "<:bytes", "filename" or die
| $!; undef $/; Dump <$fh>'

You say blunt, I call it idiot-proof. Anyway, 

SV = PV(0x703ae8) at 0x72bbb0
  REFCNT = 1
  FLAGS = (TEMP,POK,pPOK)
  PV = 0x733bb0 "Plat pr\303\251f\303\251r\303\251\nDas M\344dchen Jeanne
d\264Arc (Kr\374ck von Poturzyn, Maria J.)\n"\0
  CUR = 72
  LEN = 88

Am I correct in thinking that \303\251 is correct utf-8 for é (e-aigu), and
\344 correct latin-1 for ä (a-trema)? And that I'm going to burn for using
them mixed up with one another, as \303\251 is _also_ correct latin-1 for é
(A-tilde copyright)?

Thanks, I feel positively enlightened! Of course, I would still like all
that text to use a single encoding. "How hard could it be?"

--
Jurgen Pletinckx





More information about the london.pm mailing list