UTF-8 + HTML::Template + CGI::Fast

James Laver james.laver at gmail.com
Fri Dec 4 12:00:23 GMT 2009


On Fri, Dec 4, 2009 at 11:49 AM, Philip Potter
<philip.g.potter at gmail.com> wrote:
> I don't know if this problem is in general solvable, because user
> agents are not required to declare what encoding they are using to
> submit form contents. Even when the form uses the
> accept-charset="utf-8" attribute to restrict the user agent to only
> one charset, firefox doesn't append charset=utf-8 to the Content-type:
> HTTP header.
>
> I don't see how you're supposed to guess what encoding the user agent
> used if it won't tell you. Does anyone else have any ideas?
>
> Phil
>

This is one of the fun things about character sets. There are three
ways to determine character set:

1. Seperate data that tells you what the character set is (that would
be the http headers, and many browsers do set them (and most servers
do, making life easier on browser developers))
2. Character set data embedded in the data, with data prior to that
being in a specified required character set (that would be html,
specifying that the charset is utf-8 in a meta http-equiv tag (alas,
you're not ea browser, that isn't going to help))
3. Checking if it looks like a given character set (very lossy). Eg.
the is_utf8() function only checks if it *could* be utf-8. If you pass
it ascii text, it'll pass. Subsets of some other character sets will
also pass. There are no guarantees, just percentage chances. Not
exactly the world's best fallback.

Of course if you get a choice between a potentially lying piece of
software (software, it's hateful) and percentage chances, your chances
of it working right most of the time are of course nonexistent. Better
to give up.

--James


More information about the london.pm mailing list