Encodings

by havoc

Since GTK+ went UTF-8, GTK+ programs have had to figure out text
encoding (even for Latin-1 users, who often happily ignored it
before). The rule we beat people over the head with is that all text
must be either:

  • In some way encoding-tagged, like HTML
  • Defined by specification to be in a particular
    encoding

Joel just posted a nice
article
expanding on that in more detail.

For Linux, the ongoing problem seems to be filenames. Lots of disks
out there contain untagged random-locale-encoding filenames. The ext3
filesystem is defined to have UTF-8 filenames, but apparently POSIX
requires the kernel to pass through filenames 8-bit clean without
validation, so of course people have been sticking all kinds of
non-UTF-8 junk on their filesystem. In the bad cases, you have an NFS
mount with 1000 users running different locales. Oops.

Another relic of this problem: the Character Coding menu in
gnome-terminal. ssh should do encoding negotiation and pass that to
the terminal somehow, then it would just work without a silly menu.
Or everyone should just use UTF-8 and be done with it – maybe someday.

For Red Hat Linux 8.0, we cut over to UTF-8 for all locales except
CJK, creating a firestorm of whining from European users who suddenly
experienced all the bugs CJK users have been enduring for years. But
the bugs have mostly been fixed now; which just goes to show, to get
the corner cases right, be sure everyone shares the pain equally.

Nice to see Joel’s aside about the “be liberal in what you accept”
myth; web browsers are the huge, glaring example of what a horrible
idea that proverb really is. Imagine a compiler that guessed at your
intent instead of printing errors. Wait, maybe perl already does
that. 😉

(This post was originally found at http://log.ometer.com/2003-10.html#10)

My Twitter account is @havocp.
Interested in becoming a better software developer? Sign up for my email list and I'll let you know when I write something new.