a memo

I18N NOW!

T�bingen is a place name that Berylium should be able to handle. So is Nyk�bing. And Välsignalandet. But can it?

Web applications require support for international charactersets. And that means Unicode to me, including Hebrew, Chinese, Arabic, Greek... I know very little about these languages and charactersets, but I need to develop tests to ensure that they will be supported by Berylium applications.

From On The Goodness of Unicode by Tim Bray:
So, if I were living in Cairo you'd probably want to send it to تم براي, and if in Osaka, to チムブレー.
Well, that didn't work so well. But at least we have a good test set-- when will Berylium pass the tbray.org test?

The Big Problem is illustrated by the first line of the quote above, which was copy-and-pasted directly from Mr. Bray's correctly-rendered webpage. At what point did Mozilla decide to turn the characters into ISO-8859-1 entities?
When I pasted them into the form on the document edit page?

When submitting the page to Berylium for processing?

Not Mozilla's fault, PHP converted the characters when it decoded them from the post request.

Whatever happened here, by the time those characters hit the database they were entities. Must investigate.

Meanwhile, here are some recommendations from the same article:

# Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.

# Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away.

# If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct.

# If you're doing search, try to hand the linguistic and character-handling problems off to someone who understands them.

By Chris Snyder on May 13, 2003 at 9:34am

permalink - uplink/email

jump to top

I18N NOW!

Menu: