Big5 to UTF-8 on Chinese Windows
May 01
Microsoft has yet again been a source of pain for me.. And let me say that I find it amusing that English windows does unicode better than Chinese windows.. How is that even possible?!
We are in the process of internationalizing our product, and we stumbled across this bizarre issue where a person in Taiwan on a Chinese windows machine couldn’t enter simplified Chinese into our product. Doing so would give unpredictable results.
So, I picked 5 random simplified Chinese characters that sat across the character set;
单 – 5355 (16) / E5 8D 95 (8)
的 – 7684 (16) / E7 9A 84 (8)
简 – 7B80 (16) / E7 AE 80 (8)
汉 – 6C49 (16) / E6 B1 89 (8)
字 – 5B57 (16) / E5 AD 97 (8)
These 5 characters are spread nicely across the Radical-stroke index and all have commonality to one another across UTF-16/UTF-8 and bits.
So, after much rigmarole I secured myself an account on the box and began to test it, and sure enough these 5 Chinese characters (单的简汉字) have proven to me that internationalization stinks.. heh Depending on the order in which they are placed, random ones would show up.. Rearrange them, double them, spaces between.. whatever.. Always different ones would show.
But ONLY on the Chinese version of windows. Locally using our own boxes, it works great. No worries. But over there, strangeness. After much ado and testing both locally and in Taiwan it appears that the Chinese version of windows tries to concatenate the bits of each symbol and make them into a single letter due to them being in a certain order.
In other words, it’s trying to say that X number of characters after the first one are combining characters. Where X = a random number of them. And it appears completely random.. No amount of byte/bit/hex/dec whatever seems to pan out to the reason this is happening.
Something that you’d think the people in Taiwan would know about and maybe give us a clue to — naw.. So now we have to figure out how to make it work correctly.. *sigh*
Pain I tell ya.. Clearly there is a solution.. I mean other people have crossed this bridge.. Though after digging into this it’s interesting how many different applications aren’t UTF-8 compliant at all. Linux does this at the OS level, windows doesn’t. The results seem to be that the majority of the software in the windows world thus doesn’t correctly support it.
heh the bug tracking tool we use here at work doesn’t either, which I find amusing. Do you know how hard it is to talk about Chinese characters in a tool that wont let you write any?!
More as we fight through this.. we haven’t solved it yet.. it should be interesting to say the least..
May 01, 2007 @ 17:16:33
*cough* SQL Server *cough*
That bug-tracking train wreck has temped me to find security holes in various places just so I could see if tracking a XSS or SQL injection bug would break it.
May 01, 2007 @ 18:42:08
Well the DB in question is DB2.. I assumed the bug tracking tools db was also DB2.. would be a little surprised if it wasn’t.. but.. that’d be funny..
May 01, 2007 @ 21:34:07
I actually meant how SQL Server doesn’t actually support UTF-8…at all (http://support.microsoft.com/kb/232580). Not to mention how your queries have to have the SQL in ASCII, with the data in UCS-2.
Yeah, it would kinda shock me if it didn’t run DB2, but you never know…
May 09, 2007 @ 08:48:08
Shawn could you sum up what the problem/solution was to this?
May 09, 2007 @ 20:06:43
The problem basically lies in SQL Server only supporting UCS-2, and not UTF-8. Among Microsoft’s suggestions for storing UTF-8 in SQL Server: use binary (ie. non-searchable, non-usable) fields to store all of your strings, or convert your entire application to our way instead of UTF-8. As far as I know, this stays this way until the next version of SQL Server, due for release sometime in the future (2008, I think). Unless of course, they drop that feature or delay the product.
The real solution: don’t use SQL Server if you need to store UTF-8 bytes.
It does amuse me that Microsoft’s site sends Content-Type: text/html; charset=utf-8, since I have no clue what a browser would do if it sent Content-Type: text/html; charset=ucs-2. But, then, since they use IIS, ASP, and Windows as their server, they have the only platform supported for automatic conversion to UTF-8 for output when using SQL Server.
Also see: “What is the difference between UCS-2 and UTF-16?” http://unicode.org/faq/basic_q.html#25