
If you're a web professional and are clueless as to what's causing this, or (worse!) if you're responsible for things like this on a regular basis, go read Joel Spolsky's excellent post on character sets and Unicode first. The rest of this article is for non-programmers anyway.
We carry portable gadgets with features too slick and cool even for the original creators of Star Trek. So why does the newest version of my browser still make pâté of the euro sign and the accents over the é? It started a few years before Star Trek, it is called ASCII and its legacy is everywhere.
| the Tricorder, 1966 sci-fi gizmo | The i-phone, 2009 real gizmo |
![]() | ![]() |
The page you're reading right now reaches your computer as a large row of numbers, and it's the job of your browser, be it Internet Explorer, Safari or Firefox, to convert these numbers into text. To do that, the digital community has agreed on several standard ways to get from a bunch of numbers to letters, numbers and punctuation marks. However, if your browser encounters a non-standard number it has no way to display it. Bad luck. Yet how did that "wrong" number get there in the first place? Did the author make a typo? Did the number get mangled up in net traffic? Nope. Remember I highlighted the word several a few sentences back. Your browser either doesn't know which standard to use and has picked the wrong one, or it's simply being lied to.
In 1963, after three years of hard work, a subcommittee of the American Standards Association introduced the American Standard Code for Information Interchange (ASCII). Mark the word American. ASCII assigns a number between zero and 128 to all Roman letters, digits and several punctuation marks. This covers everything you need to display English text, but is not nearly enough to display all the accented letters (diacritics, to use the technical term) used by other languages based on the Roman alphabet, let alone Greek, Russian, Chinese, Japanese, Korean, etc. You may wonder why these wise men didn't do it properly and reserved more than a measly 128. A hundred thousand should do it and would let the Asians join the party. Well, the sixties were still the era of punch cards. They may have had free love in those days, but data storage came at a hefty price. Nobody wasted bits to placate the Chinese or the Russians, especially not during the Cold War. To suggest that in forty years' time you would have a gigabyte of memory dangling on your key chain would have been too outrageous for words.
This is how they solved the multilingual problem. Computers express all numbers in ones and zeros, so called binary digits (bits). Grouping these bits in chunks of eight lets you express 28 = 256 different numbers. These are called bytes, and for the sake of the argument we will assume that a byte is always eight bits long. Remember that the core ASCII set only had 128 entries. If you occupy a single byte per character you can use the numbers 129 to 256 for something else. You could have a set combining Greek and Roman, or Cyrillic and Greek. As long as people agree what character in the extended set each number refers to.
People actually agreed to differ, and lots of different character set mappings have since been drawn up to accommodate the various languages of the world. What most of these standards have in common are entries zero to 128, the original ASCII mapping. Note that character set mapping is not a technical term. The nitty-gritty involves code pages, code points and encodings, but it still boils down to a technical means of getting from
Dear Åsa, Günther, Søren and Małgorzata
to
0110001101100101011001101010111010101000100110100101110010110110010100101*
The string of characters I type on my keyboard is translated into numbers and then stored or transmitted. A different machine or program can only display these numbers correctly if it knows how to map the numbers to readable characters. That means it must know how the original text was encoded. It makes no sense to throw a bunch of numbers at me over the Internet without telling me how to interpret them.
You would think that having as little different character sets as possible is a good thing, because more will only add to the confusion. True, but a big shortcoming of these one-byte character sets is that they have no vacancies for new characters. A case in point: in 2002 the euro was introduced, and with it came the brand new character €
There is another problem with one-byte sets. For Chinese and Japanese the ASCII single-byte trick will not do. These so-called logographic writing systems have thousands of characters and need at least two bytes per character, because that gives you 256 x 256 = 65,536 options. If I wanted to add my Chinese friend to the salutation I would be needing characters from different sets, so I would need to indicate that some bytes are from set A and others from set B. What you really need of course is a character set big enough to accommodate all these characters. That's where Unicode comes in. Unicode has been a long time in the making (starting as early1987) and is the much awaited effort to unite all the languages of the world in one happy digital standard.
Unicode is all-inclusive and able to adopt new characters in the future, but even if we ditch all ASCII derivatives today and use nothing but Unicode that doesn't magically change all the millions of legacy pages still out there. Compared to the speed of hardware improvements new digital protocols and standards seem to take an eternity before they are universally adopted. What makes it even more treacherous is the fact that Unicode conforms to the first 128 positions of ASCII. An understandable choice is you want backward compatibility: the text “Hello World” maps to the same numbers in Unicode as it does in any of the ASCII-based standards.
However, by doing so Unicode allows the situation that a Western European text can pose as one character set when in fact it is encoded in another. A lazy programmer will easily miss it. The root of all evil is the design choice that lets lazy programmers get away with it in the first place. It is called Postel's Law and it will be the topic of my next post.* My advice to anyone trying to decode the binary gibberish: get a life :-)


0 comments:
Post a Comment