In my previous post I showed you that the legacy of ASCII is still everywhere after more than forty years. It is not even for want of something better, since we have Unicode now. The reason why the world didn't ditch ASCII straight away is that it still does a fine job for English-only texts.
At the heart of the pâté problem is not the fact that ASCII was incomplete to begin with. The problem is that two very dim-witted participants in a conversation make the wrong assumptions about the language each of them speaks and that they must have a not very educated guess at it. If I send you a book in Finnish with a Post-it note stuck to it saying Here's the English novel you requested, you will either assume I sent you the wrong book or that I wrote “English” where I meant to say “Finnish”. Even though you may not know any Finnish, you would be intelligent enough not to bother looking up the words in an English dictionary. Yet that's exactly what computers do when you send them English and tell them it's Mandarin. They will convert the numbers into a bewildering array of Chinese characters. That's how dumb they are. Makes you feel superior, doesn't it?
I will let you in on a little secret:
I dare you to feed that line to any optical character-recognition software. Human communication is very resilient when presented with crummy data. That is why websites let you decipher these wavy nonsense words as a security measure when you create a Gmail account. It takes a special kind of intelligence (i.e. human) to do this. Calculating pi up to a hundred decimals? That's nothing. Having a conversation in a busy pub after three beers, now that's an impressive feat of data processing. We cannot fathom what marvellous mechanisms of extrapolation and pattern recognition goes on under the surface. We are simply the users of our brains, not the designers.
Computers hate vagueness, while we humans thrive by it. Try communicating with one, even in a relatively user-friendly programming language and you'll know what fussy, unfeeling monsters they can be. Put a single semicolon wrong and they'll just shut up on you. What a cheek! For global communication we need a standard that's a bit more kind to us people. Or do we?
In 1991 British physicist Tim Berners-Lee pioneered the technology that paved the way for the World Wide Web as we know it today. One of these inventions was a communications standard for computers to exchange documents, the Hypertext Transfer Protocol, and the mother of all Internet document formats Hypertext Markup Language (HTML). An HTML document contains readable text interleaved with so-called markup tags to indicate bold text, paragraph breaks, table structures, hyperlinks and links to images, video and sound. Markup tags are enclosed in angle brackets and not displayed as-is, but interpreted.
The average user doesn't know HTML. They don't know that some HTML documents are actual physical files and that others are tailor-made with bits of data from various databases. They don't know that HTML is an open standard, that there are thousands of software tools available to create HTML documents and dozens of applications to view them, some good and a lot really awful. Why should they want or need to know? I made my first website in 1998 using Word 97, when I was working as a teacher in China. It produced terrible HTML, but I didn't care, since it looked all right in Internet Explorer 3. I just wanted to share my experiences and do it with the tool I knew best, not necessarily with the best tool. I had neither the time nor the inclination to learn this so-called HTML. What's more, I don't recall any problems with accented Dutch vowels as long as I stayed within the Microsoft ecosystem.
Berners-Lee wanted to make HTML a simple and forgiving standard. Simple in the sense that you can read and write raw HTML code without a specialized tool and forgiving in the sense that the software used for displaying it does the best it can, even though your code is messy and doesn't tell the receiver what character set to use.
Tools that work with open standards should follow the robustness principle, also called Postel's Law, after computer scientist Jonathan Postel.
Be conservative in what you do, be liberal in what you accept from others.
A fine human quality indeed. Not so fine for machines, though. In the harsh reality of web browsers it more likely becomes:
Always tell others what character set you're using, but if they don't tell you theirs and it has the word comrade in it, just assume it's Russian and go with Cyrillic characters.
The purists will tell you that HTML is not a programming language. They are probably right, but they they are also the kind of people who love writing HTML code by hand. If you consider the learning curve (high) and the coding errors you can get away with before your page turns to garbage (slim), then HTML has all the features of a programming language. The educated people who provide most of the content for the Internet certainly don't want to learn it. They want a tool that does it for them, and does it well.


