Tuesday, 25 August 2009

Filling in the gaps

In my previous post I showed you that the legacy of ASCII is still everywhere after more than forty years. It is not even for want of something better, since we have Unicode now. The reason why the world didn't ditch ASCII straight away is that it still does a fine job for English-only texts.

At the heart of the pâté problem is not the fact that ASCII was incomplete to begin with. The problem is that two very dim-witted participants in a conversation make the wrong assumptions about the language each of them speaks and that they must have a not very educated guess at it. If I send you a book in Finnish with a Post-it note stuck to it saying Here's the English novel you requested, you will either assume I sent you the wrong book or that I wrote “English” where I meant to say “Finnish”. Even though you may not know any Finnish, you would be intelligent enough not to bother looking up the words in an English dictionary. Yet that's exactly what computers do when you send them English and tell them it's Mandarin. They will convert the numbers into a bewildering array of Chinese characters. That's how dumb they are. Makes you feel superior, doesn't it?

I will let you in on a little secret:

I dare you to feed that line to any optical character-recognition software. Human communication is very resilient when presented with crummy data. That is why websites let you decipher these wavy nonsense words as a security measure when you create a Gmail account. It takes a special kind of intelligence (i.e. human) to do this. Calculating pi up to a hundred decimals? That's nothing. Having a conversation in a busy pub after three beers, now that's an impressive feat of data processing. We cannot fathom what marvellous mechanisms of extrapolation and pattern recognition goes on under the surface. We are simply the users of our brains, not the designers.

Computers hate vagueness, while we humans thrive by it. Try communicating with one, even in a relatively user-friendly programming language and you'll know what fussy, unfeeling monsters they can be. Put a single semicolon wrong and they'll just shut up on you. What a cheek! For global communication we need a standard that's a bit more kind to us people. Or do we?

In 1991 British physicist Tim Berners-Lee pioneered the technology that paved the way for the World Wide Web as we know it today. One of these inventions was a communications standard for computers to exchange documents, the Hypertext Transfer Protocol, and the mother of all Internet document formats Hypertext Markup Language (HTML). An HTML document contains readable text interleaved with so-called markup tags to indicate bold text, paragraph breaks, table structures, hyperlinks and links to images, video and sound. Markup tags are enclosed in angle brackets and not displayed as-is, but interpreted.

The average user doesn't know HTML. They don't know that some HTML documents are actual physical files and that others are tailor-made with bits of data from various databases. They don't know that HTML is an open standard, that there are thousands of software tools available to create HTML documents and dozens of applications to view them, some good and a lot really awful. Why should they want or need to know? I made my first website in 1998 using Word 97, when I was working as a teacher in China. It produced terrible HTML, but I didn't care, since it looked all right in Internet Explorer 3. I just wanted to share my experiences and do it with the tool I knew best, not necessarily with the best tool. I had neither the time nor the inclination to learn this so-called HTML. What's more, I don't recall any problems with accented Dutch vowels as long as I stayed within the Microsoft ecosystem.

Berners-Lee wanted to make HTML a simple and forgiving standard. Simple in the sense that you can read and write raw HTML code without a specialized tool and forgiving in the sense that the software used for displaying it does the best it can, even though your code is messy and doesn't tell the receiver what character set to use.
Tools that work with open standards should follow the robustness principle, also called Postel's Law, after computer scientist Jonathan Postel.

Be conservative in what you do, be liberal in what you accept from others.

A fine human quality indeed. Not so fine for machines, though. In the harsh reality of web browsers it more likely becomes:

Always tell others what character set you're using, but if they don't tell you theirs and it has the word comrade in it, just assume it's Russian and go with Cyrillic characters.

The purists will tell you that HTML is not a programming language. They are probably right, but they they are also the kind of people who love writing HTML code by hand. If you consider the learning curve (high) and the coding errors you can get away with before your page turns to garbage (slim), then HTML has all the features of a programming language. The educated people who provide most of the content for the Internet certainly don't want to learn it. They want a tool that does it for them, and does it well.

Monday, 24 August 2009

ASCII and the mangled euro

If you regularly visit non-English sites or receive email from exotic places you're more than likely to have come across gems like p?t?: ? 9, where you expected pâté: € 9. If you look up pâté in the Longman online dictionary, this is what your tab looks like in Firefox3 on Windows:



If you're a web professional and are clueless as to what's causing this, or (worse!) if you're responsible for things like this on a regular basis, go read Joel Spolsky's excellent post on character sets and Unicode first. The rest of this article is for non-programmers anyway.

We carry portable gadgets with features too slick and cool even for the original creators of Star Trek. So why does the newest version of my browser still make pâté of the euro sign and the accents over the é? It started a few years before Star Trek, it is called ASCII and its legacy is everywhere.



the Tricorder, 1966 sci-fi gizmoThe i-phone, 2009 real gizmo





The page you're reading right now reaches your computer as a large row of numbers, and it's the job of your browser, be it Internet Explorer, Safari or Firefox, to convert these numbers into text. To do that, the digital community has agreed on several standard ways to get from a bunch of numbers to letters, numbers and punctuation marks. However, if your browser encounters a non-standard number it has no way to display it. Bad luck. Yet how did that "wrong" number get there in the first place? Did the author make a typo? Did the number get mangled up in net traffic? Nope. Remember I highlighted the word several a few sentences back. Your browser either doesn't know which standard to use and has picked the wrong one, or it's simply being lied to.

In 1963, after three years of hard work, a subcommittee of the American Standards Association introduced the American Standard Code for Information Interchange (ASCII). Mark the word American. ASCII assigns a number between zero and 128 to all Roman letters, digits and several punctuation marks. This covers everything you need to display English text, but is not nearly enough to display all the accented letters (diacritics, to use the technical term) used by other languages based on the Roman alphabet, let alone Greek, Russian, Chinese, Japanese, Korean, etc. You may wonder why these wise men didn't do it properly and reserved more than a measly 128. A hundred thousand should do it and would let the Asians join the party. Well, the sixties were still the era of punch cards. They may have had free love in those days, but data storage came at a hefty price. Nobody wasted bits to placate the Chinese or the Russians, especially not during the Cold War. To suggest that in forty years' time you would have a gigabyte of memory dangling on your key chain would have been too outrageous for words.

This is how they solved the multilingual problem. Computers express all numbers in ones and zeros, so called binary digits (bits). Grouping these bits in chunks of eight lets you express 28 = 256 different numbers. These are called bytes, and for the sake of the argument we will assume that a byte is always eight bits long. Remember that the core ASCII set only had 128 entries. If you occupy a single byte per character you can use the numbers 129 to 256 for something else. You could have a set combining Greek and Roman, or Cyrillic and Greek. As long as people agree what character in the extended set each number refers to.

People actually agreed to differ, and lots of different character set mappings have since been drawn up to accommodate the various languages of the world. What most of these standards have in common are entries zero to 128, the original ASCII mapping. Note that character set mapping is not a technical term. The nitty-gritty involves code pages, code points and encodings, but it still boils down to a technical means of getting from

Dear Åsa, Günther, Søren and Małgorzata

to

0110001101100101011001101010111010101000100110100101110010110110010100101*

The string of characters I type on my keyboard is translated into numbers and then stored or transmitted. A different machine or program can only display these numbers correctly if it knows how to map the numbers to readable characters. That means it must know how the original text was encoded. It makes no sense to throw a bunch of numbers at me over the Internet without telling me how to interpret them.

You would think that having as little different character sets as possible is a good thing, because more will only add to the confusion. True, but a big shortcoming of these one-byte character sets is that they have no vacancies for new characters. A case in point: in 2002 the euro was introduced, and with it came the brand new character , but nowhere to put it. It might have gone in an existing, little used slot of the most popular character set ISO-8859-1, but that's not what happened. Standard committees are wary of change. It's like the American constitution: you cannot change the original text, but you can have amendments added to it. So they copied the existing character set, called it ISO-8859-15 and stuck the euro in an existing, little used slot. Problem solved, as long as all software clearly communicates that it is using either the 1 or the 15 flavour, which it is not required to do. You get my point.

There is another problem with one-byte sets. For Chinese and Japanese the ASCII single-byte trick will not do. These so-called logographic writing systems have thousands of characters and need at least two bytes per character, because that gives you 256 x 256 = 65,536 options. If I wanted to add my Chinese friend to the salutation I would be needing characters from different sets, so I would need to indicate that some bytes are from set A and others from set B. What you really need of course is a character set big enough to accommodate all these characters. That's where Unicode comes in. Unicode has been a long time in the making (starting as early1987) and is the much awaited effort to unite all the languages of the world in one happy digital standard.

Unicode is all-inclusive and able to adopt new characters in the future, but even if we ditch all ASCII derivatives today and use nothing but Unicode that doesn't magically change all the millions of legacy pages still out there. Compared to the speed of hardware improvements new digital protocols and standards seem to take an eternity before they are universally adopted. What makes it even more treacherous is the fact that Unicode conforms to the first 128 positions of ASCII. An understandable choice is you want backward compatibility: the text “Hello World” maps to the same numbers in Unicode as it does in any of the ASCII-based standards.

However, by doing so Unicode allows the situation that a Western European text can pose as one character set when in fact it is encoded in another. A lazy programmer will easily miss it. The root of all evil is the design choice that lets lazy programmers get away with it in the first place. It is called Postel's Law and it will be the topic of my next post.

* My advice to anyone trying to decode the binary gibberish: get a life :-)


Monday, 17 August 2009

A False Sense of Simplicity

For the past five years I have had the dubious pleasure of using Hibernate in Oracle-backed production environments and more often than not it has made me want to crawl into a cave to sleep off the months of ensuing darkness. For the uninitiated: Hibernate is a popular open source object-relational mapping tool for Java, an interface layer between your (Java) code and a relational database which lets you query and manipulate data by means of the object-oriented paradigm, effectively hiding the SQL it generates to achieve this.

I have nothing against Hibernate or ORM tools in general, but it annoys me how often it is touted as the perfect fit anytime you need to do something with databases. It irks me how often I have seen it used in environments where it is totally unsuited. It gets this undeserved support from people who have been lulled into a false sense of simplicity.

Let´s assume that I'm a green and optimistic Java programmer with no experience in relational databases and no wish to acquire any. Let´s say I´m writing an application to manage all logging and billing for a large phone company. The Ministry of Love needs me to store the recorded sound data of every conversation. So I bash away:

public class Subscription {
int
number;
List calls;
SubscriptionType subscrType;

}

public class Call {
Subscriber caller;
long receiver;
Date started;
Date finishded;
byte[] data;
}

public class SubscriptionType {
String name;
double fixedMonthlyFee;
double callChargePerSecond;
}

I suppose this is conceptually valid, albeit very simplified. All I do now is insert some clever annotations in the code, let Hibernate create the tables for me and within minutes I can do things like this:

List smallList = session.query(“select name from SubscriptionType”);
List whoppingList = session.query(“select data from Call”);

The SubscriptionType class maps to a table with no more than ten or twenty rows at any time. No problem there. If marketing and sales do a good job the Subscriber class will hold millions and the Call class billions of records and terabytes of data, mostly consisting of binary voice data. So within weeks of your launch party that whoppingList is sure to bring your system to its knees, because behind the iterator of the whoppingList is actually a JDBC resultset that will add an entry to its in-memory object cache with each call to next(). A Java List that mimics a collections of objects which are not already on the stack is not forbidden, but at the very least counter-intuitive.

An intricate schema with hundreds of tables will not break Hibernate. It only takes a few tables, LOTS of data and an ignorant developer who doesn't know which buttons to push. The so-called impedance mismatch between the relational and the object-oriented realm often gets really ugly when your data reaches a sizable volume. That means in production environments, if you skimp on proper testing.

In all fairness, a seasoned Hibernate guru would not be this naive. She would anticipate that the Call table will grow like mad and needs to be archived regularly. Since the raw sound data are for archiving and rarely queried, she would store them in a separate database that doesn't require daily backups. She would have given our newbie one of the fine manuals that tell him how to do things properly. Feeling the sheer weight of the tome would have taught him instantly that serious database solutions are never plain sailing, (not even/especially not) with Hibernate. Why is that?

BECAUSE SIMPLE THINGS DON'T TAKE 880 PAGES TO EXPLAIN

It´s time for a little historical context, because I think the confusion started with Plato. He proposed that the clutter of our daily lives is no more than a feeble reflection of the world of ideal Forms. The true essence of the objects we see are beyond space and time, just to give you a nano-summary of his wisdom.

Now skip a few millennia. The enlightened designer likes to think up wildly intricate entity-relationship diagrams that model a fraction of the world as he perceives it, ignorant of the sordid reality of fragmented indices, concurrent modifications and failed backups, because on Mount Olympus nobody pollutes their pristine tables with real-world data. Sadly, you cannot make a living by keeping your databases empty. That means you cannot design with a Platonic eye. You must have some estimate of how many rows each table will hold, how often it will be queried, updated, inserted into and deleted from. You have to allow for scalability within reasonable margins. Databases don't automatically polish up a crummy data definition based on actual query behaviour and rowcounts. That's your job, and a dirty one it is at times.

Object Oriented Programming has become the new sliced bread. Java is sooo much cleaner and nicer than eighties Oracle PL/SQL. I want to do it all in Java, why can't I? After all, we have tables and columns, classes and objects, foreign keys and object references. Same thing, isn't it? I should be able to tinker with data records as if they were objects. Not so fast. What we have here are analogies, not essential similarities. I believe the crucial difference between objects and data-records lies in their lifespan and numbers. Objects are disposable by nature. We create them, use them for our vile private purpose and let the garbage collector kill them for us. They never survive beyond the lifespan of a processor process. They're not built to last. Database records, if well nursed, will survive Larry Ellison and his grand-children. An object stack is to a database as a whiteboard to the British Museum Library. The whiteboard is meant only for the people in the meeting and is erased afterwards. The library is accessible to anyone with a library card. These are essential differences, not accidental ones. If it doesn't look like a duck and doesn't quack like a duck, don't dress it up as one.


Objects
Database records
Average lifespan
milliseconds to hoursminutes to millennia
Used by
one virtual machine at a time
thousands of them
Average count
thousands to millions
thousands to gazillions

Object relational mapping solutions want you to forget about tables and think objects instead. What you scribble on the whiteboard goes into the database thanks to clever proxy objects that wrap a database connection and take care of all the inserting, deletion and updating for you. These are the standard things you do with objects. You create or receive them, use them and leave them to die. Looking them up is a less frequent operation, and when you do it's usually by flipping through the Rolodex/iterating through the collection, of picking one out by its index. Big deal. Database retrieval is all about speed and efficiency, because libraries are big places. It is a big deal. That's why nobody says.


SELECT * FROM products,items,customers,invoices

Since we don't have unlimited memory and patience, we have where-clauses. Getting what you want means telling the database how to join tables and what restrictions to put on column values, so actually telling it what you don´t want. The where-clause is is the key to querying and Hibernate doesn´t even pretend that its OO-alternative Hibernate Query Language (HQL) is a perfect substitute. Unlike many other persistence solutions, Hibernate does not hide the power of SQL from you and guarantees that your investment in relational technology and knowledge is as valid as always (it's right on the first page). So apparently Hibernate has deemed it necessary to build in a backdoor. Why would you want to know what SQL it generates? Two reasons. One because Hibernate gets it wrong from time to time, and even when it doesn´t, the RDBMS itself can behave in cruel and unusual way by choosing an inefficient execution plan. Note again that nastiness of this kind goes unnoticed when you only test with miniature datasets. Either way you´re up the creek and you find yourself using native SQL littered with query hints, leaving the object paradigm and thinking tables again.

Is there a moral to this? If you need a car analogy: learn how to change gears manually, because Hibernate's automatic gearbox may work fine on the highway, but if it does abandon you it will be on the rockiest of roads, by which time you have forgotten how to use a gearstick. Under the right circumstances Hibernate can make your business layer look a paragon of cleanliness, a poster-child for separation of concerns, but in a less forgiving environment it will behave like the leaky abstraction it is. By all means use it if you know what you're doing, but if you're building serious database-backed stuff don't even think you can get away not knowing your SQL and the quirks of the RDBMS it is built on. Don't get lulled into a false sense of simplicity.