Sunday, 20 December 2009

Eat your own dog food - part one

I could be really boastful and pretend that I have 25 years' worth of programming experience, starting with the 100-line Commodore VIC-20 game Operation Crocodile when I was fifteen (no copies remain). At least in those days I did the right thing: I built little ramshackle programs entirely for my own use and enjoyment, and for as long as my adolescent attention span could muster.

In May 2000, while I was living in Edinburgh, Scotland, I quit a very boring job doing technical support in order to start a web development company with a fellow Dutchman. I got my feet wet with good old Perl/CGI code interspersed with lots (no really, I mean LOTS) of hardcoded HTML in a riot of highly artistic anti-pattern programming madness. But hey, it worked!
Around the turn of the century they paid you good money to hone your skills in a work environment that consisted almost exclusively of amateurs, and I mean that in the best sense of the word. Nobody could boast years of experience in web development, simply because the whole discipline was in its infancy.

I knew I was destined for better things, and during the time I spent in between contracts me and my partner in crime were hatching the killer app. It was called yourstreet.org (don't bother looking it up). The idea wasn't bad, provided you had the non-technical manpower to pull it off.
Based on Comiston Road in Edinburgh's well-heeled Morningside area we set up a website for local news, including free as well as paid advertising for the local shops. We acted locally, and too quickly though globally. That's where we went wrong.

Geeks are typically not interested in the non-technical aspects of making a web enterprise successful. Our codebase mushroomed at a frightening rate. I was not just interested in Morningside road. I thought areas, cities, municipalities, the whole of Lothian, Scotland, the UK, Europe, the galaxy. I wasn't interested in actual users. Worse, I wasn't even interested in using it myself.

Take a brief moment to think about that.

A lot of the mistakes I made was textbook stuff

1 - Re-inventing the wheel.

If you're writing a non-trivial application that needs to contain, say, a calendar to schedule events with multiple attendees, you have what is called a wheel. Wheels have been around a long time. They're available in many sizes, flavours and prices. Here's for the obligatory car analogy: even the most revolutionary hydrogen-powered car will probably have standard wheels.

2 - No unit testing (or any of those pesky best practices)

Unit tests? Why should I need them? I don't make bugs! I was astonishingly naïve in those days, but at least I wasn't the only one. I once worked in a team once where we had Human Semaphore File Locking (HSFL). “Nobody touch index.html. I'm working on it!”. At least we all worked within shouting distance.

1 – Programming with a blunt saw

I was a very inexperienced programmer. Although I was learning at a steady rate I had no peer supervision. I should have spent more time sharpening my saw by concentrating on the quality of my work instead of the quantity.

But there's one lesson that's not in any of the textbooks. It has to do with motivation.

Be prepared to eat your own dog food.




If you're investing time and energy in open source software, it must be to scratch some personal itch. Thinking up impressive architectures that don't actually do anything is a dead-end street.
If you really think you can out-PhotoShop PhotoShop, then do it, and use it to edit all your pictures. I don't think you stand a chance in hell, to be frank.
Being user-centric means being egocentric. Put yourself first. Don't fall into the trap of thinking what the software can do. Think what you want to do with the software.
To be continued...

Friday, 23 October 2009

Long live the average programmer



They say that great programmers are an order of magnitude more productive than average programmers. Wikipedia would want me to specify who "they" are (Frederick Brooks and Joel Spolsky for starters), but this is not a scientific assertion. Any programmer who has worked with really smart colleagues will, perhaps grudgingly, admit that it is by and large true.
It doesn't only go for programming, though. True excellence in any endeavour is rare (Mozart, Michelangelo, Monty Python). Excellent achievements really stand out. That's why we call them excellent. If excellent programmers weren't rare they would be average. So much for tautology.

Experience doesn't count for much. It doesn't take you a lifetime to become excellent at something, and that's a soothing though. You find out quickly enough whether you really have great talent. I started playing the piano almost thirty years ago and I quickly made progress during my early teens, but after that it kind of stagnated. Child prodigies, by comparison, don't stagnate after a few years. They start playing when they're four and by the time they're sixteen they baffle the crowds at the major concert halls. Great athletes don't win the Tour de France once or twice: they win it at least five times (Merckx, Hinault, Indurain, Armstrong).

There's no shame in being out of these people's league. Most of us are. You can still be a competent pianist and avoid the fiendishly difficult sonatas of Franz Liszt. You can be a competent programmer and not be able to reverse-engineer the Linux kernel.

Should a software company fire people like myself and hire a genius, who is ten times more productive but takes up the same amount of office space? Of course not. To begin with, geniuses are very rare and even if you paid them triple wages (which would make very good business sense) a genius is primarily motivated by the satisfaction that practising his art gives him, only secondarily by the amount of money he can make with it. If you have a genius in your team, you better make sure to give them some real challenges, or they get bored before you can say "private office with Aero chair".

Most programming work I have done over the years takes smart people, not geniuses. In fact, some of it was pretty mundane and boring. Most software jobs don't require inventing faster sorting algorithms or more reliable floating-point arithmetic. At least mine didn't. I would have been completely clueless anyhow. More likely they require building a user-friendly web form that handles phone numbers intuitively rather than insulting customers with "010-123456" does not match regex validPhoneNumber Javascript popups.

If it takes me a day to fix a simple task, the star could probably do it in half that time, but I guess the simpler the task, the smaller the difference. Partly because the star would get bored or consider it an insult of their intelligence. Molest me not with this pocket calculator stuff, they will reply haughtily when asked to hack a few AJAX web forms. Give them something really difficult to do, because that's where they shine. How much faster than myself could they reverse-engineer the Linux kernel? It does not compute, because I simply wouldn't be up to it. Not in this lifetime, with this set of brains at least.

Friday, 16 October 2009

The Page Paradigm

Following Jeff Atwood's tip I read Dan Ariely's excellent book Predictably Irrational, which gave me all sorts of revealing insights into my own irrational psyche.

Anchoring is the psychological phenomenon explaining how we human beings are reluctant to update our ingrained opinions. This relates to our sense of value (cheap or expensive) as well as our conception of the validity of doing things a certain way. We like to stick to what we know. If you grew up in the fifties, 200,000 Euro for a house will always feel like an obscene amount, and if the Commodore VIC20 was your first computer any file over a megabyte just feels huge. I remember a C programmer lambasting me in very derisive terms for programming with mod perl (an embedded Perl interpreter in the Apache web server) because it added a mere 4Mb to Apache's memory footprint. Note that this was seven years ago, in 2002.

If you built dynamic web sites around the turn of the century, you'll have HTTP, HTML, JavaScript with either Perl, JSP, PHP or ASP paradigm anchored in your brain and you'll be used to doing things in a very time-consuming and cumbersome way. I don't care what anyone may say to the contrary: building fast, usable, secure and reliable web applications was and is a very tough job.

You could fill entire bookshelves lamenting the inadequacies in terms of security, usability and browser compatibility nastiness, but what's the point? Clearly the benefits outweigh the drawbacks. Embedded Java applets had their chance and we didn't want them. Web applications are here to stay. The underlying technology has evolved impressively, yet for a revolution we should stop thinking of web applications in terms of pages. It's time to haul that anchor and move on.

Originally the web was conceived as an information universe of static content, randomly accessible through hyperlinks. The back button has always been an indispensable tool. It allows you to follow any link on a page and quickly retrieve your steps. It's the equivalent of the undo function and it works well for static content.

A typical web application however does not take kindly to it. The function of a web application is performing a mandatory sequence of actions in which the user and the remote server typically exchange data through HTML forms. You search and pick a book of your choice, you provide shipping and payment details and then you submit your order. Presenting these stages as navigable pages with a back button is asking for trouble. Your browser knows this. It will warn you that you are “attempting to submit a form twice”, or something to that effect, leaving the less technically inclined users clueless. Bad programmers will even book you three seats on the same flight as a penalty for hitting the refresh button a couple of times. Where's the logic in that? The back and forward buttons look like undo and redo functions, but they're not. You can even bookmark a submitted form like this: www.initrode.com/order.cgi?orderId=12345;creditCard=1234234534564567;type=VISA;valid=10_2015.

Don't get me started.

Ajax to the rescue. Browser technology has now progressed to the point where you can build an entire client-server application to be operated from a single HTML page. Embedded or linked JavaScript libraries generate page elements on the fly and communicate asynchronously with a remote server without having to refresh the page. No more form submissions. And no more quirky back buttons. The Google Web Toolkit (GWT) framework even has undo functionality built in. Users can hit the back button and the application will behave appropriately, which also means notifying the user when an action is not undoable.

Usability guru Jakob Nielsen still did not think much of AJAX technology in 2005, when it was still a budding technology http://www.usabilityviews.com/ajaxsucks.html but he's slowly coming round and for a thousand bucks he'll tell you why.

Existing page-based technologies like PHP and JSP remain anchored in the pages metaphor. If you're a Java developer with plenty of experience in Swing GUI programming you should give GWT a spin. It's the most radical departure from the old school and I absolutely love it.

Sunday, 11 October 2009

The Curse of Mr Fixit

There are two kinds of programmers. There are those who get back from work, shove a frozen pizza in the microwave and scoff it down while coding their own LDAP server in a little used but cool language. Then there's the motley assortment of CS, physics, history and language graduates with a life that doesn't only involve computers. In the first category you will find Mr Fixit (I have yet to come across a Ms Fixit).

Mr Fixit has a voracious appetite for new technologies. He has a l'art pour l'art attitude when it comes to information technology, where writing your own LDAP server for an address book with thirty entries makes perfect sense. He likes to think himself a free spirit. Anything that impedes his artistic flow (like coding standards and testing procedures) he hates with a passion.

He likes to disparage lesser operating systems and programming languages with a destructive zeal comparable to that of orthodox Muslims looking at a sacrilegious drawing of the prophet Mohammed.

He relishes the kudos that comes from getting a job done, not necessarily from getting the job done properly. He´s not afraid to use duct tape to stop a leak, but the duct tape is never replaced by proper solder. Where's the fun in that? He shows great stamina when solving a nasty classpath issue in the web server, but he'll often be curt and impatient explaining his solution to lesser mortals. That's because he likes to be smarter than you.

However, our Mr Fixit is a person of flesh and blood and writing your own Microsoft killer in the wee hours of the morning is going to tell on you one day. When Mr Fixit suffers his first major burnout the company will realize what trouble they are in for not having kept him in better check. Suddenly everybody realizes how they have relied on Mr Fixit and nobody remembers where he left all these nifty undocumented Perl command line utilities to reboot the Oracle servers and archive the log files.

As a manager, you should not always avoid hiring a Mr Fixit. He can be extremely useful when facing a tight deadline, working ungodly hours when the rest of the team would rather save their marriage. But you cannot let him get away with the feeling that therefore the rules don't apply to him. If every programmer in your team has to provide unit tests and documentation, so does Mr Fixit.

Secondly, be aware that anynew software technology or working method has a learning curve. Programmers are smart people with usually a fine capacity for self study, but we can also be very conservative in our adherence to doing things a certain way and be prejudiced towards anything newfangled. If you let Mr Fixit install a new source repository and a new build server without proper training and evangelism within the team, you have made him the de facto guru and everyone will turn to him for the most stupid questions. Now you really pissed him off!

Monday, 28 September 2009

Documentation rant - part two

Last time I desperately tried to convince you to take source code documentation seriously and not treat it as a hurried afterthought, lest the technical debt management unit catches up with you and demands an explanation as to why you executed an explicit commit on that database handle, when it is running in auto-commit mode. Believe me, in a year you will have forgotten the completely valid reason for doing so unless you document it today.

I would like to dwell a little on the difference between API documentation and the rest, because it's an essential difference. In Java, API documentation can be generated from your source files, provided your code comments are properly formatted for the javadoc standard. Modern development studios like Eclipse and Netbeans make life very easy in that regard.

On the other hand we have these little one-word or one-line comments scattered throughout. Let's call them inline comments. Whereas these are an integral part of your source code, API docs are extracted to be read separately, intended for your users. Users of a third-party software library are programmers, but users nonetheless insofar as they should be expected to use your API without reference to the sources. In many commercial software libraries they will not even have access to them. Therefore make sure that the documentation for your classes and methods are mini-manuals.

The manual tells you what your library (or toaster) does and how to operate it. It only touches on technical details where they are relevant for the operation. Documentation for a class or method should cover behaviour (what goes in, what comes out, what can go wrong) but no explanations as to why. If it becomes necessary to explain something unintuitive in your design it's your job to fix the design. Unless of course you're not in a position to do so.

If you find yourself writing a lot of inline documentation, one or more of the following may be the case:

  • // loop though all Address objects
    for ( Address adr : getAddresses()){

    //throw a ValidationException if the postcode doesn't match the regular expression
    if ( !POSTCODE_PATTERN.matches(adr.getPostcode()) ){
    throw new ValidationException(...)
    }
    }
    You underestimate the intelligence of your fellow programmers. Assume that the reader of your code is a competent programmer. If it doesn't pass you own 'duh-test', leave it out.

  • Your typical opening and closing braces are regularly more than a few hundred lines apart. You keep getting tangled up in your own spaghetti and have to tell yourself what you're doing every three lines. This is really bad and I'm not even going to explain why. In fact, I won't even give you an example. It's time for some serious refactoring.

  • You're programming against an undocumented, unintuitive or buggy API (probably all three at the same time). It has methods called doStuff or patrick_solution. You have my sympathy. This is where inline comments are indispensable, because you can tell the the world it's not you who is incompetent.
    Dog fido = zoo.getDogByName(“Fido”);
    //Yeah, it says Id, but it's actually a lookup by name
    Cat minou = zoo.getCatById(“Minou”);
    proxy.insertAndCommit(newRecord);
    //insertAndCommit doesn't seem to autocommit on MySQL4.X . //AARRGH!!
    proxy.getConnection().commit();

  • You get paid by lines of code. Where do you work, and are they hiring?

As far as inline comments go I see a lot of type one and type two, and not nearly enough of type three, especially in code that makes frequent use of open source libraries. There are a of unintuitive and badly documented libraries around that actually work fine once you know where the pitfalls are. Why any programmer would not run that extra mile to make their stuff actually usable is beyond me. Unless they really are only working for their own pleasure.

Friday, 18 September 2009

The passport from hell

This week I was going to post part two of my course in source code documentation, but something far more important has come up to rant about. It's the new Dutch passport, which will hold the owner's digitally encoded fingerprints, in time amounting to a huge biometric database. The year is 1984 again.

This post is not going to be about hackers logging in to the monster database with password "change_on_install"*. I don't doubt the system can be compromised. It's only a matter of time. Those responsible for guarding sensitive data in the Netherlands have proved themselves shockingly cavalier and nothing but an embarrassment of epic proportions is likely to effect a change. Apparently with a cheap set-up and some patience you can produce your own forged prints on plastic foil and wear them to the scene of the crime. It must be true, because I read it on the Internet.

You know, forget about fingerprints. They can be forged and therefore in court they don't always stand up. We all know the holy grail of forensic conclusiveness is DNA. Unless you have an identical twin sibling, your DNA is intimately yours. It's impossible to produce your own fake DNA to throw forensics off the scent. Next time you get a passport you'll be handing over a saliva sample, mark my words.

Anything that can go wrong will go wrong. Any technology or data that can be put to evil use will be abused. In the same way that the usefulness of a mobile phone network grows exponentially with each new user, so will a biometric database of all citizens.

Imagine how easy it becomes for prospective wrongdoers to incriminate someone of their personal acquaintance, secure in the knowledge that their DNA is stored and can be pulled from the database in no time at all. Consider how easy it is to obtain a DNA sample from someone you know. I don't mean a printout from the lab, but actual tissue. Any article of clothing is teeming with it. Just steal a hairbrush and carefully place a few hairs (not too conspicuous, of course) over the murdered body of your choice. Make sure the intended suspect has no credible alibi and they have been in contact with the future victim, preferably with some supporting CCTV footage. Bob's your uncle.

* This is the default administrator password for Oracle databases, and it's appalling how often I have found it still used in production systems, despite the unambiguous hint in its name...

Saturday, 12 September 2009

Brush your code after every meal

Programmers have many pet hates -- hardware and software being just two of them. There is however a bewildering paradox that I would like to talk about today. It is the anguish of documenting your own code on the one hand and the torture of having to use someone else's undocumented code on the other. Actually there is one thing worse than not having documentation. That is bad or outdated documentation and the world of open source software is rife with it. Old docs are like an old copy of the Lonely Planet where you travel half a day to visit some must-see haunt only to find out it closes on Sundays.

To many programmers writing source documentation is up there with cleaning the rim of the toilet bowl in terms of satisfaction. It shouldn't be like that if these people took a more selfish approach. Taking your documentation seriously is not for the good of mankind. It's all about doing yourself a favour in the end.

Source documentation is different from functional requirements and specifications in terms of the intended audience: it is written by techies to be read by techies. It is also different from other technical documents such as UML diagrams in terms of its purpose. Requirements and specifications describe what the software should do, whereas source documentation describes what the system actually does. As a consequence it is impossible to document beforehand what some class or method does before you have coded and run it. If you do document beforehand you'll need to check carefully that what you have coded is in line with your documentation, otherwise your carefully crafted text is instantly useless. And we already know that bad documentation is worse than none at all.

There is another good reason why you should document after coding. The writing of a piece of software is a fluid process, especially in the early stages. Over a short period of time you will throw away some classes, split them up, merge them and add or remove arguments. The less functional requirements you have, the more this will be the case. All the while you keep a clear mental image of the whole, and once you're satisfied you go over it again and describe exactly what you have done. In an ideal world, that is.


The satisfaction of seeing your own code work makes you hungry to write more. Why should I write down what happens when I can see what happens? Try to resist the urge to steam ahead. Take a step back, revise what you have written and describe it. More often than not you'll spot a bug or two in the process. More importantly though, all those classes and their interrelations make perfect sense in your brain now, but won't in year's time. Pay back your technical debt now before the interest eats up your team's budget. If you don't care for your employer's money at least protect your own future sanity. For those still unconvinced let me make it clear with a little dental metaphor. Brushing your teeth is not as much fun as the meal that preceded it, but nothing compared to the agony and expense this poor man went through:











Next week I'll share with you a way to make documentation more efficient and less of a chore. I have called it 'the duh test' – if that's too juvenile or American to your taste you may call it the “Well, obviously, my dear fellow” test. Whenever you feel you could stick a duh behind your comments, leave it out altogether. Documentation is about stating the non-obvious.

Saturday, 5 September 2009

Would you download a car?

It's already a few years old, but if you ever bought a DVD in the UK you'll remember this one:
You see nasty people stealing all aforementioned items, and then a teenage girl behind her computer downloading a film, thereby instilling the notion in us that downloading content illegally is tantamount to mugging old ladies in the park. I love the British tenuous sense of proportion in what is otherwise an annoying and superfluous yet unskippable on a perfectly legal copy, but I won't go into that paradox now. Try youtube for some of the hilarious parodies it inspired.

Propaganda works with imagery that evokes an emotional response to accompany your message, but can be completely unrelated to it. Don't underestimate the power of association. People will create a context between what they see, hear and smell, however flimsy the connection. Bad breath has nothing to do with a person's character, but it will ruin any date.

At the far end of the propaganda spectrum we have the infamous anti-Semitic pamphlets of the Third Reich and in a less pernicious form we have Michael Moore's controversial editing of George Bush's finest moments in Fahrenheit 9/11. Even the fact of me mentioning Moore in the same paragraph with the Nazis is purposely creating a connection in your brain right now. Objection, your honor!

I'm not a lawyer, but I did study Dutch law for one year and I will tell you what you and the people who commissioned this silly bit of agitprop already know.

Yes, copyright infringement is against the law in most countries, but if you equate downloading with stealing you practise justice by analogy. You may claim that the effect of sneaking a physical disc out of the Virgin Megastore or downloading it from the Pirate Bay boils down to the same thing – leaving Richard Branson out of pocket. You may even have convincing evidence that verbal abuse can be as bad as physical assault. However, a criminal act is defined by what people do, while the harm it causes to society is (or should be) expressed in the punishment.

There's no people more fussy about wording than lawyers and judges, with the exception of good translators. Informally put, in Dutch law stealing means removing (1) a physical item (2) belonging to someone else (3) without permission (4), with the intention of keeping it (5). If the prosecution can't persuade the judges of all five elements that means you're off the hook.
How about not returning my library books? They're physical, they're not mine, I took them out from the library and I don't intend to bring them back. Ah, but my client did have permission to remove them from the library, your worship. He's not a thief, he's a rotten embezzler of books.

If you want a better analogy, then downloading is like getting on a train or in a cinema without a ticket. You're enjoying something for free that other people paid for. Provided there are enough seats, you don't impede their enjoyment.

So much for this overly long pedantic preamble. I have a confession to make. I count many illegal downloaders among my friends, colleagues and acquaintances. None of them steal cars or beat their spouses as far as I know. So why do they do it?

Downloading is just too easy to do and too easy to get away with. Historically, when a crime is ubiquitous and the perpetrators tough to track down the law retaliates with excessive punishment. Charging Jammie Thomas two million dollars ($80,000 per song) reminds you of the practice of killing horse thieves in the Old West. You wouldn't break the speed limit if it cost you your car and your house. In the Netherlands no civilians are bankrupted for ripping a few albums. Although the political climate is set to change, right now the most compelling incentive for people not to download content would be their conviction that it is simply wrong. It appears most people, especially the young, don't feel that strongly.

The effect that being completely invisible would have on a person's morality has been argued by philosophers and explored in literature and film. I myself believe that the getting-away-with-it part of the attraction is weaker than the it's-not-that-big-a-deal conviction. The effect of illegal downloads is not tangible, like punching somebody in the face. When you don't see your victim you cannot contemplate the harm you have caused. Some will argue that their downloading does no harm at all. They still buy as many CDs as before and will claim it is a victimless crime.

If you want to convince people that downloading hurts, show some of the small record stores and video rental shops going out of business. I don't grudge Metallica drummer Lars Ulrich his wealth, but when he spoke out against Napster I didn't feel sorry for him. The average punter has no compunctions about making millionaires a little less wealthy.

I hope the Capitol v. Thomas case will prove to be Pyrrhic victory in the end for MegaCorp Inc and that the Internet will prove to be a blessing, not a curse. Authors become publishers through print on demand. Bands let you download their music from their own web sites and sell must-have limited editions straight to the fans. Power to the people!

Tuesday, 25 August 2009

Filling in the gaps

In my previous post I showed you that the legacy of ASCII is still everywhere after more than forty years. It is not even for want of something better, since we have Unicode now. The reason why the world didn't ditch ASCII straight away is that it still does a fine job for English-only texts.

At the heart of the pâté problem is not the fact that ASCII was incomplete to begin with. The problem is that two very dim-witted participants in a conversation make the wrong assumptions about the language each of them speaks and that they must have a not very educated guess at it. If I send you a book in Finnish with a Post-it note stuck to it saying Here's the English novel you requested, you will either assume I sent you the wrong book or that I wrote “English” where I meant to say “Finnish”. Even though you may not know any Finnish, you would be intelligent enough not to bother looking up the words in an English dictionary. Yet that's exactly what computers do when you send them English and tell them it's Mandarin. They will convert the numbers into a bewildering array of Chinese characters. That's how dumb they are. Makes you feel superior, doesn't it?

I will let you in on a little secret:

I dare you to feed that line to any optical character-recognition software. Human communication is very resilient when presented with crummy data. That is why websites let you decipher these wavy nonsense words as a security measure when you create a Gmail account. It takes a special kind of intelligence (i.e. human) to do this. Calculating pi up to a hundred decimals? That's nothing. Having a conversation in a busy pub after three beers, now that's an impressive feat of data processing. We cannot fathom what marvellous mechanisms of extrapolation and pattern recognition goes on under the surface. We are simply the users of our brains, not the designers.

Computers hate vagueness, while we humans thrive by it. Try communicating with one, even in a relatively user-friendly programming language and you'll know what fussy, unfeeling monsters they can be. Put a single semicolon wrong and they'll just shut up on you. What a cheek! For global communication we need a standard that's a bit more kind to us people. Or do we?

In 1991 British physicist Tim Berners-Lee pioneered the technology that paved the way for the World Wide Web as we know it today. One of these inventions was a communications standard for computers to exchange documents, the Hypertext Transfer Protocol, and the mother of all Internet document formats Hypertext Markup Language (HTML). An HTML document contains readable text interleaved with so-called markup tags to indicate bold text, paragraph breaks, table structures, hyperlinks and links to images, video and sound. Markup tags are enclosed in angle brackets and not displayed as-is, but interpreted.

The average user doesn't know HTML. They don't know that some HTML documents are actual physical files and that others are tailor-made with bits of data from various databases. They don't know that HTML is an open standard, that there are thousands of software tools available to create HTML documents and dozens of applications to view them, some good and a lot really awful. Why should they want or need to know? I made my first website in 1998 using Word 97, when I was working as a teacher in China. It produced terrible HTML, but I didn't care, since it looked all right in Internet Explorer 3. I just wanted to share my experiences and do it with the tool I knew best, not necessarily with the best tool. I had neither the time nor the inclination to learn this so-called HTML. What's more, I don't recall any problems with accented Dutch vowels as long as I stayed within the Microsoft ecosystem.

Berners-Lee wanted to make HTML a simple and forgiving standard. Simple in the sense that you can read and write raw HTML code without a specialized tool and forgiving in the sense that the software used for displaying it does the best it can, even though your code is messy and doesn't tell the receiver what character set to use.
Tools that work with open standards should follow the robustness principle, also called Postel's Law, after computer scientist Jonathan Postel.

Be conservative in what you do, be liberal in what you accept from others.

A fine human quality indeed. Not so fine for machines, though. In the harsh reality of web browsers it more likely becomes:

Always tell others what character set you're using, but if they don't tell you theirs and it has the word comrade in it, just assume it's Russian and go with Cyrillic characters.

The purists will tell you that HTML is not a programming language. They are probably right, but they they are also the kind of people who love writing HTML code by hand. If you consider the learning curve (high) and the coding errors you can get away with before your page turns to garbage (slim), then HTML has all the features of a programming language. The educated people who provide most of the content for the Internet certainly don't want to learn it. They want a tool that does it for them, and does it well.

Monday, 24 August 2009

ASCII and the mangled euro

If you regularly visit non-English sites or receive email from exotic places you're more than likely to have come across gems like p?t?: ? 9, where you expected pâté: € 9. If you look up pâté in the Longman online dictionary, this is what your tab looks like in Firefox3 on Windows:



If you're a web professional and are clueless as to what's causing this, or (worse!) if you're responsible for things like this on a regular basis, go read Joel Spolsky's excellent post on character sets and Unicode first. The rest of this article is for non-programmers anyway.

We carry portable gadgets with features too slick and cool even for the original creators of Star Trek. So why does the newest version of my browser still make pâté of the euro sign and the accents over the é? It started a few years before Star Trek, it is called ASCII and its legacy is everywhere.



the Tricorder, 1966 sci-fi gizmoThe i-phone, 2009 real gizmo





The page you're reading right now reaches your computer as a large row of numbers, and it's the job of your browser, be it Internet Explorer, Safari or Firefox, to convert these numbers into text. To do that, the digital community has agreed on several standard ways to get from a bunch of numbers to letters, numbers and punctuation marks. However, if your browser encounters a non-standard number it has no way to display it. Bad luck. Yet how did that "wrong" number get there in the first place? Did the author make a typo? Did the number get mangled up in net traffic? Nope. Remember I highlighted the word several a few sentences back. Your browser either doesn't know which standard to use and has picked the wrong one, or it's simply being lied to.

In 1963, after three years of hard work, a subcommittee of the American Standards Association introduced the American Standard Code for Information Interchange (ASCII). Mark the word American. ASCII assigns a number between zero and 128 to all Roman letters, digits and several punctuation marks. This covers everything you need to display English text, but is not nearly enough to display all the accented letters (diacritics, to use the technical term) used by other languages based on the Roman alphabet, let alone Greek, Russian, Chinese, Japanese, Korean, etc. You may wonder why these wise men didn't do it properly and reserved more than a measly 128. A hundred thousand should do it and would let the Asians join the party. Well, the sixties were still the era of punch cards. They may have had free love in those days, but data storage came at a hefty price. Nobody wasted bits to placate the Chinese or the Russians, especially not during the Cold War. To suggest that in forty years' time you would have a gigabyte of memory dangling on your key chain would have been too outrageous for words.

This is how they solved the multilingual problem. Computers express all numbers in ones and zeros, so called binary digits (bits). Grouping these bits in chunks of eight lets you express 28 = 256 different numbers. These are called bytes, and for the sake of the argument we will assume that a byte is always eight bits long. Remember that the core ASCII set only had 128 entries. If you occupy a single byte per character you can use the numbers 129 to 256 for something else. You could have a set combining Greek and Roman, or Cyrillic and Greek. As long as people agree what character in the extended set each number refers to.

People actually agreed to differ, and lots of different character set mappings have since been drawn up to accommodate the various languages of the world. What most of these standards have in common are entries zero to 128, the original ASCII mapping. Note that character set mapping is not a technical term. The nitty-gritty involves code pages, code points and encodings, but it still boils down to a technical means of getting from

Dear Åsa, Günther, Søren and Małgorzata

to

0110001101100101011001101010111010101000100110100101110010110110010100101*

The string of characters I type on my keyboard is translated into numbers and then stored or transmitted. A different machine or program can only display these numbers correctly if it knows how to map the numbers to readable characters. That means it must know how the original text was encoded. It makes no sense to throw a bunch of numbers at me over the Internet without telling me how to interpret them.

You would think that having as little different character sets as possible is a good thing, because more will only add to the confusion. True, but a big shortcoming of these one-byte character sets is that they have no vacancies for new characters. A case in point: in 2002 the euro was introduced, and with it came the brand new character , but nowhere to put it. It might have gone in an existing, little used slot of the most popular character set ISO-8859-1, but that's not what happened. Standard committees are wary of change. It's like the American constitution: you cannot change the original text, but you can have amendments added to it. So they copied the existing character set, called it ISO-8859-15 and stuck the euro in an existing, little used slot. Problem solved, as long as all software clearly communicates that it is using either the 1 or the 15 flavour, which it is not required to do. You get my point.

There is another problem with one-byte sets. For Chinese and Japanese the ASCII single-byte trick will not do. These so-called logographic writing systems have thousands of characters and need at least two bytes per character, because that gives you 256 x 256 = 65,536 options. If I wanted to add my Chinese friend to the salutation I would be needing characters from different sets, so I would need to indicate that some bytes are from set A and others from set B. What you really need of course is a character set big enough to accommodate all these characters. That's where Unicode comes in. Unicode has been a long time in the making (starting as early1987) and is the much awaited effort to unite all the languages of the world in one happy digital standard.

Unicode is all-inclusive and able to adopt new characters in the future, but even if we ditch all ASCII derivatives today and use nothing but Unicode that doesn't magically change all the millions of legacy pages still out there. Compared to the speed of hardware improvements new digital protocols and standards seem to take an eternity before they are universally adopted. What makes it even more treacherous is the fact that Unicode conforms to the first 128 positions of ASCII. An understandable choice is you want backward compatibility: the text “Hello World” maps to the same numbers in Unicode as it does in any of the ASCII-based standards.

However, by doing so Unicode allows the situation that a Western European text can pose as one character set when in fact it is encoded in another. A lazy programmer will easily miss it. The root of all evil is the design choice that lets lazy programmers get away with it in the first place. It is called Postel's Law and it will be the topic of my next post.

* My advice to anyone trying to decode the binary gibberish: get a life :-)


Monday, 17 August 2009

A False Sense of Simplicity

For the past five years I have had the dubious pleasure of using Hibernate in Oracle-backed production environments and more often than not it has made me want to crawl into a cave to sleep off the months of ensuing darkness. For the uninitiated: Hibernate is a popular open source object-relational mapping tool for Java, an interface layer between your (Java) code and a relational database which lets you query and manipulate data by means of the object-oriented paradigm, effectively hiding the SQL it generates to achieve this.

I have nothing against Hibernate or ORM tools in general, but it annoys me how often it is touted as the perfect fit anytime you need to do something with databases. It irks me how often I have seen it used in environments where it is totally unsuited. It gets this undeserved support from people who have been lulled into a false sense of simplicity.

Let´s assume that I'm a green and optimistic Java programmer with no experience in relational databases and no wish to acquire any. Let´s say I´m writing an application to manage all logging and billing for a large phone company. The Ministry of Love needs me to store the recorded sound data of every conversation. So I bash away:

public class Subscription {
int
number;
List calls;
SubscriptionType subscrType;

}

public class Call {
Subscriber caller;
long receiver;
Date started;
Date finishded;
byte[] data;
}

public class SubscriptionType {
String name;
double fixedMonthlyFee;
double callChargePerSecond;
}

I suppose this is conceptually valid, albeit very simplified. All I do now is insert some clever annotations in the code, let Hibernate create the tables for me and within minutes I can do things like this:

List smallList = session.query(“select name from SubscriptionType”);
List whoppingList = session.query(“select data from Call”);

The SubscriptionType class maps to a table with no more than ten or twenty rows at any time. No problem there. If marketing and sales do a good job the Subscriber class will hold millions and the Call class billions of records and terabytes of data, mostly consisting of binary voice data. So within weeks of your launch party that whoppingList is sure to bring your system to its knees, because behind the iterator of the whoppingList is actually a JDBC resultset that will add an entry to its in-memory object cache with each call to next(). A Java List that mimics a collections of objects which are not already on the stack is not forbidden, but at the very least counter-intuitive.

An intricate schema with hundreds of tables will not break Hibernate. It only takes a few tables, LOTS of data and an ignorant developer who doesn't know which buttons to push. The so-called impedance mismatch between the relational and the object-oriented realm often gets really ugly when your data reaches a sizable volume. That means in production environments, if you skimp on proper testing.

In all fairness, a seasoned Hibernate guru would not be this naive. She would anticipate that the Call table will grow like mad and needs to be archived regularly. Since the raw sound data are for archiving and rarely queried, she would store them in a separate database that doesn't require daily backups. She would have given our newbie one of the fine manuals that tell him how to do things properly. Feeling the sheer weight of the tome would have taught him instantly that serious database solutions are never plain sailing, (not even/especially not) with Hibernate. Why is that?

BECAUSE SIMPLE THINGS DON'T TAKE 880 PAGES TO EXPLAIN

It´s time for a little historical context, because I think the confusion started with Plato. He proposed that the clutter of our daily lives is no more than a feeble reflection of the world of ideal Forms. The true essence of the objects we see are beyond space and time, just to give you a nano-summary of his wisdom.

Now skip a few millennia. The enlightened designer likes to think up wildly intricate entity-relationship diagrams that model a fraction of the world as he perceives it, ignorant of the sordid reality of fragmented indices, concurrent modifications and failed backups, because on Mount Olympus nobody pollutes their pristine tables with real-world data. Sadly, you cannot make a living by keeping your databases empty. That means you cannot design with a Platonic eye. You must have some estimate of how many rows each table will hold, how often it will be queried, updated, inserted into and deleted from. You have to allow for scalability within reasonable margins. Databases don't automatically polish up a crummy data definition based on actual query behaviour and rowcounts. That's your job, and a dirty one it is at times.

Object Oriented Programming has become the new sliced bread. Java is sooo much cleaner and nicer than eighties Oracle PL/SQL. I want to do it all in Java, why can't I? After all, we have tables and columns, classes and objects, foreign keys and object references. Same thing, isn't it? I should be able to tinker with data records as if they were objects. Not so fast. What we have here are analogies, not essential similarities. I believe the crucial difference between objects and data-records lies in their lifespan and numbers. Objects are disposable by nature. We create them, use them for our vile private purpose and let the garbage collector kill them for us. They never survive beyond the lifespan of a processor process. They're not built to last. Database records, if well nursed, will survive Larry Ellison and his grand-children. An object stack is to a database as a whiteboard to the British Museum Library. The whiteboard is meant only for the people in the meeting and is erased afterwards. The library is accessible to anyone with a library card. These are essential differences, not accidental ones. If it doesn't look like a duck and doesn't quack like a duck, don't dress it up as one.


Objects
Database records
Average lifespan
milliseconds to hoursminutes to millennia
Used by
one virtual machine at a time
thousands of them
Average count
thousands to millions
thousands to gazillions

Object relational mapping solutions want you to forget about tables and think objects instead. What you scribble on the whiteboard goes into the database thanks to clever proxy objects that wrap a database connection and take care of all the inserting, deletion and updating for you. These are the standard things you do with objects. You create or receive them, use them and leave them to die. Looking them up is a less frequent operation, and when you do it's usually by flipping through the Rolodex/iterating through the collection, of picking one out by its index. Big deal. Database retrieval is all about speed and efficiency, because libraries are big places. It is a big deal. That's why nobody says.


SELECT * FROM products,items,customers,invoices

Since we don't have unlimited memory and patience, we have where-clauses. Getting what you want means telling the database how to join tables and what restrictions to put on column values, so actually telling it what you don´t want. The where-clause is is the key to querying and Hibernate doesn´t even pretend that its OO-alternative Hibernate Query Language (HQL) is a perfect substitute. Unlike many other persistence solutions, Hibernate does not hide the power of SQL from you and guarantees that your investment in relational technology and knowledge is as valid as always (it's right on the first page). So apparently Hibernate has deemed it necessary to build in a backdoor. Why would you want to know what SQL it generates? Two reasons. One because Hibernate gets it wrong from time to time, and even when it doesn´t, the RDBMS itself can behave in cruel and unusual way by choosing an inefficient execution plan. Note again that nastiness of this kind goes unnoticed when you only test with miniature datasets. Either way you´re up the creek and you find yourself using native SQL littered with query hints, leaving the object paradigm and thinking tables again.

Is there a moral to this? If you need a car analogy: learn how to change gears manually, because Hibernate's automatic gearbox may work fine on the highway, but if it does abandon you it will be on the rockiest of roads, by which time you have forgotten how to use a gearstick. Under the right circumstances Hibernate can make your business layer look a paragon of cleanliness, a poster-child for separation of concerns, but in a less forgiving environment it will behave like the leaky abstraction it is. By all means use it if you know what you're doing, but if you're building serious database-backed stuff don't even think you can get away not knowing your SQL and the quirks of the RDBMS it is built on. Don't get lulled into a false sense of simplicity.