Saturday 8 March 2014

Against Escapism

Many things in the world of computing exist not because they are the best way, or even a good way, but because they were an expedient solution in the historical context. Such practices have a habit of surviving long after the original reasons for it have vanished, distorting and blighting the world of following generations. Douglas Crockford's excellent talk The Early Years identifies some wonderful examples, not least the strong influence of that the size of the US dollar bill at the end of the 1800s had on the programming environment up to 100 years later.

There is another aspect of modern programming which I believe has similar roots in technological history, where the initial rationale ceased to exist decades ago, but which is sustained by its ubiquity and the strength of cultural conventions. It is responsible for countless bugs, unnecessary complexity, large swathes of fundamentally pointless code and substantial inefficiencies. This curse from the past is the almost religious reverence programmers seem to have of the symbols that were available on mechanical typewriters, and the associated practice of escaping.

In the web world, the cult of escapism has reached a level where it can be hard to understand what is correct. It results in an unbelievable amount of programmer time, processor cycles, storage and network bandwidth being sacrificed on the altar of "human readability". The real irony is that through the evolution of different sects and canons, human readability has in reality been lost in all but the simplest uses.

It is time to take off the blinkers of past practice and look at the underlying issues afresh.

To strike at the dogma which is at the heart of the madness: human readability is not a magical property of some special subset of the range of possible 8, 16, 32 or 64 bit values. Human readability is a matter of how those values are represented as symbols on the screen or on a piece of paper. That representation is normally mediated by many layers of software, and the rules by which some chunk of binary data is represented to a human are neither fixed, universally applicable, nor necessarily simplistic.

The other dogma that strengthens the belief in escaping is the idea that it is somehow evil to use defined lengths of strings or chunks of data - that it is necessary to set aside one or more special values that serve as end markers. To count bits and bytes by hand is of course tedious and error prone, and would make editing impossibly slow with the tools of the 1960s. However, nobody today writes any code without an editor that routinely does tasks many orders of magnitude more complex between each keystroke. The real justification for termination markers vanished right after the card punch.

Imagine the world free of Escapism

  • Including binary data does not require encoding/decoding it in a forms that impose a 25% or greater size overhead, not to mention the cycles and memory space required to convert between the Escapist form and its useful form.
  • Numbers and dates can be represented in well-defined binary forms, not in the hopelessly inefficient and often ambiguous Escapist forms. Add to that a clear and undeniable distinction between a text that happens to look like a number or a date and an actual number or date.
  • The scope of nested chunks of code in different languages is clear, and unambiguously defined. Within its bounds, a piece of code is unaffected by its surrounding context, and does not in turn impose restrictions or encoding rules on any other code or data that it might contain.
  • Structured data is expressed concisely, without huge space overheads, unreasonably complex parsing rules. Ironically, and would be easier to present so that humans can understand than is the reality with forms such as xml.
  • In most cases, the size of a chunk of data is constant for some given amount of content, irrespective of the content itself. Where overhead is added to allow for streaming and large/indeterminate datasets, it can easily be tuned to be trivial and predictable.
  • Being able to handle and process data without being often pointlessly obliged to traverse the data linearly, byte by byte, and applying transformative parsing rules in the process.

What does it take to achieve freedom?

First, define or adjust language and data format definitions, such that data can be represented as length, (type), data. By defining the length in particular, we eliminate the prime evil of reserving some potential data value as special, and then having special case handling when that value is actually a part of the data. Encoding standards already exist that deliver these characteristics, so it doesnt even require massive effort to invent and agree them, or to code them up. 

Second, recognise that converting binary to a meaningful representation of the data, using a limited set of symbols that humans can easily recognise and understand is the function of editors. Thus we could use layout, colour coding, and symbolic representations of logical notions in conventions that are easy for humans to type, view and understand. The editor would perform the function of transforming between the hiuman-readable presentation format, and a differently structured but logically identical binary form.

We would probably need to have some conventions of how things are represented in human readable form - at least the main aspects of it, to allow for a common language of discourse and understanding. Nevertheless, there is plenty of scope for enhancement and variation, in much the same way as today we have standards for the character representation, but variability in syntax highlighting etc. For some things, it is simple enough to imagine a presentation that is non-ambiguous to readers: making use of color/font coding for type, folding techniques and inline rendering (e.g. of images) to avoid messy display of binary data, and so on. To write in the new idiom, we just need editor commands to declare what type of data is being entered, and let the machine work out and create the 'raw' encoding in the underlying file. 

The biggest issue is backward compatibilty. I cannot think of any way to allow older tools, which are dyed-in-the-wool Escapists, to accept a new regime. To be exposed to the full truth is simply beyond their cognitive capacity. Until we feel we can allow them to die with honour, they will just have to live in a sheltered environment where we they are restricted to fare that will not upset their world view. This is eminently possible, and in fact much better than the current situation where even enlightened systems who can deal with the full truth of binary data spend so much of their time masquerading as Escapists by communicating in Escapist languages.

The time for a new Enlightenment has come. We must throw off the shackles of the past.

No comments:

Post a Comment