The Demon ASCII

With the growth of the Internet that started in the mid-‘80s came an attendant growth in networking. That introduced a rapidly increasing need for applications to share data and interact with one another across platforms. The problem was that the hardware vendors could not standardize on very basic things like the way data is represented in digital bits. Thus binary data written on one platform was unreadable on a different platform.

IMO, this is inexcusable. There are only three characteristics of binary data that differentiate data formats: the number of bits in the smallest aggregate of bits (i.e., the byte size in bits); which end of the aggregate has the least significant bit; and the ordering of aggregates within larger aggregates (i.e., where the least significant byte is in a word). Back in the ‘60s, computer vendors vied with one another to make the most efficient machines and various combinations of these three characteristics evolved. However, today those characteristics are no longer relevant to ALU optimization, so there is no reason not to standardize on a single set of characteristic values. (That’s not quite true for representing spoken languages, but memory is so cheap today that always using 16 bits for a character in a minor concern.) The overall cost in performance for all computing due to this failure to standardize is mind-boggling.

There are three approaches to resolving the inconsistencies in hardware data formats. Given that the hardware vendors refuse to standardize, they could at least provide firmware instructions to allow a foreign format to be converted to their native format on a word-by-word basis. From a performance perspective, this would be far and away the best solution, though it would require several instruction sets to accommodate each of the various combinations.

The second approach would be to use a software proxy to convert data coming from an external source to the format of the receiving platform as it arrives. This was done at a company where I once worked for their internal LANs. It is surprisingly simple to do and there is relatively little overhead, compared to the third alternative. The reason we did that was because our software would have been infeasible due to poor performance if we hadn’t; given the machines we were working with in the early ‘80s, you could have had children in the time it would take to execute using the third alternative.

The third approach is the one used by virtually all infrastructures in IT. The basic idea is to convert all binary numeric data to ASCII and transmit that between platforms because the ASCII format is a standard. One then reconverts it back to binary for computations. This is the worst possible choice for resolving binary data compatibility because the machine instructions to convert back and forth between binary numbers and ASCII numbers are very expensive. (In fact, they are not even individual instructions; on today’s machines, ASCII conversions can only be done with software algorithms that involve executing a large number of instructions, especially for floating point data.)

If the use of ASCII were limited to just pure interoperability issues around converting the hardware format from an external platform, the problem would only warrant a small head shake and a tsk-tsk about such foolishness. However, the ASCII approach has become ubiquitous in IT and it is manifested in things like overuse of markup and scripting languages (e.g., HTML, XML, etc.). So the use of ASCII is not limited to network port proxies; it permeates IT applications because it allows a generic parser to be built for any platform, thus saving the developer keystrokes for processing specific binary data structures in memory.

The result is mind-numbing overhead. In 1984 you could run a spreadsheet in less than a minute on a TRS-80 or Apple I that had 64K of memory, a floppy drive, and a clock rate measured in kHz. Today, you could not even load the same spreadsheet on such a machine and, if you could, it would take hours to execute. Some of that, memory constraints, is due to code and feature bloat, but most of the performance hit is due to ASCII processing in the bowels of the spreadsheet program. Every time I talk to an IT guy about specific performance where lots of keystrokes were saved using massive “helpful” infrastructures, I am astounded that their examples take minutes to execute when equivalent processing in a cycle-counting R-T/E environment would take a few tens of milliseconds. The frightening thing is that the IT people are so used to using such infrastructures that they don’t think there is a problem – they think such abysmal performance is normal!

One reason ASCII is so popular in IT is the use of markup and scripting languages that was triggered in the mid-‘80s by the World Wide Web. A few years ago I read a paper in a refereed journal where the author claimed that the first markup language was created in 1986. How soon they forget. (Perhaps more apropos: How soon they repeat the mistakes of the past.) In fact, markup and scripting languages were very popular in the ’50s and ‘60s. But, by the mid-‘70s they had pretty much disappeared. There was a very good reason why they disappeared. They were great for formatting reports, but they sucked for actual programming because they were slow and very difficult to maintain at the large application level. Show me a buggy web site and I will show you a Javascript website.

Main Topics