Internet oddity

Have you ever noticed that Google has skipped a UI beat in Google Maps and Google Earth?

Try entering a latitude / longitude value to either.

Unless you get it exactly right, both systems fuss and stonewall.

Why?

Is there some silly patent out there?

Anyway, it turns out that parsing lat/lon gets dirty fast. Sort of like parsing arbitrary time/dates.

Here’s a shot: /cgi-bin/latlon_cgi.py.

Har, har. Should someone write a GreaseMonkey script to “fix” input to Google Maps with an XMLHttpRequest? That would be keeping in the Web 2.0 river.

Changing a program you don’t know

How might a language help solve the problems of those who wish to safely add code to a big project that they don’t understand very well.

Well, such a language might make it hard to build a program with the “pragma assert”s stripped out. (Stripping asserts is a powerful tool for those who spend time “proving the correctness of programs” or who, in the dark recesses of their minds, agree with the thinking of the early days of computers: “A bug?!? Why, that’s so totally unexpected!”)

And, could you be encouraged to write self-test code that does not execute in-line, but rather works kind of like a conditional hardware breakpoint at a higher, more complex level?

Heck, there’s gotta be some way to keep all these multi-core CPUs busy.

Now, if you could build a system that wrote such high-level, self-test code – that would be kinda interesting.

The promise of open source

It’s funny how the promise of open source is an unreachable ideal.

Let’s take a popular open source project: Thunderbird.

Gosh, it would be nice to make a few changes to it.

But it’s too big. And, as a guess, you’d need to spend a lot of time (and perhaps money) setting up a development environment to actually work on it.

In an imaginary, ideal world, such a program would be made up of many clearly labeled, independent, smaller parts with clear, decoupling APIs between them.

Yes, there would always be a bucket of shared library, “memcpy” kinds of things. But those things should just be there in an opaque monolith, always available, never obtrusive.

And, in that ideal world, the pieces of the program would be written in a modern language (i.e. not C/C++).

Telling experience: A few years ago, I needed the Perl POP3 module to do something. Since the source was part of Perl, I simply modified it, sent the author/owner (whose contact information was at the top of the source file) the change and moved on. Soon thereafter, I noticed he had imported the change in a better, more general way. That was all good and pleasant. That the program ran directly from the source was the enabling factor there.

I wonder how much FireFox and Thunderbird would be improved if there were a Tools|Advanced button to toggle the UI/Javascript source between the .jar file format it’s in and a fully expanded form. … And much more of the program were moved up to Javascript from C++. … Or if writing extensions weren’t so chaotic a process.

EasySay Characters

Talking on the phone recently, it seemed like a good time to note down somewhere the EasySay characters I used for OnlyMe admin passwords and such.

EasySay characters are characters that are quite unambiguous / distinct both in spoken form and in written form.

In short, they are: “AESINO267”.

OnlyMe considered the other characters to be equivalent to their EasySay peers.

Granted, mapping from the other characters to the EasySay characters is ambiguous. Z? Good arguments could be made for it to map to E, 2, or S.

Here’s the table:

A ahjk8
E bcdegptvz3
S fsx
I ily159
N mn
O oqr04
2 uw
6
7

So, if the world used a 9 character alphabet we’d spend a lot less time on the phone talking like we’re WWII combat guys with huge radios glued to our ears.

More Expected Characters

Now, it’s expected words.

Or, more exactly, after running the Buffet letters through a program that tracks strings of words (rather than characters), the last of a sequence of letters is shown with the words that are in common strings made small. And unusual words or strings of words are made big.

Common strings are small - uncommon strings are big

The effect is the same. Boiler plate paragraphs are small. New stuff is big.

Data Compression

I count three ways to compress data:

  1. Make common quantas of data short, uncommon ones long. e.g. Huffman encoding. I, am, not, be, a, or, prestidigitation, gesticulate, onomatopoeia, redundant.
  2. Reference known data. e.g. Symbols. ZIP file encoding of references to repeated byte strings. Refer to a whole book’s worth of information by referencing the title. One if by land, two if by sea.
  3. Drop information that is not needed. e.g. JPG images. MP3 music. Forget it all. Don’t do it.

Are there any more?

In a sense, all optimization is data compression, is it not?

Japanese Style User Interface Design

Over the years, I’ve noticed that Japanese devices have a unique style of UI.

What’s that, you ask?

They present a Colossal Cave Adventure Game to the user.

They idea behind Japanese UI, it seems, is to give the user a rich world to explore. “Look what I found!”

Lots of “You are in a twisty maze of passageways, all different” ness.

Lots of options. Not just a lot of options in simple lists, as one would expect an out-of-control engineer to create, but rather modes, tricks and cross-connect dependencies galore, each affecting available options.

Presumably, the device has fulfilled its function when the user fully explores the device’s UI. That done, the user tosses the device and gets something new.

Expected Characters

After reading all the Buffett yearly letters it sure seemed like a good idea to experiment with programs to help read repetitive stuff – stuff containing lots of boiler plate, for instance.

So …

There are lots of ways to address the issue. I did something with assembling a large tree that could represent a Markov Model of the text. That took a lot of memory and a lot of CPU. And, things get very interesting when it comes time to prune the tree. More work to be done with that.

Meanwhile, there’s a really quick, easy way to play with this sort of thing:

Starting with, say, 10 documents, ordered in some way, “read” the first 9. Build a memory of sub-strings of the 9 documents. Then display the 10th with each character rendered in a font size that represents how well it’s expected to be at that position in the document. In particular, render “surprising” characters big and “expected” characters small.

Well, without going in to details of the current code in random_byte_strings_in_file.py, here is an example paragraph from the Buffett 1999 letter processed with data from the ’77 through ’98 letters:

Paragraph of Buffett 1999 letter

Ugly.

But, nice try.

Hmm. Well, let’s note how the script works:

It stores a big dictionary/hash keyed by unique strings. The hash’s values are the number of times the string has been found in a document.

For each document the script reads, it picks lots of random characters in the document.

For each random character, it remembers strings in the document that include the character. It does this by, first, storing the character as a string. Then it tries to extend the string on both ends, continuing to store strings until either a new string is stored or some limitation is reached.

To process the last document, the script uses the string:value hash table to assign a numeric value to each character of the document. I’ve experimented with several ways to do this. They all lead to words that look like kidnappers created ’em.

There are, of course, a gob of ways to make the output more visually appealing, if not usable.

But, what the heck. Another interesting thing I’ve not done is to convert the output font sizes to audio samples and listen to the thing.

One wonders, for instance, whether the ear can distinguish between various texts. But, then, recognizing the differences between writings of, for instance, one person and another, is an old story – and there sure are better ways to do so than this rather round-about scheme.

CRC / Checksums Again

Shudda noted that the method of doing additive checksums by look-up table values makes for some pretty dumb, fast code able to compute things like 256 bit checksums. Just concatenate 16 16-bit checksums, each computed using a different set of tables. Or whatever.

Also thinking about computing a rolling checksum or incremental checksum:

I gather the standard method of distinguishing the difference between, for instance, “xy” and “yx” (which a normal checksum considers equal) is to sum the sum in a separate super-sum. … The final checksum is a combination of the sum and super-sum. The super-sum effectively indexes the position of each byte by shifting the byte’s value up in the super-sum as a function of how far back the byte is in the data stream. Pulling the byte’s effect out of the super-sum is easy: subtract the byte’s value times the length of the rolling data block. Let’s presume the byte-value multiple is precomputed.

Off hand, I don’t see anything wrong with wrapping the sums something like this (Warning: vanilla C carry bit kludge ahead.):

byte           b
uint32         sum_of_sums, sum

sum         += table[b]
sum          = (sum         >> 16) + (sum         & 0xffff)
sum_of_sums +=  sum
sum_of_sums  = (sum_of_sums >> 16) + (sum_of_sums & 0xffff)

Yes, in this 16-bit example if you have more than 64k bytes of data you could have a problem, but you’d have a problem if the overflow goes to the bit bucket. And, anyway, you’d probably want to use 31 bits or 32 bits in a real application, giving you a 2-4 gig of data without overflow. Etc.