After reading all the Buffett yearly letters it sure seemed like a good idea to experiment with programs to help read repetitive stuff – stuff containing lots of boiler plate, for instance.
There are lots of ways to address the issue. I did something with assembling a large tree that could represent a Markov Model of the text. That took a lot of memory and a lot of CPU. And, things get very interesting when it comes time to prune the tree. More work to be done with that.
Meanwhile, there’s a really quick, easy way to play with this sort of thing:
Starting with, say, 10 documents, ordered in some way, “read” the first 9. Build a memory of sub-strings of the 9 documents. Then display the 10th with each character rendered in a font size that represents how well it’s expected to be at that position in the document. In particular, render “surprising” characters big and “expected” characters small.
Well, without going in to details of the current code in random_byte_strings_in_file.py, here is an example paragraph from the Buffett 1999 letter processed with data from the ’77 through ’98 letters:
But, nice try.
Hmm. Well, let’s note how the script works:
It stores a big dictionary/hash keyed by unique strings. The hash’s values are the number of times the string has been found in a document.
For each document the script reads, it picks lots of random characters in the document.
For each random character, it remembers strings in the document that include the character. It does this by, first, storing the character as a string. Then it tries to extend the string on both ends, continuing to store strings until either a new string is stored or some limitation is reached.
To process the last document, the script uses the string:value hash table to assign a numeric value to each character of the document. I’ve experimented with several ways to do this. They all lead to words that look like kidnappers created ’em.
There are, of course, a gob of ways to make the output more visually appealing, if not usable.
But, what the heck. Another interesting thing I’ve not done is to convert the output font sizes to audio samples and listen to the thing.
One wonders, for instance, whether the ear can distinguish between various texts. But, then, recognizing the differences between writings of, for instance, one person and another, is an old story – and there sure are better ways to do so than this rather round-about scheme.