{"id":14,"date":"2006-04-27T16:38:19","date_gmt":"2006-04-28T00:38:19","guid":{"rendered":"http:\/\/www.tranzoa.net\/~alex\/blog\/?p=14"},"modified":"2006-04-27T16:38:25","modified_gmt":"2006-04-28T00:38:25","slug":"expected-characters","status":"publish","type":"post","link":"https:\/\/www.tranzoa.net\/~alex\/blog\/?p=14","title":{"rendered":"Expected Characters"},"content":{"rendered":"<p>After reading all the Buffett <a href=\"http:\/\/berkshirehathaway.com\/letters\/letters.html\">yearly letters<\/a> it sure seemed like a good idea to experiment with programs to help read repetitive stuff &#8211; stuff containing lots of boiler plate, for instance.<\/p>\n<p>So &#8230;<\/p>\n<p>There are lots of ways to address the issue. I did something with assembling a large tree that could represent a Markov Model of the text. That took a lot of memory and a lot of CPU. And, things get very interesting when it comes time to prune the tree. More work to be done with that.<\/p>\n<p>Meanwhile, there&#8217;s a really quick, easy way to play with this sort of thing:<\/p>\n<p>Starting with, say, 10 documents, ordered in some way, &#8220;read&#8221; the first 9. Build a memory of sub-strings of the 9 documents. Then display the 10th with each character rendered in a font size that represents how well it&#8217;s expected to be at that position in the document. In particular, render &#8220;surprising&#8221; characters big and &#8220;expected&#8221; characters small.<\/p>\n<p>Well, without going in to details of the current code in random_byte_strings_in_file.py, here is an example paragraph from the Buffett 1999 letter processed with data from the &#8217;77 through &#8217;98 letters:<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/www.tranzoa.net\/~alex\/blog\/images\/expected_chars_01.gif\" alt=\"Paragraph of Buffett 1999 letter\" \/><\/p>\n<p>Ugly.<\/p>\n<p>But, nice try.<\/p>\n<p>Hmm. Well, let&#8217;s note how the script works:<\/p>\n<p>It stores a big dictionary\/hash keyed by unique strings. The hash&#8217;s values are the number of times the string has been found in a document.<\/p>\n<p>For each document the script reads, it picks lots of random characters in the document.<\/p>\n<p>For each random character, it remembers strings in the document that include the character. It does this by, first, storing the character as a string. Then it tries to extend the string on both ends, continuing to store strings until either a new string is stored or some limitation is reached.<\/p>\n<p>To process the last document, the script uses the string:value hash table to assign a numeric value to each character of the document. I&#8217;ve experimented with several ways to do this. They all lead to words that look like kidnappers created &#8217;em.<\/p>\n<p>There are, of course, a gob of ways to make the output more visually appealing, if not usable.<\/p>\n<p>But, what the heck. Another interesting thing I&#8217;ve not done is to convert the output font sizes to audio samples and listen to the thing.<\/p>\n<p>One wonders, for instance, whether the ear can distinguish between various texts. But, then, recognizing the differences between writings of, for instance, one person and another, is an old story &#8211; and there sure are better ways to do so than this rather round-about scheme.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>After reading all the Buffett yearly letters it sure seemed like a good idea to experiment with programs to help read repetitive stuff &#8211; stuff containing lots of boiler plate, for instance. So &#8230; There are lots of ways to &hellip; <a href=\"https:\/\/www.tranzoa.net\/~alex\/blog\/?p=14\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-14","post","type-post","status-publish","format-standard","hentry","category-programing"],"_links":{"self":[{"href":"https:\/\/www.tranzoa.net\/~alex\/blog\/index.php?rest_route=\/wp\/v2\/posts\/14","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.tranzoa.net\/~alex\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tranzoa.net\/~alex\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tranzoa.net\/~alex\/blog\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tranzoa.net\/~alex\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=14"}],"version-history":[{"count":0,"href":"https:\/\/www.tranzoa.net\/~alex\/blog\/index.php?rest_route=\/wp\/v2\/posts\/14\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.tranzoa.net\/~alex\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=14"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tranzoa.net\/~alex\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=14"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tranzoa.net\/~alex\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=14"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}