It was time to scan the B2 todo.doc idea file (hashes of which are published to alt.security.keydist for lack of a better, public place to dump ’em).
One of the odd items in B2 was a note of curiosity about what words would change in the future. Specifically, what words would be shortened because they are used a lot? And what words would drop out of use because they are too short for their own good? I speculated that words that are too short are pompous, fuddy-duddy words, scheduled to go out of use, and words that are too long are hip-happening words, scheduled to be replaced by shortened forms of the word (“something” becomes “sum’em”, “about” becomes “bout”, “OK” becomes “K”).
The thing is, words that are in common use are short, e.g. “I the you me”. And rare words are usually big words. That makes sense. Huffman type compression is a natural phenomenon.
There are lots of word list out there. I turned to Wiktionary’s TV script word frequency list.
And, from a while back, I just happened to have a copy of all the audio word recordings from Merriam Webster.
Now, the durations of these recordings are not a very good indication of the duration of the words, but it’s a start. (I considered using a phoneme count from Wiktionary’s pronunciation guides)
If you sort the words by word-count and give them each an index corresponding to where they are in the list, and do the same for durations, you should be able to figure out which words have very different indices/rankings in the two sorted lists.
The sorted, absolute-value results should order the words in “stability”. That is, the words at the top of the list should either be too-short words, or too-long words.
Here they are, grouped:
Too-short, fuddy-duddy words:
aught 6 15795 4074 37 0.99775 kiddy 6 15795 4331 93 0.99434 chirp 6 15795 4516 170 0.98965 pomp 6 15795 4599 220 0.98661 clunk 6 15795 4844 390 0.97626 teat 6 15795 4925 473 0.97121 hera 6 15795 4965 524 0.96810 peat 7 15269 4161 56 0.96329 pic 7 15269 4163 57 0.96323 debit 6 15795 5040 607 0.96305 wort 6 15795 5101 678 0.95873 cud 6 15795 5158 750 0.95435 deft 7 15269 4574 205 0.95422 yolk 7 15269 4687 267 0.95045 berth 7 15269 4778 338 0.94612 putter 7 15269 4804 352 0.94527 aright 6 15795 5282 925 0.94369 airy 7 15269 4868 414 0.94150 capper 6 15795 5345 1013 0.93834 heller 7 15269 4934 488 0.93699 amuck 7 15269 4973 542 0.93371 heady 7 15269 4989 552 0.93310 lite 7 15269 4990 554 0.93298 punt 8 14787 4359 107 0.92967 alum 7 15269 5051 619 0.92902 bauble 8 14787 4426 132 0.92815 pellet 7 15269 5070 637 0.92792 breadth 8 14787 4554 188 0.92474 beagle 8 14787 4554 188 0.92474 erie 8 14787 4702 278 0.91926 buoy 7 15269 5176 785 0.91891 bap 7 15269 5189 802 0.91788 simp 7 15269 5203 824 0.91654 wisp 6 15795 5519 1378 0.91612 whist 6 15795 5537 1402 0.91466 ardent 8 14787 4821 372 0.91354 beech 8 14787 4842 385 0.91275 heath 8 14787 4852 399 0.91189 dewy 8 14787 4859 404 0.91159
Too-long, hip-happening words:
relationship 3880 543 10302 15572 -0.91352 responsibility 1157 1262 11876 16298 -0.91219 apologize 1932 885 10730 15895 -0.91153 themselves 1117 1287 11861 16296 -0.91048 affair 865 1527 22434 16425 -0.90314 investigation 951 1433 11692 16260 -0.89905 sharon 1750 957 10540 15751 -0.89820 outside 4260 514 10052 15284 -0.89782 situation 3359 595 10107 15354 -0.89695 opportunity 1601 1011 10526 15742 -0.89423 experience 1767 952 10388 15638 -0.89164 mia 1105 1295 10838 15953 -0.88910 information 3063 635 10028 15251 -0.88815 realize 4114 527 9922 15110 -0.88641 explanation 847 1545 11186 16119 -0.88337 surprise 3439 585 9908 15091 -0.88158 otherwise 1499 1051 10273 15537 -0.87922 suppose 2952 652 9838 15009 -0.87234 security 2237 817 9960 15164 -0.87133 necessary 1215 1219 10288 15561 -0.87005 conversation 2266 808 9919 15107 -0.86843 besides 3355 596 9738 14868 -0.86731 absolutely 4704 482 9648 14724 -0.86576 ridiculous 1943 881 9941 15138 -0.86570 downstairs 1071 1327 10333 15596 -0.86534 girlfriend 2325 789 9846 15014 -0.86397 eventually 985 1400 10380 15634 -0.86303 ourselves 1525 1041 9980 15200 -0.85934 sometime 871 1518 10440 15680 -0.85836 someplace 1165 1253 10132 15384 -0.85712 grandfather 1006 1377 10191 15444 -0.85292 sacrifice 472 2244 11921 16308 -0.85063 recognize 909 1470 10225 15486 -0.84959 psychiatrist 463 2272 11731 16269 -0.84648 necessarily 478 2222 11402 16186 -0.84459 champagne 1085 1316 9998 15220 -0.84315 understand 16724 191 9299 14020 -0.84133 meantime 701 1741 10275 15540 -0.83572 imagination 459 2284 11062 16060 -0.83300
Well, it was a thought, anyway.
Some data and code:
Here is a selection of the results in “stability” order from least to most stable:
; Thu Oct 30 16:38:35 2008 ; counts=15795 durations=16428 unique_counts=2110 unique_durations=5812 ; Word count cnti dur duri offness aught 6 15795 4074 37 0.99775 kiddy 6 15795 4331 93 0.99434 chirp 6 15795 4516 170 0.98965 pomp 6 15795 4599 220 0.98661 clunk 6 15795 4844 390 0.97626 teat 6 15795 4925 473 0.97121 hera 6 15795 4965 524 0.96810 peat 7 15269 4161 56 0.96329 pic 7 15269 4163 57 0.96323 debit 6 15795 5040 607 0.96305 wort 6 15795 5101 678 0.95873 cud 6 15795 5158 750 0.95435 deft 7 15269 4574 205 0.95422 yolk 7 15269 4687 267 0.95045 berth 7 15269 4778 338 0.94612 putter 7 15269 4804 352 0.94527 aright 6 15795 5282 925 0.94369 sometimes 5596 431 10772 15920 -0.94179 airy 7 15269 4868 414 0.94150 capper 6 15795 5345 1013 0.93834 heller 7 15269 4934 488 0.93699 amuck 7 15269 4973 542 0.93371 heady 7 15269 4989 552 0.93310 lite 7 15269 4990 554 0.93298 punt 8 14787 4359 107 0.92967 alum 7 15269 5051 619 0.92902 bauble 8 14787 4426 132 0.92815 pellet 7 15269 5070 637 0.92792 breadth 8 14787 4554 188 0.92474 beagle 8 14787 4554 188 0.92474 erie 8 14787 4702 278 0.91926 buoy 7 15269 5176 785 0.91891 bap 7 15269 5189 802 0.91788 simp 7 15269 5203 824 0.91654 wisp 6 15795 5519 1378 0.91612 whist 6 15795 5537 1402 0.91466 ardent 8 14787 4821 372 0.91354 relationship 3880 543 10302 15572 -0.91352 beech 8 14787 4842 385 0.91275 responsibility 1157 1262 11876 16298 -0.91219 heath 8 14787 4852 399 0.91189 dewy 8 14787 4859 404 0.91159 apologize 1932 885 10730 15895 -0.91153 themselves 1117 1287 11861 16296 -0.91048 catty 8 14787 4892 433 0.90982 contra 6 15795 5575 1485 0.90961 droop 6 15795 5579 1502 0.90857 gluck 6 15795 5582 1512 0.90796 yammer 8 14787 4921 468 0.90769 affair 865 1527 22434 16425 -0.90314 cherub 6 15795 5630 1603 0.90242 inca 8 14787 5005 567 0.90167 ogle 7 15269 5385 1082 0.90084 millet 7 15269 5385 1082 0.90084 bey 6 15795 5642 1636 0.90041 creak 8 14787 5025 588 0.90039 bunt 8 14787 5032 593 0.90009 amah 6 15795 5644 1645 0.89987 whet 9 14316 4394 119 0.89912 investigation 951 1433 11692 16260 -0.89905 wrought 8 14787 5050 617 0.89862 sharon 1750 957 10540 15751 -0.89820 dietrich 6 15795 5655 1677 0.89792 outside 4260 514 10052 15284 -0.89782 dour 7 15269 5412 1140 0.89730 situation 3359 595 10107 15354 -0.89695 velour 6 15795 5660 1694 0.89688 hatter 6 15795 5665 1711 0.89585 conk 8 14787 5109 687 0.89436 opportunity 1601 1011 10526 15742 -0.89423 cooker 9 14316 4579 210 0.89358 ilk 9 14316 4620 231 0.89230 sot 7 15269 5455 1227 0.89201 experience 1767 952 10388 15638 -0.89164 batty 9 14316 4682 264 0.89029 mia 1105 1295 10838 15953 -0.88910 thomson 6 15795 5720 1826 0.88885 baroque 6 15795 5724 1831 0.88854 eth 6 15795 5725 1833 0.88842 information 3063 635 10028 15251 -0.88815 tusk 7 15269 5487 1294 0.88793 anima 7 15269 5496 1312 0.88683 realize 4114 527 9922 15110 -0.88641 vigor 8 14787 5199 820 0.88627 brusque 7 15269 5513 1349 0.88458 corker 8 14787 5234 867 0.88341 explanation 847 1545 11186 16119 -0.88337 woolly 8 14787 5237 872 0.88310 demur 6 15795 5759 1925 0.88282 coolant 6 15795 5760 1928 0.88264 peeve 6 15795 5765 1939 0.88197 . . . Somewhere in the middle of the list... . . cockpit 42 8131 6575 4544 0.23818 dummy 166 4147 4858 401 0.23814 hut 169 4103 4809 356 0.23810 home 22901 156 6450 4073 -0.23805 shorthand 28 9689 9285 13988 -0.23805 twirl 34 8949 6793 5397 0.23805 free 5433 440 6531 4368 -0.23803 aspire 25 10179 7094 6677 0.23800 sheila 352 2688 7102 6704 -0.23790 component 40 8299 8687 12538 -0.23779 pathetic 1115 1289 6749 5247 -0.23779 tune 344 2716 7109 6731 -0.23777 sped 21 10897 7282 7428 0.23775 excel 16 12021 7561 8598 0.23769 chemotherapy 23 10525 9729 14851 -0.23766 conduct 234 3426 7293 7465 -0.23750 tribal 47 7696 6456 4103 0.23749 hitch 121 4863 5421 1157 0.23745 cult 170 4091 4808 355 0.23740 envision 13 12907 7797 9526 0.23729 glib 33 9036 6819 5500 0.23729 libel 18 11532 7441 8097 0.23723 willed 39 8399 6649 4839 0.23719 lecturing 65 6604 8146 10765 -0.23718 grapefruit 65 6604 8146 10765 -0.23718 severance 36 8727 8846 12973 -0.23717 virtual 57 7018 6261 3403 0.23717 from 59972 74 6420 3973 -0.23716 fungus 58 6968 8257 11143 -0.23714 illinois 153 4326 7511 8395 -0.23713 decoration 27 9842 9351 14131 -0.23707 exploitation 17 11781 11263 16147 -0.23703 champ 144 4460 5151 745 0.23702 compartment 55 7129 8307 11307 -0.23693 iceberg 60 6862 8220 11028 -0.23685 . . . The most "stable" words... . loose 1069 1331 5516 1367 0.00106 scottie 56 7070 7258 7336 0.00106 arrival 149 4385 6584 4578 -0.00105 radioactivity 6 15795 13593 16411 0.00103 sung 65 6604 7147 6885 -0.00099 pilgrimage 9 14316 9766 14906 -0.00099 eyeball 35 8825 7702 9163 0.00095 heal 466 2260 5904 2335 0.00095 scrabble 39 8399 7596 8751 -0.00094 provoke 93 5547 6886 5784 -0.00089 iron 314 2875 6128 2976 0.00087 extortionist 7 15269 10730 15895 -0.00086 rubbish 54 7186 7299 7488 -0.00085 cavalier 28 9689 7948 10091 -0.00083 get 126849 37 3948 25 0.00082 nick 2699 704 5132 719 0.00080 integration 15 12307 8785 12813 -0.00078 pedestal 65 6604 7140 6856 0.00077 ringing 281 3078 6201 3213 -0.00071 yo 1347 1138 5429 1172 0.00071 platter 116 4967 6726 5155 0.00067 stifler 14 12603 8898 13119 -0.00066 cat 1742 960 5330 988 0.00064 relate 172 4061 6487 4214 0.00059 machismo 8 14787 10116 15370 0.00058 altitude 52 7303 7325 7605 -0.00057 fetish 43 8040 7501 8353 0.00056 ton 304 2941 6157 3068 -0.00056 twentieth 41 8212 7542 8532 0.00055 telltale 10 13925 9533 14492 -0.00054 montage 7 15269 10700 15873 0.00048 toll 113 5030 6742 5224 0.00046 fabricate 13 12907 9016 13418 0.00038 misrepresentation 6 15795 15362 16422 0.00037 buster 249 3295 6271 3433 -0.00036 primal 49 7517 7375 7824 -0.00035 alvin 65 6604 7142 6863 0.00034 tumor 225 3506 6328 3641 0.00034 book 5027 468 4937 492 -0.00032 voyage 106 5215 6799 5429 -0.00030 hug 697 1751 5720 1826 -0.00029 demon 1703 977 5349 1020 -0.00023 marietta 21 10897 8313 11330 0.00023 empty 1261 1183 5457 1234 -0.00022 bootleg 19 11306 8435 11756 0.00019 jordan 266 3167 6229 3297 -0.00019 theses 9 14316 9761 14892 -0.00014 letter 1839 925 5311 960 0.00013 cheer 550 2046 5834 2130 -0.00012 maiden 82 5896 6965 6134 -0.00010 toaster 76 6107 7016 6352 -0.00002 castle 408 2451 5977 2549 0.00001
Here is the code:
class a_word(object) : def __init__(me, word, cnt, dur) : me.word = word # the word me.cnt = cnt # the word's use count me.dur = dur # the word's shortest .wav file byte length me.cnti = 0 # the normalized ranking of the count (low rank are frequent words) me.duri = 0 # the normalized ranking of the duration (low ranks are short words) me.off = 0.0 # how far off the two rankings are pass # a_word # # if __name__ == '__main__' : import os import re import sys import time import TZCommandLineAtFile import tzlib sys.argv.pop(0) TZCommandLineAtFile.expand_at_sign_command_line_files(sys.argv) wc_fn = sys.argv.pop(0) wcs = tzlib.read_whole_text_file(wc_fn) # lines of: "word count (wav_size (...))" - we use the shortest .wav size wa = re.split(r"\n", wcs) wa = [ wc for wc in [ re.split(r"\s+", ln) for ln in wa ] if (len(wc) >= 3) and (wc[0][0] != ';') ] words = [] for wc in wa : wc[1] = int(wc[1]) words.append(a_word(wc[0], wc[1], min([ int(ln) for ln in wc[2:]]))) words.sort(lambda a, b : cmp(b.cnt, a.cnt)) i = 0 j = 0 icnt = 0 ucnt = 0 for w in words : if icnt != w.cnt : icnt = w.cnt i = j ucnt += 1 w.cnti = i j += 1 icnt = float(i) words.sort(lambda a, b : cmp(a.dur, b.dur)) i = 0 j = 0 idur = 0 udur = 0 for w in words : if idur != w.dur : idur = w.dur i = j udur += 1 w.duri = i j += 1 idur = float(i) for w in words : w.off = (w.cnti / icnt) - (w.duri / idur) words.sort(lambda a, b : cmp(abs(b.off), abs(a.off))) print "; " + time.asctime() print "; counts=%i durations=%i unique_counts=%i unique_durations=%i" % ( int(icnt), int(idur), int(ucnt), int(udur) ) print print "; %-30s count cnti dur duri offness" % "Word" print for w in words : print " %-30s %8u %5u %6i %5u %8.5f" % ( w.word, w.cnt, w.cnti, w.dur, w.duri, w.off ) print print ";" print "; eof"