It was time to scan the B2 todo.doc idea file (hashes of which are published to alt.security.keydist for lack of a better, public place to dump ’em).
One of the odd items in B2 was a note of curiosity about what words would change in the future. Specifically, what words would be shortened because they are used a lot? And what words would drop out of use because they are too short for their own good? I speculated that words that are too short are pompous, fuddy-duddy words, scheduled to go out of use, and words that are too long are hip-happening words, scheduled to be replaced by shortened forms of the word (“something” becomes “sum’em”, “about” becomes “bout”, “OK” becomes “K”).
The thing is, words that are in common use are short, e.g. “I the you me”. And rare words are usually big words. That makes sense. Huffman type compression is a natural phenomenon.
There are lots of word list out there. I turned to Wiktionary’s TV script word frequency list.
And, from a while back, I just happened to have a copy of all the audio word recordings from Merriam Webster.
Now, the durations of these recordings are not a very good indication of the duration of the words, but it’s a start. (I considered using a phoneme count from Wiktionary’s pronunciation guides)
If you sort the words by word-count and give them each an index corresponding to where they are in the list, and do the same for durations, you should be able to figure out which words have very different indices/rankings in the two sorted lists.
The sorted, absolute-value results should order the words in “stability”. That is, the words at the top of the list should either be too-short words, or too-long words.
Here they are, grouped:
Too-short, fuddy-duddy words:
aught 6 15795 4074 37 0.99775
kiddy 6 15795 4331 93 0.99434
chirp 6 15795 4516 170 0.98965
pomp 6 15795 4599 220 0.98661
clunk 6 15795 4844 390 0.97626
teat 6 15795 4925 473 0.97121
hera 6 15795 4965 524 0.96810
peat 7 15269 4161 56 0.96329
pic 7 15269 4163 57 0.96323
debit 6 15795 5040 607 0.96305
wort 6 15795 5101 678 0.95873
cud 6 15795 5158 750 0.95435
deft 7 15269 4574 205 0.95422
yolk 7 15269 4687 267 0.95045
berth 7 15269 4778 338 0.94612
putter 7 15269 4804 352 0.94527
aright 6 15795 5282 925 0.94369
airy 7 15269 4868 414 0.94150
capper 6 15795 5345 1013 0.93834
heller 7 15269 4934 488 0.93699
amuck 7 15269 4973 542 0.93371
heady 7 15269 4989 552 0.93310
lite 7 15269 4990 554 0.93298
punt 8 14787 4359 107 0.92967
alum 7 15269 5051 619 0.92902
bauble 8 14787 4426 132 0.92815
pellet 7 15269 5070 637 0.92792
breadth 8 14787 4554 188 0.92474
beagle 8 14787 4554 188 0.92474
erie 8 14787 4702 278 0.91926
buoy 7 15269 5176 785 0.91891
bap 7 15269 5189 802 0.91788
simp 7 15269 5203 824 0.91654
wisp 6 15795 5519 1378 0.91612
whist 6 15795 5537 1402 0.91466
ardent 8 14787 4821 372 0.91354
beech 8 14787 4842 385 0.91275
heath 8 14787 4852 399 0.91189
dewy 8 14787 4859 404 0.91159
Too-long, hip-happening words:
relationship 3880 543 10302 15572 -0.91352
responsibility 1157 1262 11876 16298 -0.91219
apologize 1932 885 10730 15895 -0.91153
themselves 1117 1287 11861 16296 -0.91048
affair 865 1527 22434 16425 -0.90314
investigation 951 1433 11692 16260 -0.89905
sharon 1750 957 10540 15751 -0.89820
outside 4260 514 10052 15284 -0.89782
situation 3359 595 10107 15354 -0.89695
opportunity 1601 1011 10526 15742 -0.89423
experience 1767 952 10388 15638 -0.89164
mia 1105 1295 10838 15953 -0.88910
information 3063 635 10028 15251 -0.88815
realize 4114 527 9922 15110 -0.88641
explanation 847 1545 11186 16119 -0.88337
surprise 3439 585 9908 15091 -0.88158
otherwise 1499 1051 10273 15537 -0.87922
suppose 2952 652 9838 15009 -0.87234
security 2237 817 9960 15164 -0.87133
necessary 1215 1219 10288 15561 -0.87005
conversation 2266 808 9919 15107 -0.86843
besides 3355 596 9738 14868 -0.86731
absolutely 4704 482 9648 14724 -0.86576
ridiculous 1943 881 9941 15138 -0.86570
downstairs 1071 1327 10333 15596 -0.86534
girlfriend 2325 789 9846 15014 -0.86397
eventually 985 1400 10380 15634 -0.86303
ourselves 1525 1041 9980 15200 -0.85934
sometime 871 1518 10440 15680 -0.85836
someplace 1165 1253 10132 15384 -0.85712
grandfather 1006 1377 10191 15444 -0.85292
sacrifice 472 2244 11921 16308 -0.85063
recognize 909 1470 10225 15486 -0.84959
psychiatrist 463 2272 11731 16269 -0.84648
necessarily 478 2222 11402 16186 -0.84459
champagne 1085 1316 9998 15220 -0.84315
understand 16724 191 9299 14020 -0.84133
meantime 701 1741 10275 15540 -0.83572
imagination 459 2284 11062 16060 -0.83300
Well, it was a thought, anyway.
Some data and code:
Here is a selection of the results in “stability” order from least to most stable:
; Thu Oct 30 16:38:35 2008
; counts=15795 durations=16428 unique_counts=2110 unique_durations=5812
; Word count cnti dur duri offness
aught 6 15795 4074 37 0.99775
kiddy 6 15795 4331 93 0.99434
chirp 6 15795 4516 170 0.98965
pomp 6 15795 4599 220 0.98661
clunk 6 15795 4844 390 0.97626
teat 6 15795 4925 473 0.97121
hera 6 15795 4965 524 0.96810
peat 7 15269 4161 56 0.96329
pic 7 15269 4163 57 0.96323
debit 6 15795 5040 607 0.96305
wort 6 15795 5101 678 0.95873
cud 6 15795 5158 750 0.95435
deft 7 15269 4574 205 0.95422
yolk 7 15269 4687 267 0.95045
berth 7 15269 4778 338 0.94612
putter 7 15269 4804 352 0.94527
aright 6 15795 5282 925 0.94369
sometimes 5596 431 10772 15920 -0.94179
airy 7 15269 4868 414 0.94150
capper 6 15795 5345 1013 0.93834
heller 7 15269 4934 488 0.93699
amuck 7 15269 4973 542 0.93371
heady 7 15269 4989 552 0.93310
lite 7 15269 4990 554 0.93298
punt 8 14787 4359 107 0.92967
alum 7 15269 5051 619 0.92902
bauble 8 14787 4426 132 0.92815
pellet 7 15269 5070 637 0.92792
breadth 8 14787 4554 188 0.92474
beagle 8 14787 4554 188 0.92474
erie 8 14787 4702 278 0.91926
buoy 7 15269 5176 785 0.91891
bap 7 15269 5189 802 0.91788
simp 7 15269 5203 824 0.91654
wisp 6 15795 5519 1378 0.91612
whist 6 15795 5537 1402 0.91466
ardent 8 14787 4821 372 0.91354
relationship 3880 543 10302 15572 -0.91352
beech 8 14787 4842 385 0.91275
responsibility 1157 1262 11876 16298 -0.91219
heath 8 14787 4852 399 0.91189
dewy 8 14787 4859 404 0.91159
apologize 1932 885 10730 15895 -0.91153
themselves 1117 1287 11861 16296 -0.91048
catty 8 14787 4892 433 0.90982
contra 6 15795 5575 1485 0.90961
droop 6 15795 5579 1502 0.90857
gluck 6 15795 5582 1512 0.90796
yammer 8 14787 4921 468 0.90769
affair 865 1527 22434 16425 -0.90314
cherub 6 15795 5630 1603 0.90242
inca 8 14787 5005 567 0.90167
ogle 7 15269 5385 1082 0.90084
millet 7 15269 5385 1082 0.90084
bey 6 15795 5642 1636 0.90041
creak 8 14787 5025 588 0.90039
bunt 8 14787 5032 593 0.90009
amah 6 15795 5644 1645 0.89987
whet 9 14316 4394 119 0.89912
investigation 951 1433 11692 16260 -0.89905
wrought 8 14787 5050 617 0.89862
sharon 1750 957 10540 15751 -0.89820
dietrich 6 15795 5655 1677 0.89792
outside 4260 514 10052 15284 -0.89782
dour 7 15269 5412 1140 0.89730
situation 3359 595 10107 15354 -0.89695
velour 6 15795 5660 1694 0.89688
hatter 6 15795 5665 1711 0.89585
conk 8 14787 5109 687 0.89436
opportunity 1601 1011 10526 15742 -0.89423
cooker 9 14316 4579 210 0.89358
ilk 9 14316 4620 231 0.89230
sot 7 15269 5455 1227 0.89201
experience 1767 952 10388 15638 -0.89164
batty 9 14316 4682 264 0.89029
mia 1105 1295 10838 15953 -0.88910
thomson 6 15795 5720 1826 0.88885
baroque 6 15795 5724 1831 0.88854
eth 6 15795 5725 1833 0.88842
information 3063 635 10028 15251 -0.88815
tusk 7 15269 5487 1294 0.88793
anima 7 15269 5496 1312 0.88683
realize 4114 527 9922 15110 -0.88641
vigor 8 14787 5199 820 0.88627
brusque 7 15269 5513 1349 0.88458
corker 8 14787 5234 867 0.88341
explanation 847 1545 11186 16119 -0.88337
woolly 8 14787 5237 872 0.88310
demur 6 15795 5759 1925 0.88282
coolant 6 15795 5760 1928 0.88264
peeve 6 15795 5765 1939 0.88197
.
.
. Somewhere in the middle of the list...
.
.
cockpit 42 8131 6575 4544 0.23818
dummy 166 4147 4858 401 0.23814
hut 169 4103 4809 356 0.23810
home 22901 156 6450 4073 -0.23805
shorthand 28 9689 9285 13988 -0.23805
twirl 34 8949 6793 5397 0.23805
free 5433 440 6531 4368 -0.23803
aspire 25 10179 7094 6677 0.23800
sheila 352 2688 7102 6704 -0.23790
component 40 8299 8687 12538 -0.23779
pathetic 1115 1289 6749 5247 -0.23779
tune 344 2716 7109 6731 -0.23777
sped 21 10897 7282 7428 0.23775
excel 16 12021 7561 8598 0.23769
chemotherapy 23 10525 9729 14851 -0.23766
conduct 234 3426 7293 7465 -0.23750
tribal 47 7696 6456 4103 0.23749
hitch 121 4863 5421 1157 0.23745
cult 170 4091 4808 355 0.23740
envision 13 12907 7797 9526 0.23729
glib 33 9036 6819 5500 0.23729
libel 18 11532 7441 8097 0.23723
willed 39 8399 6649 4839 0.23719
lecturing 65 6604 8146 10765 -0.23718
grapefruit 65 6604 8146 10765 -0.23718
severance 36 8727 8846 12973 -0.23717
virtual 57 7018 6261 3403 0.23717
from 59972 74 6420 3973 -0.23716
fungus 58 6968 8257 11143 -0.23714
illinois 153 4326 7511 8395 -0.23713
decoration 27 9842 9351 14131 -0.23707
exploitation 17 11781 11263 16147 -0.23703
champ 144 4460 5151 745 0.23702
compartment 55 7129 8307 11307 -0.23693
iceberg 60 6862 8220 11028 -0.23685
.
.
. The most "stable" words...
.
loose 1069 1331 5516 1367 0.00106
scottie 56 7070 7258 7336 0.00106
arrival 149 4385 6584 4578 -0.00105
radioactivity 6 15795 13593 16411 0.00103
sung 65 6604 7147 6885 -0.00099
pilgrimage 9 14316 9766 14906 -0.00099
eyeball 35 8825 7702 9163 0.00095
heal 466 2260 5904 2335 0.00095
scrabble 39 8399 7596 8751 -0.00094
provoke 93 5547 6886 5784 -0.00089
iron 314 2875 6128 2976 0.00087
extortionist 7 15269 10730 15895 -0.00086
rubbish 54 7186 7299 7488 -0.00085
cavalier 28 9689 7948 10091 -0.00083
get 126849 37 3948 25 0.00082
nick 2699 704 5132 719 0.00080
integration 15 12307 8785 12813 -0.00078
pedestal 65 6604 7140 6856 0.00077
ringing 281 3078 6201 3213 -0.00071
yo 1347 1138 5429 1172 0.00071
platter 116 4967 6726 5155 0.00067
stifler 14 12603 8898 13119 -0.00066
cat 1742 960 5330 988 0.00064
relate 172 4061 6487 4214 0.00059
machismo 8 14787 10116 15370 0.00058
altitude 52 7303 7325 7605 -0.00057
fetish 43 8040 7501 8353 0.00056
ton 304 2941 6157 3068 -0.00056
twentieth 41 8212 7542 8532 0.00055
telltale 10 13925 9533 14492 -0.00054
montage 7 15269 10700 15873 0.00048
toll 113 5030 6742 5224 0.00046
fabricate 13 12907 9016 13418 0.00038
misrepresentation 6 15795 15362 16422 0.00037
buster 249 3295 6271 3433 -0.00036
primal 49 7517 7375 7824 -0.00035
alvin 65 6604 7142 6863 0.00034
tumor 225 3506 6328 3641 0.00034
book 5027 468 4937 492 -0.00032
voyage 106 5215 6799 5429 -0.00030
hug 697 1751 5720 1826 -0.00029
demon 1703 977 5349 1020 -0.00023
marietta 21 10897 8313 11330 0.00023
empty 1261 1183 5457 1234 -0.00022
bootleg 19 11306 8435 11756 0.00019
jordan 266 3167 6229 3297 -0.00019
theses 9 14316 9761 14892 -0.00014
letter 1839 925 5311 960 0.00013
cheer 550 2046 5834 2130 -0.00012
maiden 82 5896 6965 6134 -0.00010
toaster 76 6107 7016 6352 -0.00002
castle 408 2451 5977 2549 0.00001
Here is the code:
class a_word(object) :
def __init__(me, word, cnt, dur) :
me.word = word # the word
me.cnt = cnt # the word's use count
me.dur = dur # the word's shortest .wav file byte length
me.cnti = 0 # the normalized ranking of the count (low rank are frequent words)
me.duri = 0 # the normalized ranking of the duration (low ranks are short words)
me.off = 0.0 # how far off the two rankings are
pass # a_word
#
#
if __name__ == '__main__' :
import os
import re
import sys
import time
import TZCommandLineAtFile
import tzlib
sys.argv.pop(0)
TZCommandLineAtFile.expand_at_sign_command_line_files(sys.argv)
wc_fn = sys.argv.pop(0)
wcs = tzlib.read_whole_text_file(wc_fn) # lines of: "word count (wav_size (...))" - we use the shortest .wav size
wa = re.split(r"\n", wcs)
wa = [ wc for wc in [ re.split(r"\s+", ln) for ln in wa ] if (len(wc) >= 3) and (wc[0][0] != ';') ]
words = []
for wc in wa :
wc[1] = int(wc[1])
words.append(a_word(wc[0], wc[1], min([ int(ln) for ln in wc[2:]])))
words.sort(lambda a, b : cmp(b.cnt, a.cnt))
i = 0
j = 0
icnt = 0
ucnt = 0
for w in words :
if icnt != w.cnt :
icnt = w.cnt
i = j
ucnt += 1
w.cnti = i
j += 1
icnt = float(i)
words.sort(lambda a, b : cmp(a.dur, b.dur))
i = 0
j = 0
idur = 0
udur = 0
for w in words :
if idur != w.dur :
idur = w.dur
i = j
udur += 1
w.duri = i
j += 1
idur = float(i)
for w in words :
w.off = (w.cnti / icnt) - (w.duri / idur)
words.sort(lambda a, b : cmp(abs(b.off), abs(a.off)))
print "; " + time.asctime()
print "; counts=%i durations=%i unique_counts=%i unique_durations=%i" % ( int(icnt), int(idur), int(ucnt), int(udur) )
print
print "; %-30s count cnti dur duri offness" % "Word"
print
for w in words :
print " %-30s %8u %5u %6i %5u %8.5f" % ( w.word, w.cnt, w.cnti, w.dur, w.duri, w.off )
print
print ";"
print "; eof"