Our living language

It was time to scan the B2 todo.doc idea file (hashes of which are published to alt.security.keydist for lack of a better, public place to dump ’em).

One of the odd items in B2 was a note of curiosity about what words would change in the future. Specifically, what words would be shortened because they are used a lot? And what words would drop out of use because they are too short for their own good? I speculated that words that are too short are pompous, fuddy-duddy words, scheduled to go out of use, and words that are too long are hip-happening words, scheduled to be replaced by shortened forms of the word (“something” becomes “sum’em”, “about” becomes “bout”, “OK” becomes “K”).

The thing is, words that are in common use are short, e.g. “I the you me”. And rare words are usually big words. That makes sense. Huffman type compression is a natural phenomenon.

There are lots of word list out there. I turned to Wiktionary’s TV script word frequency list.

And, from a while back, I just happened to have a copy of all the audio word recordings from Merriam Webster.

Now, the durations of these recordings are not a very good indication of the duration of the words, but it’s a start. (I considered using a phoneme count from Wiktionary’s pronunciation guides)

If you sort the words by word-count and give them each an index corresponding to where they are in the list, and do the same for durations, you should be able to figure out which words have very different indices/rankings in the two sorted lists.

The sorted, absolute-value results should order the words in “stability”. That is, the words at the top of the list should either be too-short words, or too-long words.

Here they are, grouped:

Too-short, fuddy-duddy words:


  aught                                 6 15795   4074    37  0.99775
  kiddy                                 6 15795   4331    93  0.99434
  chirp                                 6 15795   4516   170  0.98965
  pomp                                  6 15795   4599   220  0.98661
  clunk                                 6 15795   4844   390  0.97626
  teat                                  6 15795   4925   473  0.97121
  hera                                  6 15795   4965   524  0.96810
  peat                                  7 15269   4161    56  0.96329
  pic                                   7 15269   4163    57  0.96323
  debit                                 6 15795   5040   607  0.96305
  wort                                  6 15795   5101   678  0.95873
  cud                                   6 15795   5158   750  0.95435
  deft                                  7 15269   4574   205  0.95422
  yolk                                  7 15269   4687   267  0.95045
  berth                                 7 15269   4778   338  0.94612
  putter                                7 15269   4804   352  0.94527
  aright                                6 15795   5282   925  0.94369
  airy                                  7 15269   4868   414  0.94150
  capper                                6 15795   5345  1013  0.93834
  heller                                7 15269   4934   488  0.93699
  amuck                                 7 15269   4973   542  0.93371
  heady                                 7 15269   4989   552  0.93310
  lite                                  7 15269   4990   554  0.93298
  punt                                  8 14787   4359   107  0.92967
  alum                                  7 15269   5051   619  0.92902
  bauble                                8 14787   4426   132  0.92815
  pellet                                7 15269   5070   637  0.92792
  breadth                               8 14787   4554   188  0.92474
  beagle                                8 14787   4554   188  0.92474
  erie                                  8 14787   4702   278  0.91926
  buoy                                  7 15269   5176   785  0.91891
  bap                                   7 15269   5189   802  0.91788
  simp                                  7 15269   5203   824  0.91654
  wisp                                  6 15795   5519  1378  0.91612
  whist                                 6 15795   5537  1402  0.91466
  ardent                                8 14787   4821   372  0.91354
  beech                                 8 14787   4842   385  0.91275
  heath                                 8 14787   4852   399  0.91189
  dewy                                  8 14787   4859   404  0.91159

Too-long, hip-happening words:


  relationship                       3880   543  10302 15572 -0.91352
  responsibility                     1157  1262  11876 16298 -0.91219
  apologize                          1932   885  10730 15895 -0.91153
  themselves                         1117  1287  11861 16296 -0.91048
  affair                              865  1527  22434 16425 -0.90314
  investigation                       951  1433  11692 16260 -0.89905
  sharon                             1750   957  10540 15751 -0.89820
  outside                            4260   514  10052 15284 -0.89782
  situation                          3359   595  10107 15354 -0.89695
  opportunity                        1601  1011  10526 15742 -0.89423
  experience                         1767   952  10388 15638 -0.89164
  mia                                1105  1295  10838 15953 -0.88910
  information                        3063   635  10028 15251 -0.88815
  realize                            4114   527   9922 15110 -0.88641
  explanation                         847  1545  11186 16119 -0.88337
  surprise                           3439   585   9908 15091 -0.88158
  otherwise                          1499  1051  10273 15537 -0.87922
  suppose                            2952   652   9838 15009 -0.87234
  security                           2237   817   9960 15164 -0.87133
  necessary                          1215  1219  10288 15561 -0.87005
  conversation                       2266   808   9919 15107 -0.86843
  besides                            3355   596   9738 14868 -0.86731
  absolutely                         4704   482   9648 14724 -0.86576
  ridiculous                         1943   881   9941 15138 -0.86570
  downstairs                         1071  1327  10333 15596 -0.86534
  girlfriend                         2325   789   9846 15014 -0.86397
  eventually                          985  1400  10380 15634 -0.86303
  ourselves                          1525  1041   9980 15200 -0.85934
  sometime                            871  1518  10440 15680 -0.85836
  someplace                          1165  1253  10132 15384 -0.85712
  grandfather                        1006  1377  10191 15444 -0.85292
  sacrifice                           472  2244  11921 16308 -0.85063
  recognize                           909  1470  10225 15486 -0.84959
  psychiatrist                        463  2272  11731 16269 -0.84648
  necessarily                         478  2222  11402 16186 -0.84459
  champagne                          1085  1316   9998 15220 -0.84315
  understand                        16724   191   9299 14020 -0.84133
  meantime                            701  1741  10275 15540 -0.83572
  imagination                         459  2284  11062 16060 -0.83300

Well, it was a thought, anyway.

Some data and code:

Here is a selection of the results in “stability” order from least to most stable:


; Thu Oct 30 16:38:35 2008
; counts=15795 durations=16428 unique_counts=2110 unique_durations=5812

; Word                              count  cnti    dur  duri  offness

  aught                                 6 15795   4074    37  0.99775
  kiddy                                 6 15795   4331    93  0.99434
  chirp                                 6 15795   4516   170  0.98965
  pomp                                  6 15795   4599   220  0.98661
  clunk                                 6 15795   4844   390  0.97626
  teat                                  6 15795   4925   473  0.97121
  hera                                  6 15795   4965   524  0.96810
  peat                                  7 15269   4161    56  0.96329
  pic                                   7 15269   4163    57  0.96323
  debit                                 6 15795   5040   607  0.96305
  wort                                  6 15795   5101   678  0.95873
  cud                                   6 15795   5158   750  0.95435
  deft                                  7 15269   4574   205  0.95422
  yolk                                  7 15269   4687   267  0.95045
  berth                                 7 15269   4778   338  0.94612
  putter                                7 15269   4804   352  0.94527
  aright                                6 15795   5282   925  0.94369
  sometimes                          5596   431  10772 15920 -0.94179
  airy                                  7 15269   4868   414  0.94150
  capper                                6 15795   5345  1013  0.93834
  heller                                7 15269   4934   488  0.93699
  amuck                                 7 15269   4973   542  0.93371
  heady                                 7 15269   4989   552  0.93310
  lite                                  7 15269   4990   554  0.93298
  punt                                  8 14787   4359   107  0.92967
  alum                                  7 15269   5051   619  0.92902
  bauble                                8 14787   4426   132  0.92815
  pellet                                7 15269   5070   637  0.92792
  breadth                               8 14787   4554   188  0.92474
  beagle                                8 14787   4554   188  0.92474
  erie                                  8 14787   4702   278  0.91926
  buoy                                  7 15269   5176   785  0.91891
  bap                                   7 15269   5189   802  0.91788
  simp                                  7 15269   5203   824  0.91654
  wisp                                  6 15795   5519  1378  0.91612
  whist                                 6 15795   5537  1402  0.91466
  ardent                                8 14787   4821   372  0.91354
  relationship                       3880   543  10302 15572 -0.91352
  beech                                 8 14787   4842   385  0.91275
  responsibility                     1157  1262  11876 16298 -0.91219
  heath                                 8 14787   4852   399  0.91189
  dewy                                  8 14787   4859   404  0.91159
  apologize                          1932   885  10730 15895 -0.91153
  themselves                         1117  1287  11861 16296 -0.91048
  catty                                 8 14787   4892   433  0.90982
  contra                                6 15795   5575  1485  0.90961
  droop                                 6 15795   5579  1502  0.90857
  gluck                                 6 15795   5582  1512  0.90796
  yammer                                8 14787   4921   468  0.90769
  affair                              865  1527  22434 16425 -0.90314
  cherub                                6 15795   5630  1603  0.90242
  inca                                  8 14787   5005   567  0.90167
  ogle                                  7 15269   5385  1082  0.90084
  millet                                7 15269   5385  1082  0.90084
  bey                                   6 15795   5642  1636  0.90041
  creak                                 8 14787   5025   588  0.90039
  bunt                                  8 14787   5032   593  0.90009
  amah                                  6 15795   5644  1645  0.89987
  whet                                  9 14316   4394   119  0.89912
  investigation                       951  1433  11692 16260 -0.89905
  wrought                               8 14787   5050   617  0.89862
  sharon                             1750   957  10540 15751 -0.89820
  dietrich                              6 15795   5655  1677  0.89792
  outside                            4260   514  10052 15284 -0.89782
  dour                                  7 15269   5412  1140  0.89730
  situation                          3359   595  10107 15354 -0.89695
  velour                                6 15795   5660  1694  0.89688
  hatter                                6 15795   5665  1711  0.89585
  conk                                  8 14787   5109   687  0.89436
  opportunity                        1601  1011  10526 15742 -0.89423
  cooker                                9 14316   4579   210  0.89358
  ilk                                   9 14316   4620   231  0.89230
  sot                                   7 15269   5455  1227  0.89201
  experience                         1767   952  10388 15638 -0.89164
  batty                                 9 14316   4682   264  0.89029
  mia                                1105  1295  10838 15953 -0.88910
  thomson                               6 15795   5720  1826  0.88885
  baroque                               6 15795   5724  1831  0.88854
  eth                                   6 15795   5725  1833  0.88842
  information                        3063   635  10028 15251 -0.88815
  tusk                                  7 15269   5487  1294  0.88793
  anima                                 7 15269   5496  1312  0.88683
  realize                            4114   527   9922 15110 -0.88641
  vigor                                 8 14787   5199   820  0.88627
  brusque                               7 15269   5513  1349  0.88458
  corker                                8 14787   5234   867  0.88341
  explanation                         847  1545  11186 16119 -0.88337
  woolly                                8 14787   5237   872  0.88310
  demur                                 6 15795   5759  1925  0.88282
  coolant                               6 15795   5760  1928  0.88264
  peeve                                 6 15795   5765  1939  0.88197
.
.
.       Somewhere in the middle of the list...
.
.
  cockpit                              42  8131   6575  4544  0.23818
  dummy                               166  4147   4858   401  0.23814
  hut                                 169  4103   4809   356  0.23810
  home                              22901   156   6450  4073 -0.23805
  shorthand                            28  9689   9285 13988 -0.23805
  twirl                                34  8949   6793  5397  0.23805
  free                               5433   440   6531  4368 -0.23803
  aspire                               25 10179   7094  6677  0.23800
  sheila                              352  2688   7102  6704 -0.23790
  component                            40  8299   8687 12538 -0.23779
  pathetic                           1115  1289   6749  5247 -0.23779
  tune                                344  2716   7109  6731 -0.23777
  sped                                 21 10897   7282  7428  0.23775
  excel                                16 12021   7561  8598  0.23769
  chemotherapy                         23 10525   9729 14851 -0.23766
  conduct                             234  3426   7293  7465 -0.23750
  tribal                               47  7696   6456  4103  0.23749
  hitch                               121  4863   5421  1157  0.23745
  cult                                170  4091   4808   355  0.23740
  envision                             13 12907   7797  9526  0.23729
  glib                                 33  9036   6819  5500  0.23729
  libel                                18 11532   7441  8097  0.23723
  willed                               39  8399   6649  4839  0.23719
  lecturing                            65  6604   8146 10765 -0.23718
  grapefruit                           65  6604   8146 10765 -0.23718
  severance                            36  8727   8846 12973 -0.23717
  virtual                              57  7018   6261  3403  0.23717
  from                              59972    74   6420  3973 -0.23716
  fungus                               58  6968   8257 11143 -0.23714
  illinois                            153  4326   7511  8395 -0.23713
  decoration                           27  9842   9351 14131 -0.23707
  exploitation                         17 11781  11263 16147 -0.23703
  champ                               144  4460   5151   745  0.23702
  compartment                          55  7129   8307 11307 -0.23693
  iceberg                              60  6862   8220 11028 -0.23685
.
.
.   The most "stable" words...
.
  loose                              1069  1331   5516  1367  0.00106
  scottie                              56  7070   7258  7336  0.00106
  arrival                             149  4385   6584  4578 -0.00105
  radioactivity                         6 15795  13593 16411  0.00103
  sung                                 65  6604   7147  6885 -0.00099
  pilgrimage                            9 14316   9766 14906 -0.00099
  eyeball                              35  8825   7702  9163  0.00095
  heal                                466  2260   5904  2335  0.00095
  scrabble                             39  8399   7596  8751 -0.00094
  provoke                              93  5547   6886  5784 -0.00089
  iron                                314  2875   6128  2976  0.00087
  extortionist                          7 15269  10730 15895 -0.00086
  rubbish                              54  7186   7299  7488 -0.00085
  cavalier                             28  9689   7948 10091 -0.00083
  get                              126849    37   3948    25  0.00082
  nick                               2699   704   5132   719  0.00080
  integration                          15 12307   8785 12813 -0.00078
  pedestal                             65  6604   7140  6856  0.00077
  ringing                             281  3078   6201  3213 -0.00071
  yo                                 1347  1138   5429  1172  0.00071
  platter                             116  4967   6726  5155  0.00067
  stifler                              14 12603   8898 13119 -0.00066
  cat                                1742   960   5330   988  0.00064
  relate                              172  4061   6487  4214  0.00059
  machismo                              8 14787  10116 15370  0.00058
  altitude                             52  7303   7325  7605 -0.00057
  fetish                               43  8040   7501  8353  0.00056
  ton                                 304  2941   6157  3068 -0.00056
  twentieth                            41  8212   7542  8532  0.00055
  telltale                             10 13925   9533 14492 -0.00054
  montage                               7 15269  10700 15873  0.00048
  toll                                113  5030   6742  5224  0.00046
  fabricate                            13 12907   9016 13418  0.00038
  misrepresentation                     6 15795  15362 16422  0.00037
  buster                              249  3295   6271  3433 -0.00036
  primal                               49  7517   7375  7824 -0.00035
  alvin                                65  6604   7142  6863  0.00034
  tumor                               225  3506   6328  3641  0.00034
  book                               5027   468   4937   492 -0.00032
  voyage                              106  5215   6799  5429 -0.00030
  hug                                 697  1751   5720  1826 -0.00029
  demon                              1703   977   5349  1020 -0.00023
  marietta                             21 10897   8313 11330  0.00023
  empty                              1261  1183   5457  1234 -0.00022
  bootleg                              19 11306   8435 11756  0.00019
  jordan                              266  3167   6229  3297 -0.00019
  theses                                9 14316   9761 14892 -0.00014
  letter                             1839   925   5311   960  0.00013
  cheer                               550  2046   5834  2130 -0.00012
  maiden                               82  5896   6965  6134 -0.00010
  toaster                              76  6107   7016  6352 -0.00002
  castle                              408  2451   5977  2549  0.00001

Here is the code:


class   a_word(object) :
    def __init__(me, word, cnt, dur) :
        me.word = word                      # the word
        me.cnt  = cnt                       # the word's use count
        me.dur  = dur                       # the word's shortest .wav file byte length
        me.cnti = 0                         # the normalized ranking of the count       (low rank are frequent words)
        me.duri = 0                         # the normalized ranking of the duration    (low ranks are short words)
        me.off  = 0.0                       # how far off the two rankings are
    pass        # a_word

#
#
if  __name__ == '__main__' :
    import  os
    import  re
    import  sys
    import  time

    import  TZCommandLineAtFile
    import  tzlib


    sys.argv.pop(0)

    TZCommandLineAtFile.expand_at_sign_command_line_files(sys.argv)

    wc_fn   = sys.argv.pop(0)

    wcs     = tzlib.read_whole_text_file(wc_fn)             # lines of: "word count (wav_size (...))" - we use the shortest .wav size
    wa      = re.split(r"\n", wcs)
    wa      = [ wc for wc in [ re.split(r"\s+", ln) for ln in wa ] if (len(wc) >= 3) and (wc[0][0] != ';') ]
    words   = []
    for wc in wa :
        wc[1]       = int(wc[1])
        words.append(a_word(wc[0], wc[1], min([ int(ln) for ln in wc[2:]])))


    words.sort(lambda a, b : cmp(b.cnt, a.cnt))
    i       = 0
    j       = 0
    icnt    = 0
    ucnt    = 0
    for w in words  :
        if  icnt   != w.cnt :
            icnt    = w.cnt
            i       = j
            ucnt   += 1
        w.cnti      = i
        j          += 1
    icnt            = float(i)


    words.sort(lambda a, b : cmp(a.dur, b.dur))
    i       = 0
    j       = 0
    idur    = 0
    udur    = 0
    for w in words  :
        if  idur   != w.dur :
            idur    = w.dur
            i       = j
            udur   += 1
        w.duri      = i
        j          += 1
    idur            = float(i)


    for w in words  :
        w.off       = (w.cnti / icnt) - (w.duri / idur)

    words.sort(lambda a, b : cmp(abs(b.off), abs(a.off)))

    print "; " + time.asctime()
    print "; counts=%i durations=%i unique_counts=%i unique_durations=%i" % ( int(icnt), int(idur), int(ucnt), int(udur) )
    print
    print "; %-30s    count  cnti    dur  duri  offness" % "Word"
    print
    for w in words  :
        print "  %-30s %8u %5u %6i %5u %8.5f" % ( w.word, w.cnt, w.cnti, w.dur, w.duri, w.off )

    print
    print ";"
    print "; eof"

Raising the Sail

Something said once upon a time…

Leave a Reply