However, it is often exceedingly tedious to search through,a dictionary for words with given properties; as a consequence , many,specialized works have been developed to aid the logologist: positional,word lists, reverse word lists, pattern word lists and the like. The,purpose of this article is to acquaint the reader with a specialized word,list which is indispensible in any study involving the relative frequency-,of-appe arance of words in Engli sh-language text: Henry Kucera and W.,Nelson Francis' s Commputational Analy sis of Present- Day American,Engli sh, published by Brown Univer sity Press in 1967.,Kucera and Francis have tabulate d the relative frequencies of a lit-,tle more than one million words in two lists -- one in descending order,of frequency, the other alphabetical. There are a little more than,50,000 separate words in each list, The million words were taken,from 500 separate segment s of 1948 to 2246 words apiece, all printed,in the United St ates during 1961; however, only 50 of the se 500 seg-,ments are specifically referenced. The se 500 segments were in turn,taken fr om a wide variety of genres: newspape r reports and editorials ,,magazine articles, belles lettres, learned and scie ntific writings, and,fiction ( but poetry and drama were excluded). Each word in both lists,has three numbers associated with it -- the total num ber of occurren-,ces, the numnber of different segments in which it appeared, and the,numbe r of different genres in which it appe ared.,Kucera and Francis define a word as a continuous string of lette rs,,numerals, hyphens, apostrophes, periods and the like, bounded at,each end by spaces. In order to avoid counting a word followed by a,comma, semicolon or period as different from the same word without,such punctuation, no punctuation-mark s (except hyphens and apo stro-,phe s) followed by another punctuation mark or by a space are included,in a string. Mathematic al and chemical formulas are not separately,enumerated but their total number (l010, scatte red through 50 of the,500 segments) is noted, The reader is cautione d that diffe rent mean-,ings of a word are not separately enumerated; for example, the verb,CAN and the noun CAN collectively appear 1772 times. Thus, the,book is relatively useless for semantic or styli stic stu dies.,180,The relative frequencies in this sample mu st be regarded as ap-,proximations to the true (but unknown) frequencie s of words in Eng-,lish prose. If the words had been drawn truly at random ( instead of,in 2000-word segments), it would have been pos sible to draw infer-,ences about the range within which the true frequencies lie; however,,words in a given segment are to some extent correlated with each oth-,er, resulting in an inflated number of appe arance s. To cite an ex-,treme case, almost no one is likely to believe that B' DIKKAT is 21,times as common as CRUST in Engli sh-language text, yet this is what,the raw frequencie s say. Fortunately, there is a way to identify those,words which may be inflated in frequency: note the number of differ -,ent segments in which the word appeared. If only one or two of 500,segment s exhibit the word, it is much more likely to be misplaced in,the frequency list. There are a number of single - segment words tha,appear even mnore frequently than b'dikkat:segments exhibit the word, it is much more likely to be misplaced in,the freque ncy list. There are a number of single - segment words that,appear even more frequently than b' dikkat:,STAINING,BINO MIA L,MATSUO,37,36,35,JESS,GORTON,SCOTTY,33,32,32,SKYWAVE 32,MIRIA M,WOODRUFF 30,30,The suspicion that personal names are expecially likely to have multi-,ple appearances within a se gment is strengthene d by the fact that the,four most frequent words appearing in exactly two segments are,HEARST 48, LINDA 42, HENRIETTA 4l and MA RIS 36. At the other,e xtreme, some words appear once each in many different segments:,MEANWHILE 35, INVARIA BL Y 31, LETTING 31, LOSING 28, DELI-,CATE 27, EAGER 27, PECULIAR 27 and PROFOUND 27. Only a,handíul of words appear in all 500 segments: THE, OF, AND, TO, A,,IN, THAT, FOR, IT, WITH, AS, ON, AT and FROM. (Surprisingly,,IS is missing from 15; less surprisingly, personal pronouns such as,HE, HER and I are miss ing fromn many more. ),According to this list, only eleven words -- THE, OF, AND, TO,,A, IN, THAT, IS, WAS, HE and FOR -- appear a total of 25 per cent,of the time, and 135 words are sufficient to exhaust half of all prose.,THE alone occurs nearly 7 per cent of the time. At the other extreme,,48 per cent of the 50,000 separate words in the corpus appear only,once. Many of the se are victims of the particular sequences sampled,,and in reality appear more frequently than once in a million words:,BUN, WAND, TOIL, INFER, DRIP, MOAN, FARMED, ERR, DON-,KEY. Other s, particularly lang hyphenated coinages such as LET'S-,MAKE- YOUR-HOUSE-OUR-CLUB and YIELDING- MEDITERRANEAN-,WOMAN- FLESH-OF- WATER ( the longest word in the corpus), are,most unlikely to occur anywhere in English 1iterature outside of the,sample segment. The fictional segments generate quite a number of,dialect- imitating coinage s: MOR- EE- AIR- TEEEEE, S-S-SA HJUNT,,SUHTHU HN ( Southern ?) , T'JAWN ( to John?), KEE-REIST, AYE-,YAH- AH- AH, AAAWWW. Many weird- looking single-occurrence,words are probably foreign names: IIJ IMA, HELSQ' IYOKOM, IRAQW,,TSCHILW YK, RATTZHENFUUT, HAQVIN, SPEGITITGNIND, and,NNỰOLA PERTA R- IT-VUH- KARTI- BIRI- PITKNOUMEN ( the second,longest word in the corpus). Se squipedali an scientific terms are not,181,forgotten: ALKYLBENZENESUL FONA TES (the longest unhyphenated,word in the corpus), SPECT ROPHOTOMETE R, LEXICOST ATISTIC,,GLOTTOCHRONOLOGY. A few single- oc currence words with a fey,charm of their own are DOUGHNUTTERY, HOO-PIG, PUMBLE-,CHOOK and OOP SIE- COLA (l0ok that up in your Funk & Wagnalls !) .,The specialized words of recreational linguistics appear very rarely;,TRANSPOSITION, ANAGRAM, PALINDROMES and WORD-GAMES,appear once apiece.,The transcription accuracy of the corpus appears to be very high;,however, an apparent spelling error that was noted was VIOILN for,VIOLIN.,ples: tive frequencies of groups of related words. A few illustrative exam-,RED,The Kucera- Francis corpus is well- suited for comparing the rela-,197,23,55,116,43,1,7,THU RSDAY,FRIDAY,SATURDA Y,SUNDAY,MONDAY,TUESDA Y,WEDNESDAY,33,60,67,101,68,59,35,JANUARY,FEBRUARY,MARCH,APRIL,MAY,JUNE,JULY,0TICIIST,53,45,120,71,1400,93,65,ORANGE,YELLOW,GREEN,BLUE,INDIGO,VIOLETVIOLET,NORTH 206,EAST 183,7,WEDNESDAY 35,NORTHEAST 16,SOUTHEAST 28,SOUTH 240 SOUTHWEST 16,WEST 235 NORTHWEST 25,JULY,AUGUST,SEPTEMBER,OCTOBER,NOVEMBER,DECEMBER,56,51,74,62,A few of the relative frequencies on the se lists mu st be inte rpreted,with care. Mo st of the 1400 occurrences of MAY relate to the auxili-,ary verb, not the month, and some of the MARCHes are verbs. The,inte rpretation of the apparent differences on these lists is, perhaps,,more a matter for the psychologist or sociologist than the linguist.,Why does JUNE stand out from the other months? Why is the popular-,ity of a day of the week symmetrically arrayed with respect to SUN-,DA Y? (Apparently, Sundays in June are written about six time s more,often than Thur sdays in February. ) Why is RED the most popular col-,or? The point s of the compass exhibit a strange inconsistency. The,cardinal points WEST and SOUTH are cle a rly more popular than EAST,and NORTH (reflecting historical population movements in the United,States ?), but SOUTHWEST is much less common than either NORTH-,WEST or SOUTHEAST.,Similar lists can be made of geographic place-names. The most,obvious set of these are the 50 states (actually, only 47 can be distin-,gui she d, for the Dakotas, the Carolinas and the Virginias must be,lumped together). A simple tabul ati on is not too meaningful , for it,is to be expected that more populous states will be mentione d more,frequently; to correct for this bias, an index-numnber for each state,was obtained by dividing the number of occurrences in the corpus by,the state's 1960 population in millions. The results are summarized,182,below (using the official Post Office state abbreviations):,122,97,71,AK,WA,20-29 WY, HI, NV,10-19 AR, KS, MS, GA, MA, NY, NM, NH,6-9 OR, CO, OK, LA, MD, AZ, CT, TX, TN, UT, ME,4-5 MN, IL, OH, NJ, KY, CA, MO, PA, FL, WI, ID, NE,0-3 AL, IA, IN, MI, MT,RI,61 DE,54,VT,The lumped Virginias fall in the 10-19 range, and the lumped Dakotas,and Carolinas both fall in the 4-5 range. Why is Rhode Island (actually,,the word RHODE) at the head of the list? Appar ently the compilers of,the corpus, based at Brown Univer sity in Providence, selected a num-,ber of 2000-word segments from sou rces close at hand; among the 50,references cited are press reviews from the Providence Journal and,documents fromn the Rhode Island Legislative Council. The high indi-,ces for Washington and New York can be explained by the fact that,the se names also refer to citie s. Virginia is a feminine name as,well a s a state, and Mexico is a country as well as a state. However,,in the absence of a list of references for the other 450 se gments, it,is impossible to explain the high indices associated with Ala ska, Del-,aware, and Vermont,The corre sponding indices for United States cities are equally var -,iable, ranging from a high of 258 for Washington and 385 for Provi-,dence, to a low of 8 for San Diego. The median index for the 21 lar-,gest cities (tho se with a 1960 population of 500,000 or more) is 25,,considerably larger than the state median index of 7.,Another game that one can play with the Kucera- Francis corpus,is tabulate the number of occurrences of one-letter words. All alpha-,betic lette rs but Z .are represented:,A 23237,B 117 G,C 130 H,F,70,50,4,K,L,M,20,55,84,83,2,66,U,V,W,91,31,84130,90,101,M,N,34,41,55,D,E,I,J,5173,124,s,135,32,X,Y,1),Large values for A and I are to be expected; however, the rest of the,alphabet does not match the normal frequency of letters in Engli sh-,language text (for example, J exceeds E, and W exceeds T). It is,likely that most of the se words" are initials for first names and,middle names, explaining the high incidence of letters such as J and W.,Thirty-five of the chemical element s in the periodic table are lis-,ted, the most common ones being GOLD 51, IRON 43, OXYGEN 43,,HYDROGEN 39, CHLORINE 33, CARBON 30, and SILVER 29. Actual-,ly, LEAD 129 heads the list, but it is likely that most of the se occur-,rences refer to the various homnonyms of the element.,For mathematicians, perhaps the most inte re sting aspect of the,corpus is the light it sheds on the relative Engli sh-language usage of,183,different integers. As one might expe ct, the larger the integer, the,smaller the number of occurrences; the word THREE occurs far,more often than the word EIGHTY-SEVEN, However, the rates of,decline are obviously different for cardinal numbers (ONE, TWO,,THREE, ... ), ordinal numbers ( FIRST, SECOND, THIRD, ...),and mathematical not ation (1, 2, 3, ...). The unparenthe sized,numbers in the table below summarize the number of occurrences,of the three numbe r-types in the Kucera- Francis corpus:,Cardinal,3292 (3000),1412 (l160),610 (588),359 (374),286 (269),220,113,104,81,165,40,48,11,31,56,20,24,17,18,80,8,8,7,14,25,(212),( 162),(133),(1I1),(95),(72),(57),(47),(39),(34),Ordinal,1360 ( 1500),373,190,74,38,26,31,23,20,7,4,(375),(167),(94),(60),(42),(30),(24),(19),(15),5 (10.4),2,3,9,12,22,42,(7.7),(5.9),(4.6),20 (3.7),3,Mathematical,496 (800),450 (400),282 (267),196 (200),134 (160),113 (133),91 (114),104 (100),63 (89),143 (80),62,98 (67),47,55 (57),109,51 (50),38,50 (44),34,88 (40),45,45,46,49,82 (32),1,2,3,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,1,(24),(2.4),Note that the cardinal sequence starts out at a much higher level,than the mathematical one, but the latter soon overtakes it. This,sugge sts that English-language writers are more likely to spell out,Small integers, but use mathe matical notation for large ones (when,the words become long and awkward). On the other hand, the ordi-,nal sequence (except for the bulge between 14 and 22) dr ops off,even faster than the cardinal one.SmalI 1ntegers, but use mathematical notation tor large oneS When,the words become long and awkward). On the other hand, the ordi-,nal sequence ( except for the bulge between 14 and 22) drops off,even fa ster than the cardinal one.,Superinmposed on the general decline is a certain amount of an-,omalous behavior. The aforementioned bulge in the ordinals between,14 and 22 is probably attributable to terminology such as the NINE-,TEENTH century. In both the cardinal and mathematical sequences,,the entries for 10 and 20 (and, to a les ser extent, 15 and 25) are,much larger than their immediate neighbors, leading to the suspi-,cion that the se numbers are often used as approximations of their,184,neighbors. (A similar effect has been noticed by the Bureau of the,Census in the demographic field; the tendency of ages reported to,census-takers to pile up at inte ge rs divisible by 5 is called " age-,heaping!" .) Finally, the abnormally low entry associated with the,cardinal number THIRTEEN sugge sts that triskaidekaphobia (fear,of the number 13) may be playing a role.,If one ignores the " ordinal bulge" and spreads out the peaks at,10, 15, 20 and. 25, it is tempting to ask whether or not the general de-,cline can be approximated by a simple mathe mati cal model. Let n be,the value of the integer and E denote the expected number of occur-,rences in the million- word corpus; then a simple predictive model,can be written in the form E = A/nP. The arbitrary constants A and,B were adjusted (by trial and error, although more refined stati stical,procedures could have been used) for each of the three sequences (for,n running from l to 100) to get the best-fitting mnodel possible:,E = 3000/n 3/2,E = 1500/ n,2,or dinal sequence:,mathematical sequence: E = 800/n,cardinal sequence:,Sele cted predicted values generated by these series are given in par-,enthe ses in the table on the preceding page. The fit is good but far,from perfect; the differences cannot be dismnissed as statistical,fluctuations in all cases.,Time is another dimension that can be explored with the aid of,Kucera and Francis. The table below gives the number of times,each year (or group of years) was mentioned:,1961,1960,1959,1958,1957,1956,1955,1954,1953,1952,1951,134,170,98,89,46,28,25,46,2€6,31,20,1950,1949,1948,1947,1946,1945,1944,21,20,20,14,20,18,15,1943 11,1942,1 1,15,1941,1940,1930-39 76,1920-29 89,1910-19 79,1900-09 50,1890-99 38,1880-89,1870-79,1860-69,1850-59,1840-49,28,18,28,26,241,1830-39 22,1820-29,15,1810-19 36,1800-09 12,1750-99,1700-49,1650-99,1600-49,1550-99,1500-49,49,21,11,54,24,5,This declining sequence can be quite adequately mnodeled by the for-,mula E = 200/n, where n is the number of years prior to 1961. For,example, the fo rmula predicts 200 occurrences of 1960, 18 of 1950,,and 9.5 of 1940; by 1860 it predicts 2 occurrences, and by l660, only,2/3 (two dates in a three-year interval). In some sense, this for,mula measures the rate at which interest in the past diminishes, at,least as far as written records of a given epoch are concerned.A, ROSS ECKLER,Morristown, New Jersey,In exploring the recreational byways of the English language, the,most important tool of the logologist is an unabridged dictionary. ( To,enlarge his stockpile of words, he may also find it necessary to con-,sult a wide variety of dictiona ries in specialized fields, as well as gaz-,ettee rs and biog raphical works -- even such ephemera as telephone,directories.) However, it is often exceedingly tedious to search through,a dictionary for words with given properties; as a consequence , many,specialized works have been developed to aid the logologist: positional,word lists, reverse word lists, pattern word lists and the like. The,purpose of this article is to acquaint the reader with a specialized word,list which is indispensible in any study involving the relative frequency-,of-appe arance of words in Engli sh-language text: Henry Kucera and W.,Nelson Francis' s Commputational Analy sis of Present- Day American,Engli sh, published by Brown Univer sity Press in 1967.,Kucera and Francis have tabulate d the relative frequencies of a lit-,tle more than one million words in two lists -- one in descending order,of frequency, the other alphabetical. There are a little more than,50,000 separate words in each list, The million words were taken,from 500 separate segment s of 1948 to 2246 words apiece, all printed,in the United St ates during 1961; however, only 50 of the se 500 seg-,ments are specifically referenced. The se 500 segments were in turn,taken fr om a wide variety of genres: newspape r reports and editorials ,,magazine articles, belles lettres, learned and scie ntific writings, and,fiction ( but poetry and drama were excluded). Each word in both lists,has three numbers associated with it -- the total num ber of occurren-,ces, the numnber of different segments in which it appeared, and the,numbe r of different genres in which it appe ared.,Kucera and Francis define a word as a continuous string of lette rs,,numerals, hyphens, apostrophes, periods and the like, bounded at,each end by spaces. In order to avoid counting a word followed by a,comma, semicolon or period as different from the same word without,such punctuation, no punctuation-mark s (except hyphens and apo stro-,phe s) followed by another punctuation mark or by a space are included,in a string. Mathematic al and chemical formulas are not separately,enumerated but their total number (l010, scatte red through 50 of the,500 segments) is noted, The reader is cautione d that diffe rent mean-,ings of a word are not separately enumerated; for example, the verb,CAN and the noun CAN collectively appear 1772 times. Thus, the,book is relatively useless for semantic or styli stic stu dies.,180,The relative frequencies in this sample mu st be regarded as ap-,proximations to the true (but unknown) frequencie s of words in Eng-,lish prose. If the words had been drawn truly at random ( instead of,in 2000-word segments), it would have been pos sible to draw infer-,ences about the range within which the true frequencies lie; however,,words in a given segment are to some extent correlated with each oth-,er, resulting in an inflated number of appe arance s. To cite an ex-,treme case, almost no one is likely to believe that B' DIKKAT is 21,times as common as CRUST in Engli sh-language text, yet this is what,the raw frequencie s say. Fortunately, there is a way to identify those,words which may be inflated in frequency: note the number of differ -,ent segments in which the word appeared. If only one or two of 500,segment s exhibit the word, it is much more likely to be misplaced in,the frequency list. There are a number of single - segment words tha,appear even mnore frequently than b'dikkat:segments exhibit the word, it is much more likely to be misplaced in,the freque ncy list. There are a number of single - segment words that,appear even more frequently than b' dikkat:,STAINING,BINO MIA L,MATSUO,37,36,35,JESS,GORTON,SCOTTY,33,32,32,SKYWAVE 32,MIRIA M,WOODRUFF 30,30,The suspicion that personal names are expecially likely to have multi-,ple appearances within a se gment is strengthene d by the fact that the,four most frequent words appearing in exactly two segments are,HEARST 48, LINDA 42, HENRIETTA 4l and MA RIS 36. At the other,e xtreme, some words appear once each in many different segments:,MEANWHILE 35, INVARIA BL Y 31, LETTING 31, LOSING 28, DELI-,CATE 27, EAGER 27, PECULIAR 27 and PROFOUND 27. Only a,handíul of words appear in all 500 segments: THE, OF, AND, TO, A,,IN, THAT, FOR, IT, WITH, AS, ON, AT and FROM. (Surprisingly,,IS is missing from 15; less surprisingly, personal pronouns such as,HE, HER and I are miss ing fromn many more. ),According to this list, only eleven words -- THE, OF, AND, TO,,A, IN, THAT, IS, WAS, HE and FOR -- appear a total of 25 per cent,of the time, and 135 words are sufficient to exhaust half of all prose.,THE alone occurs nearly 7 per cent of the time. At the other extreme,,48 per cent of the 50,000 separate words in the corpus appear only,once. Many of the se are victims of the particular sequences sampled,,and in reality appear more frequently than once in a million words:,BUN, WAND, TOIL, INFER, DRIP, MOAN, FARMED, ERR, DON-,KEY. Other s, particularly lang hyphenated coinages such as LET'S-,MAKE- YOUR-HOUSE-OUR-CLUB and YIELDING- MEDITERRANEAN-,WOMAN- FLESH-OF- WATER ( the longest word in the corpus), are,most unlikely to occur anywhere in English 1iterature outside of the,sample segment. The fictional segments generate quite a number of,dialect- imitating coinage s: MOR- EE- AIR- TEEEEE, S-S-SA HJUNT,,SUHTHU HN ( Southern ?) , T'JAWN ( to John?), KEE-REIST, AYE-,YAH- AH- AH, AAAWWW. Many weird- looking single-occurrence,words are probably foreign names: IIJ IMA, HELSQ' IYOKOM, IRAQW,,TSCHILW YK, RATTZHENFUUT, HAQVIN, SPEGITITGNIND, and,NNỰOLA PERTA R- IT-VUH- KARTI- BIRI- PITKNOUMEN ( the second,longest word in the corpus). Se squipedali an scientific terms are not,181,forgotten: ALKYLBENZENESUL FONA TES (the longest unhyphenated,word in the corpus), SPECT ROPHOTOMETE R, LEXICOST ATISTIC,,GLOTTOCHRONOLOGY. A few single- oc currence words with a fey,charm of their own are DOUGHNUTTERY, HOO-PIG, PUMBLE-,CHOOK and OOP SIE- COLA (l0ok that up in your Funk & Wagnalls !) .,The specialized words of recreational linguistics appear very rarely;,TRANSPOSITION, ANAGRAM, PALINDROMES and WORD-GAMES,appear once apiece.,The transcription accuracy of the corpus appears to be very high;,however, an apparent spelling error that was noted was VIOILN for,VIOLIN.,ples: tive frequencies of groups of related words. A few illustrative exam-,RED,The Kucera- Francis corpus is well- suited for comparing the rela-,197,23,55,116,43,1,7,THU RSDAY,FRIDAY,SATURDA Y,SUNDAY,MONDAY,TUESDA Y,WEDNESDAY,33,60,67,101,68,59,35,JANUARY,FEBRUARY,MARCH,APRIL,MAY,JUNE,JULY,0TICIIST,53,45,120,71,1400,93,65,ORANGE,YELLOW,GREEN,BLUE,INDIGO,VIOLETVIOLET,NORTH 206,EAST 183,7,WEDNESDAY 35,NORTHEAST 16,SOUTHEAST 28,SOUTH 240 SOUTHWEST 16,WEST 235 NORTHWEST 25,JULY,AUGUST,SEPTEMBER,OCTOBER,NOVEMBER,DECEMBER,56,51,74,62,A few of the relative frequencies on the se lists mu st be inte rpreted,with care. Mo st of the 1400 occurrences of MAY relate to the auxili-,ary verb, not the month, and some of the MARCHes are verbs. The,inte rpretation of the apparent differences on these lists is, perhaps,,more a matter for the psychologist or sociologist than the linguist.,Why does JUNE stand out from the other months? Why is the popular-,ity of a day of the week symmetrically arrayed with respect to SUN-,DA Y? (Apparently, Sundays in June are written about six time s more,often than Thur sdays in February. ) Why is RED the most popular col-,or? The point s of the compass exhibit a strange inconsistency. The,cardinal points WEST and SOUTH are cle a rly more popular than EAST,and NORTH (reflecting historical population movements in the United,States ?), but SOUTHWEST is much less common than either NORTH-,WEST or SOUTHEAST.,Similar lists can be made of geographic place-names. The most,obvious set of these are the 50 states (actually, only 47 can be distin-,gui she d, for the Dakotas, the Carolinas and the Virginias must be,lumped
