It is difficult to produce a useful report on the frequency of English words, because often there are two different words that have identical appearances (e.g. 'lead' the verb and 'lead' the noun; sometimes 'to' is a preposition and sometimes it's an infinitive verb marker). One of the more useful surveys of a large body of English material is the file ftp://ftp.itri.bton.ac.uk/pub/bnc/all.num.o5.gz which is a survey of the British National Corpus, prepared and made available by the Information Technology Research Institute at the University of Brighton. The material that was surveyed includes millions of words of transcribed conversation, printed text, and lectures and oratory.
If we look at the 1996 version of this survey and add together items that are closely related -- for example, if we consider 'this' and 'these' as a single item -- we find that the following items are the most frequent, starting with 'the' which makes up 6.18 percent of the corpus:
6.18% the 4.23% is, was, be, are, 's (= is), were, been, being, 're, 'm, am 2.94% of 2.68% and 2.46% a, an 1.80% in, inside (preposition) 1.62% to (infinitive verb marker) 1.37% have, has, have, 've, 's (= has), had, having, 'd (= had) 1.27% he, him, his 1.25% it, its 1.17% I, me, my 0.91% to (preposition) 0.86% they, them, their 0.86% not, n't, no (interjection) 0.83% for 0.83% you, your 0.70% she, her 0.65% with 0.64% on 0.62% that (conjunction) 0.58% this, these 0.57% that (demonstrative), those 0.55% do, did, does, done, doing 0.51% we, us, our 0.50% by 0.47% at 0.45% but (conjunction) 0.44% 's (possessive) 0.41% from 0.40% as (many parts of speech) 0.37% which 0.37% or 0.31% will, 'll 0.28% said, say, says, saying 0.25% would 0.25% what 0.23% there (existential, in "there is ..." phrases) 0.23% if 0.23% can 0.22% all 0.22% who, whose 0.21% so (adverb / conjunction) 0.20% go, went, gone, goes 0.20% more 0.19% other, another 0.19% one (numeral) 0.18% see, saw, seen, seeing 0.18% know, knew, known, knows, knowing
The items listed above make up about 43% of the corpus.
This local link will take you to a portion of the survey giving the 3000 most frequent words. Each line consists of four items: number of times the word occurred, the word, its part of speech, and the number of files in which the word was found. Here is the README file from the ftp directory where the original files are housed:
README for ftp.itri.bton.ac.uk/pub/bnc
======================================
Adam Kilgarriff
20 Nov 1995
Updated 15 March 1996
The files in this directory relate to the British National Corpus (BNC).
They are a bibliographical database, various frequency lists,
and a file giving variances of word frequencies (details in
variances.doc).
bib-dbase a one-line-per-file bibliographic database for the
4124 files in the BNC. (The first part of the file
is the describes the coding scheme.)
Frequency lists:
These are all available in 6 forms:
* sorted alphabetically ('al')
or by frequency (highest frequency first) ('num');
* the complete lists, or a smaller file containing only those
items occurring over five times (suffix 'o5');
* all lists are available compressed using gzip ('.gz'). The
o5 lists are also available uncompressed (no suffix).
The frequencies are for (CLAWS-word, POS) pairs. NB some CLAWS words
- eg 'in spite of' are not orthographic words, while others are
numbers etc, and some POS's are CLAWS 'portmanteau tags', eg NN1-VVB,
where CLAWS was uncertain as to whether the word was a singular common
noun or base form of a verb. See BNC manual for serious documentation,
also my 'Putting frequencies in the dictionary' (available via www home
page, see adddress below) for detailed discussion of frequency lists.
The format is: four fields, separated by spaces.
1: frequency
2: word
3: pos
4: number of files the word occurs in
For non-orthographic words, spaces are replaced by underscore, giving
eg 'in_spite_of'
cg 'context-governed' spoken material
(eg meetings, lectures etc) 6.2M tokens, 79,906 types
demog 'demographic' spoken material
(eg conversation) 4.2M tokens, 54,652 types
written 89.7M tokens, 921,074 types
all 100.1M tokens, 939,028 types
Sizes in MB ('al' and 'num' variants all the same size)
all uncompressed .gz o5 o5.gz
-------------------------------------------------------------
all 18.1 4.8 4.0 1.32
cg 1.4 0.39 0.43 0.15
demog 0.9 0.26 0.25 0.09
written 17.8 4.7 3.9 1.30
-------------------------------------------------------------
For further information on the BNC see
http://info.ox.ac.uk/bnc
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4AT email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%