OED Text Annotator: FAQs

Thank you for your interest in this application from Oxford Languages. The OED Text Annotator has been designed to annotate texts written between 1750 and the present day using lexical information derived from the OED.

To use the tool, upload your text as a plain-text file (.txt). You will be asked for the approximate date when the text was written; this can help to improve the accuracy of annotation. If the software that you are using to create or edit text files asks you to choose an "Encoding" when you save the file, please specify UTF-8.

Your text will be annotated with:

  • Etymology type
  • Languages of immediate etymon
  • Languages of ulterior etymon
  • Contemporary frequency
  • Modern frequency
  • Dates in use

If you need to report a problem, or can't find an answer to your problem in this FAQ, please email the OED team at oed.uk@oup.com and include "OED Text Annotator" in the subject line

The “Annotated Tokens Table” is a CSV file which includes a row for each token (word, punctuation mark, or other symbol) in the input text, in the order in which the tokens occur in the input text. Thus, if the input text consists of 10,000 tokens, the Annotated Tokens Table will have 10,000 rows. The first column of each row is the token as it appears in the input text. The subsequent columns contain additional information supplied by the annotation process, such as lemma, part of speech, date, and etymological information.

The “Concordance Table” is a CSV file which includes a row for each distinct lemma in the input text. Each row includes the same columns as in the Annotated Tokens Table, and in addition includes an "occurrences" column which gives the number of times that the lemma occurs in the input text. For example, if the word twilight occurs 20 times in the input text, the Concordance Table will contain a single row for the lemma twilight, with the value "20" in the "occurrences" column. Rows are sorted in descending order by the value of the "occurrences" column, i.e. most frequent words first.

Because the Concordance Table only includes one row for each distinct lemma, it is usually significantly smaller than the corresponding Annotated Tokens Table. For example, if the input text consists of 10,000 tokens, the Concordance Table is likely to have 5,000-6,000 rows.

If you have received a message saying your annotation has failed, check the format and size of your text. For best results, texts to be annotated should be plain text (.txt), UTF-8 encoded, and smaller than 4MB in size.

The annotation process can take some time for larger texts (an hour or more for very long novel-length texts). The application puts texts for annotation in a queue, so even if your text is relatively short, it may be delayed by a longer text ahead of it in the queue. You will be notified by email when the annotation is complete.

The following columns are given in the “Annotated Tokens Table

token
The original form of the token (word, punctuation mark, or other symbol), as it appears in the original text.
part_of_speech
The part of speech of the token. The OED Text Annotator uses the Penn Treebank POS tagset for part-of-speech tagging.
lemma
The corresponding OED lemma, in the form given in the OED. For example, if the original token is Apple-trees, the corresponding OED lemma will be apple tree.
lemma_part_of_speech
The part of speech of the lemma. For example, if the part of speech of the token is “VBD” (past-tense verb), the part of speech of the corresponding lemma will be “VB” (infinitive verb).
oed_reference
A canonical reference to the lemma in the OED, including part(s) of speech and homograph number where applicable. For example, for the token prayers, the OED reference for the corresponding lemma prayer is “prayer, n.1”.
contemporary_frequency
The frequency of the lemma (per million tokens) in typical written English around the time when the text was written. Frequency data is derived primarily from the Google Books Ngrams data set.
modern_frequency
The frequency of the lemma (per million tokens) in modern written English.
first_known_use
The period or date from which this lemma is first evidenced in English. In most cases, this corresponds to the date of the earliest quotation given in the OED entry or subentry. But see the note below regarding dating of words from the Old English period.
sort_date
An integer equivalent of the first_known_use value. This provides support for sorting or filtering by date.
etymology_type
A classification of the origin of the word in English. The most common values are:
  • inherited: words inherited from the Germanic base of English
  • conversion: words converted from a pre-existing English word with a different part of speech
  • variant: words developed as a variation on a pre-existing English word
  • borrowing: words borrowed from another language
  • derivative: words derived by adding a prefix or suffix to a pre-existing English word
  • compound: words composed from two or more pre-existing English words
language_of_immediate_origin
The language(s) from which this word is immediately derived, according to the etymology given in the OED. For example, if a word originated in Greek, then passed into Latin, but reached English via French, the language_of_immediate_origin value would be “French”.
language_of_ulterior_origin
The language(s) from which this word is ultimately or more distantly derived, according to the etymology given in the OED. For example, if a word originated in Greek, then passed into Latin, and reached English via French, the language_of_ulterior_origin value would be “Greek”.
revised
"True" if the annotation information is taken from an OED entry that has been revised for the OED 3rd edition; "False" if the annotation information is taken from an unrevised OED entry. In general, information taken from an unrevised entry should be regarded as potentially less reliable.
record_uri
The OED Text Annotator runs on top of the OED Researcher API. The "record_uri" field gives a URI for the API record from which the annotation information has been derived.
url
A URL to the OED entry or subentry for the lemma.

The “Concordance Table” has many of the same columns, but with some differences:

  • The token and part_of_speech columns are omitted. The Concordance Table aggregates multiple tokens under a single lemma. For example, the tokens save (VB), saves (VBZ), saving (VBG), and saved (VBD) would all be aggregated into a single row for the lemma save (VB). Hence each row in the Concordance Table represents a lemma, not any specific token.
  • An occurrences column is included, giving the number of times tokens mapped to this lemma occur in the input text.
  • A frequency_in_text column is included, giving the frequency (per million tokens) of this lemma in the input text. This is based on the ratio of occurrences to the total number of tokens in the input text.

The OED Text Annotator maps each word in the input text to the corresponding OED lexeme (lemma + part of speech). It does not attempt to perform any semantic disambiguation of the token, and therefore does not map the token to a particular sense of the OED lexeme. This means that it cannot identify the appropriate definition to use for a given token.

The OED Text Annotator uses the Penn Treebank POS tagset for part-of-speech tagging. The most common codes are

NNsingular noun
NNSplural noun
NNPproper noun
JJadjective
RBadverb
VBinfinitive verb
VBZthird-person present-tense verb
VBGpresent-participle verb
VBDpast-tense verb
VBNpast-participle verb
INpreposition
CCconjunction
PRPpersonal pronoun
PRP$possessive pronoun

The OED Text Annotator uses the Stanford POS Tagger. The annotator makes some post-processing adjustments to align part-of-speech tags for tokens with OED’s part-of-speech conventions for lemmas.

For words from the Old English period (i.e. words documented before 1150), exact dating is not meaningful, due to the paucity of surviving manuscript evidence. All words from this period are therefore recorded as “OE—” in the "first_known_use" column, and as “1000” in the "sort_date" column. “1000” here should be regarded as an arbitrary value; it’s included to make it easier to sort and filter results by date, and should not be taken to mean that the word in question actually dates from the year 1000.

Essentially, the dating of a word as “OE—” / “1000” just means that the word is at least as old as surviving records of written English.

Proper names and other non-lexical tokens (punctuation, numerals, and symbols) are included in both the Annotated Tokens Table and the Concordance Table. However, since these cannot be mapped to OED lexemes, no annotation information is provided, and the annotation columns are left blank.

The handling of compound words can vary depending on whether the compound is closed, hyphenated, or open.

  • A closed compound like marketplace is always treated as a single unit: it will be mapped to the corresponding OED lexeme if included in OED, or left unmapped if it is not included in OED. (This should work regardless of whether the corresponding OED lexeme is itself closed, hyphenated, or open.)
  • A hyphenated compound like market-place is first treated as a single unit, and will be mapped to the corresponding OED lexeme if included in OED. (Again, this should work regardless of whether the corresponding OED lexeme is itself closed, hyphenated, or open.) But if the compound as a whole is not found in OED, it will be tokenized into its constituent parts, and these will be individually mapped to their corresponding OED lexemes. For example, if the hyphenated compound market-intelligence is not found in OED, it will be broken into a sequence of three tokens (market, -, and intelligence), allowing market and intelligence to be individually mapped to their corresponding OED lexemes.
  • An open compound (i.e. one with a space between the two words) may or may not be treated as a unit. For example, the open compound black hole is identified as a single multi-word unit and mapped to the corresponding OED lexeme black hole. But in other cases it may be more difficult to identify whether a two-word sequence should be identified as a single unit. The OED Text Annotator errs decidedly on the side of caution: that’s to say, it will tend to treat two-word sequences as two distinct tokens unless it’s very confident that they form a fixed multi-word unit. We plan to refine this in future iterations of the application.

The annotator is built on top of the OED Researcher API, which does most of the work in lemmatizing the input text, matching each lemma to the corresponding OED entry or subentry, and deriving the information used to fill out the annotation.

The input text is first subdivided into a sequence of short passages (around 500 words each). Each passage is processed in turn, and the results are concatenated at the end. Caching is used to avoid repeated analysis if the same word appears in multiple passages.

For each passage, the annotator first tokenizes and part-of-speech tags the text of the passage using the Stanford POS Tagger. The text is then examined for any sequences of words that can be identified as multi-word entities (e.g. the two-word sequence black hole); these are chunked together as a unit, and in subsequent steps are treated as a single token.

Based on part-of-speech tags, each token is categorized as either lexical or non-lexical.

  • Non-lexical tokens include proper names, punctuation marks, numbers, and other symbols.
  • Lexical tokens include everything else: nouns, verbs, adjectives, adverbs, etc.

Non-lexical tokens are assumed not to be covered in OED, and are therefore left unannotated.

For each lexical token, the annotator checks the wordform + part-of-speech against the OED Researcher API’s database of wordforms (the API’s surfaceforms endpoint), which includes every OED lemma, variant spelling, and inflected forms of these.

  • If no match is found, the wordform is counted as "unknown", and the token is left unannotated in the results
  • If more than one possible match is found, the annotator uses some simple heuristics to select the "best" match – usually, whichever represents the more common word at the time the text was written. For example, the token moles tagged as “NNS” (plural noun) could potentially be matched to any of OED’s ten homograph entries for the noun mole. The most common of these is mole, n.3 (“Any of various small burrowing insectivorous mammals of the subfamily Talpinae...”); accordingly, that’s the one that the annotator is likely to select as the best match.

Having matched a lexical token to the corresponding OED lemma, the annotator uses the OED Researcher API to retrieve the “word record” for that lemma. The record contains all the information that is used to annotate the token (date, etymology, frequency, reference and URL, etc.). The URI of the record itself is included as the "record_uri" column in the annotated results.

The date is used to “tune” the annotator to provide results appropriate to the period of the text. In particular, the date can help with disambiguation. For example, if a wordform in the text could potentially be mapped to several OED entries, the annotator is able to discard any candidate match that was obsolete by the time the text was written, or that had not yet entered the language at the time the text was written.

An approximate date is usually good enough, if you don’t know the exact date.

The OED Text Annotator uses a number of best-guess algorithms that tend to give the right results most of the time, but that can make mistakes in particular cases. Common causes of error include:

  • An error by the part-of-speech tagger can cause a cascade of errors in the annotation of a given token. For example, if an occurrence of the present-tense verb impacts is mistakenly tagged as “NNS” (plural noun), this will cause it to be mapped to the OED entry impact, n., rather than impact, v., which in turn will cause the wrong date, frequency information, etc., to be given in the annotation.
  • A given token may be ambiguous because of homography. For example, OED has ten homograph entries for the noun mole. It’s often unclear which of these homographs a given instance of the token mole should be mapped to, without attempting complex word-sense disambiguation. In such circumstances, the annotator will favour the most common (highest-frequency) of the candidate matches (in this case mole, n.3 - the homograph for the burrowing mammal). This will tend to be a good guess most of the time, but may be wrong on occasions.
  • If a token uses an unusual or archaic spelling for a word, the annotator may fail to recognize it, or in some cases may map it to the wrong lemma. The annotator draws on all the variant spellings listed in the OED; but these are not exhaustive, and don’t always account for things like graphic abbreviations, phonetic representations of regional accents, etc.
  • Proper names are sometimes mistaken for lexical items (common nouns and the like), or vice versa. This is especially likely to happen at the beginning of a sentence, where capitalization is ambiguous.
  • Fixed multi-word units are not always correctly identified as such, in which case each unit is analysed and annotated as a discrete token. This is usually tolerable for open compounds made up of two English words (e.g. if the compound mountain guide is not identified as such, and is instead treated as two discrete tokens, mountain and guide). But it can cause problems for multi-word units that cannot be analysed as a pair of English words: terms like habeas corpus, chop suey, or mot juste.
  • Some words are not (yet) covered in OED. These may include foreign-language words not naturalized in English; neologisms; unusual slang and regional terms; and very specialist technical vocabulary. Such words will usually be left unannotated; but in some cases the annotator may misidentify such a word as a variant spelling of an unrelated word.

We would appreciate examples of any other kinds of error that you may spot.
Email us at oed.uk@oup.com and include "OED Text Annotator" in the subject line

You can submit a pre-1750 text for annotation if the text uses modernized spelling. If the text uses original spelling, however, the accuracy of the results is likely to be poorer than for post-1750 texts. This is because the OED Text Annotator does not yet do a reliable enough job at mapping an archaic spelling of a word to its modern equivalent. We plan to improve this in future iterations of the application.

The links to the downloadable files expire after 7 days.

The maximum file size is 4MB. This should be sufficient for texts up to the length of most long novels (e.g. Moby-Dick is 1.2MB; Bleak House is 1.9MB).

Project Gutenberg is a good source of plain-text files suitable for the OED Text Annotator.
Once you have found a text that you would like to annotate:

  • Download the text file to your device, making sure to choose the “Plain Text UTF-8” format
    (usually, by right-clicking on the “Plain Text UTF-8” link, and and then selecting Save Link As...)
  • Project Gutenberg texts usually have boilerplate sections at the start and end containing metadata, legal information, etc.;
    Edit the file you just downloaded and remove those sections (note: check both the start and end of the file!)
    If your edit software gives you a choice, make sure that you save the file back in “Plain Text UTF-8” format
  • You can now upload the edited file to the OED Text Annotator

Use of the OED Text Annotator is subject to the Terms and Conditions