Thank you for your interest in this application from Oxford Languages. The OED Text Annotator has been designed to annotate texts written between 1750 and the present day using lexical information derived from the OED.
To use the tool, upload your text as a plain-text file (.txt). You will be asked for the approximate date when the text was written; this can help to improve the accuracy of annotation. If the software that you are using to create or edit text files asks you to choose an "Encoding" when you save the file, please specify UTF-8.
Your text will be annotated with:
If you need to report a problem, or can't find an answer to your problem in this FAQ, please email the OED team at oed.uk@oup.com and include "OED Text Annotator" in the subject line
The “Annotated Tokens Table” is a CSV file which includes a row for each token (word, punctuation mark, or other symbol) in the input text, in the order in which the tokens occur in the input text. Thus, if the input text consists of 10,000 tokens, the Annotated Tokens Table will have 10,000 rows. The first column of each row is the token as it appears in the input text. The subsequent columns contain additional information supplied by the annotation process, such as lemma, part of speech, date, and etymological information.
The “Concordance Table” is a CSV file which includes a row for each distinct lemma in the input text. Each row includes the same columns as in the Annotated Tokens Table, and in addition includes an "occurrences" column which gives the number of times that the lemma occurs in the input text. For example, if the word twilight occurs 20 times in the input text, the Concordance Table will contain a single row for the lemma twilight, with the value "20" in the "occurrences" column. Rows are sorted in descending order by the value of the "occurrences" column, i.e. most frequent words first.
Because the Concordance Table only includes one row for each distinct lemma, it is usually significantly smaller than the corresponding Annotated Tokens Table. For example, if the input text consists of 10,000 tokens, the Concordance Table is likely to have 5,000-6,000 rows.
If you have received a message saying your annotation has failed, check the format and size of your text. For best results, texts to be annotated should be plain text (.txt), UTF-8 encoded, and smaller than 4MB in size.
The annotation process can take some time for larger texts (an hour or more for very long novel-length texts). The application puts texts for annotation in a queue, so even if your text is relatively short, it may be delayed by a longer text ahead of it in the queue. You will be notified by email when the annotation is complete.
The following columns are given in the “Annotated Tokens Table”
The “Concordance Table” has many of the same columns, but with some differences:
The OED Text Annotator maps each word in the input text to the corresponding OED lexeme (lemma + part of speech). It does not attempt to perform any semantic disambiguation of the token, and therefore does not map the token to a particular sense of the OED lexeme. This means that it cannot identify the appropriate definition to use for a given token.
The OED Text Annotator uses the Penn Treebank POS tagset for part-of-speech tagging. The most common codes are
NN | singular noun |
NNS | plural noun |
NNP | proper noun |
JJ | adjective |
RB | adverb |
VB | infinitive verb |
VBZ | third-person present-tense verb |
VBG | present-participle verb |
VBD | past-tense verb |
VBN | past-participle verb |
IN | preposition |
CC | conjunction |
PRP | personal pronoun |
PRP$ | possessive pronoun |
The OED Text Annotator uses the Stanford POS Tagger. The annotator makes some post-processing adjustments to align part-of-speech tags for tokens with OED’s part-of-speech conventions for lemmas.
For words from the Old English period (i.e. words documented before 1150), exact dating is not meaningful, due to the paucity of surviving manuscript evidence. All words from this period are therefore recorded as “OE—” in the "first_known_use" column, and as “1000” in the "sort_date" column. “1000” here should be regarded as an arbitrary value; it’s included to make it easier to sort and filter results by date, and should not be taken to mean that the word in question actually dates from the year 1000.
Essentially, the dating of a word as “OE—” / “1000” just means that the word is at least as old as surviving records of written English.
Proper names and other non-lexical tokens (punctuation, numerals, and symbols) are included in both the Annotated Tokens Table and the Concordance Table. However, since these cannot be mapped to OED lexemes, no annotation information is provided, and the annotation columns are left blank.
The handling of compound words can vary depending on whether the compound is closed, hyphenated, or open.
The annotator is built on top of the OED Researcher API, which does most of the work in lemmatizing the input text, matching each lemma to the corresponding OED entry or subentry, and deriving the information used to fill out the annotation.
The input text is first subdivided into a sequence of short passages (around 500 words each). Each passage is processed in turn, and the results are concatenated at the end. Caching is used to avoid repeated analysis if the same word appears in multiple passages.
For each passage, the annotator first tokenizes and part-of-speech tags the text of the passage using the Stanford POS Tagger. The text is then examined for any sequences of words that can be identified as multi-word entities (e.g. the two-word sequence black hole); these are chunked together as a unit, and in subsequent steps are treated as a single token.
Based on part-of-speech tags, each token is categorized as either lexical or non-lexical.
Non-lexical tokens are assumed not to be covered in OED, and are therefore left unannotated.
For each lexical token, the annotator checks the wordform + part-of-speech against the OED Researcher API’s database of wordforms (the API’s surfaceforms endpoint), which includes every OED lemma, variant spelling, and inflected forms of these.
Having matched a lexical token to the corresponding OED lemma, the annotator uses the OED Researcher API to retrieve the “word record” for that lemma. The record contains all the information that is used to annotate the token (date, etymology, frequency, reference and URL, etc.). The URI of the record itself is included as the "record_uri" column in the annotated results.
The date is used to “tune” the annotator to provide results appropriate to the period of the text. In particular, the date can help with disambiguation. For example, if a wordform in the text could potentially be mapped to several OED entries, the annotator is able to discard any candidate match that was obsolete by the time the text was written, or that had not yet entered the language at the time the text was written.
An approximate date is usually good enough, if you don’t know the exact date.
The OED Text Annotator uses a number of best-guess algorithms that tend to give the right results most of the time, but that can make mistakes in particular cases. Common causes of error include:
We would appreciate examples of any other kinds of error that you may spot.
Email us at oed.uk@oup.com and include
"OED Text Annotator" in the subject line
You can submit a pre-1750 text for annotation if the text uses modernized spelling. If the text uses original spelling, however, the accuracy of the results is likely to be poorer than for post-1750 texts. This is because the OED Text Annotator does not yet do a reliable enough job at mapping an archaic spelling of a word to its modern equivalent. We plan to improve this in future iterations of the application.
The links to the downloadable files expire after 7 days.
The maximum file size is 4MB. This should be sufficient for texts up to the length of most long novels (e.g. Moby-Dick is 1.2MB; Bleak House is 1.9MB).
Project Gutenberg
is a good source of plain-text files suitable for the OED Text Annotator.
Once you have found a text that you would like to annotate:
Use of the OED Text Annotator is subject to the Terms and Conditions