Penn Historical corpora

The Penn Parsed Corpora of Historical English

The Penn Parsed Corpora of Historical English are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. They include three corpora:

the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2),
the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and
the Penn Parsed Corpus of Modern British English, second edition (PPCMBE2).

The texts come in three forms: simple text, part-of-speech tagged text, and syntactically annotated text. The syntactic annotation (parsing) permits searching not only for words and word sequences, but also for abstract syntactic structures. All of the annotation has been carefully reviewed by expert human annotators for accuracy and consistency. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are publicly available to individuals, research groups, and libraries.

The 2016 release adds 2 million words to the Modern British English corpus, for a total of 3 million words, and includes a substantial number of corrections to the other corpora in the series. In addition, several small changes have been made to streamline the annotation guidelines.

As of July 2025, the 2016 release is superseded by PPCHE2, which again corrects annotation errors and inconsistencies and streamlines the current annotation guidelines yet further. Unlike earlier releases, PPCHE2 contains only tagged and parsed versions of the texts. It is available from the Linguistic Data Consortium (LDC) at the University of Pennsylvania under catalog number LDC2025T09. The 2016 release remains available under catalog number LDC2020T16.

A lemmatized version of PPCEME and PPCMBE2 is available that is based on the New English Dictionary (NED), the forerunner of the Oxford English Dictionary. It is not part of the LDC release, but as of September 2025, a patch is available on github that generates the lemmatized files from the release version of the parsed files.

For a short time, PPCHE2 was available in its entirety on github. This is no longer the case since posting the texts infringed on LDC's prior distribution rights. Supporting material for PPCHE2 continues to be available on github:

Philological information (PPCME2 | PPCEME | PPCMBE2)
Annotation guidelines (current). Also available on the web (current | 2016).
Lemmatization patch for PPCEME and PPCMBE2 and other lemmatization-related information
CorpusSearch (program is unchanged from Sourceforge site, but documentation has been updated for improved organization and searchability). Documentation also available on the web.

For questions concerning distribution, please contact LDC (ldc AT ldc DOT upenn DOT edu). For other issues, contact Beatrice Santorini (beatrice DOT santorini AT gmail DOT com). We especially welcome reports of annotation errors or inconsistencies, so that we can continue to improve the quality of the corpora.

Acknowledgments

The PPCME2 was created with the support of the National Science Foundation (Grants BNS 89-19701 and SBR 95-11368), with supplementary support from the University of Pennsylvania Research Foundation.
The PPCEME was created with the support of the National Endowment for the Humanities (Grant PA 23382-99) and the National Science Foundation (Grant BCS 99-05488).
The PPCMBE2 was created with the support of the National Science Foundation (Grants BCS 05-08731 and BCS 11-47499).

With respect to the above-listed grants, any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Endowment for the Humanities or the National Science Foundation.

Update website Nov 5 2024

This website is an update of the website that was previously hosted by the University of Pennsylvania. In accordance with Beatrice Santorini the website was moved to the University of Mannheim where it will be hosted and maintained by Carola Trips until further notice.