Corpus annotation
Annotation manual for the Penn Historical Corpora and the York-Helsinki Corpus of Early English Correspondence
Beatrice Santorini
(April 2016)
This annotation manual is a revised version of the manual written in connection with the first release of the Penn-Helsinki Parsed Corpus of Early Modern English (Kroch, Santorini, and Delfs 2004). It is heavily indebted to the annotation guidelines developed by Ann Taylor and Tony Kroch for the second edition of the Penn-Helsinki Parsed Corpus of Middle English (Kroch and Taylor 2000) as well as to the guidelines developed for the Penn Treebank (Marcus, Santorini, and Marcinkiewicz 1993). The current version corrects typos and broken links as well as adding some examples and clarifications; the substance of the guidelines remains unchanged. The guidelines apply to the following corpora:
- the Penn-Helsinki Parsed Corpus of Middle English, 2nd edition (PPCME2)
- the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME)
- the Penn Parsed Corpus of Modern British English, 2nd edition (PPCMBE2)
- the York-Helsinki Parsed Corpus of Early English Correspondence (PCEEC)
There are slight annotation differences among the above-named corpora (notably, between the PPCME2 and the later corpora).
Acknowledgments
I would like to thank the following institutions and individuals for their support and assistance:
-
The National Endowment for the Humanities for financial support under NEH Grant PA 23382-99.
-
The National Science Foundation for financial support under NSF Grant BCS 99-05488.
-
The National Science Foundation for financial support under NSF Grants BCS 05-08731 and BCS 11-47499.
-
The users of the Penn Historical Corpora for their financial support in purchasing the corpora.
-
Tony Kroch and Ann Taylor for many helpful discussions concerning the original guidelines for the PPCME2 and their adaptation to modern English.
References
-
Kroch, Anthony, and Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, release 4.
-
Kroch, Anthony, Beatrice Santorini, and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, release 3.
-
Marcus, Mitchell, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19, 313-330. Reprinted in Susan Armstrong, ed., 1994, Using large corpora. Cambridge, MA: MIT Press. 273-290.