GerManC

Title

GerManC

Author

Martin Durrell; Paul Bennett; Silke Scheible; Richard J. Whitt

Availability

Distributed by the University of Oxford under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. This is a very liberal license that grants certain rights for non-commercial use, especially your right to use GerManC for your own research, but also reserves certain rights for the original creators of GerManC.

Download: zip

Languages

German

Editorial Practice

Encoding format: TEI Lite P5 XML; GATE XML; GATE column format; plain text

OTA keywords

Linguistic corpora
Corpus

LC keywords

Linguistics
Linguistics analysis (Linguistics)

Extent
  • designation: CollectionText
  • size: 1352 files: ca. 1.3 Gb
Creation Date

The corpus was constructed between 2008 and 2011.

Source Description

Expanded and revised version of http://ota.ox.ac.uk/id/2537

Various: see documentation in the download package. :

Following the model of the ARCHER corpus and given the aim of representativeness, the GerManC corpus consists of text samples of about 2000 words from eight genres: drama, newspapers, sermons and personal letters (to represent orally oriented registers) and narrative prose (fiction or non-fiction), scholarly (i.e. humanities), scientific and legal texts (to represent more print-oriented registers). In order to facilitate tracing historical developments, the whole period was divided into fifty year sections (in this case 1650-1700, 1700-1750 and 1750-1800), and an equal number of texts from each genre was selected for each of these sub-periods.

The complete corpus thus consists of 360 samples, comprising approximately 800,000 words. Appendix 1 in the download package contains a lists of the files in the corpus with full documentation in an Excel spreadsheet.