nltk.corpus.reader.ycoe module

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts. The corpus is distributed by the Oxford Text Archive: http://www.ota.ahds.ac.uk/ It is not included with NLTK.

The YCOE corpus is divided into 100 files, each representing an Old English prose text. Tags used within each text complies to the YCOE standard: https://www-users.york.ac.uk/~lang22/YCOE/YcoeHome.htm

class nltk.corpus.reader.ycoe.YCOECorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.

__init__(root, encoding='utf8')[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

documents(fileids=None)[source]

Return a list of document identifiers for all documents in this corpus, or for the documents with the given file(s) if specified.

fileids(documents=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that store the given document(s) if specified.

words(documents=None)[source]
sents(documents=None)[source]
paras(documents=None)[source]
tagged_words(documents=None)[source]
tagged_sents(documents=None)[source]
tagged_paras(documents=None)[source]
parsed_sents(documents=None)[source]
class nltk.corpus.reader.ycoe.YCOEParseCorpusReader[source]

Bases: nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

Specialized version of the standard bracket parse corpus reader that strips out (CODE …) and (ID …) nodes.

class nltk.corpus.reader.ycoe.YCOETaggedCorpusReader[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

__init__(root, items, encoding='utf8')[source]

Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = TaggedCorpusReader(root, '.*', '.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

nltk.corpus.reader.ycoe.documents = {'coadrian.o34': 'Adrian and Ritheus', 'coaelhom.o3': 'Ælfric, Supplemental Homilies', 'coaelive.o3': "Ælfric's Lives of Saints", 'coalcuin': 'Alcuin De virtutibus et vitiis', 'coalex.o23': "Alexander's Letter to Aristotle", 'coapollo.o3': 'Apollonius of Tyre', 'coaugust': 'Augustine', 'cobede.o2': "Bede's History of the English Church", 'cobenrul.o3': 'Benedictine Rule', 'coblick.o23': 'Blickling Homilies', 'coboeth.o2': "Boethius' Consolation of Philosophy", 'cobyrhtf.o3': "Byrhtferth's Manual", 'cocanedgD': 'Canons of Edgar (D)', 'cocanedgX': 'Canons of Edgar (X)', 'cocathom1.o3': "Ælfric's Catholic Homilies I", 'cocathom2.o3': "Ælfric's Catholic Homilies II", 'cochad.o24': 'Saint Chad', 'cochdrul': 'Chrodegang of Metz, Rule', 'cochristoph': 'Saint Christopher', 'cochronA.o23': 'Anglo-Saxon Chronicle A', 'cochronC': 'Anglo-Saxon Chronicle C', 'cochronD': 'Anglo-Saxon Chronicle D', 'cochronE.o34': 'Anglo-Saxon Chronicle E', 'cocura.o2': 'Cura Pastoralis', 'cocuraC': 'Cura Pastoralis (Cotton)', 'codicts.o34': 'Dicts of Cato', 'codocu1.o1': 'Documents 1 (O1)', 'codocu2.o12': 'Documents 2 (O1/O2)', 'codocu2.o2': 'Documents 2 (O2)', 'codocu3.o23': 'Documents 3 (O2/O3)', 'codocu3.o3': 'Documents 3 (O3)', 'codocu4.o24': 'Documents 4 (O2/O4)', 'coeluc1': 'Honorius of Autun, Elucidarium 1', 'coeluc2': 'Honorius of Autun, Elucidarium 1', 'coepigen.o3': "Ælfric's Epilogue to Genesis", 'coeuphr': 'Saint Euphrosyne', 'coeust': 'Saint Eustace and his companions', 'coexodusP': 'Exodus (P)', 'cogenesiC': 'Genesis (C)', 'cogregdC.o24': "Gregory's Dialogues (C)", 'cogregdH.o23': "Gregory's Dialogues (H)", 'coherbar': 'Pseudo-Apuleius, Herbarium', 'coinspolD.o34': "Wulfstan's Institute of Polity (D)", 'coinspolX': "Wulfstan's Institute of Polity (X)", 'cojames': 'Saint James', 'colacnu.o23': 'Lacnunga', 'colaece.o2': 'Leechdoms', 'colaw1cn.o3': 'Laws, Cnut I', 'colaw2cn.o3': 'Laws, Cnut II', 'colaw5atr.o3': 'Laws, Æthelred V', 'colaw6atr.o3': 'Laws, Æthelred VI', 'colawaf.o2': 'Laws, Alfred', 'colawafint.o2': "Alfred's Introduction to Laws", 'colawger.o34': 'Laws, Gerefa', 'colawine.ox2': 'Laws, Ine', 'colawnorthu.o3': 'Northumbra Preosta Lagu', 'colawwllad.o4': 'Laws, William I, Lad', 'coleofri.o4': 'Leofric', 'colsigef.o3': "Ælfric's Letter to Sigefyrth", 'colsigewB': "Ælfric's Letter to Sigeweard (B)", 'colsigewZ.o34': "Ælfric's Letter to Sigeweard (Z)", 'colwgeat': "Ælfric's Letter to Wulfgeat", 'colwsigeT': "Ælfric's Letter to Wulfsige (T)", 'colwsigeXa.o34': "Ælfric's Letter to Wulfsige (Xa)", 'colwstan1.o3': "Ælfric's Letter to Wulfstan I", 'colwstan2.o3': "Ælfric's Letter to Wulfstan II", 'comargaC.o34': 'Saint Margaret (C)', 'comargaT': 'Saint Margaret (T)', 'comart1': 'Martyrology, I', 'comart2': 'Martyrology, II', 'comart3.o23': 'Martyrology, III', 'comarvel.o23': 'Marvels of the East', 'comary': 'Mary of Egypt', 'coneot': 'Saint Neot', 'conicodA': 'Gospel of Nicodemus (A)', 'conicodC': 'Gospel of Nicodemus (C)', 'conicodD': 'Gospel of Nicodemus (D)', 'conicodE': 'Gospel of Nicodemus (E)', 'coorosiu.o2': 'Orosius', 'cootest.o3': 'Heptateuch', 'coprefcath1.o3': "Ælfric's Preface to Catholic Homilies I", 'coprefcath2.o3': "Ælfric's Preface to Catholic Homilies II", 'coprefcura.o2': 'Preface to the Cura Pastoralis', 'coprefgen.o3': "Ælfric's Preface to Genesis", 'copreflives.o3': "Ælfric's Preface to Lives of Saints", 'coprefsolilo': "Preface to Augustine's Soliloquies", 'coquadru.o23': 'Pseudo-Apuleius, Medicina de quadrupedibus', 'corood': 'History of the Holy Rood-Tree', 'cosevensl': 'Seven Sleepers', 'cosolilo': "St. Augustine's Soliloquies", 'cosolsat1.o4': 'Solomon and Saturn I', 'cosolsat2': 'Solomon and Saturn II', 'cotempo.o3': "Ælfric's De Temporibus Anni", 'coverhom': 'Vercelli Homilies', 'coverhomE': 'Vercelli Homilies (E)', 'coverhomL': 'Vercelli Homilies (L)', 'covinceB': 'Saint Vincent (Bodley 343)', 'covinsal': 'Vindicta Salvatoris', 'cowsgosp.o3': 'West-Saxon Gospels', 'cowulf.o34': "Wulfstan's Homilies"}

A list of all documents and their titles in ycoe.