nltk.corpus.reader package

Submodules

Module contents

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.

Corpus Reader Functions

Each corpus module defines one or more “corpus reader functions”, which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:

  • If item is one of the unique identifiers listed in the corpus module’s items variable, then the corresponding document will be loaded from the NLTK corpus package.

  • If item is a fileid, then that file will be read.

Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.

Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:

  • words(): list of str

  • sents(): list of (list of str)

  • paras(): list of (list of (list of str))

  • tagged_words(): list of (str,str) tuple

  • tagged_sents(): list of (list of (str,str))

  • tagged_paras(): list of (list of (list of (str,str)))

  • chunked_sents(): list of (Tree w/ (str,str) leaves)

  • parsed_sents(): list of (Tree with str leaves)

  • parsed_paras(): list of (list of (Tree with str leaves))

  • xml(): A single xml ElementTree

  • raw(): unprocessed corpus contents

For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

>>> from nltk.corpus import brown
>>> print(", ".join(brown.words()[:6])) # only first 6 words
The, Fulton, County, Grand, Jury, said

isort:skip_file

class nltk.corpus.reader.AlignedCorpusReader[source]

Bases: CorpusReader

Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.

__init__(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), alignedsent_block_reader=<function read_alignedsent_block>, encoding='latin1')[source]

Construct a new Aligned Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = AlignedCorpusReader(root, '.*', '.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

aligned_sents(fileids=None)[source]
Returns

the given file(s) as a list of AlignedSent objects.

Return type

list(AlignedSent)

sents(fileids=None)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

words(fileids=None)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.AlpinoCorpusReader[source]

Bases: BracketParseCorpusReader

Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree. This is no good for tag_ and word_ _tag and _word will be overridden to use a non-default new parameter ‘ordered’ to the overridden _normalize function. The _parse function can then remain untouched.

__init__(root, encoding='ISO-8859-1', tagset=None)[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • comment_char – The character which can appear at the start of a line to indicate that the rest of the line is a comment.

  • detect_blocks – The method that is used to find blocks in the corpus; can be ‘unindented_paren’ (every unindented parenthesis starts a new parse) or ‘sexpr’ (brackets are matched).

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

class nltk.corpus.reader.BCP47CorpusReader[source]

Bases: CorpusReader

Parse BCP-47 composite language tags

Supports all the main subtags, and the ‘u-sd’ extension:

>>> from nltk.corpus import bcp47
>>> bcp47.name('oc-gascon-u-sd-fr64')
'Occitan (post 1500): Gascon: Pyrénées-Atlantiques'

Can load a conversion table to Wikidata Q-codes: >>> bcp47.load_wiki_q() >>> bcp47.wiki_q[‘en-GI-spanglis’] ‘Q79388’

__init__(root, fileids)[source]

Read the BCP-47 database

data_dict(records)[source]

Convert the BCP-47 language subtag registry to a dictionary

lang2str(lg_record)[source]

Concatenate subtag values

load_wiki_q()[source]

Load conversion table to Wikidata Q-codes (only if needed)

morphology()[source]
name(tag)[source]

Convert a BCP-47 tag to a colon-separated string of subtag names

>>> from nltk.corpus import bcp47
>>> bcp47.name('ca-Latn-ES-valencia')
'Catalan: Latin: Spain: Valencian'
parse_tag(tag)[source]

Convert a BCP-47 tag to a dictionary of labelled subtags

subdiv_dict(subdivs)[source]

Convert the CLDR subdivisions list to a dictionary

val2str(val)[source]

Return only first value

wiki_dict(lines)[source]

Convert Wikidata list of Q-codes to a BCP-47 dictionary

class nltk.corpus.reader.BNCCorpusReader[source]

Bases: XMLCorpusReader

Corpus reader for the XML version of the British National Corpus.

For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

You can obtain the full version of the BNC corpus at https://www.ota.ox.ac.uk/desc/2554

If you extracted the archive to a directory called BNC, then you can instantiate the reader as:

BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
__init__(root, fileids, lazy=True)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

sents(fileids=None, strip_space=True, stem=False)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

Parameters
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • stem – If true, then use word stems instead of word strings.

tagged_sents(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type

list(list(tuple(str,str)))

Parameters
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.

  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • stem – If true, then use word stems instead of word strings.

tagged_words(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type

list(tuple(str,str))

Parameters
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.

  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • stem – If true, then use word stems instead of word strings.

words(fileids=None, strip_space=True, stem=False)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

Parameters
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • stem – If true, then use word stems instead of word strings.

class nltk.corpus.reader.BracketParseCorpusReader[source]

Bases: SyntaxCorpusReader

Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the “combined” section of the Penn Treebank, e.g. “(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))”.

__init__(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • comment_char – The character which can appear at the start of a line to indicate that the rest of the line is a comment.

  • detect_blocks – The method that is used to find blocks in the corpus; can be ‘unindented_paren’ (every unindented parenthesis starts a new parse) or ‘sexpr’ (brackets are matched).

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

class nltk.corpus.reader.CHILDESCorpusReader[source]

Bases: XMLCorpusReader

Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at https://childes.talkbank.org/. The XML version of CHILDES is located at https://childes.talkbank.org/data-xml/. Copy the needed parts of the CHILDES XML corpus into the NLTK data directory (nltk_data/corpora/CHILDES/).

For access to the file text use the usual nltk functions, words(), sents(), tagged_words() and tagged_sents().

MLU(fileids=None, speaker='CHI')[source]
Returns

the given file(s) as a floating number

Return type

list(float)

__init__(root, fileids, lazy=True)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

age(fileids=None, speaker='CHI', month=False)[source]
Returns

the given file(s) as string or int

Return type

list or int

Parameters

month – If true, return months instead of year-month-date

childes_url_base = 'https://childes.talkbank.org/browser/index.php?url='
convert_age(age_year)[source]

Caclculate age in months from a string in CHILDES format

corpus(fileids=None)[source]
Returns

the given file(s) as a dict of (corpus_property_key, value)

Return type

list(dict)

participants(fileids=None)[source]
Returns

the given file(s) as a dict of (participant_property_key, value)

Return type

list(dict)

sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

Parameters
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)

  • stem – If true, then use word stems instead of word strings.

  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)

  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)

tagged_sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type

list(list(tuple(str,str)))

Parameters
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)

  • stem – If true, then use word stems instead of word strings.

  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)

  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)

tagged_words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]
Returns

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type

list(tuple(str,str))

Parameters
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)

  • stem – If true, then use word stems instead of word strings.

  • relation – If true, then return tuples of (stem, index, dependent_index)

  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)

webview_file(fileid, urlbase=None)[source]

Map a corpus file to its web version on the CHILDES website, and open it in a web browser.

The complete URL to be used is:

childes.childes_url_base + urlbase + fileid.replace(‘.xml’, ‘.cha’)

If no urlbase is passed, we try to calculate it. This requires that the childes corpus was set up to mirror the folder hierarchy under childes.psy.cmu.edu/data-xml/, e.g.: nltk_data/corpora/childes/Eng-USA/Cornell/??? or nltk_data/corpora/childes/Romance/Spanish/Aguirre/???

The function first looks (as a special case) if “Eng-USA” is on the path consisting of <corpus root>+fileid; then if “childes”, possibly followed by “data-xml”, appears. If neither one is found, we use the unmodified fileid and hope for the best. If this is not right, specify urlbase explicitly, e.g., if the corpus root points to the Cornell folder, urlbase=’Eng-USA/Cornell’.

words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]
Returns

the given file(s) as a list of words

Return type

list(str)

Parameters
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)

  • stem – If true, then use word stems instead of word strings.

  • relation – If true, then return tuples of (stem, index, dependent_index)

  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.

  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)

class nltk.corpus.reader.CMUDictCorpusReader[source]

Bases: CorpusReader

dict()[source]
Returns

the cmudict lexicon as a dictionary, whose keys are lowercase words and whose values are lists of pronunciations.

entries()[source]
Returns

the cmudict lexicon as a list of entries containing (word, transcriptions) tuples.

words()[source]
Returns

a list of all words defined in the cmudict lexicon.

class nltk.corpus.reader.CategorizedBracketParseCorpusReader[source]

Bases: CategorizedCorpusReader, BracketParseCorpusReader

A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>

__init__(*args, **kwargs)[source]

Initialize the corpus reader. Categorization arguments (C{cat_pattern}, C{cat_map}, and C{cat_file}) are passed to the L{CategorizedCorpusReader constructor <CategorizedCorpusReader.__init__>}. The remaining arguments are passed to the L{BracketParseCorpusReader constructor <BracketParseCorpusReader.__init__>}.

parsed_paras(fileids=None, categories=None)[source]
parsed_sents(fileids=None, categories=None)[source]
parsed_words(fileids=None, categories=None)[source]
tagged_paras(fileids=None, categories=None, tagset=None)[source]
tagged_sents(fileids=None, categories=None, tagset=None)[source]
tagged_words(fileids=None, categories=None, tagset=None)[source]
class nltk.corpus.reader.CategorizedCorpusReader[source]

Bases: object

A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method categories(), which returns a list of the categories for the corpus or for a specified set of fileids; and overrides fileids() to take a categories argument, restricting the set of fileids to be returned.

Subclasses are expected to:

  • Call __init__() to set up the mapping.

  • Override all view methods to accept a categories parameter, which can be used instead of the fileids parameter, to select which fileids should be included in the returned view.

__init__(kwargs)[source]

Initialize this mapping based on keyword arguments, as follows:

  • cat_pattern: A regular expression pattern used to find the category for each file identifier. The pattern will be applied to each file identifier, and the first matching group will be used as the category label for that file.

  • cat_map: A dictionary, mapping from file identifiers to category labels.

  • cat_file: The name of a file that contains the mapping from file identifiers to categories. The argument cat_delimiter can be used to specify a delimiter.

The corresponding argument will be deleted from kwargs. If more than one argument is specified, an exception will be raised.

categories(fileids=None)[source]

Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.

fileids(categories=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that make up the given category(s) if specified.

paras(fileids=None, categories=None)[source]
raw(fileids=None, categories=None)[source]
sents(fileids=None, categories=None)[source]
words(fileids=None, categories=None)[source]
class nltk.corpus.reader.CategorizedPlaintextCorpusReader[source]

Bases: CategorizedCorpusReader, PlaintextCorpusReader

A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.

__init__(*args, **kwargs)[source]

Initialize the corpus reader. Categorization arguments (cat_pattern, cat_map, and cat_file) are passed to the CategorizedCorpusReader constructor. The remaining arguments are passed to the PlaintextCorpusReader constructor.

class nltk.corpus.reader.CategorizedSentencesCorpusReader[source]

Bases: CategorizedCorpusReader, CorpusReader

A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.

Examples using the Subjectivity Dataset:

>>> from nltk.corpus import subjectivity
>>> subjectivity.sents()[23] 
['television', 'made', 'him', 'famous', ',', 'but', 'his', 'biggest', 'hits',
'happened', 'off', 'screen', '.']
>>> subjectivity.categories()
['obj', 'subj']
>>> subjectivity.words(categories='subj')
['smart', 'and', 'alert', ',', 'thirteen', ...]

Examples using the Sentence Polarity Dataset:

>>> from nltk.corpus import sentence_polarity
>>> sentence_polarity.sents() 
[['simplistic', ',', 'silly', 'and', 'tedious', '.'], ["it's", 'so', 'laddish',
'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find',
'it', 'funny', '.'], ...]
>>> sentence_polarity.categories()
['neg', 'pos']
CorpusView

alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), sent_tokenizer=None, encoding='utf8', **kwargs)[source]
Parameters
  • root – The root directory for the corpus.

  • fileids – a list or regexp specifying the fileids in the corpus.

  • word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer

  • sent_tokenizer – a tokenizer for breaking paragraphs into sentences.

  • encoding – the encoding that should be used to read the corpus.

  • kwargs – additional parameters passed to CategorizedCorpusReader.

sents(fileids=None, categories=None)[source]

Return all sentences in the corpus or in the specified file(s).

Parameters
  • fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.

  • categories – a list specifying the categories whose sentences have to be returned.

Returns

the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.

Return type

list(list(str))

words(fileids=None, categories=None)[source]

Return all words and punctuation symbols in the corpus or in the specified file(s).

Parameters
  • fileids – a list or regexp specifying the ids of the files whose words have to be returned.

  • categories – a list specifying the categories whose words have to be returned.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.CategorizedTaggedCorpusReader[source]

Bases: CategorizedCorpusReader, TaggedCorpusReader

A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.

__init__(*args, **kwargs)[source]

Initialize the corpus reader. Categorization arguments (cat_pattern, cat_map, and cat_file) are passed to the CategorizedCorpusReader constructor. The remaining arguments are passed to the TaggedCorpusReader.

tagged_paras(fileids=None, categories=None, tagset=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

Return type

list(list(list(tuple(str,str))))

tagged_sents(fileids=None, categories=None, tagset=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type

list(list(tuple(str,str)))

tagged_words(fileids=None, categories=None, tagset=None)[source]
Returns

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type

list(tuple(str,str))

class nltk.corpus.reader.ChasenCorpusReader[source]

Bases: CorpusReader

__init__(root, fileids, encoding='utf8', sent_splitter=None)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

paras(fileids=None)[source]
sents(fileids=None)[source]
tagged_paras(fileids=None)[source]
tagged_sents(fileids=None)[source]
tagged_words(fileids=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.ChunkedCorpusReader[source]

Bases: CorpusReader

Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using nltk.chunk.tagstr2tree.

__init__(root, fileids, extension='', str2chunktree=<function tagstr2tree>, sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

chunked_paras(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).

Return type

list(list(Tree))

chunked_sents(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).

Return type

list(Tree)

chunked_words(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of tagged words and chunks. Words are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over (word,tag) tuples or word strings.

Return type

list(tuple(str,str) and Tree)

paras(fileids=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

sents(fileids=None)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

tagged_paras(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

Return type

list(list(list(tuple(str,str))))

tagged_sents(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type

list(list(tuple(str,str)))

tagged_words(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type

list(tuple(str,str))

words(fileids=None)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.ComparativeSentencesCorpusReader[source]

Bases: CorpusReader

Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).

>>> from nltk.corpus import comparative_sentences
>>> comparison = comparative_sentences.comparisons()[0]
>>> comparison.text 
['its', 'fast-forward', 'and', 'rewind', 'work', 'much', 'more', 'smoothly',
'and', 'consistently', 'than', 'those', 'of', 'other', 'models', 'i', "'ve",
'had', '.']
>>> comparison.entity_2
'models'
>>> (comparison.feature, comparison.keyword)
('rewind', 'more')
>>> len(comparative_sentences.comparisons())
853
CorpusView

alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), sent_tokenizer=None, encoding='utf8')[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – a list or regexp specifying the fileids in this corpus.

  • word_tokenizer – tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer

  • sent_tokenizer – tokenizer for breaking paragraphs into sentences.

  • encoding – the encoding that should be used to read the corpus.

comparisons(fileids=None)[source]

Return all comparisons in the corpus.

Parameters

fileids – a list or regexp specifying the ids of the files whose comparisons have to be returned.

Returns

the given file(s) as a list of Comparison objects.

Return type

list(Comparison)

keywords(fileids=None)[source]

Return a set of all keywords used in the corpus.

Parameters

fileids – a list or regexp specifying the ids of the files whose keywords have to be returned.

Returns

the set of keywords and comparative phrases used in the corpus.

Return type

set(str)

keywords_readme()[source]

Return the list of words and constituents considered as clues of a comparison (from listOfkeywords.txt).

sents(fileids=None)[source]

Return all sentences in the corpus.

Parameters

fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.

Returns

all sentences of the corpus as lists of tokens (or as plain strings, if no word tokenizer is specified).

Return type

list(list(str)) or list(str)

words(fileids=None)[source]

Return all words and punctuation symbols in the corpus.

Parameters

fileids – a list or regexp specifying the ids of the files whose words have to be returned.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.ConllChunkCorpusReader[source]

Bases: ConllCorpusReader

A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.

__init__(root, fileids, chunk_types, encoding='utf8', tagset=None, separator=None)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

class nltk.corpus.reader.ConllCorpusReader[source]

Bases: CorpusReader

A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the ConllCorpusReader constructor therefore takes an argument, columntypes, which is used to specify the columns that are used by a given corpus. By default columns are split by consecutive whitespaces, with the separator argument you can set a string to split by (e.g. ' ').

@todo: Add support for reading from corpora where different

parallel files contain different columns.

@todo: Possibly add caching of the grid corpus view? This would

allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).

@todo: Better support for -DOCSTART-. Currently, we just ignore

it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).

CHUNK = 'chunk'

column type for chunk structures

COLUMN_TYPES = ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')

A list of all column types supported by the conll corpus reader.

IGNORE = 'ignore'

column type for column that should be ignored

NE = 'ne'

column type for named entities

POS = 'pos'

column type for part-of-speech tags

SRL = 'srl'

column type for semantic role labels

TREE = 'tree'

column type for parse trees

WORDS = 'words'

column type for words

__init__(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.tree.Tree'>, tagset=None, separator=None)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

chunked_sents(fileids=None, chunk_types=None, tagset=None)[source]
chunked_words(fileids=None, chunk_types=None, tagset=None)[source]
iob_sents(fileids=None, tagset=None)[source]
Returns

a list of lists of word/tag/IOB tuples

Return type

list(list)

Parameters

fileids (None or str or list) – the list of fileids that make up this corpus

iob_words(fileids=None, tagset=None)[source]
Returns

a list of word/tag/IOB tuples

Return type

list(tuple)

Parameters

fileids (None or str or list) – the list of fileids that make up this corpus

parsed_sents(fileids=None, pos_in_tree=None, tagset=None)[source]
sents(fileids=None)[source]
srl_instances(fileids=None, pos_in_tree=None, flatten=True)[source]
srl_spans(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.CorpusReader[source]

Bases: object

A base class for “corpus reader” classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory. Each file is identified by its file identifier, which is the relative path to the file from the root directory.

A separate subclass is defined for each corpus format. These subclasses define one or more methods that provide ‘views’ on the corpus contents, such as words() (for a list of words) and parsed_sents() (for a list of parsed sentences). Called with no arguments, these methods will return the contents of the entire corpus. For most corpora, these methods define one or more selection arguments, such as fileids or categories, which can be used to select which portion of the corpus should be returned.

__init__(root, fileids, encoding='utf8', tagset=None)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

abspath(fileid)[source]

Return the absolute path for the given file.

Parameters

fileid (str) – The file identifier for the file whose path should be returned.

Return type

PathPointer

abspaths(fileids=None, include_encoding=False, include_fileid=False)[source]

Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.

Parameters
  • fileids (None or str or list) – Specifies the set of fileids for which paths should be returned. Can be None, for all fileids; a list of file identifiers, for a specified set of fileids; or a single file identifier, for a single file. Note that the return value is always a list of paths, even if fileids is a single file identifier.

  • include_encoding – If true, then return a list of (path_pointer, encoding) tuples.

Return type

list(PathPointer)

citation()[source]

Return the contents of the corpus citation.bib file, if it exists.

encoding(file)[source]

Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.

ensure_loaded()[source]

Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded – e.g., in case a user wants to do help(some_corpus).

fileids()[source]

Return a list of file identifiers for the fileids that make up this corpus.

license()[source]

Return the contents of the corpus LICENSE file, if it exists.

open(file)[source]

Return an open stream that can be used to read the given file. If the file’s encoding is not None, then the stream will automatically decode the file’s contents into unicode.

Parameters

file – The file identifier of the file to read.

raw(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a single string.

Return type

str

readme()[source]

Return the contents of the corpus README file, if it exists.

property root

The directory where this corpus is stored.

Type

PathPointer

class nltk.corpus.reader.CrubadanCorpusReader[source]

Bases: CorpusReader

A corpus reader used to access language An Crubadan n-gram files.

__init__(root, fileids, encoding='utf8', tagset=None)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

crubadan_to_iso(lang)[source]

Return ISO 639-3 code given internal Crubadan code

iso_to_crubadan(lang)[source]

Return internal Crubadan code based on ISO 639-3 code

lang_freq(lang)[source]

Return n-gram FreqDist for a specific language given ISO 639-3 language code

langs()[source]

Return a list of supported languages as ISO 639-3 codes

class nltk.corpus.reader.DependencyCorpusReader[source]

Bases: SyntaxCorpusReader

__init__(root, fileids, encoding='utf8', word_tokenizer=<nltk.tokenize.simple.TabTokenizer object>, sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

parsed_sents(fileids=None)[source]
sents(fileids=None)[source]
tagged_sents(fileids=None)[source]
tagged_words(fileids=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.EuroparlCorpusReader[source]

Bases: PlaintextCorpusReader

Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from PlaintextCorpusReader except that:

  • Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.

  • For the same reason, the sentence tokenizer should just split the paragraph at line breaks.

  • There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.

  • The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.

chapters(fileids=None)[source]
Returns

the given file(s) as a list of chapters, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

paras(fileids=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

class nltk.corpus.reader.FramenetCorpusReader[source]

Bases: XMLCorpusReader

A corpus reader for the Framenet Corpus.

>>> from nltk.corpus import framenet as fn
>>> fn.lu(3238).frame.lexUnit['glint.v'] is fn.lu(3238)
True
>>> fn.frame_by_name('Replacing') is fn.lus('replace.v')[0].frame
True
>>> fn.lus('prejudice.n')[0].frame.frameRelations == fn.frame_relations('Partiality')
True
__init__(root, fileids)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

annotations(luNamePattern=None, exemplars=True, full_text=True)[source]

Frame annotation sets matching the specified criteria.

buildindexes()[source]

Build the internal indexes to make look-ups faster.

doc(fn_docid)[source]

Returns the annotated document whose id number is fn_docid. This id number can be obtained by calling the Documents() function.

The dict that is returned from this function will contain the following keys:

  • ‘_type’ : ‘fulltextannotation’

  • ‘sentence’a list of sentences in the document
    • Each item in the list is a dict containing the following keys:
      • ‘ID’ : the ID number of the sentence

      • ‘_type’ : ‘sentence’

      • ‘text’ : the text of the sentence

      • ‘paragNo’ : the paragraph number

      • ‘sentNo’ : the sentence number

      • ‘docID’ : the document ID number

      • ‘corpID’ : the corpus ID number

      • ‘aPos’ : the annotation position

      • ‘annotationSet’a list of annotation layers for the sentence
        • Each item in the list is a dict containing the following keys:
          • ‘ID’ : the ID number of the annotation set

          • ‘_type’ : ‘annotationset’

          • ‘status’ : either ‘MANUAL’ or ‘UNANN’

          • ‘luName’ : (only if status is ‘MANUAL’)

          • ‘luID’ : (only if status is ‘MANUAL’)

          • ‘frameID’ : (only if status is ‘MANUAL’)

          • ‘frameName’: (only if status is ‘MANUAL’)

          • ‘layer’a list of labels for the layer
            • Each item in the layer is a dict containing the following keys:
              • ‘_type’: ‘layer’

              • ‘rank’

              • ‘name’

              • ‘label’a list of labels in the layer
                • Each item is a dict containing the following keys:
                  • ‘start’

                  • ‘end’

                  • ‘name’

                  • ‘feID’ (optional)

Parameters

fn_docid (int) – The Framenet id number of the document

Returns

Information about the annotated document

Return type

dict

docs(name=None)[source]

Return a list of the annotated full-text documents in FrameNet, optionally filtered by a regex to be matched against the document name.

docs_metadata(name=None)[source]

Return an index of the annotated documents in Framenet.

Details for a specific annotated document can be obtained using this class’s doc() function and pass it the value of the ‘ID’ field.

>>> from nltk.corpus import framenet as fn
>>> len(fn.docs()) in (78, 107) # FN 1.5 and 1.7, resp.
True
>>> set([x.corpname for x in fn.docs_metadata()])>=set(['ANC', 'KBEval',                     'LUCorpus-v0.3', 'Miscellaneous', 'NTI', 'PropBank'])
True
Parameters

name (str) – A regular expression pattern used to search the file name of each annotated document. The document’s file name contains the name of the corpus that the document is from, followed by two underscores “__” followed by the document name. So, for example, the file name “LUCorpus-v0.3__20000410_nyt-NEW.xml” is from the corpus named “LUCorpus-v0.3” and the document name is “20000410_nyt-NEW.xml”.

Returns

A list of selected (or all) annotated documents

Return type

list of dicts, where each dict object contains the following keys:

  • ’name’

  • ’ID’

  • ’corpid’

  • ’corpname’

  • ’description’

  • ’filename’

exemplars(luNamePattern=None, frame=None, fe=None, fe2=None)[source]

Lexicographic exemplar sentences, optionally filtered by LU name and/or 1-2 FEs that are realized overtly. ‘frame’ may be a name pattern, frame ID, or frame instance. ‘fe’ may be a name pattern or FE instance; if specified, ‘fe2’ may also be specified to retrieve sentences with both overt FEs (in either order).

fe_relations()[source]

Obtain a list of frame element relations.

>>> from nltk.corpus import framenet as fn
>>> ferels = fn.fe_relations()
>>> isinstance(ferels, list)
True
>>> len(ferels) in (10020, 12393)   # FN 1.5 and 1.7, resp.
True
>>> PrettyDict(ferels[0], breakLines=True) 
{'ID': 14642,
'_type': 'ferelation',
'frameRelation': <Parent=Abounding_with -- Inheritance -> Child=Lively_place>,
'subFE': <fe ID=11370 name=Degree>,
'subFEName': 'Degree',
'subFrame': <frame ID=1904 name=Lively_place>,
'subID': 11370,
'supID': 2271,
'superFE': <fe ID=2271 name=Degree>,
'superFEName': 'Degree',
'superFrame': <frame ID=262 name=Abounding_with>,
'type': <framerelationtype ID=1 name=Inheritance>}
Returns

A list of all of the frame element relations in framenet

Return type

list(dict)

fes(name=None, frame=None)[source]

Lists frame element objects. If ‘name’ is provided, this is treated as a case-insensitive regular expression to filter by frame name. (Case-insensitivity is because casing of frame element names is not always consistent across frames.) Specify ‘frame’ to filter by a frame name pattern, ID, or object.

>>> from nltk.corpus import framenet as fn
>>> fn.fes('Noise_maker')
[<fe ID=6043 name=Noise_maker>]
>>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound')]) 
[('Cause_to_make_noise', 'Sound_maker'), ('Make_noise', 'Sound'),
 ('Make_noise', 'Sound_source'), ('Sound_movement', 'Location_of_sound_source'),
 ('Sound_movement', 'Sound'), ('Sound_movement', 'Sound_source'),
 ('Sounds', 'Component_sound'), ('Sounds', 'Location_of_sound_source'),
 ('Sounds', 'Sound_source'), ('Vocalizations', 'Location_of_sound_source'),
 ('Vocalizations', 'Sound_source')]
>>> sorted([(fe.frame.name,fe.name) for fe in fn.fes('sound',r'(?i)make_noise')]) 
[('Cause_to_make_noise', 'Sound_maker'),
 ('Make_noise', 'Sound'),
 ('Make_noise', 'Sound_source')]
>>> sorted(set(fe.name for fe in fn.fes('^sound')))
['Sound', 'Sound_maker', 'Sound_source']
>>> len(fn.fes('^sound$'))
2
Parameters

name (str) – A regular expression pattern used to match against frame element names. If ‘name’ is None, then a list of all frame elements will be returned.

Returns

A list of matching frame elements

Return type

list(AttrDict)

frame(fn_fid_or_fname, ignorekeys=[])[source]

Get the details for the specified Frame using the frame’s name or id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame(256)
>>> f.name
'Medical_specialties'
>>> f = fn.frame('Medical_specialties')
>>> f.ID
256
>>> # ensure non-ASCII character in definition doesn't trigger an encoding error:
>>> fn.frame('Imposing_obligation') 
frame (1494): Imposing_obligation...

The dict that is returned from this function will contain the following information about the Frame:

  • ‘name’ : the name of the Frame (e.g. ‘Birth’, ‘Apply_heat’, etc.)

  • ‘definition’ : textual definition of the Frame

  • ‘ID’ : the internal ID number of the Frame

  • ‘semTypes’a list of semantic types for this frame
    • Each item in the list is a dict containing the following keys:
      • ‘name’ : can be used with the semtype() function

      • ‘ID’ : can be used with the semtype() function

  • ‘lexUnit’a dict containing all of the LUs for this frame.

    The keys in this dict are the names of the LUs and the value for each key is itself a dict containing info about the LU (see the lu() function for more info.)

  • ‘FE’a dict containing the Frame Elements that are part of this frame

    The keys in this dict are the names of the FEs (e.g. ‘Body_system’) and the values are dicts containing the following keys

    • ‘definition’ : The definition of the FE

    • ‘name’ : The name of the FE e.g. ‘Body_system’

    • ‘ID’ : The id number

    • ‘_type’ : ‘fe’

    • ‘abbrev’ : Abbreviation e.g. ‘bod’

    • ‘coreType’ : one of “Core”, “Peripheral”, or “Extra-Thematic”

    • ‘semType’if not None, a dict with the following two keys:
      • ‘name’name of the semantic type. can be used with

        the semtype() function

      • ‘ID’id number of the semantic type. can be used with

        the semtype() function

    • ‘requiresFE’if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame

      • ‘ID’ : the id of the other FE in this frame

    • ‘excludesFE’if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame

      • ‘ID’ : the id of the other FE in this frame

  • ‘frameRelation’ : a list of objects describing frame relations

  • ‘FEcoreSets’a list of Frame Element core sets for this frame
    • Each item in the list is a list of FE objects

Parameters
  • fn_fid_or_fname (int or str) – The Framenet name or id number of the frame

  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)

Returns

Information about a frame

Return type

dict

frame_by_id(fn_fid, ignorekeys=[])[source]

Get the details for the specified Frame using the frame’s id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_id(256)
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition 
"This frame includes words that name medical specialties and is closely related to the
Medical_professionals frame.  The FE Type characterizing a sub-are in a Specialty may also be
expressed. 'Ralph practices paediatric oncology.'"
Parameters
  • fn_fid (int) – The Framenet id number of the frame

  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)

Returns

Information about a frame

Return type

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_by_name(fn_fname, ignorekeys=[], check_cache=True)[source]

Get the details for the specified Frame using the frame’s name.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_name('Medical_specialties')
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition 
 "This frame includes words that name medical specialties and is closely related to the
  Medical_professionals frame.  The FE Type characterizing a sub-are in a Specialty may also be
  expressed. 'Ralph practices paediatric oncology.'"
Parameters
  • fn_fname (str) – The name of the frame

  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)

Returns

Information about a frame

Return type

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_ids_and_names(name=None)[source]

Uses the frame index, which is much faster than looking up each frame definition if only the names and IDs are needed.

frame_relation_types()[source]

Obtain a list of frame relation types.

>>> from nltk.corpus import framenet as fn
>>> frts = sorted(fn.frame_relation_types(), key=itemgetter('ID'))
>>> isinstance(frts, list)
True
>>> len(frts) in (9, 10)    # FN 1.5 and 1.7, resp.
True
>>> PrettyDict(frts[0], breakLines=True)
{'ID': 1,
 '_type': 'framerelationtype',
 'frameRelations': [<Parent=Event -- Inheritance -> Child=Change_of_consistency>, <Parent=Event -- Inheritance -> Child=Rotting>, ...],
 'name': 'Inheritance',
 'subFrameName': 'Child',
 'superFrameName': 'Parent'}
Returns

A list of all of the frame relation types in framenet

Return type

list(dict)

frame_relations(frame=None, frame2=None, type=None)[source]
Parameters
  • frame (int or str or AttrDict) – (optional) frame object, name, or ID; only relations involving this frame will be returned

  • frame2 – (optional; ‘frame’ must be a different frame) only show relations between the two specified frames, in either direction

  • type – (optional) frame relation type (name or object); show only relations of this type

Returns

A list of all of the frame relations in framenet

Return type

list(dict)

>>> from nltk.corpus import framenet as fn
>>> frels = fn.frame_relations()
>>> isinstance(frels, list)
True
>>> len(frels) in (1676, 2070)  # FN 1.5 and 1.7, resp.
True
>>> PrettyList(fn.frame_relations('Cooking_creation'), maxReprSize=0, breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>,
 <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
>>> PrettyList(fn.frame_relations(274), breakLines=True)
[<Parent=Avoiding -- Inheritance -> Child=Dodging>,
 <Parent=Avoiding -- Inheritance -> Child=Evading>, ...]
>>> PrettyList(fn.frame_relations(fn.frame('Cooking_creation')), breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>, ...]
>>> PrettyList(fn.frame_relations('Cooking_creation', type='Inheritance'))
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>]
>>> PrettyList(fn.frame_relations('Cooking_creation', 'Apply_heat'), breakLines=True) 
[<Parent=Apply_heat -- Using -> Child=Cooking_creation>,
<MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
frames(name=None)[source]

Obtain details for a specific frame.

>>> from nltk.corpus import framenet as fn
>>> len(fn.frames()) in (1019, 1221)    # FN 1.5 and 1.7, resp.
True
>>> x = PrettyList(fn.frames(r'(?i)crim'), maxReprSize=0, breakLines=True)
>>> x.sort(key=itemgetter('ID'))
>>> x
[<frame ID=200 name=Criminal_process>,
 <frame ID=500 name=Criminal_investigation>,
 <frame ID=692 name=Crime_scenario>,
 <frame ID=700 name=Committing_crime>]

A brief intro to Frames (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A Frame is a script-like conceptual structure that describes a particular type of situation, object, or event along with the participants and props that are needed for that Frame. For example, the “Apply_heat” frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc.

We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called “lexical units” (LUs).

FrameNet includes relations between Frames. Several types of relations are defined, of which the most important are:

  • Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.

  • Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to child FEs.

  • Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and “Sentencing”.

  • Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point of view, respectively.

Parameters

name (str) – A regular expression pattern used to match against Frame names. If ‘name’ is None, then a list of all Framenet Frames will be returned.

Returns

A list of matching Frames (or all Frames).

Return type

list(AttrDict)

frames_by_lemma(pat)[source]

Returns a list of all frames that contain LUs in which the name attribute of the LU matches the given regular expression pat. Note that LU names are composed of “lemma.POS”, where the “lemma” part can be made up of either a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’).

Note: if you are going to be doing a lot of this type of searching, you’d want to build an index that maps from lemmas to frames because each time frames_by_lemma() is called, it has to search through ALL of the frame XML files in the db.

>>> from nltk.corpus import framenet as fn
>>> from nltk.corpus.reader.framenet import PrettyList
>>> PrettyList(sorted(fn.frames_by_lemma(r'(?i)a little'), key=itemgetter('ID'))) 
[<frame ID=189 name=Quanti...>, <frame ID=2001 name=Degree>]
Returns

A list of frame objects.

Return type

list(AttrDict)

ft_sents(docNamePattern=None)[source]

Full-text annotation sentences, optionally filtered by document name.

help(attrname=None)[source]

Display help information summarizing the main methods.

lu(fn_luid, ignorekeys=[], luName=None, frameID=None, frameName=None)[source]

Access a lexical unit by its ID. luName, frameID, and frameName are used only in the event that the LU does not have a file in the database (which is the case for LUs with “Problem” status); in this case, a placeholder LU is created which just contains its name, ID, and frame.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> fn.lu(256).name
'foresee.v'
>>> fn.lu(256).definition
'COD: be aware of beforehand; predict.'
>>> fn.lu(256).frame.name
'Expectation'
>>> list(map(PrettyDict, fn.lu(256).lexemes))
[{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}]
>>> fn.lu(227).exemplars[23] 
exemplar sentence (352962):
[sentNo] 0
[aPos] 59699508

[LU] (227) guess.v in Coming_to_believe

[frame] (23) Coming_to_believe

[annotationSet] 2 annotation sets

[POS] 18 tags

[POS_tagset] BNC

[GF] 3 relations

[PT] 3 phrases

[Other] 1 entry

[text] + [Target] + [FE]

When he was inside the house , Culley noticed the characteristic
                                              ------------------
                                              Content

he would n't have guessed at .
--                ******* --
Co                        C1 [Evidence:INI]
 (Co=Cognizer, C1=Content)

The dict that is returned from this function will contain most of the following information about the LU. Note that some LUs do not contain all of these pieces of information - particularly ‘totalAnnotated’ and ‘incorporatedFE’ may be missing in some LUs:

  • ‘name’ : the name of the LU (e.g. ‘merger.n’)

  • ‘definition’ : textual definition of the LU

  • ‘ID’ : the internal ID number of the LU

  • ‘_type’ : ‘lu’

  • ‘status’ : e.g. ‘Created’

  • ‘frame’ : Frame that this LU belongs to

  • ‘POS’ : the part of speech of this LU (e.g. ‘N’)

  • ‘totalAnnotated’ : total number of examples annotated with this LU

  • ‘incorporatedFE’ : FE that incorporates this LU (e.g. ‘Ailment’)

  • ‘sentenceCount’a dict with the following two keys:
    • ‘annotated’: number of sentences annotated with this LU

    • ‘total’ : total number of sentences with this LU

  • ‘lexemes’a list of dicts describing the lemma of this LU.

    Each dict in the list contains these keys:

    • ‘POS’ : part of speech e.g. ‘N’

    • ‘name’either single-lexeme e.g. ‘merger’ or

      multi-lexeme e.g. ‘a little’

    • ‘order’: the order of the lexeme in the lemma (starting from 1)

    • ‘headword’: a boolean (‘true’ or ‘false’)

    • ‘breakBefore’: Can this lexeme be separated from the previous lexeme?

      Consider: “take over.v” as in:

      Germany took over the Netherlands in 2 days.
      Germany took the Netherlands over in 2 days.
      

      In this case, ‘breakBefore’ would be “true” for the lexeme “over”. Contrast this with “take after.v” as in:

       Mary takes after her grandmother.
      *Mary takes her grandmother after.
      

      In this case, ‘breakBefore’ would be “false” for the lexeme “after”

  • ‘lemmaID’ : Can be used to connect lemmas in different LUs

  • ‘semTypes’ : a list of semantic type objects for this LU

  • ‘subCorpus’a list of subcorpora
    • Each item in the list is a dict containing the following keys:
      • ‘name’ :

      • ‘sentence’a list of sentences in the subcorpus
        • each item in the list is a dict with the following keys:
          • ‘ID’:

          • ‘sentNo’:

          • ‘text’: the text of the sentence

          • ‘aPos’:

          • ‘annotationSet’: a list of annotation sets
            • each item in the list is a dict with the following keys:
              • ‘ID’:

              • ‘status’:

              • ‘layer’: a list of layers
                • each layer is a dict containing the following keys:
                  • ‘name’: layer name (e.g. ‘BNC’)

                  • ‘rank’:

                  • ‘label’: a list of labels for the layer
                    • each label is a dict containing the following keys:
                      • ‘start’: start pos of label in sentence ‘text’ (0-based)

                      • ‘end’: end pos of label in sentence ‘text’ (0-based)

                      • ‘name’: name of label (e.g. ‘NN1’)

Under the hood, this implementation looks up the lexical unit information in the frame definition file. That file does not contain corpus annotations, so the LU files will be accessed on demand if those are needed. In principle, valence patterns could be loaded here too, though these are not currently supported.

Parameters
  • fn_luid (int) – The id number of the lexical unit

  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)

Returns

All information about the lexical unit

Return type

dict

lu_basic(fn_luid)[source]

Returns basic information about the LU whose id is fn_luid. This is basically just a wrapper around the lu() function with “subCorpus” info excluded.

>>> from nltk.corpus import framenet as fn
>>> lu = PrettyDict(fn.lu_basic(256), breakLines=True)
>>> # ellipses account for differences between FN 1.5 and 1.7
>>> lu 
{'ID': 256,
 'POS': 'V',
 'URL': 'https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu256.xml',
 '_type': 'lu',
 'cBy': ...,
 'cDate': '02/08/2001 01:27:50 PST Thu',
 'definition': 'COD: be aware of beforehand; predict.',
 'definitionMarkup': 'COD: be aware of beforehand; predict.',
 'frame': <frame ID=26 name=Expectation>,
 'lemmaID': 15082,
 'lexemes': [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}],
 'name': 'foresee.v',
 'semTypes': [],
 'sentenceCount': {'annotated': ..., 'total': ...},
 'status': 'FN1_Sent'}
Parameters

fn_luid (int) – The id number of the desired LU

Returns

Basic information about the lexical unit

Return type

dict

lu_ids_and_names(name=None)[source]

Uses the LU index, which is much faster than looking up each LU definition if only the names and IDs are needed.

lus(name=None, frame=None)[source]

Obtain details for lexical units. Optionally restrict by lexical unit name pattern, and/or to a certain frame or frames whose name matches a pattern.

>>> from nltk.corpus import framenet as fn
>>> len(fn.lus()) in (11829, 13572) # FN 1.5 and 1.7, resp.
True
>>> PrettyList(sorted(fn.lus(r'(?i)a little'), key=itemgetter('ID')), maxReprSize=0, breakLines=True)
[<lu ID=14733 name=a little.n>,
 <lu ID=14743 name=a little.adv>,
 <lu ID=14744 name=a little bit.adv>]
>>> PrettyList(sorted(fn.lus(r'interest', r'(?i)stimulus'), key=itemgetter('ID')))
[<lu ID=14894 name=interested.a>, <lu ID=14920 name=interesting.a>]

A brief intro to Lexical Units (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A lexical unit (LU) is a pairing of a word with a meaning. For example, the “Apply_heat” Frame describes a common situation involving a Cook, some Food, and a Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil, brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat frame. Each sense of a polysemous word is a different LU.

We have used the word “word” in talking about LUs. The reality is actually rather complex. When we say that the word “bake” is polysemous, we mean that the lemma “bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is linked to three different frames:

  • Apply_heat: “Michelle baked the potatoes for 45 minutes.”

  • Cooking_creation: “Michelle baked her mother a cake for her birthday.”

  • Absorb_heat: “The potatoes have to bake for more than 30 minutes.”

These constitute three different LUs, with different definitions.

Multiword expressions such as “given name” and hyphenated words like “shut-eye” can also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)” are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”, respectively), and their internal structure is not analyzed.

Framenet provides multiple annotated examples of each sense of a word (i.e. each LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the combinatorial possibilities of the lexical unit.

Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This makes the FrameNet database similar to a thesaurus, grouping together semantically similar words.

In the simplest case, frame-evoking words are verbs such as “fried” in:

“Matilde fried the catfish in a heavy iron skillet.”

Sometimes event nouns may evoke a Frame. For example, “reduction” evokes “Cause_change_of_scalar_position” in:

“…the reduction of debt levels to $665 million from $2.6 billion.”

Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep” frame as in:

“They were asleep for hours.”

Many common nouns, such as artifacts like “hat” or “tower”, typically serve as dependents rather than clearly evoking their own frames.

Parameters

name (str) –

A regular expression pattern used to search the LU names. Note that LU names take the form of a dotted string (e.g. “run.v” or “a little.adv”) in which a lemma precedes the “.” and a POS follows the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple lexemes (e.g. “a little”). If ‘name’ is not given, then all LUs will be returned.

The valid POSes are:

v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj - interjection art - article c - conjunction scon - subordinating conjunction

Returns

A list of selected (or all) lexical units

Return type

list of LU objects (dicts). See the lu() function for info about the specifics of LU objects.

propagate_semtypes()[source]

Apply inference rules to distribute semtypes over relations between FEs. For FrameNet 1.5, this results in 1011 semtypes being propagated. (Not done by default because it requires loading all frame files, which takes several seconds. If this needed to be fast, it could be rewritten to traverse the neighboring relations on demand for each FE semtype.)

>>> from nltk.corpus import framenet as fn
>>> x = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
>>> fn.propagate_semtypes()
>>> y = sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
>>> y-x > 1000
True
semtype(key)[source]
>>> from nltk.corpus import framenet as fn
>>> fn.semtype(233).name
'Temperature'
>>> fn.semtype(233).abbrev
'Temp'
>>> fn.semtype('Temperature').ID
233
Parameters

key (string or int) – The name, abbreviation, or id number of the semantic type

Returns

Information about a semantic type

Return type

dict

semtype_inherits(st, superST)[source]
semtypes()[source]

Obtain a list of semantic types.

>>> from nltk.corpus import framenet as fn
>>> stypes = fn.semtypes()
>>> len(stypes) in (73, 109) # FN 1.5 and 1.7, resp.
True
>>> sorted(stypes[0].keys())
['ID', '_type', 'abbrev', 'definition', 'definitionMarkup', 'name', 'rootType', 'subTypes', 'superType']
Returns

A list of all of the semantic types in framenet

Return type

list(dict)

sents(exemplars=True, full_text=True)[source]

Annotated sentences matching the specified criteria.

warnings(v)[source]

Enable or disable warnings of data integrity issues as they are encountered. If v is truthy, warnings will be enabled.

(This is a function rather than just an attribute/property to ensure that if enabling warnings is the first action taken, the corpus reader is instantiated first.)

class nltk.corpus.reader.IEERCorpusReader[source]

Bases: CorpusReader

docs(fileids=None)[source]
parsed_docs(fileids=None)[source]
class nltk.corpus.reader.IPIPANCorpusReader[source]

Bases: CorpusReader

Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.

The corpus includes information about text domain, channel and categories. You can access possible values using domains(), channels() and categories(). You can use also this metadata to filter files, e.g.: fileids(channel='prasa'), fileids(categories='publicystyczny').

The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.: tagged_sents(simplify_tags=True).

Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.: tagged_paras(one_tag=False).

You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g. tagged_words(disamb_only=False).

The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g. tagged_words(append_no_space=True). As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.

The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g. words(append_space=True). As a result either ‘ ‘ or (’ ‘, ‘space’) will be inserted between tokens.

By default, xml entities like &quot; and &amp; are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g. words(replace_xmlentities=False).

__init__(root, fileids)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

categories(fileids=None)[source]
channels(fileids=None)[source]
domains(fileids=None)[source]
fileids(channels=None, domains=None, categories=None)[source]

Return a list of file identifiers for the fileids that make up this corpus.

paras(fileids=None, **kwargs)[source]
sents(fileids=None, **kwargs)[source]
tagged_paras(fileids=None, **kwargs)[source]
tagged_sents(fileids=None, **kwargs)[source]
tagged_words(fileids=None, **kwargs)[source]
words(fileids=None, **kwargs)[source]
class nltk.corpus.reader.IndianCorpusReader[source]

Bases: CorpusReader

List of words, one per line. Blank lines are ignored.

sents(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.KNBCorpusReader[source]

Bases: SyntaxCorpusReader

This class implements:
  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.

  • _read_block, which reads a block from the input stream.

  • _word, which takes a block and returns a list of list of words.

  • _tag, which takes a block and returns a list of list of tagged words.

  • _parse, which takes a block and returns a list of parsed sentences.

The structure of tagged words:

tagged_word = (word(str), tags(tuple)) tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others …)

Usage example

>>> from nltk.corpus.util import LazyCorpusLoader
>>> knbc = LazyCorpusLoader(
...     'knbc/corpus1',
...     KNBCorpusReader,
...     r'.*/KN.*',
...     encoding='euc-jp',
... )
>>> len(knbc.sents()[0])
9
__init__(root, fileids, encoding='utf8', morphs2str=<function <lambda>>)[source]

Initialize KNBCorpusReader morphs2str is a function to convert morphlist to str for tree representation for _parse()

class nltk.corpus.reader.LinThesaurusCorpusReader[source]

Bases: CorpusReader

Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.

__init__(root, badscore=0.0)[source]

Initialize the thesaurus.

Parameters
  • root (C{string}) – root directory containing thesaurus LISP files

  • badscore (C{float}) – the score to give to words which do not appear in each other’s sets of synonyms

scored_synonyms(ngram, fileid=None)[source]

Returns a list of scored synonyms (tuples of synonyms and scores) for the current ngram

Parameters
  • ngram (C{string}) – ngram to lookup

  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.

Returns

If fileid is specified, list of tuples of scores and synonyms; otherwise, list of tuples of fileids and lists, where inner lists consist of tuples of scores and synonyms.

similarity(ngram1, ngram2, fileid=None)[source]

Returns the similarity score for two ngrams.

Parameters
  • ngram1 (C{string}) – first ngram to compare

  • ngram2 (C{string}) – second ngram to compare

  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.

Returns

If fileid is specified, just the score for the two ngrams; otherwise, list of tuples of fileids and scores.

synonyms(ngram, fileid=None)[source]

Returns a list of synonyms for the current ngram.

Parameters
  • ngram (C{string}) – ngram to lookup

  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.

Returns

If fileid is specified, list of synonyms; otherwise, list of tuples of fileids and lists, where inner lists contain synonyms.

class nltk.corpus.reader.MTECorpusReader[source]

Bases: TaggedCorpusReader

Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset

__init__(root=None, fileids=None, encoding='utf8')[source]

Construct a new MTECorpusreader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = MTECorpusReader(root, 'oana-*.xml', 'utf8') 
Parameters
  • root – The root directory for this corpus. (default points to location in multext config file)

  • fileids – A list or regexp specifying the fileids in this corpus. (default is oana-en.xml)

  • encoding – The encoding of the given files (default is utf8)

lemma_paras(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of tuples of the word and the corresponding lemma (word, lemma)

Return type

list(List(List(tuple(str, str))))

lemma_sents(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of tuples of the word and the corresponding lemma (word, lemma)

Return type

list(list(tuple(str, str)))

lemma_words(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of words, the corresponding lemmas and punctuation symbols, encoded as tuples (word, lemma)

Return type

list(tuple(str,str))

paras(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word string

Return type

list(list(list(str)))

sents(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings

Return type

list(list(str))

tagged_paras(fileids=None, tagset='msd', tags='')[source]
Parameters
  • fileids – A list specifying the fileids that should be used.

  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default

  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag

Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a list of (word,tag) tuples

Return type

list(list(list(tuple(str, str))))

tagged_sents(fileids=None, tagset='msd', tags='')[source]
Parameters
  • fileids – A list specifying the fileids that should be used.

  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default

  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag

Returns

the given file(s) as a list of sentences or utterances, each each encoded as a list of (word,tag) tuples

Return type

list(list(tuple(str, str)))

tagged_words(fileids=None, tagset='msd', tags='')[source]
Parameters
  • fileids – A list specifying the fileids that should be used.

  • tagset – The tagset that should be used in the returned object, either “universal” or “msd”, “msd” is the default

  • tags – An MSD Tag that is used to filter all parts of the used corpus that are not more precise or at least equal to the given tag

Returns

the given file(s) as a list of tagged words and punctuation symbols encoded as tuples (word, tag)

Return type

list(tuple(str, str))

words(fileids=None)[source]
Parameters

fileids – A list specifying the fileids that should be used.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.MWAPPDBCorpusReader[source]

Bases: WordListCorpusReader

This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al. (2014a, 2014b, 2015):

The original source of the full PPDB corpus can be found on https://www.cis.upenn.edu/~ccb/ppdb/

Returns

a list of tuples of similar lexical terms.

entries(fileids='ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs')[source]
Returns

a tuple of synonym word pairs.

mwa_ppdb_xxxl_file = 'ppdb-1.0-xxxl-lexical.extended.synonyms.uniquepairs'
class nltk.corpus.reader.MacMorphoCorpusReader[source]

Bases: TaggedCorpusReader

A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by self.paras() and self.tagged_paras() contains a single sentence.

__init__(root, fileids, encoding='utf8', tagset=None)[source]

Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = TaggedCorpusReader(root, '.*', '.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

class nltk.corpus.reader.NKJPCorpusReader[source]

Bases: XMLCorpusReader

HEADER_MODE = 2
RAW_MODE = 3
SENTS_MODE = 1
WORDS_MODE = 0
__init__(root, fileids='.*')[source]

Corpus reader designed to work with National Corpus of Polish. See http://nkjp.pl/ for more details about NKJP. use example: import nltk import nkjp from nkjp import NKJPCorpusReader x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’’) # obtain the whole corpus x.header() x.raw() x.words() x.tagged_words(tags=[‘subst’, ‘comp’]) #Link to find more tags: nkjp.pl/poliqarp/help/ense2.html x.sents() x = NKJPCorpusReader(root=’/home/USER/nltk_data/corpora/nkjp/’, fileids=’Wilk*’) # obtain particular file(s) x.header(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’]) x.tagged_words(fileids=[‘WilkDom’, ‘/home/USER/nltk_data/corpora/nkjp/WilkWilczy’], tags=[‘subst’, ‘comp’])

add_root(fileid)[source]

Add root if necessary to specified fileid.

fileids()[source]

Returns a list of file identifiers for the fileids that make up this corpus.

get_paths()[source]
header(fileids=None, **kwargs)[source]

Returns header(s) of specified fileids.

raw(fileids=None, **kwargs)[source]

Returns words in specified fileids.

sents(fileids=None, **kwargs)[source]

Returns sentences in specified fileids.

tagged_words(fileids=None, **kwargs)[source]

Call with specified tags as a list, e.g. tags=[‘subst’, ‘comp’]. Returns tagged words in specified fileids.

words(fileids=None, **kwargs)[source]

Returns words in specified fileids.

class nltk.corpus.reader.NPSChatCorpusReader[source]

Bases: XMLCorpusReader

__init__(root, fileids, wrap_etree=False, tagset=None)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

posts(fileids=None)[source]
tagged_posts(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns

the given file’s text nodes as a list of words and punctuation symbols

Return type

list(str)

xml_posts(fileids=None)[source]
class nltk.corpus.reader.NombankCorpusReader[source]

Bases: CorpusReader

Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-noun basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

__init__(root, nomfile, framefiles='', nounsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]
Parameters
  • root – The root directory for this corpus.

  • nomfile – The name of the file containing the predicate- argument annotations (relative to root).

  • framefiles – A list or regexp specifying the frameset fileids for this corpus.

  • parse_fileid_xform – A transform that should be applied to the fileids in this corpus. This should be a function of one argument (a fileid) that returns a string (the new fileid).

  • parse_corpus – The corpus containing the parse trees corresponding to this corpus. These parse trees are necessary to resolve the tree pointers used by nombank.

instances(baseform=None)[source]
Returns

a corpus view that acts as a list of NombankInstance objects, one for each noun in the corpus.

lines()[source]
Returns

a corpus view that acts as a list of strings, one for each line in the predicate-argument annotation file.

nouns()[source]
Returns

a corpus view that acts as a list of all noun lemmas in this corpus (from the nombank.1.0.words file).

roleset(roleset_id)[source]
Returns

the xml description for the given roleset.

rolesets(baseform=None)[source]
Returns

list of xml descriptions for rolesets.

class nltk.corpus.reader.NonbreakingPrefixesCorpusReader[source]

Bases: WordListCorpusReader

This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses’ word tokenizer.

available_langs = {'ca': 'ca', 'catalan': 'ca', 'cs': 'cs', 'czech': 'cs', 'de': 'de', 'dutch': 'nl', 'el': 'el', 'en': 'en', 'english': 'en', 'es': 'es', 'fi': 'fi', 'finnish': 'fi', 'fr': 'fr', 'french': 'fr', 'german': 'de', 'greek': 'el', 'hu': 'hu', 'hungarian': 'hu', 'icelandic': 'is', 'is': 'is', 'it': 'it', 'italian': 'it', 'latvian': 'lv', 'lv': 'lv', 'nl': 'nl', 'pl': 'pl', 'polish': 'pl', 'portuguese': 'pt', 'pt': 'pt', 'ro': 'ro', 'romanian': 'ro', 'ru': 'ru', 'russian': 'ru', 'sk': 'sk', 'sl': 'sl', 'slovak': 'sk', 'slovenian': 'sl', 'spanish': 'es', 'sv': 'sv', 'swedish': 'sv', 'ta': 'ta', 'tamil': 'ta'}
words(lang=None, fileids=None, ignore_lines_startswith='#')[source]

This module returns a list of nonbreaking prefixes for the specified language(s).

>>> from nltk.corpus import nonbreaking_prefixes as nbp
>>> nbp.words('en')[:10] == [u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J']
True
>>> nbp.words('ta')[:5] == [u'அ', u'ஆ', u'இ', u'ஈ', u'உ']
True
Returns

a list words for the specified language(s).

class nltk.corpus.reader.OpinionLexiconCorpusReader[source]

Bases: WordListCorpusReader

Reader for Liu and Hu opinion lexicon. Blank lines and readme are ignored.

>>> from nltk.corpus import opinion_lexicon
>>> opinion_lexicon.words()
['2-faced', '2-faces', 'abnormal', 'abolish', ...]

The OpinionLexiconCorpusReader provides shortcuts to retrieve positive/negative words:

>>> opinion_lexicon.negative()
['2-faced', '2-faces', 'abnormal', 'abolish', ...]

Note that words from words() method are sorted by file id, not alphabetically:

>>> opinion_lexicon.words()[0:10] 
['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably',
'abominate', 'abomination', 'abort', 'aborted']
>>> sorted(opinion_lexicon.words())[0:10] 
['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably',
'abominate', 'abomination', 'abort']
CorpusView

alias of IgnoreReadmeCorpusView

negative()[source]

Return all negative words in alphabetical order.

Returns

a list of negative words.

Return type

list(str)

positive()[source]

Return all positive words in alphabetical order.

Returns

a list of positive words.

Return type

list(str)

words(fileids=None)[source]

Return all words in the opinion lexicon. Note that these words are not sorted in alphabetical order.

Parameters

fileids – a list or regexp specifying the ids of the files whose words have to be returned.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.PPAttachmentCorpusReader[source]

Bases: CorpusReader

sentence_id verb noun1 preposition noun2 attachment

attachments(fileids)[source]
tuples(fileids)[source]
class nltk.corpus.reader.PanLexLiteCorpusReader[source]

Bases: CorpusReader

MEANING_Q = '\n        SELECT dnx2.mn, dnx2.uq, dnx2.ap, dnx2.ui, ex2.tt, ex2.lv\n        FROM dnx\n        JOIN ex ON (ex.ex = dnx.ex)\n        JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n        JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n        WHERE dnx.ex != dnx2.ex AND ex.tt = ? AND ex.lv = ?\n        ORDER BY dnx2.uq DESC\n    '
TRANSLATION_Q = '\n        SELECT s.tt, sum(s.uq) AS trq FROM (\n            SELECT ex2.tt, max(dnx.uq) AS uq\n            FROM dnx\n            JOIN ex ON (ex.ex = dnx.ex)\n            JOIN dnx dnx2 ON (dnx2.mn = dnx.mn)\n            JOIN ex ex2 ON (ex2.ex = dnx2.ex)\n            WHERE dnx.ex != dnx2.ex AND ex.lv = ? AND ex.tt = ? AND ex2.lv = ?\n            GROUP BY ex2.tt, dnx.ui\n        ) s\n        GROUP BY s.tt\n        ORDER BY trq DESC, s.tt\n    '
__init__(root)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

language_varieties(lc=None)[source]

Return a list of PanLex language varieties.

Parameters

lc – ISO 639 alpha-3 code. If specified, filters returned varieties by this code. If unspecified, all varieties are returned.

Returns

the specified language varieties as a list of tuples. The first element is the language variety’s seven-character uniform identifier, and the second element is its default name.

Return type

list(tuple)

meanings(expr_uid, expr_tt)[source]

Return a list of meanings for an expression.

Parameters
  • expr_uid – the expression’s language variety, as a seven-character uniform identifier.

  • expr_tt – the expression’s text.

Returns

a list of Meaning objects.

Return type

list(Meaning)

translations(from_uid, from_tt, to_uid)[source]

Return a list of translations for an expression into a single language variety.

Parameters
  • from_uid – the source expression’s language variety, as a seven-character uniform identifier.

  • from_tt – the source expression’s text.

  • to_uid – the target language variety, as a seven-character uniform identifier.

Returns

a list of translation tuples. The first element is the expression text and the second element is the translation quality.

Return type

list(tuple)

class nltk.corpus.reader.PanlexSwadeshCorpusReader[source]

Bases: WordListCorpusReader

This is a class to read the PanLex Swadesh list from

David Kamholz, Jonathan Pool, and Susan M. Colowick (2014). PanLex: Building a Resource for Panlingual Lexical Translation. In LREC. http://www.lrec-conf.org/proceedings/lrec2014/pdf/1029_Paper.pdf

License: CC0 1.0 Universal https://creativecommons.org/publicdomain/zero/1.0/legalcode

__init__(*args, **kwargs)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

entries(fileids=None)[source]
Returns

a tuple of words for the specified fileids.

get_languages()[source]
get_macrolanguages()[source]
language_codes()[source]
license()[source]

Return the contents of the corpus LICENSE file, if it exists.

words_by_iso639(iso63_code)[source]
Returns

a list of list(str)

words_by_lang(lang_code)[source]
Returns

a list of list(str)

class nltk.corpus.reader.Pl196xCorpusReader[source]

Bases: CategorizedCorpusReader, XMLCorpusReader

__init__(*args, **kwargs)[source]

Initialize this mapping based on keyword arguments, as follows:

  • cat_pattern: A regular expression pattern used to find the category for each file identifier. The pattern will be applied to each file identifier, and the first matching group will be used as the category label for that file.

  • cat_map: A dictionary, mapping from file identifiers to category labels.

  • cat_file: The name of a file that contains the mapping from file identifiers to categories. The argument cat_delimiter can be used to specify a delimiter.

The corresponding argument will be deleted from kwargs. If more than one argument is specified, an exception will be raised.

decode_tag(tag)[source]
head_len = 2770
paras(fileids=None, categories=None, textids=None)[source]
sents(fileids=None, categories=None, textids=None)[source]
tagged_paras(fileids=None, categories=None, textids=None)[source]
tagged_sents(fileids=None, categories=None, textids=None)[source]
tagged_words(fileids=None, categories=None, textids=None)[source]
textids(fileids=None, categories=None)[source]

In the pl196x corpus each category is stored in single file and thus both methods provide identical functionality. In order to accommodate finer granularity, a non-standard textids() method was implemented. All the main functions can be supplied with a list of required chunks—giving much more control to the user.

words(fileids=None, categories=None, textids=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns

the given file’s text nodes as a list of words and punctuation symbols

Return type

list(str)

xml(fileids=None, categories=None)[source]
class nltk.corpus.reader.PlaintextCorpusReader[source]

Bases: CorpusReader

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

CorpusView

alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object>, para_block_reader=<function read_blankline_block>, encoding='utf8')[source]

Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/usr/local/share/nltk_data/corpora/webtext/'
>>> reader = PlaintextCorpusReader(root, '.*\.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.

  • sent_tokenizer – Tokenizer for breaking paragraphs into words.

  • para_block_reader – The block reader used to divide the corpus into paragraph blocks.

paras(fileids=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

sents(fileids=None)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

words(fileids=None)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.PortugueseCategorizedPlaintextCorpusReader[source]

Bases: CategorizedPlaintextCorpusReader

__init__(*args, **kwargs)[source]

Initialize the corpus reader. Categorization arguments (cat_pattern, cat_map, and cat_file) are passed to the CategorizedCorpusReader constructor. The remaining arguments are passed to the PlaintextCorpusReader constructor.

class nltk.corpus.reader.PropbankCorpusReader[source]

Bases: CorpusReader

Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-verb basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

__init__(root, propfile, framefiles='', verbsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]
Parameters
  • root – The root directory for this corpus.

  • propfile – The name of the file containing the predicate- argument annotations (relative to root).

  • framefiles – A list or regexp specifying the frameset fileids for this corpus.

  • parse_fileid_xform – A transform that should be applied to the fileids in this corpus. This should be a function of one argument (a fileid) that returns a string (the new fileid).

  • parse_corpus – The corpus containing the parse trees corresponding to this corpus. These parse trees are necessary to resolve the tree pointers used by propbank.

instances(baseform=None)[source]
Returns

a corpus view that acts as a list of PropBankInstance objects, one for each noun in the corpus.

lines()[source]
Returns

a corpus view that acts as a list of strings, one for each line in the predicate-argument annotation file.

roleset(roleset_id)[source]
Returns

the xml description for the given roleset.

rolesets(baseform=None)[source]
Returns

list of xml descriptions for rolesets.

verbs()[source]
Returns

a corpus view that acts as a list of all verb lemmas in this corpus (from the verbs.txt file).

class nltk.corpus.reader.ProsConsCorpusReader[source]

Bases: CategorizedCorpusReader, CorpusReader

Reader for the Pros and Cons sentence dataset.

>>> from nltk.corpus import pros_cons
>>> pros_cons.sents(categories='Cons') 
[['East', 'batteries', '!', 'On', '-', 'off', 'switch', 'too', 'easy',
'to', 'maneuver', '.'], ['Eats', '...', 'no', ',', 'GULPS', 'batteries'],
...]
>>> pros_cons.words('IntegratedPros.txt')
['Easy', 'to', 'use', ',', 'economical', '!', ...]
CorpusView

alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8', **kwargs)[source]
Parameters
  • root – The root directory for the corpus.

  • fileids – a list or regexp specifying the fileids in the corpus.

  • word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WhitespaceTokenizer

  • encoding – the encoding that should be used to read the corpus.

  • kwargs – additional parameters passed to CategorizedCorpusReader.

sents(fileids=None, categories=None)[source]

Return all sentences in the corpus or in the specified files/categories.

Parameters
  • fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.

  • categories – a list specifying the categories whose sentences have to be returned.

Returns

the given file(s) as a list of sentences. Each sentence is tokenized using the specified word_tokenizer.

Return type

list(list(str))

words(fileids=None, categories=None)[source]

Return all words and punctuation symbols in the corpus or in the specified files/categories.

Parameters
  • fileids – a list or regexp specifying the ids of the files whose words have to be returned.

  • categories – a list specifying the categories whose words have to be returned.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.RTECorpusReader[source]

Bases: XMLCorpusReader

Corpus reader for corpora in RTE challenges.

This is just a wrapper around the XMLCorpusReader. See module docstring above for the expected structure of input documents.

pairs(fileids)[source]

Build a list of RTEPairs from a RTE corpus.

Parameters

fileids – a list of RTE corpus fileids

Type

list

Return type

list(RTEPair)

class nltk.corpus.reader.ReviewsCorpusReader[source]

Bases: CorpusReader

Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.

>>> from nltk.corpus import product_reviews_1
>>> camera_reviews = product_reviews_1.reviews('Canon_G3.txt')
>>> review = camera_reviews[0]
>>> review.sents()[0] 
['i', 'recently', 'purchased', 'the', 'canon', 'powershot', 'g3', 'and', 'am',
'extremely', 'satisfied', 'with', 'the', 'purchase', '.']
>>> review.features() 
[('canon powershot g3', '+3'), ('use', '+2'), ('picture', '+2'),
('picture quality', '+1'), ('picture quality', '+1'), ('camera', '+2'),
('use', '+2'), ('feature', '+1'), ('picture quality', '+3'), ('use', '+1'),
('option', '+1')]

We can also reach the same information directly from the stream:

>>> product_reviews_1.features('Canon_G3.txt')
[('canon powershot g3', '+3'), ('use', '+2'), ...]

We can compute stats for specific product features:

>>> n_reviews = len([(feat,score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> tot = sum([int(score) for (feat,score) in product_reviews_1.features('Canon_G3.txt') if feat=='picture'])
>>> mean = tot / n_reviews
>>> print(n_reviews, tot, mean)
15 24 1.6
CorpusView

alias of StreamBackedCorpusView

__init__(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL), encoding='utf8')[source]
Parameters
  • root – The root directory for the corpus.

  • fileids – a list or regexp specifying the fileids in the corpus.

  • word_tokenizer – a tokenizer for breaking sentences or paragraphs into words. Default: WordPunctTokenizer

  • encoding – the encoding that should be used to read the corpus.

features(fileids=None)[source]

Return a list of features. Each feature is a tuple made of the specific item feature and the opinion strength about that feature.

Parameters

fileids – a list or regexp specifying the ids of the files whose features have to be returned.

Returns

all features for the item(s) in the given file(s).

Return type

list(tuple)

reviews(fileids=None)[source]

Return all the reviews as a list of Review objects. If fileids is specified, return all the reviews from each of the specified files.

Parameters

fileids – a list or regexp specifying the ids of the files whose reviews have to be returned.

Returns

the given file(s) as a list of reviews.

sents(fileids=None)[source]

Return all sentences in the corpus or in the specified files.

Parameters

fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.

Returns

the given file(s) as a list of sentences, each encoded as a list of word strings.

Return type

list(list(str))

words(fileids=None)[source]

Return all words and punctuation symbols in the corpus or in the specified files.

Parameters

fileids – a list or regexp specifying the ids of the files whose words have to be returned.

Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.SemcorCorpusReader[source]

Bases: XMLCorpusReader

Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

__init__(root, fileids, wordnet, lazy=True)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

chunk_sents(fileids=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of chunks.

Return type

list(list(list(str)))

chunks(fileids=None)[source]
Returns

the given file(s) as a list of chunks, each of which is a list of words and punctuation symbols that form a unit.

Return type

list(list(str))

sents(fileids=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of word strings.

Return type

list(list(str))

tagged_chunks(fileids=None, tag='pos')[source]
Returns

the given file(s) as a list of tagged chunks, represented in tree form.

Return type

list(Tree)

Parameters

tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)

tagged_sents(fileids=None, tag='pos')[source]
Returns

the given file(s) as a list of sentences. Each sentence is represented as a list of tagged chunks (in tree form).

Return type

list(list(Tree))

Parameters

tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)

words(fileids=None)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.SensevalCorpusReader[source]

Bases: CorpusReader

instances(fileids=None)[source]
class nltk.corpus.reader.SentiSynset[source]

Bases: object

__init__(pos_score, neg_score, synset)[source]
neg_score()[source]
obj_score()[source]
pos_score()[source]
class nltk.corpus.reader.SentiWordNetCorpusReader[source]

Bases: CorpusReader

__init__(root, fileids, encoding='utf-8')[source]

Construct a new SentiWordNet Corpus Reader, using data from the specified file.

all_senti_synsets()[source]
senti_synset(*vals)[source]
senti_synsets(string, pos=None)[source]
class nltk.corpus.reader.SinicaTreebankCorpusReader[source]

Bases: SyntaxCorpusReader

Reader for the sinica treebank.

class nltk.corpus.reader.StringCategoryCorpusReader[source]

Bases: CorpusReader

__init__(root, fileids, delimiter=' ', encoding='utf8')[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • delimiter – Field delimiter

tuples(fileids=None)[source]
class nltk.corpus.reader.SwadeshCorpusReader[source]

Bases: WordListCorpusReader

entries(fileids=None)[source]
Returns

a tuple of words for the specified fileids.

class nltk.corpus.reader.SwitchboardCorpusReader[source]

Bases: CorpusReader

__init__(root, tagset=None)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

discourses()[source]
tagged_discourses(tagset=False)[source]
tagged_turns(tagset=None)[source]
tagged_words(tagset=None)[source]
turns()[source]
words()[source]
class nltk.corpus.reader.SyntaxCorpusReader[source]

Bases: CorpusReader

An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:

  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.

  • _read_block, which reads a block from the input stream.

  • _word, which takes a block and returns a list of list of words.

  • _tag, which takes a block and returns a list of list of tagged words.

  • _parse, which takes a block and returns a list of parsed sentences.

parsed_sents(fileids=None)[source]
sents(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.TEICorpusView[source]

Bases: StreamBackedCorpusView

__init__(corpus_file, tagged, group_by_sent, group_by_para, tagset=None, head_len=0, textids=None)[source]

Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.

Parameters
  • fileid – The path to the file that is read by this corpus view. fileid can either be a string or a PathPointer.

  • startpos – The file position at which the view will start reading. This can be used to skip over preface sections.

  • encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).

read_block(stream)[source]

Read a block from the input stream.

Returns

a block of tokens from the input stream

Return type

list(any)

Parameters

stream (stream) – an input stream

class nltk.corpus.reader.TaggedCorpusReader[source]

Bases: CorpusReader

Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/' is used as the separator. I.e., words should have the form:

word1/tag1 word2/tag2 word3/tag3 ...

But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.

__init__(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]

Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = TaggedCorpusReader(root, '.*', '.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

paras(fileids=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

sents(fileids=None)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

tagged_paras(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

Return type

list(list(list(tuple(str,str))))

tagged_sents(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type

list(list(tuple(str,str)))

tagged_words(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type

list(tuple(str,str))

words(fileids=None)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.TimitCorpusReader[source]

Bases: CorpusReader

Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:

  • timitdic.txt: dictionary of standard transcriptions

  • spkrinfo.txt: table of speaker information

In addition, the root directory should contain one subdirectory for each speaker, containing three files for each utterance:

  • <utterance-id>.txt: text content of utterances

  • <utterance-id>.wrd: tokenized text content of utterances

  • <utterance-id>.phn: phonetic transcription of utterances

  • <utterance-id>.wav: utterance sound file

__init__(root, encoding='utf8')[source]

Construct a new TIMIT corpus reader in the given directory. :param root: The root directory for this corpus.

audiodata(utterance, start=0, end=None)[source]
fileids(filetype=None)[source]

Return a list of file identifiers for the files that make up this corpus.

Parameters

filetype – If specified, then filetype indicates that only the files that have the given type should be returned. Accepted values are: txt, wrd, phn, wav, or metadata,

phone_times(utterances=None)[source]

offset is represented as a number of 16kHz samples!

phone_trees(utterances=None)[source]
phones(utterances=None)[source]
play(utterance, start=0, end=None)[source]

Play the given audio sample.

Parameters

utterance – The utterance id of the sample to play

sent_times(utterances=None)[source]
sentid(utterance)[source]
sents(utterances=None)[source]
spkrid(utterance)[source]
spkrinfo(speaker)[source]
Returns

A dictionary mapping .. something.

spkrutteranceids(speaker)[source]
Returns

A list of all utterances associated with a given speaker.

transcription_dict()[source]
Returns

A dictionary giving the ‘standard’ transcription for each word.

utterance(spkrid, sentid)[source]
utteranceids(dialect=None, sex=None, spkrid=None, sent_type=None, sentid=None)[source]
Returns

A list of the utterance identifiers for all utterances in this corpus, or for the given speaker, dialect region, gender, sentence type, or sentence number, if specified.

wav(utterance, start=0, end=None)[source]
word_times(utterances=None)[source]
words(utterances=None)[source]
class nltk.corpus.reader.TimitTaggedCorpusReader[source]

Bases: TaggedCorpusReader

A corpus reader for tagged sentences that are included in the TIMIT corpus.

__init__(*args, **kwargs)[source]

Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = TaggedCorpusReader(root, '.*', '.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

paras()[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

tagged_paras()[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

Return type

list(list(list(tuple(str,str))))

class nltk.corpus.reader.ToolboxCorpusReader[source]

Bases: CorpusReader

entries(fileids, **kwargs)[source]
fields(fileids, strip=True, unwrap=True, encoding='utf8', errors='strict', unicode_fields=None)[source]
words(fileids, key='lx')[source]
xml(fileids, key=None)[source]
class nltk.corpus.reader.TwitterCorpusReader[source]

Bases: CorpusReader

Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.

Individual Tweets can be tokenized using the default tokenizer, or by a custom tokenizer specified as a parameter to the constructor.

Construct a new Tweet corpus reader for a set of documents located at the given root directory.

If you made your own tweet collection in a directory called twitter-files, then you can initialise the reader as:

from nltk.corpus import TwitterCorpusReader
reader = TwitterCorpusReader(root='/path/to/twitter-files', '.*\.json')

However, the recommended approach is to set the relevant directory as the value of the environmental variable TWITTER, and then invoke the reader as follows:

root = os.environ['TWITTER']
reader = TwitterCorpusReader(root, '.*\.json')

If you want to work directly with the raw Tweets, the json library can be used:

import json
for tweet in reader.docs():
    print(json.dumps(tweet, indent=1, sort_keys=True))
CorpusView

alias of StreamBackedCorpusView

__init__(root, fileids=None, word_tokenizer=<nltk.tokenize.casual.TweetTokenizer object>, encoding='utf8')[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • word_tokenizer – Tokenizer for breaking the text of Tweets into smaller units, including but not limited to words.

docs(fileids=None)[source]

Returns the full Tweet objects, as specified by Twitter documentation on Tweets

Returns

the given file(s) as a list of dictionaries deserialised from JSON.

Return type

list(dict)

strings(fileids=None)[source]

Returns only the text content of Tweets in the file(s)

Returns

the given file(s) as a list of Tweets.

Return type

list(str)

tokenized(fileids=None)[source]
Returns

the given file(s) as a list of the text content of Tweets as as a list of words, screenanames, hashtags, URLs and punctuation symbols.

Return type

list(list(str))

class nltk.corpus.reader.UdhrCorpusReader[source]

Bases: PlaintextCorpusReader

ENCODINGS = [('.*-Latin1$', 'latin-1'), ('.*-Hebrew$', 'hebrew'), ('.*-Arabic$', 'cp1256'), ('Czech_Cesky-UTF8', 'cp1250'), ('Polish-Latin2', 'cp1250'), ('Polish_Polski-Latin2', 'cp1250'), ('.*-Cyrillic$', 'cyrillic'), ('.*-SJIS$', 'SJIS'), ('.*-GB2312$', 'GB2312'), ('.*-Latin2$', 'ISO-8859-2'), ('.*-Greek$', 'greek'), ('.*-UTF8$', 'utf-8'), ('Hungarian_Magyar-Unicode', 'utf-16-le'), ('Amahuaca', 'latin1'), ('Turkish_Turkce-Turkish', 'latin5'), ('Lithuanian_Lietuviskai-Baltic', 'latin4'), ('Japanese_Nihongo-EUC', 'EUC-JP'), ('Japanese_Nihongo-JIS', 'iso2022_jp'), ('Chinese_Mandarin-HZ', 'hz'), ('Abkhaz\\-Cyrillic\\+Abkh', 'cp1251')]
SKIP = {'Amharic-Afenegus6..60375', 'Armenian-DallakHelv', 'Azeri_Azerbaijani_Cyrillic-Az.Times.Cyr.Normal0117', 'Azeri_Azerbaijani_Latin-Az.Times.Lat0117', 'Bhojpuri-Agra', 'Burmese_Myanmar-UTF8', 'Burmese_Myanmar-WinResearcher', 'Chinese_Mandarin-HZ', 'Chinese_Mandarin-UTF8', 'Czech-Latin2-err', 'Esperanto-T61', 'Gujarati-UTF8', 'Hungarian_Magyar-Unicode', 'Japanese_Nihongo-JIS', 'Lao-UTF8', 'Magahi-Agra', 'Magahi-UTF8', 'Marathi-UTF8', 'Navaho_Dine-Navajo-Navaho-font', 'Russian_Russky-UTF8~', 'Tamil-UTF8', 'Tigrinya_Tigrigna-VG2Main', 'Vietnamese-TCVN', 'Vietnamese-VIQR', 'Vietnamese-VPS'}
__init__(root='udhr')[source]

Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/usr/local/share/nltk_data/corpora/webtext/'
>>> reader = PlaintextCorpusReader(root, '.*\.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • word_tokenizer – Tokenizer for breaking sentences or paragraphs into words.

  • sent_tokenizer – Tokenizer for breaking paragraphs into words.

  • para_block_reader – The block reader used to divide the corpus into paragraph blocks.

class nltk.corpus.reader.UnicharsCorpusReader[source]

Bases: WordListCorpusReader

This class is used to read lists of characters from the Perl Unicode Properties (see https://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from https://search.cpan.org/~bdfoy/Unicode-Tussle-1.11/lib/Unicode/Tussle.pm

available_categories = ['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']
chars(category=None, fileids=None)[source]

This module returns a list of characters from the Perl Unicode Properties. They are very useful when porting Perl tokenizers to Python.

>>> from nltk.corpus import perluniprops as pup
>>> pup.chars('Open_Punctuation')[:5] == [u'(', u'[', u'{', u'༺', u'༼']
True
>>> pup.chars('Currency_Symbol')[:5] == [u'$', u'¢', u'£', u'¤', u'¥']
True
>>> pup.available_categories
['Close_Punctuation', 'Currency_Symbol', 'IsAlnum', 'IsAlpha', 'IsLower', 'IsN', 'IsSc', 'IsSo', 'IsUpper', 'Line_Separator', 'Number', 'Open_Punctuation', 'Punctuation', 'Separator', 'Symbol']
Returns

a list of characters given the specific unicode character category

class nltk.corpus.reader.VerbnetCorpusReader[source]

Bases: XMLCorpusReader

An NLTK interface to the VerbNet verb lexicon.

From the VerbNet site: “VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet (Miller, 1990; Fellbaum, 1998), XTAG (XTAG Research Group, 2001), and FrameNet (Baker et al., 1998).”

For details about VerbNet see: https://verbs.colorado.edu/~mpalmer/projects/verbnet.html

__init__(root, fileids, wrap_etree=False)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

classids(lemma=None, wordnetid=None, fileid=None, classid=None)[source]

Return a list of the VerbNet class identifiers. If a file identifier is specified, then return only the VerbNet class identifiers for classes (and subclasses) defined by that file. If a lemma is specified, then return only VerbNet class identifiers for classes that contain that lemma as a member. If a wordnetid is specified, then return only identifiers for classes that contain that wordnetid as a member. If a classid is specified, then return only identifiers for subclasses of the specified VerbNet class. If nothing is specified, return all classids within VerbNet

fileids(vnclass_ids=None)[source]

Return a list of fileids that make up this corpus. If vnclass_ids is specified, then return the fileids that make up the specified VerbNet class(es).

frames(vnclass)[source]

Given a VerbNet class, this method returns VerbNet frames

The members returned are: 1) Example 2) Description 3) Syntax 4) Semantics

Parameters

vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.

Returns

frames - a list of frame dictionaries

lemmas(vnclass=None)[source]

Return a list of all verb lemmas that appear in any class, or in the classid if specified.

longid(shortid)[source]

Returns longid of a VerbNet class

Given a short VerbNet class identifier (eg ‘37.10’), map it to a long id (eg ‘confess-37.10’). If shortid is already a long id, then return it as-is

pprint(vnclass)[source]

Returns pretty printed version of a VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class.

Parameters

vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.

pprint_frames(vnclass, indent='')[source]

Returns pretty version of all frames in a VerbNet class

Return a string containing a pretty-printed representation of the list of frames within the VerbNet class.

Parameters

vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.

pprint_members(vnclass, indent='')[source]

Returns pretty printed version of members in a VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class’s member verbs.

Parameters

vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.

pprint_subclasses(vnclass, indent='')[source]

Returns pretty printed version of subclasses of VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class’s subclasses.

Parameters

vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.

pprint_themroles(vnclass, indent='')[source]

Returns pretty printed version of thematic roles in a VerbNet class

Return a string containing a pretty-printed representation of the given VerbNet class’s thematic roles.

Parameters

vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.

shortid(longid)[source]

Returns shortid of a VerbNet class

Given a long VerbNet class identifier (eg ‘confess-37.10’), map it to a short id (eg ‘37.10’). If longid is already a short id, then return it as-is.

subclasses(vnclass)[source]

Returns subclass ids, if any exist

Given a VerbNet class, this method returns subclass ids (if they exist) in a list of strings.

Parameters

vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.

Returns

list of subclasses

themroles(vnclass)[source]

Returns thematic roles participating in a VerbNet class

Members returned as part of roles are- 1) Type 2) Modifiers

Parameters

vnclass – A VerbNet class identifier; or an ElementTree containing the xml contents of a VerbNet class.

Returns

themroles: A list of thematic roles in the VerbNet class

vnclass(fileid_or_classid)[source]

Returns VerbNet class ElementTree

Return an ElementTree containing the xml for the specified VerbNet class.

Parameters

fileid_or_classid – An identifier specifying which class should be returned. Can be a file identifier (such as 'put-9.1.xml'), or a VerbNet class identifier (such as 'put-9.1') or a short VerbNet class identifier (such as '9.1').

wordnetids(vnclass=None)[source]

Return a list of all wordnet identifiers that appear in any class, or in classid if specified.

class nltk.corpus.reader.WordListCorpusReader[source]

Bases: CorpusReader

List of words, one per line. Blank lines are ignored.

words(fileids=None, ignore_lines_startswith='\n')[source]
class nltk.corpus.reader.WordNetCorpusReader[source]

Bases: CorpusReader

A corpus reader used to access wordnet or its variants.

ADJ = 'a'
ADJ_SAT = 's'
ADV = 'r'
MORPHOLOGICAL_SUBSTITUTIONS = {'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 's': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')]}
NOUN = 'n'
VERB = 'v'
__init__(root, omw_reader)[source]

Construct a new wordnet corpus reader, with the given root directory.

add_exomw()[source]

Add languages from Extended OMW

>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> wn.add_exomw()
>>> print(wn.synset('intrinsically.r.01').lemmas(lang="eng_wikt"))
[Lemma('intrinsically.r.01.per_se'), Lemma('intrinsically.r.01.as_such')]
add_omw()[source]
add_provs(reader)[source]

Add languages from Multilingual Wordnet to the provenance dictionary

all_eng_synsets(pos=None)[source]
all_lemma_names(pos=None, lang='eng')[source]

Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.

all_omw_synsets(pos=None, lang=None)[source]
all_synsets(pos=None, lang='eng')[source]

Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

citation(lang='eng')[source]

Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language

custom_lemmas(tab_file, lang)[source]

Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK’s WordNet functions to then be used with that language.

See the “Tab files” section at https://omwn.org/omw1.html for documentation on the Multilingual WordNet tab file format.

Parameters

tab_file – Tab file as a file or file-like object

Type

lang str

Param

lang ISO 639-3 code of the language of the tab file

digraph(inputs, rel=<function WordNetCorpusReader.<lambda>>, pos=None, maxdepth=-1, shapes=None, attr=None, verbose=False)[source]

Produce a graphical representation from ‘inputs’ (a list of start nodes, which can be a mix of Synsets, Lemmas and/or words), and a synset relation, for drawing with the ‘dot’ graph visualisation program from the Graphviz package.

Return a string in the DOT graph file language, which can then be converted to an image by nltk.parse.dependencygraph.dot2img(dot_string).

Optional Parameters: :rel: Wordnet synset relation :pos: for words, restricts Part of Speech to ‘n’, ‘v’, ‘a’ or ‘r’ :maxdepth: limit the longest path :shapes: dictionary of strings that trigger a specified shape :attr: dictionary with global graph attributes :verbose: warn about cycles

>>> from nltk.corpus import wordnet as wn
>>> print(wn.digraph([wn.synset('dog.n.01')]))
digraph G {
"Synset('animal.n.01')" -> "Synset('organism.n.01')";
"Synset('canine.n.02')" -> "Synset('carnivore.n.01')";
"Synset('carnivore.n.01')" -> "Synset('placental.n.01')";
"Synset('chordate.n.01')" -> "Synset('animal.n.01')";
"Synset('dog.n.01')" -> "Synset('canine.n.02')";
"Synset('dog.n.01')" -> "Synset('domestic_animal.n.01')";
"Synset('domestic_animal.n.01')" -> "Synset('animal.n.01')";
"Synset('living_thing.n.01')" -> "Synset('whole.n.02')";
"Synset('mammal.n.01')" -> "Synset('vertebrate.n.01')";
"Synset('object.n.01')" -> "Synset('physical_entity.n.01')";
"Synset('organism.n.01')" -> "Synset('living_thing.n.01')";
"Synset('physical_entity.n.01')" -> "Synset('entity.n.01')";
"Synset('placental.n.01')" -> "Synset('mammal.n.01')";
"Synset('vertebrate.n.01')" -> "Synset('chordate.n.01')";
"Synset('whole.n.02')" -> "Synset('object.n.01')";
}
disable_custom_lemmas(lang)[source]

prevent synsets from being mistakenly added

doc(file='README', lang='eng')[source]

Return the contents of readme, license or citation file use lang=lang to get the file for an individual language

get_version()[source]
ic(corpus, weight_senses_equally=False, smoothing=1.0)[source]

Creates an information content lookup dictionary from a corpus.

Parameters
  • corpus (CorpusReader) – The corpus from which we create an information content dictionary.

  • weight_senses_equally (bool) – If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.)

  • smoothing (float) – How much do we smooth synset counts (default is 1.0)

Returns

An information content dictionary

index_sense(version=None)[source]

Read sense key to synset id mapping from index.sense file in corpus directory

jcn_similarity(synset1, synset2, ic, verbose=False)[source]

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns

A float score denoting the similarity of the two Synset objects.

langs()[source]

return a list of languages supported by Multilingual Wordnet

lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

lemma(name, lang='eng')[source]

Return lemma object that matches the name

lemma_count(lemma)[source]

Return the frequency count for this Lemma

lemma_from_key(key)[source]
lemmas(lemma, pos=None, lang='eng')[source]

Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

license(lang='eng')[source]

Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language

lin_similarity(synset1, synset2, ic, verbose=False)[source]

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

map_to_many()[source]
map_to_one()[source]
map_wn30()[source]

Mapping from Wordnet 3.0 to currently loaded Wordnet version

morphy(form, pos=None, check_exceptions=True)[source]

Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.morphy('dogs'))
dog
>>> print(wn.morphy('churches'))
church
>>> print(wn.morphy('aardwolves'))
aardwolf
>>> print(wn.morphy('abaci'))
abacus
>>> wn.morphy('hardrock', wn.ADV)
>>> print(wn.morphy('book', wn.NOUN))
book
>>> wn.morphy('book', wn.ADJ)
of2ss(of)[source]

take an id and return the synsets

path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

readme(lang='eng')[source]

Return the contents of README (for omw) use lang=lang to get the readme for an individual language

res_similarity(synset1, synset2, ic, verbose=False)[source]

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

ss2of(ss)[source]

return the ID of the synset

synonyms(word, lang='eng')[source]

return nested list with the synonyms of the different senses of word in the given language

synset(name)[source]
synset_from_pos_and_offset(pos, offset)[source]
  • pos: The synset’s part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB (‘a’, ‘s’, ‘r’, ‘n’, or ‘v’).

  • offset: The byte offset of this synset in the WordNet dict file for this pos.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_pos_and_offset('n', 1740))
Synset('entity.n.01')
synset_from_sense_key(sense_key)[source]

Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()

From https://wordnet.princeton.edu/documentation/senseidx5wn: A sense_key is represented as:

lemma % lex_sense (e.g. 'dog%1:18:01::')

where lex_sense is encoded as:

ss_type:lex_filenum:lex_id:head_word:head_id
Lemma

ASCII text of word/collocation, in lower case

Ss_type

synset type for the sense (1 digit int) The synset type is encoded as follows:

1    NOUN
2    VERB
3    ADJECTIVE
4    ADVERB
5    ADJECTIVE SATELLITE
Lex_filenum

name of lexicographer file containing the synset for the sense (2 digit int)

Lex_id

when paired with lemma, uniquely identifies a sense in the lexicographer file (2 digit int)

Head_word

lemma of the first word in satellite’s head synset Only used if sense is in an adjective satellite synset

Head_id

uniquely identifies sense in a lexicographer file when paired with head_word Only used if head_word is present (2 digit int)

>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_sense_key("drive%1:04:03::"))
Synset('drive.n.06')
>>> print(wn.synset_from_sense_key("driving%1:04:03::"))
Synset('drive.n.06')
synsets(lemma, pos=None, lang='eng', check_exceptions=True)[source]

Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

words(lang='eng')[source]

return lemmas of the given language as list of words

wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters
  • other (Synset) – The Synset that this Synset is being compared to.

  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

class nltk.corpus.reader.WordNetICCorpusReader[source]

Bases: CorpusReader

A corpus reader for the WordNet information content corpus.

__init__(root, fileids)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

ic(icfile)[source]

Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.

Parameters

icfile (str) – The name of the wordnet_ic file (e.g. “ic-brown.dat”)

Returns

An information content dictionary

class nltk.corpus.reader.XMLCorpusReader[source]

Bases: CorpusReader

Corpus reader for corpora whose documents are xml files.

Note that the XMLCorpusReader constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves. See the XML specs for more info.

__init__(root, fileids, wrap_etree=False)[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

words(fileid=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns

the given file’s text nodes as a list of words and punctuation symbols

Return type

list(str)

xml(fileid=None)[source]
class nltk.corpus.reader.YCOECorpusReader[source]

Bases: CorpusReader

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.

__init__(root, encoding='utf8')[source]
Parameters
  • root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.

  • fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.

  • encoding

    The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:

    • A string: encoding is the encoding name for all files.

    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.

    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.

    • None: the file contents of all files will be processed using non-unicode byte strings.

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

documents(fileids=None)[source]

Return a list of document identifiers for all documents in this corpus, or for the documents with the given file(s) if specified.

fileids(documents=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that store the given document(s) if specified.

paras(documents=None)[source]
parsed_sents(documents=None)[source]
sents(documents=None)[source]
tagged_paras(documents=None)[source]
tagged_sents(documents=None)[source]
tagged_words(documents=None)[source]
words(documents=None)[source]
nltk.corpus.reader.find_corpus_fileids(root, regexp)[source]
nltk.corpus.reader.tagged_treebank_para_block_reader(stream)[source]