nltk.corpus.reader package

Submodules

nltk.corpus.reader.aligned module

class nltk.corpus.reader.aligned.AlignedCorpusReader(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), alignedsent_block_reader=<function read_alignedsent_block at 0x10805b268>, encoding='latin1')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.

aligned_sents(fileids=None)[source]
Returns:the given file(s) as a list of AlignedSent objects.
Return type:list(AlignedSent)
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.aligned.AlignedSentCorpusView(corpus_file, encoding, aligned, group_by_sent, word_tokenizer, sent_tokenizer, alignedsent_block_reader)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A specialized corpus view for aligned sentences. AlignedSentCorpusView objects are typically created by AlignedCorpusReader (not directly by nltk users).

read_block(stream)[source]

nltk.corpus.reader.api module

API for corpus readers.

class nltk.corpus.reader.api.CategorizedCorpusReader(kwargs)[source]

Bases: builtins.object

A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method categories(), which returns a list of the categories for the corpus or for a specified set of fileids; and overrides fileids() to take a categories argument, restricting the set of fileids to be returned.

Subclasses are expected to:

  • Call __init__() to set up the mapping.
  • Override all view methods to accept a categories parameter, which can be used instead of the fileids parameter, to select which fileids should be included in the returned view.
categories(fileids=None)[source]

Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.

fileids(categories=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that make up the given category(s) if specified.

class nltk.corpus.reader.api.CorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: builtins.object

A base class for “corpus reader” classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory. Each file is identified by its file identifier, which is the relative path to the file from the root directory.

A separate subclass is be defined for each corpus format. These subclasses define one or more methods that provide ‘views’ on the corpus contents, such as words() (for a list of words) and parsed_sents() (for a list of parsed sentences). Called with no arguments, these methods will return the contents of the entire corpus. For most corpora, these methods define one or more selection arguments, such as fileids or categories, which can be used to select which portion of the corpus should be returned.

abspath(fileid)[source]

Return the absolute path for the given file.

Parameters:file (str) – The file identifier for the file whose path should be returned.
Return type:PathPointer
abspaths(fileids=None, include_encoding=False, include_fileid=False)[source]

Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.

Parameters:
  • fileids (None or str or list) – Specifies the set of fileids for which paths should be returned. Can be None, for all fileids; a list of file identifiers, for a specified set of fileids; or a single file identifier, for a single file. Note that the return value is always a list of paths, even if fileids is a single file identifier.
  • include_encoding – If true, then return a list of (path_pointer, encoding) tuples.
Return type:

list(PathPointer)

encoding(file)[source]

Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.

ensure_loaded()[source]

Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded – e.g., in case a user wants to do help(some_corpus).

fileids()[source]

Return a list of file identifiers for the fileids that make up this corpus.

open(file)[source]

Return an open stream that can be used to read the given file. If the file’s encoding is not None, then the stream will automatically decode the file’s contents into unicode.

Parameters:file – The file identifier of the file to read.
readme()[source]

Return the contents of the corpus README file, if it exists.

root

The directory where this corpus is stored.

Type:PathPointer
unicode_repr()
class nltk.corpus.reader.api.SyntaxCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:

  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
  • _read_block, which reads a block from the input stream.
  • _word, which takes a block and returns a list of list of words.
  • _tag, which takes a block and returns a list of list of tagged words.
  • _parse, which takes a block and returns a list of parsed sentences.
parsed_sents(fileids=None)[source]
raw(fileids=None)[source]
sents(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]

nltk.corpus.reader.bnc module

Corpus reader for the XML version of the British National Corpus.

class nltk.corpus.reader.bnc.BNCCorpusReader(root, fileids, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the British National Corpus.

For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554

If you extracted the archive to a directory called BNC, then you can instantiate the reder as:

BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
sents(fileids=None, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_sents(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_words(fileids=None, c5=False, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
words(fileids=None, strip_space=True, stem=False)[source]
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
class nltk.corpus.reader.bnc.BNCSentence(num, items)[source]

Bases: builtins.list

A list of words, augmented by an attribute num used to record the sentence identifier (the n attribute from the XML).

class nltk.corpus.reader.bnc.BNCWordView(fileid, sent, tag, strip_space, stem)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with the BNC corpus.

author = None

Author of the document.

editor = None

Editor

handle_elt(elt, context)[source]
handle_header(elt, context)[source]
handle_sent(elt)[source]
handle_word(elt)[source]
resps = None

Statement of responsibility

tags_to_ignore = {'align', 'unclear', 'shift', 'pause', 'gap', 'pb', 'vocal', 'event'}

These tags are ignored. For their description refer to the technical documentation, for example, http://www.natcorp.ox.ac.uk/docs/URG/ref-vocal.html

title = None

Title of the document.

nltk.corpus.reader.bracket_parse module

Corpus reader for corpora that consist of parenthesis-delineated parse trees.

class nltk.corpus.reader.bracket_parse.AlpinoCorpusReader(root, encoding='ISO-8859-1', tagset=None)[source]

Bases: nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

Reader for the Alpino Dutch Treebank.

class nltk.corpus.reader.bracket_parse.BracketParseCorpusReader(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

Reader for corpora that consist of parenthesis-delineated parse trees.

class nltk.corpus.reader.bracket_parse.CategorizedBracketParseCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>

paras(fileids=None, categories=None)[source]
parsed_paras(fileids=None, categories=None)[source]
parsed_sents(fileids=None, categories=None)[source]
parsed_words(fileids=None, categories=None)[source]
raw(fileids=None, categories=None)[source]
sents(fileids=None, categories=None)[source]
tagged_paras(fileids=None, categories=None, tagset=None)[source]
tagged_sents(fileids=None, categories=None, tagset=None)[source]
tagged_words(fileids=None, categories=None, tagset=None)[source]
words(fileids=None, categories=None)[source]

nltk.corpus.reader.chasen module

class nltk.corpus.reader.chasen.ChasenCorpusReader(root, fileids, encoding='utf8', sent_splitter=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

paras(fileids=None)[source]
raw(fileids=None)[source]
sents(fileids=None)[source]
tagged_paras(fileids=None)[source]
tagged_sents(fileids=None)[source]
tagged_words(fileids=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.chasen.ChasenCorpusView(corpus_file, encoding, tagged, group_by_sent, group_by_para, sent_splitter=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A specialized corpus view for ChasenReader. Similar to TaggedCorpusView, but this’ll use fixed sets of word and sentence tokenizer.

read_block(stream)[source]

Reads one paragraph at a time.

nltk.corpus.reader.chasen.demo()[source]
nltk.corpus.reader.chasen.test()[source]

nltk.corpus.reader.childes module

Corpus reader for the XML version of the CHILDES corpus.

class nltk.corpus.reader.childes.CHILDESCorpusReader(root, fileids, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at http://childes.psy.cmu.edu/. The XML version of CHILDES is located at http://childes.psy.cmu.edu/data-xml/. Copy the needed parts of the CHILDES XML corpus into the NLTK data directory (nltk_data/corpora/CHILDES/).

For access to the file text use the usual nltk functions, words(), sents(), tagged_words() and tagged_sents().

MLU(fileids=None, speaker='CHI')[source]
Returns:the given file(s) as a floating number
Return type:list(float)
age(fileids=None, speaker='CHI', month=False)[source]
Returns:the given file(s) as string or int
Return type:list or int
Parameters:month – If true, return months instead of year-month-date
childes_url_base = 'http://childes.psy.cmu.edu/browser/index.php?url='
convert_age(age_year)[source]

Caclculate age in months from a string in CHILDES format

corpus(fileids=None)[source]
Returns:the given file(s) as a dict of (corpus_property_key, value)
Return type:list(dict)
participants(fileids=None)[source]
Returns:the given file(s) as a dict of (participant_property_key, value)
Return type:list(dict)
sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
tagged_sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
tagged_words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (stem, index, dependent_index)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
webview_file(fileid, urlbase=None)[source]

Map a corpus file to its web version on the CHILDES website, and open it in a web browser.

The complete URL to be used is:
childes.childes_url_base + urlbase + fileid.replace(‘.xml’, ‘.cha’)

If no urlbase is passed, we try to calculate it. This requires that the childes corpus was set up to mirror the folder hierarchy under childes.psy.cmu.edu/data-xml/, e.g.: nltk_data/corpora/childes/Eng-USA/Cornell/??? or nltk_data/corpora/childes/Romance/Spanish/Aguirre/???

The function first looks (as a special case) if “Eng-USA” is on the path consisting of <corpus root>+fileid; then if “childes”, possibly followed by “data-xml”, appears. If neither one is found, we use the unmodified fileid and hope for the best. If this is not right, specify urlbase explicitly, e.g., if the corpus root points to the Cornell folder, urlbase=’Eng-USA/Cornell’.

words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)[source]
Returns:

the given file(s) as a list of words

Return type:

list(str)

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (stem, index, dependent_index)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
nltk.corpus.reader.childes.demo(corpus_root=None)[source]

The CHILDES corpus should be manually downloaded and saved to [NLTK_Data_Dir]/corpora/childes/

nltk.corpus.reader.chunked module

A reader for corpora that contain chunked (and optionally tagged) documents.

class nltk.corpus.reader.chunked.ChunkedCorpusReader(root, fileids, extension='', str2chunktree=<function tagstr2tree at 0x10812d730>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block at 0x10805b1e0>, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using nltk.chunk.tagstr2tree.

chunked_paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).
Return type:list(list(Tree))
chunked_sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).
Return type:list(Tree)
chunked_words(fileids=None)[source]
Returns:the given file(s) as a list of tagged words and chunks. Words are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over (word,tag) tuples or word strings.
Return type:list(tuple(str,str) and Tree)
paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None)[source]
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.chunked.ChunkedCorpusView(fileid, encoding, tagged, group_by_sent, group_by_para, chunked, str2chunktree, sent_tokenizer, para_block_reader)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

nltk.corpus.reader.cmudict module

The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6] ftp://ftp.cs.cmu.edu/project/speech/dict/ Copyright 1998 Carnegie Mellon University

File Format: Each line consists of an uppercased word, a counter (for alternative pronunciations), and a transcription. Vowels are marked for stress (1=primary, 2=secondary, 0=no stress). E.g.: NATURAL 1 N AE1 CH ER0 AH0 L

The dictionary contains 127069 entries. Of these, 119400 words are assigned a unique pronunciation, 6830 words have two pronunciations, and 839 words have three or more pronunciations. Many of these are fast-speech variants.

Phonemes: There are 39 phonemes, as shown below:

Phoneme Example Translation Phoneme Example Translation ——- ——- ———– ——- ——- ———– AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY K key K IY L lee L IY M me M IY N knee N IY NG ping P IH NG OW oat OW T OY toy T OY P pee P IY R read R IY D S sea S IY SH she SH IY T tea T IY TH theta TH EY T AH UH hood HH UH D UW two T UW V vee V IY W we W IY Y yield Y IY L D Z zee Z IY ZH seizure S IY ZH ER

class nltk.corpus.reader.cmudict.CMUDictCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

dict()[source]
Returns:the cmudict lexicon as a dictionary, whose keys are

lowercase words and whose values are lists of pronunciations.

entries()[source]
Returns:the cmudict lexicon as a list of entries

containing (word, transcriptions) tuples.

raw()[source]
Returns:the cmudict lexicon as a raw string.
words()[source]
Returns:a list of all words defined in the cmudict lexicon.
nltk.corpus.reader.cmudict.read_cmudict_block(stream)[source]

nltk.corpus.reader.conll module

Read CoNLL-style chunk fileids.

class nltk.corpus.reader.conll.ConllChunkCorpusReader(root, fileids, chunk_types, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.conll.ConllCorpusReader

A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.

class nltk.corpus.reader.conll.ConllCorpusReader(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.Tree'>, tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the ConllCorpusReader constructor therefore takes an argument, columntypes, which is used to specify the columns that are used by a given corpus.

@todo: Add support for reading from corpora where different
parallel files contain different columns.
@todo: Possibly add caching of the grid corpus view? This would
allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).
@todo: Better support for -DOCSTART-. Currently, we just ignore
it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).
CHUNK = 'chunk'

column type for chunk structures

COLUMN_TYPES = ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')

A list of all column types supported by the conll corpus reader.

IGNORE = 'ignore'

column type for column that should be ignored

NE = 'ne'

column type for named entities

POS = 'pos'

column type for part-of-speech tags

SRL = 'srl'

column type for semantic role labels

TREE = 'tree'

column type for parse trees

WORDS = 'words'

column type for words

chunked_sents(fileids=None, chunk_types=None, tagset=None)[source]
chunked_words(fileids=None, chunk_types=None, tagset=None)[source]
iob_sents(fileids=None, tagset=None)[source]
Returns:a list of lists of word/tag/IOB tuples
Return type:list(list)
Parameters:fileids (None or str or list) – the list of fileids that make up this corpus
iob_words(fileids=None, tagset=None)[source]
Returns:a list of word/tag/IOB tuples
Return type:list(tuple)
Parameters:fileids (None or str or list) – the list of fileids that make up this corpus
parsed_sents(fileids=None, pos_in_tree=None, tagset=None)[source]
raw(fileids=None)[source]
sents(fileids=None)[source]
srl_instances(fileids=None, pos_in_tree=None, flatten=True)[source]
srl_spans(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.conll.ConllSRLInstance(tree, verb_head, verb_stem, roleset, tagged_spans)[source]

Bases: builtins.object

An SRL instance from a CoNLL corpus, which identifies and providing labels for the arguments of a single verb.

arguments = None

A list of (argspan, argid) tuples, specifying the location and type for each of the arguments identified by this instance. argspan is a tuple start, end, indicating that the argument consists of the words[start:end].

pprint()[source]
tagged_spans = None

A list of (span, id) tuples, specifying the location and type for each of the arguments, as well as the verb pieces, that make up this instance.

tree = None

The parse tree for the sentence containing this instance.

unicode_repr()
verb = None

A list of the word indices of the words that compose the verb whose arguments are identified by this instance. This will contain multiple word indices when multi-word verbs are used (e.g. ‘turn on’).

verb_head = None

The word index of the head word of the verb whose arguments are identified by this instance. E.g., for a sentence that uses the verb ‘turn on,’ verb_head will be the word index of the word ‘turn’.

words = None

A list of the words in the sentence containing this instance.

class nltk.corpus.reader.conll.ConllSRLInstanceList(tree, instances=())[source]

Bases: builtins.list

Set of instances for a single sentence

pprint(include_tree=False)[source]
unicode_repr

Return repr(self).

nltk.corpus.reader.dependency module

class nltk.corpus.reader.dependency.DependencyCorpusReader(root, fileids, encoding='utf8', word_tokenizer=<nltk.tokenize.simple.TabTokenizer object at 0x1080fdac8>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block at 0x10805b1e0>)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

parsed_sents(fileids=None)[source]
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
tagged_sents(fileids=None)[source]
tagged_words(fileids=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.dependency.DependencyCorpusView(corpus_file, tagged, group_by_sent, dependencies, chunk_types=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

nltk.corpus.reader.framenet module

class nltk.corpus.reader.framenet.AttrDict(*args, **kwargs)[source]

Bases: builtins.dict

A class that wraps a dict and allows accessing the keys of the dict as if they were attributes. Taken from here:

>>> foo = {'a':1, 'b':2, 'c':3}
>>> bar = AttrDict(foo)
>>> pprint(dict(bar))
{'a': 1, 'b': 2, 'c': 3}
>>> bar.b
2
>>> bar.d = 4
>>> pprint(dict(bar))
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
unicode_repr()
class nltk.corpus.reader.framenet.FramenetCorpusReader(root, fileids)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

A corpus reader for the Framenet Corpus.

>>> from nltk.corpus import framenet as fn
>>> fn.lu(3238).frame.lexUnit['glint.v'] is fn.lu(3238)
True
>>> fn.frame_by_name('Replacing') is fn.lus('replace.v')[0].frame
True
>>> fn.lus('prejudice.n')[0].frame.frameRelations == fn.frame_relations('Partiality')
True
annotated_document(fn_docid)[source]

Returns the annotated document whose id number is fn_docid. This id number can be obtained by calling the Documents() function.

The dict that is returned from this function will contain the following keys:

  • ‘_type’ : ‘fulltextannotation’

  • ‘sentence’ : a list of sentences in the document
    • Each item in the list is a dict containing the following keys:
      • ‘ID’ : the ID number of the sentence

      • ‘_type’ : ‘sentence’

      • ‘text’ : the text of the sentence

      • ‘paragNo’ : the paragraph number

      • ‘sentNo’ : the sentence number

      • ‘docID’ : the document ID number

      • ‘corpID’ : the corpus ID number

      • ‘aPos’ : the annotation position

      • ‘annotationSet’ : a list of annotation layers for the sentence
        • Each item in the list is a dict containing the following keys:
          • ‘ID’ : the ID number of the annotation set

          • ‘_type’ : ‘annotationset’

          • ‘status’ : either ‘MANUAL’ or ‘UNANN’

          • ‘luName’ : (only if status is ‘MANUAL’)

          • ‘luID’ : (only if status is ‘MANUAL’)

          • ‘frameID’ : (only if status is ‘MANUAL’)

          • ‘frameName’: (only if status is ‘MANUAL’)

          • ‘layer’ : a list of labels for the layer
            • Each item in the layer is a dict containing the following keys:

              • ‘_type’: ‘layer’

              • ‘rank’

              • ‘name’

              • ‘label’ : a list of labels in the layer
                • Each item is a dict containing the following keys:
                  • ‘start’
                  • ‘end’
                  • ‘name’
                  • ‘feID’ (optional)
Parameters:fn_docid (int) – The Framenet id number of the document
Returns:Information about the annotated document
Return type:dict
buildindexes()[source]

Build the internal indexes to make look-ups faster.

documents(name=None)[source]

Return a list of the annotated documents in Framenet.

Details for a specific annotated document can be obtained using this class’s annotated_document() function and pass it the value of the ‘ID’ field.

>>> from nltk.corpus import framenet as fn
>>> len(fn.documents())
78
>>> set([x.corpname for x in fn.documents()])==set(['ANC', 'C-4', 'KBEval',                     'LUCorpus-v0.3', 'Miscellaneous', 'NTI', 'PropBank', 'QA', 'SemAnno'])
True
Parameters:name (str) – A regular expression pattern used to search the file name of each annotated document. The document’s file name contains the name of the corpus that the document is from, followed by two underscores “__” followed by the document name. So, for example, the file name “LUCorpus-v0.3__20000410_nyt-NEW.xml” is from the corpus named “LUCorpus-v0.3” and the document name is “20000410_nyt-NEW.xml”.
Returns:A list of selected (or all) annotated documents
Return type:list of dicts, where each dict object contains the following keys:
  • ‘name’
  • ‘ID’
  • ‘corpid’
  • ‘corpname’
  • ‘description’
  • ‘filename’
fe_relations()[source]

Obtain a list of frame element relations.

>>> from nltk.corpus import framenet as fn
>>> ferels = fn.fe_relations()
>>> isinstance(ferels, list)
True
>>> len(ferels)
10020
>>> PrettyDict(ferels[0], breakLines=True)
{'ID': 14642,
'_type': 'ferelation',
'frameRelation': <Parent=Abounding_with -- Inheritance -> Child=Lively_place>,
'subFE': <fe ID=11370 name=Degree>,
'subFEName': 'Degree',
'subFrame': <frame ID=1904 name=Lively_place>,
'subID': 11370,
'supID': 2271,
'superFE': <fe ID=2271 name=Degree>,
'superFEName': 'Degree',
'superFrame': <frame ID=262 name=Abounding_with>,
'type': <framerelationtype ID=1 name=Inheritance>}
Returns:A list of all of the frame element relations in framenet
Return type:list(dict)
frame(fn_fid_or_fname, ignorekeys=[])[source]

Get the details for the specified Frame using the frame’s name or id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame(256)
>>> f.name
'Medical_specialties'
>>> f = fn.frame('Medical_specialties')
>>> f.ID
256
>>> # ensure non-ASCII character in definition doesn't trigger an encoding error:
>>> fn.frame('Imposing_obligation')
frame (1494): Imposing_obligation...

The dict that is returned from this function will contain the following information about the Frame:

  • ‘name’ : the name of the Frame (e.g. ‘Birth’, ‘Apply_heat’, etc.)

  • ‘definition’ : textual definition of the Frame

  • ‘ID’ : the internal ID number of the Frame

  • ‘semTypes’ : a list of semantic types for this frame
    • Each item in the list is a dict containing the following keys:
      • ‘name’ : can be used with the semtype() function
      • ‘ID’ : can be used with the semtype() function
  • ‘lexUnit’ : a dict containing all of the LUs for this frame.

    The keys in this dict are the names of the LUs and the value for each key is itself a dict containing info about the LU (see the lu() function for more info.)

  • ‘FE’ : a dict containing the Frame Elements that are part of this frame

    The keys in this dict are the names of the FEs (e.g. ‘Body_system’) and the values are dicts containing the following keys

    • ‘definition’ : The definition of the FE

    • ‘name’ : The name of the FE e.g. ‘Body_system’

    • ‘ID’ : The id number

    • ‘_type’ : ‘fe’

    • ‘abbrev’ : Abbreviation e.g. ‘bod’

    • ‘coreType’ : one of “Core”, “Peripheral”, or “Extra-Thematic”

    • ‘semType’ : if not None, a dict with the following two keys:
      • ‘name’ : name of the semantic type. can be used with

        the semtype() function

      • ‘ID’ : id number of the semantic type. can be used with

        the semtype() function

    • ‘requiresFE’ : if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame
      • ‘ID’ : the id of the other FE in this frame
    • ‘excludesFE’ : if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame
      • ‘ID’ : the id of the other FE in this frame
  • ‘frameRelation’ : a list of objects describing frame relations

  • ‘FEcoreSets’ : a list of Frame Element core sets for this frame
    • Each item in the list is a list of FE objects
Parameters:
  • fn_fid_or_fname (int or str) – The Framenet name or id number of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

frame_by_id(fn_fid, ignorekeys=[])[source]

Get the details for the specified Frame using the frame’s id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_id(256)
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition
"This frame includes words that name ..."
Parameters:
  • fn_fid (int) – The Framenet id number of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_by_name(fn_fname, ignorekeys=, []check_cache=True)[source]

Get the details for the specified Frame using the frame’s name.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_name('Medical_specialties')
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition
"This frame includes words that name ..."
Parameters:
  • fn_fname (str) – The name of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_ids_and_names(name=None)[source]

Uses the frame index, which is much faster than looking up each frame definition if only the names and IDs are needed.

frame_relation_types()[source]

Obtain a list of frame relation types.

>>> from nltk.corpus import framenet as fn
>>> frts = list(fn.frame_relation_types())
>>> isinstance(frts, list)
True
>>> len(frts)
9
>>> PrettyDict(frts[0], breakLines=True)
{'ID': 1,
 '_type': 'framerelationtype',
 'frameRelations': [<Parent=Event -- Inheritance -> Child=Change_of_consistency>, <Parent=Event -- Inheritance -> Child=Rotting>, ...],
 'name': 'Inheritance',
 'subFrameName': 'Child',
 'superFrameName': 'Parent'}
Returns:A list of all of the frame relation types in framenet
Return type:list(dict)
frame_relations(frame=None, frame2=None, type=None)[source]
Parameters:frame – (optional) frame object, name, or ID; only relations involving

this frame will be returned :param frame2: (optional; ‘frame’ must be a different frame) only show relations between the two specified frames, in either direction :param type: (optional) frame relation type (name or object); show only relations of this type :type frame: int or str or AttrDict :return: A list of all of the frame relations in framenet :rtype: list(dict)

>>> from nltk.corpus import framenet as fn
>>> frels = fn.frame_relations()
>>> isinstance(frels, list)
True
>>> len(frels)
1676
>>> PrettyList(fn.frame_relations('Cooking_creation'), maxReprSize=0, breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>,
 <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
>>> PrettyList(fn.frame_relations(373), breakLines=True)
[<Parent=Topic -- Using -> Child=Communication>,
 <Source=Discussion -- ReFraming_Mapping -> Target=Topic>, ...]
>>> PrettyList(fn.frame_relations(fn.frame('Cooking_creation')), breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>, ...]
>>> PrettyList(fn.frame_relations('Cooking_creation', type='Inheritance'))
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>]
>>> PrettyList(fn.frame_relations('Cooking_creation', 'Apply_heat'), breakLines=True)
[<Parent=Apply_heat -- Using -> Child=Cooking_creation>,
<MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
frames(name=None)[source]

Obtain details for a specific frame.

>>> from nltk.corpus import framenet as fn
>>> len(fn.frames())
1019
>>> PrettyList(fn.frames(r'(?i)medical'), maxReprSize=0, breakLines=True)
[<frame ID=256 name=Medical_specialties>,
 <frame ID=257 name=Medical_instruments>,
 <frame ID=255 name=Medical_professionals>,
 <frame ID=239 name=Medical_conditions>]

A brief intro to Frames (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A Frame is a script-like conceptual structure that describes a particular type of situation, object, or event along with the participants and props that are needed for that Frame. For example, the “Apply_heat” frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc.

We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called “lexical units” (LUs).

FrameNet includes relations between Frames. Several types of relations are defined, of which the most important are:

  • Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.
  • Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to child FEs.
  • Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and “Sentencing”.
  • Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point of view, respectively.
Parameters:name (str) – A regular expression pattern used to match against Frame names. If ‘name’ is None, then a list of all Framenet Frames will be returned.
Returns:A list of matching Frames (or all Frames).
Return type:list(AttrDict)
frames_by_lemma(pat)[source]

Returns a list of all frames that contain LUs in which the name attribute of the LU matchs the given regular expression pat. Note that LU names are composed of “lemma.POS”, where the “lemma” part can be made up of either a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’).

Note: if you are going to be doing a lot of this type of searching, you’d want to build an index that maps from lemmas to frames because each time frames_by_lemma() is called, it has to search through ALL of the frame XML files in the db.

>>> from nltk.corpus import framenet as fn
>>> fn.frames_by_lemma(r'(?i)a little')
[<frame ID=189 name=Quantity>, <frame ID=2001 name=Degree>]
Returns:A list of frame objects.
Return type:list(AttrDict)
lu(fn_luid, ignorekeys=[])[source]

Get information about a specific Lexical Unit using the id number fn_luid. This function reads the LU information from the xml file on disk each time it is called. You may want to cache this info if you plan to call this function with the same id number multiple times.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> fn.lu(256).name
'foresee.v'
>>> fn.lu(256).definition
'COD: be aware of beforehand; predict.'
>>> fn.lu(256).frame.name
'Expectation'
>>> pprint(list(map(PrettyDict, fn.lu(256).lexemes)))
[{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}]

The dict that is returned from this function will contain most of the following information about the LU. Note that some LUs do not contain all of these pieces of information - particularly ‘totalAnnotated’ and ‘incorporatedFE’ may be missing in some LUs:

  • ‘name’ : the name of the LU (e.g. ‘merger.n’)

  • ‘definition’ : textual definition of the LU

  • ‘ID’ : the internal ID number of the LU

  • ‘_type’ : ‘lu’

  • ‘status’ : e.g. ‘Created’

  • ‘frame’ : Frame that this LU belongs to

  • ‘POS’ : the part of speech of this LU (e.g. ‘N’)

  • ‘totalAnnotated’ : total number of examples annotated with this LU

  • ‘incorporatedFE’ : FE that incorporates this LU (e.g. ‘Ailment’)

  • ‘sentenceCount’ : a dict with the following two keys:
    • ‘annotated’: number of sentences annotated with this LU
    • ‘total’ : total number of sentences with this LU
  • ‘lexemes’ : a list of dicts describing the lemma of this LU.

    Each dict in the list contains these keys: - ‘POS’ : part of speech e.g. ‘N’ - ‘name’ : either single-lexeme e.g. ‘merger’ or

    multi-lexeme e.g. ‘a little’

    • ‘order’: the order of the lexeme in the lemma (starting from 1)

    • ‘headword’: a boolean (‘true’ or ‘false’)

    • ‘breakBefore’: Can this lexeme be separated from the previous lexeme?
      Consider: “take over.v” as in:

      Germany took over the Netherlands in 2 days. Germany took the Netherlands over in 2 days.

      In this case, ‘breakBefore’ would be “true” for the lexeme “over”. Contrast this with “take after.v” as in:

      Mary takes after her grandmother.

      *Mary takes her grandmother after.

      In this case, ‘breakBefore’ would be “false” for the lexeme “after”

  • ‘lemmaID’ : Can be used to connect lemmas in different LUs

  • ‘semTypes’ : a list of semantic type objects for this LU

  • ‘subCorpus’ : a list of subcorpora
    • Each item in the list is a dict containing the following keys:
      • ‘name’ :

      • ‘sentence’ : a list of sentences in the subcorpus
        • each item in the list is a dict with the following keys:
          • ‘ID’:

          • ‘sentNo’:

          • ‘text’: the text of the sentence

          • ‘aPos’:

          • ‘annotationSet’: a list of annotation sets
            • each item in the list is a dict with the following keys:
              • ‘ID’:

              • ‘status’:

              • ‘layer’: a list of layers
                • each layer is a dict containing the following keys:
                  • ‘name’: layer name (e.g. ‘BNC’)

                  • ‘rank’:

                  • ‘label’: a list of labels for the layer
                    • each label is a dict containing the following keys:
                      • ‘start’: start pos of label in sentence ‘text’ (0-based)
                      • ‘end’: end pos of label in sentence ‘text’ (0-based)
                      • ‘name’: name of label (e.g. ‘NN1’)

Under the hood, this implementation looks up the lexical unit information in the frame definition file. That file does not contain corpus annotations, so the LU files will be accessed on demand if those are needed. In principle, valence patterns could be loaded here too, though these are not currently supported.

Parameters:
  • fn_luid (int) – The id number of the lexical unit
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

All information about the lexical unit

Return type:

dict

lu_basic(fn_luid)[source]

Returns basic information about the LU whose id is fn_luid. This is basically just a wrapper around the lu() function with “subCorpus” info excluded.

>>> from nltk.corpus import framenet as fn
>>> PrettyDict(fn.lu_basic(256), breakLines=True)
{'ID': 256,
 'POS': 'V',
 '_type': 'lu',
 'definition': 'COD: be aware of beforehand; predict.',
 'frame': <frame ID=26 name=Expectation>,
 'lemmaID': 15082,
 'lexemes': [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}],
 'name': 'foresee.v',
 'semTypes': [],
 'sentenceCount': {'annotated': 44, 'total': 227},
 'status': 'FN1_Sent'}
Parameters:fn_luid (int) – The id number of the desired LU
Returns:Basic information about the lexical unit
Return type:dict
lu_ids_and_names(name=None)[source]

Uses the LU index, which is much faster than looking up each LU definition if only the names and IDs are needed.

lus(name=None)[source]

Obtain details for a specific lexical unit.

>>> from nltk.corpus import framenet as fn
>>> len(fn.lus())
11829
>>> PrettyList(fn.lus(r'(?i)a little'), maxReprSize=0, breakLines=True)
[<lu ID=14744 name=a little bit.adv>,
 <lu ID=14733 name=a little.n>,
 <lu ID=14743 name=a little.adv>]

A brief intro to Lexical Units (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A lexical unit (LU) is a pairing of a word with a meaning. For example, the “Apply_heat” Frame describes a common situation involving a Cook, some Food, and a Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil, brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat frame. Each sense of a polysemous word is a different LU.

We have used the word “word” in talking about LUs. The reality is actually rather complex. When we say that the word “bake” is polysemous, we mean that the lemma “bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is linked to three different frames:

  • Apply_heat: “Michelle baked the potatoes for 45 minutes.”
  • Cooking_creation: “Michelle baked her mother a cake for her birthday.”
  • Absorb_heat: “The potatoes have to bake for more than 30 minutes.”

These constitute three different LUs, with different definitions.

Multiword expressions such as “given name” and hyphenated words like “shut-eye” can also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)” are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”, respectively), and their internal structure is not analyzed.

Framenet provides multiple annotated examples of each sense of a word (i.e. each LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the combinatorial possibilities of the lexical unit.

Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This makes the FrameNet database similar to a thesaurus, grouping together semantically similar words.

In the simplest case, frame-evoking words are verbs such as “fried” in:

“Matilde fried the catfish in a heavy iron skillet.”

Sometimes event nouns may evoke a Frame. For example, “reduction” evokes “Cause_change_of_scalar_position” in:

”...the reduction of debt levels to $665 million from $2.6 billion.”

Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep” frame as in:

“They were asleep for hours.”

Many common nouns, such as artifacts like “hat” or “tower”, typically serve as dependents rather than clearly evoking their own frames.

Parameters:name (str) –

A regular expression pattern used to search the LU names. Note that LU names take the form of a dotted string (e.g. “run.v” or “a little.adv”) in which a lemma preceeds the ”.” and a POS follows the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple lexemes (e.g. “a little”). If ‘name’ is not given, then all LUs will be returned.

The valid POSes are:

v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj - interjection art - article c - conjunction scon - subordinating conjunction
Returns:A list of selected (or all) lexical units
Return type:list of LU objects (dicts). See the lu() function for info about the specifics of LU objects.
propagate_semtypes()[source]

Apply inference rules to distribute semtypes over relations between FEs. For FrameNet 1.5, this results in 1011 semtypes being propagated. (Not done by default because it requires loading all frame files, which takes several seconds. If this needed to be fast, it could be rewritten to traverse the neighboring relations on demand for each FE semtype.)

>>> from nltk.corpus import framenet as fn
>>> sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
4241
>>> fn.propagate_semtypes()
>>> sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
5252
readme()[source]

Return the contents of the corpus README.txt (or README) file.

semtype(key)[source]
>>> from nltk.corpus import framenet as fn
>>> fn.semtype(233).name
'Temperature'
>>> fn.semtype(233).abbrev
'Temp'
>>> fn.semtype('Temperature').ID
233
Parameters:key (string or int) – The name, abbreviation, or id number of the semantic type
Returns:Information about a semantic type
Return type:dict
semtype_inherits(st, superST)[source]
semtypes()[source]

Obtain a list of semantic types.

>>> from nltk.corpus import framenet as fn
>>> stypes = fn.semtypes()
>>> len(stypes)
73
>>> sorted(stypes[0].keys())
['ID', '_type', 'abbrev', 'definition', 'name', 'rootType', 'subTypes', 'superType']
Returns:A list of all of the semantic types in framenet
Return type:list(dict)
exception nltk.corpus.reader.framenet.FramenetError[source]

Bases: builtins.Exception

An exception class for framenet-related errors.

class nltk.corpus.reader.framenet.Future(loader, *args, **kwargs)[source]

Bases: builtins.object

Wraps and acts as a proxy for a value to be loaded lazily (on demand). Adapted from https://gist.github.com/sergey-miryanov/2935416

class nltk.corpus.reader.framenet.PrettyDict(*args, **kwargs)[source]

Bases: nltk.corpus.reader.framenet.AttrDict

Displays an abbreviated repr of values where possible. Inherits from AttrDict, so a callable value will be lazily converted to an actual value.

unicode_repr()
class nltk.corpus.reader.framenet.PrettyLazyMap(function, *lists, **config)[source]

Bases: nltk.util.LazyMap

Displays an abbreviated repr of only the first several elements, not the whole list.

unicode_repr()

Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.

class nltk.corpus.reader.framenet.PrettyList(*args, **kwargs)[source]

Bases: builtins.list

Displays an abbreviated repr of only the first several elements, not the whole list.

unicode_repr()

Return a string representation for this corpus view that is similar to a list’s representation; but if it would be more than 60 characters long, it is truncated.

nltk.corpus.reader.framenet.demo()[source]

nltk.corpus.reader.ieer module

Corpus reader for the Information Extraction and Entity Recognition Corpus.

NIST 1999 Information Extraction: Entity Recognition Evaluation http://www.itl.nist.gov/iad/894.01/tests/ie-er/er_99/er_99.htm

This corpus contains the NEWSWIRE development test data for the NIST 1999 IE-ER Evaluation. The files were taken from the subdirectory: /ie_er_99/english/devtest/newswire/*.ref.nwt and filenames were shortened.

The corpus contains the following files: APW_19980314, APW_19980424, APW_19980429, NYT_19980315, NYT_19980403, and NYT_19980407.

class nltk.corpus.reader.ieer.IEERCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

docs(fileids=None)[source]
parsed_docs(fileids=None)[source]
raw(fileids=None)[source]
class nltk.corpus.reader.ieer.IEERDocument(text, docno=None, doctype=None, date_time=None, headline='')[source]

Bases: builtins.object

unicode_repr()
nltk.corpus.reader.ieer.documents = ['APW_19980314', 'APW_19980424', 'APW_19980429', 'NYT_19980315', 'NYT_19980403', 'NYT_19980407']

A list of all documents in this corpus.

nltk.corpus.reader.ieer.titles = {'APW_19980424': 'Associated Press Weekly, 24 April 1998', 'NYT_19980403': 'New York Times, 3 April 1998', 'APW_19980314': 'Associated Press Weekly, 14 March 1998', 'NYT_19980315': 'New York Times, 15 March 1998', 'NYT_19980407': 'New York Times, 7 April 1998', 'APW_19980429': 'Associated Press Weekly, 29 April 1998'}

A dictionary whose keys are the names of documents in this corpus; and whose values are descriptions of those documents’ contents.

nltk.corpus.reader.indian module

Indian Language POS-Tagged Corpus Collected by A Kumaran, Microsoft Research, India Distributed with permission

Contents:
  • Bangla: IIT Kharagpur
  • Hindi: Microsoft Research India
  • Marathi: IIT Bombay
  • Telugu: IIIT Hyderabad
class nltk.corpus.reader.indian.IndianCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

List of words, one per line. Blank lines are ignored.

raw(fileids=None)[source]
sents(fileids=None)[source]
tagged_sents(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
class nltk.corpus.reader.indian.IndianCorpusView(corpus_file, encoding, tagged, group_by_sent, tag_mapping_function=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

nltk.corpus.reader.ipipan module

class nltk.corpus.reader.ipipan.IPIPANCorpusReader(root, fileids)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.

The corpus includes information about text domain, channel and categories. You can access possible values using domains(), channels() and categories(). You can use also this metadata to filter files, e.g.: fileids(channel='prasa'), fileids(categories='publicystyczny').

The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.: tagged_sents(simplify_tags=True).

Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.: tagged_paras(one_tag=False).

You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g. tagged_words(disamb_only=False).

The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g. tagged_words(append_no_space=True). As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.

The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g. words(append_space=True). As a result either ‘ ‘ or (‘ ‘, ‘space’) will be inserted between tokens.

By default, xml entities like &quot; and &amp; are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g. words(replace_xmlentities=False).

categories(fileids=None)[source]
channels(fileids=None)[source]
domains(fileids=None)[source]
fileids(channels=None, domains=None, categories=None)[source]
paras(fileids=None, **kwargs)[source]
raw(fileids=None)[source]
sents(fileids=None, **kwargs)[source]
tagged_paras(fileids=None, **kwargs)[source]
tagged_sents(fileids=None, **kwargs)[source]
tagged_words(fileids=None, **kwargs)[source]
words(fileids=None, **kwargs)[source]
class nltk.corpus.reader.ipipan.IPIPANCorpusView(filename, startpos=0, **kwargs)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

PARAS_MODE = 2
SENTS_MODE = 1
WORDS_MODE = 0
read_block(stream)[source]

nltk.corpus.reader.knbc module

class nltk.corpus.reader.knbc.KNBCorpusReader(root, fileids, encoding='utf8', morphs2str=<function <lambda> at 0x10811a378>)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

This class implements:
  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
  • _read_block, which reads a block from the input stream.
  • _word, which takes a block and returns a list of list of words.
  • _tag, which takes a block and returns a list of list of tagged words.
  • _parse, which takes a block and returns a list of parsed sentences.
The structure of tagged words:
tagged_word = (word(str), tags(tuple)) tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others ...)
nltk.corpus.reader.knbc.demo()[source]
nltk.corpus.reader.knbc.test()[source]

nltk.corpus.reader.lin module

class nltk.corpus.reader.lin.LinThesaurusCorpusReader(root, badscore=0.0)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.

scored_synonyms(ngram, fileid=None)[source]

Returns a list of scored synonyms (tuples of synonyms and scores) for the current ngram

Parameters:
  • ngram (C{string}) – ngram to lookup
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, list of tuples of scores and synonyms; otherwise, list of tuples of fileids and lists, where inner lists consist of tuples of scores and synonyms.

similarity(ngram1, ngram2, fileid=None)[source]

Returns the similarity score for two ngrams.

Parameters:
  • ngram1 (C{string}) – first ngram to compare
  • ngram2 (C{string}) – second ngram to compare
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, just the score for the two ngrams; otherwise, list of tuples of fileids and scores.

synonyms(ngram, fileid=None)[source]

Returns a list of synonyms for the current ngram.

Parameters:
  • ngram (C{string}) – ngram to lookup
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, list of synonyms; otherwise, list of tuples of fileids and lists, where inner lists contain synonyms.

nltk.corpus.reader.lin.demo()[source]

nltk.corpus.reader.nombank module

class nltk.corpus.reader.nombank.NombankChainTreePointer(pieces)[source]

Bases: nltk.corpus.reader.nombank.NombankPointer

pieces = None

A list of the pieces that make up this chain. Elements may be either NombankSplitTreePointer or NombankTreePointer pointers.

select(tree)[source]
unicode_repr()
class nltk.corpus.reader.nombank.NombankCorpusReader(root, nomfile, framefiles='', nounsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-noun basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

instances(baseform=None)[source]
Returns:a corpus view that acts as a list of

NombankInstance objects, one for each noun in the corpus.

lines()[source]
Returns:a corpus view that acts as a list of strings, one for

each line in the predicate-argument annotation file.

nouns()[source]
Returns:a corpus view that acts as a list of all noun lemmas

in this corpus (from the nombank.1.0.words file).

raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
roleset(roleset_id)[source]
Returns:the xml description for the given roleset.
rolesets(baseform=None)[source]
Returns:list of xml descriptions for rolesets.
class nltk.corpus.reader.nombank.NombankInstance(fileid, sentnum, wordnum, baseform, sensenumber, predicate, predid, arguments, parse_corpus=None)[source]

Bases: builtins.object

arguments = None

A list of tuples (argloc, argid), specifying the location and identifier for each of the predicate’s argument in the containing sentence. Argument identifiers are strings such as 'ARG0' or 'ARGM-TMP'. This list does not contain the predicate.

baseform = None

The baseform of the predicate.

fileid = None

The name of the file containing the parse tree for this instance’s sentence.

static parse(s, parse_fileid_xform=None, parse_corpus=None)[source]
parse_corpus = None

A corpus reader for the parse trees corresponding to the instances in this nombank corpus.

predicate = None

A NombankTreePointer indicating the position of this instance’s predicate within its containing sentence.

predid = None

Identifier of the predicate.

roleset[source]

The name of the roleset used by this instance’s predicate. Use nombank.roleset() <NombankCorpusReader.roleset> to look up information about the roleset.

sensenumber = None

The sense number of the predicate.

sentnum = None

The sentence number of this sentence within fileid. Indexing starts from zero.

tree

The parse tree corresponding to this instance, or None if the corresponding tree is not available.

unicode_repr()
wordnum = None

The word number of this instance’s predicate within its containing sentence. Word numbers are indexed starting from zero, and include traces and other empty parse elements.

class nltk.corpus.reader.nombank.NombankPointer[source]

Bases: builtins.object

A pointer used by nombank to identify one or more constituents in a parse tree. NombankPointer is an abstract base class with three concrete subclasses:

  • NombankTreePointer is used to point to single constituents.
  • NombankSplitTreePointer is used to point to ‘split’ constituents, which consist of a sequence of two or more NombankTreePointer pointers.
  • NombankChainTreePointer is used to point to entire trace chains in a tree. It consists of a sequence of pieces, which can be NombankTreePointer or NombankSplitTreePointer pointers.
class nltk.corpus.reader.nombank.NombankSplitTreePointer(pieces)[source]

Bases: nltk.corpus.reader.nombank.NombankPointer

pieces = None

A list of the pieces that make up this chain. Elements are all NombankTreePointer pointers.

select(tree)[source]
unicode_repr()
class nltk.corpus.reader.nombank.NombankTreePointer(wordnum, height)[source]

Bases: nltk.corpus.reader.nombank.NombankPointer

wordnum:height*wordnum:height*... wordnum:height,

static parse(s)[source]
select(tree)[source]
treepos(tree)[source]

Convert this pointer to a standard ‘tree position’ pointer, given that it points to the given tree.

unicode_repr()

nltk.corpus.reader.nps_chat module

class nltk.corpus.reader.nps_chat.NPSChatCorpusReader(root, fileids, wrap_etree=False, tagset=None)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

posts(fileids=None)[source]
tagged_posts(fileids=None, tagset=None)[source]
tagged_words(fileids=None, tagset=None)[source]
words(fileids=None)[source]
xml_posts(fileids=None)[source]

nltk.corpus.reader.pl196x module

class nltk.corpus.reader.pl196x.Pl196xCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.xmldocs.XMLCorpusReader

decode_tag(tag)[source]
headLen = 2770
paras(fileids=None, categories=None, textids=None)[source]
raw(fileids=None, categories=None)[source]
sents(fileids=None, categories=None, textids=None)[source]
tagged_paras(fileids=None, categories=None, textids=None)[source]
tagged_sents(fileids=None, categories=None, textids=None)[source]
tagged_words(fileids=None, categories=None, textids=None)[source]
textids(fileids=None, categories=None)[source]

In the pl196x corpus each category is stored in single file and thus both methods provide identical functionality. In order to accommodate finer granularity, a non-standard textids() method was implemented. All the main functions can be supplied with a list of required chunks—giving much more control to the user.

words(fileids=None, categories=None, textids=None)[source]
xml(fileids=None, categories=None)[source]
class nltk.corpus.reader.pl196x.TEICorpusView(corpus_file, tagged, group_by_sent, group_by_para, tagset=None, headLen=0, textids=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]

nltk.corpus.reader.plaintext module

A reader for corpora that consist of plaintext documents.

class nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.plaintext.PlaintextCorpusReader

A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.

paras(fileids=None, categories=None)[source]
raw(fileids=None, categories=None)[source]
sents(fileids=None, categories=None)[source]
words(fileids=None, categories=None)[source]
class nltk.corpus.reader.plaintext.EuroparlCorpusReader(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object at 0x10804ccc0>, para_block_reader=<function read_blankline_block at 0x10805b1e0>, encoding='utf8')[source]

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from PlaintextCorpusReader except that:

  • Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.
  • For the same reason, the sentence tokenizer should just split the paragraph at line breaks.
  • There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.
  • The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.
chapters(fileids=None)[source]
Returns:the given file(s) as a list of chapters, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
paras(fileids=None)[source]
class nltk.corpus.reader.plaintext.PlaintextCorpusReader(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object at 0x10804ccc0>, para_block_reader=<function read_blankline_block at 0x10805b1e0>, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

CorpusView

The corpus view class used by this reader. Subclasses of PlaintextCorpusReader may specify alternative corpus view classes (e.g., to skip the preface sections of documents.)

alias of StreamBackedCorpusView

paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.plaintext.PortugueseCategorizedPlaintextCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader

nltk.corpus.reader.ppattach module

Read lines from the Prepositional Phrase Attachment Corpus.

The PP Attachment Corpus contains several files having the format:

sentence_id verb noun1 preposition noun2 attachment

For example:

42960 gives authority to administration V 46742 gives inventors of microchip N

The PP attachment is to the verb phrase (V) or noun phrase (N), i.e.:

(VP gives (NP authority) (PP to administration)) (VP gives (NP inventors (PP of microchip)))

The corpus contains the following files:

training: training set devset: development test set, used for algorithm development. test: test set, used to report results bitstrings: word classes derived from Mutual Information Clustering for the Wall Street Journal.

Ratnaparkhi, Adwait (1994). A Maximum Entropy Model for Prepositional Phrase Attachment. Proceedings of the ARPA Human Language Technology Conference. [http://www.cis.upenn.edu/~adwait/papers/hlt94.ps]

The PP Attachment Corpus is distributed with NLTK with the permission of the author.

class nltk.corpus.reader.ppattach.PPAttachment(sent, verb, noun1, prep, noun2, attachment)[source]

Bases: builtins.object

unicode_repr()
class nltk.corpus.reader.ppattach.PPAttachmentCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

sentence_id verb noun1 preposition noun2 attachment

attachments(fileids)[source]
raw(fileids=None)[source]
tuples(fileids)[source]

nltk.corpus.reader.propbank module

class nltk.corpus.reader.propbank.PropbankChainTreePointer(pieces)[source]

Bases: nltk.corpus.reader.propbank.PropbankPointer

pieces = None

A list of the pieces that make up this chain. Elements may be either PropbankSplitTreePointer or PropbankTreePointer pointers.

select(tree)[source]
unicode_repr()
class nltk.corpus.reader.propbank.PropbankCorpusReader(root, propfile, framefiles='', verbsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-verb basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

instances(baseform=None)[source]
Returns:a corpus view that acts as a list of

PropBankInstance objects, one for each noun in the corpus.

lines()[source]
Returns:a corpus view that acts as a list of strings, one for

each line in the predicate-argument annotation file.

raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
roleset(roleset_id)[source]
Returns:the xml description for the given roleset.
rolesets(baseform=None)[source]
Returns:list of xml descriptions for rolesets.
verbs()[source]
Returns:a corpus view that acts as a list of all verb lemmas

in this corpus (from the verbs.txt file).

class nltk.corpus.reader.propbank.PropbankInflection(form='-', tense='-', aspect='-', person='-', voice='-')[source]

Bases: builtins.object

ACTIVE = 'a'
FINITE = 'v'
FUTURE = 'f'
GERUND = 'g'
INFINITIVE = 'i'
NONE = '-'
PARTICIPLE = 'p'
PASSIVE = 'p'
PAST = 'p'
PERFECT = 'p'
PERFECT_AND_PROGRESSIVE = 'b'
PRESENT = 'n'
PROGRESSIVE = 'o'
THIRD_PERSON = '3'
static parse(s)[source]
unicode_repr()
class nltk.corpus.reader.propbank.PropbankInstance(fileid, sentnum, wordnum, tagger, roleset, inflection, predicate, arguments, parse_corpus=None)[source]

Bases: builtins.object

arguments = None

A list of tuples (argloc, argid), specifying the location and identifier for each of the predicate’s argument in the containing sentence. Argument identifiers are strings such as 'ARG0' or 'ARGM-TMP'. This list does not contain the predicate.

baseform[source]

The baseform of the predicate.

fileid = None

The name of the file containing the parse tree for this instance’s sentence.

inflection = None

A PropbankInflection object describing the inflection of this instance’s predicate.

static parse(s, parse_fileid_xform=None, parse_corpus=None)[source]
parse_corpus = None

A corpus reader for the parse trees corresponding to the instances in this propbank corpus.

predicate = None

A PropbankTreePointer indicating the position of this instance’s predicate within its containing sentence.

predid[source]

Identifier of the predicate.

roleset = None

The name of the roleset used by this instance’s predicate. Use propbank.roleset() <PropbankCorpusReader.roleset> to look up information about the roleset.

sensenumber[source]

The sense number of the predicate.

sentnum = None

The sentence number of this sentence within fileid. Indexing starts from zero.

tagger = None

An identifier for the tagger who tagged this instance; or 'gold' if this is an adjuticated instance.

tree

The parse tree corresponding to this instance, or None if the corresponding tree is not available.

unicode_repr()
wordnum = None

The word number of this instance’s predicate within its containing sentence. Word numbers are indexed starting from zero, and include traces and other empty parse elements.

class nltk.corpus.reader.propbank.PropbankPointer[source]

Bases: builtins.object

A pointer used by propbank to identify one or more constituents in a parse tree. PropbankPointer is an abstract base class with three concrete subclasses:

  • PropbankTreePointer is used to point to single constituents.
  • PropbankSplitTreePointer is used to point to ‘split’ constituents, which consist of a sequence of two or more PropbankTreePointer pointers.
  • PropbankChainTreePointer is used to point to entire trace chains in a tree. It consists of a sequence of pieces, which can be PropbankTreePointer or PropbankSplitTreePointer pointers.
class nltk.corpus.reader.propbank.PropbankSplitTreePointer(pieces)[source]

Bases: nltk.corpus.reader.propbank.PropbankPointer

pieces = None

A list of the pieces that make up this chain. Elements are all PropbankTreePointer pointers.

select(tree)[source]
unicode_repr()
class nltk.corpus.reader.propbank.PropbankTreePointer(wordnum, height)[source]

Bases: nltk.corpus.reader.propbank.PropbankPointer

wordnum:height*wordnum:height*... wordnum:height,

static parse(s)[source]
select(tree)[source]
treepos(tree)[source]

Convert this pointer to a standard ‘tree position’ pointer, given that it points to the given tree.

unicode_repr()

nltk.corpus.reader.rte module

Corpus reader for the Recognizing Textual Entailment (RTE) Challenge Corpora.

The files were taken from the RTE1, RTE2 and RTE3 datasets and the files were regularized.

Filenames are of the form rte*_dev.xml and rte*_test.xml. The latter are the gold standard annotated files.

Each entailment corpus is a list of ‘text’/’hypothesis’ pairs. The following example is taken from RTE3:

<pair id="1" entailment="YES" task="IE" length="short" >

   <t>The sale was made to pay Yukos' US$ 27.5 billion tax bill,
   Yuganskneftegaz was originally sold for US$ 9.4 billion to a little known
   company Baikalfinansgroup which was later bought by the Russian
   state-owned oil company Rosneft .</t>

  <h>Baikalfinansgroup was sold to Rosneft.</h>
</pair>

In order to provide globally unique IDs for each pair, a new attribute challenge has been added to the root element entailment-corpus of each file, taking values 1, 2 or 3. The GID is formatted ‘m-n’, where ‘m’ is the challenge number and ‘n’ is the pair ID.

class nltk.corpus.reader.rte.RTECorpusReader(root, fileids, wrap_etree=False)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for corpora in RTE challenges.

This is just a wrapper around the XMLCorpusReader. See module docstring above for the expected structure of input documents.

pairs(fileids)[source]

Build a list of RTEPairs from a RTE corpus.

Parameters:fileids – a list of RTE corpus fileids
Type:list
Return type:list(RTEPair)
class nltk.corpus.reader.rte.RTEPair(pair, challenge=None, id=None, text=None, hyp=None, value=None, task=None, length=None)[source]

Bases: builtins.object

Container for RTE text-hypothesis pairs.

The entailment relation is signalled by the value attribute in RTE1, and by entailment in RTE2 and RTE3. These both get mapped on to the entailment attribute of this class.

unicode_repr()
nltk.corpus.reader.rte.norm(value_string)[source]

Normalize the string value in an RTE pair’s value or entailment attribute as an integer (1, 0).

Parameters:value_string (str) – the label used to classify a text/hypothesis pair
Return type:int

nltk.corpus.reader.semcor module

Corpus reader for the SemCor Corpus.

class nltk.corpus.reader.semcor.SemcorCorpusReader(root, fileids, wordnet, lazy=True)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

chunk_sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of chunks.
Return type:list(list(list(str)))
chunks(fileids=None)[source]
Returns:the given file(s) as a list of chunks, each of which is a list of words and punctuation symbols that form a unit.
Return type:list(list(str))
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of word strings.
Return type:list(list(str))
tagged_chunks(fileids=None, tag='pos')[source]
Returns:the given file(s) as a list of tagged chunks, represented in tree form.
Return type:list(Tree)
Parameters:tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
tagged_sents(fileids=None, tag='pos')[source]
Returns:the given file(s) as a list of sentences. Each sentence is represented as a list of tagged chunks (in tree form).
Return type:list(list(Tree))
Parameters:tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.semcor.SemcorSentence(num, items)[source]

Bases: builtins.list

A list of words, augmented by an attribute num used to record the sentence identifier (the n attribute from the XML).

class nltk.corpus.reader.semcor.SemcorWordView(fileid, unit, bracket_sent, pos_tag, sem_tag, wordnet)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusView

A stream backed corpus view specialized for use with the BNC corpus.

handle_elt(elt, context)[source]
handle_sent(elt)[source]
handle_word(elt)[source]

nltk.corpus.reader.senseval module

Read from the Senseval 2 Corpus.

SENSEVAL [http://www.senseval.org/] Evaluation exercises for Word Sense Disambiguation. Organized by ACL-SIGLEX [http://www.siglex.org/]

Prepared by Ted Pedersen <tpederse@umn.edu>, University of Minnesota, http://www.d.umn.edu/~tpederse/data.html Distributed with permission.

The NLTK version of the Senseval 2 files uses well-formed XML. Each instance of the ambiguous words “hard”, “interest”, “line”, and “serve” is tagged with a sense identifier, and supplied with context.

class nltk.corpus.reader.senseval.SensevalCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

instances(fileids=None)[source]
raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
class nltk.corpus.reader.senseval.SensevalCorpusView(fileid, encoding)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)[source]
class nltk.corpus.reader.senseval.SensevalInstance(word, position, context, senses)[source]

Bases: builtins.object

unicode_repr()

nltk.corpus.reader.sentiwordnet module

An NLTK interface for SentiWordNet

SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, and objectivity.

For details about SentiWordNet see: http://sentiwordnet.isti.cnr.it/

>>> from nltk.corpus import sentiwordnet as swn
>>> print(swn.senti_synset('breakdown.n.03'))
<breakdown.n.03: PosScore=0.0 NegScore=0.25>
>>> list(swn.senti_synsets('slow'))
[SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'),    SentiSynset('slow.v.03'), SentiSynset('slow.a.01'),    SentiSynset('slow.a.02'), SentiSynset('slow.a.04'),    SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]
>>> happy = swn.senti_synsets('happy', 'a')
>>> happy0 = list(happy)[0]
>>> happy0.pos_score()
0.875
>>> happy0.neg_score()
0.0
>>> happy0.obj_score()
0.125
class nltk.corpus.reader.sentiwordnet.SentiSynset(pos_score, neg_score, synset)[source]

Bases: builtins.object

neg_score()[source]
obj_score()[source]
pos_score()[source]
unicode_repr()
class nltk.corpus.reader.sentiwordnet.SentiWordNetCorpusReader(root, fileids, encoding='utf-8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

all_senti_synsets()[source]
senti_synset(*vals)[source]
senti_synsets(string, pos=None)[source]
unicode_repr()

nltk.corpus.reader.sinica_treebank module

Sinica Treebank Corpus Sample

http://rocling.iis.sinica.edu.tw/CKIP/engversion/treebank.htm

10,000 parsed sentences, drawn from the Academia Sinica Balanced Corpus of Modern Chinese. Parse tree notation is based on Information-based Case Grammar. Tagset documentation is available at http://www.sinica.edu.tw/SinicaCorpus/modern_e_wordtype.html

Language and Knowledge Processing Group, Institute of Information Science, Academia Sinica

It is distributed with the Natural Language Toolkit under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike License [http://creativecommons.org/licenses/by-nc-sa/2.5/].

References:

Feng-Yi Chen, Pi-Fang Tsai, Keh-Jiann Chen, and Chu-Ren Huang (1999) The Construction of Sinica Treebank. Computational Linguistics and Chinese Language Processing, 4, pp 87-104.

Huang Chu-Ren, Keh-Jiann Chen, Feng-Yi Chen, Keh-Jiann Chen, Zhao-Ming Gao, and Kuang-Yu Chen. 2000. Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface. Proceedings of 2nd Chinese Language Processing Workshop, Association for Computational Linguistics.

Chen Keh-Jiann and Yu-Ming Hsieh (2004) Chinese Treebanks and Grammar Extraction, Proceedings of IJCNLP-04, pp560-565.

class nltk.corpus.reader.sinica_treebank.SinicaTreebankCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

Reader for the sinica treebank.

nltk.corpus.reader.string_category module

Read tuples from a corpus consisting of categorized strings. For example, from the question classification corpus:

NUM:dist How far is it from Denver to Aspen ? LOC:city What county is Modesto , California in ? HUM:desc Who was Galileo ? DESC:def What is an atom ? NUM:date When did Hawaii become a state ?

class nltk.corpus.reader.string_category.StringCategoryCorpusReader(root, fileids, delimiter=' ', encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

raw(fileids=None)[source]
Returns:the text contents of the given fileids, as a single string.
tuples(fileids=None)[source]

nltk.corpus.reader.switchboard module

class nltk.corpus.reader.switchboard.SwitchboardCorpusReader(root, tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

discourses()[source]
tagged_discourses(tagset=False)[source]
tagged_turns(tagset=None)[source]
tagged_words(tagset=None)[source]
turns()[source]
words()[source]
class nltk.corpus.reader.switchboard.SwitchboardTurn(words, speaker, id)[source]

Bases: builtins.list

A specialized list object used to encode switchboard utterances. The elements of the list are the words in the utterance; and two attributes, speaker and id, are provided to retrieve the spearker identifier and utterance id. Note that utterance ids are only unique within a given discourse.

unicode_repr()

nltk.corpus.reader.tagged module

A reader for corpora whose documents contain part-of-speech-tagged words.

class nltk.corpus.reader.tagged.CategorizedTaggedCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.tagged.TaggedCorpusReader

A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.

paras(fileids=None, categories=None)[source]
raw(fileids=None, categories=None)[source]
sents(fileids=None, categories=None)[source]
tagged_paras(fileids=None, categories=None, tagset=None)[source]
tagged_sents(fileids=None, categories=None, tagset=None)[source]
tagged_words(fileids=None, categories=None, tagset=None)[source]
words(fileids=None, categories=None)[source]
class nltk.corpus.reader.tagged.MacMorphoCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by self.paras() and self.tagged_paras() contains a single sentence.

class nltk.corpus.reader.tagged.TaggedCorpusReader(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block at 0x10805b1e0>, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/' is used as the separator. I.e., words should have the form:

word1/tag1 word2/tag2 word3/tag3 ...

But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.

paras(fileids=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)[source]
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)[source]
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None, tagset=None)[source]
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None)[source]
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.tagged.TaggedCorpusView(corpus_file, encoding, tagged, group_by_sent, group_by_para, sep, word_tokenizer, sent_tokenizer, para_block_reader, tag_mapping_function=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A specialized corpus view for tagged documents. It can be customized via flags to divide the tagged corpus documents up by sentence or paragraph, and to include or omit part of speech tags. TaggedCorpusView objects are typically created by TaggedCorpusReader (not directly by nltk users).

read_block(stream)[source]

Reads one paragraph at a time.

class nltk.corpus.reader.tagged.TimitTaggedCorpusReader(*args, **kwargs)[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

A corpus reader for tagged sentences that are included in the TIMIT corpus.

paras()[source]
tagged_paras()[source]

nltk.corpus.reader.timit module

Read tokens, phonemes and audio data from the NLTK TIMIT Corpus.

This corpus contains selected portion of the TIMIT corpus.

  • 16 speakers from 8 dialect regions
  • 1 male and 1 female from each dialect region
  • total 130 sentences (10 sentences per speaker. Note that some sentences are shared among other speakers, especially sa1 and sa2 are spoken by all speakers.)
  • total 160 recording of sentences (10 recordings per speaker)
  • audio format: NIST Sphere, single channel, 16kHz sampling,
16 bit sample, PCM encoding

Module contents

The timit corpus reader provides 4 functions and 4 data items.

  • utterances

    List of utterances in the corpus. There are total 160 utterances, each of which corresponds to a unique utterance of a speaker. Here’s an example of an utterance identifier in the list:

    dr1-fvmh0/sx206
      - _----  _---
      | |  |   | |
      | |  |   | |
      | |  |   | `--- sentence number
      | |  |   `----- sentence type (a:all, i:shared, x:exclusive)
      | |  `--------- speaker ID
      | `------------ sex (m:male, f:female)
      `-------------- dialect region (1..8)
    
  • speakers

    List of speaker IDs. An example of speaker ID:

    dr1-fvmh0
    

    Note that if you split an item ID with colon and take the first element of the result, you will get a speaker ID.

    >>> itemid = 'dr1-fvmh0/sx206'
    >>> spkrid , sentid = itemid.split('/')
    >>> spkrid
    'dr1-fvmh0'
    

    The second element of the result is a sentence ID.

  • dictionary()

    Phonetic dictionary of words contained in this corpus. This is a Python dictionary from words to phoneme lists.

  • spkrinfo()

    Speaker information table. It’s a Python dictionary from speaker IDs to records of 10 fields. Speaker IDs the same as the ones in timie.speakers. Each record is a dictionary from field names to values, and the fields are as follows:

    id         speaker ID as defined in the original TIMIT speaker info table
    sex        speaker gender (M:male, F:female)
    dr         speaker dialect region (1:new england, 2:northern,
               3:north midland, 4:south midland, 5:southern, 6:new york city,
               7:western, 8:army brat (moved around))
    use        corpus type (TRN:training, TST:test)
               in this sample corpus only TRN is available
    recdate    recording date
    birthdate  speaker birth date
    ht         speaker height
    race       speaker race (WHT:white, BLK:black, AMR:american indian,
               SPN:spanish-american, ORN:oriental,???:unknown)
    edu        speaker education level (HS:high school, AS:associate degree,
               BS:bachelor's degree (BS or BA), MS:master's degree (MS or MA),
               PHD:doctorate degree (PhD,JD,MD), ??:unknown)
    comments   comments by the recorder
    

The 4 functions are as follows.

  • tokenized(sentences=items, offset=False)

    Given a list of items, returns an iterator of a list of word lists, each of which corresponds to an item (sentence). If offset is set to True, each element of the word list is a tuple of word(string), start offset and end offset, where offset is represented as a number of 16kHz samples.

  • phonetic(sentences=items, offset=False)

    Given a list of items, returns an iterator of a list of phoneme lists, each of which corresponds to an item (sentence). If offset is set to True, each element of the phoneme list is a tuple of word(string), start offset and end offset, where offset is represented as a number of 16kHz samples.

  • audiodata(item, start=0, end=None)

    Given an item, returns a chunk of audio samples formatted into a string. When the fuction is called, if start and end are omitted, the entire samples of the recording will be returned. If only end is omitted, samples from the start offset to the end of the recording will be returned.

  • play(data)

    Play the given audio samples. The audio samples can be obtained from the timit.audiodata function.

class nltk.corpus.reader.timit.SpeakerInfo(id, sex, dr, use, recdate, birthdate, ht, race, edu, comments=None)[source]

Bases: builtins.object

unicode_repr()
class nltk.corpus.reader.timit.TimitCorpusReader(root, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:

  • timitdic.txt: dictionary of standard transcriptions
  • spkrinfo.txt: table of speaker information

In addition, the root directory should contain one subdirectory for each speaker, containing three files for each utterance:

  • <utterance-id>.txt: text content of utterances
  • <utterance-id>.wrd: tokenized text content of utterances
  • <utterance-id>.phn: phonetic transcription of utterances
  • <utterance-id>.wav: utterance sound file
audiodata(utterance, start=0, end=None)[source]
fileids(filetype=None)[source]

Return a list of file identifiers for the files that make up this corpus.

Parameters:filetype – If specified, then filetype indicates that only the files that have the given type should be returned. Accepted values are: txt, wrd, phn, wav, or metadata,
phone_times(utterances=None)[source]

offset is represented as a number of 16kHz samples!

phone_trees(utterances=None)[source]
phones(utterances=None)[source]
play(utterance, start=0, end=None)[source]

Play the given audio sample.

Parameters:utterance – The utterance id of the sample to play
sent_times(utterances=None)[source]
sentid(utterance)[source]
sents(utterances=None)[source]
spkrid(utterance)[source]
spkrinfo(speaker)[source]
Returns:A dictionary mapping .. something.
spkrutteranceids(speaker)[source]
Returns:A list of all utterances associated with a given

speaker.

transcription_dict()[source]
Returns:A dictionary giving the ‘standard’ transcription for

each word.

utterance(spkrid, sentid)[source]
utteranceids(dialect=None, sex=None, spkrid=None, sent_type=None, sentid=None)[source]
Returns:A list of the utterance identifiers for all

utterances in this corpus, or for the given speaker, dialect region, gender, sentence type, or sentence number, if specified.

wav(utterance, start=0, end=None)[source]
word_times(utterances=None)[source]
words(utterances=None)[source]
nltk.corpus.reader.timit.read_timit_block(stream)[source]

Block reader for timit tagged sentences, which are preceded by a sentence number that will be ignored.

nltk.corpus.reader.toolbox module

Module for reading, writing and manipulating Toolbox databases and settings fileids.

class nltk.corpus.reader.toolbox.ToolboxCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

entries(fileids, **kwargs)[source]
fields(fileids, strip=True, unwrap=True, encoding='utf8', errors='strict', unicode_fields=None)[source]
raw(fileids)[source]
words(fileids, key='lx')[source]
xml(fileids, key=None)[source]
nltk.corpus.reader.toolbox.demo()[source]

nltk.corpus.reader.udhr module

UDHR corpus reader. It mostly deals with encodings.

class nltk.corpus.reader.udhr.UdhrCorpusReader(root='udhr')[source]

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

ENCODINGS = [('.*-Latin1$', 'latin-1'), ('.*-Hebrew$', 'hebrew'), ('.*-Arabic$', 'cp1256'), ('Czech_Cesky-UTF8', 'cp1250'), ('.*-Cyrillic$', 'cyrillic'), ('.*-SJIS$', 'SJIS'), ('.*-GB2312$', 'GB2312'), ('.*-Latin2$', 'ISO-8859-2'), ('.*-Greek$', 'greek'), ('.*-UTF8$', 'utf-8'), ('Hungarian_Magyar-Unicode', 'utf-16-le'), ('Amahuaca', 'latin1'), ('Turkish_Turkce-Turkish', 'latin5'), ('Lithuanian_Lietuviskai-Baltic', 'latin4'), ('Japanese_Nihongo-EUC', 'EUC-JP'), ('Japanese_Nihongo-JIS', 'iso2022_jp'), ('Chinese_Mandarin-HZ', 'hz'), ('Abkhaz\\-Cyrillic\\+Abkh', 'cp1251')]
SKIP = {'Chinese_Mandarin-HZ', 'Vietnamese-VPS', 'Azeri_Azerbaijani_Latin-Az.Times.Lat0117', 'Vietnamese-TCVN', 'Hungarian_Magyar-Unicode', 'Gujarati-UTF8', 'Armenian-DallakHelv', 'Esperanto-T61', 'Lao-UTF8', 'Tigrinya_Tigrigna-VG2Main', 'Czech-Latin2-err', 'Magahi-Agra', 'Burmese_Myanmar-WinResearcher', 'Chinese_Mandarin-UTF8', 'Azeri_Azerbaijani_Cyrillic-Az.Times.Cyr.Normal0117', 'Bhojpuri-Agra', 'Japanese_Nihongo-JIS', 'Russian_Russky-UTF8~', 'Vietnamese-VIQR', 'Amharic-Afenegus6..60375', 'Tamil-UTF8', 'Navaho_Dine-Navajo-Navaho-font', 'Magahi-UTF8', 'Marathi-UTF8', 'Burmese_Myanmar-UTF8'}

nltk.corpus.reader.util module

class nltk.corpus.reader.util.ConcatenatedCorpusView(corpus_views)[source]

Bases: nltk.util.AbstractLazySequence

A ‘view’ of a corpus file that joins together one or more StreamBackedCorpusViews<StreamBackedCorpusView>. At most one file handle is left open at any time.

close()[source]
iterate_from(start_tok)[source]
class nltk.corpus.reader.util.PickleCorpusView(fileid, delete_on_gc=False)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A stream backed corpus view for corpus files that consist of sequences of serialized Python objects (serialized using pickle.dump). One use case for this class is to store the result of running feature detection on a corpus to disk. This can be useful when performing feature detection is expensive (so we don’t want to repeat it); but the corpus is too large to store in memory. The following example illustrates this technique:

>>> from nltk.corpus.reader.util import PickleCorpusView
>>> from nltk.util import LazyMap
>>> feature_corpus = LazyMap(detect_features, corpus) 
>>> PickleCorpusView.write(feature_corpus, some_fileid)  
>>> pcv = PickleCorpusView(some_fileid) 
BLOCK_SIZE = 100
PROTOCOL = -1
classmethod cache_to_tempfile(sequence, delete_on_gc=True)[source]

Write the given sequence to a temporary file as a pickle corpus; and then return a PickleCorpusView view for that temporary corpus file.

Parameters:delete_on_gc – If true, then the temporary file will be deleted whenever this object gets garbage-collected.
read_block(stream)[source]
classmethod write(sequence, output_file)[source]
class nltk.corpus.reader.util.StreamBackedCorpusView(fileid, block_reader=None, startpos=0, encoding='utf8')[source]

Bases: nltk.util.AbstractLazySequence

A ‘view’ of a corpus file, which acts like a sequence of tokens: it can be accessed by index, iterated over, etc. However, the tokens are only constructed as-needed – the entire corpus is never stored in memory at once.

The constructor to StreamBackedCorpusView takes two arguments: a corpus fileid (specified as a string or as a PathPointer); and a block reader. A “block reader” is a function that reads zero or more tokens from a stream, and returns them as a list. A very simple example of a block reader is:

>>> def simple_block_reader(stream):
...     return stream.readline().split()

This simple block reader reads a single line at a time, and returns a single token (consisting of a string) for each whitespace-separated substring on the line.

When deciding how to define the block reader for a given corpus, careful consideration should be given to the size of blocks handled by the block reader. Smaller block sizes will increase the memory requirements of the corpus view’s internal data structures (by 2 integers per block). On the other hand, larger block sizes may decrease performance for random access to the corpus. (But note that larger block sizes will not decrease performance for iteration.)

Internally, CorpusView maintains a partial mapping from token index to file position, with one entry per block. When a token with a given index i is requested, the CorpusView constructs it as follows:

  1. First, it searches the toknum/filepos mapping for the token index closest to (but less than or equal to) i.
  2. Then, starting at the file position corresponding to that index, it reads one block at a time using the block reader until it reaches the requested token.

The toknum/filepos mapping is created lazily: it is initially empty, but every time a new block is read, the block’s initial token is added to the mapping. (Thus, the toknum/filepos map has one entry per block.)

In order to increase efficiency for random access patterns that have high degrees of locality, the corpus view may cache one or more blocks.

Note:

Each CorpusView object internally maintains an open file object for its underlying corpus file. This file should be automatically closed when the CorpusView is garbage collected, but if you wish to close it manually, use the close() method. If you access a CorpusView‘s items after it has been closed, the file object will be automatically re-opened.

Warning:

If the contents of the file are modified during the lifetime of the CorpusView, then the CorpusView‘s behavior is undefined.

Warning:

If a unicode encoding is specified when constructing a CorpusView, then the block reader may only call stream.seek() with offsets that have been returned by stream.tell(); in particular, calling stream.seek() with relative offsets, or with offsets based on string lengths, may lead to incorrect behavior.

Variables:
  • _block_reader – The function used to read a single block from the underlying file stream.
  • _toknum – A list containing the token index of each block that has been processed. In particular, _toknum[i] is the token index of the first token in block i. Together with _filepos, this forms a partial mapping between token indices and file positions.
  • _filepos – A list containing the file position of each block that has been processed. In particular, _toknum[i] is the file position of the first character in block i. Together with _toknum, this forms a partial mapping between token indices and file positions.
  • _stream – The stream used to access the underlying corpus file.
  • _len – The total number of tokens in the corpus, if known; or None, if the number of tokens is not yet known.
  • _eofpos – The character position of the last character in the file. This is calculated when the corpus view is initialized, and is used to decide when the end of file has been reached.
  • _cache – A cache of the most recently read block. It is encoded as a tuple (start_toknum, end_toknum, tokens), where start_toknum is the token index of the first token in the block; end_toknum is the token index of the first token not in the block; and tokens is a list of the tokens in the block.
close()[source]

Close the file stream associated with this corpus view. This can be useful if you are worried about running out of file handles (although the stream should automatically be closed upon garbage collection of the corpus view). If the corpus view is accessed after it is closed, it will be automatically re-opened.

fileid

The fileid of the file that is accessed by this view.

Type:str or PathPointer
iterate_from(start_tok)[source]
read_block(stream)[source]

Read a block from the input stream.

Returns:a block of tokens from the input stream
Return type:list(any)
Parameters:stream (stream) – an input stream
nltk.corpus.reader.util.concat(docs)[source]

Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time.

nltk.corpus.reader.util.find_corpus_fileids(root, regexp)[source]
nltk.corpus.reader.util.read_alignedsent_block(stream)[source]
nltk.corpus.reader.util.read_blankline_block(stream)[source]
nltk.corpus.reader.util.read_line_block(stream)[source]
nltk.corpus.reader.util.read_regexp_block(stream, start_re, end_re=None)[source]

Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. If end_re is specified, then tokens end with lines that match end_re; otherwise, tokens end whenever the next line matching start_re or EOF is found.

nltk.corpus.reader.util.read_sexpr_block(stream, block_size=16384, comment_char=None)[source]

Read a sequence of s-expressions from the stream, and leave the stream’s file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file.

If the file ends in in the middle of an s-expression, then that incomplete s-expression is returned when the end of the file is reached.

Parameters:
  • block_size – The default block size for reading. If an s-expression is longer than one block, then more than one block will be read.
  • comment_char – A character that marks comments. Any lines that begin with this character will be stripped out. (If spaces or tabs precede the comment character, then the line will not be stripped.)
nltk.corpus.reader.util.read_whitespace_block(stream)[source]
nltk.corpus.reader.util.read_wordpunct_block(stream)[source]
nltk.corpus.reader.util.tagged_treebank_para_block_reader(stream)[source]

nltk.corpus.reader.verbnet module

An NLTK interface to the VerbNet verb lexicon

For details about VerbNet see: http://verbs.colorado.edu/~mpalmer/projects/verbnet.html

class nltk.corpus.reader.verbnet.VerbnetCorpusReader(root, fileids, wrap_etree=False)[source]

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

classids(lemma=None, wordnetid=None, fileid=None, classid=None)[source]

Return a list of the verbnet class identifiers. If a file identifier is specified, then return only the verbnet class identifiers for classes (and subclasses) defined by that file. If a lemma is specified, then return only verbnet class identifiers for classes that contain that lemma as a member. If a wordnetid is specified, then return only identifiers for classes that contain that wordnetid as a member. If a classid is specified, then return only identifiers for subclasses of the specified verbnet class.

fileids(vnclass_ids=None)[source]

Return a list of fileids that make up this corpus. If vnclass_ids is specified, then return the fileids that make up the specified verbnet class(es).

lemmas(classid=None)[source]

Return a list of all verb lemmas that appear in any class, or in the classid if specified.

longid(shortid)[source]

Given a short verbnet class identifier (eg ‘37.10’), map it to a long id (eg ‘confess-37.10’). If shortid is already a long id, then return it as-is

pprint(vnclass)[source]

Return a string containing a pretty-printed representation of the given verbnet class.

Parameters:vnclass – A verbnet class identifier; or an ElementTree

containing the xml contents of a verbnet class.

pprint_description(vnframe, indent='')[source]

Return a string containing a pretty-printed representation of the given verbnet frame description.

Parameters:vnframe – An ElementTree containing the xml contents of a verbnet frame.
pprint_frame(vnframe, indent='')[source]

Return a string containing a pretty-printed representation of the given verbnet frame.

Parameters:vnframe – An ElementTree containing the xml contents of a verbnet frame.
pprint_members(vnclass, indent='')[source]

Return a string containing a pretty-printed representation of the given verbnet class’s member verbs.

Parameters:vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
pprint_semantics(vnframe, indent='')[source]

Return a string containing a pretty-printed representation of the given verbnet frame semantics.

Parameters:vnframe – An ElementTree containing the xml contents of a verbnet frame.
pprint_subclasses(vnclass, indent='')[source]

Return a string containing a pretty-printed representation of the given verbnet class’s subclasses.

Parameters:vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
pprint_syntax(vnframe, indent='')[source]

Return a string containing a pretty-printed representation of the given verbnet frame syntax.

Parameters:vnframe – An ElementTree containing the xml contents of a verbnet frame.
pprint_themroles(vnclass, indent='')[source]

Return a string containing a pretty-printed representation of the given verbnet class’s thematic roles.

Parameters:vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
shortid(longid)[source]

Given a long verbnet class identifier (eg ‘confess-37.10’), map it to a short id (eg ‘37.10’). If longid is already a short id, then return it as-is.

vnclass(fileid_or_classid)[source]

Return an ElementTree containing the xml for the specified verbnet class.

Parameters:fileid_or_classid – An identifier specifying which class should be returned. Can be a file identifier (such as 'put-9.1.xml'), or a verbnet class identifier (such as 'put-9.1') or a short verbnet class identifier (such as '9.1').
wordnetids(classid=None)[source]

Return a list of all wordnet identifiers that appear in any class, or in classid if specified.

nltk.corpus.reader.wordlist module

class nltk.corpus.reader.wordlist.SwadeshCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

entries(fileids=None)[source]
Returns:a tuple of words for the specified fileids.
class nltk.corpus.reader.wordlist.WordListCorpusReader(root, fileids, encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.api.CorpusReader

List of words, one per line. Blank lines are ignored.

raw(fileids=None)[source]
words(fileids=None)[source]

nltk.corpus.reader.wordnet module

An NLTK interface for WordNet

WordNet is a lexical database of English. Using synsets, helps find conceptual relationships between words such as hypernyms, hyponyms, synonyms, antonyms etc.

For details about WordNet see: http://wordnet.princeton.edu/

class nltk.corpus.reader.wordnet.Lemma(wordnet_corpus_reader, synset, name, lexname_index, lex_id, syntactic_marker)[source]

Bases: nltk.corpus.reader.wordnet._WordNetObject

The lexical entry for a single morphological form of a sense-disambiguated word.

Create a Lemma from a “<word>.<pos>.<number>.<lemma>” string where: <word> is the morphological stem identifying the synset <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0. <lemma> is the morphological form of interest

Note that <word> and <lemma> can be different, e.g. the Synset ‘salt.n.03’ has the Lemmas ‘salt.n.03.salt’, ‘salt.n.03.saltiness’ and ‘salt.n.03.salinity’.

Lemma attributes, accessible via methods with the same name:

- name: The canonical name of this lemma.
- synset: The synset that this lemma belongs to.
- syntactic_marker: For adjectives, the WordNet string identifying the
syntactic position relative modified noun. See: http://wordnet.princeton.edu/man/wninput.5WN.html#sect10 For all other parts of speech, this attribute is None.
  • count: The frequency of this lemma in wordnet.

Lemma methods:

Lemmas have the following methods for retrieving related Lemmas. They correspond to the names for the pointer symbols defined here: http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 These methods all return lists of Lemmas:

  • antonyms
  • hypernyms, instance_hypernyms
  • hyponyms, instance_hyponyms
  • member_holonyms, substance_holonyms, part_holonyms
  • member_meronyms, substance_meronyms, part_meronyms
  • topic_domains, region_domains, usage_domains
  • attributes
  • derivationally_related_forms
  • entailments
  • causes
  • also_sees
  • verb_groups
  • similar_tos
  • pertainyms
antonyms()[source]
count()[source]

Return the frequency count for this Lemma

frame_ids()[source]
frame_strings()[source]
key()[source]
lang()[source]
name()[source]
pertainyms()[source]
synset()[source]
syntactic_marker()[source]
unicode_repr()
class nltk.corpus.reader.wordnet.Synset(wordnet_corpus_reader)[source]

Bases: nltk.corpus.reader.wordnet._WordNetObject

Create a Synset from a “<lemma>.<pos>.<number>” string where: <lemma> is the word’s morphological stem <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0.

Synset attributes, accessible via methods with the same name:

  • name: The canonical name of this synset, formed using the first lemma of this synset. Note that this may be different from the name passed to the constructor if that string used a different lemma to identify the synset.
  • pos: The synset’s part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB.
  • lemmas: A list of the Lemma objects for this synset.
  • definition: The definition for this synset.
  • examples: A list of example strings for this synset.
  • offset: The offset in the WordNet dict file of this synset.
  • lexname: The name of the lexicographer file containing this synset.

Synset methods:

Synsets have the following methods for retrieving related Synsets. They correspond to the names for the pointer symbols defined here: http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 These methods all return lists of Synsets.

  • hypernyms, instance_hypernyms
  • hyponyms, instance_hyponyms
  • member_holonyms, substance_holonyms, part_holonyms
  • member_meronyms, substance_meronyms, part_meronyms
  • attributes
  • entailments
  • causes
  • also_sees
  • verb_groups
  • similar_tos

Additionally, Synsets support the following methods specific to the hypernym relation:

  • root_hypernyms
  • common_hypernyms
  • lowest_common_hypernyms

Note that Synsets do not support the following relations because these are defined by WordNet as lexical relations:

  • antonyms
  • derivationally_related_forms
  • pertainyms
closure(rel, depth=-1)[source]

Return the transitive closure of source under the rel relationship, breadth-first

>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synset('dog.n.01')
>>> hyp = lambda s:s.hypernyms()
>>> list(dog.closure(hyp))
[Synset('canine.n.02'), Synset('domestic_animal.n.01'),
Synset('carnivore.n.01'), Synset('animal.n.01'),
Synset('placental.n.01'), Synset('organism.n.01'),
Synset('mammal.n.01'), Synset('living_thing.n.01'),
Synset('vertebrate.n.01'), Synset('whole.n.02'),
Synset('chordate.n.01'), Synset('object.n.01'),
Synset('physical_entity.n.01'), Synset('entity.n.01')]
common_hypernyms(other)[source]

Find all synsets that are hypernyms of this synset and the other synset.

Parameters:other (Synset) – other input synset.
Returns:The synsets that are hypernyms of both synsets.
definition()[source]
examples()[source]
frame_ids()[source]
hypernym_distances(distance=0, simulate_root=False)[source]

Get the path(s) from this synset to the root, counting the distance of each node from the initial node on the way. A set of (synset, distance) tuples is returned.

Parameters:distance (int) – the distance (number of edges) from this hypernym to the original hypernym Synset on which this method was called.
Returns:A set of (Synset, int) tuples where each Synset is a hypernym of the first Synset.
hypernym_paths()[source]

Get the path(s) from this synset to the root, where each path is a list of the synset nodes traversed on the way to the root.

Returns:A list of lists, where each list gives the node sequence connecting the initial Synset node and a root node.
jcn_similarity(other, ic, verbose=False)[source]

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects.

lch_similarity(other, verbose=False, simulate_root=True)[source]

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

lemma_names(lang='en')[source]

Return all the lemma_names associated with the synset

lemmas(lang='en')[source]

Return all the lemma objects associated with the synset

lexname()[source]
lin_similarity(other, ic, verbose=False)[source]

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

lowest_common_hypernyms(other, simulate_root=False, use_min_depth=False)[source]

Get a list of lowest synset(s) that both synsets have as a hypernym. When use_min_depth == False this means that the synset which appears as a hypernym of both self and other with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned

However, if use_min_depth == True then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned.

By setting the use_min_depth flag to True, the behavior of NLTK2 can be preserved. This was changed in NLTK3 to give more accurate results in a small set of cases, generally with synsets concerning people. (eg: ‘chef.n.01’, ‘fireman.n.01’, etc.)

This method is an implementation of Ted Pedersen’s “Lowest Common Subsumer” method from the Perl Wordnet module. It can return either “self” or “other” if they are a hypernym of the other.

Parameters:
  • other (Synset) – other input synset
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (False by default) creates a fake root that connects all the taxonomies. Set it to True to enable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will need to be added for nouns as well.
  • use_min_depth (bool) – This setting mimics older (v2) behavior of NLTK wordnet If True, will use the min_depth function to calculate the lowest common hypernyms. This is known to give strange results for some synset pairs (eg: ‘chef.n.01’, ‘fireman.n.01’) but is retained for backwards compatibility
Returns:

The synsets that are the lowest common hypernyms of both synsets

max_depth()[source]
Returns:The length of the longest hypernym path from this

synset to the root.

min_depth()[source]
Returns:The length of the shortest hypernym path from this

synset to the root.

name()[source]
offset()[source]
path_similarity(other, verbose=False, simulate_root=True)[source]

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

pos()[source]
res_similarity(other, ic, verbose=False)[source]

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

root_hypernyms()[source]

Get the topmost hypernyms of this synset in WordNet.

shortest_path_distance(other, simulate_root=False)[source]

Returns the distance of the shortest path linking the two synsets (if one exists). For each synset, all the ancestor nodes and their distances are recorded and compared. The ancestor node common to both synsets that can be reached with the minimum number of traversals is used. If no ancestor nodes are common, None is returned. If a node is compared with itself 0 is returned.

Parameters:other (Synset) – The Synset to which the shortest path will be found.
Returns:The number of edges in the shortest path connecting the two nodes, or None if no path exists.
tree(rel, depth=-1, cut_mark=None)[source]
>>> from nltk.corpus import wordnet as wn
>>> dog = wn.synset('dog.n.01')
>>> hyp = lambda s:s.hypernyms()
>>> from pprint import pprint
>>> pprint(dog.tree(hyp))
[Synset('dog.n.01'),
 [Synset('canine.n.02'),
  [Synset('carnivore.n.01'),
   [Synset('placental.n.01'),
    [Synset('mammal.n.01'),
     [Synset('vertebrate.n.01'),
      [Synset('chordate.n.01'),
       [Synset('animal.n.01'),
        [Synset('organism.n.01'),
         [Synset('living_thing.n.01'),
          [Synset('whole.n.02'),
           [Synset('object.n.01'),
            [Synset('physical_entity.n.01'),
             [Synset('entity.n.01')]]]]]]]]]]]]],
 [Synset('domestic_animal.n.01'),
  [Synset('animal.n.01'),
   [Synset('organism.n.01'),
    [Synset('living_thing.n.01'),
     [Synset('whole.n.02'),
      [Synset('object.n.01'),
       [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]
unicode_repr()
wup_similarity(other, verbose=False, simulate_root=True)[source]

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

nltk.corpus.reader.wordnet.VERB_FRAME_STRINGS = (None, 'Something %s', 'Somebody %s', 'It is %sing', 'Something is %sing PP', 'Something %s something Adjective/Noun', 'Something %s Adjective/Noun', 'Somebody %s Adjective', 'Somebody %s something', 'Somebody %s somebody', 'Something %s somebody', 'Something %s something', 'Something %s to somebody', 'Somebody %s on something', 'Somebody %s somebody something', 'Somebody %s something to somebody', 'Somebody %s something from somebody', 'Somebody %s somebody with something', 'Somebody %s somebody of something', 'Somebody %s something on somebody', 'Somebody %s somebody PP', 'Somebody %s something PP', 'Somebody %s PP', "Somebody's (body part) %s", 'Somebody %s somebody to INFINITIVE', 'Somebody %s somebody INFINITIVE', 'Somebody %s that CLAUSE', 'Somebody %s to somebody', 'Somebody %s to INFINITIVE', 'Somebody %s whether INFINITIVE', 'Somebody %s somebody into V-ing something', 'Somebody %s something with something', 'Somebody %s INFINITIVE', 'Somebody %s VERB-ing', 'It %s that CLAUSE', 'Something %s INFINITIVE')

A table of strings that are used to express verb frames.

class nltk.corpus.reader.wordnet.WordNetCorpusReader(root, omw_reader)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader used to access wordnet or its variants.

ADJ = 'a'
ADJ_SAT = 's'
ADV = 'r'
MORPHOLOGICAL_SUBSTITUTIONS = {'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')]}
NOUN = 'n'
VERB = 'v'
all_lemma_names(pos=None, lang='en')[source]

Return all lemma names for all synsets for the given part of speech tag and langauge or languages. If pos is not specified, all synsets for all parts of speech will be used.

all_synsets(pos=None)[source]

Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

get_version()[source]
ic(corpus, weight_senses_equally=False, smoothing=1.0)[source]

Creates an information content lookup dictionary from a corpus.

Parameters:corpus (CorpusReader) – The corpus from which we create an information

content dictionary. :type weight_senses_equally: bool :param weight_senses_equally: If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.) :param smoothing: How much do we smooth synset counts (default is 1.0) :type smoothing: float :return: An information content dictionary

jcn_similarity(synset1, synset2, ic, verbose=False)[source]

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects.

langs()[source]

return a list of languages supported by Multilingual Wordnet

lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

lemma(name, lang='en')[source]

Return lemma object that matches the name

lemma_count(lemma)[source]

Return the frequency count for this Lemma

lemma_from_key(key)[source]
lemmas(lemma, pos=None, lang='en')[source]

Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

lin_similarity(synset1, synset2, ic, verbose=False)[source]

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

morphy(form, pos=None)[source]

Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.morphy('dogs'))
dog
>>> print(wn.morphy('churches'))
church
>>> print(wn.morphy('aardwolves'))
aardwolf
>>> print(wn.morphy('abaci'))
abacus
>>> wn.morphy('hardrock', wn.ADV)
>>> print(wn.morphy('book', wn.NOUN))
book
>>> wn.morphy('book', wn.ADJ)
of2ss(of)[source]

take an id and return the synsets

path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

res_similarity(synset1, synset2, ic, verbose=False)[source]

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

ss2of(ss)[source]

return the ILI of the synset

synset(name)[source]
synsets(lemma, pos=None, lang='en')[source]

Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

exception nltk.corpus.reader.wordnet.WordNetError[source]

Bases: builtins.Exception

An exception class for wordnet-related errors.

class nltk.corpus.reader.wordnet.WordNetICCorpusReader(root, fileids)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for the WordNet information content corpus.

ic(icfile)[source]

Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.

Parameters:icfile (str) – The name of the wordnet_ic file (e.g. “ic-brown.dat”)
Returns:An information content dictionary
nltk.corpus.reader.wordnet.demo()[source]
nltk.corpus.reader.wordnet.information_content(synset, ic)[source]
nltk.corpus.reader.wordnet.jcn_similarity(synset1, synset2, ic, verbose=False)[source]

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects.

nltk.corpus.reader.wordnet.lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

nltk.corpus.reader.wordnet.lin_similarity(synset1, synset2, ic, verbose=False)[source]

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

nltk.corpus.reader.wordnet.path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

nltk.corpus.reader.wordnet.res_similarity(synset1, synset2, ic, verbose=False)[source]

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

nltk.corpus.reader.wordnet.teardown_module(module=None)[source]
nltk.corpus.reader.wordnet.wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

nltk.corpus.reader.xmldocs module

Corpus reader for corpora whose documents are xml files.

(note – not named ‘xml’ to avoid conflicting w/ standard xml package)

class nltk.corpus.reader.xmldocs.XMLCorpusReader(root, fileids, wrap_etree=False)[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for corpora whose documents are xml files.

Note that the XMLCorpusReader constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves. See the XML specs for more info.

raw(fileids=None)[source]
words(fileid=None)[source]

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns:the given file’s text nodes as a list of words and punctuation symbols
Return type:list(str)
xml(fileid=None)[source]
class nltk.corpus.reader.xmldocs.XMLCorpusView(fileid, tagspec, elt_handler=None)[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

A corpus view that selects out specified elements from an XML file, and provides a flat list-like interface for accessing them. (Note: XMLCorpusView is not used by XMLCorpusReader itself, but may be used by subclasses of XMLCorpusReader.)

Every XML corpus view has a “tag specification”, indicating what XML elements should be included in the view; and each (non-nested) element that matches this specification corresponds to one item in the view. Tag specifications are regular expressions over tag paths, where a tag path is a list of element tag names, separated by ‘/’, indicating the ancestry of the element. Some examples:

  • 'foo': A top-level element whose tag is foo.
  • 'foo/bar': An element whose tag is bar and whose parent is a top-level element whose tag is foo.
  • '.*/foo': An element whose tag is foo, appearing anywhere in the xml tree.
  • '.*/(foo|bar)': An wlement whose tag is foo or bar, appearing anywhere in the xml tree.

The view items are generated from the selected XML elements via the method handle_elt(). By default, this method returns the element as-is (i.e., as an ElementTree object); but it can be overridden, either via subclassing or via the elt_handler constructor parameter.

handle_elt(elt, context)[source]

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Returns:

The view value corresponding to elt.

Parameters:
  • elt (ElementTree) – The element that should be converted.
  • context (str) – A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
read_block(stream, tagspec=None, elt_handler=None)[source]

Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found.

nltk.corpus.reader.ycoe module

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts. The corpus is distributed by the Oxford Text Archive: http://www.ota.ahds.ac.uk/ It is not included with NLTK.

The YCOE corpus is divided into 100 files, each representing an Old English prose text. Tags used within each text complies to the YCOE standard: http://www-users.york.ac.uk/~lang22/YCOE/YcoeHome.htm

class nltk.corpus.reader.ycoe.YCOECorpusReader(root, encoding='utf8')[source]

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.

documents(fileids=None)[source]

Return a list of document identifiers for all documents in this corpus, or for the documents with the given file(s) if specified.

fileids(documents=None)[source]

Return a list of file identifiers for the files that make up this corpus, or that store the given document(s) if specified.

paras(documents=None)[source]
parsed_sents(documents=None)[source]
sents(documents=None)[source]
tagged_paras(documents=None)[source]
tagged_sents(documents=None)[source]
tagged_words(documents=None)[source]
words(documents=None)[source]
class nltk.corpus.reader.ycoe.YCOEParseCorpusReader(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]

Bases: nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

Specialized version of the standard bracket parse corpus reader that strips out (CODE ...) and (ID ...) nodes.

class nltk.corpus.reader.ycoe.YCOETaggedCorpusReader(root, items, encoding='utf8')[source]

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

nltk.corpus.reader.ycoe.documents = {'comart3.o23': 'Martyrology, III', 'colawnorthu.o3': 'Northumbra Preosta Lagu', 'coinspolX': "Wulfstan's Institute of Polity (X)", 'colaw6atr.o3': 'Laws, Æthelred VI', 'coalcuin': 'Alcuin De virtutibus et vitiis', 'cochristoph': 'Saint Christopher', 'conicodC': 'Gospel of Nicodemus (C)', 'coaelive.o3': "Ælfric's Lives of Saints", 'cosolsat1.o4': 'Solomon and Saturn I', 'coprefcath1.o3': "Ælfric's Preface to Catholic Homilies I", 'cochronD': 'Anglo-Saxon Chronicle D', 'colsigewZ.o34': "Ælfric's Letter to Sigeweard (Z)", 'comargaT': 'Saint Margaret (T)', 'coleofri.o4': 'Leofric', 'cootest.o3': 'Heptateuch', 'cosevensl': 'Seven Sleepers', 'cochronA.o23': 'Anglo-Saxon Chronicle A', 'codocu2.o2': 'Documents 2 (O2)', 'cowulf.o34': "Wulfstan's Homilies", 'coprefcura.o2': 'Preface to the Cura Pastoralis', 'cowsgosp.o3': 'West-Saxon Gospels', 'codicts.o34': 'Dicts of Cato', 'conicodE': 'Gospel of Nicodemus (E)', 'coadrian.o34': 'Adrian and Ritheus', 'coalex.o23': "Alexander's Letter to Aristotle", 'colsigewB': "Ælfric's Letter to Sigeweard (B)", 'cocanedgX': 'Canons of Edgar (X)', 'colwstan1.o3': "Ælfric's Letter to Wulfstan I", 'coeluc1': 'Honorius of Autun, Elucidarium 1', 'codocu4.o24': 'Documents 4 (O2/O4)', 'cobenrul.o3': 'Benedictine Rule', 'cochronC': 'Anglo-Saxon Chronicle C', 'colacnu.o23': 'Lacnunga', 'covinsal': 'Vindicta Salvatoris', 'coeust': 'Saint Eustace and his companions', 'coquadru.o23': 'Pseudo-Apuleius, Medicina de quadrupedibus', 'coeluc2': 'Honorius of Autun, Elucidarium 1', 'coverhom': 'Vercelli Homilies', 'colawafint.o2': "Alfred's Introduction to Laws", 'coepigen.o3': "Ælfric's Epilogue to Genesis", 'coprefsolilo': "Preface to Augustine's Soliloquies", 'conicodA': 'Gospel of Nicodemus (A)', 'comart1': 'Martyrology, I', 'comarvel.o23': 'Marvels of the East', 'colwsigeXa.o34': "Ælfric's Letter to Wulfsige (Xa)", 'cocura.o2': 'Cura Pastoralis', 'codocu2.o12': 'Documents 2 (O1/O2)', 'coverhomL': 'Vercelli Homilies (L)', 'coaelhom.o3': 'Ælfric, Supplemental Homilies', 'comart2': 'Martyrology, II', 'comargaC.o34': 'Saint Margaret (C)', 'codocu1.o1': 'Documents 1 (O1)', 'coapollo.o3': 'Apollonius of Tyre', 'cochdrul': 'Chrodegang of Metz, Rule', 'coprefcath2.o3': "Ælfric's Preface to Catholic Homilies II", 'coblick.o23': 'Blickling Homilies', 'cocathom1.o3': "Ælfric's Catholic Homilies I", 'cogregdC.o24': "Gregory's Dialogues (C)", 'cojames': 'Saint James', 'cochad.o24': 'Saint Chad', 'corood': 'History of the Holy Rood-Tree', 'coaugust': 'Augustine', 'cocanedgD': 'Canons of Edgar (D)', 'colawine.ox2': 'Laws, Ine', 'cosolilo': "St. Augustine's Soliloquies", 'conicodD': 'Gospel of Nicodemus (D)', 'codocu3.o3': 'Documents 3 (O3)', 'cotempo.o3': "Ælfric's De Temporibus Anni", 'cochronE.o34': 'Anglo-Saxon Chronicle E', 'colaw5atr.o3': 'Laws, Æthelred V', 'colsigef.o3': "Ælfric's Letter to Sigefyrth", 'cobede.o2': "Bede's History of the English Church", 'colawaf.o2': 'Laws, Alfred', 'colwsigeT': "Ælfric's Letter to Wulfsige (T)", 'coverhomE': 'Vercelli Homilies (E)', 'cogregdH.o23': "Gregory's Dialogues (H)", 'colawger.o34': 'Laws, Gerefa', 'colaece.o2': 'Leechdoms', 'coeuphr': 'Saint Euphrosyne', 'cogenesiC': 'Genesis (C)', 'colwgeat': "Ælfric's Letter to Wulfgeat", 'coherbar': 'Pseudo-Apuleius, Herbarium', 'comary': 'Mary of Egypt', 'cocathom2.o3': "Ælfric's Catholic Homilies II", 'colaw2cn.o3': 'Laws, Cnut II', 'coorosiu.o2': 'Orosius', 'copreflives.o3': "Ælfric's Preface to Lives of Saints", 'coprefgen.o3': "Ælfric's Preface to Genesis", 'colawwllad.o4': 'Laws, William I, Lad', 'covinceB': 'Saint Vincent (Bodley 343)', 'codocu3.o23': 'Documents 3 (O2/O3)', 'cocuraC': 'Cura Pastoralis (Cotton)', 'coneot': 'Saint Neot', 'coinspolD.o34': "Wulfstan's Institute of Polity (D)", 'colaw1cn.o3': 'Laws, Cnut I', 'coexodusP': 'Exodus (P)', 'cobyrhtf.o3': "Byrhtferth's Manual", 'cosolsat2': 'Solomon and Saturn II', 'coboeth.o2': "Boethius' Consolation of Philosophy", 'colwstan2.o3': "Ælfric's Letter to Wulfstan II"}

A list of all documents and their titles in ycoe.

Module contents

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.

Corpus Reader Functions

Each corpus module defines one or more “corpus reader functions”, which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:

  • If item is one of the unique identifiers listed in the corpus module’s items variable, then the corresponding document will be loaded from the NLTK corpus package.
  • If item is a fileid, then that file will be read.

Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.

Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:

  • words(): list of str
  • sents(): list of (list of str)
  • paras(): list of (list of (list of str))
  • tagged_words(): list of (str,str) tuple
  • tagged_sents(): list of (list of (str,str))
  • tagged_paras(): list of (list of (list of (str,str)))
  • chunked_sents(): list of (Tree w/ (str,str) leaves)
  • parsed_sents(): list of (Tree with str leaves)
  • parsed_paras(): list of (list of (Tree with str leaves))
  • xml(): A single xml ElementTree
  • raw(): unprocessed corpus contents

For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

>>> from nltk.corpus import brown
>>> print(", ".join(brown.words()))
The, Fulton, County, Grand, Jury, said, ...
class nltk.corpus.reader.CorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: builtins.object

A base class for “corpus reader” classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory. Each file is identified by its file identifier, which is the relative path to the file from the root directory.

A separate subclass is be defined for each corpus format. These subclasses define one or more methods that provide ‘views’ on the corpus contents, such as words() (for a list of words) and parsed_sents() (for a list of parsed sentences). Called with no arguments, these methods will return the contents of the entire corpus. For most corpora, these methods define one or more selection arguments, such as fileids or categories, which can be used to select which portion of the corpus should be returned.

abspath(fileid)

Return the absolute path for the given file.

Parameters:file (str) – The file identifier for the file whose path should be returned.
Return type:PathPointer
abspaths(fileids=None, include_encoding=False, include_fileid=False)

Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.

Parameters:
  • fileids (None or str or list) – Specifies the set of fileids for which paths should be returned. Can be None, for all fileids; a list of file identifiers, for a specified set of fileids; or a single file identifier, for a single file. Note that the return value is always a list of paths, even if fileids is a single file identifier.
  • include_encoding – If true, then return a list of (path_pointer, encoding) tuples.
Return type:

list(PathPointer)

encoding(file)

Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.

ensure_loaded()

Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded – e.g., in case a user wants to do help(some_corpus).

fileids()

Return a list of file identifiers for the fileids that make up this corpus.

open(file)

Return an open stream that can be used to read the given file. If the file’s encoding is not None, then the stream will automatically decode the file’s contents into unicode.

Parameters:file – The file identifier of the file to read.
readme()

Return the contents of the corpus README file, if it exists.

root

The directory where this corpus is stored.

Type:PathPointer
unicode_repr()
class nltk.corpus.reader.CategorizedCorpusReader(kwargs)

Bases: builtins.object

A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method categories(), which returns a list of the categories for the corpus or for a specified set of fileids; and overrides fileids() to take a categories argument, restricting the set of fileids to be returned.

Subclasses are expected to:

  • Call __init__() to set up the mapping.
  • Override all view methods to accept a categories parameter, which can be used instead of the fileids parameter, to select which fileids should be included in the returned view.
categories(fileids=None)

Return a list of the categories that are defined for this corpus, or for the file(s) if it is given.

fileids(categories=None)

Return a list of file identifiers for the files that make up this corpus, or that make up the given category(s) if specified.

class nltk.corpus.reader.PlaintextCorpusReader(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object at 0x10804ccc0>, para_block_reader=<function read_blankline_block at 0x10805b1e0>, encoding='utf8')

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.

This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the CorpusView class variable.

CorpusView

alias of StreamBackedCorpusView

paras(fileids=None)
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
nltk.corpus.reader.find_corpus_fileids(root, regexp)
class nltk.corpus.reader.TaggedCorpusReader(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block at 0x10805b1e0>, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/' is used as the separator. I.e., words should have the form:

word1/tag1 word2/tag2 word3/tag3 ...

But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.

paras(fileids=None)
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None, tagset=None)
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None, tagset=None)
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None, tagset=None)
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None)
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.CMUDictCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

dict()
Returns:the cmudict lexicon as a dictionary, whose keys are

lowercase words and whose values are lists of pronunciations.

entries()
Returns:the cmudict lexicon as a list of entries

containing (word, transcriptions) tuples.

raw()
Returns:the cmudict lexicon as a raw string.
words()
Returns:a list of all words defined in the cmudict lexicon.
class nltk.corpus.reader.ConllChunkCorpusReader(root, fileids, chunk_types, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.conll.ConllCorpusReader

A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.

class nltk.corpus.reader.WordListCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

List of words, one per line. Blank lines are ignored.

raw(fileids=None)
words(fileids=None)
class nltk.corpus.reader.PPAttachmentCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

sentence_id verb noun1 preposition noun2 attachment

attachments(fileids)
raw(fileids=None)
tuples(fileids)
class nltk.corpus.reader.SensevalCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

instances(fileids=None)
raw(fileids=None)
Returns:the text contents of the given fileids, as a single string.
class nltk.corpus.reader.IEERCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

docs(fileids=None)
parsed_docs(fileids=None)
raw(fileids=None)
class nltk.corpus.reader.ChunkedCorpusReader(root, fileids, extension='', str2chunktree=<function tagstr2tree at 0x10812d730>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block at 0x10805b1e0>, encoding='utf8')

Bases: nltk.corpus.reader.api.CorpusReader

Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using nltk.chunk.tagstr2tree.

chunked_paras(fileids=None)
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).
Return type:list(list(Tree))
chunked_sents(fileids=None)
Returns:the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).
Return type:list(Tree)
chunked_words(fileids=None)
Returns:the given file(s) as a list of tagged words and chunks. Words are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over (word,tag) tuples or word strings.
Return type:list(tuple(str,str) and Tree)
paras(fileids=None)
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
raw(fileids=None)
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
tagged_paras(fileids=None)
Returns:the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.
Return type:list(list(list(tuple(str,str))))
tagged_sents(fileids=None)
Returns:the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.
Return type:list(list(tuple(str,str)))
tagged_words(fileids=None)
Returns:the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).
Return type:list(tuple(str,str))
words(fileids=None)
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.SinicaTreebankCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

Reader for the sinica treebank.

class nltk.corpus.reader.BracketParseCorpusReader(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

Reader for corpora that consist of parenthesis-delineated parse trees.

class nltk.corpus.reader.IndianCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

List of words, one per line. Blank lines are ignored.

raw(fileids=None)
sents(fileids=None)
tagged_sents(fileids=None, tagset=None)
tagged_words(fileids=None, tagset=None)
words(fileids=None)
class nltk.corpus.reader.ToolboxCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

entries(fileids, **kwargs)
fields(fileids, strip=True, unwrap=True, encoding='utf8', errors='strict', unicode_fields=None)
raw(fileids)
words(fileids, key='lx')
xml(fileids, key=None)
class nltk.corpus.reader.TimitCorpusReader(root, encoding='utf8')

Bases: nltk.corpus.reader.api.CorpusReader

Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:

  • timitdic.txt: dictionary of standard transcriptions
  • spkrinfo.txt: table of speaker information

In addition, the root directory should contain one subdirectory for each speaker, containing three files for each utterance:

  • <utterance-id>.txt: text content of utterances
  • <utterance-id>.wrd: tokenized text content of utterances
  • <utterance-id>.phn: phonetic transcription of utterances
  • <utterance-id>.wav: utterance sound file
audiodata(utterance, start=0, end=None)
fileids(filetype=None)

Return a list of file identifiers for the files that make up this corpus.

Parameters:filetype – If specified, then filetype indicates that only the files that have the given type should be returned. Accepted values are: txt, wrd, phn, wav, or metadata,
phone_times(utterances=None)

offset is represented as a number of 16kHz samples!

phone_trees(utterances=None)
phones(utterances=None)
play(utterance, start=0, end=None)

Play the given audio sample.

Parameters:utterance – The utterance id of the sample to play
sent_times(utterances=None)
sentid(utterance)
sents(utterances=None)
spkrid(utterance)
spkrinfo(speaker)
Returns:A dictionary mapping .. something.
spkrutteranceids(speaker)
Returns:A list of all utterances associated with a given

speaker.

transcription_dict()
Returns:A dictionary giving the ‘standard’ transcription for

each word.

utterance(spkrid, sentid)
utteranceids(dialect=None, sex=None, spkrid=None, sent_type=None, sentid=None)
Returns:A list of the utterance identifiers for all

utterances in this corpus, or for the given speaker, dialect region, gender, sentence type, or sentence number, if specified.

wav(utterance, start=0, end=None)
word_times(utterances=None)
words(utterances=None)
class nltk.corpus.reader.YCOECorpusReader(root, encoding='utf8')

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.

documents(fileids=None)

Return a list of document identifiers for all documents in this corpus, or for the documents with the given file(s) if specified.

fileids(documents=None)

Return a list of file identifiers for the files that make up this corpus, or that store the given document(s) if specified.

paras(documents=None)
parsed_sents(documents=None)
sents(documents=None)
tagged_paras(documents=None)
tagged_sents(documents=None)
tagged_words(documents=None)
words(documents=None)
class nltk.corpus.reader.MacMorphoCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by self.paras() and self.tagged_paras() contains a single sentence.

class nltk.corpus.reader.SyntaxCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:

  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
  • _read_block, which reads a block from the input stream.
  • _word, which takes a block and returns a list of list of words.
  • _tag, which takes a block and returns a list of list of tagged words.
  • _parse, which takes a block and returns a list of parsed sentences.
parsed_sents(fileids=None)
raw(fileids=None)
sents(fileids=None)
tagged_sents(fileids=None, tagset=None)
tagged_words(fileids=None, tagset=None)
words(fileids=None)
class nltk.corpus.reader.AlpinoCorpusReader(root, encoding='ISO-8859-1', tagset=None)

Bases: nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

Reader for the Alpino Dutch Treebank.

class nltk.corpus.reader.RTECorpusReader(root, fileids, wrap_etree=False)

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for corpora in RTE challenges.

This is just a wrapper around the XMLCorpusReader. See module docstring above for the expected structure of input documents.

pairs(fileids)

Build a list of RTEPairs from a RTE corpus.

Parameters:fileids – a list of RTE corpus fileids
Type:list
Return type:list(RTEPair)
class nltk.corpus.reader.StringCategoryCorpusReader(root, fileids, delimiter=' ', encoding='utf8')

Bases: nltk.corpus.reader.api.CorpusReader

raw(fileids=None)
Returns:the text contents of the given fileids, as a single string.
tuples(fileids=None)
class nltk.corpus.reader.EuroparlCorpusReader(root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\w+|[^\w\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=<nltk.tokenize.punkt.PunktSentenceTokenizer object at 0x10804ccc0>, para_block_reader=<function read_blankline_block at 0x10805b1e0>, encoding='utf8')

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from PlaintextCorpusReader except that:

  • Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces.
  • For the same reason, the sentence tokenizer should just split the paragraph at line breaks.
  • There is a new ‘chapters()’ method that returns chapters instead instead of paragraphs.
  • The ‘paras()’ method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl.
chapters(fileids=None)
Returns:the given file(s) as a list of chapters, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
Return type:list(list(list(str)))
paras(fileids=None)
class nltk.corpus.reader.CategorizedBracketParseCorpusReader(*args, **kwargs)

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>

paras(fileids=None, categories=None)
parsed_paras(fileids=None, categories=None)
parsed_sents(fileids=None, categories=None)
parsed_words(fileids=None, categories=None)
raw(fileids=None, categories=None)
sents(fileids=None, categories=None)
tagged_paras(fileids=None, categories=None, tagset=None)
tagged_sents(fileids=None, categories=None, tagset=None)
tagged_words(fileids=None, categories=None, tagset=None)
words(fileids=None, categories=None)
class nltk.corpus.reader.CategorizedTaggedCorpusReader(*args, **kwargs)

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.tagged.TaggedCorpusReader

A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.

paras(fileids=None, categories=None)
raw(fileids=None, categories=None)
sents(fileids=None, categories=None)
tagged_paras(fileids=None, categories=None, tagset=None)
tagged_sents(fileids=None, categories=None, tagset=None)
tagged_words(fileids=None, categories=None, tagset=None)
words(fileids=None, categories=None)
class nltk.corpus.reader.CategorizedPlaintextCorpusReader(*args, **kwargs)

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.plaintext.PlaintextCorpusReader

A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.

paras(fileids=None, categories=None)
raw(fileids=None, categories=None)
sents(fileids=None, categories=None)
words(fileids=None, categories=None)
class nltk.corpus.reader.PortugueseCategorizedPlaintextCorpusReader(*args, **kwargs)

Bases: nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader

nltk.corpus.reader.tagged_treebank_para_block_reader(stream)
class nltk.corpus.reader.PropbankCorpusReader(root, propfile, framefiles='', verbsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-verb basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

instances(baseform=None)
Returns:a corpus view that acts as a list of

PropBankInstance objects, one for each noun in the corpus.

lines()
Returns:a corpus view that acts as a list of strings, one for

each line in the predicate-argument annotation file.

raw(fileids=None)
Returns:the text contents of the given fileids, as a single string.
roleset(roleset_id)
Returns:the xml description for the given roleset.
rolesets(baseform=None)
Returns:list of xml descriptions for rolesets.
verbs()
Returns:a corpus view that acts as a list of all verb lemmas

in this corpus (from the verbs.txt file).

class nltk.corpus.reader.VerbnetCorpusReader(root, fileids, wrap_etree=False)

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

classids(lemma=None, wordnetid=None, fileid=None, classid=None)

Return a list of the verbnet class identifiers. If a file identifier is specified, then return only the verbnet class identifiers for classes (and subclasses) defined by that file. If a lemma is specified, then return only verbnet class identifiers for classes that contain that lemma as a member. If a wordnetid is specified, then return only identifiers for classes that contain that wordnetid as a member. If a classid is specified, then return only identifiers for subclasses of the specified verbnet class.

fileids(vnclass_ids=None)

Return a list of fileids that make up this corpus. If vnclass_ids is specified, then return the fileids that make up the specified verbnet class(es).

lemmas(classid=None)

Return a list of all verb lemmas that appear in any class, or in the classid if specified.

longid(shortid)

Given a short verbnet class identifier (eg ‘37.10’), map it to a long id (eg ‘confess-37.10’). If shortid is already a long id, then return it as-is

pprint(vnclass)

Return a string containing a pretty-printed representation of the given verbnet class.

Parameters:vnclass – A verbnet class identifier; or an ElementTree

containing the xml contents of a verbnet class.

pprint_description(vnframe, indent='')

Return a string containing a pretty-printed representation of the given verbnet frame description.

Parameters:vnframe – An ElementTree containing the xml contents of a verbnet frame.
pprint_frame(vnframe, indent='')

Return a string containing a pretty-printed representation of the given verbnet frame.

Parameters:vnframe – An ElementTree containing the xml contents of a verbnet frame.
pprint_members(vnclass, indent='')

Return a string containing a pretty-printed representation of the given verbnet class’s member verbs.

Parameters:vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
pprint_semantics(vnframe, indent='')

Return a string containing a pretty-printed representation of the given verbnet frame semantics.

Parameters:vnframe – An ElementTree containing the xml contents of a verbnet frame.
pprint_subclasses(vnclass, indent='')

Return a string containing a pretty-printed representation of the given verbnet class’s subclasses.

Parameters:vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
pprint_syntax(vnframe, indent='')

Return a string containing a pretty-printed representation of the given verbnet frame syntax.

Parameters:vnframe – An ElementTree containing the xml contents of a verbnet frame.
pprint_themroles(vnclass, indent='')

Return a string containing a pretty-printed representation of the given verbnet class’s thematic roles.

Parameters:vnclass – A verbnet class identifier; or an ElementTree containing the xml contents of a verbnet class.
shortid(longid)

Given a long verbnet class identifier (eg ‘confess-37.10’), map it to a short id (eg ‘37.10’). If longid is already a short id, then return it as-is.

vnclass(fileid_or_classid)

Return an ElementTree containing the xml for the specified verbnet class.

Parameters:fileid_or_classid – An identifier specifying which class should be returned. Can be a file identifier (such as 'put-9.1.xml'), or a verbnet class identifier (such as 'put-9.1') or a short verbnet class identifier (such as '9.1').
wordnetids(classid=None)

Return a list of all wordnet identifiers that appear in any class, or in classid if specified.

class nltk.corpus.reader.BNCCorpusReader(root, fileids, lazy=True)

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the British National Corpus.

For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554

If you extracted the archive to a directory called BNC, then you can instantiate the reder as:

BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
sents(fileids=None, strip_space=True, stem=False)
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_sents(fileids=None, c5=False, strip_space=True, stem=False)
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_words(fileids=None, c5=False, strip_space=True, stem=False)
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
words(fileids=None, strip_space=True, stem=False)
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
class nltk.corpus.reader.ConllCorpusReader(root, fileids, columntypes, chunk_types=None, root_label='S', pos_in_tree=False, srl_includes_roleset=True, encoding='utf8', tree_class=<class 'nltk.tree.Tree'>, tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or “grid”) of values, where each line corresponds to a single word, and each column corresponds to an annotation type. The set of columns used by CoNLL-style files can vary from corpus to corpus; the ConllCorpusReader constructor therefore takes an argument, columntypes, which is used to specify the columns that are used by a given corpus.

@todo: Add support for reading from corpora where different
parallel files contain different columns.
@todo: Possibly add caching of the grid corpus view? This would
allow the same grid view to be used by different data access methods (eg words() and parsed_sents() could both share the same grid corpus view object).
@todo: Better support for -DOCSTART-. Currently, we just ignore
it, but it could be used to define methods that retrieve a document at a time (eg parsed_documents()).
CHUNK = 'chunk'
COLUMN_TYPES = ('words', 'pos', 'tree', 'chunk', 'ne', 'srl', 'ignore')
IGNORE = 'ignore'
NE = 'ne'
POS = 'pos'
SRL = 'srl'
TREE = 'tree'
WORDS = 'words'
chunked_sents(fileids=None, chunk_types=None, tagset=None)
chunked_words(fileids=None, chunk_types=None, tagset=None)
iob_sents(fileids=None, tagset=None)
Returns:a list of lists of word/tag/IOB tuples
Return type:list(list)
Parameters:fileids (None or str or list) – the list of fileids that make up this corpus
iob_words(fileids=None, tagset=None)
Returns:a list of word/tag/IOB tuples
Return type:list(tuple)
Parameters:fileids (None or str or list) – the list of fileids that make up this corpus
parsed_sents(fileids=None, pos_in_tree=None, tagset=None)
raw(fileids=None)
sents(fileids=None)
srl_instances(fileids=None, pos_in_tree=None, flatten=True)
srl_spans(fileids=None)
tagged_sents(fileids=None, tagset=None)
tagged_words(fileids=None, tagset=None)
words(fileids=None)
class nltk.corpus.reader.XMLCorpusReader(root, fileids, wrap_etree=False)

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for corpora whose documents are xml files.

Note that the XMLCorpusReader constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves. See the XML specs for more info.

raw(fileids=None)
words(fileid=None)

Returns all of the words and punctuation symbols in the specified file that were in text nodes – ie, tags are ignored. Like the xml() method, fileid can only specify one file.

Returns:the given file’s text nodes as a list of words and punctuation symbols
Return type:list(str)
xml(fileid=None)
class nltk.corpus.reader.NPSChatCorpusReader(root, fileids, wrap_etree=False, tagset=None)

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

posts(fileids=None)
tagged_posts(fileids=None, tagset=None)
tagged_words(fileids=None, tagset=None)
words(fileids=None)
xml_posts(fileids=None)
class nltk.corpus.reader.SwadeshCorpusReader(root, fileids, encoding='utf8', tagset=None)

Bases: nltk.corpus.reader.wordlist.WordListCorpusReader

entries(fileids=None)
Returns:a tuple of words for the specified fileids.
class nltk.corpus.reader.WordNetCorpusReader(root, omw_reader)

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader used to access wordnet or its variants.

ADJ = 'a'
ADJ_SAT = 's'
ADV = 'r'
MORPHOLOGICAL_SUBSTITUTIONS = {'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')]}
NOUN = 'n'
VERB = 'v'
all_lemma_names(pos=None, lang='en')

Return all lemma names for all synsets for the given part of speech tag and langauge or languages. If pos is not specified, all synsets for all parts of speech will be used.

all_synsets(pos=None)

Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

get_version()
ic(corpus, weight_senses_equally=False, smoothing=1.0)

Creates an information content lookup dictionary from a corpus.

Parameters:corpus (CorpusReader) – The corpus from which we create an information

content dictionary. :type weight_senses_equally: bool :param weight_senses_equally: If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.) :param smoothing: How much do we smooth synset counts (default is 1.0) :type smoothing: float :return: An information content dictionary

jcn_similarity(synset1, synset2, ic, verbose=False)

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects.

langs()

return a list of languages supported by Multilingual Wordnet

lch_similarity(synset1, synset2, verbose=False, simulate_root=True)

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

lemma(name, lang='en')

Return lemma object that matches the name

lemma_count(lemma)

Return the frequency count for this Lemma

lemma_from_key(key)
lemmas(lemma, pos=None, lang='en')

Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

lin_similarity(synset1, synset2, ic, verbose=False)

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

morphy(form, pos=None)

Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.morphy('dogs'))
dog
>>> print(wn.morphy('churches'))
church
>>> print(wn.morphy('aardwolves'))
aardwolf
>>> print(wn.morphy('abaci'))
abacus
>>> wn.morphy('hardrock', wn.ADV)
>>> print(wn.morphy('book', wn.NOUN))
book
>>> wn.morphy('book', wn.ADJ)
of2ss(of)

take an id and return the synsets

path_similarity(synset1, synset2, verbose=False, simulate_root=True)

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

res_similarity(synset1, synset2, ic, verbose=False)

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).
Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

ss2of(ss)

return the ILI of the synset

synset(name)
synsets(lemma, pos=None, lang='en')

Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

wup_similarity(synset1, synset2, verbose=False, simulate_root=True)

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:
  • other (Synset) – The Synset that this Synset is being compared to.
  • simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

class nltk.corpus.reader.WordNetICCorpusReader(root, fileids)

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for the WordNet information content corpus.

ic(icfile)

Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.

Parameters:icfile (str) – The name of the wordnet_ic file (e.g. “ic-brown.dat”)
Returns:An information content dictionary
class nltk.corpus.reader.SwitchboardCorpusReader(root, tagset=None)

Bases: nltk.corpus.reader.api.CorpusReader

discourses()
tagged_discourses(tagset=False)
tagged_turns(tagset=None)
tagged_words(tagset=None)
turns()
words()
class nltk.corpus.reader.DependencyCorpusReader(root, fileids, encoding='utf8', word_tokenizer=<nltk.tokenize.simple.TabTokenizer object at 0x1080fdac8>, sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), para_block_reader=<function read_blankline_block at 0x10805b1e0>)

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

parsed_sents(fileids=None)
raw(fileids=None)
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)
tagged_sents(fileids=None)
tagged_words(fileids=None)
words(fileids=None)
class nltk.corpus.reader.NombankCorpusReader(root, nomfile, framefiles='', nounsfile=None, parse_fileid_xform=None, parse_corpus=None, encoding='utf8')

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of “frameset files” which define the argument labels used by the annotations, on a per-noun basis. Each “frameset file” contains one or more predicates, such as 'turn' or 'turn_on', each of which is divided into coarse-grained word senses called “rolesets”. For each “roleset”, the frameset file provides descriptions of the argument roles, along with examples.

instances(baseform=None)
Returns:a corpus view that acts as a list of

NombankInstance objects, one for each noun in the corpus.

lines()
Returns:a corpus view that acts as a list of strings, one for

each line in the predicate-argument annotation file.

nouns()
Returns:a corpus view that acts as a list of all noun lemmas

in this corpus (from the nombank.1.0.words file).

raw(fileids=None)
Returns:the text contents of the given fileids, as a single string.
roleset(roleset_id)
Returns:the xml description for the given roleset.
rolesets(baseform=None)
Returns:list of xml descriptions for rolesets.
class nltk.corpus.reader.IPIPANCorpusReader(root, fileids)

Bases: nltk.corpus.reader.api.CorpusReader

Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.

The corpus includes information about text domain, channel and categories. You can access possible values using domains(), channels() and categories(). You can use also this metadata to filter files, e.g.: fileids(channel='prasa'), fileids(categories='publicystyczny').

The reader supports methods: words, sents, paras and their tagged versions. You can get part of speech instead of full tag by giving “simplify_tags=True” parameter, e.g.: tagged_sents(simplify_tags=True).

Also you can get all tags disambiguated tags specifying parameter “one_tag=False”, e.g.: tagged_paras(one_tag=False).

You can get all tags that were assigned by a morphological analyzer specifying parameter “disamb_only=False”, e.g. tagged_words(disamb_only=False).

The IPIPAN Corpus contains tags indicating if there is a space between two tokens. To add special “no space” markers, you should specify parameter “append_no_space=True”, e.g. tagged_words(append_no_space=True). As a result in place where there should be no space between two tokens new pair (‘’, ‘no-space’) will be inserted (for tagged data) and just ‘’ for methods without tags.

The corpus reader can also try to append spaces between words. To enable this option, specify parameter “append_space=True”, e.g. words(append_space=True). As a result either ‘ ‘ or (‘ ‘, ‘space’) will be inserted between tokens.

By default, xml entities like &quot; and &amp; are replaced by corresponding characters. You can turn off this feature, specifying parameter “replace_xmlentities=False”, e.g. words(replace_xmlentities=False).

categories(fileids=None)
channels(fileids=None)
domains(fileids=None)
fileids(channels=None, domains=None, categories=None)
paras(fileids=None, **kwargs)
raw(fileids=None)
sents(fileids=None, **kwargs)
tagged_paras(fileids=None, **kwargs)
tagged_sents(fileids=None, **kwargs)
tagged_words(fileids=None, **kwargs)
words(fileids=None, **kwargs)
class nltk.corpus.reader.Pl196xCorpusReader(*args, **kwargs)

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.xmldocs.XMLCorpusReader

decode_tag(tag)
headLen = 2770
paras(fileids=None, categories=None, textids=None)
raw(fileids=None, categories=None)
sents(fileids=None, categories=None, textids=None)
tagged_paras(fileids=None, categories=None, textids=None)
tagged_sents(fileids=None, categories=None, textids=None)
tagged_words(fileids=None, categories=None, textids=None)
textids(fileids=None, categories=None)

In the pl196x corpus each category is stored in single file and thus both methods provide identical functionality. In order to accommodate finer granularity, a non-standard textids() method was implemented. All the main functions can be supplied with a list of required chunks—giving much more control to the user.

words(fileids=None, categories=None, textids=None)
xml(fileids=None, categories=None)
class nltk.corpus.reader.TEICorpusView(corpus_file, tagged, group_by_sent, group_by_para, tagset=None, headLen=0, textids=None)

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

read_block(stream)
class nltk.corpus.reader.KNBCorpusReader(root, fileids, encoding='utf8', morphs2str=<function <lambda> at 0x10811a378>)

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

This class implements:
  • __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
  • _read_block, which reads a block from the input stream.
  • _word, which takes a block and returns a list of list of words.
  • _tag, which takes a block and returns a list of list of tagged words.
  • _parse, which takes a block and returns a list of parsed sentences.
The structure of tagged words:
tagged_word = (word(str), tags(tuple)) tags = (surface, reading, lemma, pos1, posid1, pos2, posid2, pos3, posid3, others ...)
class nltk.corpus.reader.ChasenCorpusReader(root, fileids, encoding='utf8', sent_splitter=None)

Bases: nltk.corpus.reader.api.CorpusReader

paras(fileids=None)
raw(fileids=None)
sents(fileids=None)
tagged_paras(fileids=None)
tagged_sents(fileids=None)
tagged_words(fileids=None)
words(fileids=None)
class nltk.corpus.reader.CHILDESCorpusReader(root, fileids, lazy=True)

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at http://childes.psy.cmu.edu/. The XML version of CHILDES is located at http://childes.psy.cmu.edu/data-xml/. Copy the needed parts of the CHILDES XML corpus into the NLTK data directory (nltk_data/corpora/CHILDES/).

For access to the file text use the usual nltk functions, words(), sents(), tagged_words() and tagged_sents().

MLU(fileids=None, speaker='CHI')
Returns:the given file(s) as a floating number
Return type:list(float)
age(fileids=None, speaker='CHI', month=False)
Returns:the given file(s) as string or int
Return type:list or int
Parameters:month – If true, return months instead of year-month-date
childes_url_base = 'http://childes.psy.cmu.edu/browser/index.php?url='
convert_age(age_year)

Caclculate age in months from a string in CHILDES format

corpus(fileids=None)
Returns:the given file(s) as a dict of (corpus_property_key, value)
Return type:list(dict)
participants(fileids=None)
Returns:the given file(s) as a dict of (participant_property_key, value)
Return type:list(dict)
sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
tagged_sents(fileids=None, speaker='ALL', stem=False, relation=None, strip_space=True, replace=False)
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (str,pos,relation_list). If there is manually-annotated relation info, it will return tuples of (str,pos,test_relation_list,str,pos,gold_relation_list)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
tagged_words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (stem, index, dependent_index)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
webview_file(fileid, urlbase=None)

Map a corpus file to its web version on the CHILDES website, and open it in a web browser.

The complete URL to be used is:
childes.childes_url_base + urlbase + fileid.replace(‘.xml’, ‘.cha’)

If no urlbase is passed, we try to calculate it. This requires that the childes corpus was set up to mirror the folder hierarchy under childes.psy.cmu.edu/data-xml/, e.g.: nltk_data/corpora/childes/Eng-USA/Cornell/??? or nltk_data/corpora/childes/Romance/Spanish/Aguirre/???

The function first looks (as a special case) if “Eng-USA” is on the path consisting of <corpus root>+fileid; then if “childes”, possibly followed by “data-xml”, appears. If neither one is found, we use the unmodified fileid and hope for the best. If this is not right, specify urlbase explicitly, e.g., if the corpus root points to the Cornell folder, urlbase=’Eng-USA/Cornell’.

words(fileids=None, speaker='ALL', stem=False, relation=False, strip_space=True, replace=False)
Returns:

the given file(s) as a list of words

Return type:

list(str)

Parameters:
  • speaker – If specified, select specific speaker(s) defined in the corpus. Default is ‘ALL’ (all participants). Common choices are ‘CHI’ (the child), ‘MOT’ (mother), [‘CHI’,’MOT’] (exclude researchers)
  • stem – If true, then use word stems instead of word strings.
  • relation – If true, then return tuples of (stem, index, dependent_index)
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • replace – If true, then use the replaced (intended) word instead of the original word (e.g., ‘wat’ will be replaced with ‘watch’)
class nltk.corpus.reader.AlignedCorpusReader(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\s+', gaps=True, discard_empty=True, flags=56), sent_tokenizer=RegexpTokenizer(pattern='n', gaps=True, discard_empty=True, flags=56), alignedsent_block_reader=<function read_alignedsent_block at 0x10805b268>, encoding='latin1')

Bases: nltk.corpus.reader.api.CorpusReader

Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.

aligned_sents(fileids=None)
Returns:the given file(s) as a list of AlignedSent objects.
Return type:list(AlignedSent)
raw(fileids=None)
Returns:the given file(s) as a single string.
Return type:str
sents(fileids=None)
Returns:the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
Return type:list(list(str))
words(fileids=None)
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.TimitTaggedCorpusReader(*args, **kwargs)

Bases: nltk.corpus.reader.tagged.TaggedCorpusReader

A corpus reader for tagged sentences that are included in the TIMIT corpus.

paras()
tagged_paras()
class nltk.corpus.reader.LinThesaurusCorpusReader(root, badscore=0.0)

Bases: nltk.corpus.reader.api.CorpusReader

Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.

scored_synonyms(ngram, fileid=None)

Returns a list of scored synonyms (tuples of synonyms and scores) for the current ngram

Parameters:
  • ngram (C{string}) – ngram to lookup
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, list of tuples of scores and synonyms; otherwise, list of tuples of fileids and lists, where inner lists consist of tuples of scores and synonyms.

similarity(ngram1, ngram2, fileid=None)

Returns the similarity score for two ngrams.

Parameters:
  • ngram1 (C{string}) – first ngram to compare
  • ngram2 (C{string}) – second ngram to compare
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, just the score for the two ngrams; otherwise, list of tuples of fileids and scores.

synonyms(ngram, fileid=None)

Returns a list of synonyms for the current ngram.

Parameters:
  • ngram (C{string}) – ngram to lookup
  • fileid (C{string}) – thesaurus fileid to search in. If None, search all fileids.
Returns:

If fileid is specified, list of synonyms; otherwise, list of tuples of fileids and lists, where inner lists contain synonyms.

class nltk.corpus.reader.SemcorCorpusReader(root, fileids, wordnet, lazy=True)

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

chunk_sents(fileids=None)
Returns:the given file(s) as a list of sentences, each encoded as a list of chunks.
Return type:list(list(list(str)))
chunks(fileids=None)
Returns:the given file(s) as a list of chunks, each of which is a list of words and punctuation symbols that form a unit.
Return type:list(list(str))
sents(fileids=None)
Returns:the given file(s) as a list of sentences, each encoded as a list of word strings.
Return type:list(list(str))
tagged_chunks(fileids=None, tag='pos')
Returns:the given file(s) as a list of tagged chunks, represented in tree form.
Return type:list(Tree)
Parameters:tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
tagged_sents(fileids=None, tag='pos')
Returns:the given file(s) as a list of sentences. Each sentence is represented as a list of tagged chunks (in tree form).
Return type:list(list(Tree))
Parameters:tag‘pos’ (part of speech), ‘sem’ (semantic), or ‘both’ to indicate the kind of tags to include. Semantic tags consist of WordNet lemma IDs, plus an ‘NE’ node if the chunk is a named entity without a specific entry in WordNet. (Named entities of type ‘other’ have no lemma. Other chunks not in WordNet have no semantic tag. Punctuation tokens have None for their part of speech tag.)
words(fileids=None)
Returns:the given file(s) as a list of words and punctuation symbols.
Return type:list(str)
class nltk.corpus.reader.FramenetCorpusReader(root, fileids)

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

A corpus reader for the Framenet Corpus.

>>> from nltk.corpus import framenet as fn
>>> fn.lu(3238).frame.lexUnit['glint.v'] is fn.lu(3238)
True
>>> fn.frame_by_name('Replacing') is fn.lus('replace.v')[0].frame
True
>>> fn.lus('prejudice.n')[0].frame.frameRelations == fn.frame_relations('Partiality')
True
annotated_document(fn_docid)

Returns the annotated document whose id number is fn_docid. This id number can be obtained by calling the Documents() function.

The dict that is returned from this function will contain the following keys:

  • ‘_type’ : ‘fulltextannotation’

  • ‘sentence’ : a list of sentences in the document
    • Each item in the list is a dict containing the following keys:
      • ‘ID’ : the ID number of the sentence

      • ‘_type’ : ‘sentence’

      • ‘text’ : the text of the sentence

      • ‘paragNo’ : the paragraph number

      • ‘sentNo’ : the sentence number

      • ‘docID’ : the document ID number

      • ‘corpID’ : the corpus ID number

      • ‘aPos’ : the annotation position

      • ‘annotationSet’ : a list of annotation layers for the sentence
        • Each item in the list is a dict containing the following keys:
          • ‘ID’ : the ID number of the annotation set

          • ‘_type’ : ‘annotationset’

          • ‘status’ : either ‘MANUAL’ or ‘UNANN’

          • ‘luName’ : (only if status is ‘MANUAL’)

          • ‘luID’ : (only if status is ‘MANUAL’)

          • ‘frameID’ : (only if status is ‘MANUAL’)

          • ‘frameName’: (only if status is ‘MANUAL’)

          • ‘layer’ : a list of labels for the layer
            • Each item in the layer is a dict containing the following keys:

              • ‘_type’: ‘layer’

              • ‘rank’

              • ‘name’

              • ‘label’ : a list of labels in the layer
                • Each item is a dict containing the following keys:
                  • ‘start’
                  • ‘end’
                  • ‘name’
                  • ‘feID’ (optional)
Parameters:fn_docid (int) – The Framenet id number of the document
Returns:Information about the annotated document
Return type:dict
buildindexes()

Build the internal indexes to make look-ups faster.

documents(name=None)

Return a list of the annotated documents in Framenet.

Details for a specific annotated document can be obtained using this class’s annotated_document() function and pass it the value of the ‘ID’ field.

>>> from nltk.corpus import framenet as fn
>>> len(fn.documents())
78
>>> set([x.corpname for x in fn.documents()])==set(['ANC', 'C-4', 'KBEval',                     'LUCorpus-v0.3', 'Miscellaneous', 'NTI', 'PropBank', 'QA', 'SemAnno'])
True
Parameters:name (str) – A regular expression pattern used to search the file name of each annotated document. The document’s file name contains the name of the corpus that the document is from, followed by two underscores “__” followed by the document name. So, for example, the file name “LUCorpus-v0.3__20000410_nyt-NEW.xml” is from the corpus named “LUCorpus-v0.3” and the document name is “20000410_nyt-NEW.xml”.
Returns:A list of selected (or all) annotated documents
Return type:list of dicts, where each dict object contains the following keys:
  • ‘name’
  • ‘ID’
  • ‘corpid’
  • ‘corpname’
  • ‘description’
  • ‘filename’
fe_relations()

Obtain a list of frame element relations.

>>> from nltk.corpus import framenet as fn
>>> ferels = fn.fe_relations()
>>> isinstance(ferels, list)
True
>>> len(ferels)
10020
>>> PrettyDict(ferels[0], breakLines=True)
{'ID': 14642,
'_type': 'ferelation',
'frameRelation': <Parent=Abounding_with -- Inheritance -> Child=Lively_place>,
'subFE': <fe ID=11370 name=Degree>,
'subFEName': 'Degree',
'subFrame': <frame ID=1904 name=Lively_place>,
'subID': 11370,
'supID': 2271,
'superFE': <fe ID=2271 name=Degree>,
'superFEName': 'Degree',
'superFrame': <frame ID=262 name=Abounding_with>,
'type': <framerelationtype ID=1 name=Inheritance>}
Returns:A list of all of the frame element relations in framenet
Return type:list(dict)
frame(fn_fid_or_fname, ignorekeys=[])

Get the details for the specified Frame using the frame’s name or id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame(256)
>>> f.name
'Medical_specialties'
>>> f = fn.frame('Medical_specialties')
>>> f.ID
256
>>> # ensure non-ASCII character in definition doesn't trigger an encoding error:
>>> fn.frame('Imposing_obligation')
frame (1494): Imposing_obligation...

The dict that is returned from this function will contain the following information about the Frame:

  • ‘name’ : the name of the Frame (e.g. ‘Birth’, ‘Apply_heat’, etc.)

  • ‘definition’ : textual definition of the Frame

  • ‘ID’ : the internal ID number of the Frame

  • ‘semTypes’ : a list of semantic types for this frame
    • Each item in the list is a dict containing the following keys:
      • ‘name’ : can be used with the semtype() function
      • ‘ID’ : can be used with the semtype() function
  • ‘lexUnit’ : a dict containing all of the LUs for this frame.

    The keys in this dict are the names of the LUs and the value for each key is itself a dict containing info about the LU (see the lu() function for more info.)

  • ‘FE’ : a dict containing the Frame Elements that are part of this frame

    The keys in this dict are the names of the FEs (e.g. ‘Body_system’) and the values are dicts containing the following keys

    • ‘definition’ : The definition of the FE

    • ‘name’ : The name of the FE e.g. ‘Body_system’

    • ‘ID’ : The id number

    • ‘_type’ : ‘fe’

    • ‘abbrev’ : Abbreviation e.g. ‘bod’

    • ‘coreType’ : one of “Core”, “Peripheral”, or “Extra-Thematic”

    • ‘semType’ : if not None, a dict with the following two keys:
      • ‘name’ : name of the semantic type. can be used with

        the semtype() function

      • ‘ID’ : id number of the semantic type. can be used with

        the semtype() function

    • ‘requiresFE’ : if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame
      • ‘ID’ : the id of the other FE in this frame
    • ‘excludesFE’ : if not None, a dict with the following two keys:
      • ‘name’ : the name of another FE in this frame
      • ‘ID’ : the id of the other FE in this frame
  • ‘frameRelation’ : a list of objects describing frame relations

  • ‘FEcoreSets’ : a list of Frame Element core sets for this frame
    • Each item in the list is a list of FE objects
Parameters:
  • fn_fid_or_fname (int or str) – The Framenet name or id number of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

frame_by_id(fn_fid, ignorekeys=[])

Get the details for the specified Frame using the frame’s id number.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_id(256)
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition
"This frame includes words that name ..."
Parameters:
  • fn_fid (int) – The Framenet id number of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_by_name(fn_fname, ignorekeys=, []check_cache=True)

Get the details for the specified Frame using the frame’s name.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> f = fn.frame_by_name('Medical_specialties')
>>> f.ID
256
>>> f.name
'Medical_specialties'
>>> f.definition
"This frame includes words that name ..."
Parameters:
  • fn_fname (str) – The name of the frame
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

Information about a frame

Return type:

dict

Also see the frame() function for details about what is contained in the dict that is returned.

frame_ids_and_names(name=None)

Uses the frame index, which is much faster than looking up each frame definition if only the names and IDs are needed.

frame_relation_types()

Obtain a list of frame relation types.

>>> from nltk.corpus import framenet as fn
>>> frts = list(fn.frame_relation_types())
>>> isinstance(frts, list)
True
>>> len(frts)
9
>>> PrettyDict(frts[0], breakLines=True)
{'ID': 1,
 '_type': 'framerelationtype',
 'frameRelations': [<Parent=Event -- Inheritance -> Child=Change_of_consistency>, <Parent=Event -- Inheritance -> Child=Rotting>, ...],
 'name': 'Inheritance',
 'subFrameName': 'Child',
 'superFrameName': 'Parent'}
Returns:A list of all of the frame relation types in framenet
Return type:list(dict)
frame_relations(frame=None, frame2=None, type=None)
Parameters:frame – (optional) frame object, name, or ID; only relations involving

this frame will be returned :param frame2: (optional; ‘frame’ must be a different frame) only show relations between the two specified frames, in either direction :param type: (optional) frame relation type (name or object); show only relations of this type :type frame: int or str or AttrDict :return: A list of all of the frame relations in framenet :rtype: list(dict)

>>> from nltk.corpus import framenet as fn
>>> frels = fn.frame_relations()
>>> isinstance(frels, list)
True
>>> len(frels)
1676
>>> PrettyList(fn.frame_relations('Cooking_creation'), maxReprSize=0, breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>,
 <MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
>>> PrettyList(fn.frame_relations(373), breakLines=True)
[<Parent=Topic -- Using -> Child=Communication>,
 <Source=Discussion -- ReFraming_Mapping -> Target=Topic>, ...]
>>> PrettyList(fn.frame_relations(fn.frame('Cooking_creation')), breakLines=True)
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>,
 <Parent=Apply_heat -- Using -> Child=Cooking_creation>, ...]
>>> PrettyList(fn.frame_relations('Cooking_creation', type='Inheritance'))
[<Parent=Intentionally_create -- Inheritance -> Child=Cooking_creation>]
>>> PrettyList(fn.frame_relations('Cooking_creation', 'Apply_heat'), breakLines=True)
[<Parent=Apply_heat -- Using -> Child=Cooking_creation>,
<MainEntry=Apply_heat -- See_also -> ReferringEntry=Cooking_creation>]
frames(name=None)

Obtain details for a specific frame.

>>> from nltk.corpus import framenet as fn
>>> len(fn.frames())
1019
>>> PrettyList(fn.frames(r'(?i)medical'), maxReprSize=0, breakLines=True)
[<frame ID=256 name=Medical_specialties>,
 <frame ID=257 name=Medical_instruments>,
 <frame ID=255 name=Medical_professionals>,
 <frame ID=239 name=Medical_conditions>]

A brief intro to Frames (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A Frame is a script-like conceptual structure that describes a particular type of situation, object, or event along with the participants and props that are needed for that Frame. For example, the “Apply_heat” frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc.

We call the roles of a Frame “frame elements” (FEs) and the frame-evoking words are called “lexical units” (LUs).

FrameNet includes relations between Frames. Several types of relations are defined, of which the most important are:

  • Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE in the parent is bound to a corresponding FE in the child. An example is the “Revenge” frame which inherits from the “Rewards_and_punishments” frame.
  • Using: The child frame presupposes the parent frame as background, e.g the “Speed” frame “uses” (or presupposes) the “Motion” frame; however, not all parent FEs need to be bound to child FEs.
  • Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the “Criminal_process” frame has subframes of “Arrest”, “Arraignment”, “Trial”, and “Sentencing”.
  • Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of the “Hiring” and “Get_a_job” frames, which perspectivize the “Employment_start” frame from the Employer’s and the Employee’s point of view, respectively.
Parameters:name (str) – A regular expression pattern used to match against Frame names. If ‘name’ is None, then a list of all Framenet Frames will be returned.
Returns:A list of matching Frames (or all Frames).
Return type:list(AttrDict)
frames_by_lemma(pat)

Returns a list of all frames that contain LUs in which the name attribute of the LU matchs the given regular expression pat. Note that LU names are composed of “lemma.POS”, where the “lemma” part can be made up of either a single lexeme (e.g. ‘run’) or multiple lexemes (e.g. ‘a little’).

Note: if you are going to be doing a lot of this type of searching, you’d want to build an index that maps from lemmas to frames because each time frames_by_lemma() is called, it has to search through ALL of the frame XML files in the db.

>>> from nltk.corpus import framenet as fn
>>> fn.frames_by_lemma(r'(?i)a little')
[<frame ID=189 name=Quantity>, <frame ID=2001 name=Degree>]
Returns:A list of frame objects.
Return type:list(AttrDict)
lu(fn_luid, ignorekeys=[])

Get information about a specific Lexical Unit using the id number fn_luid. This function reads the LU information from the xml file on disk each time it is called. You may want to cache this info if you plan to call this function with the same id number multiple times.

Usage examples:

>>> from nltk.corpus import framenet as fn
>>> fn.lu(256).name
'foresee.v'
>>> fn.lu(256).definition
'COD: be aware of beforehand; predict.'
>>> fn.lu(256).frame.name
'Expectation'
>>> pprint(list(map(PrettyDict, fn.lu(256).lexemes)))
[{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}]

The dict that is returned from this function will contain most of the following information about the LU. Note that some LUs do not contain all of these pieces of information - particularly ‘totalAnnotated’ and ‘incorporatedFE’ may be missing in some LUs:

  • ‘name’ : the name of the LU (e.g. ‘merger.n’)

  • ‘definition’ : textual definition of the LU

  • ‘ID’ : the internal ID number of the LU

  • ‘_type’ : ‘lu’

  • ‘status’ : e.g. ‘Created’

  • ‘frame’ : Frame that this LU belongs to

  • ‘POS’ : the part of speech of this LU (e.g. ‘N’)

  • ‘totalAnnotated’ : total number of examples annotated with this LU

  • ‘incorporatedFE’ : FE that incorporates this LU (e.g. ‘Ailment’)

  • ‘sentenceCount’ : a dict with the following two keys:
    • ‘annotated’: number of sentences annotated with this LU
    • ‘total’ : total number of sentences with this LU
  • ‘lexemes’ : a list of dicts describing the lemma of this LU.

    Each dict in the list contains these keys: - ‘POS’ : part of speech e.g. ‘N’ - ‘name’ : either single-lexeme e.g. ‘merger’ or

    multi-lexeme e.g. ‘a little’

    • ‘order’: the order of the lexeme in the lemma (starting from 1)

    • ‘headword’: a boolean (‘true’ or ‘false’)

    • ‘breakBefore’: Can this lexeme be separated from the previous lexeme?
      Consider: “take over.v” as in:

      Germany took over the Netherlands in 2 days. Germany took the Netherlands over in 2 days.

      In this case, ‘breakBefore’ would be “true” for the lexeme “over”. Contrast this with “take after.v” as in:

      Mary takes after her grandmother.

      *Mary takes her grandmother after.

      In this case, ‘breakBefore’ would be “false” for the lexeme “after”

  • ‘lemmaID’ : Can be used to connect lemmas in different LUs

  • ‘semTypes’ : a list of semantic type objects for this LU

  • ‘subCorpus’ : a list of subcorpora
    • Each item in the list is a dict containing the following keys:
      • ‘name’ :

      • ‘sentence’ : a list of sentences in the subcorpus
        • each item in the list is a dict with the following keys:
          • ‘ID’:

          • ‘sentNo’:

          • ‘text’: the text of the sentence

          • ‘aPos’:

          • ‘annotationSet’: a list of annotation sets
            • each item in the list is a dict with the following keys:
              • ‘ID’:

              • ‘status’:

              • ‘layer’: a list of layers
                • each layer is a dict containing the following keys:
                  • ‘name’: layer name (e.g. ‘BNC’)

                  • ‘rank’:

                  • ‘label’: a list of labels for the layer
                    • each label is a dict containing the following keys:
                      • ‘start’: start pos of label in sentence ‘text’ (0-based)
                      • ‘end’: end pos of label in sentence ‘text’ (0-based)
                      • ‘name’: name of label (e.g. ‘NN1’)

Under the hood, this implementation looks up the lexical unit information in the frame definition file. That file does not contain corpus annotations, so the LU files will be accessed on demand if those are needed. In principle, valence patterns could be loaded here too, though these are not currently supported.

Parameters:
  • fn_luid (int) – The id number of the lexical unit
  • ignorekeys (list(str)) – The keys to ignore. These keys will not be included in the output. (optional)
Returns:

All information about the lexical unit

Return type:

dict

lu_basic(fn_luid)

Returns basic information about the LU whose id is fn_luid. This is basically just a wrapper around the lu() function with “subCorpus” info excluded.

>>> from nltk.corpus import framenet as fn
>>> PrettyDict(fn.lu_basic(256), breakLines=True)
{'ID': 256,
 'POS': 'V',
 '_type': 'lu',
 'definition': 'COD: be aware of beforehand; predict.',
 'frame': <frame ID=26 name=Expectation>,
 'lemmaID': 15082,
 'lexemes': [{'POS': 'V', 'breakBefore': 'false', 'headword': 'false', 'name': 'foresee', 'order': 1}],
 'name': 'foresee.v',
 'semTypes': [],
 'sentenceCount': {'annotated': 44, 'total': 227},
 'status': 'FN1_Sent'}
Parameters:fn_luid (int) – The id number of the desired LU
Returns:Basic information about the lexical unit
Return type:dict
lu_ids_and_names(name=None)

Uses the LU index, which is much faster than looking up each LU definition if only the names and IDs are needed.

lus(name=None)

Obtain details for a specific lexical unit.

>>> from nltk.corpus import framenet as fn
>>> len(fn.lus())
11829
>>> PrettyList(fn.lus(r'(?i)a little'), maxReprSize=0, breakLines=True)
[<lu ID=14744 name=a little bit.adv>,
 <lu ID=14733 name=a little.n>,
 <lu ID=14743 name=a little.adv>]

A brief intro to Lexical Units (excerpted from “FrameNet II: Extended Theory and Practice” by Ruppenhofer et. al., 2010):

A lexical unit (LU) is a pairing of a word with a meaning. For example, the “Apply_heat” Frame describes a common situation involving a Cook, some Food, and a Heating Instrument, and is _evoked_ by words such as bake, blanch, boil, broil, brown, simmer, steam, etc. These frame-evoking words are the LUs in the Apply_heat frame. Each sense of a polysemous word is a different LU.

We have used the word “word” in talking about LUs. The reality is actually rather complex. When we say that the word “bake” is polysemous, we mean that the lemma “bake.v” (which has the word-forms “bake”, “bakes”, “baked”, and “baking”) is linked to three different frames:

  • Apply_heat: “Michelle baked the potatoes for 45 minutes.”
  • Cooking_creation: “Michelle baked her mother a cake for her birthday.”
  • Absorb_heat: “The potatoes have to bake for more than 30 minutes.”

These constitute three different LUs, with different definitions.

Multiword expressions such as “given name” and hyphenated words like “shut-eye” can also be LUs. Idiomatic phrases such as “middle of nowhere” and “give the slip (to)” are also defined as LUs in the appropriate frames (“Isolated_places” and “Evading”, respectively), and their internal structure is not analyzed.

Framenet provides multiple annotated examples of each sense of a word (i.e. each LU). Moreover, the set of examples (approximately 20 per LU) illustrates all of the combinatorial possibilities of the lexical unit.

Each LU is linked to a Frame, and hence to the other words which evoke that Frame. This makes the FrameNet database similar to a thesaurus, grouping together semantically similar words.

In the simplest case, frame-evoking words are verbs such as “fried” in:

“Matilde fried the catfish in a heavy iron skillet.”

Sometimes event nouns may evoke a Frame. For example, “reduction” evokes “Cause_change_of_scalar_position” in:

”...the reduction of debt levels to $665 million from $2.6 billion.”

Adjectives may also evoke a Frame. For example, “asleep” may evoke the “Sleep” frame as in:

“They were asleep for hours.”

Many common nouns, such as artifacts like “hat” or “tower”, typically serve as dependents rather than clearly evoking their own frames.

Parameters:name (str) –

A regular expression pattern used to search the LU names. Note that LU names take the form of a dotted string (e.g. “run.v” or “a little.adv”) in which a lemma preceeds the ”.” and a POS follows the dot. The lemma may be composed of a single lexeme (e.g. “run”) or of multiple lexemes (e.g. “a little”). If ‘name’ is not given, then all LUs will be returned.

The valid POSes are:

v - verb n - noun a - adjective adv - adverb prep - preposition num - numbers intj - interjection art - article c - conjunction scon - subordinating conjunction
Returns:A list of selected (or all) lexical units
Return type:list of LU objects (dicts). See the lu() function for info about the specifics of LU objects.
propagate_semtypes()

Apply inference rules to distribute semtypes over relations between FEs. For FrameNet 1.5, this results in 1011 semtypes being propagated. (Not done by default because it requires loading all frame files, which takes several seconds. If this needed to be fast, it could be rewritten to traverse the neighboring relations on demand for each FE semtype.)

>>> from nltk.corpus import framenet as fn
>>> sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
4241
>>> fn.propagate_semtypes()
>>> sum(1 for f in fn.frames() for fe in f.FE.values() if fe.semType)
5252
readme()

Return the contents of the corpus README.txt (or README) file.

semtype(key)
>>> from nltk.corpus import framenet as fn
>>> fn.semtype(233).name
'Temperature'
>>> fn.semtype(233).abbrev
'Temp'
>>> fn.semtype('Temperature').ID
233
Parameters:key (string or int) – The name, abbreviation, or id number of the semantic type
Returns:Information about a semantic type
Return type:dict
semtype_inherits(st, superST)
semtypes()

Obtain a list of semantic types.

>>> from nltk.corpus import framenet as fn
>>> stypes = fn.semtypes()
>>> len(stypes)
73
>>> sorted(stypes[0].keys())
['ID', '_type', 'abbrev', 'definition', 'name', 'rootType', 'subTypes', 'superType']
Returns:A list of all of the semantic types in framenet
Return type:list(dict)
class nltk.corpus.reader.UdhrCorpusReader(root='udhr')

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

ENCODINGS = [('.*-Latin1$', 'latin-1'), ('.*-Hebrew$', 'hebrew'), ('.*-Arabic$', 'cp1256'), ('Czech_Cesky-UTF8', 'cp1250'), ('.*-Cyrillic$', 'cyrillic'), ('.*-SJIS$', 'SJIS'), ('.*-GB2312$', 'GB2312'), ('.*-Latin2$', 'ISO-8859-2'), ('.*-Greek$', 'greek'), ('.*-UTF8$', 'utf-8'), ('Hungarian_Magyar-Unicode', 'utf-16-le'), ('Amahuaca', 'latin1'), ('Turkish_Turkce-Turkish', 'latin5'), ('Lithuanian_Lietuviskai-Baltic', 'latin4'), ('Japanese_Nihongo-EUC', 'EUC-JP'), ('Japanese_Nihongo-JIS', 'iso2022_jp'), ('Chinese_Mandarin-HZ', 'hz'), ('Abkhaz\\-Cyrillic\\+Abkh', 'cp1251')]
SKIP = {'Chinese_Mandarin-HZ', 'Vietnamese-VPS', 'Azeri_Azerbaijani_Latin-Az.Times.Lat0117', 'Vietnamese-TCVN', 'Hungarian_Magyar-Unicode', 'Gujarati-UTF8', 'Armenian-DallakHelv', 'Esperanto-T61', 'Lao-UTF8', 'Tigrinya_Tigrigna-VG2Main', 'Czech-Latin2-err', 'Magahi-Agra', 'Burmese_Myanmar-WinResearcher', 'Chinese_Mandarin-UTF8', 'Azeri_Azerbaijani_Cyrillic-Az.Times.Cyr.Normal0117', 'Bhojpuri-Agra', 'Japanese_Nihongo-JIS', 'Russian_Russky-UTF8~', 'Vietnamese-VIQR', 'Amharic-Afenegus6..60375', 'Tamil-UTF8', 'Navaho_Dine-Navajo-Navaho-font', 'Magahi-UTF8', 'Marathi-UTF8', 'Burmese_Myanmar-UTF8'}
class nltk.corpus.reader.BNCCorpusReader(root, fileids, lazy=True)

Bases: nltk.corpus.reader.xmldocs.XMLCorpusReader

Corpus reader for the XML version of the British National Corpus.

For access to the complete XML data structure, use the xml() method. For access to simple word lists and tagged word lists, use words(), sents(), tagged_words(), and tagged_sents().

You can obtain the full version of the BNC corpus at http://www.ota.ox.ac.uk/desc/2554

If you extracted the archive to a directory called BNC, then you can instantiate the reder as:

BNCCorpusReader(root='BNC/Texts/', fileids=r'[A-K]/\w*/\w*\.xml')
sents(fileids=None, strip_space=True, stem=False)
Returns:

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type:

list(list(str))

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_sents(fileids=None, c5=False, strip_space=True, stem=False)
Returns:

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type:

list(list(tuple(str,str)))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
tagged_words(fileids=None, c5=False, strip_space=True, stem=False)
Returns:

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type:

list(tuple(str,str))

Parameters:
  • c5 – If true, then the tags used will be the more detailed c5 tags. Otherwise, the simplified tags will be used.
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
words(fileids=None, strip_space=True, stem=False)
Returns:

the given file(s) as a list of words and punctuation symbols.

Return type:

list(str)

Parameters:
  • strip_space – If true, then strip trailing spaces from word tokens. Otherwise, leave the spaces on the tokens.
  • stem – If true, then use word stems instead of word strings.
class nltk.corpus.reader.SentiWordNetCorpusReader(root, fileids, encoding='utf-8')

Bases: nltk.corpus.reader.api.CorpusReader

all_senti_synsets()
senti_synset(*vals)
senti_synsets(string, pos=None)
unicode_repr()
class nltk.corpus.reader.SentiSynset(pos_score, neg_score, synset)

Bases: builtins.object

neg_score()
obj_score()
pos_score()
unicode_repr()