nltk.corpus.reader.chunked module

A reader for corpora that contain chunked (and optionally tagged) documents.

class nltk.corpus.reader.chunked.ChunkedCorpusReader[source]

Bases: nltk.corpus.reader.api.CorpusReader

Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using nltk.chunk.tagstr2tree.

__init__(root, fileids, extension='', str2chunktree=<function tagstr2tree>, sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

words(fileids=None)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

sents(fileids=None)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

paras(fileids=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

tagged_words(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type

list(tuple(str,str))

tagged_sents(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type

list(list(tuple(str,str)))

tagged_paras(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

Return type

list(list(list(tuple(str,str))))

chunked_words(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of tagged words and chunks. Words are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over (word,tag) tuples or word strings.

Return type

list(tuple(str,str) and Tree)

chunked_sents(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).

Return type

list(Tree)

chunked_paras(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as (word, tag) tuples (if the corpus has tags) or word strings (if the corpus has no tags).

Return type

list(list(Tree))

class nltk.corpus.reader.chunked.ChunkedCorpusView[source]

Bases: nltk.corpus.reader.util.StreamBackedCorpusView

__init__(fileid, encoding, tagged, group_by_sent, group_by_para, chunked, str2chunktree, sent_tokenizer, para_block_reader, source_tagset=None, target_tagset=None)[source]

Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.

Parameters
  • fileid – The path to the file that is read by this corpus view. fileid can either be a string or a PathPointer.

  • startpos – The file position at which the view will start reading. This can be used to skip over preface sections.

  • encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).

read_block(stream)[source]

Read a block from the input stream.

Returns

a block of tokens from the input stream

Return type

list(any)

Parameters

stream (stream) – an input stream