nltk.corpus.reader.chunked module¶
A reader for corpora that contain chunked (and optionally tagged) documents.
- class nltk.corpus.reader.chunked.ChunkedCorpusReader[source]¶
Bases:
CorpusReader
Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function. Each of these steps can be performed using a default function or a custom function. By default, paragraphs are split on blank lines; sentences are listed one per line; and sentences are parsed into chunk trees using
nltk.chunk.tagstr2tree
.- __init__(root, fileids, extension='', str2chunktree=<function tagstr2tree>, sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]¶
- Parameters:
root – The root directory for this corpus.
fileids – A list or regexp specifying the fileids in this corpus.
- chunked_paras(fileids=None, tagset=None)[source]¶
- Returns:
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as a shallow Tree. The leaves of these trees are encoded as
(word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags).- Return type:
list(list(Tree))
- chunked_sents(fileids=None, tagset=None)[source]¶
- Returns:
the given file(s) as a list of sentences, each encoded as a shallow Tree. The leaves of these trees are encoded as
(word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags).- Return type:
list(Tree)
- chunked_words(fileids=None, tagset=None)[source]¶
- Returns:
the given file(s) as a list of tagged words and chunks. Words are encoded as
(word, tag)
tuples (if the corpus has tags) or word strings (if the corpus has no tags). Chunks are encoded as depth-one trees over(word,tag)
tuples or word strings.- Return type:
list(tuple(str,str) and Tree)
- paras(fileids=None)[source]¶
- Returns:
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.
- Return type:
list(list(list(str)))
- sents(fileids=None)[source]¶
- Returns:
the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
- Return type:
list(list(str))
- tagged_paras(fileids=None, tagset=None)[source]¶
- Returns:
the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of
(word,tag)
tuples.- Return type:
list(list(list(tuple(str,str))))
- tagged_sents(fileids=None, tagset=None)[source]¶
- Returns:
the given file(s) as a list of sentences, each encoded as a list of
(word,tag)
tuples.- Return type:
list(list(tuple(str,str)))
- class nltk.corpus.reader.chunked.ChunkedCorpusView[source]¶
Bases:
StreamBackedCorpusView
- __init__(fileid, encoding, tagged, group_by_sent, group_by_para, chunked, str2chunktree, sent_tokenizer, para_block_reader, source_tagset=None, target_tagset=None)[source]¶
Create a new corpus view, based on the file
fileid
, and read withblock_reader
. See the class documentation for more information.- Parameters:
fileid – The path to the file that is read by this corpus view.
fileid
can either be a string or aPathPointer
.startpos – The file position at which the view will start reading. This can be used to skip over preface sections.
encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).