nltk.corpus.reader.tagged module

A reader for corpora whose documents contain part-of-speech-tagged words.

class nltk.corpus.reader.tagged.CategorizedTaggedCorpusReader[source]

Bases: CategorizedCorpusReader, TaggedCorpusReader

A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.

__init__(*args, **kwargs)[source]

Initialize the corpus reader. Categorization arguments (cat_pattern, cat_map, and cat_file) are passed to the CategorizedCorpusReader constructor. The remaining arguments are passed to the TaggedCorpusReader.

tagged_paras(fileids=None, categories=None, tagset=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

Return type

list(list(list(tuple(str,str))))

tagged_sents(fileids=None, categories=None, tagset=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type

list(list(tuple(str,str)))

tagged_words(fileids=None, categories=None, tagset=None)[source]
Returns

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type

list(tuple(str,str))

class nltk.corpus.reader.tagged.MacMorphoCorpusReader[source]

Bases: TaggedCorpusReader

A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using ‘_’ as a separator. Sentence boundaries are based on the end-sentence tag (‘_.’). Paragraph information is not included in the corpus, so each paragraph returned by self.paras() and self.tagged_paras() contains a single sentence.

__init__(root, fileids, encoding='utf8', tagset=None)[source]

Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = TaggedCorpusReader(root, '.*', '.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

class nltk.corpus.reader.tagged.TaggedCorpusReader[source]

Bases: CorpusReader

Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor. Words are parsed using nltk.tag.str2tuple. By default, '/' is used as the separator. I.e., words should have the form:

word1/tag1 word2/tag2 word3/tag3 ...

But custom separators may be specified as parameters to the constructor. Part of speech tags are case-normalized to upper case.

__init__(root, fileids, sep='/', word_tokenizer=WhitespaceTokenizer(pattern='\\s+', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), sent_tokenizer=RegexpTokenizer(pattern='\n', gaps=True, discard_empty=True, flags=re.UNICODE|re.MULTILINE|re.DOTALL), para_block_reader=<function read_blankline_block>, encoding='utf8', tagset=None)[source]

Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = TaggedCorpusReader(root, '.*', '.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

paras(fileids=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

sents(fileids=None)[source]
Returns

the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.

Return type

list(list(str))

tagged_paras(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

Return type

list(list(list(tuple(str,str))))

tagged_sents(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of sentences, each encoded as a list of (word,tag) tuples.

Return type

list(list(tuple(str,str)))

tagged_words(fileids=None, tagset=None)[source]
Returns

the given file(s) as a list of tagged words and punctuation symbols, encoded as tuples (word,tag).

Return type

list(tuple(str,str))

words(fileids=None)[source]
Returns

the given file(s) as a list of words and punctuation symbols.

Return type

list(str)

class nltk.corpus.reader.tagged.TaggedCorpusView[source]

Bases: StreamBackedCorpusView

A specialized corpus view for tagged documents. It can be customized via flags to divide the tagged corpus documents up by sentence or paragraph, and to include or omit part of speech tags. TaggedCorpusView objects are typically created by TaggedCorpusReader (not directly by nltk users).

__init__(corpus_file, encoding, tagged, group_by_sent, group_by_para, sep, word_tokenizer, sent_tokenizer, para_block_reader, tag_mapping_function=None)[source]

Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.

Parameters
  • fileid – The path to the file that is read by this corpus view. fileid can either be a string or a PathPointer.

  • startpos – The file position at which the view will start reading. This can be used to skip over preface sections.

  • encoding – The unicode encoding that should be used to read the file’s contents. If no encoding is specified, then the file’s contents will be read as a non-unicode string (i.e., a str).

read_block(stream)[source]

Reads one paragraph at a time.

class nltk.corpus.reader.tagged.TimitTaggedCorpusReader[source]

Bases: TaggedCorpusReader

A corpus reader for tagged sentences that are included in the TIMIT corpus.

__init__(*args, **kwargs)[source]

Construct a new Tagged Corpus reader for a set of documents located at the given root directory. Example usage:

>>> root = '/...path to corpus.../'
>>> reader = TaggedCorpusReader(root, '.*', '.txt') 
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

paras()[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings.

Return type

list(list(list(str)))

tagged_paras()[source]
Returns

the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of (word,tag) tuples.

Return type

list(list(list(tuple(str,str))))