nltk.corpus.reader.bracket_parse module

Corpus reader for corpora that consist of parenthesis-delineated parse trees.

class nltk.corpus.reader.bracket_parse.BracketParseCorpusReader[source]

Bases: nltk.corpus.reader.api.SyntaxCorpusReader

Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the “combined” section of the Penn Treebank, e.g. “(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))”.

__init__(root, fileids, comment_char=None, detect_blocks='unindented_paren', encoding='utf8', tagset=None)[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • comment_char – The character which can appear at the start of a line to indicate that the rest of the line is a comment.

  • detect_blocks – The method that is used to find blocks in the corpus; can be ‘unindented_paren’ (every unindented parenthesis starts a new parse) or ‘sexpr’ (brackets are matched).

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

class nltk.corpus.reader.bracket_parse.CategorizedBracketParseCorpusReader[source]

Bases: nltk.corpus.reader.api.CategorizedCorpusReader, nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>

__init__(*args, **kwargs)[source]

Initialize the corpus reader. Categorization arguments (C{cat_pattern}, C{cat_map}, and C{cat_file}) are passed to the L{CategorizedCorpusReader constructor <CategorizedCorpusReader.__init__>}. The remaining arguments are passed to the L{BracketParseCorpusReader constructor <BracketParseCorpusReader.__init__>}.

tagged_words(fileids=None, categories=None, tagset=None)[source]
tagged_sents(fileids=None, categories=None, tagset=None)[source]
tagged_paras(fileids=None, categories=None, tagset=None)[source]
parsed_words(fileids=None, categories=None)[source]
parsed_sents(fileids=None, categories=None)[source]
parsed_paras(fileids=None, categories=None)[source]
class nltk.corpus.reader.bracket_parse.AlpinoCorpusReader[source]

Bases: nltk.corpus.reader.bracket_parse.BracketParseCorpusReader

Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree. This is no good for tag_ and word_ _tag and _word will be overridden to use a non-default new parameter ‘ordered’ to the overridden _normalize function. The _parse function can then remain untouched.

__init__(root, encoding='ISO-8859-1', tagset=None)[source]
Parameters
  • root – The root directory for this corpus.

  • fileids – A list or regexp specifying the fileids in this corpus.

  • comment_char – The character which can appear at the start of a line to indicate that the rest of the line is a comment.

  • detect_blocks – The method that is used to find blocks in the corpus; can be ‘unindented_paren’ (every unindented parenthesis starts a new parse) or ‘sexpr’ (brackets are matched).

  • tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.