nltk.parse package

Submodules

nltk.parse.api module

class nltk.parse.api.ParserI[source]

Bases: object

A processing class for deriving trees that represent possible structures for a sequence of tokens. These tree structures are known as “parses”. Typically, parsers are used to derive syntax trees for sentences. But parsers can also be used to derive other kinds of tree structure, such as morphological trees and discourse structures.

Subclasses must define:
  • at least one of: parse(), parse_sents().
Subclasses may define:
  • grammar()
grammar()[source]
Returns:The grammar used by this parser.
parse(sent, *args, **kwargs)[source]
Returns:An iterator that generates parse trees for the sentence.

When possible this list is sorted from most likely to least likely.

Parameters:sent (list(str)) – The sentence to be parsed
Return type:iter(Tree)
parse_all(sent, *args, **kwargs)[source]
Return type:list(Tree)
parse_one(sent, *args, **kwargs)[source]
Return type:Tree or None
parse_sents(sents, *args, **kwargs)[source]

Apply self.parse() to each element of sents. :rtype: iter(iter(Tree))

nltk.parse.bllip module

class nltk.parse.bllip.BllipParser(parser_model=None, reranker_features=None, reranker_weights=None, parser_options=None, reranker_options=None)[source]

Bases: nltk.parse.api.ParserI

Interface for parsing with BLLIP Parser. BllipParser objects can be constructed with the BllipParser.from_unified_model_dir class method or manually using the BllipParser constructor.

classmethod from_unified_model_dir(this_class, model_dir, parser_options=None, reranker_options=None)[source]

Create a BllipParser object from a unified parsing model directory. Unified parsing model directories are a standardized way of storing BLLIP parser and reranker models together on disk. See bllipparser.RerankingParser.get_unified_model_parameters() for more information about unified model directories.

Returns:A BllipParser object using the parser and reranker

models in the model directory.

Parameters:
  • model_dir (str) – Path to the unified model directory.
  • parser_options – optional dictionary of parser options, see

bllipparser.RerankingParser.RerankingParser.load_parser_options() for more information. :type parser_options: dict(str) :param reranker_options: optional dictionary of reranker options, see bllipparser.RerankingParser.RerankingParser.load_reranker_model() for more information. :type reranker_options: dict(str) :rtype: BllipParser

parse(sentence)[source]

Use BLLIP Parser to parse a sentence. Takes a sentence as a list of words; it will be automatically tagged with this BLLIP Parser instance’s tagger.

Returns:An iterator that generates parse trees for the sentence

from most likely to least likely.

Parameters:sentence (list(str)) – The sentence to be parsed
Return type:iter(Tree)
tagged_parse(word_and_tag_pairs)[source]

Use BLLIP to parse a sentence. Takes a sentence as a list of (word, tag) tuples; the sentence must have already been tokenized and tagged. BLLIP will attempt to use the tags provided but may use others if it can’t come up with a complete parse subject to those constraints. You may also specify a tag as None to leave a token’s tag unconstrained.

Returns:An iterator that generates parse trees for the sentence

from most likely to least likely.

Parameters:sentence (list(tuple(str, str))) – Input sentence to parse as (word, tag) pairs
Return type:iter(Tree)

nltk.parse.chart module

Data classes and parser implementations for “chart parsers”, which use dynamic programming to efficiently parse a text. A chart parser derives parse trees for a text by iteratively adding “edges” to a “chart.” Each edge represents a hypothesis about the tree structure for a subsequence of the text. The chart is a “blackboard” for composing and combining these hypotheses.

When a chart parser begins parsing a text, it creates a new (empty) chart, spanning the text. It then incrementally adds new edges to the chart. A set of “chart rules” specifies the conditions under which new edges should be added to the chart. Once the chart reaches a stage where none of the chart rules adds any new edges, parsing is complete.

Charts are encoded with the Chart class, and edges are encoded with the TreeEdge and LeafEdge classes. The chart parser module defines three chart parsers:

  • ChartParser is a simple and flexible chart parser. Given a set of chart rules, it will apply those rules to the chart until no more edges are added.
  • SteppingChartParser is a subclass of ChartParser that can be used to step through the parsing process.
class nltk.parse.chart.AbstractChartRule[source]

Bases: nltk.parse.chart.ChartRuleI

An abstract base class for chart rules. AbstractChartRule provides:

  • A default implementation for apply.
  • A default implementation for apply_everywhere, (Currently, this implementation assumes that ``NUM_EDGES``<=3.)
  • A default implementation for __str__, which returns a name based on the rule’s class name.
apply(chart, grammar, *edges)[source]
apply_everywhere(chart, grammar)[source]
unicode_repr

Return repr(self).

class nltk.parse.chart.BottomUpChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

A ChartParser using a bottom-up parsing strategy. See ChartParser for more information.

class nltk.parse.chart.BottomUpLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

A ChartParser using a bottom-up left-corner parsing strategy. This strategy is often more efficient than standard bottom-up. See ChartParser for more information.

class nltk.parse.chart.BottomUpPredictCombineRule[source]

Bases: nltk.parse.chart.BottomUpPredictRule

A rule licensing any edge corresponding to a production whose right-hand side begins with a complete edge’s left-hand side. In particular, this rule specifies that [A -> alpha \*] licenses the edge [B -> A \* beta] for each grammar production B -> A beta.

Note:This is like BottomUpPredictRule, but it also applies the FundamentalRule to the resulting edge.
NUM_EDGES = 1
apply(chart, grammar, edge)[source]
class nltk.parse.chart.BottomUpPredictRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule licensing any edge corresponding to a production whose right-hand side begins with a complete edge’s left-hand side. In particular, this rule specifies that [A -> alpha \*] licenses the edge [B -> \* A beta] for each grammar production B -> A beta.

NUM_EDGES = 1
apply(chart, grammar, edge)[source]
class nltk.parse.chart.CachedTopDownPredictRule[source]

Bases: nltk.parse.chart.TopDownPredictRule

A cached version of TopDownPredictRule. After the first time this rule is applied to an edge with a given end and next, it will not generate any more edges for edges with that end and next.

If chart or grammar are changed, then the cache is flushed.

apply(chart, grammar, edge)[source]
class nltk.parse.chart.Chart(tokens)[source]

Bases: object

A blackboard for hypotheses about the syntactic constituents of a sentence. A chart contains a set of edges, and each edge encodes a single hypothesis about the structure of some portion of the sentence.

The select method can be used to select a specific collection of edges. For example chart.select(is_complete=True, start=0) yields all complete edges whose start indices are 0. To ensure the efficiency of these selection operations, Chart dynamically creates and maintains an index for each set of attributes that have been selected on.

In order to reconstruct the trees that are represented by an edge, the chart associates each edge with a set of child pointer lists. A child pointer list is a list of the edges that license an edge’s right-hand side.

Variables:
  • _tokens – The sentence that the chart covers.
  • _num_leaves – The number of tokens.
  • _edges – A list of the edges in the chart
  • _edge_to_cpls – A dictionary mapping each edge to a set of child pointer lists that are associated with that edge.
  • _indexes – A dictionary mapping tuples of edge attributes to indices, where each index maps the corresponding edge attribute values to lists of edges.
child_pointer_lists(edge)[source]

Return the set of child pointer lists for the given edge. Each child pointer list is a list of edges that have been used to form this edge.

Return type:list(list(EdgeI))
dot_digraph()[source]
edges()[source]

Return a list of all edges in this chart. New edges that are added to the chart after the call to edges() will not be contained in this list.

Return type:list(EdgeI)
See:iteredges, select
initialize()[source]

Clear the chart.

insert(edge, *child_pointer_lists)[source]

Add a new edge to the chart, and return True if this operation modified the chart. In particular, return true iff the chart did not already contain edge, or if it did not already associate child_pointer_lists with edge.

Parameters:
  • edge (EdgeI) – The new edge
  • child_pointer_lists (sequence of tuple(EdgeI)) – A sequence of lists of the edges that were used to form this edge. This list is used to reconstruct the trees (or partial trees) that are associated with edge.
Return type:

bool

insert_with_backpointer(new_edge, previous_edge, child_edge)[source]

Add a new edge to the chart, using a pointer to the previous edge.

iteredges()[source]

Return an iterator over the edges in this chart. It is not guaranteed that new edges which are added to the chart before the iterator is exhausted will also be generated.

Return type:iter(EdgeI)
See:edges, select
leaf(index)[source]

Return the leaf value of the word at the given index.

Return type:str
leaves()[source]

Return a list of the leaf values of each word in the chart’s sentence.

Return type:list(str)
num_edges()[source]

Return the number of edges contained in this chart.

Return type:int
num_leaves()[source]

Return the number of words in this chart’s sentence.

Return type:int
parses(root, tree_class=<class 'nltk.tree.Tree'>)[source]

Return an iterator of the complete tree structures that span the entire chart, and whose root node is root.

pretty_format(width=None)[source]

Return a pretty-printed string representation of this chart.

Parameters:width – The number of characters allotted to each index in the sentence.
Return type:str
pretty_format_edge(edge, width=None)[source]

Return a pretty-printed string representation of a given edge in this chart.

Return type:str
Parameters:width – The number of characters allotted to each index in the sentence.
pretty_format_leaves(width=None)[source]

Return a pretty-printed string representation of this chart’s leaves. This string can be used as a header for calls to pretty_format_edge.

select(**restrictions)[source]

Return an iterator over the edges in this chart. Any new edges that are added to the chart before the iterator is exahusted will also be generated. restrictions can be used to restrict the set of edges that will be generated.

Parameters:
  • span – Only generate edges e where e.span()==span
  • start – Only generate edges e where e.start()==start
  • end – Only generate edges e where e.end()==end
  • length – Only generate edges e where e.length()==length
  • lhs – Only generate edges e where e.lhs()==lhs
  • rhs – Only generate edges e where e.rhs()==rhs
  • nextsym – Only generate edges e where e.nextsym()==nextsym
  • dot – Only generate edges e where e.dot()==dot
  • is_complete – Only generate edges e where e.is_complete()==is_complete
  • is_incomplete – Only generate edges e where e.is_incomplete()==is_incomplete
Return type:

iter(EdgeI)

trees(edge, tree_class=<class 'nltk.tree.Tree'>, complete=False)[source]

Return an iterator of the tree structures that are associated with edge.

If edge is incomplete, then the unexpanded children will be encoded as childless subtrees, whose node value is the corresponding terminal or nonterminal.

Return type:list(Tree)
Note:If two trees share a common subtree, then the same Tree may be used to encode that subtree in both trees. If you need to eliminate this subtree sharing, then create a deep copy of each tree.
class nltk.parse.chart.ChartParser(grammar, strategy=[<nltk.parse.chart.LeafInitRule object>, <nltk.parse.chart.EmptyPredictRule object>, <nltk.parse.chart.BottomUpPredictCombineRule object>, <nltk.parse.chart.SingleEdgeFundamentalRule object>], trace=0, trace_chart_width=50, use_agenda=True, chart_class=<class 'nltk.parse.chart.Chart'>)[source]

Bases: nltk.parse.api.ParserI

A generic chart parser. A “strategy”, or list of ChartRuleI instances, is used to decide what edges to add to the chart. In particular, ChartParser uses the following algorithm to parse texts:

Until no new edges are added:
For each rule in strategy:
Apply rule to any applicable edges in the chart.
Return any complete parses in the chart
chart_parse(tokens, trace=None)[source]

Return the final parse Chart from which all possible parse trees can be extracted.

Parameters:tokens (list(str)) – The sentence to be parsed
Return type:Chart
grammar()[source]
parse(tokens, tree_class=<class 'nltk.tree.Tree'>)[source]
class nltk.parse.chart.ChartRuleI[source]

Bases: object

A rule that specifies what new edges are licensed by any given set of existing edges. Each chart rule expects a fixed number of edges, as indicated by the class variable NUM_EDGES. In particular:

  • A chart rule with NUM_EDGES=0 specifies what new edges are licensed, regardless of existing edges.
  • A chart rule with NUM_EDGES=1 specifies what new edges are licensed by a single existing edge.
  • A chart rule with NUM_EDGES=2 specifies what new edges are licensed by a pair of existing edges.
Variables:NUM_EDGES – The number of existing edges that this rule uses to license new edges. Typically, this number ranges from zero to two.
apply(chart, grammar, *edges)[source]

Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.

Parameters:edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply() is specified by the NUM_EDGES class variable.
Return type:iter(EdgeI)
apply_everywhere(chart, grammar)[source]

Return a generator that will add all edges licensed by this rule, given the edges that are currently in the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.

Return type:iter(EdgeI)
class nltk.parse.chart.EdgeI[source]

Bases: object

A hypothesis about the structure of part of a sentence. Each edge records the fact that a structure is (partially) consistent with the sentence. An edge contains:

  • A span, indicating what part of the sentence is consistent with the hypothesized structure.
  • A left-hand side, specifying what kind of structure is hypothesized.
  • A right-hand side, specifying the contents of the hypothesized structure.
  • A dot position, indicating how much of the hypothesized structure is consistent with the sentence.

Every edge is either complete or incomplete:

  • An edge is complete if its structure is fully consistent with the sentence.
  • An edge is incomplete if its structure is partially consistent with the sentence. For every incomplete edge, the span specifies a possible prefix for the edge’s structure.

There are two kinds of edge:

  • A TreeEdge records which trees have been found to be (partially) consistent with the text.
  • A LeafEdge records the tokens occurring in the text.

The EdgeI interface provides a common interface to both types of edge, allowing chart parsers to treat them in a uniform manner.

dot()[source]

Return this edge’s dot position, which indicates how much of the hypothesized structure is consistent with the sentence. In particular, self.rhs[:dot] is consistent with tokens[self.start():self.end()].

Return type:int
end()[source]

Return the end index of this edge’s span.

Return type:int
is_complete()[source]

Return True if this edge’s structure is fully consistent with the text.

Return type:bool
is_incomplete()[source]

Return True if this edge’s structure is partially consistent with the text.

Return type:bool
length()[source]

Return the length of this edge’s span.

Return type:int
lhs()[source]

Return this edge’s left-hand side, which specifies what kind of structure is hypothesized by this edge.

See:TreeEdge and LeafEdge for a description of the left-hand side values for each edge type.
nextsym()[source]

Return the element of this edge’s right-hand side that immediately follows its dot.

Return type:Nonterminal or terminal or None
rhs()[source]

Return this edge’s right-hand side, which specifies the content of the structure hypothesized by this edge.

See:TreeEdge and LeafEdge for a description of the right-hand side values for each edge type.
span()[source]

Return a tuple (s, e), where tokens[s:e] is the portion of the sentence that is consistent with this edge’s structure.

Return type:tuple(int, int)
start()[source]

Return the start index of this edge’s span.

Return type:int
class nltk.parse.chart.EmptyPredictRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule that inserts all empty productions as passive edges, in every position in the chart.

NUM_EDGES = 0
apply(chart, grammar)[source]
class nltk.parse.chart.FilteredBottomUpPredictCombineRule[source]

Bases: nltk.parse.chart.BottomUpPredictCombineRule

apply(chart, grammar, edge)[source]
class nltk.parse.chart.FilteredSingleEdgeFundamentalRule[source]

Bases: nltk.parse.chart.SingleEdgeFundamentalRule

class nltk.parse.chart.FundamentalRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule that joins two adjacent edges to form a single combined edge. In particular, this rule specifies that any pair of edges

  • [A -> alpha \* B beta][i:j]
  • [B -> gamma \*][j:k]

licenses the edge:

  • [A -> alpha B * beta][i:j]
NUM_EDGES = 2
apply(chart, grammar, left_edge, right_edge)[source]
class nltk.parse.chart.LeafEdge(leaf, index)[source]

Bases: nltk.parse.chart.EdgeI

An edge that records the fact that a leaf value is consistent with a word in the sentence. A leaf edge consists of:

  • An index, indicating the position of the word.
  • A leaf, specifying the word’s content.

A leaf edge’s left-hand side is its leaf value, and its right hand side is (). Its span is [index, index+1], and its dot position is 0.

dot()[source]
end()[source]
is_complete()[source]
is_incomplete()[source]
length()[source]
lhs()[source]
nextsym()[source]
rhs()[source]
span()[source]
start()[source]
unicode_repr()
class nltk.parse.chart.LeafInitRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 0
apply(chart, grammar)[source]
class nltk.parse.chart.LeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

class nltk.parse.chart.SingleEdgeFundamentalRule[source]

Bases: nltk.parse.chart.FundamentalRule

A rule that joins a given edge with adjacent edges in the chart, to form combined edges. In particular, this rule specifies that either of the edges:

  • [A -> alpha \* B beta][i:j]
  • [B -> gamma \*][j:k]

licenses the edge:

  • [A -> alpha B * beta][i:j]

if the other edge is already in the chart.

Note:This is basically FundamentalRule, with one edge left unspecified.
NUM_EDGES = 1
apply(chart, grammar, edge)[source]
class nltk.parse.chart.SteppingChartParser(grammar, strategy=[], trace=0)[source]

Bases: nltk.parse.chart.ChartParser

A ChartParser that allows you to step through the parsing process, adding a single edge at a time. It also allows you to change the parser’s strategy or grammar midway through parsing a text.

The initialize method is used to start parsing a text. step adds a single edge to the chart. set_strategy changes the strategy used by the chart parser. parses returns the set of parses that has been found by the chart parser.

Variables:_restart – Records whether the parser’s strategy, grammar, or chart has been changed. If so, then step must restart the parsing algorithm.
chart()[source]

Return the chart that is used by this parser.

current_chartrule()[source]

Return the chart rule used to generate the most recent edge.

grammar()[source]

Return the grammar used by this parser.

initialize(tokens)[source]

Begin parsing the given tokens.

parse(tokens, tree_class=<class 'nltk.tree.Tree'>)[source]
parses(tree_class=<class 'nltk.tree.Tree'>)[source]

Return the parse trees currently contained in the chart.

set_chart(chart)[source]

Load a given chart into the chart parser.

set_grammar(grammar)[source]

Change the grammar used by the parser.

set_strategy(strategy)[source]

Change the strategy that the parser uses to decide which edges to add to the chart.

Parameters:strategy (list(ChartRuleI)) – A list of rules that should be used to decide what edges to add to the chart.
step()[source]

Return a generator that adds edges to the chart, one at a time. Each time the generator is resumed, it adds a single edge and yields that edge. If no more edges can be added, then it yields None.

If the parser’s strategy, grammar, or chart is changed, then the generator will continue adding edges using the new strategy, grammar, or chart.

Note that this generator never terminates, since the grammar or strategy might be changed to values that would add new edges. Instead, it yields None when no more edges can be added with the current strategy and grammar.

strategy()[source]

Return the strategy used by this parser.

class nltk.parse.chart.TopDownChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

A ChartParser using a top-down parsing strategy. See ChartParser for more information.

class nltk.parse.chart.TopDownInitRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule licensing edges corresponding to the grammar productions for the grammar’s start symbol. In particular, this rule specifies that [S -> \* alpha][0:i] is licensed for each grammar production S -> alpha, where S is the grammar’s start symbol.

NUM_EDGES = 0
apply(chart, grammar)[source]
class nltk.parse.chart.TopDownPredictRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule licensing edges corresponding to the grammar productions for the nonterminal following an incomplete edge’s dot. In particular, this rule specifies that [A -> alpha \* B beta][i:j] licenses the edge [B -> \* gamma][j:j] for each grammar production B -> gamma.

Note:This rule corresponds to the Predictor Rule in Earley parsing.
NUM_EDGES = 1
apply(chart, grammar, edge)[source]
class nltk.parse.chart.TreeEdge(span, lhs, rhs, dot=0)[source]

Bases: nltk.parse.chart.EdgeI

An edge that records the fact that a tree is (partially) consistent with the sentence. A tree edge consists of:

  • A span, indicating what part of the sentence is consistent with the hypothesized tree.
  • A left-hand side, specifying the hypothesized tree’s node value.
  • A right-hand side, specifying the hypothesized tree’s children. Each element of the right-hand side is either a terminal, specifying a token with that terminal as its leaf value; or a nonterminal, specifying a subtree with that nonterminal’s symbol as its node value.
  • A dot position, indicating which children are consistent with part of the sentence. In particular, if dot is the dot position, rhs is the right-hand size, (start,end) is the span, and sentence is the list of tokens in the sentence, then tokens[start:end] can be spanned by the children specified by rhs[:dot].

For more information about edges, see the EdgeI interface.

dot()[source]
end()[source]
static from_production(production, index)[source]

Return a new TreeEdge formed from the given production. The new edge’s left-hand side and right-hand side will be taken from production; its span will be (index,index); and its dot position will be 0.

Return type:TreeEdge
is_complete()[source]
is_incomplete()[source]
length()[source]
lhs()[source]
move_dot_forward(new_end)[source]

Return a new TreeEdge formed from this edge. The new edge’s dot position is increased by 1, and its end index will be replaced by new_end.

Parameters:new_end (int) – The new end index.
Return type:TreeEdge
nextsym()[source]
rhs()[source]
span()[source]
start()[source]
unicode_repr()
nltk.parse.chart.demo(choice=None, print_times=True, print_grammar=False, print_trees=True, trace=2, sent='I saw John with a dog with my cookie', numparses=5)[source]

A demonstration of the chart parsers.

nltk.parse.chart.demo_grammar()[source]

nltk.parse.corenlp module

class nltk.parse.corenlp.CoreNLPDependencyParser(url='http://localhost:9000', encoding='utf8')[source]

Bases: nltk.parse.corenlp.GenericCoreNLPParser

Dependency parser.

>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> parse, = dep_parser.raw_parse(
...     'The quick brown fox jumps over the lazy dog.'
... )
>>> print(parse.to_conll(4))  
The     DT      4       det
quick   JJ      4       amod
brown   JJ      4       amod
fox     NN      5       nsubj
jumps   VBZ     0       ROOT
over    IN      9       case
the     DT      9       det
lazy    JJ      9       amod
dog     NN      5       nmod
.       .       5       punct
>>> print(parse.tree())  
(jumps (fox The quick brown) (dog over the lazy) .)
>>> for governor, dep, dependent in parse.triples():
...     print(governor, dep, dependent)  
    ('jumps', 'VBZ') nsubj ('fox', 'NN')
    ('fox', 'NN') det ('The', 'DT')
    ('fox', 'NN') amod ('quick', 'JJ')
    ('fox', 'NN') amod ('brown', 'JJ')
    ('jumps', 'VBZ') nmod ('dog', 'NN')
    ('dog', 'NN') case ('over', 'IN')
    ('dog', 'NN') det ('the', 'DT')
    ('dog', 'NN') amod ('lazy', 'JJ')
    ('jumps', 'VBZ') punct ('.', '.')
>>> (parse_fox, ), (parse_dog, ) = dep_parser.raw_parse_sents(
...     [
...         'The quick brown fox jumps over the lazy dog.',
...         'The quick grey wolf jumps over the lazy fox.',
...     ]
... )
>>> print(parse_fox.to_conll(4))  
The DT      4       det
quick       JJ      4       amod
brown       JJ      4       amod
fox NN      5       nsubj
jumps       VBZ     0       ROOT
over        IN      9       case
the DT      9       det
lazy        JJ      9       amod
dog NN      5       nmod
.   .       5       punct
>>> print(parse_dog.to_conll(4))  
The DT      4       det
quick       JJ      4       amod
grey        JJ      4       amod
wolf        NN      5       nsubj
jumps       VBZ     0       ROOT
over        IN      9       case
the DT      9       det
lazy        JJ      9       amod
fox NN      5       nmod
.   .       5       punct
>>> (parse_dog, ), (parse_friends, ) = dep_parser.parse_sents(
...     [
...         "I 'm a dog".split(),
...         "This is my friends ' cat ( the tabby )".split(),
...     ]
... )
>>> print(parse_dog.to_conll(4))  
I   PRP     4       nsubj
'm  VBP     4       cop
a   DT      4       det
dog NN      0       ROOT
>>> print(parse_friends.to_conll(4))  
This        DT      6       nsubj
is  VBZ     6       cop
my  PRP$    4       nmod:poss
friends     NNS     6       nmod:poss
'   POS     4       case
cat NN      0       ROOT
-LRB-       -LRB-   9       punct
the DT      9       det
tabby       NN      6       appos
-RRB-       -RRB-   9       punct
>>> parse_john, parse_mary, = dep_parser.parse_text(
...     'John loves Mary. Mary walks.'
... )
>>> print(parse_john.to_conll(4))  
John        NNP     2       nsubj
loves       VBZ     0       ROOT
Mary        NNP     2       dobj
.   .       2       punct
>>> print(parse_mary.to_conll(4))  
Mary        NNP     2       nsubj
walks       VBZ     0       ROOT
.   .       2       punct

Non-breaking space inside of a token.

>>> len(
...     next(
...         dep_parser.raw_parse(
...             'Anhalt said children typically treat a 20-ounce soda bottle as one '
...             'serving, while it actually contains 2 1/2 servings.'
...         )
...     ).nodes
... )
21

Phone numbers.

>>> len(
...     next(
...         dep_parser.raw_parse('This is not going to crash: 01 111 555.')
...     ).nodes
... )
10
>>> print(
...     next(
...         dep_parser.raw_parse('The underscore _ should not simply disappear.')
...     ).to_conll(4)
... )  
The         DT  3   det
underscore  VBP 3   amod
_           NN  7   nsubj
should      MD  7   aux
not         RB  7   neg
simply      RB  7   advmod
disappear   VB  0   ROOT
.           .   7   punct
>>> print(
...     '\n'.join(
...         next(
...             dep_parser.raw_parse(
...                 'for all of its insights into the dream world of teen life , and its electronic expression through '
...                 'cyber culture , the film gives no quarter to anyone seeking to pull a cohesive story out of its 2 '
...                 '1/2-hour running time .'
...             )
...         ).to_conll(4).split('\n')[-8:]
...     )
... )
its PRP$    40      nmod:poss
2 1/2       CD      40      nummod
-   :       40      punct
hour        NN      31      nmod
running     VBG     42      amod
time        NN      40      dep
.   .       24      punct
make_tree(result)[source]
parser_annotator = 'depparse'
class nltk.parse.corenlp.CoreNLPParser(url='http://localhost:9000', encoding='utf8')[source]

Bases: nltk.parse.corenlp.GenericCoreNLPParser

>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> next(
...     parser.raw_parse('The quick brown fox jumps over the lazy dog.')
... ).pretty_print()  
                     ROOT
                      |
                      S
       _______________|__________________________
      |                         VP               |
      |                _________|___             |
      |               |             PP           |
      |               |     ________|___         |
      NP              |    |            NP       |
  ____|__________     |    |     _______|____    |
 DT   JJ    JJ   NN  VBZ   IN   DT      JJ   NN  .
 |    |     |    |    |    |    |       |    |   |
The quick brown fox jumps over the     lazy dog  .
>>> (parse_fox, ), (parse_wolf, ) = parser.raw_parse_sents(
...     [
...         'The quick brown fox jumps over the lazy dog.',
...         'The quick grey wolf jumps over the lazy fox.',
...     ]
... )
>>> parse_fox.pretty_print()  
                     ROOT
                      |
                      S
       _______________|__________________________
      |                         VP               |
      |                _________|___             |
      |               |             PP           |
      |               |     ________|___         |
      NP              |    |            NP       |
  ____|__________     |    |     _______|____    |
 DT   JJ    JJ   NN  VBZ   IN   DT      JJ   NN  .
 |    |     |    |    |    |    |       |    |   |
The quick brown fox jumps over the     lazy dog  .
>>> parse_wolf.pretty_print()  
                     ROOT
                      |
                      S
       _______________|__________________________
      |                         VP               |
      |                _________|___             |
      |               |             PP           |
      |               |     ________|___         |
      NP              |    |            NP       |
  ____|_________      |    |     _______|____    |
 DT   JJ   JJ   NN   VBZ   IN   DT      JJ   NN  .
 |    |    |    |     |    |    |       |    |   |
The quick grey wolf jumps over the     lazy fox  .
>>> (parse_dog, ), (parse_friends, ) = parser.parse_sents(
...     [
...         "I 'm a dog".split(),
...         "This is my friends ' cat ( the tabby )".split(),
...     ]
... )
>>> parse_dog.pretty_print()  
        ROOT
         |
         S
  _______|____
 |            VP
 |    ________|___
 NP  |            NP
 |   |         ___|___
PRP VBP       DT      NN
 |   |        |       |
 I   'm       a      dog
>>> parse_friends.pretty_print()  
     ROOT
      |
      S
  ____|___________
 |                VP
 |     ___________|_____________
 |    |                         NP
 |    |                  _______|_________
 |    |                 NP               PRN
 |    |            _____|_______      ____|______________
 NP   |           NP            |    |        NP         |
 |    |     ______|_________    |    |     ___|____      |
 DT  VBZ  PRP$   NNS       POS  NN -LRB-  DT       NN  -RRB-
 |    |    |      |         |   |    |    |        |     |
This  is   my  friends      '  cat -LRB- the     tabby -RRB-
>>> parse_john, parse_mary, = parser.parse_text(
...     'John loves Mary. Mary walks.'
... )
>>> parse_john.pretty_print()  
      ROOT
       |
       S
  _____|_____________
 |          VP       |
 |      ____|___     |
 NP    |        NP   |
 |     |        |    |
NNP   VBZ      NNP   .
 |     |        |    |
John loves     Mary  .
>>> parse_mary.pretty_print()  
      ROOT
       |
       S
  _____|____
 NP    VP   |
 |     |    |
NNP   VBZ   .
 |     |    |
Mary walks  .
>>> next(
...     parser.raw_parse(
...         'NASIRIYA, Iraq—Iraqi doctors who treated former prisoner of war '
...         'Jessica Lynch have angrily dismissed claims made in her biography '
...         'that she was raped by her Iraqi captors.'
...     )
... ).height()
20
>>> next(
...     parser.raw_parse(
...         "The broader Standard & Poor's 500 Index <.SPX> was 0.46 points lower, or "
...         '0.05 percent, at 997.02.'
...     )
... ).height()
9
make_tree(result)[source]
parser_annotator = 'parse'
class nltk.parse.corenlp.CoreNLPServer(path_to_jar=None, path_to_models_jar=None, verbose=False, java_options=None, corenlp_options=None, port=None)[source]

Bases: object

start()[source]
stop()[source]
exception nltk.parse.corenlp.CoreNLPServerError[source]

Bases: OSError

Exceptions associated with the Core NLP server.

class nltk.parse.corenlp.GenericCoreNLPParser(url='http://localhost:9000', encoding='utf8')[source]

Bases: nltk.parse.api.ParserI, nltk.tokenize.api.TokenizerI

Interface to the CoreNLP Parser.

api_call(data, properties=None)[source]
parse_sents(sentences, *args, **kwargs)[source]

Parse multiple sentences.

Takes multiple sentences as a list where each sentence is a list of words. Each sentence will be automatically tagged with this CoreNLPParser instance’s tagger.

If a whitespace exists inside a token, then the token will be treated as several tokens.

Parameters:sentences (list(list(str))) – Input sentences to parse
Return type:iter(iter(Tree))
parse_text(text, *args, **kwargs)[source]

Parse a piece of text.

The text might contain several sentences which will be split by CoreNLP.

Parameters:text (str) – text to be split.
Returns:an iterable of syntactic structures. # TODO: should it be an iterable of iterables?
raw_parse(sentence, properties=None, *args, **kwargs)[source]

Parse a sentence.

Takes a sentence as a string; before parsing, it will be automatically tokenized and tagged by the CoreNLP Parser.

Parameters:sentence (str) – Input sentence to parse
Return type:iter(Tree)
raw_parse_sents(sentences, verbose=False, properties=None, *args, **kwargs)[source]

Parse multiple sentences.

Takes multiple sentences as a list of strings. Each sentence will be automatically tokenized and tagged.

Parameters:sentences (list(str)) – Input sentences to parse.
Return type:iter(iter(Tree))
tokenize(text, properties=None)[source]

Tokenize a string of text.

>>> parser = CoreNLPParser(url='http://localhost:9000')
>>> text = 'Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'
>>> list(parser.tokenize(text))
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> s = "The colour of the wall is blue."
>>> list(
...     parser.tokenize(
...         'The colour of the wall is blue.',
...             properties={'tokenize.options': 'americanize=true'},
...     )
... )
['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']
nltk.parse.corenlp.setup_module(module)[source]
nltk.parse.corenlp.teardown_module(module)[source]
nltk.parse.corenlp.transform(sentence)[source]
nltk.parse.corenlp.try_port(port=0)[source]

nltk.parse.dependencygraph module

Tools for reading and writing dependency trees. The input is assumed to be in Malt-TAB format (http://stp.lingfil.uu.se/~nivre/research/MaltXML.html).

class nltk.parse.dependencygraph.DependencyGraph(tree_str=None, cell_extractor=None, zero_based=False, cell_separator=None, top_relation_label='ROOT')[source]

Bases: object

A container for the nodes and labelled edges of a dependency structure.

add_arc(head_address, mod_address)[source]

Adds an arc from the node specified by head_address to the node specified by the mod address.

add_node(node)[source]
connect_graph()[source]

Fully connects all non-root nodes. All nodes are set to be dependents of the root node.

contains_address(node_address)[source]

Returns true if the graph contains a node with the given node address, false otherwise.

contains_cycle()[source]

Check whether there are cycles.

>>> dg = DependencyGraph(treebank_data)
>>> dg.contains_cycle()
False
>>> cyclic_dg = DependencyGraph()
>>> top = {'word': None, 'deps': [1], 'rel': 'TOP', 'address': 0}
>>> child1 = {'word': None, 'deps': [2], 'rel': 'NTOP', 'address': 1}
>>> child2 = {'word': None, 'deps': [4], 'rel': 'NTOP', 'address': 2}
>>> child3 = {'word': None, 'deps': [1], 'rel': 'NTOP', 'address': 3}
>>> child4 = {'word': None, 'deps': [3], 'rel': 'NTOP', 'address': 4}
>>> cyclic_dg.nodes = {
...     0: top,
...     1: child1,
...     2: child2,
...     3: child3,
...     4: child4,
... }
>>> cyclic_dg.root = top
>>> cyclic_dg.contains_cycle()
[3, 1, 2, 4]
get_by_address(node_address)[source]

Return the node with the given address.

get_cycle_path(curr_node, goal_node_index)[source]
left_children(node_index)[source]

Returns the number of left children under the node specified by the given address.

static load(filename, zero_based=False, cell_separator=None, top_relation_label='ROOT')[source]
Parameters:
  • filename – a name of a file in Malt-TAB format
  • zero_based – nodes in the input file are numbered starting from 0

rather than 1 (as produced by, e.g., zpar) :param str cell_separator: the cell separator. If not provided, cells are split by whitespace. :param str top_relation_label: the label by which the top relation is identified, for examlple, ROOT, null or TOP.

Returns:a list of DependencyGraphs
nx_graph()[source]

Convert the data in a nodelist into a networkx labeled directed graph.

redirect_arcs(originals, redirect)[source]

Redirects arcs to any of the nodes in the originals list to the redirect node address.

remove_by_address(address)[source]

Removes the node with the given address. References to this node in others will still exist.

right_children(node_index)[source]

Returns the number of right children under the node specified by the given address.

to_conll(style)[source]

The dependency graph in CoNLL format.

Parameters:style (int) – the style to use for the format (3, 4, 10 columns)
Return type:str
to_dot()[source]

Return a dot representation suitable for using with Graphviz.

>>> dg = DependencyGraph(
...     'John N 2\n'
...     'loves V 0\n'
...     'Mary N 2'
... )
>>> print(dg.to_dot())
digraph G{
edge [dir=forward]
node [shape=plaintext]

0 [label="0 (None)"]
0 -> 2 [label="ROOT"]
1 [label="1 (John)"]
2 [label="2 (loves)"]
2 -> 1 [label=""]
2 -> 3 [label=""]
3 [label="3 (Mary)"]
}
tree()[source]

Starting with the root node, build a dependency tree using the NLTK Tree constructor. Dependency labels are omitted.

triples(node=None)[source]

Extract dependency triples of the form: ((head word, head tag), rel, (dep word, dep tag))

unicode_repr()
exception nltk.parse.dependencygraph.DependencyGraphError[source]

Bases: Exception

Dependency graph exception.

nltk.parse.dependencygraph.conll_demo()[source]

A demonstration of how to read a string representation of a CoNLL format dependency tree.

nltk.parse.dependencygraph.conll_file_demo()[source]
nltk.parse.dependencygraph.cycle_finding_demo()[source]
nltk.parse.dependencygraph.demo()[source]
nltk.parse.dependencygraph.malt_demo(nx=False)[source]

A demonstration of the result of reading a dependency version of the first sentence of the Penn Treebank.

nltk.parse.earleychart module

Data classes and parser implementations for incremental chart parsers, which use dynamic programming to efficiently parse a text. A “chart parser” derives parse trees for a text by iteratively adding “edges” to a “chart”. Each “edge” represents a hypothesis about the tree structure for a subsequence of the text. The “chart” is a “blackboard” for composing and combining these hypotheses.

A parser is “incremental”, if it guarantees that for all i, j where i < j, all edges ending at i are built before any edges ending at j. This is appealing for, say, speech recognizer hypothesis filtering.

The main parser class is EarleyChartParser, which is a top-down algorithm, originally formulated by Jay Earley (1970).

class nltk.parse.earleychart.CompleteFundamentalRule[source]

Bases: nltk.parse.chart.SingleEdgeFundamentalRule

class nltk.parse.earleychart.CompleterRule[source]

Bases: nltk.parse.earleychart.CompleteFundamentalRule

apply(chart, grammar, edge)[source]
class nltk.parse.earleychart.EarleyChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.FeatureCompleteFundamentalRule[source]

Bases: nltk.parse.featurechart.FeatureSingleEdgeFundamentalRule

class nltk.parse.earleychart.FeatureCompleterRule[source]

Bases: nltk.parse.earleychart.CompleterRule

class nltk.parse.earleychart.FeatureEarleyChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.FeatureIncrementalChartParser

class nltk.parse.earleychart.FeatureIncrementalBottomUpChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.FeatureIncrementalChartParser

class nltk.parse.earleychart.FeatureIncrementalBottomUpLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.FeatureIncrementalChartParser

class nltk.parse.earleychart.FeatureIncrementalChart(tokens)[source]

Bases: nltk.parse.earleychart.IncrementalChart, nltk.parse.featurechart.FeatureChart

select(end, **restrictions)[source]
class nltk.parse.earleychart.FeatureIncrementalChartParser(grammar, strategy=[<nltk.parse.chart.LeafInitRule object>, <nltk.parse.featurechart.FeatureEmptyPredictRule object>, <nltk.parse.featurechart.FeatureBottomUpPredictCombineRule object>, <nltk.parse.earleychart.FeatureCompleteFundamentalRule object>], trace_chart_width=20, chart_class=<class 'nltk.parse.earleychart.FeatureIncrementalChart'>, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser, nltk.parse.featurechart.FeatureChartParser

class nltk.parse.earleychart.FeatureIncrementalTopDownChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.FeatureIncrementalChartParser

class nltk.parse.earleychart.FeaturePredictorRule[source]

Bases: nltk.parse.featurechart.FeatureTopDownPredictRule

class nltk.parse.earleychart.FeatureScannerRule[source]

Bases: nltk.parse.earleychart.ScannerRule

class nltk.parse.earleychart.FilteredCompleteFundamentalRule[source]

Bases: nltk.parse.chart.FilteredSingleEdgeFundamentalRule

apply(chart, grammar, edge)[source]
class nltk.parse.earleychart.IncrementalBottomUpChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.IncrementalBottomUpLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.IncrementalChart(tokens)[source]

Bases: nltk.parse.chart.Chart

edges()[source]
initialize()[source]
iteredges()[source]
select(end, **restrictions)[source]
class nltk.parse.earleychart.IncrementalChartParser(grammar, strategy=[<nltk.parse.chart.LeafInitRule object>, <nltk.parse.chart.EmptyPredictRule object>, <nltk.parse.chart.BottomUpPredictCombineRule object>, <nltk.parse.earleychart.CompleteFundamentalRule object>], trace=0, trace_chart_width=50, chart_class=<class 'nltk.parse.earleychart.IncrementalChart'>)[source]

Bases: nltk.parse.chart.ChartParser

An incremental chart parser implementing Jay Earley’s parsing algorithm:

For each index end in [0, 1, ..., N]:
For each edge such that edge.end = end:
If edge is incomplete and edge.next is not a part of speech:
Apply PredictorRule to edge
If edge is incomplete and edge.next is a part of speech:
Apply ScannerRule to edge
If edge is complete:
Apply CompleterRule to edge
Return any complete parses in the chart
chart_parse(tokens, trace=None)[source]
class nltk.parse.earleychart.IncrementalLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.IncrementalTopDownChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.PredictorRule[source]

Bases: nltk.parse.chart.CachedTopDownPredictRule

class nltk.parse.earleychart.ScannerRule[source]

Bases: nltk.parse.earleychart.CompleteFundamentalRule

apply(chart, grammar, edge)[source]
nltk.parse.earleychart.demo(print_times=True, print_grammar=False, print_trees=True, trace=2, sent='I saw John with a dog with my cookie', numparses=5)[source]

A demonstration of the Earley parsers.

nltk.parse.evaluate module

class nltk.parse.evaluate.DependencyEvaluator(parsed_sents, gold_sents)[source]

Bases: object

Class for measuring labelled and unlabelled attachment score for dependency parsing. Note that the evaluation ignores punctuation.

>>> from nltk.parse import DependencyGraph, DependencyEvaluator
>>> gold_sent = DependencyGraph("""
... Pierre  NNP     2       NMOD
... Vinken  NNP     8       SUB
... ,       ,       2       P
... 61      CD      5       NMOD
... years   NNS     6       AMOD
... old     JJ      2       NMOD
... ,       ,       2       P
... will    MD      0       ROOT
... join    VB      8       VC
... the     DT      11      NMOD
... board   NN      9       OBJ
... as      IN      9       VMOD
... a       DT      15      NMOD
... nonexecutive    JJ      15      NMOD
... director        NN      12      PMOD
... Nov.    NNP     9       VMOD
... 29      CD      16      NMOD
... .       .       9       VMOD
... """)
>>> parsed_sent = DependencyGraph("""
... Pierre  NNP     8       NMOD
... Vinken  NNP     1       SUB
... ,       ,       3       P
... 61      CD      6       NMOD
... years   NNS     6       AMOD
... old     JJ      2       NMOD
... ,       ,       3       AMOD
... will    MD      0       ROOT
... join    VB      8       VC
... the     DT      11      AMOD
... board   NN      9       OBJECT
... as      IN      9       NMOD
... a       DT      15      NMOD
... nonexecutive    JJ      15      NMOD
... director        NN      12      PMOD
... Nov.    NNP     9       VMOD
... 29      CD      16      NMOD
... .       .       9       VMOD
... """)
>>> de = DependencyEvaluator([parsed_sent],[gold_sent])
>>> las, uas = de.eval()
>>> las
0.8...
>>> abs(uas - 0.6) < 0.00001
True
eval()[source]

Return the Labeled Attachment Score (LAS) and Unlabeled Attachment Score (UAS)

:return : tuple(float,float)

nltk.parse.featurechart module

Extension of chart parsing implementation to handle grammars with feature structures as nodes.

class nltk.parse.featurechart.FeatureBottomUpChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.featurechart.FeatureChartParser

class nltk.parse.featurechart.FeatureBottomUpLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.featurechart.FeatureChartParser

class nltk.parse.featurechart.FeatureBottomUpPredictCombineRule[source]

Bases: nltk.parse.chart.BottomUpPredictCombineRule

apply(chart, grammar, edge)[source]
class nltk.parse.featurechart.FeatureBottomUpPredictRule[source]

Bases: nltk.parse.chart.BottomUpPredictRule

apply(chart, grammar, edge)[source]
class nltk.parse.featurechart.FeatureChart(tokens)[source]

Bases: nltk.parse.chart.Chart

A Chart for feature grammars. :see: Chart for more information.

parses(start, tree_class=<class 'nltk.tree.Tree'>)[source]
select(**restrictions)[source]

Returns an iterator over the edges in this chart. See Chart.select for more information about the restrictions on the edges.

class nltk.parse.featurechart.FeatureChartParser(grammar, strategy=[<nltk.parse.chart.LeafInitRule object>, <nltk.parse.featurechart.FeatureEmptyPredictRule object>, <nltk.parse.featurechart.FeatureBottomUpPredictCombineRule object>, <nltk.parse.featurechart.FeatureSingleEdgeFundamentalRule object>], trace_chart_width=20, chart_class=<class 'nltk.parse.featurechart.FeatureChart'>, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

class nltk.parse.featurechart.FeatureEmptyPredictRule[source]

Bases: nltk.parse.chart.EmptyPredictRule

apply(chart, grammar)[source]
class nltk.parse.featurechart.FeatureFundamentalRule[source]

Bases: nltk.parse.chart.FundamentalRule

A specialized version of the fundamental rule that operates on nonterminals whose symbols are FeatStructNonterminal``s.  Rather tha simply comparing the nonterminals for equality, they are unified.  Variable bindings from these unifications are collected and stored in the chart using a ``FeatureTreeEdge. When a complete edge is generated, these bindings are applied to all nonterminals in the edge.

The fundamental rule states that:

  • [A -> alpha \* B1 beta][i:j]
  • [B2 -> gamma \*][j:k]

licenses the edge:

  • [A -> alpha B3 \* beta][i:j]

assuming that B1 and B2 can be unified to generate B3.

apply(chart, grammar, left_edge, right_edge)[source]
class nltk.parse.featurechart.FeatureSingleEdgeFundamentalRule[source]

Bases: nltk.parse.chart.SingleEdgeFundamentalRule

A specialized version of the completer / single edge fundamental rule that operates on nonterminals whose symbols are ``FeatStructNonterminal``s. Rather than simply comparing the nonterminals for equality, they are unified.

class nltk.parse.featurechart.FeatureTopDownChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.featurechart.FeatureChartParser

class nltk.parse.featurechart.FeatureTopDownInitRule[source]

Bases: nltk.parse.chart.TopDownInitRule

apply(chart, grammar)[source]
class nltk.parse.featurechart.FeatureTopDownPredictRule[source]

Bases: nltk.parse.chart.CachedTopDownPredictRule

A specialized version of the (cached) top down predict rule that operates on nonterminals whose symbols are ``FeatStructNonterminal``s. Rather than simply comparing the nonterminals for equality, they are unified.

The top down expand rule states that:

  • [A -> alpha \* B1 beta][i:j]

licenses the edge:

  • [B2 -> \* gamma][j:j]

for each grammar production B2 -> gamma, assuming that B1 and B2 can be unified.

apply(chart, grammar, edge)[source]
class nltk.parse.featurechart.FeatureTreeEdge(span, lhs, rhs, dot=0, bindings=None)[source]

Bases: nltk.parse.chart.TreeEdge

A specialized tree edge that allows shared variable bindings between nonterminals on the left-hand side and right-hand side.

Each FeatureTreeEdge contains a set of bindings, i.e., a dictionary mapping from variables to values. If the edge is not complete, then these bindings are simply stored. However, if the edge is complete, then the constructor applies these bindings to every nonterminal in the edge whose symbol implements the interface SubstituteBindingsI.

bindings()[source]

Return a copy of this edge’s bindings dictionary.

static from_production(production, index)[source]
Returns:A new TreeEdge formed from the given production. The new edge’s left-hand side and right-hand side will be taken from production; its span will be (index,index); and its dot position will be 0.
Return type:TreeEdge
move_dot_forward(new_end, bindings=None)[source]
Returns:

A new FeatureTreeEdge formed from this edge. The new edge’s dot position is increased by 1, and its end index will be replaced by new_end.

Return type:

FeatureTreeEdge

Parameters:
  • new_end (int) – The new end index.
  • bindings (dict) – Bindings for the new edge.
next_with_bindings()[source]
unicode_repr()
variables()[source]
Returns:The set of variables used by this edge.
Return type:set(Variable)
class nltk.parse.featurechart.InstantiateVarsChart(tokens)[source]

Bases: nltk.parse.featurechart.FeatureChart

A specialized chart that ‘instantiates’ variables whose names start with ‘@’, by replacing them with unique new variables. In particular, whenever a complete edge is added to the chart, any variables in the edge’s lhs whose names start with ‘@’ will be replaced by unique new ``Variable``s.

initialize()[source]
insert(edge, child_pointer_list)[source]
inst_vars(edge)[source]
instantiate_edge(edge)[source]

If the edge is a FeatureTreeEdge, and it is complete, then instantiate all variables whose names start with ‘@’, by replacing them with unique new variables.

Note that instantiation is done in-place, since the parsing algorithms might already hold a reference to the edge for future use.

nltk.parse.featurechart.demo(print_times=True, print_grammar=True, print_trees=True, print_sentence=True, trace=1, parser=<class 'nltk.parse.featurechart.FeatureChartParser'>, sent='I saw John with a dog with my cookie')[source]
nltk.parse.featurechart.demo_grammar()[source]
nltk.parse.featurechart.run_profile()[source]

nltk.parse.generate module

nltk.parse.generate.demo(N=23)[source]
nltk.parse.generate.generate(grammar, start=None, depth=None, n=None)[source]

Generates an iterator of all sentences from a CFG.

Parameters:
  • grammar – The Grammar used to generate sentences.
  • start – The Nonterminal from which to start generate sentences.
  • depth – The maximal depth of the generated tree.
  • n – The maximum number of sentences to return.
Returns:

An iterator of lists of terminal tokens.

nltk.parse.malt module

class nltk.parse.malt.MaltParser(parser_dirname, model_filename=None, tagger=None, additional_java_args=None)[source]

Bases: nltk.parse.api.ParserI

A class for dependency parsing with MaltParser. The input is the paths to: - a maltparser directory - (optionally) the path to a pre-trained MaltParser .mco model file - (optionally) the tagger to use for POS tagging before parsing - (optionally) additional Java arguments

Example:
>>> from nltk.parse import malt
>>> # With MALT_PARSER and MALT_MODEL environment set.
>>> mp = malt.MaltParser('maltparser-1.7.2', 'engmalt.linear-1.7.mco') 
>>> mp.parse_one('I shot an elephant in my pajamas .'.split()).tree() 
(shot I (elephant an) (in (pajamas my)) .)
>>> # Without MALT_PARSER and MALT_MODEL environment.
>>> mp = malt.MaltParser('/home/user/maltparser-1.7.2/', '/home/user/engmalt.linear-1.7.mco') 
>>> mp.parse_one('I shot an elephant in my pajamas .'.split()).tree() 
(shot I (elephant an) (in (pajamas my)) .)
generate_malt_command(inputfilename, outputfilename=None, mode=None)[source]

This function generates the maltparser command use at the terminal.

Parameters:
  • inputfilename (str) – path to the input file
  • outputfilename (str) – path to the output file
parse_sents(sentences, verbose=False, top_relation_label='null')[source]

Use MaltParser to parse multiple sentences. Takes a list of sentences, where each sentence is a list of words. Each sentence will be automatically tagged with this MaltParser instance’s tagger.

Parameters:sentences – Input sentences to parse
Returns:iter(DependencyGraph)
parse_tagged_sents(sentences, verbose=False, top_relation_label='null')[source]

Use MaltParser to parse multiple POS tagged sentences. Takes multiple sentences where each sentence is a list of (word, tag) tuples. The sentences must have already been tokenized and tagged.

Parameters:sentences – Input sentences to parse
Returns:iter(iter(DependencyGraph)) the dependency graph

representation of each sentence

train(depgraphs, verbose=False)[source]

Train MaltParser from a list of DependencyGraph objects

Parameters:depgraphs (DependencyGraph) – list of DependencyGraph objects for training input data
train_from_file(conll_file, verbose=False)[source]

Train MaltParser from a file :param conll_file: str for the filename of the training input data :type conll_file: str

nltk.parse.malt.find_malt_model(model_filename)[source]

A module to find pre-trained MaltParser model.

nltk.parse.malt.find_maltparser(parser_dirname)[source]

A module to find MaltParser .jar file and its dependencies.

nltk.parse.malt.malt_regex_tagger()[source]

nltk.parse.nonprojectivedependencyparser module

class nltk.parse.nonprojectivedependencyparser.DemoScorer[source]

Bases: nltk.parse.nonprojectivedependencyparser.DependencyScorerI

score(graph)[source]
train(graphs)[source]
class nltk.parse.nonprojectivedependencyparser.DependencyScorerI[source]

Bases: object

A scorer for calculated the weights on the edges of a weighted dependency graph. This is used by a ProbabilisticNonprojectiveParser to initialize the edge weights of a DependencyGraph. While typically this would be done by training a binary classifier, any class that can return a multidimensional list representation of the edge weights can implement this interface. As such, it has no necessary fields.

score(graph)[source]
Parameters:graph (DependencyGraph) – A dependency graph whose set of edges need to be

scored. :rtype: A three-dimensional list of numbers. :return: The score is returned in a multidimensional(3) list, such that the outer-dimension refers to the head, and the inner-dimension refers to the dependencies. For instance, scores[0][1] would reference the list of scores corresponding to arcs from node 0 to node 1. The node’s ‘address’ field can be used to determine its number identification.

For further illustration, a score list corresponding to Fig.2 of Keith Hall’s ‘K-best Spanning Tree Parsing’ paper:

scores = [[[], [5], [1], [1]],
[[], [], [11], [4]], [[], [10], [], [5]], [[], [8], [8], []]]

When used in conjunction with a MaxEntClassifier, each score would correspond to the confidence of a particular edge being classified with the positive training examples.

train(graphs)[source]
Parameters:graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer.

Typically the edges present in the graphs can be used as positive training examples, and the edges not present as negative examples.

class nltk.parse.nonprojectivedependencyparser.NaiveBayesDependencyScorer[source]

Bases: nltk.parse.nonprojectivedependencyparser.DependencyScorerI

A dependency scorer built around a MaxEnt classifier. In this particular class that classifier is a NaiveBayesClassifier. It uses head-word, head-tag, child-word, and child-tag features for classification.

>>> from nltk.parse.dependencygraph import DependencyGraph, conll_data2
>>> graphs = [DependencyGraph(entry) for entry in conll_data2.split('\n\n') if entry]
>>> npp = ProbabilisticNonprojectiveParser()
>>> npp.train(graphs, NaiveBayesDependencyScorer())
>>> parses = npp.parse(['Cathy', 'zag', 'hen', 'zwaaien', '.'], ['N', 'V', 'Pron', 'Adj', 'N', 'Punc'])
>>> len(list(parses))
1
score(graph)[source]

Converts the graph into a feature-based representation of each edge, and then assigns a score to each based on the confidence of the classifier in assigning it to the positive label. Scores are returned in a multidimensional list.

Parameters:graph (DependencyGraph) – A dependency graph to score.
Return type:3 dimensional list
Returns:Edge scores for the graph parameter.
train(graphs)[source]

Trains a NaiveBayesClassifier using the edges present in graphs list as positive examples, the edges not present as negative examples. Uses a feature vector of head-word, head-tag, child-word, and child-tag.

Parameters:graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer.
class nltk.parse.nonprojectivedependencyparser.NonprojectiveDependencyParser(dependency_grammar)[source]

Bases: object

A non-projective, rule-based, dependency parser. This parser will return the set of all possible non-projective parses based on the word-to-word relations defined in the parser’s dependency grammar, and will allow the branches of the parse tree to cross in order to capture a variety of linguistic phenomena that a projective parser will not.

parse(tokens)[source]

Parses the input tokens with respect to the parser’s grammar. Parsing is accomplished by representing the search-space of possible parses as a fully-connected directed graph. Arcs that would lead to ungrammatical parses are removed and a lattice is constructed of length n, where n is the number of input tokens, to represent all possible grammatical traversals. All possible paths through the lattice are then enumerated to produce the set of non-projective parses.

param tokens: A list of tokens to parse. type tokens: list(str) return: An iterator of non-projective parses. rtype: iter(DependencyGraph)

class nltk.parse.nonprojectivedependencyparser.ProbabilisticNonprojectiveParser[source]

Bases: object

A probabilistic non-projective dependency parser.

Nonprojective dependencies allows for “crossing branches” in the parse tree which is necessary for representing particular linguistic phenomena, or even typical parses in some languages. This parser follows the MST parsing algorithm, outlined in McDonald(2005), which likens the search for the best non-projective parse to finding the maximum spanning tree in a weighted directed graph.

>>> class Scorer(DependencyScorerI):
...     def train(self, graphs):
...         pass
...
...     def score(self, graph):
...         return [
...             [[], [5],  [1],  [1]],
...             [[], [],   [11], [4]],
...             [[], [10], [],   [5]],
...             [[], [8],  [8],  []],
...         ]
>>> npp = ProbabilisticNonprojectiveParser()
>>> npp.train([], Scorer())
>>> parses = npp.parse(['v1', 'v2', 'v3'], [None, None, None])
>>> len(list(parses))
1
>>> from nltk.grammar import DependencyGrammar
>>> grammar = DependencyGrammar.fromstring('''
... 'taught' -> 'play' | 'man'
... 'man' -> 'the' | 'in'
... 'in' -> 'corner'
... 'corner' -> 'the'
... 'play' -> 'golf' | 'dachshund' | 'to'
... 'dachshund' -> 'his'
... ''')
>>> ndp = NonprojectiveDependencyParser(grammar)
>>> parses = ndp.parse(['the', 'man', 'in', 'the', 'corner', 'taught', 'his', 'dachshund', 'to', 'play', 'golf'])
>>> len(list(parses))
4
best_incoming_arc(node_index)[source]

Returns the source of the best incoming arc to the node with address: node_index

Parameters:node_index (integer.) – The address of the ‘destination’ node,

the node that is arced to.

collapse_nodes(new_node, cycle_path, g_graph, b_graph, c_graph)[source]

Takes a list of nodes that have been identified to belong to a cycle, and collapses them into on larger node. The arcs of all nodes in the graph must be updated to account for this.

Parameters:
  • new_node (Node.) – A Node (Dictionary) to collapse the cycle nodes into.
  • cycle_path (A list of integers.) – A list of node addresses, each of which is in the cycle.
  • b_graph, c_graph (g_graph,) – Graphs which need to be updated.
compute_max_subtract_score(column_index, cycle_indexes)[source]

When updating scores the score of the highest-weighted incoming arc is subtracted upon collapse. This returns the correct amount to subtract from that edge.

Parameters:column_index (integer.) – A index representing the column of incoming arcs

to a particular node being updated :type cycle_indexes: A list of integers. :param cycle_indexes: Only arcs from cycle nodes are considered. This is a list of such nodes addresses.

compute_original_indexes(new_indexes)[source]

As nodes are collapsed into others, they are replaced by the new node in the graph, but it’s still necessary to keep track of what these original nodes were. This takes a list of node addresses and replaces any collapsed node addresses with their original addresses.

Parameters:new_indexes (A list of integers.) – A list of node addresses to check for

subsumed nodes.

initialize_edge_scores(graph)[source]

Assigns a score to every edge in the DependencyGraph graph. These scores are generated via the parser’s scorer which was assigned during the training process.

Parameters:graph (DependencyGraph) – A dependency graph to assign scores to.
original_best_arc(node_index)[source]
parse(tokens, tags)[source]

Parses a list of tokens in accordance to the MST parsing algorithm for non-projective dependency parses. Assumes that the tokens to be parsed have already been tagged and those tags are provided. Various scoring methods can be used by implementing the DependencyScorerI interface and passing it to the training algorithm.

Parameters:
  • tokens (list(str)) – A list of words or punctuation to be parsed.
  • tags (list(str)) – A list of tags corresponding by index to the words in the tokens list.
Returns:

An iterator of non-projective parses.

Return type:

iter(DependencyGraph)

train(graphs, dependency_scorer)[source]

Trains a DependencyScorerI from a set of DependencyGraph objects, and establishes this as the parser’s scorer. This is used to initialize the scores on a DependencyGraph during the parsing procedure.

Parameters:
  • graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer.
  • dependency_scorer (DependencyScorerI) – A scorer which implements the DependencyScorerI interface.
update_edge_scores(new_node, cycle_path)[source]

Updates the edge scores to reflect a collapse operation into new_node.

Parameters:
  • new_node (A Node.) – The node which cycle nodes are collapsed into.
  • cycle_path (A list of integers.) – A list of node addresses that belong to the cycle.
nltk.parse.nonprojectivedependencyparser.demo()[source]
nltk.parse.nonprojectivedependencyparser.hall_demo()[source]
nltk.parse.nonprojectivedependencyparser.nonprojective_conll_parse_demo()[source]
nltk.parse.nonprojectivedependencyparser.rule_based_demo()[source]

nltk.parse.pchart module

Classes and interfaces for associating probabilities with tree structures that represent the internal organization of a text. The probabilistic parser module defines BottomUpProbabilisticChartParser.

BottomUpProbabilisticChartParser is an abstract class that implements a bottom-up chart parser for PCFG grammars. It maintains a queue of edges, and adds them to the chart one at a time. The ordering of this queue is based on the probabilities associated with the edges, allowing the parser to expand more likely edges before less likely ones. Each subclass implements a different queue ordering, producing different search strategies. Currently the following subclasses are defined:

  • InsideChartParser searches edges in decreasing order of their trees’ inside probabilities.
  • RandomChartParser searches edges in random order.
  • LongestChartParser searches edges in decreasing order of their location’s length.

The BottomUpProbabilisticChartParser constructor has an optional argument beam_size. If non-zero, this controls the size of the beam (aka the edge queue). This option is most useful with InsideChartParser.

class nltk.parse.pchart.BottomUpProbabilisticChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.api.ParserI

An abstract bottom-up parser for PCFG grammars that uses a Chart to record partial results. BottomUpProbabilisticChartParser maintains a queue of edges that can be added to the chart. This queue is initialized with edges for each token in the text that is being parsed. BottomUpProbabilisticChartParser inserts these edges into the chart one at a time, starting with the most likely edges, and proceeding to less likely edges. For each edge that is added to the chart, it may become possible to insert additional edges into the chart; these are added to the queue. This process continues until enough complete parses have been generated, or until the queue is empty.

The sorting order for the queue is not specified by BottomUpProbabilisticChartParser. Different sorting orders will result in different search strategies. The sorting order for the queue is defined by the method sort_queue; subclasses are required to provide a definition for this method.

Variables:
  • _grammar – The grammar used to parse sentences.
  • _trace – The level of tracing output that should be generated when parsing a text.
grammar()[source]
parse(tokens)[source]
sort_queue(queue, chart)[source]

Sort the given queue of Edge objects, placing the edge that should be tried first at the beginning of the queue. This method will be called after each Edge is added to the queue.

Parameters:
  • queue (list(Edge)) – The queue of Edge objects to sort. Each edge in this queue is an edge that could be added to the chart by the fundamental rule; but that has not yet been added.
  • chart (Chart) – The chart being used to parse the text. This chart can be used to provide extra information for sorting the queue.
Return type:

None

trace(trace=2)[source]

Set the level of tracing output that should be generated when parsing a text.

Parameters:trace (int) – The trace level. A trace level of 0 will generate no tracing output; and higher trace levels will produce more verbose tracing output.
Return type:None
class nltk.parse.pchart.InsideChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.pchart.BottomUpProbabilisticChartParser

A bottom-up parser for PCFG grammars that tries edges in descending order of the inside probabilities of their trees. The “inside probability” of a tree is simply the probability of the entire tree, ignoring its context. In particular, the inside probability of a tree generated by production p with children c[1], c[2], ..., c[n] is P(p)P(c[1])P(c[2])...P(c[n]); and the inside probability of a token is 1 if it is present in the text, and 0 if it is absent.

This sorting order results in a type of lowest-cost-first search strategy.

sort_queue(queue, chart)[source]

Sort the given queue of edges, in descending order of the inside probabilities of the edges’ trees.

Parameters:
  • queue (list(Edge)) – The queue of Edge objects to sort. Each edge in this queue is an edge that could be added to the chart by the fundamental rule; but that has not yet been added.
  • chart (Chart) – The chart being used to parse the text. This chart can be used to provide extra information for sorting the queue.
Return type:

None

class nltk.parse.pchart.LongestChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.pchart.BottomUpProbabilisticChartParser

A bottom-up parser for PCFG grammars that tries longer edges before shorter ones. This sorting order results in a type of best-first search strategy.

sort_queue(queue, chart)[source]
class nltk.parse.pchart.ProbabilisticBottomUpInitRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 0
apply(chart, grammar)[source]
class nltk.parse.pchart.ProbabilisticBottomUpPredictRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 1
apply(chart, grammar, edge)[source]
class nltk.parse.pchart.ProbabilisticFundamentalRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 2
apply(chart, grammar, left_edge, right_edge)[source]
class nltk.parse.pchart.ProbabilisticLeafEdge(leaf, index)[source]

Bases: nltk.parse.chart.LeafEdge

prob()[source]
class nltk.parse.pchart.ProbabilisticTreeEdge(prob, *args, **kwargs)[source]

Bases: nltk.parse.chart.TreeEdge

static from_production(production, index, p)[source]
prob()[source]
class nltk.parse.pchart.RandomChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.pchart.BottomUpProbabilisticChartParser

A bottom-up parser for PCFG grammars that tries edges in random order. This sorting order results in a random search strategy.

sort_queue(queue, chart)[source]
class nltk.parse.pchart.SingleEdgeProbabilisticFundamentalRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 1
apply(chart, grammar, edge1)[source]
unicode_repr

Return repr(self).

class nltk.parse.pchart.UnsortedChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.pchart.BottomUpProbabilisticChartParser

A bottom-up parser for PCFG grammars that tries edges in whatever order.

sort_queue(queue, chart)[source]
nltk.parse.pchart.demo(choice=None, draw_parses=None, print_parses=None)[source]

A demonstration of the probabilistic parsers. The user is prompted to select which demo to run, and how many parses should be found; and then each parser is run on the same demo, and a summary of the results are displayed.

nltk.parse.projectivedependencyparser module

class nltk.parse.projectivedependencyparser.ChartCell(x, y)[source]

Bases: object

A cell from the parse chart formed when performing the CYK algorithm. Each cell keeps track of its x and y coordinates (though this will probably be discarded), and a list of spans serving as the cell’s entries.

add(span)[source]

Appends the given span to the list of spans representing the chart cell’s entries.

Parameters:span (DependencySpan) – The span to add.
unicode_repr()
Returns:A concise string representation of this ChartCell.
Return type:str.
class nltk.parse.projectivedependencyparser.DependencySpan(start_index, end_index, head_index, arcs, tags)[source]

Bases: object

A contiguous span over some part of the input string representing dependency (head -> modifier) relationships amongst words. An atomic span corresponds to only one word so it isn’t a ‘span’ in the conventional sense, as its _start_index = _end_index = _head_index for concatenation purposes. All other spans are assumed to have arcs between all nodes within the start and end indexes of the span, and one head index corresponding to the head word for the entire span. This is the same as the root node if the dependency structure were depicted as a graph.

head_index()[source]
Returns:An value indexing the head of the entire DependencySpan.
Return type:int
unicode_repr()
Returns:A concise string representatino of the DependencySpan.
Return type:str.
class nltk.parse.projectivedependencyparser.ProbabilisticProjectiveDependencyParser[source]

Bases: object

A probabilistic, projective dependency parser.

This parser returns the most probable projective parse derived from the probabilistic dependency grammar derived from the train() method. The probabilistic model is an implementation of Eisner’s (1996) Model C, which conditions on head-word, head-tag, child-word, and child-tag. The decoding uses a bottom-up chart-based span concatenation algorithm that’s identical to the one utilized by the rule-based projective parser.

>>> from nltk.parse.dependencygraph import conll_data2
>>> graphs = [
... DependencyGraph(entry) for entry in conll_data2.split('\n\n') if entry
... ]
>>> ppdp = ProbabilisticProjectiveDependencyParser()
>>> ppdp.train(graphs)
>>> sent = ['Cathy', 'zag', 'hen', 'wild', 'zwaaien', '.']
>>> list(ppdp.parse(sent))
[Tree('zag', ['Cathy', 'hen', Tree('zwaaien', ['wild', '.'])])]
compute_prob(dg)[source]

Computes the probability of a dependency graph based on the parser’s probability model (defined by the parser’s statistical dependency grammar).

Parameters:dg (DependencyGraph) – A dependency graph to score.
Returns:The probability of the dependency graph.
Return type:int
concatenate(span1, span2)[source]

Concatenates the two spans in whichever way possible. This includes rightward concatenation (from the leftmost word of the leftmost span to the rightmost word of the rightmost span) and leftward concatenation (vice-versa) between adjacent spans. Unlike Eisner’s presentation of span concatenation, these spans do not share or pivot on a particular word/word-index.

Returns:A list of new spans formed through concatenation.
Return type:list(DependencySpan)
parse(tokens)[source]

Parses the list of tokens subject to the projectivity constraint and the productions in the parser’s grammar. This uses a method similar to the span-concatenation algorithm defined in Eisner (1996). It returns the most probable parse derived from the parser’s probabilistic dependency grammar.

train(graphs)[source]

Trains a ProbabilisticDependencyGrammar based on the list of input DependencyGraphs. This model is an implementation of Eisner’s (1996) Model C, which derives its statistics from head-word, head-tag, child-word, and child-tag relationships.

Parameters:graphs – A list of dependency graphs to train from.
Type:list(DependencyGraph)
class nltk.parse.projectivedependencyparser.ProjectiveDependencyParser(dependency_grammar)[source]

Bases: object

A projective, rule-based, dependency parser. A ProjectiveDependencyParser is created with a DependencyGrammar, a set of productions specifying word-to-word dependency relations. The parse() method will then return the set of all parses, in tree representation, for a given input sequence of tokens. Each parse must meet the requirements of the both the grammar and the projectivity constraint which specifies that the branches of the dependency tree are not allowed to cross. Alternatively, this can be understood as stating that each parent node and its children in the parse tree form a continuous substring of the input sequence.

concatenate(span1, span2)[source]

Concatenates the two spans in whichever way possible. This includes rightward concatenation (from the leftmost word of the leftmost span to the rightmost word of the rightmost span) and leftward concatenation (vice-versa) between adjacent spans. Unlike Eisner’s presentation of span concatenation, these spans do not share or pivot on a particular word/word-index.

Returns:A list of new spans formed through concatenation.
Return type:list(DependencySpan)
parse(tokens)[source]

Performs a projective dependency parse on the list of tokens using a chart-based, span-concatenation algorithm similar to Eisner (1996).

Parameters:tokens (list(str)) – The list of input tokens.
Returns:An iterator over parse trees.
Return type:iter(Tree)
nltk.parse.projectivedependencyparser.arity_parse_demo()[source]

A demonstration showing the creation of a DependencyGrammar in which a specific number of modifiers is listed for a given head. This can further constrain the number of possible parses created by a ProjectiveDependencyParser.

nltk.parse.projectivedependencyparser.demo()[source]
nltk.parse.projectivedependencyparser.projective_prob_parse_demo()[source]

A demo showing the training and use of a projective dependency parser.

nltk.parse.projectivedependencyparser.projective_rule_parse_demo()[source]

A demonstration showing the creation and use of a DependencyGrammar to perform a projective dependency parse.

nltk.parse.recursivedescent module

class nltk.parse.recursivedescent.RecursiveDescentParser(grammar, trace=0)[source]

Bases: nltk.parse.api.ParserI

A simple top-down CFG parser that parses texts by recursively expanding the fringe of a Tree, and matching it against a text.

RecursiveDescentParser uses a list of tree locations called a “frontier” to remember which subtrees have not yet been expanded and which leaves have not yet been matched against the text. Each tree location consists of a list of child indices specifying the path from the root of the tree to a subtree or a leaf; see the reference documentation for Tree for more information about tree locations.

When the parser begins parsing a text, it constructs a tree containing only the start symbol, and a frontier containing the location of the tree’s root node. It then extends the tree to cover the text, using the following recursive procedure:

  • If the frontier is empty, and the text is covered by the tree, then return the tree as a possible parse.
  • If the frontier is empty, and the text is not covered by the tree, then return no parses.
  • If the first element of the frontier is a subtree, then use CFG productions to “expand” it. For each applicable production, add the expanded subtree’s children to the frontier, and recursively find all parses that can be generated by the new tree and frontier.
  • If the first element of the frontier is a token, then “match” it against the next token from the text. Remove the token from the frontier, and recursively find all parses that can be generated by the new tree and frontier.
See:nltk.grammar
grammar()[source]
parse(tokens)[source]
trace(trace=2)[source]

Set the level of tracing output that should be generated when parsing a text.

Parameters:trace (int) – The trace level. A trace level of 0 will generate no tracing output; and higher trace levels will produce more verbose tracing output.
Return type:None
class nltk.parse.recursivedescent.SteppingRecursiveDescentParser(grammar, trace=0)[source]

Bases: nltk.parse.recursivedescent.RecursiveDescentParser

A RecursiveDescentParser that allows you to step through the parsing process, performing a single operation at a time.

The initialize method is used to start parsing a text. expand expands the first element on the frontier using a single CFG production, and match matches the first element on the frontier against the next text token. backtrack undoes the most recent expand or match operation. step performs a single expand, match, or backtrack operation. parses returns the set of parses that have been found by the parser.

Variables:
  • _history – A list of (rtext, tree, frontier) tripples, containing the previous states of the parser. This history is used to implement the backtrack operation.
  • _tried_e – A record of all productions that have been tried for a given tree. This record is used by expand to perform the next untried production.
  • _tried_m – A record of what tokens have been matched for a given tree. This record is used by step to decide whether or not to match a token.
See:

nltk.grammar

backtrack()[source]

Return the parser to its state before the most recent match or expand operation. Calling undo repeatedly return the parser to successively earlier states. If no match or expand operations have been performed, undo will make no changes.

Returns:true if an operation was successfully undone.
Return type:bool
currently_complete()[source]
Returns:Whether the parser’s current state represents a complete parse.
Return type:bool
expand(production=None)[source]

Expand the first element of the frontier. In particular, if the first element of the frontier is a subtree whose node type is equal to production‘s left hand side, then add a child to that subtree for each element of production‘s right hand side. If production is not specified, then use the first untried expandable production. If all expandable productions have been tried, do nothing.

Returns:The production used to expand the frontier, if an expansion was performed. If no expansion was performed, return None.
Return type:Production or None
expandable_productions()[source]
Returns:A list of all the productions for which expansions are available for the current parser state.
Return type:list(Production)
frontier()[source]
Returns:A list of the tree locations of all subtrees that have not yet been expanded, and all leaves that have not yet been matched.
Return type:list(tuple(int))
initialize(tokens)[source]

Start parsing a given text. This sets the parser’s tree to the start symbol, its frontier to the root node, and its remaining text to token['SUBTOKENS'].

match()[source]

Match the first element of the frontier. In particular, if the first element of the frontier has the same type as the next text token, then substitute the text token into the tree.

Returns:The token matched, if a match operation was performed. If no match was performed, return None
Return type:str or None
parse(tokens)[source]
parses()[source]
Returns:An iterator of the parses that have been found by this parser so far.
Return type:list of Tree
remaining_text()[source]
Returns:The portion of the text that is not yet covered by the tree.
Return type:list(str)
set_grammar(grammar)[source]

Change the grammar used to parse texts.

Parameters:grammar (CFG) – The new grammar.
step()[source]

Perform a single parsing operation. If an untried match is possible, then perform the match, and return the matched token. If an untried expansion is possible, then perform the expansion, and return the production that it is based on. If backtracking is possible, then backtrack, and return True. Otherwise, return None.

Returns:None if no operation was performed; a token if a match was performed; a production if an expansion was performed; and True if a backtrack operation was performed.
Return type:Production or String or bool
tree()[source]
Returns:A partial structure for the text that is currently being parsed. The elements specified by the frontier have not yet been expanded or matched.
Return type:Tree
untried_expandable_productions()[source]
Returns:A list of all the untried productions for which expansions are available for the current parser state.
Return type:list(Production)
untried_match()[source]
Returns:Whether the first element of the frontier is a token that has not yet been matched.
Return type:bool
nltk.parse.recursivedescent.demo()[source]

A demonstration of the recursive descent parser.

nltk.parse.shiftreduce module

class nltk.parse.shiftreduce.ShiftReduceParser(grammar, trace=0)[source]

Bases: nltk.parse.api.ParserI

A simple bottom-up CFG parser that uses two operations, “shift” and “reduce”, to find a single parse for a text.

ShiftReduceParser maintains a stack, which records the structure of a portion of the text. This stack is a list of strings and Trees that collectively cover a portion of the text. For example, while parsing the sentence “the dog saw the man” with a typical grammar, ShiftReduceParser will produce the following stack, which covers “the dog saw”:

[(NP: (Det: 'the') (N: 'dog')), (V: 'saw')]

ShiftReduceParser attempts to extend the stack to cover the entire text, and to combine the stack elements into a single tree, producing a complete parse for the sentence.

Initially, the stack is empty. It is extended to cover the text, from left to right, by repeatedly applying two operations:

  • “shift” moves a token from the beginning of the text to the end of the stack.
  • “reduce” uses a CFG production to combine the rightmost stack elements into a single Tree.

Often, more than one operation can be performed on a given stack. In this case, ShiftReduceParser uses the following heuristics to decide which operation to perform:

  • Only shift if no reductions are available.
  • If multiple reductions are available, then apply the reduction whose CFG production is listed earliest in the grammar.

Note that these heuristics are not guaranteed to choose an operation that leads to a parse of the text. Also, if multiple parses exists, ShiftReduceParser will return at most one of them.

See:nltk.grammar
grammar()[source]
parse(tokens)[source]
trace(trace=2)[source]

Set the level of tracing output that should be generated when parsing a text.

Parameters:trace (int) – The trace level. A trace level of 0 will generate no tracing output; and higher trace levels will produce more verbose tracing output.
Return type:None
class nltk.parse.shiftreduce.SteppingShiftReduceParser(grammar, trace=0)[source]

Bases: nltk.parse.shiftreduce.ShiftReduceParser

A ShiftReduceParser that allows you to setp through the parsing process, performing a single operation at a time. It also allows you to change the parser’s grammar midway through parsing a text.

The initialize method is used to start parsing a text. shift performs a single shift operation, and reduce performs a single reduce operation. step will perform a single reduce operation if possible; otherwise, it will perform a single shift operation. parses returns the set of parses that have been found by the parser.

Variables:_history – A list of (stack, remaining_text) pairs, containing all of the previous states of the parser. This history is used to implement the undo operation.
See:nltk.grammar
initialize(tokens)[source]

Start parsing a given text. This sets the parser’s stack to [] and sets its remaining text to tokens.

parse(tokens)[source]
parses()[source]
Returns:An iterator of the parses that have been found by this parser so far.
Return type:iter(Tree)
reduce(production=None)[source]

Use production to combine the rightmost stack elements into a single Tree. If production does not match the rightmost stack elements, then do nothing.

Returns:The production used to reduce the stack, if a reduction was performed. If no reduction was performed, return None.
Return type:Production or None
reducible_productions()[source]
Returns:A list of the productions for which reductions are available for the current parser state.
Return type:list(Production)
remaining_text()[source]
Returns:The portion of the text that is not yet covered by the stack.
Return type:list(str)
set_grammar(grammar)[source]

Change the grammar used to parse texts.

Parameters:grammar (CFG) – The new grammar.
shift()[source]

Move a token from the beginning of the remaining text to the end of the stack. If there are no more tokens in the remaining text, then do nothing.

Returns:True if the shift operation was successful.
Return type:bool
stack()[source]
Returns:The parser’s stack.
Return type:list(str and Tree)
step()[source]

Perform a single parsing operation. If a reduction is possible, then perform that reduction, and return the production that it is based on. Otherwise, if a shift is possible, then perform it, and return True. Otherwise, return False.

Returns:False if no operation was performed; True if a shift was performed; and the CFG production used to reduce if a reduction was performed.
Return type:Production or bool
undo()[source]

Return the parser to its state before the most recent shift or reduce operation. Calling undo repeatedly return the parser to successively earlier states. If no shift or reduce operations have been performed, undo will make no changes.

Returns:true if an operation was successfully undone.
Return type:bool
nltk.parse.shiftreduce.demo()[source]

A demonstration of the shift-reduce parser.

nltk.parse.stanford module

class nltk.parse.stanford.GenericStanfordParser(path_to_jar=None, path_to_models_jar=None, model_path='edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', encoding='utf8', verbose=False, java_options='-mx1000m', corenlp_options='')[source]

Bases: nltk.parse.api.ParserI

Interface to the Stanford Parser

parse_sents(sentences, verbose=False)[source]

Use StanfordParser to parse multiple sentences. Takes multiple sentences as a list where each sentence is a list of words. Each sentence will be automatically tagged with this StanfordParser instance’s tagger. If whitespaces exists inside a token, then the token will be treated as separate tokens.

Parameters:sentences (list(list(str))) – Input sentences to parse
Return type:iter(iter(Tree))
raw_parse(sentence, verbose=False)[source]

Use StanfordParser to parse a sentence. Takes a sentence as a string; before parsing, it will be automatically tokenized and tagged by the Stanford Parser.

Parameters:sentence (str) – Input sentence to parse
Return type:iter(Tree)
raw_parse_sents(sentences, verbose=False)[source]

Use StanfordParser to parse multiple sentences. Takes multiple sentences as a list of strings. Each sentence will be automatically tokenized and tagged by the Stanford Parser.

Parameters:sentences (list(str)) – Input sentences to parse
Return type:iter(iter(Tree))
tagged_parse(sentence, verbose=False)[source]

Use StanfordParser to parse a sentence. Takes a sentence as a list of (word, tag) tuples; the sentence must have already been tokenized and tagged.

Parameters:sentence (list(tuple(str, str))) – Input sentence to parse
Return type:iter(Tree)
tagged_parse_sents(sentences, verbose=False)[source]

Use StanfordParser to parse multiple sentences. Takes multiple sentences where each sentence is a list of (word, tag) tuples. The sentences must have already been tokenized and tagged.

Parameters:sentences (list(list(tuple(str, str)))) – Input sentences to parse
Return type:iter(iter(Tree))
class nltk.parse.stanford.StanfordDependencyParser(path_to_jar=None, path_to_models_jar=None, model_path='edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', encoding='utf8', verbose=False, java_options='-mx1000m', corenlp_options='')[source]

Bases: nltk.parse.stanford.GenericStanfordParser

>>> dep_parser=StanfordDependencyParser(
...     model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz"
... )
>>> [parse.tree() for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")] 
[Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy'])])]
>>> [list(parse.triples()) for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")] 
[[((u'jumps', u'VBZ'), u'nsubj', (u'fox', u'NN')), ((u'fox', u'NN'), u'det', (u'The', u'DT')),
((u'fox', u'NN'), u'amod', (u'quick', u'JJ')), ((u'fox', u'NN'), u'amod', (u'brown', u'JJ')),
((u'jumps', u'VBZ'), u'nmod', (u'dog', u'NN')), ((u'dog', u'NN'), u'case', (u'over', u'IN')),
((u'dog', u'NN'), u'det', (u'the', u'DT')), ((u'dog', u'NN'), u'amod', (u'lazy', u'JJ'))]]
>>> sum([[parse.tree() for parse in dep_graphs] for dep_graphs in dep_parser.raw_parse_sents((
...     "The quick brown fox jumps over the lazy dog.",
...     "The quick grey wolf jumps over the lazy fox."
... ))], []) 
[Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy'])]),
Tree('jumps', [Tree('wolf', ['The', 'quick', 'grey']), Tree('fox', ['over', 'the', 'lazy'])])]
>>> sum([[parse.tree() for parse in dep_graphs] for dep_graphs in dep_parser.parse_sents((
...     "I 'm a dog".split(),
...     "This is my friends ' cat ( the tabby )".split(),
... ))], []) 
[Tree('dog', ['I', "'m", 'a']), Tree('cat', ['This', 'is', Tree('friends', ['my', "'"]), Tree('tabby', ['the'])])]
>>> sum([[list(parse.triples()) for parse in dep_graphs] for dep_graphs in dep_parser.tagged_parse_sents((
...     (
...         ("The", "DT"),
...         ("quick", "JJ"),
...         ("brown", "JJ"),
...         ("fox", "NN"),
...         ("jumped", "VBD"),
...         ("over", "IN"),
...         ("the", "DT"),
...         ("lazy", "JJ"),
...         ("dog", "NN"),
...         (".", "."),
...     ),
... ))],[]) 
[[((u'jumped', u'VBD'), u'nsubj', (u'fox', u'NN')), ((u'fox', u'NN'), u'det', (u'The', u'DT')),
((u'fox', u'NN'), u'amod', (u'quick', u'JJ')), ((u'fox', u'NN'), u'amod', (u'brown', u'JJ')),
((u'jumped', u'VBD'), u'nmod', (u'dog', u'NN')), ((u'dog', u'NN'), u'case', (u'over', u'IN')),
((u'dog', u'NN'), u'det', (u'the', u'DT')), ((u'dog', u'NN'), u'amod', (u'lazy', u'JJ'))]]
class nltk.parse.stanford.StanfordNeuralDependencyParser(*args, **kwargs)[source]

Bases: nltk.parse.stanford.GenericStanfordParser

>>> from nltk.parse.stanford import StanfordNeuralDependencyParser
>>> dep_parser=StanfordNeuralDependencyParser(java_options='-mx3g')
>>> [parse.tree() for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")] 
[Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over', 'the', 'lazy']), '.'])]
>>> [list(parse.triples()) for parse in dep_parser.raw_parse("The quick brown fox jumps over the lazy dog.")] 
[[((u'jumps', u'VBZ'), u'nsubj', (u'fox', u'NN')), ((u'fox', u'NN'), u'det',
(u'The', u'DT')), ((u'fox', u'NN'), u'amod', (u'quick', u'JJ')), ((u'fox', u'NN'),
u'amod', (u'brown', u'JJ')), ((u'jumps', u'VBZ'), u'nmod', (u'dog', u'NN')),
((u'dog', u'NN'), u'case', (u'over', u'IN')), ((u'dog', u'NN'), u'det',
(u'the', u'DT')), ((u'dog', u'NN'), u'amod', (u'lazy', u'JJ')), ((u'jumps', u'VBZ'),
u'punct', (u'.', u'.'))]]
>>> sum([[parse.tree() for parse in dep_graphs] for dep_graphs in dep_parser.raw_parse_sents((
...     "The quick brown fox jumps over the lazy dog.",
...     "The quick grey wolf jumps over the lazy fox."
... ))], []) 
[Tree('jumps', [Tree('fox', ['The', 'quick', 'brown']), Tree('dog', ['over',
'the', 'lazy']), '.']), Tree('jumps', [Tree('wolf', ['The', 'quick', 'grey']),
Tree('fox', ['over', 'the', 'lazy']), '.'])]
>>> sum([[parse.tree() for parse in dep_graphs] for dep_graphs in dep_parser.parse_sents((
...     "I 'm a dog".split(),
...     "This is my friends ' cat ( the tabby )".split(),
... ))], []) 
[Tree('dog', ['I', "'m", 'a']), Tree('cat', ['This', 'is', Tree('friends',
['my', "'"]), Tree('tabby', ['-LRB-', 'the', '-RRB-'])])]
tagged_parse_sents(sentences, verbose=False)[source]

Currently unimplemented because the neural dependency parser (and the StanfordCoreNLP pipeline class) doesn’t support passing in pre- tagged tokens.

class nltk.parse.stanford.StanfordParser(path_to_jar=None, path_to_models_jar=None, model_path='edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', encoding='utf8', verbose=False, java_options='-mx1000m', corenlp_options='')[source]

Bases: nltk.parse.stanford.GenericStanfordParser

>>> parser=StanfordParser(
...     model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz"
... )
>>> list(parser.raw_parse("the quick brown fox jumps over the lazy dog")) 
[Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']),
Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']),
Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])])]
>>> sum([list(dep_graphs) for dep_graphs in parser.raw_parse_sents((
...     "the quick brown fox jumps over the lazy dog",
...     "the quick grey wolf jumps over the lazy fox"
... ))], []) 
[Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']),
Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']),
Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])]), Tree('ROOT', [Tree('NP',
[Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['grey']), Tree('NN', ['wolf'])]), Tree('NP',
[Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']),
Tree('JJ', ['lazy']), Tree('NN', ['fox'])])])])])])]
>>> sum([list(dep_graphs) for dep_graphs in parser.parse_sents((
...     "I 'm a dog".split(),
...     "This is my friends ' cat ( the tabby )".split(),
... ))], []) 
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('PRP', ['I'])]), Tree('VP', [Tree('VBP', ["'m"]),
Tree('NP', [Tree('DT', ['a']), Tree('NN', ['dog'])])])])]), Tree('ROOT', [Tree('S', [Tree('NP',
[Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('PRP$', ['my']),
Tree('NNS', ['friends']), Tree('POS', ["'"])]), Tree('NN', ['cat'])]), Tree('PRN', [Tree('-LRB-', ['-LRB-']),
Tree('NP', [Tree('DT', ['the']), Tree('NN', ['tabby'])]), Tree('-RRB-', ['-RRB-'])])])])])])]
>>> sum([list(dep_graphs) for dep_graphs in parser.tagged_parse_sents((
...     (
...         ("The", "DT"),
...         ("quick", "JJ"),
...         ("brown", "JJ"),
...         ("fox", "NN"),
...         ("jumped", "VBD"),
...         ("over", "IN"),
...         ("the", "DT"),
...         ("lazy", "JJ"),
...         ("dog", "NN"),
...         (".", "."),
...     ),
... ))],[]) 
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('DT', ['The']), Tree('JJ', ['quick']), Tree('JJ', ['brown']),
Tree('NN', ['fox'])]), Tree('VP', [Tree('VBD', ['jumped']), Tree('PP', [Tree('IN', ['over']), Tree('NP',
[Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])]), Tree('.', ['.'])])])]
nltk.parse.stanford.setup_module(module)[source]

nltk.parse.transitionparser module

class nltk.parse.transitionparser.Configuration(dep_graph)[source]

Bases: object

Class for holding configuration which is the partial analysis of the input sentence. The transition based parser aims at finding set of operators that transfer the initial configuration to the terminal configuration.

The configuration includes:
  • Stack: for storing partially proceeded words
  • Buffer: for storing remaining input words
  • Set of arcs: for storing partially built dependency tree

This class also provides a method to represent a configuration as list of features.

extract_features()[source]

Extract the set of features for the current configuration. Implement standard features as describe in Table 3.2 (page 31) in Dependency Parsing book by Sandra Kubler, Ryan McDonal, Joakim Nivre. Please note that these features are very basic. :return: list(str)

class nltk.parse.transitionparser.Transition(alg_option)[source]

Bases: object

This class defines a set of transition which is applied to a configuration to get another configuration Note that for different parsing algorithm, the transition is different.

LEFT_ARC = 'LEFTARC'
REDUCE = 'REDUCE'
RIGHT_ARC = 'RIGHTARC'
SHIFT = 'SHIFT'
left_arc(conf, relation)[source]
Note that the algorithm for left-arc is quite similar except for precondition for both arc-standard and arc-eager
param configuration:
 is the current configuration

:return : A new configuration or -1 if the pre-condition is not satisfied

reduce(conf)[source]
Note that the algorithm for reduce is only available for arc-eager
param configuration:
 is the current configuration

:return : A new configuration or -1 if the pre-condition is not satisfied

right_arc(conf, relation)[source]
Note that the algorithm for right-arc is DIFFERENT for arc-standard and arc-eager
param configuration:
 is the current configuration

:return : A new configuration or -1 if the pre-condition is not satisfied

shift(conf)[source]
Note that the algorithm for shift is the SAME for arc-standard and arc-eager
param configuration:
 is the current configuration

:return : A new configuration or -1 if the pre-condition is not satisfied

class nltk.parse.transitionparser.TransitionParser(algorithm)[source]

Bases: nltk.parse.api.ParserI

Class for transition based parser. Implement 2 algorithms which are “arc-standard” and “arc-eager”

ARC_EAGER = 'arc-eager'
ARC_STANDARD = 'arc-standard'
parse(depgraphs, modelFile)[source]
Parameters:
  • depgraphs (list(DependencyGraph)) – the list of test sentence, each sentence is represented as a dependency graph where the ‘head’ information is dummy
  • modelfile (str) – the model file
Returns:

list (DependencyGraph) with the ‘head’ and ‘rel’ information

train(depgraphs, modelfile, verbose=True)[source]

:param depgraphs : list of DependencyGraph as the training data :type depgraphs : DependencyGraph :param modelfile : file name to save the trained model :type modelfile : str

nltk.parse.transitionparser.demo()[source]
>>> from nltk.parse import DependencyGraph, DependencyEvaluator
>>> from nltk.parse.transitionparser import TransitionParser, Configuration, Transition
>>> gold_sent = DependencyGraph("""
... Economic  JJ     2      ATT
... news  NN     3       SBJ
... has       VBD       0       ROOT
... little      JJ      5       ATT
... effect   NN     3       OBJ
... on     IN      5       ATT
... financial       JJ       8       ATT
... markets    NNS      6       PC
... .    .      3       PU
... """)
>>> conf = Configuration(gold_sent)

###################### Check the Initial Feature ########################

>>> print(', '.join(conf.extract_features()))
STK_0_POS_TOP, BUF_0_FORM_Economic, BUF_0_LEMMA_Economic, BUF_0_POS_JJ, BUF_1_FORM_news, BUF_1_POS_NN, BUF_2_POS_VBD, BUF_3_POS_JJ

###################### Check The Transition ####################### Check the Initialized Configuration >>> print(conf) Stack : [0] Buffer : [1, 2, 3, 4, 5, 6, 7, 8, 9] Arcs : []

  1. Do some transition checks for ARC-STANDARD
>>> operation = Transition('arc-standard')
>>> operation.shift(conf)
>>> operation.left_arc(conf, "ATT")
>>> operation.shift(conf)
>>> operation.left_arc(conf,"SBJ")
>>> operation.shift(conf)
>>> operation.shift(conf)
>>> operation.left_arc(conf, "ATT")
>>> operation.shift(conf)
>>> operation.shift(conf)
>>> operation.shift(conf)
>>> operation.left_arc(conf, "ATT")

Middle Configuration and Features Check >>> print(conf) Stack : [0, 3, 5, 6] Buffer : [8, 9] Arcs : [(2, ‘ATT’, 1), (3, ‘SBJ’, 2), (5, ‘ATT’, 4), (8, ‘ATT’, 7)]

>>> print(', '.join(conf.extract_features()))
STK_0_FORM_on, STK_0_LEMMA_on, STK_0_POS_IN, STK_1_POS_NN, BUF_0_FORM_markets, BUF_0_LEMMA_markets, BUF_0_POS_NNS, BUF_1_FORM_., BUF_1_POS_., BUF_0_LDEP_ATT
>>> operation.right_arc(conf, "PC")
>>> operation.right_arc(conf, "ATT")
>>> operation.right_arc(conf, "OBJ")
>>> operation.shift(conf)
>>> operation.right_arc(conf, "PU")
>>> operation.right_arc(conf, "ROOT")
>>> operation.shift(conf)

Terminated Configuration Check >>> print(conf) Stack : [0] Buffer : [] Arcs : [(2, ‘ATT’, 1), (3, ‘SBJ’, 2), (5, ‘ATT’, 4), (8, ‘ATT’, 7), (6, ‘PC’, 8), (5, ‘ATT’, 6), (3, ‘OBJ’, 5), (3, ‘PU’, 9), (0, ‘ROOT’, 3)]

  1. Do some transition checks for ARC-EAGER
>>> conf = Configuration(gold_sent)
>>> operation = Transition('arc-eager')
>>> operation.shift(conf)
>>> operation.left_arc(conf,'ATT')
>>> operation.shift(conf)
>>> operation.left_arc(conf,'SBJ')
>>> operation.right_arc(conf,'ROOT')
>>> operation.shift(conf)
>>> operation.left_arc(conf,'ATT')
>>> operation.right_arc(conf,'OBJ')
>>> operation.right_arc(conf,'ATT')
>>> operation.shift(conf)
>>> operation.left_arc(conf,'ATT')
>>> operation.right_arc(conf,'PC')
>>> operation.reduce(conf)
>>> operation.reduce(conf)
>>> operation.reduce(conf)
>>> operation.right_arc(conf,'PU')
>>> print(conf)
Stack : [0, 3, 9]  Buffer : []   Arcs : [(2, 'ATT', 1), (3, 'SBJ', 2), (0, 'ROOT', 3), (5, 'ATT', 4), (3, 'OBJ', 5), (5, 'ATT', 6), (8, 'ATT', 7), (6, 'PC', 8), (3, 'PU', 9)]

###################### Check The Training Function #######################

A. Check the ARC-STANDARD training >>> import tempfile >>> import os >>> input_file = tempfile.NamedTemporaryFile(prefix=’transition_parse.train’, dir=tempfile.gettempdir(), delete=False)

>>> parser_std = TransitionParser('arc-standard')
>>> print(', '.join(parser_std._create_training_examples_arc_std([gold_sent], input_file)))
 Number of training examples : 1
 Number of valid (projective) examples : 1
SHIFT, LEFTARC:ATT, SHIFT, LEFTARC:SBJ, SHIFT, SHIFT, LEFTARC:ATT, SHIFT, SHIFT, SHIFT, LEFTARC:ATT, RIGHTARC:PC, RIGHTARC:ATT, RIGHTARC:OBJ, SHIFT, RIGHTARC:PU, RIGHTARC:ROOT, SHIFT
>>> parser_std.train([gold_sent],'temp.arcstd.model', verbose=False)
 Number of training examples : 1
 Number of valid (projective) examples : 1
>>> remove(input_file.name)
  1. Check the ARC-EAGER training
>>> input_file = tempfile.NamedTemporaryFile(prefix='transition_parse.train', dir=tempfile.gettempdir(),delete=False)
>>> parser_eager = TransitionParser('arc-eager')
>>> print(', '.join(parser_eager._create_training_examples_arc_eager([gold_sent], input_file)))
 Number of training examples : 1
 Number of valid (projective) examples : 1
SHIFT, LEFTARC:ATT, SHIFT, LEFTARC:SBJ, RIGHTARC:ROOT, SHIFT, LEFTARC:ATT, RIGHTARC:OBJ, RIGHTARC:ATT, SHIFT, LEFTARC:ATT, RIGHTARC:PC, REDUCE, REDUCE, REDUCE, RIGHTARC:PU
>>> parser_eager.train([gold_sent],'temp.arceager.model', verbose=False)
 Number of training examples : 1
 Number of valid (projective) examples : 1
>>> remove(input_file.name)

###################### Check The Parsing Function ########################

  1. Check the ARC-STANDARD parser
>>> result = parser_std.parse([gold_sent], 'temp.arcstd.model')
>>> de = DependencyEvaluator(result, [gold_sent])
>>> de.eval() >= (0, 0)
True

B. Check the ARC-EAGER parser >>> result = parser_eager.parse([gold_sent], ‘temp.arceager.model’) >>> de = DependencyEvaluator(result, [gold_sent]) >>> de.eval() >= (0, 0) True

Remove test temporary files >>> remove(‘temp.arceager.model’) >>> remove(‘temp.arcstd.model’)

Note that result is very poor because of only one training example.

nltk.parse.util module

Utility functions for parsers.

class nltk.parse.util.TestGrammar(grammar, suite, accept=None, reject=None)[source]

Bases: object

Unit tests for CFG.

run(show_trees=False)[source]
Sentences in the test suite are divided into two classes:
  • grammatical (accept) and
  • ungrammatical (reject).

If a sentence should parse accordng to the grammar, the value of trees will be a non-empty list. If a sentence should be rejected according to the grammar, then the value of trees will be None.

nltk.parse.util.extract_test_sentences(string, comment_chars='#%;', encoding=None)[source]

Parses a string with one test sentence per line. Lines can optionally begin with:

  • a bool, saying if the sentence is grammatical or not, or
  • an int, giving the number of parse trees is should have,

The result information is followed by a colon, and then the sentence. Empty lines and lines beginning with a comment char are ignored.

Returns:

a list of tuple of sentences and expected results, where a sentence is a list of str, and a result is None, or bool, or int

Parameters:
  • comment_charsstr of possible comment characters.
  • encoding – the encoding of the string, if it is binary
nltk.parse.util.load_parser(grammar_url, trace=0, parser=None, chart_class=None, beam_size=0, **load_args)[source]

Load a grammar from a file, and build a parser based on that grammar. The parser depends on the grammar format, and might also depend on properties of the grammar itself.

The following grammar formats are currently supported:
  • 'cfg' (CFGs: CFG)
  • 'pcfg' (probabilistic CFGs: PCFG)
  • 'fcfg' (feature-based CFGs: FeatureGrammar)
Parameters:
  • grammar_url (str) – A URL specifying where the grammar is located. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
  • trace (int) – The level of tracing that should be used when parsing a text. 0 will generate no tracing output; and higher numbers will produce more verbose tracing output.
  • parser – The class used for parsing; should be ChartParser or a subclass. If None, the class depends on the grammar format.
  • chart_class – The class used for storing the chart; should be Chart or a subclass. Only used for CFGs and feature CFGs. If None, the chart class depends on the grammar format.
  • beam_size (int) – The maximum length for the parser’s edge queue. Only used for probabilistic CFGs.
  • load_args – Keyword parameters used when loading the grammar. See data.load for more information.
nltk.parse.util.taggedsent_to_conll(sentence)[source]

A module to convert a single POS tagged sentence into CONLL format.

>>> from nltk import word_tokenize, pos_tag
>>> text = "This is a foobar sentence."
>>> for line in taggedsent_to_conll(pos_tag(word_tokenize(text))):
...     print(line, end="")
1       This    _       DT      DT      _       0       a       _       _
2       is      _       VBZ     VBZ     _       0       a       _       _
3       a       _       DT      DT      _       0       a       _       _
4       foobar  _       JJ      JJ      _       0       a       _       _
5       sentence        _       NN      NN      _       0       a       _       _
6       .               _       .       .       _       0       a       _       _
Parameters:sentence (list(tuple(str, str))) – A single input sentence to parse
Return type:iter(str)
Returns:a generator yielding a single sentence in CONLL format.
nltk.parse.util.taggedsents_to_conll(sentences)[source]

A module to convert the a POS tagged document stream (i.e. list of list of tuples, a list of sentences) and yield lines in CONLL format. This module yields one line per word and two newlines for end of sentence.

>>> from nltk import word_tokenize, sent_tokenize, pos_tag
>>> text = "This is a foobar sentence. Is that right?"
>>> sentences = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
>>> for line in taggedsents_to_conll(sentences):
...     if line:
...         print(line, end="")
1       This    _       DT      DT      _       0       a       _       _
2       is      _       VBZ     VBZ     _       0       a       _       _
3       a       _       DT      DT      _       0       a       _       _
4       foobar  _       JJ      JJ      _       0       a       _       _
5       sentence        _       NN      NN      _       0       a       _       _
6       .               _       .       .       _       0       a       _       _


1       Is      _       VBZ     VBZ     _       0       a       _       _
2       that    _       IN      IN      _       0       a       _       _
3       right   _       NN      NN      _       0       a       _       _
4       ?       _       .       .       _       0       a       _       _

Parameters:sentences – Input sentences to parse
Return type:iter(str)
Returns:a generator yielding sentences in CONLL format.

nltk.parse.viterbi module

class nltk.parse.viterbi.ViterbiParser(grammar, trace=0)[source]

Bases: nltk.parse.api.ParserI

A bottom-up PCFG parser that uses dynamic programming to find the single most likely parse for a text. The ViterbiParser parser parses texts by filling in a “most likely constituent table”. This table records the most probable tree representation for any given span and node value. In particular, it has an entry for every start index, end index, and node value, recording the most likely subtree that spans from the start index to the end index, and has the given node value.

The ViterbiParser parser fills in this table incrementally. It starts by filling in all entries for constituents that span one element of text (i.e., entries where the end index is one greater than the start index). After it has filled in all table entries for constituents that span one element of text, it fills in the entries for constitutants that span two elements of text. It continues filling in the entries for constituents spanning larger and larger portions of the text, until the entire table has been filled. Finally, it returns the table entry for a constituent spanning the entire text, whose node value is the grammar’s start symbol.

In order to find the most likely constituent with a given span and node value, the ViterbiParser parser considers all productions that could produce that node value. For each production, it finds all children that collectively cover the span and have the node values specified by the production’s right hand side. If the probability of the tree formed by applying the production to the children is greater than the probability of the current entry in the table, then the table is updated with this new tree.

A pseudo-code description of the algorithm used by ViterbiParser is:

Create an empty most likely constituent table, MLC.
For width in 1...len(text):
For start in 1...len(text)-width:
For prod in grammar.productions:
For each sequence of subtrees [t[1], t[2], ..., t[n]] in MLC,
where t[i].label()==prod.rhs[i],
and the sequence covers [start:start+width]:
old_p = MLC[start, start+width, prod.lhs]
new_p = P(t[1])P(t[1])...P(t[n])P(prod)
if new_p > old_p:
new_tree = Tree(prod.lhs, t[1], t[2], ..., t[n])
MLC[start, start+width, prod.lhs] = new_tree
Return MLC[0, len(text), start_symbol]
Variables:
  • _grammar – The grammar used to parse sentences.
  • _trace – The level of tracing output that should be generated when parsing a text.
grammar()[source]
parse(tokens)[source]
trace(trace=2)[source]

Set the level of tracing output that should be generated when parsing a text.

Parameters:trace (int) – The trace level. A trace level of 0 will generate no tracing output; and higher trace levels will produce more verbose tracing output.
Return type:None
unicode_repr()
nltk.parse.viterbi.demo()[source]

A demonstration of the probabilistic parsers. The user is prompted to select which demo to run, and how many parses should be found; and then each parser is run on the same demo, and a summary of the results are displayed.

Module contents

NLTK Parsers

Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as “parsing” the text, and the resulting tree structures are called the text’s “parses”. Typically, the text is a single sentence, and the tree structure represents the syntactic structure of the sentence. However, parsers can also be used in other domains. For example, parsers can be used to derive the morphological structure of the morphemes that make up a word, or to derive the discourse structure for a set of utterances.

Sometimes, a single piece of text can be represented by more than one tree structure. Texts represented by more than one tree structure are called “ambiguous” texts. Note that there are actually two ways in which a text can be ambiguous:

  • The text has multiple correct parses.
  • There is not enough information to decide which of several candidate parses is correct.

However, the parser module does not distinguish these two types of ambiguity.

The parser module defines ParserI, a standard interface for parsing texts; and two simple implementations of that interface, ShiftReduceParser and RecursiveDescentParser. It also contains three sub-modules for specialized kinds of parsing:

  • nltk.parser.chart defines chart parsing, which uses dynamic programming to efficiently parse texts.
  • nltk.parser.probabilistic defines probabilistic parsing, which associates a probability with each parse.