nltk.tgrep module

TGrep search implementation for NLTK trees

This module supports TGrep2 syntax for matching parts of NLTK Trees. Note that many tgrep operators require the tree passed to be a ParentedTree.

External links:

Usage

>>> from nltk.tree import ParentedTree
>>> from nltk.tgrep import tgrep_nodes, tgrep_positions
>>> tree = ParentedTree.fromstring('(S (NP (DT the) (JJ big) (NN dog)) (VP bit) (NP (DT a) (NN cat)))')
>>> list(tgrep_nodes('NN', [tree]))
[[ParentedTree('NN', ['dog']), ParentedTree('NN', ['cat'])]]
>>> list(tgrep_positions('NN', [tree]))
[[(0, 2), (2, 1)]]
>>> list(tgrep_nodes('DT', [tree]))
[[ParentedTree('DT', ['the']), ParentedTree('DT', ['a'])]]
>>> list(tgrep_nodes('DT $ JJ', [tree]))
[[ParentedTree('DT', ['the'])]]

This implementation adds syntax to select nodes based on their NLTK tree position. This syntax is N plus a Python tuple representing the tree position. For instance, N(), N(0,), N(0,0) are valid node selectors. Example:

>>> tree = ParentedTree.fromstring('(S (NP (DT the) (JJ big) (NN dog)) (VP bit) (NP (DT a) (NN cat)))')
>>> tree[0,0]
ParentedTree('DT', ['the'])
>>> tree[0,0].treeposition()
(0, 0)
>>> list(tgrep_nodes('N(0,0)', [tree]))
[[ParentedTree('DT', ['the'])]]

Caveats:

  • Link modifiers: “?” and “=” are not implemented.

  • Tgrep compatibility: Using “@” for “!”, “{” for “<”, “}” for “>” are not implemented.

  • The “=” and “~” links are not implemented.

Known Issues:

  • There are some issues with link relations involving leaf nodes (which are represented as bare strings in NLTK trees). For instance, consider the tree:

    (S (A x))
    

    The search string * !>> S should select all nodes which are not dominated in some way by an S node (i.e., all nodes which are not descendants of an S). Clearly, in this tree, the only node which fulfills this criterion is the top node (since it is not dominated by anything). However, the code here will find both the top node and the leaf node x. This is because we cannot recover the parent of the leaf, since it is stored as a bare string.

    A possible workaround, when performing this kind of search, would be to filter out all leaf nodes.

Implementation notes

This implementation is (somewhat awkwardly) based on lambda functions which are predicates on a node. A predicate is a function which is either True or False; using a predicate function, we can identify sets of nodes with particular properties. A predicate function, could, for instance, return True only if a particular node has a label matching a particular regular expression, and has a daughter node which has no sisters. Because tgrep2 search strings can do things statefully (such as substituting in macros, and binding nodes with node labels), the actual predicate function is declared with three arguments:

pred = lambda n, m, l: return True # some logic here
n

is a node in a tree; this argument must always be given

m

contains a dictionary, mapping macro names onto predicate functions

l

is a dictionary to map node labels onto nodes in the tree

m and l are declared to default to None, and so need not be specified in a call to a predicate. Predicates which call other predicates must always pass the value of these arguments on. The top-level predicate (constructed by _tgrep_exprs_action) binds the macro definitions to m and initialises l to an empty dictionary.

exception nltk.tgrep.TgrepException[source]

Bases: Exception

Tgrep exception type.

nltk.tgrep.ancestors(node)[source]

Returns the list of all nodes dominating the given tree node. This method will not work with leaf nodes, since there is no way to recover the parent.

nltk.tgrep.tgrep_compile(tgrep_string)[source]

Parses (and tokenizes, if necessary) a TGrep search string into a lambda function.

nltk.tgrep.tgrep_nodes(pattern, trees, search_leaves=True)[source]

Return the tree nodes in the trees which match the given pattern.

Parameters
  • pattern (str or output of tgrep_compile()) – a tgrep search pattern

  • trees (iter(ParentedTree) or iter(Tree)) – a sequence of NLTK trees (usually ParentedTrees)

  • search_leaves (bool) – whether to return matching leaf nodes

Return type

iter(tree nodes)

nltk.tgrep.tgrep_positions(pattern, trees, search_leaves=True)[source]

Return the tree positions in the trees which match the given pattern.

Parameters
  • pattern (str or output of tgrep_compile()) – a tgrep search pattern

  • trees (iter(ParentedTree) or iter(Tree)) – a sequence of NLTK trees (usually ParentedTrees)

  • search_leaves (bool) – whether to return matching leaf nodes

Return type

iter(tree positions)

nltk.tgrep.tgrep_tokenize(tgrep_string)[source]

Tokenizes a TGrep search string into separate tokens.

nltk.tgrep.treepositions_no_leaves(tree)[source]

Returns all the tree positions in the given tree which are not leaf nodes.

nltk.tgrep.unique_ancestors(node)[source]

Returns the list of all nodes dominating the given node, where there is only a single path of descent.