nltk.sem.relextract module

Code for extracting relational triples from the ieer and conll2002 corpora.

Relations are stored internally as dictionaries (‘reldicts’).

The two serialization outputs are “rtuple” and “clause”.

  • An rtuple is a tuple of the form (subj, filler, obj), where subj and obj are pairs of Named Entity mentions, and filler is the string of words occurring between sub and obj (with no intervening NEs). Strings are printed via repr() to circumvent locale variations in rendering utf-8 encoded strings.

  • A clause is an atom of the form relsym(subjsym, objsym), where the relation, subject and object have been canonicalized to single strings.

nltk.sem.relextract.class_abbrev(type)[source]

Abbreviate an NE class name. :type type: str :rtype: str

nltk.sem.relextract.clause(reldict, relsym)[source]

Print the relation in clausal form. :param reldict: a relation dictionary :type reldict: defaultdict :param relsym: a label for the relation :type relsym: str

nltk.sem.relextract.conllesp()[source]
nltk.sem.relextract.conllned(trace=1)[source]

Find the copula+’van’ relation (‘of’) in the Dutch tagged training corpus from CoNLL 2002.

nltk.sem.relextract.descape_entity(m, defs={'AElig': 'Æ', 'Aacute': 'Á', 'Acirc': 'Â', 'Agrave': 'À', 'Alpha': 'Α', 'Aring': 'Å', 'Atilde': 'Ã', 'Auml': 'Ä', 'Beta': 'Β', 'Ccedil': 'Ç', 'Chi': 'Χ', 'Dagger': '‡', 'Delta': 'Δ', 'ETH': 'Ð', 'Eacute': 'É', 'Ecirc': 'Ê', 'Egrave': 'È', 'Epsilon': 'Ε', 'Eta': 'Η', 'Euml': 'Ë', 'Gamma': 'Γ', 'Iacute': 'Í', 'Icirc': 'Î', 'Igrave': 'Ì', 'Iota': 'Ι', 'Iuml': 'Ï', 'Kappa': 'Κ', 'Lambda': 'Λ', 'Mu': 'Μ', 'Ntilde': 'Ñ', 'Nu': 'Ν', 'OElig': 'Œ', 'Oacute': 'Ó', 'Ocirc': 'Ô', 'Ograve': 'Ò', 'Omega': 'Ω', 'Omicron': 'Ο', 'Oslash': 'Ø', 'Otilde': 'Õ', 'Ouml': 'Ö', 'Phi': 'Φ', 'Pi': 'Π', 'Prime': '″', 'Psi': 'Ψ', 'Rho': 'Ρ', 'Scaron': 'Š', 'Sigma': 'Σ', 'THORN': 'Þ', 'Tau': 'Τ', 'Theta': 'Θ', 'Uacute': 'Ú', 'Ucirc': 'Û', 'Ugrave': 'Ù', 'Upsilon': 'Υ', 'Uuml': 'Ü', 'Xi': 'Ξ', 'Yacute': 'Ý', 'Yuml': 'Ÿ', 'Zeta': 'Ζ', 'aacute': 'á', 'acirc': 'â', 'acute': '´', 'aelig': 'æ', 'agrave': 'à', 'alefsym': 'ℵ', 'alpha': 'α', 'amp': '&', 'and': '∧', 'ang': '∠', 'aring': 'å', 'asymp': '≈', 'atilde': 'ã', 'auml': 'ä', 'bdquo': '„', 'beta': 'β', 'brvbar': '¦', 'bull': '•', 'cap': '∩', 'ccedil': 'ç', 'cedil': '¸', 'cent': '¢', 'chi': 'χ', 'circ': 'ˆ', 'clubs': '♣', 'cong': '≅', 'copy': '©', 'crarr': '↵', 'cup': '∪', 'curren': '¤', 'dArr': '⇓', 'dagger': '†', 'darr': '↓', 'deg': '°', 'delta': 'δ', 'diams': '♦', 'divide': '÷', 'eacute': 'é', 'ecirc': 'ê', 'egrave': 'è', 'empty': '∅', 'emsp': '\u2003', 'ensp': '\u2002', 'epsilon': 'ε', 'equiv': '≡', 'eta': 'η', 'eth': 'ð', 'euml': 'ë', 'euro': '€', 'exist': '∃', 'fnof': 'ƒ', 'forall': '∀', 'frac12': '½', 'frac14': '¼', 'frac34': '¾', 'frasl': '⁄', 'gamma': 'γ', 'ge': '≥', 'gt': '>', 'hArr': '⇔', 'harr': '↔', 'hearts': '♥', 'hellip': '…', 'iacute': 'í', 'icirc': 'î', 'iexcl': '¡', 'igrave': 'ì', 'image': 'ℑ', 'infin': '∞', 'int': '∫', 'iota': 'ι', 'iquest': '¿', 'isin': '∈', 'iuml': 'ï', 'kappa': 'κ', 'lArr': '⇐', 'lambda': 'λ', 'lang': '〈', 'laquo': '«', 'larr': '←', 'lceil': '⌈', 'ldquo': '“', 'le': '≤', 'lfloor': '⌊', 'lowast': '∗', 'loz': '◊', 'lrm': '\u200e', 'lsaquo': '‹', 'lsquo': '‘', 'lt': '<', 'macr': '¯', 'mdash': '—', 'micro': 'µ', 'middot': '·', 'minus': '−', 'mu': 'μ', 'nabla': '∇', 'nbsp': '\xa0', 'ndash': '–', 'ne': '≠', 'ni': '∋', 'not': '¬', 'notin': '∉', 'nsub': '⊄', 'ntilde': 'ñ', 'nu': 'ν', 'oacute': 'ó', 'ocirc': 'ô', 'oelig': 'œ', 'ograve': 'ò', 'oline': '‾', 'omega': 'ω', 'omicron': 'ο', 'oplus': '⊕', 'or': '∨', 'ordf': 'ª', 'ordm': 'º', 'oslash': 'ø', 'otilde': 'õ', 'otimes': '⊗', 'ouml': 'ö', 'para': '¶', 'part': '∂', 'permil': '‰', 'perp': '⊥', 'phi': 'φ', 'pi': 'π', 'piv': 'ϖ', 'plusmn': '±', 'pound': '£', 'prime': '′', 'prod': '∏', 'prop': '∝', 'psi': 'ψ', 'quot': '"', 'rArr': '⇒', 'radic': '√', 'rang': '〉', 'raquo': '»', 'rarr': '→', 'rceil': '⌉', 'rdquo': '”', 'real': 'ℜ', 'reg': '®', 'rfloor': '⌋', 'rho': 'ρ', 'rlm': '\u200f', 'rsaquo': '›', 'rsquo': '’', 'sbquo': '‚', 'scaron': 'š', 'sdot': '⋅', 'sect': '§', 'shy': '\xad', 'sigma': 'σ', 'sigmaf': 'ς', 'sim': '∼', 'spades': '♠', 'sub': '⊂', 'sube': '⊆', 'sum': '∑', 'sup': '⊃', 'sup1': '¹', 'sup2': '²', 'sup3': '³', 'supe': '⊇', 'szlig': 'ß', 'tau': 'τ', 'there4': '∴', 'theta': 'θ', 'thetasym': 'ϑ', 'thinsp': '\u2009', 'thorn': 'þ', 'tilde': '˜', 'times': '×', 'trade': '™', 'uArr': '⇑', 'uacute': 'ú', 'uarr': '↑', 'ucirc': 'û', 'ugrave': 'ù', 'uml': '¨', 'upsih': 'ϒ', 'upsilon': 'υ', 'uuml': 'ü', 'weierp': '℘', 'xi': 'ξ', 'yacute': 'ý', 'yen': '¥', 'yuml': 'ÿ', 'zeta': 'ζ', 'zwj': '\u200d', 'zwnj': '\u200c'})[source]

Translate one entity to its ISO Latin value. Inspired by example from effbot.org

nltk.sem.relextract.extract_rels(subjclass, objclass, doc, corpus='ace', pattern=None, window=10)[source]

Filter the output of semi_rel2reldict according to specified NE classes and a filler pattern.

The parameters subjclass and objclass can be used to restrict the Named Entities to particular types (any of ‘LOCATION’, ‘ORGANIZATION’, ‘PERSON’, ‘DURATION’, ‘DATE’, ‘CARDINAL’, ‘PERCENT’, ‘MONEY’, ‘MEASURE’).

Parameters
  • subjclass (str) – the class of the subject Named Entity.

  • objclass (str) – the class of the object Named Entity.

  • doc (ieer document or a list of chunk trees) – input document

  • corpus (str) – name of the corpus to take as input; possible values are ‘ieer’ and ‘conll2002’

  • pattern (SRE_Pattern) – a regular expression for filtering the fillers of retrieved triples.

  • window (int) – filters out fillers which exceed this threshold

Returns

see mk_reldicts

Return type

list(defaultdict)

nltk.sem.relextract.ieer_headlines()[source]
nltk.sem.relextract.in_demo(trace=0, sql=True)[source]

Select pairs of organizations and locations whose mentions occur with an intervening occurrence of the preposition “in”.

If the sql parameter is set to True, then the entity pairs are loaded into an in-memory database, and subsequently pulled out using an SQL “SELECT” query.

nltk.sem.relextract.list2sym(lst)[source]

Convert a list of strings into a canonical symbol. :type lst: list :return: a Unicode string without whitespace :rtype: unicode

nltk.sem.relextract.ne_chunked()[source]
nltk.sem.relextract.roles_demo(trace=0)[source]
nltk.sem.relextract.rtuple(reldict, lcon=False, rcon=False)[source]

Pretty print the reldict as an rtuple. :param reldict: a relation dictionary :type reldict: defaultdict

nltk.sem.relextract.semi_rel2reldict(pairs, window=5, trace=False)[source]

Converts the pairs generated by tree2semi_rel into a ‘reldict’: a dictionary which stores information about the subject and object NEs plus the filler between them. Additionally, a left and right context of length =< window are captured (within a given input sentence).

Parameters
  • pairs – a pair of list(str) and Tree, as generated by

  • window (int) – a threshold for the number of items to include in the left and right context

Returns

‘relation’ dictionaries whose keys are ‘lcon’, ‘subjclass’, ‘subjtext’, ‘subjsym’, ‘filler’, objclass’, objtext’, ‘objsym’ and ‘rcon’

Return type

list(defaultdict)

nltk.sem.relextract.tree2semi_rel(tree)[source]

Group a chunk structure into a list of ‘semi-relations’ of the form (list(str), Tree).

In order to facilitate the construction of (Tree, string, Tree) triples, this identifies pairs whose first member is a list (possibly empty) of terminal strings, and whose second member is a Tree of the form (NE_label, terminals).

Parameters

tree – a chunk tree

Returns

a list of pairs (list(str), Tree)

Return type

list of tuple