NLTK :: nltk.corpus.reader.wordnet module

class nltk.corpus.reader.wordnet.Lemma[source]¶

Bases: _WordNetObject

The lexical entry for a single morphological form of a sense-disambiguated word.

Create a Lemma from a “<word>.<pos>.<number>.<lemma>” string where: <word> is the morphological stem identifying the synset <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0. <lemma> is the morphological form of interest

Note that <word> and <lemma> can be different, e.g. the Synset ‘salt.n.03’ has the Lemmas ‘salt.n.03.salt’, ‘salt.n.03.saltiness’ and ‘salt.n.03.salinity’.

Lemma attributes, accessible via methods with the same name:

name: The canonical name of this lemma.
synset: The synset that this lemma belongs to.
syntactic_marker: For adjectives, the WordNet string identifying the syntactic position relative modified noun. See: https://wordnet.princeton.edu/documentation/wninput5wn For all other parts of speech, this attribute is None.
count: The frequency of this lemma in wordnet.

Lemma methods:

Lemmas have the following methods for retrieving related Lemmas. They correspond to the names for the pointer symbols defined here: https://wordnet.princeton.edu/documentation/wninput5wn These methods all return lists of Lemmas:

antonyms
hypernyms, instance_hypernyms
hyponyms, instance_hyponyms
member_holonyms, substance_holonyms, part_holonyms
member_meronyms, substance_meronyms, part_meronyms
topic_domains, region_domains, usage_domains
attributes
derivationally_related_forms
entailments
causes
also_sees
verb_groups
similar_tos
pertainyms

__init__(wordnet_corpus_reader, synset, name, lexname_index, lex_id, syntactic_marker)[source]¶

antonyms()[source]¶

count()[source]¶: Return the frequency count for this Lemma

derivationally_related_forms()[source]¶

frame_ids()[source]¶

frame_strings()[source]¶

key()[source]¶

lang()[source]¶

name()[source]¶

pertainyms()[source]¶

synset()[source]¶

syntactic_marker()[source]¶

class nltk.corpus.reader.wordnet.Synset[source]¶

Bases: _WordNetObject

Create a Synset from a “<lemma>.<pos>.<number>” string where: <lemma> is the word’s morphological stem <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0.

Synset attributes, accessible via methods with the same name:

name: The canonical name of this synset, formed using the first lemma of this synset. Note that this may be different from the name passed to the constructor if that string used a different lemma to identify the synset.
pos: The synset’s part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB.
lemmas: A list of the Lemma objects for this synset.
definition: The definition for this synset.
examples: A list of example strings for this synset.
offset: The offset in the WordNet dict file of this synset.
lexname: The name of the lexicographer file containing this synset.

Synset methods:

Synsets have the following methods for retrieving related Synsets. They correspond to the names for the pointer symbols defined here: https://wordnet.princeton.edu/documentation/wninput5wn These methods all return lists of Synsets.

hypernyms, instance_hypernyms
hyponyms, instance_hyponyms
member_holonyms, substance_holonyms, part_holonyms
member_meronyms, substance_meronyms, part_meronyms
attributes
entailments
causes
also_sees
verb_groups
similar_tos

Additionally, Synsets support the following methods specific to the hypernym relation:

root_hypernyms
common_hypernyms
lowest_common_hypernyms

Note that Synsets do not support the following relations because these are defined by WordNet as lexical relations:

antonyms
derivationally_related_forms
pertainyms

__init__(wordnet_corpus_reader)[source]¶

acyclic_tree(children=<built-in function iter>, depth=-1, cut_mark=None, traversed=None, verbose=False)¶

Parameters:

tree – the tree root
children – a function taking as argument a tree node
depth – the maximum depth of the search
cut_mark – the mark to add when cycles are truncated
traversed – the set of traversed nodes
verbose – to print warnings when cycles are discarded

Returns:

the tree in depth-first order

Traverse the nodes of a tree in depth-first order, discarding eventual cycles within any branch, adding cut_mark (when specified) if cycles were truncated. The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.

Catches all cycles:

>>> import nltk
>>> from nltk.util import acyclic_depth_first as acyclic_tree
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(acyclic_tree(wn.synset('dog.n.01'), lambda s:sorted(s.hypernyms()),cut_mark='...'))
[Synset('dog.n.01'),
 [Synset('canine.n.02'),
  [Synset('carnivore.n.01'),
   [Synset('placental.n.01'),
    [Synset('mammal.n.01'),
     [Synset('vertebrate.n.01'),
      [Synset('chordate.n.01'),
       [Synset('animal.n.01'),
        [Synset('organism.n.01'),
         [Synset('living_thing.n.01'),
          [Synset('whole.n.02'),
           [Synset('object.n.01'),
            [Synset('physical_entity.n.01'),
             [Synset('entity.n.01')]]]]]]]]]]]]],
 [Synset('domestic_animal.n.01'), "Cycle(Synset('animal.n.01'),-3,...)"]]

closure(rel, depth=-1)[source]¶

Return the transitive closure of source under the rel relationship, breadth-first, discarding cycles:

>>> from nltk.corpus import wordnet as wn
>>> computer = wn.synset('computer.n.01')
>>> topic = lambda s:s.topic_domains()
>>> print(list(computer.closure(topic)))
[Synset('computer_science.n.01')]

UserWarning: Discarded redundant search for Synset(‘computer.n.01’) at depth 2

Include redundant paths (but only once), avoiding duplicate searches (from ‘animal.n.01’ to ‘entity.n.01’):

>>> dog = wn.synset('dog.n.01')
>>> hyp = lambda s:sorted(s.hypernyms())
>>> print(list(dog.closure(hyp)))
[Synset('canine.n.02'), Synset('domestic_animal.n.01'), Synset('carnivore.n.01'), Synset('animal.n.01'), Synset('placental.n.01'), Synset('organism.n.01'), Synset('mammal.n.01'), Synset('living_thing.n.01'), Synset('vertebrate.n.01'), Synset('whole.n.02'), Synset('chordate.n.01'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]

UserWarning: Discarded redundant search for Synset(‘animal.n.01’) at depth 7

common_hypernyms(other)[source]¶

Find all synsets that are hypernyms of this synset and the other synset.

Parameters:: other (Synset) – other input synset.
Returns:: The synsets that are hypernyms of both synsets.

definition(lang='eng')[source]¶: Return definition in specified language

examples(lang='eng')[source]¶: Return examples in specified language

frame_ids()[source]¶

hypernym_distances(distance=0, simulate_root=False)[source]¶

Get the path(s) from this synset to the root, counting the distance of each node from the initial node on the way. A set of (synset, distance) tuples is returned.

Parameters:: distance (int) – the distance (number of edges) from this hypernym to the original hypernym Synset on which this method was called.
Returns:: A set of (Synset, int) tuples where each Synset is a hypernym of the first Synset.

hypernym_paths()[source]¶

Get the path(s) from this synset to the root, where each path is a list of the synset nodes traversed on the way to the root.

Returns:: A list of lists, where each list gives the node sequence connecting the initial Synset node and a root node.

jcn_similarity(other, ic, verbose=False)[source]¶

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns:

A float score denoting the similarity of the two Synset objects.

lch_similarity(other, verbose=False, simulate_root=True)[source]¶

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

lemma_names(lang='eng')[source]¶: Return all the lemma_names associated with the synset

lemmas(lang='eng')[source]¶: Return all the lemma objects associated with the synset

lexname()[source]¶

lin_similarity(other, ic, verbose=False)[source]¶

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

lowest_common_hypernyms(other, simulate_root=False, use_min_depth=False)[source]¶

Get a list of lowest synset(s) that both synsets have as a hypernym. When use_min_depth == False this means that the synset which appears as a hypernym of both self and other with the lowest maximum depth is returned or if there are multiple such synsets at the same depth they are all returned

However, if use_min_depth == True then the synset(s) which has/have the lowest minimum depth and appear(s) in both paths is/are returned.

By setting the use_min_depth flag to True, the behavior of NLTK2 can be preserved. This was changed in NLTK3 to give more accurate results in a small set of cases, generally with synsets concerning people. (eg: ‘chef.n.01’, ‘fireman.n.01’, etc.)

This method is an implementation of Ted Pedersen’s “Lowest Common Subsumer” method from the Perl Wordnet module. It can return either “self” or “other” if they are a hypernym of the other.

Parameters:

other (Synset) – other input synset
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (False by default) creates a fake root that connects all the taxonomies. Set it to True to enable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will need to be added for nouns as well.
use_min_depth (bool) – This setting mimics older (v2) behavior of NLTK wordnet If True, will use the min_depth function to calculate the lowest common hypernyms. This is known to give strange results for some synset pairs (eg: ‘chef.n.01’, ‘fireman.n.01’) but is retained for backwards compatibility

Returns:

The synsets that are the lowest common hypernyms of both synsets

max_depth()[source]¶

Returns:: The length of the longest hypernym path from this synset to the root.

min_depth()[source]¶

Returns:: The length of the shortest hypernym path from this synset to the root.

mst(children=<built-in function iter>)¶

Parameters:

tree – the tree root
children –
a function taking as argument a tree node

Output a Minimum Spanning Tree (MST) of an unweighted graph,

by traversing the nodes of a tree in breadth-first order, discarding eventual cycles.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node’s children.

>>> import nltk
>>> from nltk.util import unweighted_minimum_spanning_tree as mst
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(mst(wn.synset('bound.a.01'), lambda s:sorted(s.also_sees())))
[Synset('bound.a.01'),
 [Synset('unfree.a.02'),
  [Synset('confined.a.02')],
  [Synset('dependent.a.01')],
  [Synset('restricted.a.01'), [Synset('classified.a.02')]]]]

name()[source]¶

offset()[source]¶

path_similarity(other, verbose=False, simulate_root=True)[source]¶

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

pos()[source]¶

res_similarity(other, ic, verbose=False)[source]¶

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

root_hypernyms()[source]¶: Get the topmost hypernyms of this synset in WordNet.

shortest_path_distance(other, simulate_root=False)[source]¶

Returns the distance of the shortest path linking the two synsets (if one exists). For each synset, all the ancestor nodes and their distances are recorded and compared. The ancestor node common to both synsets that can be reached with the minimum number of traversals is used. If no ancestor nodes are common, None is returned. If a node is compared with itself 0 is returned.

Parameters:: other (Synset) – The Synset to which the shortest path will be found.
Returns:: The number of edges in the shortest path connecting the two nodes, or None if no path exists.

tree(rel, depth=-1, cut_mark=None)[source]¶

Return the full relation tree, including self, discarding cycles:

>>> from nltk.corpus import wordnet as wn
>>> from pprint import pprint
>>> computer = wn.synset('computer.n.01')
>>> topic = lambda s:sorted(s.topic_domains())
>>> pprint(computer.tree(topic))
[Synset('computer.n.01'), [Synset('computer_science.n.01')]]

UserWarning: Discarded redundant search for Synset(‘computer.n.01’) at depth -3

But keep duplicate branches (from ‘animal.n.01’ to ‘entity.n.01’):

>>> dog = wn.synset('dog.n.01')
>>> hyp = lambda s:sorted(s.hypernyms())
>>> pprint(dog.tree(hyp))
[Synset('dog.n.01'),
 [Synset('canine.n.02'),
  [Synset('carnivore.n.01'),
   [Synset('placental.n.01'),
    [Synset('mammal.n.01'),
     [Synset('vertebrate.n.01'),
      [Synset('chordate.n.01'),
       [Synset('animal.n.01'),
        [Synset('organism.n.01'),
         [Synset('living_thing.n.01'),
          [Synset('whole.n.02'),
           [Synset('object.n.01'),
            [Synset('physical_entity.n.01'),
             [Synset('entity.n.01')]]]]]]]]]]]]],
 [Synset('domestic_animal.n.01'),
  [Synset('animal.n.01'),
   [Synset('organism.n.01'),
    [Synset('living_thing.n.01'),
     [Synset('whole.n.02'),
      [Synset('object.n.01'),
       [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]

wup_similarity(other, verbose=False, simulate_root=True)[source]¶

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

class nltk.corpus.reader.wordnet.WordNetCorpusReader[source]¶

Bases: CorpusReader

A corpus reader used to access wordnet or its variants.

ADJ = 'a'¶

ADJ_SAT = 's'¶

ADV = 'r'¶

MORPHOLOGICAL_SUBSTITUTIONS = {'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'n': [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], 'r': [], 's': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], 'v': [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')]}¶

NOUN = 'n'¶

VERB = 'v'¶

__init__(root, omw_reader)[source]¶: Construct a new wordnet corpus reader, with the given root directory.

add_exomw()[source]¶

Add languages from Extended OMW

>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> wn.add_exomw()
>>> print(wn.synset('intrinsically.r.01').lemmas(lang="eng_wikt"))
[Lemma('intrinsically.r.01.per_se'), Lemma('intrinsically.r.01.as_such')]

add_omw()[source]¶

add_provs(reader)[source]¶: Add languages from Multilingual Wordnet to the provenance dictionary

all_eng_synsets(pos=None)[source]¶

all_lemma_names(pos=None, lang='eng')[source]¶: Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.

all_omw_synsets(pos=None, lang=None)[source]¶

all_synsets(pos=None, lang='eng')[source]¶: Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

citation(lang='eng')[source]¶: Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language

custom_lemmas(tab_file, lang)[source]¶

Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK’s WordNet functions to then be used with that language.

See the “Tab files” section at https://omwn.org/omw1.html for documentation on the Multilingual WordNet tab file format.

Parameters:: tab_file – Tab file as a file or file-like object
Type:: lang str
Param:: lang ISO 639-3 code of the language of the tab file

digraph(inputs, rel=<function WordNetCorpusReader.<lambda>>, pos=None, maxdepth=-1, shapes=None, attr=None, verbose=False)[source]¶

Produce a graphical representation from ‘inputs’ (a list of start nodes, which can be a mix of Synsets, Lemmas and/or words), and a synset relation, for drawing with the ‘dot’ graph visualisation program from the Graphviz package.

Return a string in the DOT graph file language, which can then be converted to an image by nltk.parse.dependencygraph.dot2img(dot_string).

Optional Parameters: :rel: Wordnet synset relation :pos: for words, restricts Part of Speech to ‘n’, ‘v’, ‘a’ or ‘r’ :maxdepth: limit the longest path :shapes: dictionary of strings that trigger a specified shape :attr: dictionary with global graph attributes :verbose: warn about cycles

>>> from nltk.corpus import wordnet as wn
>>> print(wn.digraph([wn.synset('dog.n.01')]))
digraph G {
"Synset('animal.n.01')" -> "Synset('organism.n.01')";
"Synset('canine.n.02')" -> "Synset('carnivore.n.01')";
"Synset('carnivore.n.01')" -> "Synset('placental.n.01')";
"Synset('chordate.n.01')" -> "Synset('animal.n.01')";
"Synset('dog.n.01')" -> "Synset('canine.n.02')";
"Synset('dog.n.01')" -> "Synset('domestic_animal.n.01')";
"Synset('domestic_animal.n.01')" -> "Synset('animal.n.01')";
"Synset('living_thing.n.01')" -> "Synset('whole.n.02')";
"Synset('mammal.n.01')" -> "Synset('vertebrate.n.01')";
"Synset('object.n.01')" -> "Synset('physical_entity.n.01')";
"Synset('organism.n.01')" -> "Synset('living_thing.n.01')";
"Synset('physical_entity.n.01')" -> "Synset('entity.n.01')";
"Synset('placental.n.01')" -> "Synset('mammal.n.01')";
"Synset('vertebrate.n.01')" -> "Synset('chordate.n.01')";
"Synset('whole.n.02')" -> "Synset('object.n.01')";
}

disable_custom_lemmas(lang)[source]¶: prevent synsets from being mistakenly added

doc(file='README', lang='eng')[source]¶: Return the contents of readme, license or citation file use lang=lang to get the file for an individual language

get_version()[source]¶

ic(corpus, weight_senses_equally=False, smoothing=1.0)[source]¶

Creates an information content lookup dictionary from a corpus.

Parameters:

corpus (CorpusReader) – The corpus from which we create an information content dictionary.
weight_senses_equally (bool) – If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.)
smoothing (float) – How much do we smooth synset counts (default is 1.0)

Returns:

An information content dictionary

index_sense(version=None)[source]¶: Read sense key to synset id mapping from index.sense file in corpus directory

jcn_similarity(synset1, synset2, ic, verbose=False)[source]¶

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns:

A float score denoting the similarity of the two Synset objects.

langs()[source]¶: return a list of languages supported by Multilingual Wordnet

lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]¶

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

lemma(name, lang='eng')[source]¶: Return lemma object that matches the name

lemma_count(lemma)[source]¶: Return the frequency count for this Lemma

lemma_from_key(key)[source]¶

lemmas(lemma, pos=None, lang='eng')[source]¶: Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

license(lang='eng')[source]¶: Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language

lin_similarity(synset1, synset2, ic, verbose=False)[source]¶

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

map_to_many(version='wordnet')[source]¶

map_to_one(version='wordnet')[source]¶

map_wn(version='wordnet')[source]¶: Mapping from Wordnet ‘version’ to currently loaded Wordnet version

merged_synsets(version='wordnet')[source]¶

morphy(form, pos=None, check_exceptions=True)[source]¶

Find a possible base form for the given form, with the given part of speech, by checking WordNet’s list of exceptional forms, or by substituting suffixes for this part of speech. If pos=None, try every part of speech until finding lemmas. Return the first form found in WordNet, or eventually None.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.morphy('dogs'))
dog
>>> print(wn.morphy('churches'))
church
>>> print(wn.morphy('aardwolves'))
aardwolf
>>> print(wn.morphy('abaci'))
abacus
>>> wn.morphy('hardrock', wn.ADV)
>>> print(wn.morphy('book', wn.NOUN))
book
>>> wn.morphy('book', wn.ADJ)

of2ss(of)[source]¶: take an id and return the synsets

path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]¶

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

readme(lang='eng')[source]¶: Return the contents of README (for omw) use lang=lang to get the readme for an individual language

res_similarity(synset1, synset2, ic, verbose=False)[source]¶

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

split_synsets(version='wordnet')[source]¶

ss2of(ss)[source]¶: return the ID of the synset

synonyms(word, lang='eng')[source]¶: return nested list with the synonyms of the different senses of word in the given language

synset(name)[source]¶

synset_from_pos_and_offset(pos, offset)[source]¶

pos: The synset’s part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB (‘a’, ‘s’, ‘r’, ‘n’, or ‘v’).
offset: The byte offset of this synset in the WordNet dict file for this pos.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_pos_and_offset('n', 1740))
Synset('entity.n.01')

synset_from_sense_key(sense_key)[source]¶

Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()

From https://wordnet.princeton.edu/documentation/senseidx5wn: A sense_key is represented as:

lemma % lex_sense (e.g. 'dog%1:18:01::')

where lex_sense is encoded as:

ss_type:lex_filenum:lex_id:head_word:head_id

Lemma:

ASCII text of word/collocation, in lower case

Ss_type:

synset type for the sense (1 digit int) The synset type is encoded as follows:

  NOUN
  VERB
  ADJECTIVE
  ADVERB
  ADJECTIVE SATELLITE

Lex_filenum:

name of lexicographer file containing the synset for the sense (2 digit int)

Lex_id:

when paired with lemma, uniquely identifies a sense in the lexicographer file (2 digit int)

Head_word:

lemma of the first word in satellite’s head synset Only used if sense is in an adjective satellite synset

Head_id:

uniquely identifies sense in a lexicographer file when paired with head_word Only used if head_word is present (2 digit int)

>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_sense_key("drive%1:04:03::"))
Synset('drive.n.06')

>>> print(wn.synset_from_sense_key("driving%1:04:03::"))
Synset('drive.n.06')

synsets(lemma, pos=None, lang='eng', check_exceptions=True)[source]¶: Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

words(lang='eng')[source]¶: return lemmas of the given language as list of words

wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]¶

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

exception nltk.corpus.reader.wordnet.WordNetError[source]¶

Bases: Exception

An exception class for wordnet-related errors.

class nltk.corpus.reader.wordnet.WordNetICCorpusReader[source]¶

Bases: CorpusReader

A corpus reader for the WordNet information content corpus.

__init__(root, fileids)[source]¶

Parameters:

root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of encoding can be any of the following:
- A string: encoding is the encoding name for all files.
- A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
- A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple’s regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
- None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the tagged_...() methods.

ic(icfile)[source]¶

Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.

Parameters:: icfile (str) – The name of the wordnet_ic file (e.g. “ic-brown.dat”)
Returns:: An information content dictionary

nltk.corpus.reader.wordnet.information_content(synset, ic)[source]¶

nltk.corpus.reader.wordnet.jcn_similarity(synset1, synset2, ic, verbose=False)[source]¶

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns:

A float score denoting the similarity of the two Synset objects.

nltk.corpus.reader.wordnet.lch_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]¶

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns:

A score denoting the similarity of the two Synset objects, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

nltk.corpus.reader.wordnet.lin_similarity(synset1, synset2, ic, verbose=False)[source]¶

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns:

A float score denoting the similarity of the two Synset objects, in the range 0 to 1.

nltk.corpus.reader.wordnet.path_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]¶

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns:

A score denoting the similarity of the two Synset objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

nltk.corpus.reader.wordnet.res_similarity(synset1, synset2, ic, verbose=False)[source]¶

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
ic (dict) – an information content object (as returned by nltk.corpus.wordnet_ic.ic()).

Returns:

A float score denoting the similarity of the two Synset objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N[‘dog’][0] and N[‘table’][0]).

nltk.corpus.reader.wordnet.wup_similarity(synset1, synset2, verbose=False, simulate_root=True)[source]¶

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen’s Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:

other (Synset) – The Synset that this Synset is being compared to.
simulate_root (bool) – The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.

Returns:

A float score denoting the similarity of the two Synset objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

NLTK

Documentation

nltk.corpus.reader.wordnet module¶