nltk.cluster.util module

class nltk.cluster.util.VectorSpaceClusterer[source]

Bases: nltk.cluster.api.ClusterI

Abstract clusterer which takes tokens and maps them into a vector space. Optionally performs singular value decomposition to reduce the dimensionality.

__init__(normalise=False, svd_dimensions=None)[source]
  • normalise (boolean) – should vectors be normalised to length 1

  • svd_dimensions (int) – number of dimensions to use in reducing vector dimensionsionality with SVD

cluster(vectors, assign_clusters=False, trace=False)[source]

Assigns the vectors to clusters, learning the clustering parameters from the data. Returns a cluster identifier for each vector.

abstract cluster_vectorspace(vectors, trace)[source]

Finds the clusters using the given set of vectors.


Classifies the token into a cluster, setting the token’s CLUSTER parameter to that cluster identifier.

abstract classify_vectorspace(vector)[source]

Returns the index of the appropriate cluster for the vector.

likelihood(vector, label)[source]

Returns the likelihood (a float) of the token having the corresponding cluster.

likelihood_vectorspace(vector, cluster)[source]

Returns the likelihood of the vector belonging to the cluster.


Returns the vector after normalisation and dimensionality reduction

nltk.cluster.util.euclidean_distance(u, v)[source]

Returns the euclidean distance between vectors u and v. This is equivalent to the length of the vector (u - v).

nltk.cluster.util.cosine_distance(u, v)[source]

Returns 1 minus the cosine of the angle between vectors v and u. This is equal to 1 - (u.v / |u||v|).

class nltk.cluster.util.Dendrogram[source]

Bases: object

Represents a dendrogram, a tree with a specified branching order. This must be initialised with the leaf items, then iteratively call merge for each branch. This class constructs a tree representing the order of calls to the merge function.


items (sequence of (any)) – the items at the leaves of the dendrogram


Merges nodes at given indices in the dendrogram. The nodes will be combined which then replaces the first node specified. All other nodes involved in the merge will be removed.


indices (seq of int) – indices of the items to merge (at least two)


Finds the n-groups of items (leaves) reachable from a cut at depth n. :param n: number of groups :type n: int


Print the dendrogram in ASCII art to standard out.


leaf_labels (list) – an optional list of strings to use for labeling the leaves