nltk.cluster.util module

class nltk.cluster.util.Dendrogram[source]

Bases: object

Represents a dendrogram, a tree with a specified branching order. This must be initialised with the leaf items, then iteratively call merge for each branch. This class constructs a tree representing the order of calls to the merge function.

__init__(items=[])[source]
Parameters

items (sequence of (any)) – the items at the leaves of the dendrogram

groups(n)[source]

Finds the n-groups of items (leaves) reachable from a cut at depth n. :param n: number of groups :type n: int

merge(*indices)[source]

Merges nodes at given indices in the dendrogram. The nodes will be combined which then replaces the first node specified. All other nodes involved in the merge will be removed.

Parameters

indices (seq of int) – indices of the items to merge (at least two)

show(leaf_labels=[])[source]

Print the dendrogram in ASCII art to standard out.

Parameters

leaf_labels (list) – an optional list of strings to use for labeling the leaves

class nltk.cluster.util.VectorSpaceClusterer[source]

Bases: ClusterI

Abstract clusterer which takes tokens and maps them into a vector space. Optionally performs singular value decomposition to reduce the dimensionality.

__init__(normalise=False, svd_dimensions=None)[source]
Parameters
  • normalise (boolean) – should vectors be normalised to length 1

  • svd_dimensions (int) – number of dimensions to use in reducing vector dimensionsionality with SVD

classify(vector)[source]

Classifies the token into a cluster, setting the token’s CLUSTER parameter to that cluster identifier.

abstract classify_vectorspace(vector)[source]

Returns the index of the appropriate cluster for the vector.

cluster(vectors, assign_clusters=False, trace=False)[source]

Assigns the vectors to clusters, learning the clustering parameters from the data. Returns a cluster identifier for each vector.

abstract cluster_vectorspace(vectors, trace)[source]

Finds the clusters using the given set of vectors.

likelihood(vector, label)[source]

Returns the likelihood (a float) of the token having the corresponding cluster.

likelihood_vectorspace(vector, cluster)[source]

Returns the likelihood of the vector belonging to the cluster.

vector(vector)[source]

Returns the vector after normalisation and dimensionality reduction

nltk.cluster.util.cosine_distance(u, v)[source]

Returns 1 minus the cosine of the angle between vectors v and u. This is equal to 1 - (u.v / |u||v|).

nltk.cluster.util.euclidean_distance(u, v)[source]

Returns the euclidean distance between vectors u and v. This is equivalent to the length of the vector (u - v).