nltk.cluster.kmeans module

class nltk.cluster.kmeans.KMeansClusterer[source]

Bases: nltk.cluster.util.VectorSpaceClusterer

The K-means clusterer starts with k arbitrary chosen means then allocates each vector to the cluster with the closest mean. It then recalculates the means of each cluster as the centroid of the vectors in the cluster. This process repeats until the cluster memberships stabilise. This is a hill-climbing algorithm which may converge to a local maximum. Hence the clustering is often repeated with random initial means and the most commonly occurring output means are chosen.

__init__(num_means, distance, repeats=1, conv_test=1e-06, initial_means=None, normalise=False, svd_dimensions=None, rng=None, avoid_empty_clusters=False)[source]
Parameters
  • num_means (int) – the number of means to use (may use fewer)

  • distance (function taking two vectors and returning a float) – measure of distance between two vectors

  • repeats (int) – number of randomised clustering trials to use

  • conv_test (number) – maximum variation in mean differences before deemed convergent

  • initial_means (sequence of vectors) – set of k initial means

  • normalise (boolean) – should vectors be normalised to length 1

  • svd_dimensions (int) – number of dimensions to use in reducing vector dimensionsionality with SVD

  • rng (Random) – random number generator (or None)

  • avoid_empty_clusters (boolean) – include current centroid in computation of next one; avoids undefined behavior when clusters become empty

cluster_vectorspace(vectors, trace=False)[source]

Finds the clusters using the given set of vectors.

classify_vectorspace(vector)[source]

Returns the index of the appropriate cluster for the vector.

num_clusters()[source]

Returns the number of clusters.

means()[source]

The means used for clustering.

nltk.cluster.kmeans.demo()[source]