nltk.cluster.gaac module

class nltk.cluster.gaac.GAAClusterer[source]

Bases: nltk.cluster.util.VectorSpaceClusterer

The Group Average Agglomerative starts with each of the N vectors as singleton clusters. It then iteratively merges pairs of clusters which have the closest centroids. This continues until there is only one cluster. The order of merges gives rise to a dendrogram: a tree with the earlier merges lower than later merges. The membership of a given number of clusters c, 1 <= c <= N, can be found by cutting the dendrogram at depth c.

This clusterer uses the cosine similarity metric only, which allows for efficient speed-up in the clustering process.

__init__(num_clusters=1, normalise=True, svd_dimensions=None)[source]
Parameters
  • normalise (boolean) – should vectors be normalised to length 1

  • svd_dimensions (int) – number of dimensions to use in reducing vector dimensionsionality with SVD

cluster(vectors, assign_clusters=False, trace=False)[source]

Assigns the vectors to clusters, learning the clustering parameters from the data. Returns a cluster identifier for each vector.

cluster_vectorspace(vectors, trace=False)[source]

Finds the clusters using the given set of vectors.

update_clusters(num_clusters)[source]
classify_vectorspace(vector)[source]

Returns the index of the appropriate cluster for the vector.

dendrogram()[source]
Returns

The dendrogram representing the current clustering

Return type

Dendrogram

num_clusters()[source]

Returns the number of clusters.

nltk.cluster.gaac.demo()[source]

Non-interactive demonstration of the clusterers with simple 2-D data.