Hierarchical Agglomerative clustering

A. Feature

Make hierarchical tree of clusters, where each depth has different number of clusters and the proper depth can be chose depending on total number of clusters.
A child node imply one of the meanings parent node indicates in detail.
- Assuming clustering sentences, Keywords extracted from each cluster(node) can be regarded as hierachchical topic tree from major topic to sub topics by its parent-child relationship.
The results can vary greatly depending on which metric you use to calculate the distance.

B. Process

Assume n data is n clusters.
Find two clusters with the closest distance and combine them into one cluster.
Repeat #2 until one cluster remains

C. Option

distance metric
- cosine distance
- euclidean distance
linkage (points you use to calculate the distance)
- complete(max): the distance between the furthest points of each cluster
- single(min): the distance between the closest points of each cluster
- ward(average): the distance between average vectors of points of each cluster

Python example

1. Clustering w/ sickit-learn

import numpy as np
from sklearn.cluster import AgglomerativeClustering

data = np.random.rand(500, 64)
n_cluster = 20
cosine_cl_model = AgglomerativeClustering(affinity="cosine", n_clusters=n_cluster, linkage="complete")
cosine_cl_model.fit(data)
cosine_label_list = list(cosine_cl_model.labels_)

2. Make Tree

Make tree

import itertools
ii = itertools.count(data.shape[0])
tree = [[] for i in range(2*len(data)-1)]
for x in cosine_cl_model.children_:
  parent_node = next(ii)
  tree[parent_node].append(x[0]) # left child
  tree[parent_node].append(x[1]) # left child

Get depth list

depth_list = [[] for i in range(len(data))]
root_node = len(data) + len(cosine_cl_model.children_) - 1
get_depth(root_node, 0)

3. Arrange Cluster by Depth

Get max depth

max_depth = None
for depth, node_list in enumerate(depth_list):
  if len(node_list)>0:
      continue
  else:
      max_depth = depth-1
      break

number of clusters by depth

cluster_list = [[] for i in range(max_depth+1)]
for depth in range(0, max_depth+1):
  if depth==0:
      cluster_node = set(depth_list[depth])
      cluster_list[depth] = list(cluster_node)
  else:
      #  cluster node of previous depth + cluster node of current depth - cluster node of previous depth which have child node
      cluster_node = set(cluster_list[depth-1] + depth_list[depth])
      parent_node = set([node for node in depth_list[depth-1] if len(tree[node])>0])
      cluster_node = cluster_node.difference(parent_node)
      cluster_list[depth] = list(cluster_node)
  print("depth {depth}: {cluster_num}".format(depth=depth, cluster_num=len(cluster_node)))

Reference and Implementation: #TODO

reference: [Data Mining] Hierarchical agglomerative Clustering