Clustering Technique for Categorical Data in python
k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on distant measures like Euclidean distance etc.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.
Introduction
Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity (documents within a cluster are similar) and low inter-cluster similarity (documents from different clusters are dissimilar). This is an internal criterion for the quality of a clustering. But good scores on an internal criterion do not necessarily translate into good effectiveness in an application. An alternative to internal criteria is direct evaluation in the application of interest. For search result clustering, we may want to measure the time it takes users to find an answer with different clustering algorithms. This is the most direct evaluation, but it is expensive, especially if large user studies are necessary.
Types of categorical Clustering technique
- k-mode algorithm.
- k-prototypes
please feel free to comment some other algorithm and packages which makes working with categorical clustering easy.
Why do we need k-mode algorithm?
Ralambondrainy (1995) presented an approach to using the k-means algorithm to cluster categorical data. Ralambondrainy’s approach is to convert multiple category attributes into binary attributes (using 0 and 1 to represent either a category absent or present) and to treat the binary attributes as numeric in the k-means algorithm. If it is used in data mining, this approach needs to handle a large number of binary attributes because data sets in data mining often have categorical attributes with hundreds or thousands of categories. This will inevitably increase both computational and space costs of the k-means algorithm. The other drawback is that the cluster means, given by real values between 0 and 1, do not indicate the characteristics of the clusters.
Categorical data has a different structure than the numerical data. The distance functions in the numerical data might not be applicable to the categorical data. Algorithms for clustering numerical data cannot be applied to categorical data.
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values.
1. The k-modes algorithm
The cause that the k-means algorithm cannot cluster categorical objects is its dissimilarity measure. These barriers can be removed by making the following modifications to the k-means algorithm:
- Using a simple matching dissimilarity measure for categorical objects.
- Replacing means of clusters by modes
- Using a frequency-based method to find the modes to solve problem.
Below section discusses these modifications.
A. Dissimilarity measure
The clustering algorithm is free to choose any distance metric / similarity score. Euclidean is the most popular. But any other metric can be used that scales according to the data distribution in each dimension /attribute, for example the Mahalanobis metric.
Let X , Y be two categorical objects described by m categorical attributes. The dissimilarity measure between X and Y can be defined by the total mismatches of the corresponding attribute categories of the two objects. The smaller the number of mismatches is, the more similar the two objects. This measure is often referred to as simple matching (Kaufman and Rousseeuw, 1990). Formally,
B. Mode of a set
Let X be a set of categorical objects described by categorical attributes, A1, A2, . . . , Am . Definition 1. A mode of X = {X1, X2,…, Xn} is a vector Q = [q1,q2,…,qm] that minimizes
C. Find a mode for a set
Theorem 1 defines a way to find Q from a given X, and therefore is important because it allows the k-means paradigm to be used to cluster categorical data. The theorem implies that the mode of a data set X is not unique. For example, the mode of set {[a, b], [a, c], [c, b], [b, c]} can be either [a, b] or [a, c].
D. The k-modes algorithm
When (5) is used as the dissimilarity measure for categorical objects, the cost function (1) becomes
To minimize the cost function the basic k-means algorithm can be modified by using the simple matching dissimilarity measure to solve P1, using modes for clusters instead of means and selecting modes according to Theorem 1 to solve P2.
In the basic algorithm we need to calculate the total cost P against the whole data set each time when a new Q or W is obtained. To make the computation more efficient we use the following algorithm instead in practice.
1. Select k initial modes, one for each cluster.
2. Allocate an object to the cluster whose mode is the nearest to it according to(5). Up date the mode of the cluster after each allocation according to Theorem 1.
3. After all objects have been allocated to clusters, retest the dissimilarity of objects against the current modes. If an object is found such that its nearest mode belongs to another cluster rather than its current one, reallocate the object to that cluster and update the modes of both clusters.
4. Repeat 3 until no object has changed clusters after a full cycle test of the whole data set.
The proof of convergence for this algorithm is not yet available (Anderberg, 1973). How- ever, its practical use has shown that it always converges.
Like the k-means algorithm the k-modes algorithm also produces locally optimal solutions that are dependent on the initial modes and the order of objects in the data set. In our current implementation of the k-modes algorithm we include two initial mode selection methods. The first method selects the first k distinct records from the data set as the initial k modes.
The second method is implemented with the following steps.
- Calculate the frequencies of all categories for all attributes and store them in a category array in descending order of frequency as shown in figure 1. Here, ci, j denotes category i of attribute j and f(ci,j)≥ f(ci+1,j) where f(ci,j)is the frequency of category ci,j.
- Assign the most frequent categories equally to the initial k modes. For example in figure 1, assume k = 3. We assign Q1 = [q1,1 = c1,1, q1,2 = c2,2, q1,3 = c3,3, q1,4 = c1,4],
3. Start with Q1. Select the record most similar to Q1 and replace Q1 with the record as the first initial mode. Then select the record most similar to Q2 and replace Q2 with the record as the second initial mode. Continue this process until Qk is replaced. In these selections Ql != Qt for l != t.
Step 3 is taken to avoid the occurrence of empty clusters. The purpose of this selection method is to make the initial modes diverse, which can lead to better clustering results
2. The k-prototypes algorithm
It is straightforward to integrate the k-means and k-modes algorithms into the k-prototypes algorithm that is used to cluster the mixed-type objects. The k-prototypes algorithm is practically more useful because frequently encountered objects in real world databases are mixed-type objects.
where the first term is the squared Euclidean distance measure on the numeric attributes and the second term is the simple matching dissimilarity measure on the categorical at- tributes. The weight γ is used to avoid favoring either type of attribute. The influence of γ in the clustering process is discussed in (Huang, 1997a)
Implementation of k-mode
Python implementations of the k-modes and k-prototypes clustering algorithms relies on Numpy for a lot of the heavy lifting and there is python lib to do exactly the same thing.
https://github.com/nicodv/kmodes
Conclusion
The two algorithms are efficient when clustering very large complex data sets in terms of both the number of records and the number of clusters.
In general, the k-modes algorithm is much faster than the k-prototypes algorithm. The key reason is that the k-modes algorithm needs many less iterations to converge than the k-prototypes algorithm because of its discrete nature.
Things to explore:
There are a number of clustering algorithms that can appropriately handle mixed data types. Some possibilities include the following:
- Partitioning-based algorithms: k-Prototypes, Squeezer.
- Hierarchical algorithms: ROCK, Agglomerative single, average, and complete linkage.
- Density-based algorithms: HIERDENC, MULIC, CLIQUE.
- Model-based algorithms: SVM clustering, Self-organizing maps.
If you would like to learn more about these algorithms, the manuscript ‘Survey of Clustering Algorithms’ written by Rui Xu offers a comprehensive introduction to cluster analysis.
References
[2] http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf
[3] https://arxiv.org/ftp/cs/papers/0603/0603120.pdf
[4] https://www.ee.columbia.edu/~wa2171/MULIC/AndreopoulosPAKDD2007.pdf