1. What are sample applications of clustering in which the objective is to cluster the samples/observations (this is the most common use of clustering)?

    1. Machine learning, data mining, information retrieval, bioinformatics, data compression

  1. What are sample applications of clustering in which the objective is to cluster the variables/features?

    1. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

  1. Clustering can be loosely defined as “grouping items that are similar/close to each other”. What components do we need to formulate to have a well-defined computational framework for clustering?

    1. Proximity measure(how closely related they are)

    2. Criterion function to defined good and bad clustering

    3. Algorithm to compute clustering

  1. What are possible ways of choosing the number of clusters?

    1. K-means

      1. Assume we know how many clusters there are

      2. Find the number of clusters that minimizes the SSE (sum of squared errors)

    2. Incorporate the trade-off between complexity and fit into the objective function

      1. Evaluate the trade-off between complexity and fit after computing clusterings by trying different values of K

      2. Hierarchal clustering

        1. Pick number after a distance tree is made 

  2. What is the objective function of K-means clustering? Can K-means still be applied if our notion of distance is not Euclidian distance?

    1. Creates clusters based on distance from centroids

    2. K-means finds a locally optimal solution and is not guaranteed to find the (globally) optimal solution

    3. Yes the distance function can be changed depending on the circumstances

  1. Is K-means guaranteed to identify the optimal clustering according to its objective function? If not, what can we say about the optimality of the clustering it identifies?

    1. No it is not guaranteed. 

    2. The optimal solution is the clustering that minimizes the objective function

  1. Considering an instance of clustering where we are trying to cluster  points/observations in 2-dimensional space, can you give a sample instance and an initialization that will make K-means converge to an undesirable solution?

    1. Spherical clustering 

      1. Bc k-means draws a line

  1. Consider a run of K-means in 1-dimensional space, with say K=2 clusters. If you are shown the configurations of the cluster centroids and assignment of data points to clusters at the end of each iteration, can you order those iterations? (work out an example for yourself)

    1. The slides 

  1. What are the weaknesses of K-means?

    1. Sensitive to outliers

    2. User needs to specify k

    3. Only applicable if the mean is supplied

  1. What are possible strategies for making algorithms more robust to outliers in the application of K-means?

    1. Remove outliers 

    2. Perform random sampling, small subset of data points

  1. What are the main components of an agglomerative hierarchical clustering algorithm (i.e., what choices do we need to make to define an algorithm?)

    1. A distance matrix specifying pairwise distance 

    2. Either bottom up or top down 

    3. Decisions cannot be undone 

  1. What are the possible ways of assessing the distance between groups of items given the distances between pairs of items?

    1. Closest pair

    2. Farthest pair

    3. Avg of all pairs 

  1. Can you give an example of hierarchical clustering in 2-dimensional space with Euclidian distance as the distance function, such that single-linkage and complete-linkage will produce the same dendrogram?

  2. Can you give an example of hierarchical clustering in 2-dimensional space with Euclidian distance as the distance function, such that single-linkage and complete-linkage will produce different dendrograms?

  1. What is the key idea of Ward’s method?

    1. Consider joining 2 clusters, how far does it change the total distance from the centroid?

    2. Is it worth it to combine these two?

  1. What is the common requirement of K-means and hierarchical clustering using Ward’s method?

    1. They both have to have computable means 

  1. What are the pros and cons of applying a divisive hierarchical clustering algorithm that, for example, uses K-means with K=2 to split the points at each node of the dendrogram (as compared to agglomerative clustering)?

    1. Agglomerative is faster to compute 

      1. Only looks at pairwise

    2. Divisive is “less blind” to all data

      1. Can find the best possible split between 2 

  1. What is the (common) key idea behind Bayesian information criterion (BIC)and Aikaike information criterion (AIC) in formulating a measure to guide the selection of the number of clusters?

    1. They determine how good a model fits in a maximum-liklihood sense

    2. How many parameters are need to get a good fit

  1. What is the key difference between AIC and BIC?

    1. AIC (Akaike information criterion) tries to select the model that most adequately describes an unknown, high dimensional reality. 

    2. BIC (Bayesian information criterion) tries to find the TRUE model among the set of candidates. 

    3. AIC better predictability

    4. BIC finds more efficient solution

  1. Assume that you are given a dendrogram and you are trying to translate this into a disjoint set of clusters such that the model complexity and model fit are (locally) optimized together. You are also given a procedure that computes AIC for a given clustering. Describe an algorithm/method you would use to identify the final set of clusters.

    1. Find the largest ΔSSE between how many clusters you have

    2. between having x clusters or y clusters, indicates that x clusters divides the cells into much more homogenous groups than does y groups. From this reasoning it could be possible to pick x clusters as the final solution.