# test2

What are sample applications of clustering in which the objective is to cluster the samples/observations (this is the most common use of clustering)?

Machine learning, data mining, information retrieval, bioinformatics, data compression

What are sample applications of clustering in which the objective is to cluster the variables/features?

Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.

Clustering can be loosely defined as “grouping items that are similar/close to each other”. What components do we need to formulate to have a well-defined computational framework for clustering?

Proximity measure(how closely related they are)

Criterion function to defined good and bad clustering

Algorithm to compute clustering

What are possible ways of choosing the number of clusters?

K-means

Assume we know how many clusters there are

Find the number of clusters that minimizes the SSE (sum of squared errors)

Incorporate the trade-off between complexity and fit into the objective function

Evaluate the trade-off between complexity and fit after computing clusterings by trying different values of K

Hierarchal clustering

Pick number after a distance tree is made

What is the objective function of K-means clustering? Can K-means still be applied if our notion of distance is not Euclidian distance?

Creates clusters based on distance from centroids

K-means finds a locally optimal solution and is not guaranteed to find the (globally) optimal solution

Yes the distance function can be changed depending on the circumstances

Is K-means guaranteed to identify the optimal clustering according to its objective function? If not, what can we say about the optimality of the clustering it identifies?

No it is not guaranteed.

The optimal solution is the clustering that minimizes the objective function

Considering an instance of clustering where we are trying to cluster points/observations in 2-dimensional space, can you give a sample instance and an initialization that will make K-means converge to an undesirable solution?

Spherical clustering

Bc k-means draws a line

Consider a run of K-means in 1-dimensional space, with say K=2 clusters. If you are shown the configurations of the cluster centroids and assignment of data points to clusters at the end of each iteration, can you order those iterations? (work out an example for yourself)

The slides

What are the weaknesses of K-means?

Sensitive to outliers

User needs to specify k

Only applicable if the mean is supplied

What are possible strategies for making algorithms more robust to outliers in the application of K-means?

Remove outliers

Perform random sampling, small subset of data points

What are the main components of an agglomerative hierarchical clustering algorithm (i.e., what choices do we need to make to define an algorithm?)

A distance matrix specifying pairwise distance

Either bottom up or top down

Decisions cannot be undone

What are the possible ways of assessing the distance between groups of items given the distances between pairs of items?

Closest pair

Farthest pair

Avg of all pairs

Can you give an example of hierarchical clustering in 2-dimensional space with Euclidian distance as the distance function, such that single-linkage and complete-linkage will produce the same dendrogram?

Can you give an example of hierarchical clustering in 2-dimensional space with Euclidian distance as the distance function, such that single-linkage and complete-linkage will produce different dendrograms?

What is the key idea of Ward’s method?

Consider joining 2 clusters, how far does it change the total distance from the centroid?

Is it worth it to combine these two?

What is the common requirement of K-means and hierarchical clustering using Ward’s method?

They both have to have computable means

What are the pros and cons of applying a divisive hierarchical clustering algorithm that, for example, uses K-means with K=2 to split the points at each node of the dendrogram (as compared to agglomerative clustering)?

Agglomerative is faster to compute

Only looks at pairwise

Divisive is “less blind” to all data

Can find the best possible split between 2

What is the (common) key idea behind Bayesian information criterion (BIC)and Aikaike information criterion (AIC) in formulating a measure to guide the selection of the number of clusters?

They determine how good a model fits in a maximum-liklihood sense

How many parameters are need to get a good fit

What is the key difference between AIC and BIC?

AIC (Akaike information criterion) tries to select the model that most adequately describes an unknown, high dimensional reality.

BIC (Bayesian information criterion) tries to find the TRUE model among the set of candidates.

AIC better predictability

BIC finds more efficient solution

Assume that you are given a dendrogram and you are trying to translate this into a disjoint set of clusters such that the model complexity and model fit are (locally) optimized together. You are also given a procedure that computes AIC for a given clustering. Describe an algorithm/method you would use to identify the final set of clusters.

Find the largest ΔSSE between how many clusters you have

between having x clusters or y clusters, indicates that x clusters divides the cells into much more homogenous groups than does y groups. From this reasoning it could be possible to pick x clusters as the final solution.