Until now, we have explored supervised learning methods, such as linear regression and classification, where we had n observation and p features $X_1, X_2, \dots, X_p$ and a response variable $Y$. In unsupervised learning, we have only the features $X_1, X_2, \dots, X_p$ and [no associated response]{.underline}. The goal is to discover interesting things about the data, such as the presence of subgroups among the observations, or the relationships between the variables.
We already talked about PCA in the Principal Component Regression. When faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set. The principal component directions are represented as the directions in feature space along which the original data are highly variable.To perform principal components regression, we simply use principal components as predictors in a regression model in place of the original larger set of variables.
::: definition Principal components analysis (PCA) refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data. PCA is an unsupervised approach, since it involves only a set of features $X_1, X_2, . . . , X_p$, and no associated response Y. :::
In other words, PCA produces a low-dimensional representation of a
dataset. It finds a sequence of linear combinations of the variables
that have maximal variance, and are mutually uncorrelated.
The principal component $Z_1$ is a linear combination that has the
hisghest variance across the dataset.
$Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \dots + \phi_{p1}X_p$. The
coefficients $\phi_{11}, \phi_{21}, \dots, \phi_{p1}$ are called the
loadings of the first principal component. The first principal
component vector $\phi_1 = (\phi_{11}, \phi_{21}, \dots, \phi_{p1})$
defines the direction in feature space along which the data varies the
most. The projected $n$ data points are the principal components scores
$z_{11}, z_{21}, \dots, z_{n1}$.
In other words, we look for the line that maximizes the variance
(maximal information retained by the data) of the projected data that is
the same that minimize the error of the projection (minimizing the error
between the original data and its reconstruction). The total
reconstructing error is the averaged squared Euclidean distance between
the original data and the reconstructed data (matrix form):
\(\frac{1}{N} \sum_{n=1}^{N} ||x _{n} - \hat{x}_{n}||^2\) where
$\hat{x}_{n}$ is the projection of $x_n$ onto the line spanned by the
first principal component. It can be shown that the first principal
component is the line that minimizes reconstruction error an that
maximize the variance of the projected data.
The geometrical explaination of PCA is that the first principal
component vector $\phi_1$ defines the line that is as close (in terms of
Euclidean distance) as possible to the data. The second principal
component is the line that is orthogonal to the first line and that is
as close as possible to the data. The third principal component is
orthogonal to the first two and so on. So, for instance, the first two
principal components of a data set span the plane that is closest to the
n observations, in terms of average squared Euclidean distance. But
since the projection of a point is orthogonal to the principal component
line, the sum of the principal compenent line with the projection of the
point is equal to the the average squared distance between the $centre$
of the data and the point.
Remark: If the variables are in different units, scaling each to
have standard deviation equal to one is recommended.
We can now ask a natural question: how much of the information in a
given data set is lost by projecting the observations onto the first few
principal components? That is, how much of the variance in the data is
not contained in the first few principal components?
In general, we are interested in knowing the proportion of variance
explained (PVE) by each principal component, to understand the
strength of each component. The PVE of the $m$th principal component is:
\(PVE_m = \frac{\sum_{i=1}^{n} (z_{im})^2}{\sum_{i=1}^{n} \sum_{j=1}^{p} x_{ij}^2}\)
The PVE of each principal component is a positive quantity and is the
explained variance ratio for the first two principal components is
displayed, providing insight into how much of the datasetโs variance
these two components capture.Understanding explained variance is
essential because we aim to reduce dimensions without losing too much
information and an high explained variance in the first few PCs
indicates that we can capture most of the datasetโs information with
fewer dimensions, simplifying analysis and visualization.
Moreover, components with high explained variance reveal the dominant
patterns in the data, which are useful for understanding relationships
or clustering trends.
Clustering refers to a very broad set of techniques for finding
subgroups, or clusters, in a data set. We seek a partition of the data
into distinct groups so that the observations within each group are
quite similar to each other. Therefore, we must define what it means for
two or more observations to be similar or different. Clustering looks
for homogeneous subgroups among the observations.
We have two type of Clustering:
In K-means clustering, we seek to partition the observations into a pre-specified number of clusters.
In hierarchical clustering, we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram, that allows us to view at once the clusterings obtained for each possible number of clusters, from 1 to n.
::: definition Let $C_1, \ldots, C_K$ denote sets containing the indices of the observations in each cluster. These sets satisfy two properties:
$C_1 \cup C_2 \cup \ldots \cup C_K = {1, \ldots, n}$. In other words, each observation belongs to at least one of the $K$ clusters.
$C_k \cap C_{kโ} = \emptyset$ for all $k \neq kโ$. In other words, the clusters are non-overlapping: no observation belongs to more than one cluster.
For instance, if the $i$th observation is in the $k$th cluster, then $i \in C_k$. :::
The idea behind K-means clustering is that a good clustering is one for
which the within-cluster variation is as small as possible. This
measure is called $W(C_k)$, the within-cluster variation for the $k$th
cluster, which is a measore of the amount by which the observations
within a cluster differ from each other.
Hence, we want to solve the problem:
\(\min_{C_1, \ldots, C_K} \left( \sum_{k=1}^{K} W(C_k) \right)\) In
words, this formula says that we want to partition the observations into
K clusters such that the total within-cluster variation, summed over all
K clusters, is as small as possible.
How is $W(C_k)$ defined? We use the Euclidean distance:
\(W(C_k) = \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p} (x_{ij} - x_{i'j})^2\)
where $|C_k|$ denotes the number of observations in the $k$th cluster.
In other words, $W(C_k)$ is the sum of all of the pairwise squared
Euclidean distances between the observations in the $k$th cluster,
divided by the total number of observations in the $k$th cluster.
Thus, combining (6.1) and (6.2), we can write the K-means clustering
problem as:
\(\min_{C_1, \ldots, C_K} \left( \sum_{k=1}^{K} \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p} (x_{ij} - x_{i'j})^2 \right)\)
The algorithm for K-means clustering is as follows:
Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
Iterate until the cluster assignments stop changing:
For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster.
Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).
{K-means clustering algorithm example. Final result after 10
iterations.}
This algorithm is guaranteed to decrease the value of the objective 1.5
at each step, because:
\(\frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p} (x_{ij} - x_{i'j})^2 = 2 \sum_{i \in C_k} \sum_{j=1}^{p} (x_{ij} - \bar{x}_{kj})^2\)
where $\bar{x}_{kj}$ is the mean of the observations in the kth cluster.
The right-hand side of (6.4) is minimized when each observation is
assigned to the cluster with the closest centroid. This means that as
the algorithm is run, the clustering obtained will continually improve
until the result no longer changes. When the result no longer changes, a
local optimum has been reached. Then one selects the best solution,
i.e. that for which the objective (6.3) is smallest
K-means algorithm finds a local rather than a global optimum, the
results obtained will depend on the initial (random) cluster assignment
of each observation in Step 1 of the algorithm. For this reason, it is
important to run the algorithm multiple times from different random
initial configurations.
Scaling of the variables matters!
How many clusters to choose? For K-means, the Elbow method is popular,
but it is subjective and depends on application. Also it measures a
global clustering characteristic only. For an in-depth discussion see
Elements of Statistical Learning, chapter 14 for more details.