Principal Component Analysis PCA using Singular Value Decomposition SVD - Tech It Yourself

## Friday, 30 April 2021

1. Principal Component Analysis (PCA)

- Unsupervised estimator

- Dimensional reduction algorithm: PCA reduces the dimensionality of dataset by transforming a large set of features into a smaller one but still keeps most of the information in the dataset. It helps increasing performance but decrease a little accuracy. Lower dimensional is easier and faster to explore and visualize data.

- Noise filtering

2. Tools for PCA

2.1 Standardizationstandardize the range of the continuous initial features so that each one of them contributes equally to the analysis. It helps avoiding the features with larger ranges will dominate over the features with small ranges.

2.2 Covariance Matrix:

It shows how the features of the dataset are varying from the mean with respect to each other (or the relationship between them). Because sometimes, features are highly correlated in such a way that they contain redundant information.

The sign of the covariance means:

- if positive then : the two variables increase or decrease together (correlated)

- if negative then : One increases when the other decreases (Inversely correlated)

2.3 Eigenvectors and Eigenvalues

An eigenvector does not change direction in a transformation.

For a square matrix A, an Eigenvector and Eigenvalue make this equation true:

Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components.

Principal components are new features that are constructed as linear combinations of the initial features. These combinations are done in such a way that the new features (i.e., principal components) are uncorrelated and most of the information within the initial features is compressed into the first components. PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on.

PCA allows you to reduce dimensionality without losing much information, and this by discarding the components with low information and considering the remaining components as your new variables.

The eigenvectors of the Covariance matrix are the directions of  the most variance. And eigenvalues are the amount of variance carried in each eigenvector. Apply this to PCA  by ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the principal components in order of significance.

In order to compute the percentage of variance (information) accounted for by each component, we divide the eigenvalue of each component by the sum of eigenvalues.

2.3 Feature vector

After having the principal components, we have to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues). Finally, from remaining ones, we form a matrix of vectors that we call Feature vector.

2.4 Project the data along the principal components axes

Multiplying the transpose of the original data set by the transpose of the feature vector.

3. Singular Value Decomposition - SVD

Singular Value Decomposition is a matrix factorization method utilized in PCA. This technique provides a method to explode principal components.

The Singular Value Decomposition states that any matrix A (n x p) can be factored into A = UΣVᵀ where:
U and V are orthogonal n x n matrices with orthogonal eigenvectors of AAᵀ and AᵀA.
- Vᵀ is a transposed matrix.
Σ is a diagonal n x p matrix with the diagonal elements are ordered so that Σii ≥ Σjj for all i < j (descending order). These elements are the root of the positive eigenvalues of AAᵀ or Aᵀ A (AAᵀ and Aᵀ A have the same positive eigenvalues).
uᵢ and vᵢ have unit length.
Example to find SVD of a matrix:
Applying A to a vector x (Ax = Vᵀx) can be visualized under geometry form.
Vᵀ represents a rotation or reflection of vectors.
Σ represents a linear dilation.
U represents a rotation or reflection.
Figure: geometry form of SVD
Another form of SVD:
σᵢ are descending order so more significant elements left side. Example:

We can use SVD for Dimensionality Reduction (just keep the important components). This can be applied when dataset has more features (columns) than observations (rows). This helps reduced dataset to a smaller number of features. If we select the top k largest singular values. An approximate B of the vector A: B = (UΣVᵀ)k
Example:
σ₂ is very small, we can ignore it.
Example: Dataset relation is y=2x, using SVD to reduce dataset.
from numpy import diag
from numpy import zeros
from scipy.linalg import svd
import numpy as np

# define a matrix
A = np.array([[1,2],[2,4],[3,6],[4,8]]).T

# Singular-value decomposition
U, s, VT = svd(A)
# create m x n Sigma matrix
Sigma = zeros((A.shape[0], A.shape[1]))
# populate Sigma with n x n diagonal matrix
Sigma[:A.shape[0], :A.shape[0]] = diag(s)
# select
n_elements = 1
Sigma = Sigma[:, :n_elements]
VT = VT[:n_elements, :]
# reconstruct
B = U.dot(Sigma.dot(VT))
print(B)
# transform
T = U.dot(Sigma)
print(T)

Input: [[1. 2. 3. 4.] [2. 4. 6. 8.]]
Output: [[ -5.47722558] [-10.95445115]]
4. Practice
4.1 Simple PCA
By eye, it is clear that there is a nearly linear relationship between the x and y variables.
Instead of attempting to predict the y values from the x values, the unsupervised learning problem attempts to learn about the relationship between the x and y values.
Principal component analysis quantifies this relationship by finding a list of the principal axes in the data, and using those axes to describe the dataset.

The explained variance tells you how much information (variance) can be attributed to each of the principal components. This is important as while you can convert 4 dimensional space to 2 dimensional space, you lose some of the variance (information) when you do this. By using the attribute explained_variance_ratio_, you can see that the first principal component contains 98% of the variance and the second principal component contains 2% of the variance. Together, the two components contain 100% of the information.
The red and green vectors represent the principal axes of the data, and the length indicates of how "important" that axis is in describing the distribution of the data, or it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes are the "principal components" of the data.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.decomposition import PCA

rng = np.random.RandomState(8)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])

pca = PCA(n_components=2)
pca.fit(X)
for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
comp = comp * np.sqrt(var)  # scale component by its variance explanation power
plt.plot([0, comp[0]], [0, comp[1]], label=f"Component {i}", linewidth=5,
color=f"C{i + 2}")
plt.figure(1)
plt.gca().set(aspect='equal',
title="2-dimensional dataset with principal components",
xlabel='first feature', ylabel='second feature')

fig = plt.figure(2)
fig.suptitle('projected')
#X_projected = pca.inverse_transform(X)
#loss = ((X - X_projected) ** 2).mean()
#print(loss)
y = X.dot(pca.components_[1])
x = X.dot(pca.components_[0])
plt.scatter(x, y)

plt.show()

4.2 Dimensionality reduction
Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.
From the example above, the second component only contains 2% of the variance. So it can be removed.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.decomposition import PCA

rng = np.random.RandomState(8)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])

pca = PCA(n_components=1)
pca.fit(X)
for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
comp = comp * np.sqrt(var)  # scale component by its variance explanation power
plt.plot([0, comp[0]], [0, comp[1]], label=f"Component {i}", linewidth=5,
color=f"C{i + 2}")
plt.figure(1)
plt.gca().set(aspect='equal',
title="2-dimensional dataset with principal components",
xlabel='first feature', ylabel='second feature')
X_pca = pca.transform(X)
fig = plt.figure(2)
fig.suptitle('projected')
X_projected = pca.inverse_transform(X_pca)
loss = ((X - X_projected) ** 2).mean()

plt.scatter(X_projected[:, 0], X_projected[:, 1])
plt.show()

4.3 PCA for visualization
from sklearn.datasets import load_digits
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.decomposition import PCA

pca = PCA(2) # project from 64 to 2 dimensions
projected = pca.fit_transform(digits.data)
plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('RdBu', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
plt.show()


4.4 Choosing the number of components
We can determine this by looking at the cumulative explained variance ratio as a function of the number of components.
We can see that with the first 10 components contain approximately 75% of the variance, while you need around 50 components to describe close to 100% of the variance.