#06 | The Principal Component Analysis (PCA) & Dimensionality Reduction Techniques

© Jesús López 2022

Ask him any doubt on Twitter or LinkedIn

Chapter Importance

We used just two variables out of the seven we had in the whole DataFrame.

We could have computed better cluster models by giving more information to the Machine Learning model. Nevertheless, it would have been harder to plot seven variables with seven axes in a graph.

Is there anything we can do to compute a clustering model with more than two variables and later represent all the points along with their variables?

Yes, everything is possible with data. As one of my teachers told me: "you can torture the data until it gives you what you want" (sometimes it's unethical, so behave).

We'll develop the code to show you the need for dimensionality reduction techniques. Specifically, the Principal Component Analysis (PCA).

Load the Data

Imagine for a second you are the president of the United States of America, and you are considering creating campaigns to reduce car accidents.

You won't create 51 TV campaigns, one for each of the States of the USA (rows). Instead, you will see which States behave similarly to cluster them into 3 groups based on the variation across their features (columns).

import seaborn as sns #!

df_crashes = sns.load_dataset(name='car_crashes', index_col='abbrev')
df_crashes

Check this website to understand the measures of the following data.

Data Preprocessing

From the previous chapter, we should know that we need to preprocess the Data so that variables with different scales can be compared.

For example, it is not the same to increase 1kg of weight than 1m of height.

We will use StandardScaler() algorithm:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(df_crashes)
data_scaled[:5]

array([[ 0.73744574,  1.1681476 ,  0.43993758,  1.00230055,  0.27769155,
        -0.58008306,  0.4305138 ],
       [ 0.56593556,  1.2126951 , -0.21131068,  0.60853209,  0.80725756,
         0.94325764, -0.02289992],
       [ 0.68844283,  0.75670887,  0.18761539,  0.45935701,  1.03314134,
         0.0708756 , -0.98177845],
       [ 1.61949811, -0.48361373,  0.54740815,  1.67605228,  1.95169961,
        -0.33770122,  0.32112519],
       [-0.92865317, -0.39952407, -0.8917629 , -0.594276  , -0.89196792,
        -0.04841772,  1.26617765]])

Let's turn the array into a DataFrame for better understanding:

import pandas as pd

df_scaled = pd.DataFrame(data_scaled, index=df_crashes.index, columns=df_crashes.columns)
df_scaled

Now we see all the variables having the same scale (i.e., around the same limits):

df_scaled.agg(['min', 'max'])

k-Means Model in Python

We follow the usual Scikit-Learn procedure to develop Machine Learning models.

Import the Class

from sklearn.cluster import KMeans

Instantiate the Class

model_km = KMeans(n_clusters=3)

Fit the Model

model_km.fit(X=df_scaled)

KMeans(n_clusters=3)

Calculate Predictions

model_km.predict(X=df_scaled)

array([1, 1, 1, 1, 2, 0, 2, 1, 2, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 2, 2,
       2, 2, 0, 1, 1, 0, 0, 0, 2, 0, 2, 1, 1, 0, 1, 0, 1, 2, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1], dtype=int32)

Create a New DataFrame for the Predictions

df_pred = df_scaled.copy()

Create a New Column for the Predictions

df_pred.insert(0, 'pred', model_km.predict(X=df_scaled))
df_pred

Visualize the Model

Now let's visualize the clusters with a 2-axis plot:

sns.scatterplot(x='total', y='speeding', hue='pred',
                data=df_pred, palette='Set1');

Model Interpretation

Does the visualization make sense?

No, because the clusters should separate their points from others. Nevertheless, we see some green points in the middle of the blue cluster.

Why is this happening?

We are just representing 2 variables where the model was fitted with 7 variables. We can't see the points separated as we miss 5 variables in the plot.

Why don't we add 5 variables to the plot then?

We could, but it'd be a way too hard to interpret.

Then, what could we do?

We can apply PCA, a dimensionality reduction technique. Take a look at the following video to understand this concept:

https://www.youtube.com/watch?v=HMOI_lkzW08

Grouping Variables with `PCA()`

Transform Data to Components

PCA() is another technique used to transform data.

How has the data been manipulated so far?

Original Data df_crashes

df_crashes

Normalized Data df_scaled

df_scaled

Principal Components Data dfpca (now)

from sklearn.decomposition import PCA

pca = PCA()
data_pca = pca.fit_transform(df_scaled)
data_pca[:5]

array([[ 1.60367129,  0.13344927,  0.31788093, -0.79529296, -0.57971878,
         0.04622256,  0.21018495],
       [ 1.14421188,  0.85823399,  0.73662642,  0.31898763, -0.22870123,
        -1.00262531,  0.00896585],
       [ 1.43217197, -0.42050562,  0.3381364 ,  0.55251314,  0.16871805,
        -0.80452278, -0.07610742],
       [ 2.49158352,  0.34896812, -1.78874742,  0.26406388, -0.37238226,
        -0.48184939, -0.14763646],
       [-1.75063825,  0.63362517, -0.1361758 , -0.97491605, -0.31581147,
         0.17850962, -0.06895829]])

df_pca = pd.DataFrame(data_pca)
df_pca

cols_pca = [f'PC{i}' for i in range(1, pca.n_components_+1)]
cols_pca

['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7']

df_pca = pd.DataFrame(data_pca, columns=cols_pca, index=df_crashes.index)
df_pca

Visualize Components & Clusters

Let's visualize a scatterplot with PC1 & PC2 and colour points by cluster:

import plotly.express as px

px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred)

Are they mixed now?

No, they aren't.

That's because both PC1 and PC2 represent almost 80% of the variability of the original seven variables.

You can see the following array, where every element represents the amount of variability explained by every component:

pca.explained_variance_ratio_

array([0.57342168, 0.22543042, 0.07865743, 0.05007557, 0.04011   ,
       0.02837999, 0.00392491])

And the accumulated variability (79.88% until PC2):

pca.explained_variance_ratio_.cumsum()

array([0.57342168, 0.7988521 , 0.87750953, 0.9275851 , 0.9676951 ,
       0.99607509, 1.        ])

Which variables represent these two components?

Relationship between Original Variables & Components

Loading Vectors

The Principal Components are produced by a mathematical equation (once again), which is composed of the following weights:

df_weights = pd.DataFrame(pca.components_.T, columns=df_pca.columns, index=df_scaled.columns)
df_weights

We can observe that:

Socio-demographical features (total, speeding, alcohol, not_distracted & no_previous) have higher coefficients (higher influence) in PC1.
Whereas insurance features (ins_premium & ins_losses) have higher coefficients in PC2.

Principal Components is a technique that gathers the maximum variability of a set of features (variables) into Components.

Therefore, the two first Principal Components accurate a good amount of common data because we see two sets of variables that are correlated with each other:

Correlation Matrix

df_corr = df_scaled.corr()
sns.heatmap(df_corr, annot=True, vmin=0, vmax=1);

I hope that everything is making sense so far.

To ultimate the explanation, you can see below how df_pca values are computed:

Calculating One PCA Value

For example, we can multiply the weights of PC1 with the original variables for ALabama:

(df_weights['PC1']*df_scaled.loc['AL']).sum()

1.6036712920638672

To get the transformed value of the Principal Component 1 for ALabama State:

df_pca.head()

The same operation applies to any value of df_pca.

PCA & Cluster Interpretation

Now, let's go back to the PCA plot:

px.scatter(data_frame=df_pca, x='PC1', y='PC2', color=df_pred.pred.astype(str))

How can we interpret the clusters with the components?

Let's add information to the points thanks to animated plots from plotly library:

hover = '''
<b>%{customdata[0]}</b><br><br>
PC1: %{x}<br>
Total: %{customdata[1]}<br>
Alcohol: %{customdata[2]}<br><br>
PC2: %{y}<br>
Ins Losses: %{customdata[3]}<br>
Ins Premium: %{customdata[4]}
'''

fig = px.scatter(data_frame=df_pca, x='PC1', y='PC2',
                 color=df_pred.pred.astype(str),
                 hover_data=[df_pca.index, df_crashes.total, df_crashes.alcohol,
                             df_crashes.ins_losses, df_crashes.ins_premium])

fig.update_traces(hovertemplate = hover)

If you hover the mouse over the two most extreme points along the x-axis, you can see that their values coincide with the min and max values across socio-demographical features:

df_crashes.agg(['min', 'max'])

df_crashes.loc[['DC', 'SC'],:]

Apply the same reasoning over the two most extreme points along the y-axis. You will see the same for the insurance variables because they determine the positioning of the PC2 (y-axis).

df_crashes.agg(['min', 'max'])

df_crashes.loc[['ID', 'LA'],:]

Is there a way to represent the weights of the original data for the Principal Components and the points?

That's called a Biplot, which we will see later.

Biplot

We can observe how we position the points along the loadings vectors. Friendly reminder: the loading vectors are the weights of the original variables in each Principal Component.

import numpy as np

loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
evr = pca.explained_variance_ratio_.round(2)

fig = px.scatter(df_pca, x='PC1', y='PC2',
                 color=model_km.labels_.astype(str),
                 hover_name=df_pca.index,
                labels={
                    'PC1': f'PC1 ~ {evr[0]}%',
                    'PC2': f'PC2 ~ {evr[1]}%'
                })

for i, feature in enumerate(df_scaled.columns):
    fig.add_shape(
        type='line',
        x0=0, y0=0,
        x1=loadings[i, 0],
        y1=loadings[i, 1],
        line=dict(color="red",width=3)
    )
    fig.add_annotation(
        x=loadings[i, 0],
        y=loadings[i, 1],
        ax=0, ay=0,
        xanchor="center",
        yanchor="bottom",
        text=feature,
    )

fig.show()

Conclusion

Dimensionality Reduction techniques have many more applications, but I hope you got the essence: they are great for grouping variables that behave similarly and later visualising many variables in just one component.

In short, you are simplifying the information of the data. In this example, we simplify the data from plotting seven to only two dimensions. Although we don't get this for free because we explain around 80% of the data's original variability.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

#06 | The Principal Component Analysis (PCA) & Dimensionality Reduction Techniques

Learn how to group variables to plot seven variables in a 2-dimensional plot and why PCA is an essential technique to visualize clustering models.

Table of contents

Chapter Importance

Load the Data

Data Preprocessing

k-Means Model in Python

Import the Class

Instantiate the Class

Fit the Model

Calculate Predictions

Create a New DataFrame for the Predictions

Create a New Column for the Predictions

Visualize the Model

Model Interpretation

Grouping Variables with `PCA()`

Transform Data to Components

Visualize Components & Clusters

Relationship between Original Variables & Components

Loading Vectors

Correlation Matrix

Calculating One PCA Value

PCA & Cluster Interpretation

Biplot

Conclusion

#06 | The Principal Component Analysis (PCA) & Dimensionality Reduction Techniques

Learn how to group variables to plot seven variables in a 2-dimensional plot and why PCA is an essential technique to visualize clustering models.

Table of contents

Chapter Importance

Load the Data

Data Preprocessing

k-Means Model in Python

Import the Class

Instantiate the Class

Fit the Model

Calculate Predictions

Create a New DataFrame for the Predictions

Create a New Column for the Predictions

Visualize the Model

Model Interpretation

Grouping Variables with PCA()

Transform Data to Components

Visualize Components & Clusters

Relationship between Original Variables & Components

Loading Vectors

Correlation Matrix

Calculating One PCA Value

PCA & Cluster Interpretation

Biplot

Conclusion

Did you find this article valuable?

Grouping Variables with `PCA()`