Principal Component Analysis in 6 steps

Read this to understand how PCA works. To skip to the steps, Ctrl+F “step 1”. To perform PCA on R, click here.

What is PCA?

Principal Component Analysis, or PCA, is a statistical method used to reduce the number of variables in a dataset. It does so by lumping highly correlated variables together. Naturally, this comes at the expense of accuracy. However, if you have 50 variables and realize that 40 of them are highly correlated, you will gladly trade a little accuracy for simplicity.

How does PCA work?

Say you have a dataset of two variables, and want to simplify it. You will probably not see a pressing need to reduce such an already succinct dataset, but let us use this example for the sake of simplicity.

The two variables are:

Dow Jones Industrial Average, or DJIA, a stock market index that constitutes 30 of America’s biggest companies, such as Hewlett Packard and Boeing.
S&P 500 index, a similar aggregate of 500 stocks of large American-listed companies. It contains many of the companies that the DJIA comprises.

Not surprisingly, the DJIA and FTSE are highly correlated. Just look at how their daily readings move together. To be precise, the below is a plot of their daily % changes.

The above points are represented in 2 axes: X and Y. In theory, PCA will allow us to represent the data along one axis. This axis will be called the principal component, and is represented by the black line.

DJIA vs S&P with principle component line — DJIA vs S&P with principal component line

Note 1: In reality, you will not use PCA to transform two-dimensional data into one-dimension. Rather, you will simplify data of higher dimensions into lower dimensions.

Note 2: Reducing the data to a single dimension/axis will reduce accuracy somewhat. This is because the data is not neatly hugging the axis. Rather, it varies about the axis. But this is the trade-off we are making with PCA, and perhaps we were never concerned about needle-point accuracy. The above linear model would do a fine job of predicting the movement of a stock and making you a decent profit, so you wouldn’t complain too much.

How do I do a PCA?

In this illustrative example, I will use PCA to transform 2D data into 2D data in five steps.

Step 1 – Standardize:

Standardize the scale of the data. I have already done this, by transforming the data into daily % change. Now, both DJIA and S&P data occur on a 0-100 scale.

Step 2 – Calculate covariance:

Find the covariance matrix for the data. As a reminder, the covariance between DJIA and S&P – Cov(DJIA, S&P) or equivalently, Cov(DJIA, S&P) – is a measure of how the two variables move together.

By the way, $cor(X,Y) = \frac{cov(X,Y)}{\sigma _{X}\cdot \sigma_{Y}}$

The covariance matrix for my data will look like:

$\begin{bmatrix} Cov(DJIA,DJIA) & Cov(DJIA,S\&P)\\ Cov(S\&P,DJIA) & Cov(S\&P,S\&P) \end{bmatrix} = \begin{bmatrix} Var(DJIA) & Cov(DJIA,S\&P)\\ Cov(S\&P,DJIA) & Var()S\&P) \end{bmatrix} \vspace{1cm} \newline = \begin{bmatrix} 0.7846 & 0.8012\\ 0.8012 & 0.8970 \end{bmatrix}$

Step 3 – Deduce eigens:

Do you remember we graphically identified the principal component for our data?

The main principal component, depicted by the black line, will become our new X-axis. Naturally, a line perpendicular to the black line will be our new Y axis, the other principal component. The below lines are perpendicular; don’t let the aspect ratio fool you.

Thus, we are going to rotate our data to fit these new axes. But what will the coordinates of the rotated data be?

To convert the data into the new axes, we will multiply the original DJIA, S&P data by eigenvectors, which indicate the direction of the new axes (principal components).

But first, we need to deduce the eigenvectors (there are two – one per axis). Each eigenvector will correspond to an eigenvalue, whose magnitude indicates how much of the data’s variability is explained by its eigenvector.

As per the definition of eigenvalue and eigenvector:

$\begin{bmatrix} Covariance &matrix \end{bmatrix} \cdot \begin{bmatrix} Eigenvector \end{bmatrix} = \begin{bmatrix} eigenvalue \end{bmatrix} \cdot \begin{bmatrix} Eigenvector \end{bmatrix}$

We know the covariance matrix from step 2. Solving the above equation by some clever math will yield the below eigenvalues (e) and eigenvectors (E):

$e_{1}=1.644$

$E_{1}= \begin{bmatrix} 0.6819 \\ -0.7314 \end{bmatrix}$

$e_{2}=0.0376$

$E_{1}= \begin{bmatrix} -0.7314 \\ 0.6819 \end{bmatrix}$

Step 4 – Re-orient data:

Since the eigenvectors indicates the direction of the principal components (new axes), we will multiply the original data by the eigenvectors to re-orient our data onto the new axes. This re-oriented data is called a score.

$Sc = \begin{bmatrix} Orig. \: data \end{bmatrix} \cdot \begin{bmatrix} Eigvectors \end{bmatrix} = \begin{bmatrix} DJIA_{1} & S\&P_{1}\\ DJIA_{2} & S\&P_{2}\\ ... & ...\\ DJIA_{150} & S\&P_{150} \end{bmatrix} \cdot \begin{bmatrix} 0.6819 & 0.7314\\ -0.7314 & 0.6819 \end{bmatrix}$

Step 5 – Plot re-oriented data:

We can now plot the rotated data, or scores.

Original data, re-oriented to fit new axes

Step 6 – Bi-plot:

A PCA would not be complete without a bi-plot. This is basically the plot above, except the axes are standardized on the same scale, and arrows are added to depict the original variables, lest we forget.

Axes: In this bi-plot, the X and Y axes are the principal components.
Points: These are the DJIA and S&P points, re-oriented to the new axes.
Arrows: The arrows point in the direction of increasing values for each original variable. For example, points in the top right quadrant will have higher DJIA readings than points in the bottom left quadrant. The closeness of the arrows means that the two variables are highly correlated.

Data from St Louis Federal Reserve; PCA performed on R, with help of ggplot2 package for graphs.

Abbas Keshvani

17 thoughts on “Principal Component Analysis in 6 steps”

Josh Simons says:

March 21, 2015 at 8:58 pm

Thank you very much for this intuitive explanation of PCA — I found it very helpful.

As a small point, I think it should be “principal” rather than “principle” throughout the article.

1. Abbas Keshvani says:
  
  March 22, 2015 at 6:40 am
  
  Glad you found the post useful, Josh.
  
  Good spot there! The P in PCA should indeed be Principal (adj.), rather than Principle (noun). Made the corrections.
  
Pingback: Distilled News | Data Analytics & R
mandeep15p says:

October 3, 2015 at 5:58 pm

Thanks for the information.
Would you please elaborate more about the interpretation of data. What does -ve and +ve component values of variables signifies?

1. Abbas Keshvani says:
  
  June 4, 2016 at 4:57 am
  
  Do you mean the -ve and +ve values of the score? They signify nothing, really. It is simply data that has been re-oriented on a different x and y axis.
  
Anonymous says:

October 13, 2016 at 7:51 am

Hello, this article really helps me understand PCA a lot.
However, some of the content confuse me and I don’t know if they are typos.
The last line of step 3, it should be E2?
The eigenvectors in step 4 should be
0.6819 -0.7314
-0.7314 0.6819
?
Thanks!

Pingback: Performing PCA on R | CoolStatsBlog
Mano says:

June 1, 2017 at 6:21 pm

This is an amazing article on PCA! loved it. Please continue writing such articles 😉

Anonymous says:

October 15, 2017 at 7:14 pm

Very helpful to someone who has a limited stats background and just wants to understand professional research papers!

Anna says:

March 6, 2018 at 5:16 pm

If we have assets in our portfolio with different currencies should we try to make every asset prices or returns into same currency or it doesn’t matter?

Anonymous says:

April 29, 2018 at 8:34 pm

How do I do a PCA?
In this illustrative example, I will use PCA to transform 2D data into 2D data in five steps.
transform 2D data into 2D data in five steps.?????

Dr. Mukesh Srivastava says:

June 18, 2018 at 5:06 am

Very useful explanation in six steps.
I want to know one thing that how to get the arrows for variables from eigen values or original values? What are its calculation steps. I am using STATISTICA 7.0 software.

Pingback: Eurogenes K36 Maps, Graphs, and Data Part 3: Lukasz Macuga’s LM Genetics Report – M. P. Britt
Sheldon says:

November 16, 2019 at 1:21 pm

Awesome things here. I am very satisfied to look your article.
Thanks so much and I am looking forward to contact you.
Will you kindly drop me a mail?

Anonymous says:

March 26, 2020 at 11:52 pm

This is awesome, thank you!!!

Pingback: What's the most advanced mathematics you have used in your job as a data scientist? | Code and Solve
Paulette says:

August 17, 2020 at 4:10 pm

It’s nearly impossible to find knowledgeable people in this
particular topic, however, you seem like you know what you’re talking about!

Thanks