# Principal Component Analysis in 6 steps

What is PCA?

Principal Component Analysis, or PCA, is a statistical method used to reduce the number of variables in a dataset. It does so by lumping highly correlated variables together. Naturally, this comes at the expense of accuracy. However, if you have 50 variables and realize that 40 of them are highly correlated, you will gladly trade a little accuracy for simplicity.

How does PCA work?

Say you have a dataset of two variables, and want to simplify it. You will probably not see a pressing need to reduce such an already succinct dataset, but let us use this example for the sake of simplicity.

The two variables are:

1. Dow Jones Industrial Average, or DJIA, a stock market index that constitutes 30 of America’s biggest companies, such as Hewlett Packard and Boeing.
2. S&P 500 index, a similar aggregate of 500 stocks of large American-listed companies. It contains many of the companies that the DJIA comprises.

Not surprisingly, the DJIA and FTSE are highly correlated. Just look at how their daily readings move together. To be precise, the below is a plot of their daily % changes.

The above points are represented in 2 axes: X and Y. In theory, PCA will allow us to represent the data along one axis. This axis will be called the principal component, and is represented by the black line.

Note 1: In reality, you will not use PCA to transform two-dimensional data into one-dimension. Rather, you will simplify data of higher dimensions into lower dimensions.

Note 2: Reducing the data to a single dimension/axis will reduce accuracy somewhat. This is because the data is not neatly hugging the axis. Rather, it varies about the axis. But this is the trade-off we are making with PCA, and perhaps we were never concerned about needle-point accuracy. The above linear model would do a fine job of predicting the movement of a stock and making you a decent profit, so you wouldn’t complain too much.

How do I do a PCA?

In this illustrative example, I will use PCA to transform 2D data into 2D data in five steps.

Step 1 – Standardize:

Standardize the scale of the data. I have already done this, by transforming the data into daily % change. Now, both DJIA and S&P data occur on a 0-100 scale.

Step 2 – Calculate covariance:

Find the covariance matrix for the data. As a reminder, the covariance between DJIA and S&P – Cov(DJIA, S&P) or equivalently, Cov(DJIA, S&P) – is a measure of how the two variables move together.

By the way, $cor(X,Y) = \frac{cov(X,Y)}{\sigma _{X}\cdot \sigma_{Y}}$

The covariance matrix for my data will look like:

$\begin{bmatrix} Cov(DJIA,DJIA) & Cov(DJIA,S\&P)\\ Cov(S\&P,DJIA) & Cov(S\&P,S\&P) \end{bmatrix} = \begin{bmatrix} Var(DJIA) & Cov(DJIA,S\&P)\\ Cov(S\&P,DJIA) & Var()S\&P) \end{bmatrix} \vspace{1cm} \newline = \begin{bmatrix} 0.7846 & 0.8012\\ 0.8012 & 0.8970 \end{bmatrix}$

Step 3 – Deduce eigens:

Do you remember we graphically identified the principal component for our data?

The main principal component, depicted by the black line, will become our new X-axis. Naturally, a line perpendicular to the black line will be our new Y axis, the other principal component. The below lines are perpendicular; don’t let the aspect ratio fool you.

Thus, we are going to rotate our data to fit these new axes. But what will the coordinates of the rotated data be?

To convert the data into the new axes, we will multiply the original DJIA, S&P data by eigenvectors, which indicate the direction of the new axes (principal components).

But first, we need to deduce the eigenvectors (there are two – one per axis). Each eigenvector will correspond to an eigenvalue, whose magnitude indicates how much of the data’s variability is explained by its eigenvector.

As per the definition of eigenvalue and eigenvector:

$\begin{bmatrix} Covariance &matrix \end{bmatrix} \cdot \begin{bmatrix} Eigenvector \end{bmatrix} = \begin{bmatrix} eigenvalue \end{bmatrix} \cdot \begin{bmatrix} Eigenvector \end{bmatrix}$

We know the covariance matrix from step 2. Solving the above equation by some clever math will yield the below eigenvalues (e) and eigenvectors (E):

$e_{1}=1.644$

$E_{1}= \begin{bmatrix} 0.6819 \\ -0.7314 \end{bmatrix}$

$e_{2}=0.0376$

$E_{1}= \begin{bmatrix} -0.7314 \\ 0.6819 \end{bmatrix}$

Step 4 – Re-orient data:

Since the eigenvectors indicates the direction of the principal components (new axes), we will multiply the original data by the eigenvectors to re-orient our data onto the new axes. This re-oriented data is called a score.

$Sc = \begin{bmatrix} Orig. \: data \end{bmatrix} \cdot \begin{bmatrix} Eigvectors \end{bmatrix} = \begin{bmatrix} DJIA_{1} & S\&P_{1}\\ DJIA_{2} & S\&P_{2}\\ ... & ...\\ DJIA_{150} & S\&P_{150} \end{bmatrix} \cdot \begin{bmatrix} 0.6819 & 0.7314\\ -0.7314 & 0.6819 \end{bmatrix}$

Step 5 – Plot re-oriented data:

We can now plot the rotated data, or scores.

Step 6 – Bi-plot:

A PCA would not be complete without a bi-plot. This is basically the plot above, except the axes are standardized on the same scale, and arrows are added to depict the original variables, lest we forget.

• Axes: In this bi-plot, the X and Y axes are the principal components.
• Points: These are the DJIA and S&P points, re-oriented to the new axes.
• Arrows: The arrows point in the direction of increasing values for each original variable. For example, points in the top right quadrant will have higher DJIA readings than points in the bottom left quadrant. The closeness of the arrows means that the two variables are highly correlated.

Data from St Louis Federal Reserve; PCA performed on R, with help of ggplot2 package for graphs.

Abbas Keshvani

# Introducing Statwing

Recently, Greg Laughlin, the founder of a new statistical software called Statwing, let me try his product for free. I happen to like free things very much (the college student is strong within me) so I gave it a try.

I mostly like how easy it is to use: For instance, to relate two attributes like Age and Income, you click Age, click Income, and click Relate.

So what can Statwing do?

1. Summarize an attribute (like “age”): totals, averages, standard deviation, confidence intervals, percentiles, visual graphs like the one below
2. Relate two columns together (“Openness” vs “Extraversion”)
• Plots the two attributes against eachother to see how they relate. It will include the formula of the regression line and the R-squared value.
• Sometimes a chi-square-style table is more appropriate. The software determines how best to represent the data.
• Tests the null hypothesis that the attributes are independent, by a T-test, F-test (ANOVA) or chi-square test. Statwing determines which one is appropriate.
• Repeat the above for a ranked correlation.

For now, you can’t forecast a time series or represent data on maps. But Greg told me that the team is adding new features as I type this.

If you’d like to try the software yourself, click here. They’ve got three sample datasets to play with:

1. Titanic passengers information
2. The results of a psychological survey
3. A list of congressman, their voting record and donations.

Abbas Keshvani