This is a practical tutorial on performing PCA on R. If you would like to understand *how *PCA works, please see my plain English explainer here.

Reminder: Principal Component Analysis (PCA) is a method used to reduce the number of variables in a dataset.

We are using R’s **USArrests** dataset, a dataset from 1973 showing, for each US state, the:

- rate per 100,000 residents of murder
- rate per 100,000 residents of rape
- rate per 100,000 residents of assault
- % of the population that is urban

Now, we will simplify the data into two-variables data. This does not mean that we are eliminating two variables and keeping two; it means that we are replacing the four variables with two brand new ones called “principal components”.

This time we will use R’s princomp function to perform PCA.

Preamble: you will need the **stats **package.

Step 1: Standardize the data. You may skip this step if you would rather use princomp’s inbuilt standardization tool*.

Step 2: Run **pca=princomp(USArrests, cor=TRUE)** if your data needs standardizing / **princomp(USArrests)** if your data is already standardized.

Step 3: Now that R has computed 4 new variables (“principal components”), you can choose the two (or one, or three) principal components with the highest variances.

You can run **summary(pca) **to do this. The output will look like this:

As you can see, principal components 1 and 2 have the highest standard deviation / variance, so we should use them.

Step 4: Finally, to obtain the actual principal component coordinates (“scores”) for each state, run **pca$scores**:

Step 5: To produce the biplot, a visualization of the principal components against the original variables, run **biplot(pca)**:

The closeness of the Murder, Assault, Rape arrows indicates that these three types of crime are, intuitively, correlated. There is also some correlation between urbanization and incidence of rape; the urbanization-murder correlation is weaker.

***princomp** will turn your data into z-scores (i.e. subtract the mean, then divide by the standard deviation). But in doing so, one is not just standardizing the data, but also rescaling it. I do not see the need to rescale, so I choose to manually translate the data onto a standard range of [0,1] using the equation:

Abbas Keshvani