# How to perform PCA on R

This is a practical tutorial on performing PCA on R. If you would like to understand how PCA works, please see my plain English explainer here.

Reminder: Principal Component Analysis (PCA) is a method used to reduce the number of variables in a dataset.

We are using R’s USArrests dataset, a dataset from 1973 showing, for each US state, the:

1. rate per 100,000 residents of murder
2. rate per 100,000 residents of rape
3. rate per 100,000 residents of assault
4. % of the population that is urban

Now, we will simplify the data into two-variables data. This does not mean that we are eliminating two variables and keeping two; it means that we are replacing the four variables with two brand new ones called “principal components”.

This time we will use R’s princomp function to perform PCA.

Preamble: you will need the stats package.

Step 1: Standardize the data. You may skip this step if you would rather use princomp’s inbuilt standardization tool*.

Step 2: Run pca=princomp(USArrests, cor=TRUE) if your data needs standardizing / princomp(USArrests) if your data is already standardized.

Step 3: Now that R has computed 4 new variables (“principal components”), you can choose the two (or one, or three) principal components with the highest variances.

You can run summary(pca) to do this. The output will look like this:

As you can see, principal components 1 and 2 have the highest standard deviation / variance, so we should use them.

Step 4: Finally, to obtain the actual principal component coordinates (“scores”) for each state, run pca$scores: Step 5: To produce the biplot, a visualization of the principal components against the original variables, run biplot(pca): The closeness of the Murder, Assault, Rape arrows indicates that these three types of crime are, intuitively, correlated. There is also some correlation between urbanization and incidence of rape; the urbanization-murder correlation is weaker. *princomp will turn your data into z-scores (i.e. subtract the mean, then divide by the standard deviation). But in doing so, one is not just standardizing the data, but also rescaling it. I do not see the need to rescale, so I choose to manually translate the data onto a standard range of [0,1] using the equation: $\frac{x_{i}-x_{min}}{x_{max}-x_{min}}$ Abbas Keshvani # The end of the oil glut In my last post, I talked about how America had depressed oil prices by increasing its supply. Recall this graph which shows that the supply glut is primarily caused by increased American supply (the top pink line is America): The glut is mostly due to America producing more oil Since low prices are mainly caused by American oversupply, a decrease in American supply will have a major impact on prices. And it does look like American supply might wind down. The next graph shows how American oil production responds (eventually) to the number of oil rigs in America. Just to clarify, “rigs” here refers to rotary rigs – the machines that drill for new oil wells. The actual extraction is done by wells, not rigs. But American oil supply shows a remarkable (lagged) covariance with rig count. From the 1990 to 2000, the number of rigs decreased, and oil supply followed it down. Then, when the number of rigs jumped in 2007, oil supply also rose with it. Note that the number of American rigs has plummeted since the start of 2015. It is no coincidence that oil prices hit a record low in January 2015. At these paltry prices, oil companies have less of an impetus to dig for more oil. The number of oil rings in America has halved since January 2015 The break-even price for shale oil varies according to the basin (reservoir) it comes from. A barrel from Bakken-Parshall-Sanish (proven reserves: 1 billion barrels) costs$60, while a barrel from Utica-Condensate (4.5 billion barrels) costs $95. The reserve-weighted average price is$76.50. These figures were calculated by Wood McKenzie, an oil consulting firm, and can be viewed in detail here.

As the number of rigs has halved to 800, the United States will not be able to keep up its record supply. Keep in mind that wells are running dry all the time, so less rigging will eventually mean less oil. Perhaps finally, the glut is about to end, with consequences for oil prices. To put things in perspective, the last time America had only 800 rigs (end January 2011), oil was at $97 a barrel. Oil probably will not return to$100 a barrel. If it does, shale oil will become profitable again (the threshold is $76), American rigs will come online again, supply will increase and prices will come down again. So oil will have to find a new equilibrium price to be stable. A reasonable level to expect for this equilibrium is around$70, the break-even price for shale.

There will probably be a lag in the reduction of American supply: Note how oil supply does not immediately respond to the number of rigs. But things move faster when expectations are at play. On the 6th of April, traders realized Iranian rigs were not going to come online as fast as they thought. Oil prices rose 5% in one night. American supply does not have to come down for prices to drop: traders simply have to realize prices will come down.

Data from US Energy Information Agency and Baker Hughes, an oil rig services provider. Graphs plotted on R.

Abbas Keshvani

# Why oil prices came down, and won’t anymore

You have probably heard that the price of crude oil has tumbled from $115 per barrel (159 litres, an archaic but established unit of measurement) in June 2014 to$54 in March 2015.

The price of oil has halved in 9 months.

Why oil has plunged so far: The drop has been caused by a supply glut (oversupply), as the below graph shows. The top line in pink is America, not Saudi Arabia:

The glut is mostly due to America producing more oil.

Although most of us think of Saudi Arabia as the world’s largest oil supplier, in actual fact the United States has had this title since 2013. In 2014, America was responsible for around half of the net increase in world oil output, due to a boom in the shale gas industry there. Its increase was akin to adding one Kazakhstan to the world! All of this excludes all the natural gas the US got out of fracking, which also makes it the #1 gas supplier.

Historically, Saudi Arabia has played a stabilizing role in world oil prices, by adjusting its output to ensure global supply is stable. The below graph show how Saudi output increased to lower prices when they were high, and vice versa. However, since July, the Saudis have not responded to newly low oil prices by decreasing output. In fact, the Kingdom have insisted that they would rather bear lower oil prices than decrease their market share (read: be squeezed out by shale).

The Saudis have historically stabilized prices, but no more.

Saudi Arabia is backed by the other members of the Gulf Cooperation Council – UAE, Qatar and Kuwait. Together, the GCC are responsible for more than a fifth of world oil output, so their inaction towards falling prices has been instrumental in ensuring that oil prices remain low. But why have the Saudis and their allies been so passive?

Motive 1 – Shale: One reason the Kingdom is depressing prices is to thwart the growth of the nascent shale gas and bitumen oil industries in America and Canada. The threat from these new industries to Saudi Arabia is real – In October, America ceased Nigerian oil imports, even though Nigeria exported almost as much oil to America as Saudi Arabia as recently at 2010. Meanwhile, Canada steadily increased its exports of bitumen oil to America during the same period.

However, new shale projects require $65 oil to break even. At$53 a barrel, the shale boom has been paused, and several investments have been called off, their returns in doubt (although many existing wells remain online). If the Saudis allow prices to increase, the threat of shale will likely resume, so it does not look like they will allow prices to return to their pre-June levels. But the current price level is sufficiently low to keep the threat at bay, so the Saudis need not increase output further. At the same time, $53 oil will stop new shale projects from coming offline, so it is unlikely that North America can contribute to the supply glut any further, either. It is for these reasons that oil proces are unlikely to tumble much further. Motive 2 – Iran: However, I believe the Saudis have also depressed prices to hurt Iran and Russia, both of whom make most of their export revenue from oil. Iran’s expanding influence in the Middle East has rattled the Saudis considerably. In addition, both Iran and Russia remain staunch defenders of the Syrian government, which the Saudis and Qataris despise. The Saudi’s reserves of$900bln provide the kingdom with a buffer, but will likely force Iran and Russia to think twice about expensive foreign projects like Syria, right?

But it does not look like low oil prices have reduced Iranian, or even Russian, involvement in Iraq and Syria. Iranian General Soleimani is openly marching through Iraq as an “advisor”, while Iran-backed militia have made the bulk of gains against IS. Meanwhile Assad has held onto power, two years after most media outlets pronounced him as good as overthrown. All of this has happened against the backdrop of low oil prices. Thus, it does not look like there is much value in continuing the Saudi strategy of depressing oil prices to curb Iranian influence.

Other producers, like Nigeria: The second graph shows that other oil producers like Nigeria (produces 2.3m barrels a day or 2.6% of world oil: more than Qatar but less than UAE) have generally kept output constant. Most major oil producers – nations like Nigeria, Venezuela and Iraq – cannot afford to decrease oil sales, which are critical to their economies. They are probably not too happy about low oil prices, but have little choice in the matter. Finally, fortunately or unfortunately, the conflict in Libya has not depressed their oil output.

Wild card Iran: Iran exported 3m barrels of oil per day in 2006, and sanctions have reduced this number to a meager 1.2m per day. A barrage of nuclear-related sanctions since 2006 have imposed an embargo on Iranian oil exports to the EU, prohibited investments in Iran’s oil industry, and barred banks from mediating transactions involving Iranian oil. But as sanctions are eased, Iran’s oil exports will certainly increase, and this may lower prices even further.

However, the timelines for increased Iranian oil exports are unclear. They depend on the speed at which sanctions are repealed and the pace at which Iran can ramp-up output: The timelines for repealing nuclear-related sanctions imposed by the P5+1 will only be unveiled on 30thJune 2015; Iran has 30m barrels of oil ready to ship out immediately, but beyond this stockpile, it will takes years for Iran to bring its oil industry up to speed.

If sanctions are eased and Iran increases oil exports within a year, Saudi Arabia may actually reduce their output. Allowing prices to drop further will not serve the kingdom’s interests. Current prices are already low enough to keep shale at bay. The kingdom could very well lower prices to hurt Iran, but low oil prices do not seem to have worked to curb Iranian influence so far.

Any Iran-related decreases in oil prices will also be bound by the $50 psychological resistance (although this was breached in January) and the 2008 low of$34.

In summary, I do not think will see $100 oil any time soon, but I also do not think oil prices will drop much further than they already have. Data from US Energy Information Administration; graphs produced on R. Abbas Keshvani @abbaskesh # Principal Component Analysis in 6 steps Read this to understand how PCA works. To skip to the steps, Ctrl+F “step 1”. To perform PCA on R, click here. What is PCA? Principal Component Analysis, or PCA, is a statistical method used to reduce the number of variables in a dataset. It does so by lumping highly correlated variables together. Naturally, this comes at the expense of accuracy. However, if you have 50 variables and realize that 40 of them are highly correlated, you will gladly trade a little accuracy for simplicity. How does PCA work? Say you have a dataset of two variables, and want to simplify it. You will probably not see a pressing need to reduce such an already succinct dataset, but let us use this example for the sake of simplicity. The two variables are: 1. Dow Jones Industrial Average, or DJIA, a stock market index that constitutes 30 of America’s biggest companies, such as Hewlett Packard and Boeing. 2. S&P 500 index, a similar aggregate of 500 stocks of large American-listed companies. It contains many of the companies that the DJIA comprises. Not surprisingly, the DJIA and FTSE are highly correlated. Just look at how their daily readings move together. To be precise, the below is a plot of their daily % changes. DJIA vs S&P The above points are represented in 2 axes: X and Y. In theory, PCA will allow us to represent the data along one axis. This axis will be called the principal component, and is represented by the black line. DJIA vs S&P with principal component line Note 1: In reality, you will not use PCA to transform two-dimensional data into one-dimension. Rather, you will simplify data of higher dimensions into lower dimensions. Note 2: Reducing the data to a single dimension/axis will reduce accuracy somewhat. This is because the data is not neatly hugging the axis. Rather, it varies about the axis. But this is the trade-off we are making with PCA, and perhaps we were never concerned about needle-point accuracy. The above linear model would do a fine job of predicting the movement of a stock and making you a decent profit, so you wouldn’t complain too much. How do I do a PCA? In this illustrative example, I will use PCA to transform 2D data into 2D data in five steps. Step 1 – Standardize: Standardize the scale of the data. I have already done this, by transforming the data into daily % change. Now, both DJIA and S&P data occur on a 0-100 scale. Step 2 – Calculate covariance: Find the covariance matrix for the data. As a reminder, the covariance between DJIA and S&P – Cov(DJIA, S&P) or equivalently, Cov(DJIA, S&P) – is a measure of how the two variables move together. By the way, $cor(X,Y) = \frac{cov(X,Y)}{\sigma _{X}\cdot \sigma_{Y}}$ The covariance matrix for my data will look like: $\begin{bmatrix} Cov(DJIA,DJIA) & Cov(DJIA,S\&P)\\ Cov(S\&P,DJIA) & Cov(S\&P,S\&P) \end{bmatrix} = \begin{bmatrix} Var(DJIA) & Cov(DJIA,S\&P)\\ Cov(S\&P,DJIA) & Var()S\&P) \end{bmatrix} \vspace{1cm} \newline = \begin{bmatrix} 0.7846 & 0.8012\\ 0.8012 & 0.8970 \end{bmatrix}$ Step 3 – Deduce eigens: Do you remember we graphically identified the principal component for our data? DJIA vs S&P with principal component line The main principal component, depicted by the black line, will become our new X-axis. Naturally, a line perpendicular to the black line will be our new Y axis, the other principal component. The below lines are perpendicular; don’t let the aspect ratio fool you. If we used the black lines as our x and y axes, our data would look simpler Thus, we are going to rotate our data to fit these new axes. But what will the coordinates of the rotated data be? To convert the data into the new axes, we will multiply the original DJIA, S&P data by eigenvectors, which indicate the direction of the new axes (principal components). But first, we need to deduce the eigenvectors (there are two – one per axis). Each eigenvector will correspond to an eigenvalue, whose magnitude indicates how much of the data’s variability is explained by its eigenvector. As per the definition of eigenvalue and eigenvector: $\begin{bmatrix} Covariance &matrix \end{bmatrix} \cdot \begin{bmatrix} Eigenvector \end{bmatrix} = \begin{bmatrix} eigenvalue \end{bmatrix} \cdot \begin{bmatrix} Eigenvector \end{bmatrix}$ We know the covariance matrix from step 2. Solving the above equation by some clever math will yield the below eigenvalues (e) and eigenvectors (E): $e_{1}=1.644$ $E_{1}= \begin{bmatrix} 0.6819 \\ -0.7314 \end{bmatrix}$ $e_{2}=0.0376$ $E_{1}= \begin{bmatrix} -0.7314 \\ 0.6819 \end{bmatrix}$ Step 4 – Re-orient data: Since the eigenvectors indicates the direction of the principal components (new axes), we will multiply the original data by the eigenvectors to re-orient our data onto the new axes. This re-oriented data is called a score. $Sc = \begin{bmatrix} Orig. \: data \end{bmatrix} \cdot \begin{bmatrix} Eigvectors \end{bmatrix} = \begin{bmatrix} DJIA_{1} & S\&P_{1}\\ DJIA_{2} & S\&P_{2}\\ ... & ...\\ DJIA_{150} & S\&P_{150} \end{bmatrix} \cdot \begin{bmatrix} 0.6819 & 0.7314\\ -0.7314 & 0.6819 \end{bmatrix}$ Step 5 – Plot re-oriented data: We can now plot the rotated data, or scores. Original data, re-oriented to fit new axes Step 6 – Bi-plot: A PCA would not be complete without a bi-plot. This is basically the plot above, except the axes are standardized on the same scale, and arrows are added to depict the original variables, lest we forget. • Axes: In this bi-plot, the X and Y axes are the principal components. • Points: These are the DJIA and S&P points, re-oriented to the new axes. • Arrows: The arrows point in the direction of increasing values for each original variable. For example, points in the top right quadrant will have higher DJIA readings than points in the bottom left quadrant. The closeness of the arrows means that the two variables are highly correlated. Bi-plot Data from St Louis Federal Reserve; PCA performed on R, with help of ggplot2 package for graphs. Abbas Keshvani # World Wide Wage South Koreans earn, on average,$33,140 per year (PPP), making them almost as rich as Britons. However, Koreans also work 30% more hours than Britons, making their per-hour wage considerably less than a British wage. In fact, the Korean wage of $15 per hour (PPP) is comparable to that of a Czech or Slovakian. Here is a map of the working hours of mostly OECD countries. Hours worked per week As you can see, people in developing countries have to work longer hours. The exception is South Korea, which works pretty hard for a rich country – harder than Brazil and Poland do. If you divide per-capita income by working hours, you get a person’s hourly wage: World Wide Wage The top 10 earners by wage are mostly northern Europeans – Norway, Germany, Denmark, Sweden, Switzerland, Netherlands – and small financial centres Singapore and Luxembourg. As the first to industrialize, Europeans found they were able to mechanize ploughing, assembling and number-crunching – boosting incomes, while simultaneously decreasing working hours. The bottom earners are developing countries – such as Mexico, Brazil and Poland. Again, Korea stands out as a rich country with low wages. This could be because Korea exported their way into prosperity by winning over Western consumers from the likes of General Motors and General Electric. They did this by combining industrialization with low wages, which are therefore responsible for the ascent of their economies. Data from World Bank, OECD, and BBC. Maps created on R. Abbas Keshvani # If Scotland becomes a country On the 18th of September 2014, Scottish people will vote on secession from the United Kingdom, potentially ending a union that has existed since 1707. If Scots vote “Yes” to end the union, the United Kingdom will consist of England, Wales and Northern Ireland, while the newly created country of Scotland may look like this: Basically, Scotland would look a lot like Finland. The two countries have similar populations, GDP, and even their respective largest cities are about the same size. Abbas Keshvani # I am probably watching TV right now I got an email from a gentlemen boasting labor data that even the Bureau of Labor Statistics hasn’t published yet! Retale, a tool that helps you scour advertisements for deals in your area, has produced a cool infograph on what Americans are doing right now. It tells you what activities Americans are doing at different times of the day: at 9PM, 34% of Americans are watching TV, 7% are still working, etc… The data is pretty reliable, as it is from the Bureau of Labor Statistics. You can even break down the graphs into various demographics: men, women, 45-54 year olds, retirees…. Even though I am not American, the model predicts correctly that I am watching TV right now. The TV is on in the background as I type this, so close enough. Check out the infograph here. Abbas Keshvani # Indian Elections 2014 – a Summary India conducted general elections between 7th April and 12th May , which elected a Member of Parliament to represent each of the 543 constituencies that make up the country. The opposition BJP won 31% of the votes, which yielded them 282 out of 543 seats in parliament, or 52% of all seats. The BJP allied with smaller parties, such as the Telugu Desam Party, to form the National Democratic Alliance (NDA). Altogether, the NDA won 39% of the votes and 336 seats (62%). India’s parties, topped up by their allies Turnout was pretty good: 541 million Indians, or 66% of the total vote bank, participated in the polls. Google and Bing both performed excellent analytics on the election results, but I thought Bing’s was easier to use since their visual is a clean and simple India-only map. They actually out-simpled Google this time. Bing: A constituency is more likely to elect BJP (orange) if its people speak Hindi Interestingly, the BJP’s victories seem to come largely from Hindi speakers, traditionally concentrated in the north and west parts of India. Plenty of non-Hindi speakers voted for the BJP too, such as in Gujarat and Maharashtra, but votes in south and east of the country generally went to a more diverse pantheon of parties. Abbas Keshvani # How to do a chi-square test in 7 steps What is a chi-square test: A chi square tests the relationship between two attributes. Suppose we suspect that rural Americans tended to vote Romney, and urban Americans tended to vote Obama. In this case, we suspect a relationship between where you live and whom you vote for. The full name for this test is Pearson’s Chi-Square Test for Independence, named after Carl Pearson, the founder of modern Statistics. How to do it: We will use a fictitious poll conducted of 100 people. Each person was asked where they live and whom they voted for. 1. Draw a chi-square table. . Each row will be whom you voted for, giving us two columns for Obama and Romney. Each row will be where you live, giving us three rows – rural, suburban and urban. 2. Calculate totals for each row and column. . The purpose of the first column total is to find out how many votes Obama got from all areas. Similarly, the purpose of the first row total is to find out how rural votes were cast for either candidate. 3. Calculate probabilities for each row and column. . These will be the individual probabilities of voting Obama, voting Romney, living in the country, etc… For example, the Obama column total tells us that 54 out of 100 people polled voted Obama, so probability of voting Obama is 0.54. 4. Calculate the joint probabilities of belonging to each category. . For example, probability of being rural and an Obama voter is found by multiplying the probability of voting Obama (0.54) with the probability of living in the country (0.13). So, 0.54 x 0.13 = A person has a 0.0702 chance of being a rural Obama voter. . In doing so, we assume that where you live and whom you voted for are independent. This assumption, called the null hypothesis, may well be wrong , and we will test it later by testing the joint probabilities it yielded. 5. Based on these joint probabilities, how many people do we expect to belong to each category? . We multiply the joint probability for each category by 100, the number of people. 6. These expected numbers are based on the assumption (hypothesis) that whom you voted for and where you live are independent. We can test this hypothesis by holding these expected numbers against the actual numbers we have. . First, we need our chi-square value $\chi^{2}$. . $\chi^{2}=\sum_{i=1}^{K}\frac{(O_{i}-E_{i})^{2}}{E_{i}}$ . Basically, the equation asks that, for each category, you find the discrepancy between the observed number $O_{i}$ and expected number $E_{i}$, square it, and then divide it by the expected number $E_{i}$. Finally, add up the figures for each category. . I got 0.769 as my chi-square value. . 7. Look at a chi-square table. . Note that out degrees of freedom in a chi square test is $\left ( N_{row}-1 \right )\times \left ( N_{col}-1 \right )$. In our case, with 3 rows and 2 columns, we get 2 degrees of freedom. . For a 0.05 level of significance and 2 degrees of freedom, we get a threshold (minimum) chi-square value of 5.991. Since our chi-square value 0.769 is smaller than the minimum, we cannot reject the null hypothesis that where you live and who you voted for are independent. That is the end of the chi square test for independence. Abbas Keshvani # Introducing Statwing Recently, Greg Laughlin, the founder of a new statistical software called Statwing, let me try his product for free. I happen to like free things very much (the college student is strong within me) so I gave it a try. I mostly like how easy it is to use: For instance, to relate two attributes like Age and Income, you click Age, click Income, and click Relate. So what can Statwing do? 1. Summarize an attribute (like “age”): totals, averages, standard deviation, confidence intervals, percentiles, visual graphs like the one below 2. Relate two columns together (“Openness” vs “Extraversion”) • Plots the two attributes against eachother to see how they relate. It will include the formula of the regression line and the R-squared value. • Sometimes a chi-square-style table is more appropriate. The software determines how best to represent the data. • Tests the null hypothesis that the attributes are independent, by a T-test, F-test (ANOVA) or chi-square test. Statwing determines which one is appropriate. • Repeat the above for a ranked correlation. For now, you can’t forecast a time series or represent data on maps. But Greg told me that the team is adding new features as I type this. If you’d like to try the software yourself, click here. They’ve got three sample datasets to play with: 1. Titanic passengers information 2. The results of a psychological survey 3. A list of congressman, their voting record and donations. Abbas Keshvani # Crime map for the City of London In my experience, central London is generally a safe place, but I was robbed there two years ago. A friend and I got lost on our way to a pancake house (serving, not made of), so I took my new iPhone out to consult a map. In a flash, a bicyclist zoomed past and plucked my phone out of my hands. Needless to say, I lost my appetite for pancakes that day. But I am far from alone. Here, I have plotted 506 instances of theft, violence, arson, drug trade, and anti-social behaviour onto a map of London. The data I am using only lists crimes in the City of London, a small area within central London which hosts the global HQs of many banks and law firms, for the month of February 2014. Crime in the City of London – February 2014 Each point on this map is not a single instance of crime – recall that the data lists over 500 instances of crime. So, each point corresponds to multiple instances of crime which happened at a particular spot. So, it is probably best to split the map into hexagons (no particular reason for my choice of shape) which are colour coded to explain how dense the crime in that area is. Heatmap of crime in Central London – Feb 2014 A particular hotspot for crime appears to be the area around the Gherkin, or 30 St Mary’s Axe, Britain’s most expensive office building. Data from data.police.uk; Graphics produced on R using ggplot2 package; Map from Google maps. Abbas Keshvani # CO2 Emissions per Dollar For all the flak China receives about its greenhouse gas emissions, the average Chinese produces less than a third the amount of CO2 than his American counterpart. It just so happens that there are 1.3 billion Chinese, and 0.3 billion Americans, so China ends up producing more CO2. Carbon dioxide and other greenhouse gases, such as methane and carbon monoxide, are produced from burning petrol, growing rice, and raising cattle . These greenhouse gases let in sun rays, but do not let out the heat that the rays generate on earth. This results in a greenhouse effect, where global temperatures are purported to be rising as a result of human activities. The below map shows the per-capita emissions of greenhouse gases: As you can see, the least damage is done by people in Africa, South Asia, and Latin America. But these places also happen to be the poorest places:… View original post 162 more words # How Countries Fare, 2010 The Current Account Balance is a measure of a country’s “profitability”. It is the sum of profits (losses) made from trading with other countries, profits (losses) made from investments in other countries, and cash transfers, such as remittances from expatriates. As the infographic shows, there isn’t much middle ground when it comes to a current account balance. Most countries have: • large deficits (America, most of Europe, Australia, Brazil, India) • large surpluses (China, most of Southeast Asia, Northern European countries, Russia, Gulf oil producers). There are a few countries with • small deficits (most Central American states, Pakistan) • small surpluses (most Baltics) …but they are largely outnumbered by the clear winners and losers of world trade. The above is not a per-capita infographic, so larger countries tend to be clear winners or losers, while smaller countries are more likely to straddle the divide. Here is the per-capita Current Account Balance map: This… View original post 115 more words # Daily/monthly/yearly tallies for your data Say you have a dataset, where each row has a date or time, and something is recorded for that date and time. If each row is a unique date – great! If not, you may have rows with the same date, and you have to combine records for the same date to get a daily tally. Here is how you can make a daily tally (or a monthly or yearly one; the frequency of tallies is not important): 1. convert the dates to numbers. R will say 01/01/1970 is day 1, 02/01/1970 is day 2, …, 07/03/2010 is day 14675; 31/12/1960 is day -1. 2. use a “for loop” to lump entries from the same date together 3. calculate the daily by calculating the number of rows in the daily lump (I do this below), or by adding all entries in a particular column in a daily lump To get the daily total, summary(rott[,2])<-as.numeric(as.Date(rott[,2], format=”%m/%d/%Y”, origin = “3/7/2010″)) daily<-matrix(NA,184,1) for(i in 1:184) #my data spans 184 days from 7th March to 6th Sept 2010 { rott.i<-rott[rott[,2]==14674+i,] daily[i,1]<-nrow(rott.i) #7th March 2010 is the 14675th day from 01/01/1970, the day the R calendar starts } acf(daily,main=”Autocorrelation of Timeseries”) #ACF! Abbas Keshvani # Using ggplot2 Made on ggplot I have a standard code for ggplot2 which I use to make line graphs, scatter plots, and histograms. For lines or scatters: p<- ggplot(x, aes(x=Year, y=Rank, colour=Uni, group=Uni)) #colour lines by variable Uni #group Uni labelled variables in the same line Then: p + #you get an error if not for this step geom_line(size=1.2) + geom_point(data=QS[QS[,2]==”2013″,]) + geom_text(data=QS[QS[,2]==”2013″&QS[,1]!=”Princeton”,],aes(label=paste(paste(Rank,”.”,sep=””),Uni)),hjust=-0.2)+ ylim(17,0.5) + scale_x_continuous(limit=c(2004,2014),breaks=seq(2004,2014,1)) + theme(legend.position=”none”) + ggtitle(“QS University Rankings 2008-2013”) + theme(plot.title=element_text(size=rel(1.5))) + theme_bw() + theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank()) + geom_text(aes(label=country),size=6,vjust=-1) + annotate(“text”,x=2011,y=16.5,label=”Abbas Keshvani”) For a bar chart: ggplot(Dist, aes(x=B,y=C,fill=A)) + #stacked bars, column A contains stacks geom_bar(stat=”identity”, width=0.9) + Abbas Keshvani # CO2 Emissions per Dollar of GDP For all the flak China receives about its greenhouse gas emissions, the average Chinese produces less than a third the amount of CO2 than his American counterpart. It just so happens that there are 1.3 billion Chinese, and 0.3 billion Americans, so China ends up producing more CO2. Carbon dioxide and other greenhouse gases, such as methane and carbon monoxide, are produced from burning petrol, growing rice, and raising cattle . These greenhouse gases let in sun rays, but do not let out the heat that the rays generate on earth. This results in a greenhouse effect, where global temperatures are purported to be rising as a result of human activities. The below map shows the per-capita emissions of greenhouse gases: Greenhouse Gas Emissions per capita As you can see, the least damage is done by people in Africa, South Asia, and Latin America. But these places also happen to be the poorest places: Because they don’t have much industry, they don’t churn out much CO2. The below plot shows the correlation between poverty and green-ness. As you can see, each dollar of a rich person is attached to a smaller carbon cost than the dollar of a poor person. This is partially because rich people get most of their manufacturing done by poor people, but also because rich people are more environmentally conscious. Plot: CO2 per Dollar vs. GDP per capita Lastly, here is a map of CO2 emissions per dollar of GDP, which shows how green different economies are: CO2 Emissions per Dollar CO2 emissions per Dollar of output are lowest in: • EU and Japan: highly regulated and environmentally conscious • sub-Saharan Africa: subsistence-based economies …and highest in the industrializing economies of Asia. Kudos to Brazilian output for being so green, despite the country’s middle-income status. Were these statistics to factor in the CO2 absorption from rainforests, Brazil and other equatorial countries would appear even greener. Data from the Word Bank. Graphics produced on R. Abbas Keshvani # University Rankings over Time The QS Rankings are an influential score sheet of universities around the world. They are published annually by Quacquarelli Symonds (QS), a British research company based in London. The rankings for 2013 are out, and I have charted the rankings of this year’s top 10 over the last five years: QS’s top 10 from 2008 to 2013; The label is the 2013 rank. Columbia is included because it was in the top 10 of 2008 and 2010. Observations from this year’s ranking: • MIT (#1 in 2013) has shot up in the rankings. This is in line with the increasing demand for technical and computer science education. At Harvard, enrollment into the college’s introductory computer science course went up, from around 300 students in 2008 to almost 800 students in 2013! • Asia’s top scorer is National University of Singapore Method: Method: How QS Ranks Universities The QS Rankings produce an aggregate score, on a scale of 0-100, for each university. The aggregate score is a sum of six weighted scores: • Academic reputation: from a global survey of 62,000 academics (40%) • Student:Faculty ratio (20%) • Citations per Faculty: How many times the university’s research is cited in other sources on Scopus, a database of research (20%) • Employer reputation: from a global survey of 28,000 employers (20%) • Int’l Faculty (Students): proportion of faculty (students) from abroad (5% each) Note that many of the universities are apart by tiny numbers (MIT, Harvard, Cambridge, UCL, Imperial are all within 1.3 points of each other), which increases the likelihood of bias or error influencing the ranking. In any case, it appears futile to try and compare massive multi-disciplinary institutions by a single statistic. However, larger trends – like MIT’s and Stanford’s ascendancy – are noteworthy. Data from QS Ranking. Graphics produced on R. Abbas Keshvani # What is the “Average” American Salary? In America, the richest 1% of households earned almost 20% of the income in 2012, which points to a very wide income gap. This presents many social and economic problems, but also a statistical problem: what is the “average” American’s salary? This average is often reported as GDP per capita: the mean of household incomes. In 2011, the mean household earned$70,000. However, the majority of Americans earned well below $70K that year. The reason for this misrepresentation is rich people: In 2011, Oracle CEO Larry Ellison made almost$100 million, alone adding a dollar to each household’s income, were his salary distributed among everyone – as indeed the mean makes it appear it is.

Here is a graphic of American inequity:

Income Distribution in America: the blue part of the last bar represents the earnings of the top 5% of households

As you can see, the mean would not be such a poor representation (or rich representation) of the average salary if we discounted the top 5%.

In fact, the trimmed mean removes extreme values before calculating the mean. Unfortunately, the trimmed mean is not widely used in data reporting by the agencies that report incomes – the IRS, Bureau of Economic Analysis and the US Census.

In this case, the median is a much better average. This is simply the income right in the middle of the list of incomes.

American Household Income: the Mean is much higher than the Median

As you can see, whether you use the Mean or Median makes a very big difference. The median household income is \$20,000 lower than the mean household income.

Of course, America is not the only country with a wide economic divide. China, Mexico and Malaysia have similar disparities between rich and poor, while most of South America and Southern Africa are even more polarized, as measured by the Gini coefficient, a measure of economic inequality.

Data from the US Census. Available income data typically lags by two years, which is why graphs stop at 2011; 2012 Data is projected. Graphics produced on R.

Abbas Keshvani

# Types of Data on R

Handling data on R can be a simple process

There are different types of data on R. I use type here as a technical term, rather than merely a synonym of “variety”.  There are three main types of data:

1. Numeric: ordinary numbers
2. Character: not treated as a number, but as a word. You cannot add two characters, even if they appear to be numerical. Characters have “inverted commas” around them.
3. Date: can be used in time series analysis, like a time series plot.

To diagnose the type of data you’re dealing with, use class()

You can convert data between types. To convert to:

1. Numeric: as.numerical()
2. Character: as.character()
3. Date: as.Date()

Note that to convert a character or numeric to a date, you may need to specify the format of the date:

• ddmmyyyy: as.Date(x, format=”%d%m%Y”) *default, so format needn’t be specified
• mmddyyyy: as.Date(x, format=”%m%d%Y”)
• dd-mm-yyyy: as.Date(x, format=”%d-%m-%Y”)
• dd/mm/yyyy: as.Date(x, format=”%d/%m/%Y”)
• if the month is named, like 12February1989: as.Date(x, format=”%d%B%Y”)
• if the month is short-form named, like 12Feb1989: as.Date(x, format=”%d%b%Y”)
• if the year is in two digit form, like 12Feb89: as.Date(x, format=”%d%m%y”)
• if the date in mmyyyy form: as.yearmon(x, format=”%m%Y”) *from zoo package
• if date includes time, like 21/05/2012 21:20:30: as.Date(x, format=”%d/%m/%Y %H:%M:%S)

Abbas Keshvani