# Introducing Statwing

Recently, Greg Laughlin, the founder of a new statistical software called Statwing, let me try his product for free. I happen to like free things very much (the college student is strong within me) so I gave it a try.

I mostly like how easy it is to use: For instance, to relate two attributes like Age and Income, you click Age, click Income, and click Relate.

So what can Statwing do?

1. Summarize an attribute (like “age”): totals, averages, standard deviation, confidence intervals, percentiles, visual graphs like the one below
2. Relate two columns together (“Openness” vs “Extraversion”)
• Plots the two attributes against eachother to see how they relate. It will include the formula of the regression line and the R-squared value.
• Sometimes a chi-square-style table is more appropriate. The software determines how best to represent the data.
• Tests the null hypothesis that the attributes are independent, by a T-test, F-test (ANOVA) or chi-square test. Statwing determines which one is appropriate.
• Repeat the above for a ranked correlation.

For now, you can’t forecast a time series or represent data on maps. But Greg told me that the team is adding new features as I type this.

If you’d like to try the software yourself, click here. They’ve got three sample datasets to play with:

1. Titanic passengers information
2. The results of a psychological survey
3. A list of congressman, their voting record and donations.

Abbas Keshvani

# Crime map for the City of London

In my experience, central London is generally a safe place, but I was robbed there two years ago. A friend and I got lost on our way to a pancake house (serving, not made of), so I took my new iPhone out to consult a map. In a flash, a bicyclist zoomed past and plucked my phone out of my hands.  Needless to say, I lost my appetite for pancakes that day.

But I am far from alone. Here, I have plotted 506 instances of theft, violence, arson, drug trade, and anti-social behaviour onto a map of London. The data I am using only lists crimes in the City of London, a small area within central London which hosts the global HQs of many banks and law firms, for the month of February 2014.

Each point on this map is not a single instance of crime – recall that the data lists over 500 instances of crime. So, each point corresponds to multiple instances of crime which happened at a particular spot. So, it is probably best to split the map into hexagons (no particular reason for my choice of shape) which are colour coded to explain how dense the crime in that area is.

A particular hotspot for crime appears to be the area around the Gherkin, or 30 St Mary’s Axe, Britain’s most expensive office building.

Data from data.police.uk; Graphics produced on R using ggplot2 package; Map from Google maps.

Abbas Keshvani

# CO2 Emissions per Dollar

For all the flak China receives about its greenhouse gas emissions, the average Chinese produces less than a third the amount of CO2 than his American counterpart. It just so happens that there are 1.3 billion Chinese, and 0.3 billion Americans, so China ends up producing more CO2.

Carbon dioxide and other greenhouse gases, such as methane and carbon monoxide, are produced from burning petrol, growing rice, and raising cattle . These greenhouse gases let in sun rays, but do not let out the heat that the rays generate on earth. This results in a greenhouse effect, where global temperatures are purported to be rising as a result of human activities.

The below map shows the per-capita emissions of greenhouse gases:

As you can see, the least damage is done by people in Africa, South Asia, and Latin America. But these places also happen to be the poorest places:…

View original post 162 more words

# How Countries Fare, 2010

The Current Account Balance is a measure of a country’s “profitability”. It is the sum of profits (losses) made from trading with other countries, profits (losses) made from investments in other countries, and cash transfers, such as remittances from expatriates.

As the infographic shows, there isn’t much middle ground when it comes to a current account balance. Most countries have:

• large deficits (America, most of Europe, Australia, Brazil, India)
• large surpluses (China, most of Southeast Asia, Northern European countries, Russia, Gulf oil producers).

There are a few countries with

• small deficits (most Central American states, Pakistan)
• small surpluses (most Baltics)

…but they are largely outnumbered by the clear winners and losers of world trade.

The above is not a per-capita infographic, so larger countries tend to be clear winners or losers, while smaller countries are more likely to straddle the divide. Here is the per-capita Current Account Balance map:

This…

View original post 115 more words

# Daily/monthly/yearly tallies for your data

Say you have a dataset, where each row has a date or time, and something is recorded for that date and time. If each row is a unique date – great! If not, you may have rows with the same date, and you have to combine records for the same date to get a daily tally.

Here is how you can make a daily tally (or a monthly or yearly one; the frequency of tallies is not important):

1. convert the dates to numbers. R will say 01/01/1970 is day 1, 02/01/1970 is day 2, …, 07/03/2010 is day 14675; 31/12/1960 is day -1.
2. use a “for loop” to lump entries from the same date together
3. calculate the daily by calculating the number of rows in the daily lump (I do this below), or by adding all entries in a particular column in a daily lump

To get the daily total,

summary(rott[,2])<-as.numeric(as.Date(rott[,2], format=”%m/%d/%Y”, origin = “3/7/2010″))

daily<-matrix(NA,184,1)

for(i in 1:184) #my data spans 184 days from 7th March to 6th Sept 2010
{
rott.i<-rott[rott[,2]==14674+i,]   daily[i,1]<-nrow(rott.i) #7th March 2010 is the 14675th day from 01/01/1970, the day the R calendar starts
}

acf(daily,main=”Autocorrelation of Timeseries”) #ACF!

Abbas Keshvani

# Using ggplot2

I have a standard code for ggplot2 which I use to make line graphs, scatter plots, and histograms.

For lines or scatters:

p<- ggplot(x, aes(x=Year, y=Rank, colour=Uni, group=Uni)) #colour lines by variable Uni #group Uni labelled variables in the same line

Then:

p + #you get an error if not for this step
geom_line(size=1.2) +
geom_point(data=QS[QS[,2]==”2013″,]) +
geom_text(data=QS[QS[,2]==”2013″&QS[,1]!=”Princeton”,],aes(label=paste(paste(Rank,”.”,sep=””),Uni)),hjust=-0.2)+
ylim(17,0.5) +
scale_x_continuous(limit=c(2004,2014),breaks=seq(2004,2014,1)) +
theme(legend.position=”none”) +
ggtitle(“QS University Rankings 2008-2013”) +
theme(plot.title=element_text(size=rel(1.5))) +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank()) +
geom_text(aes(label=country),size=6,vjust=-1) +
annotate(“text”,x=2011,y=16.5,label=”Abbas Keshvani”)

For a bar chart:

ggplot(Dist, aes(x=B,y=C,fill=A)) +  #stacked bars, column A contains stacks
geom_bar(stat=”identity”, width=0.9) +

Abbas Keshvani

# CO2 Emissions per Dollar of GDP

For all the flak China receives about its greenhouse gas emissions, the average Chinese produces less than a third the amount of CO2 than his American counterpart. It just so happens that there are 1.3 billion Chinese, and 0.3 billion Americans, so China ends up producing more CO2.

Carbon dioxide and other greenhouse gases, such as methane and carbon monoxide, are produced from burning petrol, growing rice, and raising cattle . These greenhouse gases let in sun rays, but do not let out the heat that the rays generate on earth. This results in a greenhouse effect, where global temperatures are purported to be rising as a result of human activities.

The below map shows the per-capita emissions of greenhouse gases:

As you can see, the least damage is done by people in Africa, South Asia, and Latin America. But these places also happen to be the poorest places: Because they don’t have much industry, they don’t churn out much CO2.

The below plot shows the correlation between poverty and green-ness. As you can see, each dollar of a rich person is attached to a smaller carbon cost than the dollar of a poor person. This is partially because rich people get most of their manufacturing done by poor people, but also because rich people are more environmentally conscious.

Lastly, here is a map of CO2 emissions per dollar of GDP, which shows how green different economies are:

CO2 emissions per Dollar of output are lowest in:

• EU and Japan: highly regulated and environmentally conscious
• sub-Saharan Africa: subsistence-based economies

…and highest in the industrializing economies of Asia.

Kudos to Brazilian output for being so green, despite the country’s middle-income status. Were these statistics to factor in the CO2 absorption from rainforests, Brazil and other equatorial countries would appear even greener.

Data from the Word Bank. Graphics produced on R.

Abbas Keshvani

# University Rankings over Time

The QS Rankings are an influential score sheet of universities around the world. They are published annually by Quacquarelli Symonds (QS), a British research company based in London. The rankings for 2013 are out, and I have charted the rankings of this year’s top 10 over the last five years:

Observations from this year’s ranking:

• MIT (#1 in 2013) has shot up in the rankings. This is in line with the increasing demand for technical and computer science education. At Harvard, enrollment into the college’s introductory computer science course went up, from around 300 students in 2008 to almost 800 students in 2013!
• Asia’s top scorer is National University of Singapore

Method:

The QS Rankings produce an aggregate score, on a scale of 0-100, for each university. The aggregate score is a sum of six weighted scores:

• Student:Faculty ratio (20%)
• Citations per Faculty: How many times the university’s research is cited in other sources on Scopus, a database of research (20%)
• Employer reputation: from a global survey of 28,000 employers (20%)
• Int’l Faculty (Students): proportion of faculty (students) from abroad (5% each)

Note that many of the universities are apart by tiny numbers (MIT, Harvard, Cambridge, UCL, Imperial are all within 1.3 points of each other), which increases the likelihood of bias or error influencing the ranking.

In any case, it appears futile to try and compare massive multi-disciplinary institutions by a single statistic.

However, larger trends – like MIT’s and Stanford’s ascendancy – are noteworthy.

Data from QS Ranking. Graphics produced on R.

Abbas Keshvani

# What is the “Average” American Salary?

In America, the richest 1% of households earned almost 20% of the income in 2012, which points to a very wide income gap. This presents many social and economic problems, but also a statistical problem: what is the “average” American’s salary?

This average is often reported as GDP per capita: the mean of household incomes. In 2011, the mean household earned $70,000. However, the majority of Americans earned well below$70K that year. The reason for this misrepresentation is rich people: In 2011, Oracle CEO Larry Ellison made almost $100 million, alone adding a dollar to each household’s income, were his salary distributed among everyone – as indeed the mean makes it appear it is. Here is a graphic of American inequity: As you can see, the mean would not be such a poor representation (or rich representation) of the average salary if we discounted the top 5%. In fact, the trimmed mean removes extreme values before calculating the mean. Unfortunately, the trimmed mean is not widely used in data reporting by the agencies that report incomes – the IRS, Bureau of Economic Analysis and the US Census. In this case, the median is a much better average. This is simply the income right in the middle of the list of incomes. As you can see, whether you use the Mean or Median makes a very big difference. The median household income is$20,000 lower than the mean household income.

Of course, America is not the only country with a wide economic divide. China, Mexico and Malaysia have similar disparities between rich and poor, while most of South America and Southern Africa are even more polarized, as measured by the Gini coefficient, a measure of economic inequality.

Data from the US Census. Available income data typically lags by two years, which is why graphs stop at 2011; 2012 Data is projected. Graphics produced on R.

Abbas Keshvani

# Types of Data on R

There are different types of data on R. I use type here as a technical term, rather than merely a synonym of “variety”.  There are three main types of data:

1. Numeric: ordinary numbers
2. Character: not treated as a number, but as a word. You cannot add two characters, even if they appear to be numerical. Characters have “inverted commas” around them.
3. Date: can be used in time series analysis, like a time series plot.

To diagnose the type of data you’re dealing with, use class()

You can convert data between types. To convert to:

1. Numeric: as.numerical()
2. Character: as.character()
3. Date: as.Date()

Note that to convert a character or numeric to a date, you may need to specify the format of the date:

• ddmmyyyy: as.Date(x, format=”%d%m%Y”) *default, so format needn’t be specified
• mmddyyyy: as.Date(x, format=”%m%d%Y”)
• dd-mm-yyyy: as.Date(x, format=”%d-%m-%Y”)
• dd/mm/yyyy: as.Date(x, format=”%d/%m/%Y”)
• if the month is named, like 12February1989: as.Date(x, format=”%d%B%Y”)
• if the month is short-form named, like 12Feb1989: as.Date(x, format=”%d%b%Y”)
• if the year is in two digit form, like 12Feb89: as.Date(x, format=”%d%m%y”)
• if the date in mmyyyy form: as.yearmon(x, format=”%m%Y”) *from zoo package
• if date includes time, like 21/05/2012 21:20:30: as.Date(x, format=”%d/%m/%Y %H:%M:%S)

Abbas Keshvani

# Forecasting a Timeseries

Suppose you have decided on a suitable model for a timeseries. In this case, we have selected an ARIMA(2,1,3) model, using the Akaike Information Criteria (AIC) as our sole criterion for choosing between various models here, where we model the DJIA.

Note: There are many criteria for choosing a model, and the AIC is only one of them. Thus, the AIC should be used heuristically, in conjunction with t-tests and the Coefficient of Determination, among other statistics. Nonetheless, let us assume that we ran all these tests, and were still satisfied with ARIMA(2,1,3).

An ARIMA(2,1,3) looks like this:

$\Delta Y_t = \phi_2 Y_{t-2} + \phi_1 Y_{t-1} + \theta_{3} \epsilon_{t-3} + \theta_{2} \epsilon_{t-2} + \theta_1 \epsilon_{t-1} + \epsilon_{t}$

This is not very informative for forecasting future reaizations of a timeseries, because we need to know the values of the coefficients $\phi_2$, $\phi_1$, etcetera. So we use R’s arima() function, which spits out the following output:

Thus, we revise our model to:

$\Delta Y_t = -0.992 Y_{t-2} + 0.1840 Y_{t-1} -0.0511 \epsilon_{t-3} + 1.0101 \epsilon_{t-2} + -0.2483 \epsilon_{t-1} + \epsilon_{t}$

Then, we can forecast the next, say 20, realizations of the DJIA, to produce a forecast plot. We are forecasting values for January 1st 1990 to January 26th 1990, dates for which we have the real values. So, we can overlay these values on our forecast plot:

Note that the forecast is more accurate for predicting the DJIA a few days ahead than later dates. This could be due to:

1. the model we use
2. fundamental market movements that could not be forecasted

Which is why data in a vacuum is always pleasant to work with. Next: Data in a vacuum. I will look at data from the biggest vacuum of all – space.

Abbas Keshvani

# Using AIC to Test ARIMA Models

The Akaike Information Critera (AIC) is a widely used measure of a statistical model. It basically quantifies 1) the goodness of fit, and 2) the simplicity/parsimony, of the model into a single statistic.

When comparing two models, the one with the lower AIC is generally “better”. Now, let us apply this powerful tool in comparing various ARIMA models, often used to model time series.

The dataset we will use is the Dow Jones Industrial Average (DJIA), a stock market index that constitutes 30 of America’s biggest companies, such as Hewlett Packard and Boeing. First, let us perform a time plot of the DJIA data. This massive dataframe comprises almost 32000 records, going back to the index’s founding in 1896. There was an actual lag of 3 seconds between me calling the function and R spitting out the below graph!

But it immediately becomes apparent that there is a lot more at play here than an ARIMA model. Since 1896, the DJIA has seen several periods of rapid economic growth, the Great Depression, two World Wars, the Oil shock, the early 2000s recession, the current recession, etcetera. Therefore, I opted to narrow the dataset to the period 1988-1989, which saw relative stability. As is clear from the timeplot, and slow decay of the ACF, the DJIA 1988-1989 timeseries is not stationary:

So, we may want to take the first difference of the DJIA 1988-1989 index. This is expressed in the equation below:

$\Delta Y = Y_t - Y_{t-1}$

The first difference is thus, the difference between an entry and entry preceding it. The timeseries and AIC of the First Difference are shown below. They indicate a stationary time series.

Now, we can test various ARMA models against the DJIA 1988-1989 First Difference. I will test 25 ARMA models: ARMA(1,1); ARMA(1,2), … , ARMA(3,3), … , ARMA(5,5). To compare these 25 models, I will use the AIC.

I have highlighted in green the two models with the lowest AICs. Their low AIC values suggest that these models nicely straddle the requirements of goodness-of-fit and parsimony. I have also highlighted in red the worst two models: i.e. the models with the highest AICs. Since ARMA(2,3) is the best model for the First Difference of DJIA 1988-1989, we use ARIMA(2,1,3) for DJIA 1988-1989.

The AIC works as such: Some models, such as ARIMA(3,1,3), may offer better fit than ARIMA(2,1,3), but that fit is not worth the loss in parsimony imposed by the addition of additional AR and MA lags. Similarly, models such as ARIMA(1,1,1) may be more parsimonious, but they do not explain DJIA 1988-1989 well enough to justify such an austere model.

Note that the AIC has limitations and should be used heuristically. The above is merely an illustration of how the AIC is used. Nonetheless, it suggests that between 1988 and 1989, the DJIA followed the below ARIMA(2,1,3) model:

$\Delta Y_t = \phi_2 Y_{t-2} + \phi_1 Y_{t-1} + \theta_{3} \epsilon_{t-3} + \theta_{2} \epsilon_{t-2} + \theta_1 \epsilon_{t-1} + \epsilon_{t}$

Next: Determining the above coefficients, and forecasting the DJIA.

Analysis conducted on R. Credits to the St Louis Fed for the DJIA data.

Abbas Keshvani

# How to Use Autocorrelation Function (ACF) to Determine Seasonality

In my previous post, I wrote about using the autocorrelation function (ACF) to determine if a timeseries is stationary. Now, let us use the ACF to determine seasonality. This is a relatively straightforward procedure.

Firstly, seasonality in a timeseries refers to predictable and recurring trends and patterns over a period of time, normally a year. An  example of a seasonal timeseries is retail data, which sees spikes in sales during holiday seasons like Christmas. Another seasonal timeseries is box office data, which sees a spike in sales of movie tickets over the summer season. Yet another example is sales of Hallmark cards, which spike in February for Valentine’s Day.

The below graphs show sales of clothing in the UK, and how these sales follow seasonal trends, spiking in the holiday season:

Note the spikes in sales, which obediently occur every December, in time for Christmas. This is evident in the trail of December plot points (Graph 1), which hover significantly above the sales data for other months, and also in the actual spikes of the line graph (Graph 2).

The above is a simple example of a seasonal timeseries. However, timeseries are not always simply seasonal. For example, a SARMA process comprises of seasonal, autoregressive, and moving average components, hence the acronym. This will not look as obviously seasonal, as the AR and MA processes may overlap with the seasonal process. Thus, a simple timeseries plot, as shown above, will not allow us to appreciate and identify the seasonal element in the series.

Thus, it may be advisable to use an autocorrelation function to determine seasonality. In the case of seasonality, we will observe an ACF as below:

Note that the ACF shows an oscillation, indicative of a seasonal series. Note the peaks occur at lags of 12 months, because April 2011 correlates with April 2012, and 24 months, because April 2011 correlates with April 2013, and so on.

The above analyses were conducted on R. Credits to data.gov.uk and the Office of National Statistics, UK for the data.

Abbas Keshvani

# How to use the Autocorreation Function (ACF)?

The Autocorrelation function is one of the widest used tools in timeseries analysis. It is used to determine stationarity and seasonality.

Stationarity:

This refers to whether the series is “going anywhere” over time. Stationary series have a constant value over time.

Below is what a non-stationary series looks like. Note the changing mean.

And below is what a stationary series looks like. This is the first difference of the above series, FYI. Note the constant mean (long term).

The above time series provide strong indications of (non) stationary, but the ACF helps us ascertain this indication.

If a series is non-stationary (moving), its ACF may look a little like this:

The above ACF is “decaying”, or decreasing, very slowly, and remains well above the significance range (dotted blue lines). This is indicative of a non-stationary series.

On the other hand, observe the ACF of a stationary (not going anywhere) series:

Note that the ACF shows exponential decay. This is indicative of a stationary series.

Consider the case of a simple stationary series, like the process shown below:

$Y_t = \epsilon_t$

We do not expect the ACF to be above the significance range for lags 1, 2, … This is intuitively satisfactory, because the above  process is purely random, and therefore whether you are looking at a lag of 1 or a lag of 20, the correlation should be theoretically zero, or at least insignificant.

Next: ACF for Seasonality

Abbas Keshvani

# Random Variables from a non-Parametric distribution know their limits

You produce a non-parametric distribution. Then you obtain, say, 10 random variables (RV) from this non-parametric distribution- much the same way as you would obtain random variables from a (parametric) normal distribution with stated mean and variance. But unlike the parametric distribution, where our RVs would occur around the mean (our parameter), RVs from a non-parametric distribution occur within the range bound by the lowest and highest mass point. This was not necessarily an intuitive concept to me, when I first stumbled across it. Which is why this mathematical proof of this range made me feel so much more comfortable:

If our estimate of the RV is a simple weighted-mean of the mass points:

$\hat{\beta} = z_{1}w_{1} + ... + z_{k}w_{k}$

Furthermore, since $z_1 \leq \hat{\beta} \leq z_k$ for RV $\beta$:

$\left[w_{1}+...+w_{k} \right]z_{1}\leq \hat{\beta}\leq \left[w_{1}+...+w_{k} \right]z_{k}$

Since $\sum w_i=1$, we can express the inequality as:

$z_1 \leq \hat{\beta} \leq z_k$

On the other hand, If we know further information, like individual weights:

$\hat{\beta}=z_1w_1+...+z_kw_k$

Furthermore, since  for intercept $\beta$:

$\left(w_{1}+w_{k}\right)z_1\leq \hat{\beta}\leq \left(w_{1}+w_{k}\right)z_k$

Since $\sum w_i=1$, we can express the inequality as:

$z_1\leq \hat{\beta}\leq z_k$

Thus, it is proven that any estimates of an RV drawn from a non-parametric distribution will be bound by the highest and lowest mass point.

Abbas Keshvani

# Limits of Akaike Information Criteria (AIC)

We often use AIC to discern the best model among candidates.

Now suppose we have two (non-parametric) models, which use mass points and weights to model a random variable:

• model A uses 4 mass points to model a random variable (i.e. the height of men in Singapore)
• model B uses 5 mass points to mode the same random variable

We consider model A to be nested in model B. This is because model A is basically a version of model B, where one mass point is “de-activated”.

Thus, we must not use small differences in AIC or BIC alone to judge between these models. If the model with a constraint on one or more parameters (model A) is regarded as nested on within the model without the constraint (model B) , a chi-square difference test, or Likelihood Ratio (LR) test, is performed to test the reasonableness of the constraint, using a central chi-square with degrees of freedom equal to the number of parameters constraints.

However, under the null hypothesis, the parameter of interest takes its value on the boundary of the parameter space (next post). For this reason, the asymptotic distribution of the chi-square difference, or Likelihood Ratio (LR) statistic, is not that of a central chi-square distributed random variable with one degree of freedom. This boundary problem affects goodness of fit measures like AIC and BIC4. As a result, the AIC and BIC should be used heuristically, in conjunction with graphs and other criteria to evaluate estimates from the chosen model.

Abbas Keshvani

# Parametric vs non-Parametric Linear Models (LM)

 Histogram: LM estimates of Intercepts Histogram: LM estimates of Gradient QQ Plot: LM estimates of Intercepts QQ Plot: LM estimates of Gradient

Figure 1: Gradient  appears to follow a normal distribution more than intercept .

When do we use a parametric model, and when do we use a non-parametric one? In the above example, “Intercept” is one random variable, and “Gradient” is another. I will show you why “Intercept” is better modeled by a non-parametric model, and “Gradient” is better modeled by a parametric one.

In Figure 1, histograms and QQ plots of “Intercept”  and “Gradient”  show that the latter appears to follow a normal distribution whereas the former does not. As such, a parametric (normal) distribution would not be appropriate for modelling “Intercept”. This leads us to believe that a non-parametric distribution is a better method for estimating “Intercept”.

However, a parametric (normal) distribution might be appropriate for modelling “Gradient”, which appears to follow a normal distribution, according to both its histogram and QQ plot.

Abbas Keshvani