Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(knitr)
library(scales)
library(reshape)

Load data

load("gss.Rdata")
attach(gss)
set.seed(123)

Part 1: Data

The General Social Survey (GSS) aims to gather the best quality data related to social demographics and opinions about topics of special interest. The way they select participants is the following:

Your household address was randomly selected from a scientific sample designed to represent a cross-section of the country.
The random selection of households from across the United States ensures that the results of the survey are scientifically valid.
All households from across the country had an equal chance of being selected for this survey.

All of these conditions put together ensure that we can generalize any findings to the total population of U.S.

However, the sample might be exposed to non-response bias since a non-random fraction of the randomly chosen subjects might not respond the phone call, such as low-income families.

The data collected is an observational study, it is not an experiment. People are not randomly assigned into a control group and a treatment group, they are all offered the same treatment (questionnaire). Hence, we cannot infere any causality.

Part 2: Research question

Are coastal territories more developed than continentanl territories?

For this research question I will use the variable coninc which is the total family income in constant dollars, and the variable region, which is the region of interview.

The concept of development is very broad, therefore, I will just measure development attending to income. I will try to find out wether there is regional disparity income wise. History tells us that costs offered sea natural resources as well as maritime trading routes; coastal cities could use both maratime and land resources as well as their respective trading routes. I will try to find out if this advantage is still paying off nowadays.

Part 3: Exploratory data analysis

First of all, let’s select and take a look at the first lines of data:

datos <- gss %>%
  select(region,coninc) %>%
  filter(complete.cases(.))

region	coninc
E. Nor. Central	25926
E. Nor. Central	33333
E. Nor. Central	33333
E. Nor. Central	41667
E. Nor. Central	69444
E. Nor. Central	60185

Incomplete cases were not many and since they create noise it is preferable to get rid of them. Data has been split into just two columns: region and family income. It would be desirable to observe how the destributions of family income per region look like; the boxplot is the most appropriated visualization tool for this purpose since the response variable is numeric and the explanatory variable is categorical.

ggplot(data = datos, aes(x = region, y = coninc, color = region)) + 
  geom_boxplot() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ggtitle("FAMILY INCOME PER REGION") +
  xlab('Region') + 
  ylab('Family Income ($)')

There are a few things to mentioned here. First, family income around $150,000 or more in any of the regions seem to follow an exact pattern. You can count the same number of observations in every region above that threshold and they seem to have the same numerical value; this looks like an error in data collection as well as being obvious outliers so I will get rid of those values. Second, data is right skewed, which is natural since there is a lower bound for the income a family can earn and it’s zero dollars; this might imply some problems later related to normality assumption of data. And last, does this first data exploration show us any information about average family income per region? If we take a quick look at the graph and we compare the median of each distribution (which is a robust measure of centrality; the approriate measure when it comes to represent skewed data) we can observe that New England, Middle Atlantic and Pacific regions have medians superior to the rest of regions. This is enlightening but not conclusive. Let’s take a look at summary statistics now:

region	sample size	median	interquartile range
New England	2281	42215	44298
Middle Atlantic	7006	38414	40400
Pacific	6754	38141	43267
E. Nor. Central	9318	36135	37765
W. Nor. Central	3784	33221	35366
Mountain	3131	33079	35940
South Atlantic	9386	32273	36643
W. Sou. Central	4706	30193	36544
E. Sou. Central	3304	26991	35701

As we expected because of the boxplot, apparently, coastal regions have higher medians (East North Central region has the Great Lakes’ coasts). To formalize what looks like visually apparent, I will run ANOVA on data trying to group regions income wise into different segments; these segments or groups will contains homogeneous regions. I expect to find the following:

A group composed of coastal regions with similar averages and homoscedastic.
A group composed of internal regions with similar averages and homoscedastic.

Why ANOVA? because I want to compare more than two sample means and this can only be achieved with ANOVA. I expect to find two populations that will contain all of the regional samples. The method consists in checking wether the regions involved in ANOVA proceed from a population with the same mean (which could even be the same population). This is achieved by calculating the F statistic which is the ratio of the mean square variability between groups and the mean square variability within groups (also called MSE, mean square error or unexplained mean variability). Then we will compute the probability of having observed a at least as large F ratio given that the means of all groups are equal. To compute this probability, since F-statistic follows a F distribution, we need two parameters which are the degrees of freedom of the groups and the degrees of freedom of the residuals. We do not need to compute any of these since R will do this for us. If this probability is lower than the significance level, then we will reject the possibilty that the means of the populations where the samples come from are equal if we do in fact observe the collected data.

If for some reason ANOVA fails and I need to keep comparing means pair wise, I will use Hypothesis test for the mean difference between two samples, which consists in finding the probability of having observed the sample mean difference given that both samples come from a population with the same mean, and if this probability is bellow our significance level (typically 5%) then there is evidence that it is very unlikely to observe this data under those conditions. Therefore, having observed this data would be due to sampling variability.

Before I can go on, I need to transform data to normalize it, otherwise I will not be able to run an ANOVA test on averages. There are many transformations that we can perform on right skewed data but I chose to take the square root of coninc, since log transformations rescaled data so much that the distributions of family income per region shifted from right skewed to left skewed, therefore, a less aggressive rescaling of data can be achieved by taking the square root. I also changed the origin of data to achieve a minium value of 1.0 in each distribution, because transformations will have the greatest effect if the distribution is anchored at 1.0.

For more information about data transformation, check this article.

sqrt.data <- gss %>%
  select(region,coninc) %>%
  filter(coninc < 140000) %>%
  mutate(coninc, coninc = sqrt(coninc-382))

sqrt.data <- sqrt.data[complete.cases(sqrt.data),]

ggplot(data = sqrt.data, aes(x = region, y = coninc, color = region)) + 
  geom_boxplot() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ggtitle("RESCALED FAMILY INCOME PER REGION") +
  xlab('Region') + 
  ylab('Sqrt(Family Income ($))')

This looks much better skewness wise; nothing but scaling has changed but the main difference is that now we can compare averages and standard deviations which is what we need to perform ANOVA. This interpretations should not be used to infere any conclusions about income since talking about the square root of income or the standard deviation of the square root of income makes no sense, these calculations are just for the purpose of comparing different data distributions.

region	sample size	mean	standard deviation
New England	2281	205.1212	76.15888
Middle Atlantic	7006	195.5604	75.48633
Pacific	6754	194.6940	76.71136
E. Nor. Central	9318	189.2445	72.19854
Mountain	3131	183.0695	73.11425
W. Nor. Central	3784	182.7367	71.01149
South Atlantic	9386	182.2202	74.24686
W. Sou. Central	4706	175.6166	75.70251
E. Sou. Central	3304	166.3804	72.00616

Everything is set now; I am going to divide the regions into two groups attending to the standard deviation criteria for achieving homoscedastic groups and perform ANOVA on both groups. ANOVA is sensitive to heteroscedastic data, therefore, even if I want to split groups attending to wether they are coastal or interior territories it would fail if they are not homoscedastic, but if we observe the first three entries of the table, New England, Middle Atlantic and Pacific have very similar standard deviation. Therefore, my division of groups will consist of a group with those three regions and a group with the rest of regions which also seem to have roughly the same standard deviation.

Part 4: Inference

Before checking the conditions for performing ANOVA the data must be split into the two groups of interest.

group1 <- sqrt.data %>%
  filter(region %in% c('Pacific','New England','Middle Atlantic'))

group2 <- sqrt.data %>%
  filter(!region %in% c('Pacific','New England','Middle Atlantic')) 

stats1 <- summary.sqrt.stats %>% filter(region %in% c('Pacific','New England','Middle Atlantic'))
stats2 <- summary.sqrt.stats %>% filter(!region %in% c('Pacific','New England','Middle Atlantic'))

ggplot(data = group1, aes(x = region, y = coninc, color = region)) + 
  geom_boxplot() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ggtitle("GROUP 1") +
  xlab('Region') + 
  ylab('Sqrt(Family Income ($))')

region	sample size	mean	standard deviation
New England	2281	205.1212	76.15888
Middle Atlantic	7006	195.5604	75.48633
Pacific	6754	194.6940	76.71136

ggplot(data = group2, aes(x = region, y = coninc, color = region)) + 
  geom_boxplot() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ggtitle("GROUP 2") +
  xlab('Region') + 
  ylab('Sqrt(Family Income ($))')

region	sample size	mean	standard deviation
E. Nor. Central	9318	189.2445	72.19854
Mountain	3131	183.0695	73.11425
W. Nor. Central	3784	182.7367	71.01149
South Atlantic	9386	182.2202	74.24686
W. Sou. Central	4706	175.6166	75.70251
E. Sou. Central	3304	166.3804	72.00616

Conditions for ANOVA (I): Independency

Within: sample observations must be independent from each other.

Random sampling has been used to obtain data.
Each sample size is less than 10% of the total U.S. population

Between: groups must be independent from each other.

Observations have been obtained from individuals belonging to one and only one region among the nine possible ones in just one period of time which makes the data unpaired. Therefore, there is independency between groups.

Conditions for ANOVA (II): Normality

Approximate normality: distributions should be nearly normal within each group.

normal.dist <- rnorm(n = 2281, mean = 205.1212, sd = 76.15888) %>% as.data.frame() 

# This is the theoretical normal distribution of New England data which has as parameters the ones shown in the previous table, but we could have used the standard normal distribution too which yields the same outcome, it just changes scaling.    

New.England <- sqrt.data %>%
  filter(region %in% 'New England') %>%
  select(coninc) %>%
  as.data.frame()

data <- data.frame(normal.distribution = normal.dist$., New.England = New.England$coninc)
data.merged <- melt(data)

# Everything is transformed to data frame since ggplot2 handles data frame objects and not numeric class type.

ggplot(data = normal.dist, aes(sample = New.England)) + 
  stat_qq(geom="line", size=0.5) +
  geom_abline(intercept = mean(New.England$coninc), slope = sd(New.England$coninc)) +
  ggtitle('QQ PLOT') +
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = data.merged, aes(x = variable, y = value, color = variable)) + 
  geom_boxplot() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  ggtitle("SAMPLE VS THEORETICAL DATA") +
  xlab('Box Plots') + 
  ylab('Sqrt(Family Income ($))')

We have a light-tailed sample distribution. The QQ plot shows that right extreme values are not as big as expected (they are below the qq line instead of on the qq line), and left extreme values are not as low as expected (they remain above the qqline instead of on the qqline). The boxplot comparison confirms this analysis; extreme values are closer to the interquantile frontier values in the sample data distribution than in its theoretical normal distribution. However, the one-way ANOVA is considered a robust test when it comes to violating the approximately normal assumption; as long as the distribution is not platykurtic and the sample size is not small, we can go ahead using this data. A lack of independence would be the most serious assumption to fail; failing in assesing normality would imply a type I error, but since my success would be to accept the null hypothesis and not to reject it, the cost of type I error is not too high. Worst case scenario is I will not be able to segmentate properly the nine regions into groups because I will have more splits than expected due to this type I error.

When distributions have fat tails, it is easier to not reject the null hypothesis (that is the reason to use a t-student over a normal distribution to compute confidence intervals, they are more conservative when things are uncertain), because the quantile that holds 5% of the right-end tail area is a lower value in a platykurtic distribution than in a mesokurtic distribution. Therefore, the opposite happens if the distribution is leptokurtic; we might incur in type I error since the quantile that holds 5% of the right-end tail area is a higher value in a leptokurtic distribution than in a normal distribution, hence, it is easier to reject the null hypothesis of all samples belonging to the same population.

The rest of regional data follow similar QQ plots; showing 18 plots, discussing them and obtaining the same conclussions as the first one would make following this project very dull, therefore, I encourage the lector to take for granted that a similar analysis and conclusions would follow the rest of regional distributions.

Conditions for ANOVA (III): Equal Variance

Equal variance: variability should be roughly equal among groups, specially important when sample sizes differ between groups.

There are many tests for homoscedasticity but since they are not covered in this course, I will asses equality of variances by observing and comparing the standard deviations previously calculated in the summary statics:

region	sample size	mean	standard deviation
New England	2281	205.1212	76.15888
Middle Atlantic	7006	195.5604	75.48633
Pacific	6754	194.6940	76.71136

region	sample size	mean	standard deviation
E. Nor. Central	9318	189.2445	72.19854
Mountain	3131	183.0695	73.11425
W. Nor. Central	3784	182.7367	71.01149
South Atlantic	9386	182.2202	74.24686
W. Sou. Central	4706	175.6166	75.70251
E. Sou. Central	3304	166.3804	72.00616

The maximum difference between standard deviations is 6.6% in group two; I believe that when it comes to samples who are as big as 9000 observations and as small as 2000 observations, we can consider a difference of 6.6% in standard deviation to not be high enough to assume that data is homoscedastic between regions in both groups.

Note that the conditions for ANOVA also apply to further pair wise testing.

Now that the conditions have been stated and almost-fully fulfilled with some considerations, it is time to define the hypothesis test:

\[ \begin{aligned} \\ &H_0:\mu_1=\mu_2=\ ...\ =\mu_n \\ &H_1: any\ of\ them\ is\ different \\\ \end{aligned} \] These conditions would need to be fulfilled withing group 1 and group 2 separately. Let’s start with group 1:

\[ \begin{aligned} \\ &H_0:\mu_{Pacific}=\mu_{Middle\ Atlantic}=\mu_{New\ England} \\ &H_1: any\ of\ them\ is\ different \\\ \end{aligned} \]

anova1 <- summary(aov(coninc~region, data = group1))

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
region	2	197701	98850.48	17.0691130655386	3.93420354472118e-08
Residuals	16038	92879111	5791.19	-	-

The probability of observing an F ratio at least as large as roughly 17 given that the samples come from populations with the same mean is very very low.

\[ \begin{aligned} \\ P(F\ge17.07\ |\ \mu_{Pacific}=\mu_{Middle\ Atlantic}=\mu_{New\ England})<\alpha=0.05 \\\\ \end{aligned} \]

We have to reject the Null Hypothesis of all region samples within group 1 coming from populations with the same mean and try pair wise mean comparisons. Since rejecting the Null Hypothesis means that at least one sample comes from a population with a different mean, this does not imply that they all come from populations with different means. ANOVA is an omnibus test, meaning that it contains many items within that test and it does not tell you which one makes things heterogeneous.

New England has an abnormally high average, so I will try comparing the other two means now:

group1.1 <- group1 %>%
  filter(!region %in% 'New England')

\[ \begin{aligned} \\ &H_0:\mu_{Pacific}=\mu_{Middle\ Atlantic} \\ &H_1:\mu_{Pacific}\neq\mu_{Middle\ Atlantic} \\ \\\ \end{aligned} \]

inference(y = coninc, x = region, data = group1.1, statistic = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_Middle Atlantic = 7006, y_bar_Middle Atlantic = 195.5604, s_Middle Atlantic = 75.4863
## n_Pacific = 6754, y_bar_Pacific = 194.694, s_Pacific = 76.7114
## H0: mu_Middle Atlantic =  mu_Pacific
## HA: mu_Middle Atlantic != mu_Pacific
## t = 0.6676, df = 6753
## p_value = 0.5044

The probability of both samples coming from populations with with the same mean (and same variance since it is a condition for performing ANOVA) is roughly 0.5, this implies that these two territories can be grouped into the same segment when splitting income earned by a family region wise.

\[ \begin{aligned} \\ P(\bar{x}_{middle\ atlantic}\ -\ \bar{x}_{pacific}\ge 0.8664\ |\ \mu_{Pacific}=\mu_{Middle\ Atlantic})>\alpha=0.05 \\\\ \end{aligned} \]

Hence we can accpet the Null Hypothesis; both region samples come from populations with the same mean.

Let’s also compute a confidence interval and see if both the hypothesis test and the confidence interval for the mean difference yield congruent outcomes:

inference(x = region, y = coninc, data = group1.1, statistic = "mean", type = "ci", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Middle Atlantic = 7006, y_bar_Middle Atlantic = 195.5604, s_Middle Atlantic = 75.4863
## n_Pacific = 6754, y_bar_Pacific = 194.694, s_Pacific = 76.7114
## 95% CI (Middle Atlantic - Pacific): (-1.6779 , 3.4108)

The observed sample mean of the difference is included in this 95% confidence interval:

\[ \begin{aligned} \\ \bar{x}_{middle\ atlantic}\ -\ \bar{x}_{pacific}=\ 0.8664\subset(-1.6779 , 3.4108) \\\ \end{aligned} \]

95% of random samples of 7006 Americans from Middle Atlantic and 6754 Americans from Pacific will yield confidence intervals that contain the true mean value for the family income difference between these two regions ,which happens to be zero and this value has been ratified by the previous hypothesis test that accepted the Null Hypothesis of the value being zero for a 5% significance level. Another way of interpreting the confidence interval is, we are 95% confident that Americans on average earn 338.17 extra dollars if they live in Middle Atlantic relative to Pacific region. For this calculations I had to reverse the square root transformation that I used initially to normalize data.

I could repeat the same analysis over and over again with group 2 which is even bigger than group 1, so if the lecturer feels satisfied with my work up until now, allow me to jump to conclusions and take for granted that the rest of the analysis followed the same procedures.

After repeating ANOVA with group 2 and pair wise tests, I got the following segments:

region	rank
New England	1
Middle Atlantic	2
Pacific	2
E. Nor. Central	3
Mountain	4
W. Nor. Central	4
South Atlantic	4
W. Sou. Central	5
E. Sou. Central	6

The regions sharing rank number is due to either the Null Hypothesis was accepted in an ANOVA test or in a pair wise test. I cannot embed a shiny app to a static html file report, but I will attach a screenshot of my shiny app that shows how these ranks look like in a US map:

The lighter the red colors are, the better rank they have. Can we conclude that coastal territories perform better than continental territories? we cannot due to Mountain region scoring better than West South Central which has the Atlantic coast. What could I have missed out? probably a second determining factor: latitud. This map reminds me a bit of Europe with northern countries performing better than southern countries. Probably the income map could be better described with a two factor (coast, latitud) map where northern coastal territories are the best to perform (New England) and southern territories with little or no coast (East South Cnetral) might be the worst to perform. Analogously, this pattern seems to be repeated in Europe where Scandinavian countries perform the best; they have coast and are located in the north, and on the other hand, the Balkan countries which many of them have no coast and are located in the south, perform the worst.