library(ggplot2)
library(dplyr)
library(statsr)
library(knitr)
library(scales)
library(reshape)load("gss.Rdata")
attach(gss)
set.seed(123)The General Social Survey (GSS) aims to gather the best quality data related to social demographics and opinions about topics of special interest. The way they select participants is the following:
All of these conditions put together ensure that we can generalize any findings to the total population of U.S.
However, the sample might be exposed to non-response bias since a non-random fraction of the randomly chosen subjects might not respond the phone call, such as low-income families.
The data collected is an observational study, it is not an experiment. People are not randomly assigned into a control group and a treatment group, they are all offered the same treatment (questionnaire). Hence, we cannot infere any causality.
Are coastal territories more developed than continentanl territories?
For this research question I will use the variable coninc which is the total family income in constant dollars, and the variable region, which is the region of interview.
The concept of development is very broad, therefore, I will just measure development attending to income. I will try to find out wether there is regional disparity income wise. History tells us that costs offered sea natural resources as well as maritime trading routes; coastal cities could use both maratime and land resources as well as their respective trading routes. I will try to find out if this advantage is still paying off nowadays.
First of all, let’s select and take a look at the first lines of data:
datos <- gss %>%
select(region,coninc) %>%
filter(complete.cases(.))| region | coninc |
|---|---|
| E. Nor. Central | 25926 |
| E. Nor. Central | 33333 |
| E. Nor. Central | 33333 |
| E. Nor. Central | 41667 |
| E. Nor. Central | 69444 |
| E. Nor. Central | 60185 |
Incomplete cases were not many and since they create noise it is preferable to get rid of them. Data has been split into just two columns: region and family income. It would be desirable to observe how the destributions of family income per region look like; the boxplot is the most appropriated visualization tool for this purpose since the response variable is numeric and the explanatory variable is categorical.
ggplot(data = datos, aes(x = region, y = coninc, color = region)) +
geom_boxplot() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
ggtitle("FAMILY INCOME PER REGION") +
xlab('Region') +
ylab('Family Income ($)')There are a few things to mentioned here. First, family income around $150,000 or more in any of the regions seem to follow an exact pattern. You can count the same number of observations in every region above that threshold and they seem to have the same numerical value; this looks like an error in data collection as well as being obvious outliers so I will get rid of those values. Second, data is right skewed, which is natural since there is a lower bound for the income a family can earn and it’s zero dollars; this might imply some problems later related to normality assumption of data. And last, does this first data exploration show us any information about average family income per region? If we take a quick look at the graph and we compare the median of each distribution (which is a robust measure of centrality; the approriate measure when it comes to represent skewed data) we can observe that New England, Middle Atlantic and Pacific regions have medians superior to the rest of regions. This is enlightening but not conclusive. Let’s take a look at summary statistics now:
| region | sample size | median | interquartile range |
|---|---|---|---|
| New England | 2281 | 42215 | 44298 |
| Middle Atlantic | 7006 | 38414 | 40400 |
| Pacific | 6754 | 38141 | 43267 |
| E. Nor. Central | 9318 | 36135 | 37765 |
| W. Nor. Central | 3784 | 33221 | 35366 |
| Mountain | 3131 | 33079 | 35940 |
| South Atlantic | 9386 | 32273 | 36643 |
| W. Sou. Central | 4706 | 30193 | 36544 |
| E. Sou. Central | 3304 | 26991 | 35701 |
As we expected because of the boxplot, apparently, coastal regions have higher medians (East North Central region has the Great Lakes’ coasts). To formalize what looks like visually apparent, I will run ANOVA on data trying to group regions income wise into different segments; these segments or groups will contains homogeneous regions. I expect to find the following:
Why ANOVA? because I want to compare more than two sample means and this can only be achieved with ANOVA. I expect to find two populations that will contain all of the regional samples. The method consists in checking wether the regions involved in ANOVA proceed from a population with the same mean (which could even be the same population). This is achieved by calculating the F statistic which is the ratio of the mean square variability between groups and the mean square variability within groups (also called MSE, mean square error or unexplained mean variability). Then we will compute the probability of having observed a at least as large F ratio given that the means of all groups are equal. To compute this probability, since F-statistic follows a F distribution, we need two parameters which are the degrees of freedom of the groups and the degrees of freedom of the residuals. We do not need to compute any of these since R will do this for us. If this probability is lower than the significance level, then we will reject the possibilty that the means of the populations where the samples come from are equal if we do in fact observe the collected data.
If for some reason ANOVA fails and I need to keep comparing means pair wise, I will use Hypothesis test for the mean difference between two samples, which consists in finding the probability of having observed the sample mean difference given that both samples come from a population with the same mean, and if this probability is bellow our significance level (typically 5%) then there is evidence that it is very unlikely to observe this data under those conditions. Therefore, having observed this data would be due to sampling variability.
Before I can go on, I need to transform data to normalize it, otherwise I will not be able to run an ANOVA test on averages. There are many transformations that we can perform on right skewed data but I chose to take the square root of coninc, since log transformations rescaled data so much that the distributions of family income per region shifted from right skewed to left skewed, therefore, a less aggressive rescaling of data can be achieved by taking the square root. I also changed the origin of data to achieve a minium value of 1.0 in each distribution, because transformations will have the greatest effect if the distribution is anchored at 1.0.
For more information about data transformation, check this article.
sqrt.data <- gss %>%
select(region,coninc) %>%
filter(coninc < 140000) %>%
mutate(coninc, coninc = sqrt(coninc-382))
sqrt.data <- sqrt.data[complete.cases(sqrt.data),]ggplot(data = sqrt.data, aes(x = region, y = coninc, color = region)) +
geom_boxplot() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
ggtitle("RESCALED FAMILY INCOME PER REGION") +
xlab('Region') +
ylab('Sqrt(Family Income ($))')This looks much better skewness wise; nothing but scaling has changed but the main difference is that now we can compare averages and standard deviations which is what we need to perform ANOVA. This interpretations should not be used to infere any conclusions about income since talking about the square root of income or the standard deviation of the square root of income makes no sense, these calculations are just for the purpose of comparing different data distributions.
| region | sample size | mean | standard deviation |
|---|---|---|---|
| New England | 2281 | 205.1212 | 76.15888 |
| Middle Atlantic | 7006 | 195.5604 | 75.48633 |
| Pacific | 6754 | 194.6940 | 76.71136 |
| E. Nor. Central | 9318 | 189.2445 | 72.19854 |
| Mountain | 3131 | 183.0695 | 73.11425 |
| W. Nor. Central | 3784 | 182.7367 | 71.01149 |
| South Atlantic | 9386 | 182.2202 | 74.24686 |
| W. Sou. Central | 4706 | 175.6166 | 75.70251 |
| E. Sou. Central | 3304 | 166.3804 | 72.00616 |
Everything is set now; I am going to divide the regions into two groups attending to the standard deviation criteria for achieving homoscedastic groups and perform ANOVA on both groups. ANOVA is sensitive to heteroscedastic data, therefore, even if I want to split groups attending to wether they are coastal or interior territories it would fail if they are not homoscedastic, but if we observe the first three entries of the table, New England, Middle Atlantic and Pacific have very similar standard deviation. Therefore, my division of groups will consist of a group with those three regions and a group with the rest of regions which also seem to have roughly the same standard deviation.
Before checking the conditions for performing ANOVA the data must be split into the two groups of interest.
group1 <- sqrt.data %>%
filter(region %in% c('Pacific','New England','Middle Atlantic'))
group2 <- sqrt.data %>%
filter(!region %in% c('Pacific','New England','Middle Atlantic'))
stats1 <- summary.sqrt.stats %>% filter(region %in% c('Pacific','New England','Middle Atlantic'))
stats2 <- summary.sqrt.stats %>% filter(!region %in% c('Pacific','New England','Middle Atlantic')) ggplot(data = group1, aes(x = region, y = coninc, color = region)) +
geom_boxplot() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
ggtitle("GROUP 1") +
xlab('Region') +
ylab('Sqrt(Family Income ($))')| region | sample size | mean | standard deviation |
|---|---|---|---|
| New England | 2281 | 205.1212 | 76.15888 |
| Middle Atlantic | 7006 | 195.5604 | 75.48633 |
| Pacific | 6754 | 194.6940 | 76.71136 |
ggplot(data = group2, aes(x = region, y = coninc, color = region)) +
geom_boxplot() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
ggtitle("GROUP 2") +
xlab('Region') +
ylab('Sqrt(Family Income ($))')| region | sample size | mean | standard deviation |
|---|---|---|---|
| E. Nor. Central | 9318 | 189.2445 | 72.19854 |
| Mountain | 3131 | 183.0695 | 73.11425 |
| W. Nor. Central | 3784 | 182.7367 | 71.01149 |
| South Atlantic | 9386 | 182.2202 | 74.24686 |
| W. Sou. Central | 4706 | 175.6166 | 75.70251 |
| E. Sou. Central | 3304 | 166.3804 | 72.00616 |
Conditions for ANOVA (I): Independency
Within: sample observations must be independent from each other.
Between: groups must be independent from each other.
Conditions for ANOVA (II): Normality
Approximate normality: distributions should be nearly normal within each group.
normal.dist <- rnorm(n = 2281, mean = 205.1212, sd = 76.15888) %>% as.data.frame()
# This is the theoretical normal distribution of New England data which has as parameters the ones shown in the previous table, but we could have used the standard normal distribution too which yields the same outcome, it just changes scaling.
New.England <- sqrt.data %>%
filter(region %in% 'New England') %>%
select(coninc) %>%
as.data.frame()
data <- data.frame(normal.distribution = normal.dist$., New.England = New.England$coninc)
data.merged <- melt(data)
# Everything is transformed to data frame since ggplot2 handles data frame objects and not numeric class type.
ggplot(data = normal.dist, aes(sample = New.England)) +
stat_qq(geom="line", size=0.5) +
geom_abline(intercept = mean(New.England$coninc), slope = sd(New.England$coninc)) +
ggtitle('QQ PLOT') +
theme(plot.title = element_text(hjust = 0.5))ggplot(data = data.merged, aes(x = variable, y = value, color = variable)) +
geom_boxplot() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)) +
ggtitle("SAMPLE VS THEORETICAL DATA") +
xlab('Box Plots') +
ylab('Sqrt(Family Income ($))')We have a light-tailed sample distribution. The QQ plot shows that right extreme values are not as big as expected (they are below the qq line instead of on the qq line), and left extreme values are not as low as expected (they remain above the qqline instead of on the qqline). The boxplot comparison confirms this analysis; extreme values are closer to the interquantile frontier values in the sample data distribution than in its theoretical normal distribution. However, the one-way ANOVA is considered a robust test when it comes to violating the approximately normal assumption; as long as the distribution is not platykurtic and the sample size is not small, we can go ahead using this data. A lack of independence would be the most serious assumption to fail; failing in assesing normality would imply a type I error, but since my success would be to accept the null hypothesis and not to reject it, the cost of type I error is not too high. Worst case scenario is I will not be able to segmentate properly the nine regions into groups because I will have more splits than expected due to this type I error.
When distributions have fat tails, it is easier to not reject the null hypothesis (that is the reason to use a t-student over a normal distribution to compute confidence intervals, they are more conservative when things are uncertain), because the quantile that holds 5% of the right-end tail area is a lower value in a platykurtic distribution than in a mesokurtic distribution. Therefore, the opposite happens if the distribution is leptokurtic; we might incur in type I error since the quantile that holds 5% of the right-end tail area is a higher value in a leptokurtic distribution than in a normal distribution, hence, it is easier to reject the null hypothesis of all samples belonging to the same population.
The rest of regional data follow similar QQ plots; showing 18 plots, discussing them and obtaining the same conclussions as the first one would make following this project very dull, therefore, I encourage the lector to take for granted that a similar analysis and conclusions would follow the rest of regional distributions.
Conditions for ANOVA (III): Equal Variance
Equal variance: variability should be roughly equal among groups, specially important when sample sizes differ between groups.
| region | sample size | mean | standard deviation |
|---|---|---|---|
| New England | 2281 | 205.1212 | 76.15888 |
| Middle Atlantic | 7006 | 195.5604 | 75.48633 |
| Pacific | 6754 | 194.6940 | 76.71136 |
| region | sample size | mean | standard deviation |
|---|---|---|---|
| E. Nor. Central | 9318 | 189.2445 | 72.19854 |
| Mountain | 3131 | 183.0695 | 73.11425 |
| W. Nor. Central | 3784 | 182.7367 | 71.01149 |
| South Atlantic | 9386 | 182.2202 | 74.24686 |
| W. Sou. Central | 4706 | 175.6166 | 75.70251 |
| E. Sou. Central | 3304 | 166.3804 | 72.00616 |
The maximum difference between standard deviations is 6.6% in group two; I believe that when it comes to samples who are as big as 9000 observations and as small as 2000 observations, we can consider a difference of 6.6% in standard deviation to not be high enough to assume that data is homoscedastic between regions in both groups.
Note that the conditions for ANOVA also apply to further pair wise testing.
Now that the conditions have been stated and almost-fully fulfilled with some considerations, it is time to define the hypothesis test:
\[ \begin{aligned} \\ &H_0:\mu_1=\mu_2=\ ...\ =\mu_n \\ &H_1: any\ of\ them\ is\ different \\\ \end{aligned} \] These conditions would need to be fulfilled withing group 1 and group 2 separately. Let’s start with group 1:
\[ \begin{aligned} \\ &H_0:\mu_{Pacific}=\mu_{Middle\ Atlantic}=\mu_{New\ England} \\ &H_1: any\ of\ them\ is\ different \\\ \end{aligned} \]
anova1 <- summary(aov(coninc~region, data = group1))| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| region | 2 | 197701 | 98850.48 | 17.0691130655386 | 3.93420354472118e-08 |
| Residuals | 16038 | 92879111 | 5791.19 | - | - |
The probability of observing an F ratio at least as large as roughly 17 given that the samples come from populations with the same mean is very very low.
\[ \begin{aligned} \\ P(F\ge17.07\ |\ \mu_{Pacific}=\mu_{Middle\ Atlantic}=\mu_{New\ England})<\alpha=0.05 \\\\ \end{aligned} \]
We have to reject the Null Hypothesis of all region samples within group 1 coming from populations with the same mean and try pair wise mean comparisons. Since rejecting the Null Hypothesis means that at least one sample comes from a population with a different mean, this does not imply that they all come from populations with different means. ANOVA is an omnibus test, meaning that it contains many items within that test and it does not tell you which one makes things heterogeneous.
New England has an abnormally high average, so I will try comparing the other two means now:
group1.1 <- group1 %>%
filter(!region %in% 'New England')\[ \begin{aligned} \\ &H_0:\mu_{Pacific}=\mu_{Middle\ Atlantic} \\ &H_1:\mu_{Pacific}\neq\mu_{Middle\ Atlantic} \\ \\\ \end{aligned} \]
inference(y = coninc, x = region, data = group1.1, statistic = "mean", type = "ht", null = 0,
alternative = "twosided", method = "theoretical")## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_Middle Atlantic = 7006, y_bar_Middle Atlantic = 195.5604, s_Middle Atlantic = 75.4863
## n_Pacific = 6754, y_bar_Pacific = 194.694, s_Pacific = 76.7114
## H0: mu_Middle Atlantic = mu_Pacific
## HA: mu_Middle Atlantic != mu_Pacific
## t = 0.6676, df = 6753
## p_value = 0.5044
The probability of both samples coming from populations with with the same mean (and same variance since it is a condition for performing ANOVA) is roughly 0.5, this implies that these two territories can be grouped into the same segment when splitting income earned by a family region wise.
\[ \begin{aligned} \\ P(\bar{x}_{middle\ atlantic}\ -\ \bar{x}_{pacific}\ge 0.8664\ |\ \mu_{Pacific}=\mu_{Middle\ Atlantic})>\alpha=0.05 \\\\ \end{aligned} \]
Hence we can accpet the Null Hypothesis; both region samples come from populations with the same mean.
Let’s also compute a confidence interval and see if both the hypothesis test and the confidence interval for the mean difference yield congruent outcomes:
inference(x = region, y = coninc, data = group1.1, statistic = "mean", type = "ci", method = "theoretical")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Middle Atlantic = 7006, y_bar_Middle Atlantic = 195.5604, s_Middle Atlantic = 75.4863
## n_Pacific = 6754, y_bar_Pacific = 194.694, s_Pacific = 76.7114
## 95% CI (Middle Atlantic - Pacific): (-1.6779 , 3.4108)
The observed sample mean of the difference is included in this 95% confidence interval:
\[ \begin{aligned} \\ \bar{x}_{middle\ atlantic}\ -\ \bar{x}_{pacific}=\ 0.8664\subset(-1.6779 , 3.4108) \\\ \end{aligned} \]
95% of random samples of 7006 Americans from Middle Atlantic and 6754 Americans from Pacific will yield confidence intervals that contain the true mean value for the family income difference between these two regions ,which happens to be zero and this value has been ratified by the previous hypothesis test that accepted the Null Hypothesis of the value being zero for a 5% significance level. Another way of interpreting the confidence interval is, we are 95% confident that Americans on average earn 338.17 extra dollars if they live in Middle Atlantic relative to Pacific region. For this calculations I had to reverse the square root transformation that I used initially to normalize data.
I could repeat the same analysis over and over again with group 2 which is even bigger than group 1, so if the lecturer feels satisfied with my work up until now, allow me to jump to conclusions and take for granted that the rest of the analysis followed the same procedures.
After repeating ANOVA with group 2 and pair wise tests, I got the following segments:
| region | rank |
|---|---|
| New England | 1 |
| Middle Atlantic | 2 |
| Pacific | 2 |
| E. Nor. Central | 3 |
| Mountain | 4 |
| W. Nor. Central | 4 |
| South Atlantic | 4 |
| W. Sou. Central | 5 |
| E. Sou. Central | 6 |
The regions sharing rank number is due to either the Null Hypothesis was accepted in an ANOVA test or in a pair wise test. I cannot embed a shiny app to a static html file report, but I will attach a screenshot of my shiny app that shows how these ranks look like in a US map:
The lighter the red colors are, the better rank they have. Can we conclude that coastal territories perform better than continental territories? we cannot due to Mountain region scoring better than West South Central which has the Atlantic coast. What could I have missed out? probably a second determining factor: latitud. This map reminds me a bit of Europe with northern countries performing better than southern countries. Probably the income map could be better described with a two factor (coast, latitud) map where northern coastal territories are the best to perform (New England) and southern territories with little or no coast (East South Cnetral) might be the worst to perform. Analogously, this pattern seems to be repeated in Europe where Scandinavian countries perform the best; they have coast and are located in the north, and on the other hand, the Balkan countries which many of them have no coast and are located in the south, perform the worst.