This is part of the data analysis project for the course, “Data analysis and statistical inference”. I am going to be working with gss(General Social Survey) data[1]. This study involves observing the effect of societal change on the complexities of the American society. A person’s opinion on the political philosophies is a hot topic, and it would be of interest to explore if the highest year of school completed changes one’s belief on his/her political views. For instance, a person might tend to be more open minded and lean towards moderate or liberal political philosophies with higher levels of education. All these are speculations and proper data analysis will illustrate the appropriate relationship between the variables of interest. My research question is: “Is there a direct relation between the level of education and one’s political views, in the year 2012?”
The survey comprises of the responses of all non-institutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States for the time period from 1972 to 2012. Computer-assisted personal interview (CAPI), face-to-face interviews and telephone interviews were the different modes of the data collection process. The cases; ie., the units of observations are the 57061 individuals responding to the survey.
Since the set of people responding to the survey may vary over the years and also the methodology design has been modified during the period of the survey, I am going to be confining my data to the year 2012. The variables of interest in this study are:
educ: Highest year of school completed. This is a discrete numerical variable.
polviews:Think of self as liberal or conservative; This is a categorical variable; with levels: “Liberal”,“Moderate” and “Conservative”; Using the levels() function, the original seven categorical levels have been combined into three, appropriately suiting to the problem.
The cases pertaining to this study are the number of respondents in the survey year 2012, which turns out to be 1974 (starting with caseid:55088). The methodology involved in the year 2012 constitutes of sampling of the English and Spanish speaking population, non-respondent sub-sampling and biennial double sample design. The idea behind the decision to constrain the survey year to 2012 is to reduce the extent of biasing in the sample.
This investigation falls under the category of an “Observational study” because, as mentioned earlier, the data has been collecting by means of a survey incorporating the method of random sampling (probability sampling). Moreover, no random assignment to treatment/control groups was involved in this process to render it as an experimental design.
Since this is an observational study, any conclusive inference can be generalized to the population of interest, which in this case turns out to be English and Spanish speaking adults living in the United states in the year 2012. As discussed earlier, the lack of an experimental design prevents one from deducing causal links between the variables of interest using this data.
The potential sources of biasing that could be possible are: Convenience bias, voluntary response bias and non-response bias. As described briefly before, the sampling methodology designed in the year 2012 tries to minimize the sampling bias. Also, since the size of the sample is large enough, it is safe to make an assumption that the findings are generalizable to the population of interest.
Here is a display of the proportion table to illustrate the NA proportion in the data. From this table it can be seen that proportion of non-response (NA) comes out to be ~ 5%.
polviews
Liberal Moderate Conservative <NA>
0.27001013 0.36119554 0.31813576 0.05065856
The NA responses have been removed for the analysis part and the sub-set and cleaned variables appropriate to the problem of interest are defined using the following codes:
gss.clean=gss[!is.na(gss$polviews),]
Educ=subset(gss.clean[, "educ"], gss.clean$year==2012)
Polviews=subset(gss.clean[, "polviews"], gss.clean$year==2012)
A portion of the final data set used for the purpose of analysis is displayed in the appendix.
Before moving on to the inference section, we have to find out if something really interesting is going on with the data at hand. So for this purpose, a brief summary statistics followed by some visualizations of the data will be presented in this section. The histogram for the Educ variable is described by Fig 1:
A look at the histogram suggests that the Educ data is slightly right skewed. Nonetheless, we will make sure that the necessary conditions for the appropriate test statistics are satisfied to perform the inference. The summary statistics and the side by side bar plot for the number of years of education associated with the three levels of Political views : Liberal, Moderate and Conservative are shown below:
Polviews: Liberal
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.0 12.0 14.0 14.2 16.0 20.0 1
--------------------------------------------------------
Polviews: Moderate
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 12.00 13.00 13.31 15.00 20.00
--------------------------------------------------------
Polviews: Conservative
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 12.00 14.00 13.61 16.00 20.00
The mean for the liberal, moderate and conservative groups turns out to be 14.2, 13.31 and 13.61 respectively. From these results we can hypothesize that the level of education is varying across the three categories of the political philosophy. To further verify this relation, we have to carry out the inference with the appropriate methodology, which will be discussed elaborately in the next section.
n | mean | sd | |
---|---|---|---|
Liberal | 532 | 14.199 | 3.330 |
Moderate | 713 | 13.310 | 2.802 |
Conservative | 628 | 13.613 | 2.981 |
Overall | 1873 | 13.664 | 3.039 |
The data We are dealing with comprises of a numerical variable and a categorical variable with more than two levels. To compare the mean between these groups we will be using Analysis of variance (ANOVA) test. If the ANOVA test provides convincing evidence that atleast one pair of population means are different from each other; we will follow up the inferential analysis with pairwise comparisons between different groups using T statistics.
Null Hypothesis; \(H_0\): The mean outcome is the same across all categories, i.e., The average level of education is the same for the population with liberal, moderate and conservative political views.
Alternative Hypothesis; \(H_A\): Atleast one pair of means are different from each other, i.e.,The average level of education is not the same for the population with liberal, moderate and conservative political views.
Test for independence: As discussed previously, data collection was done be means of random sampling and the sample size for this study =1874 (After having removed the NA responses) which is definitely less that 10% of the population of interest. We can safely assume that the group: Liberal, Moderate and Conservative are independent of each other. All these factors is a check for the sample observations to be independent of each other.
Condition of normality: The distribution of the response variable, level of education within each group must be approximate normal. Let us check if this condition has been satisfied in this case by looking at the quantile plots:
From the above plots we can assume that the distribution of the number of years of education within each group is almost approximately normal.
Condition of constant variance:
A quick look at the side by side plot (Fig 2) and the summary of data described by table 1 demonstrates that the variability is consistent across the groups.
On Summarizing, the conditions for the ANOVA test seems to be satisfied by the data associated with the research question. We will go ahead and perform the inference.
First let us define all the quantities of interest to perform the analysis.
Df | Sum Sq | Mean Sq | F value | ||
---|---|---|---|---|---|
GROUP | Political views | 2 | 243.414222638444 | 121.707111319222 | 13.3482463663541 |
ERROR | Residuals | 1870 | 17050.3519279222 | 9.11783525557335 | |
Total | 1872 | 17293.7661505606 |
The p-value for this analysis turns out to be:
pf(F, dfG, dfE, lower.tail = FALSE)
[1] 1.75356e-06
p-value is the probability of at least as large a ratio between the “between” and “within” group variabilities if in fact the means of all groups are equal. Since the p-value is less than \(\alpha\) (significance level of 5%), we will reject \(H_0\) and conclude that the data of interest provides convincing evidence that at least one pair of the population means are different from each other. And to find out which group it is, we will go ahead and test different pairs of groups using T statistics.
Pairwise comparisons using T statistics: Applying Bonferroni correction to the significance level: \(\alpha^*=\alpha/K\); where K is the number of comparisons given by \(K=\frac{k(k-1)}{2}\). The pooled standard error for multiple pairwise comparison: \(SE=\sqrt{\frac{MSE}{n_1}+\frac{MSE}{n_2}}\) and the degree of freedom:\(df=df_E\)
Null Hypothesis; \(H_0=\mu_{Liberal}-\mu_{Moderate}=0\)
Alternative Hypothesis; \(H_A: \mu_{Liberal}-\mu_{Moderate}\neq0\) (Two sided test)
T statistics is given by: \(T=\frac{\mu_{Liberal}-\mu_{Moderate}}{SE}\)=5.141 and the corresponding p-value turns out to be
[1] 3.024891e-07
The p-value is much less than \(\alpha^{*}=0.0167\). Since p-value \(< \alpha^*\), we will reject \(H_0\). This implies that there is a difference between the average level of education between the population with liberal and moderate political views.
Null Hypothesis; \(H_0=\mu_{Liberal}-\mu_{Conservative}=0\)
Alternative Hypothesis; \(H_A: \mu_{Liberal}-\mu_{Conservative}\neq0\) (Two sided test)
T statistics is given by: \(T=\frac{\mu_{Liberal}-\mu_{Conservative}}{SE}\)=3.295 and the p-value comes out to be 0.001. Since p-value \(< \alpha^*\), we will reject \(H_0\). This implies that there is a difference between the average level of education between the population with liberal and conservative political views.
Null Hypothesis; \(H_0=\mu_{Moderate}-\mu_{Conservative}=0\)
Alternative Hypothesis; \(H_A: \mu_{Moderate}-\mu_{Conservative}\neq0\) (Two sided test)
T statistics is given by: \(T=\frac{\mu_{Moderate}-\mu_{Conservative}}{SE}\)=-1.834 and the corresponding p-value is: 0.067. Since p-value \(>\alpha^*\) , we will fail to reject \(H_0\). This implies that the data do not provide convincing evidence that the average level of education is different between the population with moderate and conservative political views.The T statistics and the corresponding p-values for the different groups are displayed in Table 3:
n1 | n2 | mean1 | mean2 | SE | T | p-value | |
---|---|---|---|---|---|---|---|
Liberal vs Moderate | 532 | 713 | 14.199 | 13.310 | 0.173 | 5.141 | 0.000 |
Liberal vs Conservative | 532 | 628 | 14.199 | 13.613 | 0.178 | 3.295 | 0.001 |
Moderate vs Conservative | 713 | 628 | 13.310 | 13.613 | 0.165 | -1.834 | 0.067 |
While doing the ANOVA testing, we have employed the method of hypothesis testing to arrive at the conclusions. Since the method of confidence interval does not apply here, we cannot use it as an alternative method to compare the results that we have obtained.
On summarizing, we first hypothesized that something interesting might be going on while looking at the influence, the number of years of education has on the political philosophy of the population of interest. This observation was verified by means of ANOVA testing and we could conclude that there is a difference in the average level of education between the population with different political philosophies. Multiple pairwise testing was employed to exactly pin point which pair of groups actually showed the difference. We then concluded that there is a difference in the average number of years of education between the population with “liberal and moderate political views” and “liberal and conservative political views” but no convincing evidence was found for the group with “moderate and conservative political views”.
From this research question, it can be inferred that the level of education does have an effect on one’s view towards political opinions, the more educated one gets, more open minded one’s outlook becomes.
I feel that most of the current generation are more liberal. Therefore, it would be of interest to collect data over the recent years (>= 2008 till present) following the same sample design and actually classify the group as only three categories (as opposed to the original seven). I would also be curious to see if the level of education has a different effect on the male’s and female’s political views.
Citation for the data:
[1] Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1
Data set used in the analysis is displayed here:
educ polviews
55088 16 Moderate
55089 12 Conservative
55090 12 Conservative
55091 13 Conservative
55092 16 Liberal
55093 19 Moderate
55094 15 Moderate
55095 11 Moderate
55096 9 Conservative
55097 17 Liberal
55098 10 Liberal
55099 16 Liberal
55100 12 Moderate
55102 4 Conservative
55103 13 Liberal
55104 12 Moderate
55105 13 Conservative
55106 12 Moderate
55107 12 Moderate
55108 0 Conservative