My research question is whether there is a difference between the average years of education in White race and Black race in 2012 in United States. The reason why I choose this topic is that I am very interested in whether different races receive fair opportunities and resources for education. And this analysis will give me a brief idea about what the most recent situation of education level in white race and black race, which are the two dominant races in US. For other researchers, this analysis can help them understand the current situation, and if the difference does exist, we can work on the drivers which cause the difference, and actions which we should take to balance the education opportunities and resources.
This analysis uses the extract of dataset called General Social Survey (GSS), which is a large-scale US survey. GSS is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States, and it is a cumulative data file for 29 surveys conducted between 1972 - 2012. The items appearing in the surveys are one of three types: Permanent questions that occur on each survey, rotating questions that appear on two out of every three surveys (1973, 1974, and 1976, or 1973, 1975, and 1976), and a few occasional questions such as split ballot experiments that occur in a single survey.
Please note that this extract is modified for Data Analysis and Statistical Inference Course of Duke University. There are a total of 57,061 cases and 114 variables in this dataset. The unit of observation is individual. For the purpose of this analysis, we only study two variables and their relevant cases in 2012(the most recent year in the survey), excluding NA values. Thus we have a subset dataset of 1,776 cases, i.e. 1,776 respondents.
In this analysis, we study one categorical variable called “race” and one numeric variable called “educ”. The “race” variable has the information about the race of respondent, and the values have “White”, “Black” and “Other”. We only study the “White” and “Black” classes here in this analysis. The “educ” variable has the information about the highest year of school completed by the respondent aka the number of years of education. The “race” variable is the explanatory variable(x) of the analysis, and the “educ” variable is the response variable(y).
This is an observational study since the data is collected by random selection(sample) and it doesn’t have random assignment.
The population of interest is the White and Black adults who live in United States in 2012. Since this survey is a large-scale US survey and conducted for 40 years using random selection, we can say it allows this analysis to be generalized to the population of interest. However, this dataset does has some Not Applicable values and it may also has the bias of voluntary response. But the impact is very small since we have a large size of dataset through a long time period. So we can still assume this analysis can be generalized.
On the other hand, these data can not be used to establish causal links between the “race” and “educ” variables since it doesn’t have random assignment and it is not an experimental design. Also, we can not block any confounding variables for the analysis. The level of education can be explained by many other variables(factors), like sex, family income level, religion etc.
First, we select our subset dataset with only two variables(“race” and “educ”), where the cases are either “White” or “Black” for race and the year of the survery is “2012”. We also exclude NA values. It gives us a dataset of 1776 case.
a1=subset(gss,gss$race=="White" | gss$race=="Black")
a2<-subset(a1, a1$year=="2012")
race_edu<-a2[c(5,8)]
race_edu<-race_edu[race_edu$race !="Other",]
race_edu$race<-droplevels(race_edu$race)
race_edu<-subset(race_edu, !is.na(race_edu$educ))
Now we take a look at our dataset and the summaries of our two variables.
summary(race_edu)
## race educ
## White:1475 Min. : 0.00
## Black: 301 1st Qu.:12.00
## Median :13.00
## Mean :13.61
## 3rd Qu.:16.00
## Max. :20.00
table(race_edu$race)/1776
##
## White Black
## 0.830518 0.169482
boxplot(race_edu$educ)
From the results, we know that the majority of our respondents here is White race(83%), compared to Black race(17%). According to the U.S. population’s distribution by race and ethnicity in 2010, the percentage of White American was 72.2% and African American 12.6%. It indicates that our data is quite consistent with the census data. And from the boxplot, we can know that the data distribution of “educ” is right skewed, which makes sense since less people when education level increases. We have a very small number of outliners. In the next step, we compare the years of education of White and Black using side-by-side boxplot.
boxplot(race_edu$educ ~ race_edu$race)
From the side-by-side boxplot, we can know that the medium years of education of White is slight higher than that of Black. The distribution of White is quite symmetric and close to normal distribution. But the distribution of Black is right skewed, and its range and IQR are narrower than those of White.
Based on the results we have, the exploratory data suggests that there may be a difference between the average years of education in White race and Black race.
The null hypothesis(Ho) is that there is no difference between the average years of education in White race and Black race in 2012 in US. The alternative hypothesis(Ha) is that there is a difference between the average years of educatin in White race and Black race in 2012 in US.
Ho:μwhite−μblack=0 Ha:μwhite−μblack≠0
Conditions: 1. Independence: -within groups: All the cases are selected randomly from the population, based on the data collection method of GSS survery. So the data are independence within groups. And the number of White race and Black race are definitely smaller than the 10% of population. -between groups: Since no respondent can be white and black race at the same time, two groups must be independent of each other. 2.Sample size/skew: Each sample has a size greater than 30.
In conclusion, all the conditions for inference for comparing two independent means are met.
Since we are comparing two independent means, we should use a theoretical method to conduct a two-sided hypothesis test and report the associated p-value. We will also calculate the confidence interval for (μwhite-μblack).
Two-sided hypothesis test at a significant level of 5%:
source("http://bit.ly/dasi_inference")
inference(y = race_edu$educ, x = race_edu$race, est = "mean", type = "ht", null = 0, alternative = "twosided", method = "theoretical", order = c("White","Black"), eda_plot=FALSE)
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_White = 1475, mean_White = 13.7031, sd_White = 3.031
## n_Black = 301, mean_Black = 13.1628, sd_Black = 2.7911
## Observed difference between means (White-Black) = 0.5403
## H0: mu_White - mu_Black = 0
## HA: mu_White - mu_Black != 0
## Standard error = 0.179
## Test statistic: Z = 3.015
## p-value = 0.0026
Since the p-value is 0.26%, which is smaller than the significant level of 5%, we reject the null hypothesis. That indicates, if there is no difference between the average years of education in White and Black in 2012 in US, there is 0.26% chance of obtaining random samples of 1475 White respondents and 301 Black respondents where the average difference between their education years is at least 0.5403 year.
The 95% confidence interval:
inference(y = race_edu$educ, x = race_edu$race, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical", order = c("White","Black"), eda_plot=FALSE)
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_White = 1475, mean_White = 13.7031, sd_White = 3.031
## n_Black = 301, mean_Black = 13.1628, sd_Black = 2.7911
## Observed difference between means (White-Black) = 0.5403
## Standard error = 0.1792
## 95 % Confidence interval = ( 0.1891 , 0.8915 )
The 95% condidence interval is (0.1891, 0.8915). That indicates, we are 95% confident that the years of education in White race are on average 0.1891 to 0.8915 longer than the years of education in Black race in 2012 in US.
The condidence interval doesn’t include our null value 0, which is consistant with the result of the rejection of null hypothesis in the previous step.
In conclusion, this project has proved that there is a difference between the average years of education in White race and Black race in 2012 in US. As a matter of fact, white race has a higher education level on average than Black race. We have learned that the percentage of White race is much greater than that of Black race, in both of the sample data and the 2010 US census data. Therefore, the black race may not have the fair opportunities and resouces for education, compared to White race. For the future research, we can study about the other information of White race and Black race, and test about the confounding variables which may also contribute to the difference. We can also conduct an experimental study to test the causal links between these two variables. [Author’s Note: Here should be only 7 pages in print view of Google Chrome.]
Data citation: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1
Persistent URL: http://doi.org/10.3886/ICPSR34802.v1
The dataset can also be accessed by using R [load(url(“http://bit.ly/dasi_gss_data”))] or using URL http://d396qusza40orc.cloudfront.net/statistics/project/gss.Rdata .
race_edu[1:30,]
## race educ
## 55088 White 16
## 55089 White 12
## 55091 White 13
## 55092 Black 16
## 55093 White 19
## 55094 White 15
## 55096 Black 9
## 55097 White 17
## 55098 White 10
## 55099 Black 16
## 55100 White 12
## 55101 Black 12
## 55103 Black 13
## 55104 White 12
## 55105 Black 13
## 55110 White 14
## 55112 White 12
## 55113 White 17
## 55114 White 15
## 55115 White 10
## 55116 Black 16
## 55117 White 13
## 55118 White 16
## 55119 White 14
## 55120 White 19
## 55122 White 14
## 55123 White 18
## 55124 White 11
## 55125 Black 12
## 55127 Black 14