library(ggplot2)
library(dplyr)
library(statsr)## Loading the gss dataset
load("gss.Rdata")In 1972, as part of data diffusion project, the General Social Surveys were designed. The survey data was used for monitoring the societal changes till 2012.
Since the data is used to study patterns and trends in various categories across United States, we would assume that the data is collected through random sampling and hence the analysis results are generalizible to the entire United States population.
However, this being an observational study, random assignment has not been taken place and hence it is not causal.
Does being married helps someone with minimizing drug addiction?
This question interests me mainly because most of my friends consume drugs whenever possible. Few of them are single and few are in relationship. I would like to see if they getting married helps them to reduce drug consumption eventually. This question can be approched in many ways but I will be implementing the techniques we learned so far to address this question.
## Lets look into the structure and dimentions of the dataset variables to understand the kind of variables we have
dim(gss)## [1] 57061 114
#Structure of variables
#absingle: Not married
str(gss$absingle)## Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 2 ...
#natdrug: dealing with drug addiction
str(gss$natdrug)## Factor w/ 3 levels "Too Little","About Right",..: NA NA NA NA NA NA NA NA NA NA ...
Since both the variables are categorical with “natdrug” variable having more than 2 levels, we will analyse this case using Chi Square test of Independance
#First, lets eliminate NA's in absingle and natdrug variables
p <- gss %>%
select(absingle, natdrug) %>%
filter(!is.na(absingle), !is.na(natdrug))
slice(p, 1:10)## # A tibble: 10 x 2
## absingle natdrug
## <fctr> <fctr>
## 1 No Too Little
## 2 Yes Too Little
## 3 Yes Too Little
## 4 No Too Little
## 5 Yes Too Little
## 6 No Too Little
## 7 No Too Little
## 8 Yes Too Little
## 9 No Too Little
## 10 No Too Little
# Lets save this count into a single_drug_count variable
single_drug_count <- nrow(p)
single_drug_count## [1] 23470
# Lets have a final look at the data to confirm if it is in the required form to proceed further
str(p)## 'data.frame': 23470 obs. of 2 variables:
## $ absingle: Factor w/ 2 levels "Yes","No": 2 1 1 2 1 2 2 1 2 2 ...
## $ natdrug : Factor w/ 3 levels "Too Little","About Right",..: 1 1 1 1 1 1 1 1 1 1 ...
# Data looks good for analysis. Lets find the number of values under each level of natdrug variable
p %>%
group_by(natdrug) %>%
summarise(count_drug = n())## # A tibble: 3 x 2
## natdrug count_drug
## <fctr> <int>
## 1 Too Little 14726
## 2 About Right 6832
## 3 Too Much 1912
#Lets find the number of values under each level of absingle variable
p %>%
group_by(absingle) %>%
summarise(count_single = n())## # A tibble: 2 x 2
## absingle count_single
## <fctr> <int>
## 1 Yes 10590
## 2 No 12880
#Lets find the number of values under each level of natdrug variable that falls under "Yes" or "No" category of absingle variable.
table(p$absingle, p$natdrug)##
## Too Little About Right Too Much
## Yes 6369 3302 919
## No 8357 3530 993
#Visualizing above results with the help of a stacked bar chart
ggplot(p, aes(x = absingle, fill = natdrug)) +
geom_bar()# We got an idea on the counts of various scenarios from the above results. Lets examine if our data is fit for conducting Chi-square test of independance
#CONDITIONS FOR THE CHI SQUARE TEST
#1. The observations are independant of each other as one respondents relationship status and drug consumption doesn't depend on other respondent being in relationship or drug consumption.
#2. We have more than 5 observations for every cell/scenario.
#3. We have a random sample
#4. 23470 is less than 10% of US population that consumes drug with a relationship status of either single or not single.
#5. Each case only contributes to one cell in the table.
#-------------------------------------------------------------#
###Reason for not calculating Confidence Inteval###
#Since we have more than 2 levels in a categorical variable, and our focus is mainly on finding the independence/dependence of one categorical variable over the other, we need not calcuate the confidence interval to help with our analysis.
#-------------------------------------------------------------#
# Looks like there is a difference in the level of drug consumption depending on whether a person is single or not. Lets check if this difference holds true using hypothesis testing.
# Null Hypothesis: Relationship status has no effect on the drug consumtion. Drug consumption is independent of the relationship status.
# Alternative Hypothesis: Relationship status has some effect on the drug consumtion.Drug consumption is dependent on the relationship status.
# We need to save each of the values from above results into individual variables to calculate proportions, expected and observed counts which in turn helps us in calculating the chi-square result and to conduct Hypothesis testing.
#number of absingle - yes results
single_yes <- p %>%
filter(absingle == "Yes")
count_single_yes <- nrow(single_yes)
count_single_yes## [1] 10590
#number of absingle - no results
single_no <- p %>%
filter(absingle == "No")
count_single_no <- nrow(single_no)
count_single_no## [1] 12880
# Proportion of absingle-yes count with total absingle count
prop_yes <- p %>% summarise(p_yes = count_single_yes/single_drug_count)
prop_yes## p_yes
## 1 0.4512143
# Proportion of absingle-yes count with total absingle count
prop_no <- p %>% summarise(p_no = count_single_no/single_drug_count)
prop_no## p_no
## 1 0.5487857
#number of natdrug - Too Little results
drug_little <- p %>%
filter(natdrug == "Too Little")
count_drug_little <- nrow(drug_little)
count_drug_little## [1] 14726
#number of natdrug - Too Much results
drug_much <- p %>%
filter(natdrug == "Too Much")
count_drug_much <- nrow(drug_much)
count_drug_much## [1] 1912
#number of natdrug - About Right results
drug_right <- p %>%
filter(natdrug == "About Right")
count_drug_right <- nrow(drug_right)
count_drug_right## [1] 6832
#number of natdrug - Too Little and absingle - yes results
yes_little <- p %>%
filter(absingle =="Yes", natdrug == "Too Little")
obs_yes_little <- nrow(yes_little)
obs_yes_little## [1] 6369
#number of natdrug - About Right and absingle - yes results
yes_right <- p %>%
filter(absingle =="Yes", natdrug == "About Right")
obs_yes_right <- nrow(yes_right)
obs_yes_right## [1] 3302
#number of natdrug - Too Much and absingle - yes results
yes_much <- p %>%
filter(absingle =="Yes", natdrug == "Too Much")
obs_yes_much <- nrow(yes_much)
obs_yes_much## [1] 919
#number of natdrug - Too Much and absingle - No results
no_much <- p %>%
filter(absingle =="No", natdrug == "Too Much")
obs_no_much <- nrow(no_much)
obs_no_much## [1] 993
#number of natdrug - About Right and absingle - No results
no_right <- p %>%
filter(absingle =="No", natdrug == "About Right")
obs_no_right <- nrow(no_right)
obs_no_right## [1] 3530
#number of natdrug - Too Little and absingle - No results
no_little <- p %>%
filter(absingle =="No", natdrug == "Too Little")
obs_no_little <- nrow(no_little)
obs_no_little## [1] 8357
#Expected count of absingle-yes and natdrug- Too Little
exp_yes_little <- count_drug_little * prop_yes
exp_yes_little## p_yes
## 1 6644.582
#Expected count of absingle-yes and natdrug- Too Much
exp_yes_much <- count_drug_much * prop_yes
exp_yes_much## p_yes
## 1 862.7218
#Expected count of absingle-yes and natdrug- About Right
exp_yes_right <- count_drug_right * prop_yes
exp_yes_right## p_yes
## 1 3082.696
# Checking if the sum of the expected absingle-yes counts for all three levels of the natdrug is equal to the total absingle-yes
Total_yes <- exp_yes_right + exp_yes_much + exp_yes_little
Total_yes## p_yes
## 1 10590
#Expected count of absingle-no and natdrug-About Right
exp_no_right <- count_drug_right * prop_no
exp_no_right## p_no
## 1 3749.304
#Expected count of absingle-no and natdrug-Too Little
exp_no_little <- count_drug_little * prop_no
exp_no_little## p_no
## 1 8081.418
#Expected count of absingle-no and natdrug-Too Much
exp_no_much <- count_drug_much * prop_no
exp_no_much## p_no
## 1 1049.278
# Checking if the sum of the expected absingle-no counts for all three levels of the natdrug is equal to the total absingle-no
Total_no <- exp_no_right + exp_no_much + exp_no_little
Total_no## p_no
## 1 12880
# View all the expected counts together in a data frame
e <- data.frame(exp_yes_little, exp_yes_right, exp_yes_much, exp_no_little, exp_no_right, exp_no_much)
e## p_yes p_yes.1 p_yes.2 p_no p_no.1 p_no.2
## 1 6644.582 3082.696 862.7218 8081.418 3749.304 1049.278
# View all the Observed counts together in a data frame
f <- data.frame(obs_yes_little,obs_yes_right, obs_yes_much,obs_no_little,obs_no_right, obs_no_much)
f## obs_yes_little obs_yes_right obs_yes_much obs_no_little obs_no_right
## 1 6369 3302 919 8357 3530
## obs_no_much
## 1 993
#Computing Chi-square test $ sigma (((observed - expected)^2)/expected)
chi_square <- ((((obs_yes_little - exp_yes_little)^2)/exp_yes_little) + (((obs_no_little - exp_no_little)^2)/exp_no_little) + (((obs_yes_right - exp_yes_right)^2)/exp_yes_right) + (((obs_no_right - exp_no_right)^2)/exp_no_right) + (((obs_yes_much - exp_yes_much)^2)/exp_yes_much)
+ (((obs_no_much - exp_no_much)^2)/exp_no_much))
chi_square## p_yes
## 1 55.94575
#Compute degrees of freedom
deg_of_freedom <- (2 - 1) * (3 - 1)
deg_of_freedom## [1] 2
#Computing the hypothesis that absingle and natdrug are associated at the 5% significance level.
pchisq(55.94575, 2, lower.tail = FALSE)## [1] 7.10452e-13
# This p-value we got is way less than 0.05, so we reject the null hypothesis in favour of alternative hypothesis that there is a significant effect of being single or not on the drug addiction.
#Lets verify if the above result is computed correctly using the inference function at 5% significance
inference(y = absingle, x = natdrug, data = p,type = "ht", statistic = "proportion", method = "theoretical", null = NULL, alternative = "greater", success = "No")## Response variable: categorical (2 levels)
## Explanatory variable: categorical (3 levels)
## Observed:
## y
## x Yes No
## Too Little 6369 8357
## About Right 3302 3530
## Too Much 919 993
##
## Expected:
## y
## x Yes No
## Too Little 6644.5820 8081.418
## About Right 3082.6962 3749.304
## Too Much 862.7218 1049.278
##
## H0: natdrug and absingle are independent
## HA: natdrug and absingle are dependent
## chi_sq = 55.9457, df = 2, p_value = 0
Both the techniques used above for hypothesis testing resulted in same values and gave us a p-value that is less than 0.05.
Based on the above results, we reject the null hypothesis in favour of alternative that there is a difference in the level of drug consumption depending on whether the person is married or not married.
But we cannot conclude that being married or not being married helps someone in minimizing their drug addiction. Eventhough there are more people with Too Little drug consumption status among married people than in the number of unmarried, above results are not enough to conclude the effect . This difference could be due to other factors and it didn’t help us in favouring a married person on his minimal drug consumption. Further research is required to arrive at this conclusion.