Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

## Loading the gss dataset

load("gss.Rdata")

Part 1: Data

In 1972, as part of data diffusion project, the General Social Surveys were designed. The survey data was used for monitoring the societal changes till 2012.

Since the data is used to study patterns and trends in various categories across United States, we would assume that the data is collected through random sampling and hence the analysis results are generalizible to the entire United States population.

However, this being an observational study, random assignment has not been taken place and hence it is not causal.

Part 2: Research question

Does being married helps someone with minimizing drug addiction?

This question interests me mainly because most of my friends consume drugs whenever possible. Few of them are single and few are in relationship. I would like to see if they getting married helps them to reduce drug consumption eventually. This question can be approched in many ways but I will be implementing the techniques we learned so far to address this question.

Part 3: Exploratory data analysis

## Lets look into the structure and dimentions of the dataset variables to understand the kind of variables we have

dim(gss)

## [1] 57061   114

#Structure of variables

#absingle: Not married

str(gss$absingle)

##  Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 2 ...

#natdrug: dealing with drug addiction

str(gss$natdrug)

##  Factor w/ 3 levels "Too Little","About Right",..: NA NA NA NA NA NA NA NA NA NA ...

Since both the variables are categorical with “natdrug” variable having more than 2 levels, we will analyse this case using Chi Square test of Independance

#First, lets eliminate NA's in absingle and natdrug variables

p <- gss %>%
  select(absingle, natdrug) %>%
  filter(!is.na(absingle), !is.na(natdrug)) 
  
slice(p, 1:10)

## # A tibble: 10 x 2
##    absingle    natdrug
##      <fctr>     <fctr>
##  1       No Too Little
##  2      Yes Too Little
##  3      Yes Too Little
##  4       No Too Little
##  5      Yes Too Little
##  6       No Too Little
##  7       No Too Little
##  8      Yes Too Little
##  9       No Too Little
## 10       No Too Little

# Lets save this count into a single_drug_count variable

single_drug_count <- nrow(p)

single_drug_count

## [1] 23470

# Lets have a final look at the data to confirm if it is in the required form to proceed further  
str(p)

## 'data.frame':    23470 obs. of  2 variables:
##  $ absingle: Factor w/ 2 levels "Yes","No": 2 1 1 2 1 2 2 1 2 2 ...
##  $ natdrug : Factor w/ 3 levels "Too Little","About Right",..: 1 1 1 1 1 1 1 1 1 1 ...

# Data looks good for analysis. Lets find the number of values under each level of natdrug variable

p %>% 
  group_by(natdrug) %>% 
  summarise(count_drug = n())

## # A tibble: 3 x 2
##       natdrug count_drug
##        <fctr>      <int>
## 1  Too Little      14726
## 2 About Right       6832
## 3    Too Much       1912

#Lets find the number of values under each level of absingle variable
p %>% 
  group_by(absingle) %>% 
  summarise(count_single = n())

## # A tibble: 2 x 2
##   absingle count_single
##     <fctr>        <int>
## 1      Yes        10590
## 2       No        12880

#Lets find the number of values under each level of natdrug variable that falls under "Yes" or "No" category of absingle variable.

table(p$absingle, p$natdrug)

##      
##       Too Little About Right Too Much
##   Yes       6369        3302      919
##   No        8357        3530      993

#Visualizing above results with the help of a stacked bar chart

ggplot(p, aes(x = absingle, fill = natdrug)) +
geom_bar()

# We got an idea on the counts of various scenarios from the above results. Lets examine if our data is fit for conducting Chi-square test of independance

#CONDITIONS FOR THE CHI SQUARE TEST

#1. The observations are independant of each other as one respondents relationship status and drug consumption doesn't depend on other respondent being in relationship or drug consumption.
#2. We have more than 5 observations for every cell/scenario. 
#3. We have a random sample
#4. 23470 is less than 10% of US population that consumes drug with a relationship status of either single or not single.
#5. Each case only contributes to one cell in the table.

#-------------------------------------------------------------#
###Reason for not calculating Confidence Inteval###
#Since we have more than 2 levels in a categorical variable, and our focus is mainly on finding the independence/dependence of one categorical variable over the other, we need not calcuate the confidence interval to help with our analysis. 
#-------------------------------------------------------------#

# Looks like there is a difference in the level of drug consumption depending on whether a person is single or not. Lets check if this difference holds true using hypothesis testing.

# Null Hypothesis: Relationship status has no effect on the drug consumtion. Drug consumption is independent of the relationship status.
# Alternative Hypothesis: Relationship status has some effect on the drug consumtion.Drug consumption is dependent on the relationship status.

# We need to save each of the values from above results into individual variables to  calculate proportions, expected and observed counts which in turn helps us in calculating the chi-square result and to conduct Hypothesis testing.  

#number of absingle - yes results

single_yes <- p %>%
  filter(absingle == "Yes") 

count_single_yes <- nrow(single_yes)

count_single_yes

## [1] 10590

#number of absingle - no results

single_no <- p %>%
  filter(absingle == "No") 

count_single_no <- nrow(single_no)

count_single_no

## [1] 12880

# Proportion of absingle-yes count with total absingle count

prop_yes <- p %>% summarise(p_yes = count_single_yes/single_drug_count)

prop_yes

##       p_yes
## 1 0.4512143

# Proportion of absingle-yes count with total absingle count  

prop_no <- p %>% summarise(p_no = count_single_no/single_drug_count)

prop_no

##        p_no
## 1 0.5487857

#number of natdrug - Too Little results

drug_little <- p %>%
  filter(natdrug == "Too Little") 

count_drug_little <- nrow(drug_little)

count_drug_little

## [1] 14726

#number of natdrug - Too Much results

drug_much <- p %>%
  filter(natdrug == "Too Much") 

count_drug_much <- nrow(drug_much)

count_drug_much

## [1] 1912

#number of natdrug - About Right results

drug_right <- p %>%
  filter(natdrug == "About Right") 

count_drug_right <- nrow(drug_right)

count_drug_right

## [1] 6832

#number of natdrug - Too Little and absingle - yes results

yes_little <- p %>%
  filter(absingle =="Yes", natdrug == "Too Little") 

obs_yes_little <- nrow(yes_little)

obs_yes_little

## [1] 6369

#number of natdrug - About Right and absingle - yes results

yes_right <- p %>%
  filter(absingle =="Yes", natdrug == "About Right") 

obs_yes_right <- nrow(yes_right)

obs_yes_right

## [1] 3302

#number of natdrug - Too Much and absingle - yes results

yes_much <- p %>%
  filter(absingle =="Yes", natdrug == "Too Much") 

obs_yes_much <- nrow(yes_much)

obs_yes_much

## [1] 919

#number of natdrug - Too Much and absingle - No results

no_much <- p %>%
  filter(absingle =="No", natdrug == "Too Much") 

obs_no_much <- nrow(no_much)

obs_no_much

## [1] 993

#number of natdrug - About Right and absingle - No results

no_right <- p %>%
  filter(absingle =="No", natdrug == "About Right") 

obs_no_right <- nrow(no_right)

obs_no_right

## [1] 3530

#number of natdrug - Too Little and absingle - No results

no_little <- p %>%
  filter(absingle =="No", natdrug == "Too Little") 

obs_no_little <- nrow(no_little)

obs_no_little

## [1] 8357

#Expected count of absingle-yes and natdrug- Too Little 

exp_yes_little <- count_drug_little * prop_yes

exp_yes_little

##      p_yes
## 1 6644.582

#Expected count of absingle-yes and natdrug- Too Much 

exp_yes_much <- count_drug_much * prop_yes

exp_yes_much

##      p_yes
## 1 862.7218

#Expected count of absingle-yes and natdrug- About Right 

exp_yes_right <- count_drug_right * prop_yes

exp_yes_right

##      p_yes
## 1 3082.696

# Checking if the sum of the expected absingle-yes counts for all three levels of the natdrug  is equal to the total absingle-yes

Total_yes <- exp_yes_right + exp_yes_much + exp_yes_little

Total_yes

##   p_yes
## 1 10590

#Expected count of absingle-no and natdrug-About Right 

exp_no_right <- count_drug_right * prop_no

exp_no_right

##       p_no
## 1 3749.304

#Expected count of absingle-no and natdrug-Too Little 

exp_no_little <- count_drug_little * prop_no

exp_no_little

##       p_no
## 1 8081.418

#Expected count of absingle-no and natdrug-Too Much 

exp_no_much <- count_drug_much * prop_no

exp_no_much

##       p_no
## 1 1049.278

# Checking if the sum of the expected absingle-no counts for all three levels of the natdrug  is equal to the total absingle-no

Total_no <- exp_no_right + exp_no_much + exp_no_little

Total_no

##    p_no
## 1 12880

# View all the expected counts together in a data frame

e <- data.frame(exp_yes_little, exp_yes_right, exp_yes_much, exp_no_little, exp_no_right, exp_no_much)

e

##      p_yes  p_yes.1  p_yes.2     p_no   p_no.1   p_no.2
## 1 6644.582 3082.696 862.7218 8081.418 3749.304 1049.278

# View all the Observed counts together in a data frame

f <- data.frame(obs_yes_little,obs_yes_right, obs_yes_much,obs_no_little,obs_no_right, obs_no_much)

f

##   obs_yes_little obs_yes_right obs_yes_much obs_no_little obs_no_right
## 1           6369          3302          919          8357         3530
##   obs_no_much
## 1         993

Part 4: Inference

#Computing Chi-square test $ sigma (((observed - expected)^2)/expected)

chi_square <- ((((obs_yes_little - exp_yes_little)^2)/exp_yes_little) + (((obs_no_little - exp_no_little)^2)/exp_no_little) + (((obs_yes_right - exp_yes_right)^2)/exp_yes_right) + (((obs_no_right - exp_no_right)^2)/exp_no_right) + (((obs_yes_much - exp_yes_much)^2)/exp_yes_much)
+ (((obs_no_much - exp_no_much)^2)/exp_no_much))

chi_square

##      p_yes
## 1 55.94575

#Compute degrees of freedom 

deg_of_freedom <- (2 - 1) * (3 - 1)

deg_of_freedom

## [1] 2

#Computing the hypothesis that absingle and natdrug are associated at the 5% significance level.

pchisq(55.94575, 2, lower.tail = FALSE)

## [1] 7.10452e-13

# This p-value we got is way less than 0.05, so we reject the null hypothesis in favour of alternative hypothesis that there is a significant effect of being single or not on the drug addiction.

#Lets verify if the above result is computed correctly using the inference function at 5% significance 

inference(y = absingle, x = natdrug, data = p,type = "ht", statistic = "proportion", method = "theoretical", null = NULL, alternative = "greater", success = "No")

## Response variable: categorical (2 levels) 
## Explanatory variable: categorical (3 levels) 
## Observed:
##              y
## x              Yes   No
##   Too Little  6369 8357
##   About Right 3302 3530
##   Too Much     919  993
## 
## Expected:
##              y
## x                   Yes       No
##   Too Little  6644.5820 8081.418
##   About Right 3082.6962 3749.304
##   Too Much     862.7218 1049.278
## 
## H0: natdrug and absingle are independent
## HA: natdrug and absingle are dependent
## chi_sq = 55.9457, df = 2, p_value = 0

Both the techniques used above for hypothesis testing resulted in same values and gave us a p-value that is less than 0.05.

Based on the above results, we reject the null hypothesis in favour of alternative that there is a difference in the level of drug consumption depending on whether the person is married or not married.

But we cannot conclude that being married or not being married helps someone in minimizing their drug addiction. Eventhough there are more people with Too Little drug consumption status among married people than in the number of unmarried, above results are not enough to conclude the effect . This difference could be due to other factors and it didn’t help us in favouring a married person on his minimal drug consumption. Further research is required to arrive at this conclusion.