Inferential Statistics Final Project

The General Social Survey

The GSS is a survey that has been conducted every year since 1972 by NORC at the university of Chicago with support of the National Science Foundation in order to monitor and explain trends and constraints in attitudes, behaviors, and attributes of Americans.

This process has had two main goals:

To conduct basic scientific research on the structure and development of American society
To distribute up-to-date, important, high-quality data to social scientists, students, policy makers, and others.

Data collection method

The vast majority of GSS data is obtained in face-to-face interviews. Computer-assisted personal interviewing (CAPI) began in the 2002 GSS. Under some conditions when it has proved difficult to arrange an in-person interview with a sampled respondent, GSS interviews may be conducted by telephone.

Population target

The target population of the GSS is adults (18+) living in households in the United States.

Those unable to do the survey in either English or Spanish are out-of-scope.

Residents of institutions and group quarters are out-of-scope. Those with mental and/or physical conditions that prevent them from doing the survey, but who live in households are part of the target population and in-scope. It is really important that people with disabilities are in-scope so a more in depth study can be done with this group in focus.

In the reinterviews those who have died, moved out of the United States, or who no longer live in a household have left the target population and are out-of-scope.

Research type

The goal when this research was conducted was to gather data so the results could be generalized beyond the research itself and other circumstances could be analyzed. So we can say this project has generalization purpose.

This work

The Idea of this project is to provide a visualization on how the habit of watching TV can impact people’s incomes.
Furthermore point some differences between men and woman regarding annual income, financial satisfaction and hours watching TV per day.

Research questions

Due to the fact that people are watching more TV a research regarding this habit is useful to point out some consequences and plausible conclusions to motivate a lifestyle change.

Question 1

Who watches more tv on average? Men or woman?

This data would allow TV channels to better advertise and target audience.

Question 2

Are men more satisfied finacialy than woman on average?

The goal in this question was to find out if there is a significant difference between men and woman financial satisfaction. Despite the variable that represents the financial satisfaction level has three levels we are considering two only: Satisfied and Not satisfied at all. The level More or less was added to Satisfied level since they have closer meanings.

Question 3

Is there a relationship between the amount of tv hours per day and income?

Here we can demonstrate if one’s income is affected by the amount of TV they watch per day.

Note: Dependent Variable x Independent Variable:

Independent variable: This is the x axis in a chart. Also known as predictor variable it represents the variable where changes happen.
Dependent variable: This is the y axis in a chart. Also known as outcome variable it represents the changes that happen due to changes in the independent variable axis.

Exploratory data analysis [ EDA ]

In this section we would understand the main characteristics of the data by visualization. The charts would allow us to gain insights into the data, identify trends, check outliers and assess assumptions on which our analysis will be based.

Variables used in this research

Variable	Type	Meaning	Range or Levels
`age`	Interval	Age of respondent	`18 - 89`
`sex`	Dichotomous	Gender of respondent	`Male`, `Female`
`tvhours`	Interval	Hours the respondent watch TV per day	`0 - 24`
`rincome16`	Ordinal	Level of respondent income	`lt $1000` - `$170000 or more` with intervals
`satfin`	Dichotomous	Level of financial satisfaction	`Satisfied`, `Not satisfied`

Note: You might be wondering why the types used were interval instead of continuous and dichotomous instead of categorical in the table above. It is worth an explanation for clarification purposes.

A categorical variable can be categorized as nominal, ordinal or dichotomous.

Nominal: Two or more categories but there is no order. E.g., Skin color;
Ordinal: Two or more categories but there is an order (rank). E.g., Shirt sizes;
Dichotomous: Only two categories and there is no order. E.g., Respondent gender;

A continuous (quantitative) variable can be categorizes as interval or ratio.

Interval: Equal intervals between between measures. E.g., Temperatures. The difference between 9 and 10 degrees is 1. And the difference between 20 and 21 is also 1;
Ratio: Is an interval variable but with the condition that a value 0 represents the absence of that variable. E.g., Temperature in Kelvin;

EDA - Age x Sex

Interquartile Range [ IQR ]

Both sex groups have the same interquartile ranges and there is no outlier.
The younger participants were 18 years old, as it is the minimum age to participate to the survey, and the maximum were 89.
On average participants were roughly 49 years old.

75% of respondents are less than 62 years old.
50% of respondents are between 33 and 62 years old.
25% of respondents are less than 33 years old.

by(gss$age, gss$sex, summary)

## gss$sex: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   33.00   44.00   44.61   56.00   89.00 
## -------------------------------------------------------- 
## gss$sex: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   32.00   43.00   43.45   55.00   86.00

Boxplot

The colorful rectangle represents 50% of the data.
The vertical lines above and below the rectangles represents 25% of the data each.
The middle line represents the median.
The x close to the middle lines represents the mean.

ggplot(data = gss, aes(x = sex, y = age, fill = sex)) +
    geom_boxplot() +
    guides(fill = FALSE) +
    xlab('Sex') +
    ylab('Age') +
    ggtitle('Age boxplot by sex') +
    scale_y_continuous(breaks = seq(min(gss$age), max(gss$age), 15)) +
    stat_summary(fun.y = mean, geom = 'point', shape = 4, size = 2) +
    theme_minimal()

Distribution

Since we got roughly the same IQR results for both groups the distributions are similar.
It is a bimodal distribution which means that it has two peaks. These values are the ones that occur most.
The solid red line represents the mean of ages and the dashed yellow line represents the median age. I.g, the age that cuts all the ages in half.

ggplot(aes(x = age), data = gss) +
    geom_histogram(aes(y = ..density..), binwidth = 2, color = 'black', fill = '#44679F') +
    geom_density(alpha = .3, fill = '#DDF5F7') +
    facet_grid(sex ~ .) +
    geom_vline(aes(xintercept = mean(age, na.rm = T)), color = 'red', linetype = 'solid', size = 1) +
    geom_vline(aes(xintercept = median(age, na.rm = T)), color = 'yellow', linetype = 'twodash', size = 1) +
    scale_x_continuous(limits = c(min(gss$age), max(gss$age) + 1), breaks = seq(min(gss$age), max(gss$age) + 1, 10)) +
    ggtitle('Age distribution by sex') +
    xlab('Age') +
    ylab('Density') +
    theme_minimal()

EDA - TV hours x Sex

Interquartile Range [ IQR ]

Both sex groups have the same interquartile ranges with outliers.
There are participants who do not watch TV and participants who watch TV for more than 20.
On average participants watch roughly 3 hours of TV per day.

75% of respondents watch less than 4 hours of TV per day;
50% of respondents watch between 1 and 4 hours of TV per day;
25% of respondents watch less than 1 hours of TV per day;

by(gss$tvhours, gss$sex, summary)

## gss$sex: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.436   3.000  20.000 
## -------------------------------------------------------- 
## gss$sex: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.392   3.000  24.000

Boxplot

The colorful rectangle represents 50% of the data.
The vertical lines above and below the rectangles represents 25% of the data each.
The dots represent outliers. I.g., an extreme value which we have to be more careful when evaluating.
The middle line represents the median.
The x close to the middle lines represents the mean.

ggplot(aes(x = sex, y = tvhours, fill = sex), data = gss) +
    geom_boxplot() +
    guides(fill = FALSE) +
    xlab('Sex') +
    ylab('Tv hours') +
    ggtitle('Tv hours boxplot by sex') +
    scale_y_continuous(breaks = seq(0, max(gss$tvhours), 2)) +
    stat_summary(fun.y = mean, geom = 'point', shape = 4, size = 2) +
    theme_minimal()

Distribution

Since we got roughly the same IQR results for both groups the distributions are similar.
It is a very right skewed distribution due to the fact that most people watch two or less hours of TV per day. The solid red line represents the mean and the dashed line represents the median. I.g, the point that cuts the data in half.

ggplot(aes(x = tvhours), data = gss) +
    geom_histogram(aes(y = ..density..), binwidth = 1, color = 'black', fill = '#44679F') +
    stat_density(adjust = 2, alpha = .2, color = 'black', fill = '#DDF5F7') +
    facet_grid(sex ~ .) +
    geom_vline(aes(xintercept = mean(tvhours, na.rm = T)), color = 'red', linetype = 'solid', size = 1) +
    geom_vline(aes(xintercept = median(tvhours, na.rm = T)), color = 'yellow', linetype = 'twodash', size = 1) +
    scale_x_continuous(limits = c(0, max(gss$tvhours) + 1), breaks = seq(0, max(gss$tvhours) + 1, 3)) +
    xlab('Tv hours') +
    ylab('Density') +
    ggtitle('Tv hours distribution by sex') +
    theme_minimal()

EDA - Income x Sex

Bubble chart

By looking at the chart below we can see both distributions are right skewed due to the fact that most occurrences happen in the upper part. I.g., most distant from 0.

The solid red lines and dashed purple lines represent the range of most income occurrences for male and female respectively.

ggplot(data = gss) +
    geom_count(aes(x = sex, y = rincom16), color = '#44679F') +
    xlab('Sex') +
    ylab('Anual income') +
    ggtitle('Anual income bar chart by sex') +
    geom_hline(yintercept = 23, color = 'red') +
    geom_hline(yintercept = 12, color = 'red') +
    geom_hline(yintercept = 21, color = 'purple', linetype = 'dashed') +
    geom_hline(yintercept = 8, color = 'purple', linetype = 'dashed') +
    theme_minimal()

EDA - Financial satisfaction x Sex

As it was explained in the Research questions, question 3 the levels satisfied and more or less were merged.
By the bar chart we can assess that men are more financially satisfied than woman. And as the bubble chart above showed men have higher incomes. That might be due to gender discrimination or the kind of jobs performed by each respondent.

gss$satfin[gss$satfin == 'more or less' ] <- 'satisfied'
satfinBySex <- data.frame(gss$satfin, gss$sex)
ggplot(aes(gss$satfin, ..count..), data = satfinBySex) +
    geom_bar(aes(fill = gss$sex), position = 'dodge') +
    xlab('Level of financial satisfaction') +
    ylab('Respondents')+
    ggtitle('Level of financial satisfaction bar chart') +
    guides(fill = guide_legend(title = 'Sex'))

EDA - Tv hours x Anual income

The graph has 6 as the maximum TV hours so we do not consider upper outliers.
As we can see most people with the higher incomes watch from 1 to 3 hours of TV per day. It means that watching TV helps somehow to increase income. But watching too much or too little might not be helpful.

df = data.frame(gss$rincom16, gss$tvhours)
ggplot(aes(gss$tvhours, ..count..), data = df) +
    geom_bar(aes(fill = gss$rincom16), position = 'dodge')+
    scale_x_continuous(limits = c(0, 6), breaks = seq(0, 6, 0.5)) +
    xlab('Tv hours') +
    ylab('Respondents') +
    guides(fill = guide_legend(title = 'Anual income')) +
    theme_minimal()

Inferences

Question 1 - Who watches more tv on average? Men or woman?

Method

Difference of two means was chosen since we are testing the means of different groups. Men and Women average hours watching TV per day.

Conditions

$n_1 > 5$
$n_2 > 5$
$n_1$ and $n_2$ roughly the same size
No outliers

Inference

$CI = 0.95$
$\alpha = 0.05$

new_gss = gss
new_gss$rincom16 = factor(new_gss$rincom16)
#new_gss$tvhours = factor(new_gss$tvhours)
men = subset(new_gss, new_gss$sex == 'male')
men = men$tvhours[!men$tvhours %in% boxplot.stats(men$tvhours)$out]
woman = subset(new_gss, new_gss$sex == 'female')
woman = woman$tvhours[!woman$tvhours %in% boxplot.stats(woman$tvhours)$out]
t_test_tv = t.test(men, woman)
df = t_test_tv$parameter
tcrit = qt(0.025, df = df)
dum = seq(-3.5, 3.5, length = 10^4) #For the plot

plot(dum, dt(dum, df = df), type = 'l', xlab = 't', ylab = 'f(t)') +
    abline(v = t_test_tv$statistic, lty = 5, col = 'red') +
    abline(v = tcrit, lty = 1) +
    abline(v = -tcrit, lty = 1)

## numeric(0)

$p =$ 0.7335475

Interpretation

The dotted red line represents the $t$ value which is the measurement of the size of the difference relative to the variation in the sample data.
The solid black lines represent the $t^*$ value (critical value) which is the point on the distribution that is compared to determine whether to reject the null hypothesis.

As $p > 0.05$ we reject the $H_0$ or null hypothesis.

Conclusion

We are 95% confident that there is no significant difference on how much TV men and women watch on average per day.

Question 2 - Men more satisfied finacialy than woman.

Here we have two proportions. The financial satisfaction rate in men and women. And have to compare that in order to find whether the hypothesis is true.

table(male$satfin)

## 
##      satisfied not at all sat 
##            393            128

table(female$satfin)

## 
##      satisfied not at all sat 
##            391            153

z = 0.0538 / sqrt((0.73 * 0.27 / 521) + (0.73 * 0.27 / 544))
2 * pnorm(-abs(z))

## [1] 0.04805408

$H_0: pMen - pWomen = 0$

Pooled proportion:
$\hat{p} = \frac{\#countYes}{totalCount} =$ 0.7361502

Conditions:

Independence: Ok;
Success-Failure:
$n\hat{p} ≥ 10$ and $n( 1 - \hat{p} ) ≥ 10$

Men group:
$521 \times 0.73 =$ 383.5342723 $≥ 10$
$521 (1 - 0.73) =$ 137.4657277 $≥ 10$

Women group:
$544 \times 0.73 =$ 400.4657277 $≥ 10$
$544 (1 - 0.73) =$ 143.5342723 $≥ 10$

Point estimate

Formula: $\hat{p}_x = \frac{\#success}{n_x}$
$\hat{p}Men - \hat{p}Women = \frac{393}{521} - \frac{391}{544}$ 0.0355686
Men are 3.56% more financially satisfied than women.

Standard error

$SE = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n_1} + \frac{\hat{p}(1 - \hat{p})}{n_2}}$
$SE = \sqrt{\frac{0.1971}{521} + \frac{0.1971}{544}} =$ 0.0270158

P value

$Z = \frac{pointEstimate - H_0}{SE}$
$Z = \frac{0.0355 - 0}{0.027}$ 1.316585

$p =$ 0.1879778

Conclusion

As $p > 0.05$ We fail to reject $H_0$. We are 95% confident that men and woman have the same level of financial satisfaction.

Question 3 - Is there a relationship between the how much TV one’s watch per day and anual income?

Chi-square Independence:

$H_0:$ tvhours and income are independent
$H_A:$ tvhours and income are dependent

new_gss = gss
new_gss$rincom16 = factor(new_gss$rincom16) # Removed the levels with 0 frequency
tv_income = table(new_gss$tvhours, new_gss$rincom16)

#chisq.test(tv_income)$expected # Test if there is at least 5 in each cell
## As we did not meet the expectations of at least 5 in each cell. Due to that we will let the correct param in the above function as #TRUE, the default.
(xsqr = chisq.test(tv_income, simulate.p.value = T))

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  tv_income
## X-squared = 453.51, df = NA, p-value = 0.01149

Note: Simulation was used due to too small expected values and it would be given some rstudio warnings.

Conclusion

Because $p =$ 0.0114943 $< 0.05$ we conclude that these variables are dependent.

Conclusion

As the study shows it is good to watch about one to three hours of TV per day. This habit and annual income are related. Men watch TV as much as women and both are equally financially satisfied.

References

Smith, Tom W, Peter Marsden, Michael Hout, and Jibum Kim. General Social Surveys, 1972-2014 [machine-readable data file] /Principal Investigator, Tom W. Smith; Co-Principal Investigator, Peter V. Marsden; Co-Principal Investigator, Michael Hout; Sponsored by National Science Foundation. –NORC ed.– Chicago: NORC at the University of Chiago [producer]; Storrs, CT: The Roper Center for Public Opinion Research, University of Connecticut [distributor], 2015. 1 data file (57,061 logical records) + 1 codebook (3,567p.). – (National Data Program for the Social Sciences, No. 22).

Statistics How to

Institute for Digital Research and Edutacion

R bloggers

R for data science

Inferential Statistics Final Project

Victor Augusto

5/2/2017

This work

Research questions

Question 1

Question 2

Question 3

Exploratory data analysis [ EDA ]

Variables used in this research

EDA - Age x Sex

Interquartile Range [ IQR ]

Boxplot

Distribution

EDA - TV hours x Sex

Interquartile Range [ IQR ]

Boxplot

Distribution

EDA - Income x Sex

Bubble chart

EDA - Financial satisfaction x Sex

EDA - Tv hours x Anual income

Inferences

Question 1 - Who watches more tv on average? Men or woman?

Method

Conditions

Inference

Interpretation

Conclusion

Question 2 - Men more satisfied finacialy than woman.

Conditions:

Point estimate

Standard error

P value

Conclusion

Question 3 - Is there a relationship between the how much TV one’s watch per day and anual income?

Conclusion

Conclusion

References