The Idea of this project is to provide a visualization on how the habit of watching TV can impact people’s incomes.
Furthermore point some differences between men and woman regarding annual income, financial satisfaction and hours watching TV per day.
Due to the fact that people are watching more TV a research regarding this habit is useful to point out some consequences and plausible conclusions to motivate a lifestyle change.
Who watches more tv on average? Men or woman?
This data would allow TV channels to better advertise and target audience.
Are men more satisfied finacialy than woman on average?
The goal in this question was to find out if there is a significant difference between men and woman financial satisfaction. Despite the variable that represents the financial satisfaction level has three levels we are considering two only: Satisfied and Not satisfied at all. The level More or less was added to Satisfied level since they have closer meanings.
Is there a relationship between the amount of tv hours per day and income?
Here we can demonstrate if one’s income is affected by the amount of TV they watch per day.
In this section we would understand the main characteristics of the data by visualization. The charts would allow us to gain insights into the data, identify trends, check outliers and assess assumptions on which our analysis will be based.
| Variable | Type | Meaning | Range or Levels |
|---|---|---|---|
age |
Interval | Age of respondent | 18 - 89 |
sex |
Dichotomous | Gender of respondent | Male, Female |
tvhours |
Interval | Hours the respondent watch TV per day | 0 - 24 |
rincome16 |
Ordinal | Level of respondent income | lt $1000 - $170000 or more with intervals |
satfin |
Dichotomous | Level of financial satisfaction | Satisfied, Not satisfied |
Note: You might be wondering why the types used were interval instead of continuous and dichotomous instead of categorical in the table above. It is worth an explanation for clarification purposes.
A categorical variable can be categorized as nominal, ordinal or dichotomous.Both sex groups have the same interquartile ranges and there is no outlier.
The younger participants were 18 years old, as it is the minimum age to participate to the survey, and the maximum were 89.
On average participants were roughly 49 years old.
75% of respondents are less than 62 years old.
50% of respondents are between 33 and 62 years old.
25% of respondents are less than 33 years old.
by(gss$age, gss$sex, summary)
## gss$sex: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 44.00 44.61 56.00 89.00
## --------------------------------------------------------
## gss$sex: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 32.00 43.00 43.45 55.00 86.00
The colorful rectangle represents 50% of the data.
The vertical lines above and below the rectangles represents 25% of the data each.
The middle line represents the median.
The x close to the middle lines represents the mean.
ggplot(data = gss, aes(x = sex, y = age, fill = sex)) +
geom_boxplot() +
guides(fill = FALSE) +
xlab('Sex') +
ylab('Age') +
ggtitle('Age boxplot by sex') +
scale_y_continuous(breaks = seq(min(gss$age), max(gss$age), 15)) +
stat_summary(fun.y = mean, geom = 'point', shape = 4, size = 2) +
theme_minimal()
Since we got roughly the same IQR results for both groups the distributions are similar.
It is a bimodal distribution which means that it has two peaks. These values are the ones that occur most.
The solid red line represents the mean of ages and the dashed yellow line represents the median age. I.g, the age that cuts all the ages in half.
ggplot(aes(x = age), data = gss) +
geom_histogram(aes(y = ..density..), binwidth = 2, color = 'black', fill = '#44679F') +
geom_density(alpha = .3, fill = '#DDF5F7') +
facet_grid(sex ~ .) +
geom_vline(aes(xintercept = mean(age, na.rm = T)), color = 'red', linetype = 'solid', size = 1) +
geom_vline(aes(xintercept = median(age, na.rm = T)), color = 'yellow', linetype = 'twodash', size = 1) +
scale_x_continuous(limits = c(min(gss$age), max(gss$age) + 1), breaks = seq(min(gss$age), max(gss$age) + 1, 10)) +
ggtitle('Age distribution by sex') +
xlab('Age') +
ylab('Density') +
theme_minimal()
Both sex groups have the same interquartile ranges with outliers.
There are participants who do not watch TV and participants who watch TV for more than 20.
On average participants watch roughly 3 hours of TV per day.
75% of respondents watch less than 4 hours of TV per day;
50% of respondents watch between 1 and 4 hours of TV per day;
25% of respondents watch less than 1 hours of TV per day;
by(gss$tvhours, gss$sex, summary)
## gss$sex: male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 2.436 3.000 20.000
## --------------------------------------------------------
## gss$sex: female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 2.392 3.000 24.000
The colorful rectangle represents 50% of the data.
The vertical lines above and below the rectangles represents 25% of the data each.
The dots represent outliers. I.g., an extreme value which we have to be more careful when evaluating.
The middle line represents the median.
The x close to the middle lines represents the mean.
ggplot(aes(x = sex, y = tvhours, fill = sex), data = gss) +
geom_boxplot() +
guides(fill = FALSE) +
xlab('Sex') +
ylab('Tv hours') +
ggtitle('Tv hours boxplot by sex') +
scale_y_continuous(breaks = seq(0, max(gss$tvhours), 2)) +
stat_summary(fun.y = mean, geom = 'point', shape = 4, size = 2) +
theme_minimal()
Since we got roughly the same IQR results for both groups the distributions are similar.
It is a very right skewed distribution due to the fact that most people watch two or less hours of TV per day. The solid red line represents the mean and the dashed line represents the median. I.g, the point that cuts the data in half.
ggplot(aes(x = tvhours), data = gss) +
geom_histogram(aes(y = ..density..), binwidth = 1, color = 'black', fill = '#44679F') +
stat_density(adjust = 2, alpha = .2, color = 'black', fill = '#DDF5F7') +
facet_grid(sex ~ .) +
geom_vline(aes(xintercept = mean(tvhours, na.rm = T)), color = 'red', linetype = 'solid', size = 1) +
geom_vline(aes(xintercept = median(tvhours, na.rm = T)), color = 'yellow', linetype = 'twodash', size = 1) +
scale_x_continuous(limits = c(0, max(gss$tvhours) + 1), breaks = seq(0, max(gss$tvhours) + 1, 3)) +
xlab('Tv hours') +
ylab('Density') +
ggtitle('Tv hours distribution by sex') +
theme_minimal()
By looking at the chart below we can see both distributions are right skewed due to the fact that most occurrences happen in the upper part. I.g., most distant from 0.
The solid red lines and dashed purple lines represent the range of most income occurrences for male and female respectively.
ggplot(data = gss) +
geom_count(aes(x = sex, y = rincom16), color = '#44679F') +
xlab('Sex') +
ylab('Anual income') +
ggtitle('Anual income bar chart by sex') +
geom_hline(yintercept = 23, color = 'red') +
geom_hline(yintercept = 12, color = 'red') +
geom_hline(yintercept = 21, color = 'purple', linetype = 'dashed') +
geom_hline(yintercept = 8, color = 'purple', linetype = 'dashed') +
theme_minimal()
As it was explained in the Research questions, question 3 the levels satisfied and more or less were merged.
By the bar chart we can assess that men are more financially satisfied than woman. And as the bubble chart above showed men have higher incomes. That might be due to gender discrimination or the kind of jobs performed by each respondent.
gss$satfin[gss$satfin == 'more or less' ] <- 'satisfied'
satfinBySex <- data.frame(gss$satfin, gss$sex)
ggplot(aes(gss$satfin, ..count..), data = satfinBySex) +
geom_bar(aes(fill = gss$sex), position = 'dodge') +
xlab('Level of financial satisfaction') +
ylab('Respondents')+
ggtitle('Level of financial satisfaction bar chart') +
guides(fill = guide_legend(title = 'Sex'))
The graph has 6 as the maximum TV hours so we do not consider upper outliers.
As we can see most people with the higher incomes watch from 1 to 3 hours of TV per day. It means that watching TV helps somehow to increase income. But watching too much or too little might not be helpful.
df = data.frame(gss$rincom16, gss$tvhours)
ggplot(aes(gss$tvhours, ..count..), data = df) +
geom_bar(aes(fill = gss$rincom16), position = 'dodge')+
scale_x_continuous(limits = c(0, 6), breaks = seq(0, 6, 0.5)) +
xlab('Tv hours') +
ylab('Respondents') +
guides(fill = guide_legend(title = 'Anual income')) +
theme_minimal()
Difference of two means was chosen since we are testing the means of different groups. Men and Women average hours watching TV per day.
\(n_1 > 5\)
\(n_2 > 5\)
\(n_1\) and \(n_2\) roughly the same size
No outliers
\(CI = 0.95\)
\(\alpha = 0.05\)
new_gss = gss
new_gss$rincom16 = factor(new_gss$rincom16)
#new_gss$tvhours = factor(new_gss$tvhours)
men = subset(new_gss, new_gss$sex == 'male')
men = men$tvhours[!men$tvhours %in% boxplot.stats(men$tvhours)$out]
woman = subset(new_gss, new_gss$sex == 'female')
woman = woman$tvhours[!woman$tvhours %in% boxplot.stats(woman$tvhours)$out]
t_test_tv = t.test(men, woman)
df = t_test_tv$parameter
tcrit = qt(0.025, df = df)
dum = seq(-3.5, 3.5, length = 10^4) #For the plot
plot(dum, dt(dum, df = df), type = 'l', xlab = 't', ylab = 'f(t)') +
abline(v = t_test_tv$statistic, lty = 5, col = 'red') +
abline(v = tcrit, lty = 1) +
abline(v = -tcrit, lty = 1)
## numeric(0)
\(p =\) 0.7335475
The dotted red line represents the \(t\) value which is the measurement of the size of the difference relative to the variation in the sample data.
The solid black lines represent the \(t^*\) value (critical value) which is the point on the distribution that is compared to determine whether to reject the null hypothesis.
As \(p > 0.05\) we reject the \(H_0\) or null hypothesis.
We are 95% confident that there is no significant difference on how much TV men and women watch on average per day.
Here we have two proportions. The financial satisfaction rate in men and women. And have to compare that in order to find whether the hypothesis is true.
table(male$satfin)
##
## satisfied not at all sat
## 393 128
table(female$satfin)
##
## satisfied not at all sat
## 391 153
z = 0.0538 / sqrt((0.73 * 0.27 / 521) + (0.73 * 0.27 / 544))
2 * pnorm(-abs(z))
## [1] 0.04805408
\(H_0: pMen - pWomen = 0\)
Pooled proportion:
\(\hat{p} = \frac{\#countYes}{totalCount} =\) 0.7361502
Independence: Ok;
Success-Failure:
\(n\hat{p} ≥ 10\) and \(n( 1 - \hat{p} ) ≥ 10\)
Men group:
\(521 \times 0.73 =\) 383.5342723 \(≥ 10\)
\(521 (1 - 0.73) =\) 137.4657277 \(≥ 10\)
Women group:
\(544 \times 0.73 =\) 400.4657277 \(≥ 10\)
\(544 (1 - 0.73) =\) 143.5342723 \(≥ 10\)
Formula: \(\hat{p}_x = \frac{\#success}{n_x}\)
\(\hat{p}Men - \hat{p}Women = \frac{393}{521} - \frac{391}{544}\) 0.0355686
Men are 3.56% more financially satisfied than women.
\(SE = \sqrt{\frac{\hat{p}(1 - \hat{p})}{n_1} + \frac{\hat{p}(1 - \hat{p})}{n_2}}\)
\(SE = \sqrt{\frac{0.1971}{521} + \frac{0.1971}{544}} =\) 0.0270158
\(Z = \frac{pointEstimate - H_0}{SE}\)
\(Z = \frac{0.0355 - 0}{0.027}\) 1.316585
\(p =\) 0.1879778
As \(p > 0.05\) We fail to reject \(H_0\). We are 95% confident that men and woman have the same level of financial satisfaction.
Chi-square Independence:
\(H_0:\) tvhours and income are independent
\(H_A:\) tvhours and income are dependent
new_gss = gss
new_gss$rincom16 = factor(new_gss$rincom16) # Removed the levels with 0 frequency
tv_income = table(new_gss$tvhours, new_gss$rincom16)
#chisq.test(tv_income)$expected # Test if there is at least 5 in each cell
## As we did not meet the expectations of at least 5 in each cell. Due to that we will let the correct param in the above function as #TRUE, the default.
(xsqr = chisq.test(tv_income, simulate.p.value = T))
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: tv_income
## X-squared = 453.51, df = NA, p-value = 0.01149
Note: Simulation was used due to too small expected values and it would be given some rstudio warnings.
Because \(p =\) 0.0114943 \(< 0.05\) we conclude that these variables are dependent.
As the study shows it is good to watch about one to three hours of TV per day. This habit and annual income are related. Men watch TV as much as women and both are equally financially satisfied.
Smith, Tom W, Peter Marsden, Michael Hout, and Jibum Kim. General Social Surveys, 1972-2014 [machine-readable data file] /Principal Investigator, Tom W. Smith; Co-Principal Investigator, Peter V. Marsden; Co-Principal Investigator, Michael Hout; Sponsored by National Science Foundation. –NORC ed.– Chicago: NORC at the University of Chiago [producer]; Storrs, CT: The Roper Center for Public Opinion Research, University of Connecticut [distributor], 2015. 1 data file (57,061 logical records) + 1 codebook (3,567p.). – (National Data Program for the Social Sciences, No. 22).