This explanatory data analysis is conducted by Ruchi Sharma, Undergraduate Student at Indian Institute of Technology Roorkee, using the The Behavioral Risk Factor Surveillance System (BRFSS) - 2013 data, as a part of the course project for the online course ‘Introduction to Probability and Data’, the first module of the Statistics with R Specialization, by the Duke’s University on Coursera.
library(ggplot2)
library(dplyr)
load("brfss2013.RData")
BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US.
A risk factor is any attribute, characteristic or exposure of an individual that increases the likelihood of developing a disease or injury. Some examples of the more important risk factors are underweight, unsafe sex, high blood pressure, tobacco and alcohol consumption, and unsafe water, sanitation and hygiene.
The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Data was collected using monthly telephone interviews. However, 2011 onwards BRFSS conducts both landline telephone- and cellular telephone-based surveys.
Random sampling was used for data collection, thus the data for the sample is generalizable to the population. But, this is an observational study and it won’t be possible to make causal inferences from the data.
Research quesion 1: Is a respondent’s opinion of their health status related to their Body Mass Index (BMI)? Is there any difference on the basis of Internet Use In The Past 30 Days?
This question looks for linkage between the BMI, which is defined measure of health, and the respondent’s self-opinion of health. It further relates internet usage and health status.
Research quesion 2: Finding out the relationship between Age and smoking habbit of respondents.
This problem helps us to understands if Age is a dominant factor for people to opt for the activity of smoking.
Research quesion 3: Does the time of the year have any affect of one’s general health?
Different interview months may be grouped into different seasons to understand if one’s general health is affected by seasons. Here, I chose to check this for the Summer season.
Research quesion 1: Is a respondent’s opinion of their health status related to their Body Mass Index (BMI)? Is there any difference on the basis of Internet Use In The Past 30 Days?
q1 <- select(brfss2013,genhlth,internet,X_bmi5cat) %>% na.omit()
prop.table(table(q1$genhlth, q1$X_bmi5cat), 2)
##
## Underweight Normal weight Overweight Obese
## Excellent 0.19945959 0.26029673 0.17368377 0.07928978
## Very good 0.26406288 0.35088062 0.35428999 0.26843448
## Good 0.26185212 0.24659382 0.30684152 0.37082956
## Fair 0.15843773 0.09741254 0.11937323 0.19917096
## Poor 0.11618767 0.04481628 0.04581149 0.08227521
g1 <- ggplot(q1) + aes(x=X_bmi5cat, fill=genhlth) + geom_bar(position = "fill")
g1 <- g1 + xlab("BMI category") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g1
g1 <- ggplot(q1) + aes(x=internet, fill=genhlth) + geom_bar(position = "fill") +facet_grid(.~X_bmi5cat)
g1 <- g1 + xlab("BMI Category per internet usage in past 30 days") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g1
Research quesion 2: Finding out the relationship between Age and smoking habbit of respondents.
q2 <- select(brfss2013, X_rfsmok3, X_ageg5yr) %>% na.omit()
prop.table(table(q2$X_ageg5yr, q2$X_rfsmok3), 2)
##
## No Yes
## Age 18 to 24 0.05308642 0.06757075
## Age 25 to 29 0.04270450 0.06783281
## Age 30 to 34 0.05113428 0.07933700
## Age 35 to 39 0.05509411 0.06974581
## Age 40 to 44 0.06267031 0.07517034
## Age 45 to 49 0.07070108 0.09394654
## Age 50 to 54 0.09057092 0.13022799
## Age 55 to 59 0.10372320 0.13172170
## Age 60 to 64 0.11135748 0.10662998
## Age 65 to 69 0.10690267 0.08324161
## Age 70 to 74 0.08843443 0.05052411
## Age 75 to 79 0.06858984 0.02608753
## Age 80 or older 0.09503077 0.01796384
g2 <- ggplot(q2) + aes(fill=X_ageg5yr, x=X_rfsmok3) + geom_bar(position = "fill")
g2 <- g2 + scale_fill_discrete(name = "Age group") + ylab("proportion") + xlab("Currently a smoker")
g2
Research quesion 3: Does the time of the year have any affect of one’s general health?
summer <- c("June", "July", "August")
q3 <- select(brfss2013, imonth, genhlth) %>%
na.omit() %>%
mutate(summer= imonth %in% summer)
prop.table(table(q3$genhlth, q3$summer), 2)
##
## FALSE TRUE
## Excellent 0.17551062 0.17161053
## Very good 0.32475278 0.32487704
## Good 0.30710127 0.30823455
## Fair 0.13580769 0.13749909
## Poor 0.05682764 0.05777879
g3 <- ggplot(q3) + aes(x=genhlth, fill=summer) + geom_bar(positon = "fill")
## Warning: Ignoring unknown parameters: positon
g3 <- g3 + scale_fill_discrete(name = "Summer") + ylab("proportion") + xlab("Reported General Health")
g3