Executive Summary

This explanatory data analysis is conducted by Ruchi Sharma, Undergraduate Student at Indian Institute of Technology Roorkee, using the The Behavioral Risk Factor Surveillance System (BRFSS) - 2013 data, as a part of the course project for the online course ‘Introduction to Probability and Data’, the first module of the Statistics with R Specialization, by the Duke’s University on Coursera.

Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

About BRFSS

BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US.

Risk Factors

A risk factor is any attribute, characteristic or exposure of an individual that increases the likelihood of developing a disease or injury. Some examples of the more important risk factors are underweight, unsafe sex, high blood pressure, tobacco and alcohol consumption, and unsafe water, sanitation and hygiene.

Project Objective

The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Data was collected using monthly telephone interviews. However, 2011 onwards BRFSS conducts both landline telephone- and cellular telephone-based surveys.

Random sampling was used for data collection, thus the data for the sample is generalizable to the population. But, this is an observational study and it won’t be possible to make causal inferences from the data.


Part 2: Research questions

Research quesion 1: Is a respondent’s opinion of their health status related to their Body Mass Index (BMI)? Is there any difference on the basis of Internet Use In The Past 30 Days?

This question looks for linkage between the BMI, which is defined measure of health, and the respondent’s self-opinion of health. It further relates internet usage and health status.

Research quesion 2: Finding out the relationship between Age and smoking habbit of respondents.

This problem helps us to understands if Age is a dominant factor for people to opt for the activity of smoking.

Research quesion 3: Does the time of the year have any affect of one’s general health?

Different interview months may be grouped into different seasons to understand if one’s general health is affected by seasons. Here, I chose to check this for the Summer season.


Part 3: Exploratory data analysis

Research quesion 1: Is a respondent’s opinion of their health status related to their Body Mass Index (BMI)? Is there any difference on the basis of Internet Use In The Past 30 Days?

q1 <- select(brfss2013,genhlth,internet,X_bmi5cat) %>% na.omit()
prop.table(table(q1$genhlth, q1$X_bmi5cat), 2)
##            
##             Underweight Normal weight Overweight      Obese
##   Excellent  0.19945959    0.26029673 0.17368377 0.07928978
##   Very good  0.26406288    0.35088062 0.35428999 0.26843448
##   Good       0.26185212    0.24659382 0.30684152 0.37082956
##   Fair       0.15843773    0.09741254 0.11937323 0.19917096
##   Poor       0.11618767    0.04481628 0.04581149 0.08227521
g1 <- ggplot(q1) + aes(x=X_bmi5cat, fill=genhlth) + geom_bar(position = "fill")
g1 <- g1 + xlab("BMI category") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g1

g1 <- ggplot(q1) + aes(x=internet, fill=genhlth) + geom_bar(position = "fill") +facet_grid(.~X_bmi5cat)
g1 <- g1 + xlab("BMI Category per internet usage in past 30 days") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g1

As it can be expected, we see that the maximum proportion of “Excellent” health status is seen in BMI category “Normal Weight”. In general, majority of the people in all BMI categories feel that their health is either good or very good.
Internet usage for the past 30 days does bring in some difference though it is quite unexpected to observe that poor health proportion was maximum for underweight people not using internet for past 30 days. Since, considerable difference is seen only at extremes, effect of internet usage cannot be generalised for the data though it seems to be not-affecting for large part of our dataset.

Research quesion 2: Finding out the relationship between Age and smoking habbit of respondents.

q2 <- select(brfss2013, X_rfsmok3, X_ageg5yr) %>% na.omit()
prop.table(table(q2$X_ageg5yr, q2$X_rfsmok3), 2)
##                  
##                           No        Yes
##   Age 18 to 24    0.05308642 0.06757075
##   Age 25 to 29    0.04270450 0.06783281
##   Age 30 to 34    0.05113428 0.07933700
##   Age 35 to 39    0.05509411 0.06974581
##   Age 40 to 44    0.06267031 0.07517034
##   Age 45 to 49    0.07070108 0.09394654
##   Age 50 to 54    0.09057092 0.13022799
##   Age 55 to 59    0.10372320 0.13172170
##   Age 60 to 64    0.11135748 0.10662998
##   Age 65 to 69    0.10690267 0.08324161
##   Age 70 to 74    0.08843443 0.05052411
##   Age 75 to 79    0.06858984 0.02608753
##   Age 80 or older 0.09503077 0.01796384
g2 <- ggplot(q2) + aes(fill=X_ageg5yr, x=X_rfsmok3) + geom_bar(position = "fill") 
g2 <- g2 + scale_fill_discrete(name = "Age group") + ylab("proportion") + xlab("Currently a smoker")
g2

An interesting observation is to see that most smokers lie between 50 to 59 years while majority of the non smokers range from 55 to 64 years. Differences can be interpreted visually from the bar plot and the accurate values are provided in the table.

Research quesion 3: Does the time of the year have any affect of one’s general health?

summer <- c("June", "July", "August")

q3 <- select(brfss2013, imonth, genhlth) %>% 
  na.omit() %>%
  mutate(summer= imonth %in% summer)
prop.table(table(q3$genhlth, q3$summer), 2)
##            
##                  FALSE       TRUE
##   Excellent 0.17551062 0.17161053
##   Very good 0.32475278 0.32487704
##   Good      0.30710127 0.30823455
##   Fair      0.13580769 0.13749909
##   Poor      0.05682764 0.05777879
g3 <- ggplot(q3) + aes(x=genhlth, fill=summer) + geom_bar(positon = "fill")
## Warning: Ignoring unknown parameters: positon
g3 <- g3 + scale_fill_discrete(name = "Summer") + ylab("proportion") + xlab("Reported General Health")
g3