setwd("E:/Coursera/Statistics with R - Duke University/01_Introduction to Probability and Data/Project")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
load("brfss2013.RData")
According to the documentation, the data for this project come from the Behavioral Risk Factor Surveillance System (BRFSS). Those are surveillance data on risk behaviors of the U.S. adults residing in all 50 states, the District of Columbia, Puerto Rico, and Guam. Overall, the data set contains 491,775 observations of 330 variables collected for 2013 and 2014 through telephone interviews. The following summaries show the years and states for wich the data are available and the counts of observations across them:
summary(brfss2013$iyear)
## 2013 2014 NA's
## 486088 5682 5
summary(brfss2013$X_state)
## 0 Alabama Alaska
## 1 6505 4578
## Arizona Arkansas California
## 4253 5268 11518
## Colorado Connecticut Delaware
## 13649 7710 5206
## District of Columbia Florida Georgia
## 4931 33668 8658
## Hawaii Idaho Illinois
## 7858 5630 5608
## Indiana Iowa Kansas
## 10338 8157 23282
## Kentucky Louisiana Maine
## 10877 5215 8273
## Maryland Massachusetts Michigan
## 13011 15071 12761
## Minnesota Mississippi Missouri
## 14340 7453 7118
## Montana Nebraska Nevada
## 9693 17139 5101
## New Hampshire New Jersey New Mexico
## 6076 13776 9316
## New York North Carolina North Dakota
## 8979 8860 7806
## Ohio Oklahoma Oregon
## 11971 8244 5949
## Pennsylvania Rhode Island South Carolina
## 11429 6531 10717
## South Dakota Tennessee Texas
## 6895 5815 10917
## Utah Vermont Virginia
## 12769 6392 8464
## Washington West Virginia Wisconsin
## 11162 5899 6589
## Wyoming Guam Puerto Rico
## 6454 1897 5997
## 80
## 1
As follows from the BRFSS Data User Guide (2013), the data for the project are collected using random sampling techniques. More specifically, the dataset consists of two samples - a sample of landline telephone respondents and a sample of cell phone respondents. The landline sample uses disproportionate stratified sampling, and the cellular telephone sample used a random selection where each respondent had equal probability of selection. The sampling method also used geographic stratification based on substate geographic regions.
Given the random sampling design, the sample must be representative of the U.S. population, and conclusions derived from the analysis based on this sample can be extened to the larger population. Also, the sample size (at least 4,000 interviews per state) is large enough to achive reasonably small margin of errors. Hovewer, it must be noted the sample contains data on telephone users and omits those adults who don’t use telephones. In addition to that, the quality of the sample in the context of generalizability is as good as the quality and completeness of the lists of telefone numbers that the projects relies upon. Finally, it is important to note that such data might we vulnerable to non-response bias.
The sample contains only observational, cross-sectional data. Because this is not an experimetnal desing (no random assignment of individuals in groups was used), the scope of analysis will be limited to determining associations between the explanatory and response variables without the possibility to establish causal relationships.
Research quesion 1:
Sleep is an interesting phenomenon that attracts attention of many researchers across the globe. There are many claims that sleep is crucially important for individual health and well-being. The quality and quantity of sleep can relate to how we feel, think, and function. As a common wisdom, many believe that it is important to get some good sleep to be focused and productive. Furthermore, the National Sleep Foundaton, have specific recommendations related to optimal sleep time. For instance, the NSF recommends that adults 18-64 sleep 7-9 hours. Overal, sleep duration can be an importan factor to consider in addressing problems of public health, organizational productivity, and personal well being. Therefore, it’s important to know whether observational data support these claims.
In this project, I will explore some links between sleep duration and individual health and behaviors. My first step is to explore whether there might be a relationship between individual sleep duration and health. My conjecture is that otherwirse healthy individuals who experience systematic sleep deprivation or other sleep disorders might perform poorer on measures of physical or mental health than those who sleep an optimal amount of time to the extent that their health condition could keep them from doing usual daily activities. It is interesting to know whether this association holds for individuals with a good general health, and who also live healthy lifestyles. Hence, the first research question (RQ1) asks:
The expectation here is that among individuals who are well and exercise, those who sleep an optimal amount of time will tend to be physically or mentally helthier than those who sleep too little or too much.
Research quesion 2:
Regardless of whether the discussed sleep disorders are linked to individual physical or mental health, it is even more reasonable to argue that not having enough sleep (or sleeping too much) migh be consequential for a person’s ability to stay focused, remember, and make decisions. As a first step towards confirming this claim, it can be useful to explore whether there is an association between the amount of sleep one gets and whether the person has difficulty concentrating, remembering, or making decisions. Therefore, the second rersearch question (RQ2) asks the following:
Research quesion 3:
If there is a relationship between the duration of sleep and the conditions and behaviors outlined above, it is also interesting to understand what manageable individual behaviors or conditions might potentially influence sleep patterns. For instance, eating, drinking, or smoking habits, employment status, or certain medical conditions might be worth studying in this context. As a first step, an exploratory analysis could show whether there are differences in sleeping patterns of those who smoke compared to those who do not; employed individual might have different sleep patterns from the unemployed; individuals with medical conditions such as astma might be different from those who don’t suffer from asthma. The third research question (RQ3) will focus on one of the discussed conditions and ask the following:
To answer the three reserch questions, I will use the following varibles from the BRFSS dataset: sleptim1, genhlth, exerany2, poorhlth, decide, and asthnow.
The following section describes each variable.
sleptim1: How Much Time Do You Sleep. The survey question asked: “On average, how many hours of sleep do you get in a 24-hour period?”
summary(brfss2013$sleptim1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.000 7.000 7.052 8.000 450.000 7387
genhlth: General Health. The survey question asked: Would you say that in general your health is: Excellent/Very Good/Good/Fair/Poor?
summary(brfss2013$genhlth)
## Excellent Very good Good Fair Poor NA's
## 85482 159076 150555 66726 27951 1985
exerany2: Exercise (Physical Activity) In Past 30 Days. The survey question asked: During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise? Yes/No
summary(brfss2013$exerany2)
## Yes No NA's
## 332464 125282 34029
poorhlth: Poor Physical Or Mental Health. The survey question asked: During the past 30 days, for about how many days did poor physical or mental health keep you from doing your usual activities, such as self-care, work, or recreation?
summary(brfss2013$poorhlth)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 5.3 5.0 7000.0 243153
decide: Difficulty Concentrating Or Remembering. The survey question asked: Because of a physical, mental, or emotional condition, do you have serious difficulty concentrating, remembering, or making decisions?
summary(brfss2013$decide)
## Yes No NA's
## 50647 428138 12990
asthnow: Still Have Asthma. The survey question asked: Do you still have asthma? Yes/No
summary(brfss2013$asthnow)
## Yes No NA's
## 45644 19696 426435
The structure of the data used in this exploratory analysis is the following:
str(select(brfss2013, genhlth, poorhlth, sleptim1, decide, exerany2, asthnow))
## 'data.frame': 491775 obs. of 6 variables:
## $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
## $ poorhlth: int 30 NA 0 0 0 NA 0 10 NA NA ...
## $ sleptim1: int NA 6 9 8 6 8 7 6 8 8 ...
## $ decide : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ exerany2: Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
## $ asthnow : Factor w/ 2 levels "Yes","No": 1 NA NA NA 2 NA NA NA NA NA ...
As we can see, the subset of data contains variables of different types, and there are missing values and possible errors. Therefore, it is important to prepare the data for the analysis.
First, I will save the needed subset of data in a new dataset named my.data.RQ1:
my.data.RQ1 <- select(brfss2013, genhlth, poorhlth, sleptim1, decide, exerany2, asthnow)
The first research question focuses on the following four variables: sleptim1, poorhlth, genhlth, and exerany2. To prepare the data for analysis, it is important to remove any missing values from the data. It is also important to remove any apparent data errors. Thus, the valid range of values for sleptim1 is 1 to 23 hours a day; the valid range of values for poorhlth is 1 to 30 days a month. Let’s check what values are present in the data on those two variables:
table(my.data.RQ1$sleptim1)
##
## 0 1 2 3 4 5 6 7 8 9
## 1 228 1076 3496 14261 33436 106197 142469 141102 23800
## 10 11 12 13 14 15 16 17 18 19
## 12102 833 3675 199 447 367 369 35 164 13
## 20 21 22 23 24 103 450
## 64 3 10 4 35 1 1
table(my.data.RQ1$poorhlth)
##
## 0 1 2 3 4 5 6 7 8 9
## 141506 12147 13815 8504 4778 8899 1556 4882 1132 210
## 10 11 12 13 14 15 16 17 18 19
## 7475 79 754 91 2634 7888 167 128 203 41
## 20 21 22 23 24 25 26 27 28 29
## 5366 759 115 64 98 2045 100 139 519 196
## 30 7000
## 22331 1
As we can see, the data on both variables contain some invalid values (103, 450, 7000), which might be the result of some data entry errors and need to be removed along with the missing values:
my.data.RQ1 <- my.data.RQ1 %>%
filter(sleptim1 <= 23 & sleptim1 >= 1 & !is.na(sleptim1) & !is.na(exerany2) & poorhlth <= 30 & !is.na(poorhlth))
## Warning: package 'bindrcpp' was built under R version 3.3.3
Now, the data on sleptim1 (Duration of Sleep) variable are numeric ranging from 1 to 23 hours. However, to answer the research question, we need to compare individuals who sleep 7-9 hours on average to those who sleep less and those who sleep more than that. A new categorical variable that defines different ranges for the duration of sleep is needed. The following code will create a new variable how.much.sleep with five levels for the sleep duration including 0-5 hours, 5-7 hours, 7-9 hours, and more than 9 hours:
my.data.RQ1 <- my.data.RQ1 %>%
mutate(how.much.sleep = ifelse(sleptim1 < 5, "0-5", ifelse(sleptim1 >= 5 & sleptim1 < 7, "5-7", ifelse(sleptim1 >= 7 & sleptim1 < 9, "7-9", "9 or more"))))
Now we can use the newly created variable to group individuals by how much they sleep and compute summary statistics for how many days individuals’ poor physical or mental health kept them from doing their usual activities, such as self-care, work, or recreation:
my.data.RQ1 %>%
group_by(how.much.sleep) %>%
summarise(median.poor.days = median(poorhlth), iqr.poor.days = IQR(poorhlth), mean.poor.days = mean(poorhlth), n=n())
## # A tibble: 4 x 5
## how.much.sleep median.poor.days iqr.poor.days mean.poor.days n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 0-5 7 28 11.878948 13234
## 2 5-7 0 6 5.531490 74912
## 3 7-9 0 3 3.794159 119641
## 4 9 or more 1 14 7.574111 21252
As the we can see, adults who sleep less than 5 or more than 9 hours a day perform substantially worse than thos who sleep 5 to 9 hours.
We can also use boxplots to demonstrate the relationship graphically:
ggplot(my.data.RQ1, aes(x = how.much.sleep, y = poorhlth)) + geom_boxplot()
According to the research question, it’s more important to know about the relationship between the sleep durations and health conditions for those who are healthy in other respects and have healthy lifestyles. To accomplish that, I will filter individals who report that their general health is good, very good, or excellent and who say they exercised in past 30 days. After the filtering, I will repeat the analysis:
my.data.RQ1 %>%
filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
group_by(how.much.sleep) %>%
summarise(Median.Poor.days = median(poorhlth), IQR.Poor.days = IQR(poorhlth), Mean.Poor.days = mean(poorhlth), n=n())
## # A tibble: 4 x 5
## how.much.sleep Median.Poor.days IQR.Poor.days Mean.Poor.days n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 0-5 0 6 5.316426 4219
## 2 5-7 0 2 2.450234 42047
## 3 7-9 0 2 1.997520 81462
## 4 9 or more 0 3 3.246155 10079
This time, the statistics indicate a less clear relationship, but the differences in the IQRs and means showw that those who sleep 7-9 hours tend to have the fewest days when poor physical or mental health kept them from doing their usual activities, such as self-care, work, or recreation.
The following boxlot represents the relationship graphically:
my.data.RQ1 %>%
filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
ggplot(aes(x = how.much.sleep, y = poorhlth)) + geom_boxplot()
As we can see on the boxplot, the medians for all four groups are equal, but the IQRs and the overal variations of values are differnet across the four groups. Based on the boxplot, it is practically impossible to see any difference between the two groups in the middle. Only the means are slighly different, but the means are not informative for heavily skewed data such as those on poorhlth:
ggplot(data = my.data.RQ1, aes(x = poorhlth)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The overall conclusion for the research question 1 is that among individuals whose general health is at least good and who exercise, there are some differences in the physical and mental health conditions of thoses who sleep 5-9 hours compared to those who sleep less than 5 hours or more than 9 hours on average. The healthy adults who sleep 5-7 and 7-9 hours perform similarly on this measure of health.
The second research question explores the relationship between the amount of sleep and difficulty concentrating, remembering, or making decisions. The response variable of interest is decide - a binary variable indicating whether a respondent has serious difficulty concentrating, remembering, or making decisions:
summary(my.data.RQ1$decide)
## Yes No NA's
## 38662 188935 1442
As we can see, there are 1,442 missing values that need be removed from the dataset before the analysis. The filtered data are saved in a new data frame:
my.data.RQ2 <- my.data.RQ1 %>%
filter(!is.na(decide))
summary(my.data.RQ2$decide)
## Yes No
## 38662 188935
Becasue the response variable is binary, I need to calculate relative frequencies of those who responded “Yes” to the question across the different sleep durations:
my.data.RQ2 %>%
filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
group_by(how.much.sleep) %>%
summarise(n.Count=n(), HaveDifficulty.Proportion= sum(decide == "Yes")/n())
## # A tibble: 4 x 3
## how.much.sleep n.Count HaveDifficulty.Proportion
## <chr> <int> <dbl>
## 1 0-5 4182 0.27594452
## 2 5-7 41851 0.10761989
## 3 7-9 81152 0.06640625
## 4 9 or more 10005 0.15072464
The results show that 27.6 percent of those generally healthy adults in the sample who sleep less than 5 hours, 15.1 percent of those who sleep more than 9 hours, 10.8 percent of those who sleep 5-7 hours, and only 6.6 percent of those who get 7-9 hours of sleep have difficulty concentrating, remembering, or making decisions. In other words, among the generally healthy individuals in this sample, those who sleep the recommended 7-9 hour are least likely to experience serious difficulty concentrating, remembering, or making decisions compared to those adults who, on average, sleep more ore less than that.
The discussed differences can be clearly seen in the following relative frequency boxplot where the size of the red areas represent the proportions of thos who have difficulty concentrating, remebering, and making decisions:
my.data.RQ2 %>%
filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
ggplot(aes(x = how.much.sleep, fill = decide)) + geom_bar(position="fill") + labs(y = "Proportion")
The overal conclusion following from the analysis is that in this sample the duration of sleep is related to the ability to concentrate, remember, or make decisions among adults whose general health is at least good and who exercise. In this sample, individuals who get 7-9 hour of sleep perform the best on this measure. Adults who sleeep less than 5 hours or more than 9 hours are most likely to have serious difficulty concentrating, remembering, or making decisions. Even those who sleep 5-7 hours on average, are more likely to have such a diffuclty compared to those who sleep the recommneded 7-9 hours.
The third research question explores the relationship between having a condition such as asthma and the average duration of sleep an individual gets. Here, the duration of sleep (how.much.sleep) is the explanatory variable, and having asthma (asthnow) is the response variable. The following is the summary for the variable asthnow:
summary(my.data.RQ2$asthnow)
## Yes No NA's
## 28816 10828 187953
The filtered data without the missing values on asthnow variable are saved into a new fata frame my.data.RQ3:
my.data.RQ3 <- my.data.RQ2 %>%
filter(!is.na(asthnow))
summary(my.data.RQ3$asthnow)
## Yes No
## 28816 10828
The following code groups the data by the values of asthma variable and calculates percentges for each duration of sleep across the two groups of individuals - with and without asthma:
my.data.RQ3 %>%
#filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
group_by(asthnow) %>%
summarise(n=n(), sl.0to5h= sum(how.much.sleep == "0-5")/n(), sl.5to7h = sum(how.much.sleep == "5-7")/n(), sl.7to9h= sum(how.much.sleep == "7-9")/n(), sl.9plus= sum(how.much.sleep == "9 or more")/n())
## # A tibble: 2 x 6
## asthnow n sl.0to5h sl.5to7h sl.7to9h sl.9plus
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Yes 28816 0.10254720 0.3646932 0.4314964 0.10126319
## 2 No 10828 0.06464721 0.3511267 0.4958441 0.08838197
The summary shows that 49.6 percent of those who have no astma and 43.1 percent of those with asthma get 7 to 9 hours of sleep. In other words, those adults in the sample who have asthma are less likey to get the recommnded amount of sleep.
This relationship is clearly seen on the relative frequency, stacked bar plot:
ggplot(my.data.RQ3, aes(x = asthnow, fill = how.much.sleep)) + geom_bar(position="fill") + labs(y = "Proportion")
The red area of the plot shows that the respondents who reported having asthma are substantially more likely to get less than 5 hours of sleep.
Overall, the findings of the exploratory analysis show that there are links between the following variables of interest:
— the average number of days when physical or mental health kept one from doing usual activities, such as self-care, work, or recreation varies and the average duration of sleep for healthy individuls who exercise.
— the probability that one has a serious difficulty concentrating, remembering, or making decisions is associated with the amount of sleep one gets in healthy individuls who exercise.
— the probability that an adult in the sample gets the recommendent amount of sleem (7-9) hour is related to whether one has such a condition as asthma.
Because the sample was collected using random sampling techniques, the discussed relationships may hold in the populatin of the U.S. adults. However, the discussed links are correlational, whcich means that the analysis doesn’t provide evidence for causal relationships between the explanatory and response varibles examined in this project.