Setup

Load packages

setwd("E:/Coursera/Statistics with R - Duke University/01_Introduction to Probability and Data/Project")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3

Load data

load("brfss2013.RData")

Part 1: Data

According to the documentation, the data for this project come from the Behavioral Risk Factor Surveillance System (BRFSS). Those are surveillance data on risk behaviors of the U.S. adults residing in all 50 states, the District of Columbia, Puerto Rico, and Guam. Overall, the data set contains 491,775 observations of 330 variables collected for 2013 and 2014 through telephone interviews. The following summaries show the years and states for wich the data are available and the counts of observations across them:

summary(brfss2013$iyear)
##   2013   2014   NA's 
## 486088   5682      5
summary(brfss2013$X_state)
##                    0              Alabama               Alaska 
##                    1                 6505                 4578 
##              Arizona             Arkansas           California 
##                 4253                 5268                11518 
##             Colorado          Connecticut             Delaware 
##                13649                 7710                 5206 
## District of Columbia              Florida              Georgia 
##                 4931                33668                 8658 
##               Hawaii                Idaho             Illinois 
##                 7858                 5630                 5608 
##              Indiana                 Iowa               Kansas 
##                10338                 8157                23282 
##             Kentucky            Louisiana                Maine 
##                10877                 5215                 8273 
##             Maryland        Massachusetts             Michigan 
##                13011                15071                12761 
##            Minnesota          Mississippi             Missouri 
##                14340                 7453                 7118 
##              Montana             Nebraska               Nevada 
##                 9693                17139                 5101 
##        New Hampshire           New Jersey           New Mexico 
##                 6076                13776                 9316 
##             New York       North Carolina         North Dakota 
##                 8979                 8860                 7806 
##                 Ohio             Oklahoma               Oregon 
##                11971                 8244                 5949 
##         Pennsylvania         Rhode Island       South Carolina 
##                11429                 6531                10717 
##         South Dakota            Tennessee                Texas 
##                 6895                 5815                10917 
##                 Utah              Vermont             Virginia 
##                12769                 6392                 8464 
##           Washington        West Virginia            Wisconsin 
##                11162                 5899                 6589 
##              Wyoming                 Guam          Puerto Rico 
##                 6454                 1897                 5997 
##                   80 
##                    1

As follows from the BRFSS Data User Guide (2013), the data for the project are collected using random sampling techniques. More specifically, the dataset consists of two samples - a sample of landline telephone respondents and a sample of cell phone respondents. The landline sample uses disproportionate stratified sampling, and the cellular telephone sample used a random selection where each respondent had equal probability of selection. The sampling method also used geographic stratification based on substate geographic regions.

Given the random sampling design, the sample must be representative of the U.S. population, and conclusions derived from the analysis based on this sample can be extened to the larger population. Also, the sample size (at least 4,000 interviews per state) is large enough to achive reasonably small margin of errors. Hovewer, it must be noted the sample contains data on telephone users and omits those adults who don’t use telephones. In addition to that, the quality of the sample in the context of generalizability is as good as the quality and completeness of the lists of telefone numbers that the projects relies upon. Finally, it is important to note that such data might we vulnerable to non-response bias.

The sample contains only observational, cross-sectional data. Because this is not an experimetnal desing (no random assignment of individuals in groups was used), the scope of analysis will be limited to determining associations between the explanatory and response variables without the possibility to establish causal relationships.


Part 2: Research questions

Research quesion 1:

Sleep is an interesting phenomenon that attracts attention of many researchers across the globe. There are many claims that sleep is crucially important for individual health and well-being. The quality and quantity of sleep can relate to how we feel, think, and function. As a common wisdom, many believe that it is important to get some good sleep to be focused and productive. Furthermore, the National Sleep Foundaton, have specific recommendations related to optimal sleep time. For instance, the NSF recommends that adults 18-64 sleep 7-9 hours. Overal, sleep duration can be an importan factor to consider in addressing problems of public health, organizational productivity, and personal well being. Therefore, it’s important to know whether observational data support these claims.

In this project, I will explore some links between sleep duration and individual health and behaviors. My first step is to explore whether there might be a relationship between individual sleep duration and health. My conjecture is that otherwirse healthy individuals who experience systematic sleep deprivation or other sleep disorders might perform poorer on measures of physical or mental health than those who sleep an optimal amount of time to the extent that their health condition could keep them from doing usual daily activities. It is interesting to know whether this association holds for individuals with a good general health, and who also live healthy lifestyles. Hence, the first research question (RQ1) asks:

The expectation here is that among individuals who are well and exercise, those who sleep an optimal amount of time will tend to be physically or mentally helthier than those who sleep too little or too much.

Research quesion 2:

Regardless of whether the discussed sleep disorders are linked to individual physical or mental health, it is even more reasonable to argue that not having enough sleep (or sleeping too much) migh be consequential for a person’s ability to stay focused, remember, and make decisions. As a first step towards confirming this claim, it can be useful to explore whether there is an association between the amount of sleep one gets and whether the person has difficulty concentrating, remembering, or making decisions. Therefore, the second rersearch question (RQ2) asks the following:

Research quesion 3:

If there is a relationship between the duration of sleep and the conditions and behaviors outlined above, it is also interesting to understand what manageable individual behaviors or conditions might potentially influence sleep patterns. For instance, eating, drinking, or smoking habits, employment status, or certain medical conditions might be worth studying in this context. As a first step, an exploratory analysis could show whether there are differences in sleeping patterns of those who smoke compared to those who do not; employed individual might have different sleep patterns from the unemployed; individuals with medical conditions such as astma might be different from those who don’t suffer from asthma. The third research question (RQ3) will focus on one of the discussed conditions and ask the following:


Part 3: Exploratory data analysis

To answer the three reserch questions, I will use the following varibles from the BRFSS dataset: sleptim1, genhlth, exerany2, poorhlth, decide, and asthnow.

The following section describes each variable.

sleptim1: How Much Time Do You Sleep. The survey question asked: “On average, how many hours of sleep do you get in a 24-hour period?”

summary(brfss2013$sleptim1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   6.000   7.000   7.052   8.000 450.000    7387

genhlth: General Health. The survey question asked: Would you say that in general your health is: Excellent/Very Good/Good/Fair/Poor?

summary(brfss2013$genhlth)
## Excellent Very good      Good      Fair      Poor      NA's 
##     85482    159076    150555     66726     27951      1985

exerany2: Exercise (Physical Activity) In Past 30 Days. The survey question asked: During the past month, other than your regular job, did you participate in any physical activities or exercises such as running, calisthenics, golf, gardening, or walking for exercise? Yes/No

summary(brfss2013$exerany2)
##    Yes     No   NA's 
## 332464 125282  34029

poorhlth: Poor Physical Or Mental Health. The survey question asked: During the past 30 days, for about how many days did poor physical or mental health keep you from doing your usual activities, such as self-care, work, or recreation?

summary(brfss2013$poorhlth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0     0.0     5.3     5.0  7000.0  243153

decide: Difficulty Concentrating Or Remembering. The survey question asked: Because of a physical, mental, or emotional condition, do you have serious difficulty concentrating, remembering, or making decisions?

summary(brfss2013$decide)
##    Yes     No   NA's 
##  50647 428138  12990

asthnow: Still Have Asthma. The survey question asked: Do you still have asthma? Yes/No

summary(brfss2013$asthnow)
##    Yes     No   NA's 
##  45644  19696 426435

The structure of the data used in this exploratory analysis is the following:

str(select(brfss2013, genhlth, poorhlth, sleptim1, decide, exerany2, asthnow))
## 'data.frame':    491775 obs. of  6 variables:
##  $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ poorhlth: int  30 NA 0 0 0 NA 0 10 NA NA ...
##  $ sleptim1: int  NA 6 9 8 6 8 7 6 8 8 ...
##  $ decide  : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ exerany2: Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
##  $ asthnow : Factor w/ 2 levels "Yes","No": 1 NA NA NA 2 NA NA NA NA NA ...

As we can see, the subset of data contains variables of different types, and there are missing values and possible errors. Therefore, it is important to prepare the data for the analysis.

Research quesion 1:

First, I will save the needed subset of data in a new dataset named my.data.RQ1:

my.data.RQ1 <- select(brfss2013, genhlth, poorhlth, sleptim1, decide, exerany2, asthnow)

The first research question focuses on the following four variables: sleptim1, poorhlth, genhlth, and exerany2. To prepare the data for analysis, it is important to remove any missing values from the data. It is also important to remove any apparent data errors. Thus, the valid range of values for sleptim1 is 1 to 23 hours a day; the valid range of values for poorhlth is 1 to 30 days a month. Let’s check what values are present in the data on those two variables:

table(my.data.RQ1$sleptim1)
## 
##      0      1      2      3      4      5      6      7      8      9 
##      1    228   1076   3496  14261  33436 106197 142469 141102  23800 
##     10     11     12     13     14     15     16     17     18     19 
##  12102    833   3675    199    447    367    369     35    164     13 
##     20     21     22     23     24    103    450 
##     64      3     10      4     35      1      1
table(my.data.RQ1$poorhlth)
## 
##      0      1      2      3      4      5      6      7      8      9 
## 141506  12147  13815   8504   4778   8899   1556   4882   1132    210 
##     10     11     12     13     14     15     16     17     18     19 
##   7475     79    754     91   2634   7888    167    128    203     41 
##     20     21     22     23     24     25     26     27     28     29 
##   5366    759    115     64     98   2045    100    139    519    196 
##     30   7000 
##  22331      1

As we can see, the data on both variables contain some invalid values (103, 450, 7000), which might be the result of some data entry errors and need to be removed along with the missing values:

my.data.RQ1 <- my.data.RQ1 %>% 
    filter(sleptim1 <= 23 & sleptim1 >= 1 & !is.na(sleptim1) & !is.na(exerany2) & poorhlth <= 30 & !is.na(poorhlth))
## Warning: package 'bindrcpp' was built under R version 3.3.3

Now, the data on sleptim1 (Duration of Sleep) variable are numeric ranging from 1 to 23 hours. However, to answer the research question, we need to compare individuals who sleep 7-9 hours on average to those who sleep less and those who sleep more than that. A new categorical variable that defines different ranges for the duration of sleep is needed. The following code will create a new variable how.much.sleep with five levels for the sleep duration including 0-5 hours, 5-7 hours, 7-9 hours, and more than 9 hours:

my.data.RQ1 <- my.data.RQ1 %>%
    mutate(how.much.sleep = ifelse(sleptim1 < 5, "0-5", ifelse(sleptim1 >= 5 & sleptim1 < 7, "5-7", ifelse(sleptim1 >= 7 & sleptim1 < 9, "7-9", "9 or more"))))

Now we can use the newly created variable to group individuals by how much they sleep and compute summary statistics for how many days individuals’ poor physical or mental health kept them from doing their usual activities, such as self-care, work, or recreation:

my.data.RQ1 %>% 
    group_by(how.much.sleep) %>%
    summarise(median.poor.days = median(poorhlth), iqr.poor.days = IQR(poorhlth), mean.poor.days = mean(poorhlth), n=n())
## # A tibble: 4 x 5
##   how.much.sleep median.poor.days iqr.poor.days mean.poor.days      n
##            <chr>            <dbl>         <dbl>          <dbl>  <int>
## 1            0-5                7            28      11.878948  13234
## 2            5-7                0             6       5.531490  74912
## 3            7-9                0             3       3.794159 119641
## 4      9 or more                1            14       7.574111  21252

As the we can see, adults who sleep less than 5 or more than 9 hours a day perform substantially worse than thos who sleep 5 to 9 hours.

We can also use boxplots to demonstrate the relationship graphically:

ggplot(my.data.RQ1, aes(x = how.much.sleep, y = poorhlth)) + geom_boxplot()

According to the research question, it’s more important to know about the relationship between the sleep durations and health conditions for those who are healthy in other respects and have healthy lifestyles. To accomplish that, I will filter individals who report that their general health is good, very good, or excellent and who say they exercised in past 30 days. After the filtering, I will repeat the analysis:

my.data.RQ1 %>% 
    filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
    group_by(how.much.sleep) %>%
    summarise(Median.Poor.days = median(poorhlth), IQR.Poor.days = IQR(poorhlth), Mean.Poor.days = mean(poorhlth), n=n())
## # A tibble: 4 x 5
##   how.much.sleep Median.Poor.days IQR.Poor.days Mean.Poor.days     n
##            <chr>            <dbl>         <dbl>          <dbl> <int>
## 1            0-5                0             6       5.316426  4219
## 2            5-7                0             2       2.450234 42047
## 3            7-9                0             2       1.997520 81462
## 4      9 or more                0             3       3.246155 10079

This time, the statistics indicate a less clear relationship, but the differences in the IQRs and means showw that those who sleep 7-9 hours tend to have the fewest days when poor physical or mental health kept them from doing their usual activities, such as self-care, work, or recreation.

The following boxlot represents the relationship graphically:

my.data.RQ1 %>% 
    filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
    ggplot(aes(x = how.much.sleep, y = poorhlth)) + geom_boxplot()

As we can see on the boxplot, the medians for all four groups are equal, but the IQRs and the overal variations of values are differnet across the four groups. Based on the boxplot, it is practically impossible to see any difference between the two groups in the middle. Only the means are slighly different, but the means are not informative for heavily skewed data such as those on poorhlth:

ggplot(data = my.data.RQ1, aes(x = poorhlth)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The overall conclusion for the research question 1 is that among individuals whose general health is at least good and who exercise, there are some differences in the physical and mental health conditions of thoses who sleep 5-9 hours compared to those who sleep less than 5 hours or more than 9 hours on average. The healthy adults who sleep 5-7 and 7-9 hours perform similarly on this measure of health.

Research quesion 2:

The second research question explores the relationship between the amount of sleep and difficulty concentrating, remembering, or making decisions. The response variable of interest is decide - a binary variable indicating whether a respondent has serious difficulty concentrating, remembering, or making decisions:

summary(my.data.RQ1$decide)
##    Yes     No   NA's 
##  38662 188935   1442

As we can see, there are 1,442 missing values that need be removed from the dataset before the analysis. The filtered data are saved in a new data frame:

my.data.RQ2 <- my.data.RQ1 %>%
    filter(!is.na(decide))
summary(my.data.RQ2$decide)
##    Yes     No 
##  38662 188935

Becasue the response variable is binary, I need to calculate relative frequencies of those who responded “Yes” to the question across the different sleep durations:

my.data.RQ2 %>% 
    filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
    group_by(how.much.sleep) %>%
    summarise(n.Count=n(), HaveDifficulty.Proportion= sum(decide == "Yes")/n())
## # A tibble: 4 x 3
##   how.much.sleep n.Count HaveDifficulty.Proportion
##            <chr>   <int>                     <dbl>
## 1            0-5    4182                0.27594452
## 2            5-7   41851                0.10761989
## 3            7-9   81152                0.06640625
## 4      9 or more   10005                0.15072464

The results show that 27.6 percent of those generally healthy adults in the sample who sleep less than 5 hours, 15.1 percent of those who sleep more than 9 hours, 10.8 percent of those who sleep 5-7 hours, and only 6.6 percent of those who get 7-9 hours of sleep have difficulty concentrating, remembering, or making decisions. In other words, among the generally healthy individuals in this sample, those who sleep the recommended 7-9 hour are least likely to experience serious difficulty concentrating, remembering, or making decisions compared to those adults who, on average, sleep more ore less than that.

The discussed differences can be clearly seen in the following relative frequency boxplot where the size of the red areas represent the proportions of thos who have difficulty concentrating, remebering, and making decisions:

my.data.RQ2 %>% 
    filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
    ggplot(aes(x = how.much.sleep, fill = decide)) + geom_bar(position="fill") + labs(y = "Proportion")

The overal conclusion following from the analysis is that in this sample the duration of sleep is related to the ability to concentrate, remember, or make decisions among adults whose general health is at least good and who exercise. In this sample, individuals who get 7-9 hour of sleep perform the best on this measure. Adults who sleeep less than 5 hours or more than 9 hours are most likely to have serious difficulty concentrating, remembering, or making decisions. Even those who sleep 5-7 hours on average, are more likely to have such a diffuclty compared to those who sleep the recommneded 7-9 hours.

Research quesion 3:

The third research question explores the relationship between having a condition such as asthma and the average duration of sleep an individual gets. Here, the duration of sleep (how.much.sleep) is the explanatory variable, and having asthma (asthnow) is the response variable. The following is the summary for the variable asthnow:

summary(my.data.RQ2$asthnow)
##    Yes     No   NA's 
##  28816  10828 187953

The filtered data without the missing values on asthnow variable are saved into a new fata frame my.data.RQ3:

my.data.RQ3 <- my.data.RQ2 %>%
    filter(!is.na(asthnow))
summary(my.data.RQ3$asthnow)
##   Yes    No 
## 28816 10828

The following code groups the data by the values of asthma variable and calculates percentges for each duration of sleep across the two groups of individuals - with and without asthma:

my.data.RQ3 %>%
    #filter(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good" & exerany2 == "Yes") %>%
    group_by(asthnow) %>%
    summarise(n=n(), sl.0to5h= sum(how.much.sleep == "0-5")/n(), sl.5to7h = sum(how.much.sleep == "5-7")/n(), sl.7to9h= sum(how.much.sleep == "7-9")/n(), sl.9plus= sum(how.much.sleep == "9 or more")/n())
## # A tibble: 2 x 6
##   asthnow     n   sl.0to5h  sl.5to7h  sl.7to9h   sl.9plus
##    <fctr> <int>      <dbl>     <dbl>     <dbl>      <dbl>
## 1     Yes 28816 0.10254720 0.3646932 0.4314964 0.10126319
## 2      No 10828 0.06464721 0.3511267 0.4958441 0.08838197

The summary shows that 49.6 percent of those who have no astma and 43.1 percent of those with asthma get 7 to 9 hours of sleep. In other words, those adults in the sample who have asthma are less likey to get the recommnded amount of sleep.

This relationship is clearly seen on the relative frequency, stacked bar plot:

ggplot(my.data.RQ3, aes(x = asthnow, fill = how.much.sleep)) + geom_bar(position="fill") + labs(y = "Proportion")  

The red area of the plot shows that the respondents who reported having asthma are substantially more likely to get less than 5 hours of sleep.

Overall, the findings of the exploratory analysis show that there are links between the following variables of interest:
— the average number of days when physical or mental health kept one from doing usual activities, such as self-care, work, or recreation varies and the average duration of sleep for healthy individuls who exercise.
— the probability that one has a serious difficulty concentrating, remembering, or making decisions is associated with the amount of sleep one gets in healthy individuls who exercise.
— the probability that an adult in the sample gets the recommendent amount of sleem (7-9) hour is related to whether one has such a condition as asthma.

Because the sample was collected using random sampling techniques, the discussed relationships may hold in the populatin of the U.S. adults. However, the discussed links are correlational, whcich means that the analysis doesn’t provide evidence for causal relationships between the explanatory and response varibles examined in this project.