Exploring the BRFSS data

Part 3: Exploratory data analysis

Research quesion 1:

To answer the first question it is first important to characterize the variables of interest. These include the following, height3 weight2 fruitju1 fruit1 fvbeans fvgreen fvorang vegetab1

A newdata set with these variables called newdata was prepared using select function.

newdata<-brfss2013%>%
  select(height3,weight2,fruitju1,fruit1,fvbeans,fvgreen,fvorang,vegetab1)

To begin, we must first find BMI with variables of interest being weight2 and height3. Let us elucidate the structure of these two variables.

str(brfss2013$weight2)

##  Factor w/ 570 levels "",".b","100",..: 154 30 63 31 169 128 9 1 139 73 ...

str(brfss2013$height3)

##  int [1:491775] 507 510 504 504 600 503 500 505 602 505 ...

height3 is an integer varibale while weight2 is reported as a factor with 570 levels. Thus, while computing bmi it will be necessary to convert the weight2 varibale to an integer/ numeric variable.

Let us consider the height3 variable first

summary(brfss2013$height3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0   504.0   506.0   551.1   510.0  9509.0    7623

brfss2013%>%
  group_by(height3)%>%
  summarise(n())

## # A tibble: 156 x 2
##    height3 `n()`
##      <int> <int>
##  1       1     1
##  2       3     2
##  3     206     1
##  4     210     1
##  5     300     5
##  6     301     1
##  7     302     1
##  8     303     1
##  9     304     2
## 10     305     2
## # ... with 146 more rows

It was evident from the above, that the height3 variable contained 7623 values with NA and 3 values less than 200. To further investigate on this varibale, codebook was referred.

Codebook stated that the values reported between 200-711 were data in feet and inches where the first number corresponded with the feets and the last two with the inches.(However from the data it was evident from the data that the values reported were between 200-811) Other coditions included data with value 7777 indicating person not knowing or not sure. The data values from 9000-9998 indicated values in centimeters where the last three digits of the value where the actual height in cm.(9 indicated that reported value is in standard units) A value of 9999 indicated that the person refused to answer.

Thus, it was important to convert the raw height data into actual heights of the people in meters (as bmi = weight(in kg)/ height^2(in m)).

To do so, the following function was designed and named heightf.

The heightf function would first sort the values into whether it was reported in cm or ft (acomplished using if function) and then convert them into m. To convert form ft and inch to m, the height was first converted into inches using formula ((first_digit)*12)+(last two digits) and then converted into m by multiplying with 0.0254. To convert from cm to m, the last three digits were divided by 100.

heightf<-function(x)
{
  if((x>=200)&(x<=211))
  {
    x=((x-200)+24)*0.0254
  }
  else if((x>=300)&(x<=311))
  {
    x=((x-300)+36)*0.0254
  }
  else if((x>=400)&(x<=411))
  {
    x=((x-400)+48)*0.0254
  }  
  else if((x>=500)&(x<=511))
  {
    x=((x-500)+60)*0.0254
  }
  else if((x>=600)&(x<=611))
  {
    x=((x-600)+72)*0.0254
  }
  else if((x>=700)&(x<=711))
  {
    x=((x-700)+84)*0.0254
  }
  else if((x>=800)&(x<=811))
  {
    x=((x-800)+96)*0.0254
  }
  else if(x>=9000)
  {
    x=(x-9000)/100
  }
}

Thus, the newdata has to be first filtered to remove any values less than 200 (can be due to error), data with values 7777, 9999 an NA

newdata<-newdata%>%
  filter(!is.na(height3))%>%
  filter(height3>=200)%>%
  filter(height3!=7777)%>%
  filter(height3!=9999)

The function heightf was now run on the height3 values to convert height values in meter which were then stored in new variable ht_m. A ‘for’ loop was used to convert each height3 value into meters and store in ht_m.

ht_m=1
for(i in 1:length(newdata$height3))
{
  ht_m[i]=heightf(newdata$height3[i])
}

To add these values to newdata with cbind function was used.

newdata<-cbind(newdata,ht_m)

With height3 variable being taken care of, now let’s move onto weight2 variable. It is first important to convert the variable type to numeric. To do this, as.numeric and as.character function is used. If as.numeric function is directly used then the values would correspod to factor levels and not the actual value.

A numeric value of weight is stored in variable named wt.This is done using the mutate function.

newdata<-newdata%>%
  mutate(wt=as.numeric(as.character(weight2)))

## Warning in evalq(as.numeric(as.character(weight2)), <environment>): NAs
## introduced by coercion

Now let us analyise the wt varibale,

newdata%>%
  group_by(wt)%>%
  summarise(n())

## # A tibble: 559 x 2
##       wt `n()`
##    <dbl> <int>
##  1    50     8
##  2    51     2
##  3    53     2
##  4    54     2
##  5    55     6
##  6    57     2
##  7    58     3
##  8    60     4
##  9    62     2
## 10    63     2
## # ... with 549 more rows

So the minimum value reported is 50 and there is data with NA values. Codebook was now examined.

Codebook states that the values from 50-0999 are the weights of respective individuals in pounds. Values between 9000-9998 indicate that the weight is reported in kg where the last three digits of the number are the actual values. Values of 7777 indicate that the individual did not know their weight and value of 9999 indicated that the individual refused to answer.

Thus, the data needs to be first filtered to remove data with values 7777, 9999 and NA

newdata<-newdata%>%
  filter(!is.na(wt))%>%
  filter(wt!=7777)%>%
  filter(wt!=9999)

A function weightf was created to convert the weight values into standard weight in kg. The function first sorts the data based on wether it was reported in pounds or kg using the if statment. It then converts the data reported in pounds by multiplying with 0.4535924.

weightf<-function(x)
{
  if(x<9000)
  {
    x=x*0.4535924
  }
  else
  {
    x=x-9000
  }
}

The function weightf was now run on the weight3 values to convert weight values in kg which were then stored in new variable wt_kg. A ‘for’ loop was used to convert each weight2 value and store it in wt_kg. To add these values to newdata with cbind function was used.

wt_kg=1
for(i in 1:length(newdata$wt))
{
wt_kg[i]=weightf(newdata$wt[i])
}
newdata<-cbind(newdata,wt_kg)

New variable bmi was created to store the calcuated values of BMI. The values were calculated and stored in a single step using mutate function.

newdata<-newdata%>%
  mutate(bmi=wt_kg/(ht_m*ht_m))

To bmi values were summarised ad below,

summary(newdata$bmi)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    4.722   23.630   26.631   27.830   30.814 2825.143

The calculated values does seem to contain a few errrors. This is evident from the mimimum and maximum values as it highly unikely that the individuals have a bmi as high as 2800.

An inspection into the calculated variable code book indicated that permissible values of bmi were between 12 and 99.99 (See. _bmi5 variable). Thus, the data needed to be filtered to only contain these values.

newdata<-newdata%>%
  filter(bmi>=12)%>%
  filter(bmi<=99.99)

To visualise the bmi data a histogram plot was made using the following code.

ggplot(data=newdata,aes(x=bmi))+geom_histogram(binwidth = 5)

Summary of calculated BMI results,

summary(newdata$bmi)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.02   23.63   26.63   27.82   30.81   97.69

The summary statistics above indicate that more than 50% of the US citizens have a BMI of more than 27. (The data sheds light on the prevelance of obesity epidemic in US)

Moving on to calculating parameters concerning the quantification of amount of fruits and vegetables consumed by the individuals.

Let us fist analyze the structure of all of them.

str(newdata$fruitju1)

##  int [1:467915] 304 305 301 202 0 205 320 0 202 303 ...

str(newdata$fruit1)

##  int [1:467915] 104 301 203 306 302 206 325 101 202 301 ...

str(newdata$fvbeans)

##  int [1:467915] 303 310 202 202 101 0 330 202 203 302 ...

str(newdata$fvgreen)

##  int [1:467915] 310 203 202 310 310 203 315 203 201 303 ...

str(newdata$fvorang)

##  int [1:467915] 303 202 310 305 303 0 310 0 201 0 ...

str(newdata$vegetab1)

##  int [1:467915] NA 203 330 204 101 207 310 101 203 308 ...

All the above variables are integer type. This implies that they can be directly used to compute the relevant data.

Let us first consider the variable ‘fruitjui1’ which gives the number of times an individual drinks 100% pure fruit juice.

Summary statistics were considered to view any errors in reporting

summary(newdata$fruitju1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0   101.0   132.7   300.0   399.0   30895

newdata%>%
  group_by(fruitju1)%>%
  summarise(n())

## # A tibble: 118 x 2
##    fruitju1  `n()`
##       <int>  <int>
##  1        0 165321
##  2        1      1
##  3      101  71469
##  4      102  11392
##  5      103   3716
##  6      104   1028
##  7      105    646
##  8      106    145
##  9      107    135
## 10      108     42
## # ... with 108 more rows

Codebook states that the values from 101-199 indicate the number of times an individual consumes 100% pure fruit in a day. Values from 201-299 indicate the number of times an individual consumes 100% pure fruit juice within a week. Values from 301-399 indicate the number of times an individual consumes 100% pure fr/it juice within a month. Data value of 300 indicate that the individual consumes 100% pure fruit juice less than once per month while the value 555 indicate that the individual never drinks 100% pure fruit juice. Value of 999 implies individual refused to answer.

Thus, the data frame first needs to be filtered to remove the data with values 999 and NA. Also values of 1 and 0 need to removed as codebook does not specify any noting pattern from them.

newdata<-newdata%>%
  filter(!is.na(fruitju1))%>%
  filter(fruitju1!=999)%>%
  filter(fruitju1!=1)%>%
  filter(fruitju1!=0)

To convert the reported data into the actual number of times individuals consumed 100% fruit juice within a day, function juicef was created. juicef function sort the data based on whether it was reported to denote the number of times per day or week or month one consumed juice and then converts them into number of times one consumes 100% fruit juice in a day.

juicef<-function(x)
{
  if((x>=101)&(x<=199))
  {
    x=(x-100)
  }
  else if((x>=201)&(x<=299))
  {
    x=((x-200)/7)
  }
  else if(x==300)
  {
    x=(0.9/30)
  }
  else if((x>=300)&(x<=399))
  {
    x=(x-300)/30
  }
  else if(x==555)
  {
    x=0
  }
}

The function juicef was now run on the variable fruitjui1, whose values were stored in jui_day which were then added to newdata data frame using cbind.

jui_day=1
for(i in 1:length(newdata$fruitju1))
{
  jui_day[i]=juicef(newdata$fruitju1[i])
}
newdata<-cbind(newdata,jui_day)

A thorough investigation of Codebook for variables fruitjui1, fruit1, fvbeans, fvgreen, fvorang and vegetab1 indicate that the data entry pattern for all them are same. The pattern is as below, values between 101 and 199 indicate the number of times the concerned variable is consumed within a day. For example a fruit1 data value of 108 indicate, fruit consumption 8 times a day. Values between 201-299 indicate the number of times the concerned variable is consumed within a week. Values between 301-399 indicate the number of times the concerned varibale is consumed within a month. Value 300 indicate consumption less than one time per month while value of 555 indicate never consumed.Value of 999 indicate refusal to answer.

Thus, juicef function can be used to convert the reported values for variales fruit1, fvbeans, fvgreen, fvorang and vegetab1 into the actual number of times a particular variable is consumed within a day.

However, data frame first needs to analaysed and filtered for unnecessary data points. Consider variable fruit1,

newdata%>%
  group_by(fruit1)%>%
  summarise(n())

## # A tibble: 125 x 2
##    fruit1 `n()`
##     <int> <int>
##  1      0  5572
##  2      2     1
##  3    101 71624
##  4    102 36549
##  5    103 13153
##  6    104  2923
##  7    105  1487
##  8    106   289
##  9    107   158
## 10    108    49
## # ... with 115 more rows

Summarizing results for fruit1 shows, that there are values less than 101 which need to be filtered including values 999 and NA.

newdata<-newdata%>%
  filter(!is.na(fruit1))%>%
  filter(fruit1!=999)%>%
  filter(fruit1!=2)%>%
  filter(fruit1!=0)

The function juicef was used to convert raw data into usable values and cbind function was used to add these values to new data frame. The actual values are stored in variable frut_day.

frut_day=1
for(i in 1:length(newdata$fruit1))
{
  frut_day[i]=juicef(newdata$fruit1[i])
}
newdata<-cbind(newdata,frut_day)

Now analyzing variable fvbeans,

newdata%>%
  group_by(fvbeans)%>%
  summarise(n())

## # A tibble: 103 x 2
##    fvbeans `n()`
##      <int> <int>
##  1       0 34945
##  2     101 12287
##  3     102  2313
##  4     103   650
##  5     104   174
##  6     105   132
##  7     106     8
##  8     107    16
##  9     108     4
## 10     109     1
## # ... with 93 more rows

From the above, it is evident that data needs to be filtered for values 0, 999 and NA

newdata<-newdata%>%
  filter(!is.na(fvbeans))%>%
  filter(fvbeans!=999)%>%
  filter(fvbeans!=0)

The function juicef was used to convert raw data into usable values and cbind function was used to add these values to new data frame. The actual values are stored in variable bean_day.

bean_day=1
for(i in 1:length(newdata$fvbeans))
{
  bean_day[i]=juicef(newdata$fvbeans[i])
}
newdata<-cbind(newdata,bean_day)

Now analysing variable fvgreen,

newdata%>%
  group_by(fvgreen)%>%
  summarise(n())

## # A tibble: 110 x 2
##    fvgreen `n()`
##      <int> <int>
##  1       0  9194
##  2     101 31258
##  3     102  8144
##  4     103  1536
##  5     104   350
##  6     105   261
##  7     106    26
##  8     107    29
##  9     108     1
## 10     110     5
## # ... with 100 more rows

From the above, it is evident that data needs to be filtered for values 0, 999 and NA

newdata<-newdata%>%
  filter(!is.na(fvgreen))%>%
  filter(fvgreen!=999)%>%
  filter(fvgreen!=0)

The function juicef was used to convert raw data into usable values and cbind function was used to add these values to new data frame. The actual values are stored in variable green_day.

green_day=1
for(i in 1:length(newdata$fvgreen))
{
  green_day[i]=juicef(newdata$fvgreen[i])
}
newdata<-cbind(newdata,green_day)

Now analysing variable fvorang

newdata%>%
  group_by(fvorang)%>%
  summarise(n())

## # A tibble: 93 x 2
##    fvorang `n()`
##      <int> <int>
##  1       0 15835
##  2     101 12552
##  3     102  1833
##  4     103   437
##  5     104   127
##  6     105    89
##  7     106    10
##  8     107     7
##  9     109     1
## 10     110     2
## # ... with 83 more rows

From the above, it is evident that data needs to be filtered for values 0, 999 and NA

newdata<-newdata%>%
  filter(!is.na(fvorang))%>%
  filter(fvorang!=999)%>%
  filter(fvorang!=0)

The function juicef was used to convert raw data into usable values and cbind function was used to add these values to new data frame. The actual values are stored in variable orang_day.

orang_day=1
for(i in 1:length(newdata$fvorang))
{
  orang_day[i]=juicef(newdata$fvorang[i])
}
newdata<-cbind(newdata,orang_day)

Now let us analayse the variable vegetab1

newdata%>%
  group_by(vegetab1)%>%
  summarise(n())

## # A tibble: 109 x 2
##    vegetab1 `n()`
##       <int> <int>
##  1        0   723
##  2      101 48611
##  3      102 17100
##  4      103  3078
##  5      104   620
##  6      105   341
##  7      106    56
##  8      107    50
##  9      108     8
## 10      109     3
## # ... with 99 more rows

From the above, it is evident that data needs to be filtered for values 0, 999 and NA

newdata<-newdata%>%
  filter(!is.na(vegetab1))%>%
  filter(vegetab1!=999)%>%
  filter(vegetab1!=0)

The function juicef was used to convert raw data into usable values and cbind function was used to add these values to new data frame. The actual values are stored in variable veg_day.

veg_day=1
for(i in 1:length(newdata$vegetab1))
{
  veg_day[i]=juicef(newdata$vegetab1[i])
}
newdata<-cbind(newdata,veg_day)

Now the data seems to be ready to be analysed for relation between eating habits and BMI. However an inspection in calculated variable Codebook suggested that data needs to be screened to remove data values where the total fruit intake in a day of an individual is greater than 16 and total vegetable intake per day of an individual is greater than 23 per day.

This task accomplished using mutate function to create two new variables total_fruit and total_veg to account for total fruit intake and total vegetable intake of an individual. Data frame is then screened to remove total_fruit values greater than 16 and total_veg values greater than 23.

newdata<-newdata%>%
  mutate(total_fruit=jui_day+frut_day)
newdata<-newdata%>%
  mutate(total_veg=bean_day+green_day+orang_day+veg_day)
newdata<-newdata%>%
  filter(total_fruit<=16)%>%
  filter(total_veg<=23)

Let us summarise and visualize the above two parameters

summary(newdata$total_fruit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0600  0.9333  1.4286  1.7048  2.0667 16.0000

ggplot(data=newdata,aes(x=total_fruit))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(newdata$total_veg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.120   1.333   1.857   2.144   2.619  22.000

ggplot(data=newdata,aes(x=total_veg))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the above, it is evident that average total fruit and vegetable consumption is twice in a day.

Plotting total_veg and total_fruit against bmi to visualize relation between the two.(log transformation is done for BMI for better visulaization of data)

ggplot(data=newdata,aes(x=total_fruit,y=log(bmi)))+geom_point()+geom_smooth(method=lm)

ggplot(data=newdata,aes(x=total_veg,y=log(bmi)))+geom_point()+geom_smooth(method=lm)

Evident from the graphs below, the results indicate that there is a negative correlation between vegetable and food intake (times a day one consumes fruits or vegetables) and BMI. This implies lower BMI values are correlated with higher vegetable and fruit intake.

To quantify whether the correlation is significant a linear model can be built and the main effects can then be tested for statistical significance.

model1<-lm(log(newdata$bmi)~newdata$total_veg*newdata$total_fruit)
summary(model1)

## 
## Call:
## lm(formula = log(newdata$bmi) ~ newdata$total_veg * newdata$total_fruit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82410 -0.13537 -0.01518  0.11778  1.27895 
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                            3.3387975  0.0013219 2525.72
## newdata$total_veg                     -0.0131660  0.0005574  -23.62
## newdata$total_fruit                   -0.0136330  0.0005905  -23.09
## newdata$total_veg:newdata$total_fruit  0.0019016  0.0001622   11.72
##                                       Pr(>|t|)    
## (Intercept)                             <2e-16 ***
## newdata$total_veg                       <2e-16 ***
## newdata$total_fruit                     <2e-16 ***
## newdata$total_veg:newdata$total_fruit   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.199 on 195866 degrees of freedom
## Multiple R-squared:  0.008531,   Adjusted R-squared:  0.008516 
## F-statistic: 561.8 on 3 and 195866 DF,  p-value: < 2.2e-16

The small p value indicates high significance of the model. The main effects total_veg and total_fruit are highly significant (indicated by very small p value). Also the variables have negative coefficient suggesting a negative correlation. Total fruit intake seems to be slightly more correlated with BMI as suggested by magnitude of main effects, but the difference seems to be rather pretty small.

Now let us find out effect of individual parameters i.e jui_day, frut_day, bean_day, green_day, orang_day and veg_day; and find out which of them has the most significant correlation with BMI.

ggplot(data=newdata,aes(x=jui_day,y=log(bmi)))+geom_point()+geom_smooth(method=lm)

ggplot(data=newdata,aes(x=frut_day,y=log(bmi)))+geom_point()+geom_smooth(method=lm)

ggplot(data=newdata,aes(x=bean_day,y=log(bmi)))+geom_point()+geom_smooth(method=lm)

ggplot(data=newdata,aes(x=green_day,y=log(bmi)))+geom_point()+geom_smooth(method=lm)

ggplot(data=newdata,aes(x=orang_day,y=log(bmi)))+geom_point()+geom_smooth(method=lm)

ggplot(data=newdata,aes(x=veg_day,y=log(bmi)))+geom_point()+geom_smooth(method=lm)

From the above graphs line orang_day vs log(bmi) has the steepest slope followed by green_day indicating a greater relation with BMI.

Now to find the variable with highest correlation, we can develop a linear model and then compare which effects are significant and which are not.

model2<-lm(log(newdata$bmi)~newdata$jui_day+newdata$frut_day+newdata$bean_day+newdata$green_day+newdata$orang_day+newdata$veg_day)
summary(model2)

## 
## Call:
## lm(formula = log(newdata$bmi) ~ newdata$jui_day + newdata$frut_day + 
##     newdata$bean_day + newdata$green_day + newdata$orang_day + 
##     newdata$veg_day)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82761 -0.13556 -0.01536  0.11785  1.28771 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        3.326e+00  9.532e-04 3489.115   <2e-16 ***
## newdata$jui_day   -5.982e-03  6.992e-04   -8.556   <2e-16 ***
## newdata$frut_day  -9.755e-03  5.486e-04  -17.781   <2e-16 ***
## newdata$bean_day  -1.072e-03  1.220e-03   -0.879    0.379    
## newdata$green_day -1.583e-02  9.538e-04  -16.599   <2e-16 ***
## newdata$orang_day -1.806e-02  1.349e-03  -13.380   <2e-16 ***
## newdata$veg_day   -7.225e-05  7.650e-04   -0.094    0.925    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1989 on 195863 degrees of freedom
## Multiple R-squared:  0.009127,   Adjusted R-squared:  0.009097 
## F-statistic: 300.7 on 6 and 195863 DF,  p-value: < 2.2e-16

INFERENCES:

The findings from the model are in accordance with the grpahs. All of the main effects have negative coefficients suggesting an increase of the main effect correlating with lower BMI values.

Overall low p-value suggest that the correlation between individual variables and BMI is statistically significant.

However surprisingly other types of vegetables that are consumed and beans that are consumed have no statistically significant correlation with lower BMI.

The data also suggests that the fruit intake is more strongly correlated with lower BMI than drinking fruit juice.

Thus, exploratory data analysis of the above variables suggest that vegetable and fruit intake per day are negatively correlated with BMI values. However the consumption orange vegetables per day (orang_day variable) seems to be most strongly corelated with lower BMI values whereas number of times one consumes beans and other vegetables(bean_day variable and veg_day variable respectively) are not correlated with BMI.

Although the data suggests a correlation and no causal conclusions can be drawn, it however provides evidence for eating more orange vegetables and green vegetables as doing so is correlated with lower BMI values. Data also suggests eating other vegetables than green and orange is not significantly correlated with lower BMI. (This makes sense to some degree given that the type of vegetables that might fall in this category would be potatoes, which although contain an array of complex carbohydrates still contain a lot of starch which can be attributed to somewhat higher BMI’s). Also eating fruits rather than consuming 100% fruit juices seems to be more beneficial, as eating fruits is more strongly correlated with lower BMI than fruit juices (It seems logical as fruits will also contain a lot more fiber than fruit juices; consumption of which can significantly lower body weight and thus overall BMI).

Research quesion 2: In order to answer question 2, variables physhlth, menthlth, genhlth and sleptim1 needs to be considered. The above variables represent number of physical days not good, number of mental days not good, general health and amount of time one sleeps, respectively.

These variables are analysed first using str() function.

str(brfss2013$genhlth)

##  Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...

str(brfss2013$physhlth)

##  int [1:491775] 30 0 3 2 10 0 1 5 0 0 ...

str(brfss2013$menthlth)

##  int [1:491775] 29 0 2 0 2 0 15 0 0 0 ...

str(brfss2013$sleptim1)

##  int [1:491775] NA 6 9 8 6 8 7 6 8 8 ...

Except general health all are integer variables while genhlth is a factor with 5 levels.

A newdata frame is created to analyse these variables further.

newdata1<-brfss2013%>%
  select(genhlth,physhlth,menthlth,sleptim1)

Let us first analyse the general health variable

newdata1%>%
  group_by(genhlth)%>%
  summarise(n())

## # A tibble: 6 x 2
##     genhlth  `n()`
##      <fctr>  <int>
## 1 Excellent  85482
## 2 Very good 159076
## 3      Good 150555
## 4      Fair  66726
## 5      Poor  27951
## 6      <NA>   1985

The above results imply that data has to filtered to remove NA values.

newdata1<-newdata1%>%
  filter(!is.na(genhlth))

Now analysing variables physhlth and menthlth

newdata1%>%
  group_by(physhlth)%>%
  summarise(n())

## # A tibble: 32 x 2
##    physhlth  `n()`
##       <int>  <int>
##  1        0 303519
##  2        1  20068
##  3        2  26731
##  4        3  15805
##  5        4   8330
##  6        5  14162
##  7        6   2422
##  8        7   9016
##  9        8   1538
## 10        9    349
## # ... with 22 more rows

newdata1%>%
  group_by(menthlth)%>%
  summarise(n())

## # A tibble: 33 x 2
##    menthlth  `n()`
##       <int>  <int>
##  1        0 333336
##  2        1  15162
##  3        2  23461
##  4        3  13556
##  5        4   6643
##  6        5  16612
##  7        6   1854
##  8        7   6332
##  9        8   1239
## 10        9    196
## # ... with 23 more rows

physhlth and menthlth varibales have to be screened to remove NA entries but menthlth also has to be screened to remove values greater than 30. Although codebook suggests that value 88 represents no unhealthy days and actual values of both varibales to be between 1-30, above data suggests that 0 is used instead of 88 to represent no bad days.

newdata1<-newdata1%>%
  filter(!is.na(physhlth))%>%
  filter(!is.na(menthlth))%>%
  filter(menthlth<=30)

Now consider variable sleptim1,

newdata1%>%
  group_by(sleptim1)%>%
  summarise(n())

## # A tibble: 25 x 2
##    sleptim1  `n()`
##       <int>  <int>
##  1        1    211
##  2        2    986
##  3        3   3248
##  4        4  13483
##  5        5  31868
##  6        6 102640
##  7        7 138794
##  8        8 136145
##  9        9  22903
## 10       10  11445
## # ... with 15 more rows

Even sleptim1 variable has to be screened for NA values.

newdata1<-newdata1%>%
  filter(!is.na(sleptim1))

Rather than considering the number of days person felt unhealthy in a month, number of days where person felt healthy can be calculated by subtracting physhlth and menthlth from 30. phy_healthy variable is created to denote the number of days an individual was physically healthy within a month and ment_healthy variable represents number of days an individual had good mental health within a month.

newdata1<-newdata1%>%
  mutate(phys_healthy=30-physhlth)%>%
  mutate(ment_healthy=30-menthlth)

First let us find out relation between physical health, mental health and sleep.

ggplot(data=newdata1,aes(x=sleptim1,y=phys_healthy))+geom_point()+geom_smooth()

## `geom_smooth()` using method = 'gam'

ggplot(data=newdata1,aes(x=sleptim1,y=ment_healthy))+geom_point()+geom_smooth()

## `geom_smooth()` using method = 'gam'

Both plots indicate a curved relationship implying that an increase in sleep time is correlated with increase in number of days but increasing sleep time even more is correlated with a decrease in number of days a person is both physically and mentally healthy.

Now let us find the relation bewteen sleep time and general health

newdata1%>%
  select(sleptim1,genhlth)%>%
  group_by(genhlth)%>%
  summarise(mean(sleptim1),median(sleptim1),IQR(sleptim1),n())

## # A tibble: 5 x 5
##     genhlth `mean(sleptim1)` `median(sleptim1)` `IQR(sleptim1)`  `n()`
##      <fctr>            <dbl>              <dbl>           <dbl>  <int>
## 1 Excellent         7.190473                  7               1  83786
## 2 Very good         7.103658                  7               2 155135
## 3      Good         7.036858                  7               2 143251
## 4      Fair         6.888660                  7               2  60562
## 5      Poor         6.728866                  6               3  24759

ggplot(data=newdata1,aes(x=genhlth,y=sleptim1))+geom_boxplot()

From the above results it is evident that people who got around 7 hours of sleep reported in general good health, while people getting less than 7 hours of sleep reported having poor health.

Now, let us find out sleep has more impact on physical health or mental health.

model3<-lm(newdata1$phys_healthy~newdata1$sleptim1+I(newdata1$sleptim1^2))
summary(model3)

## 
## Call:
## lm(formula = newdata1$phys_healthy ~ newdata1$sleptim1 + I(newdata1$sleptim1^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.741   1.259   3.259   3.730  66.079 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8.48245    0.13220   64.17   <2e-16 ***
## newdata1$sleptim1       4.35184    0.03327  130.82   <2e-16 ***
## I(newdata1$sleptim1^2) -0.25869    0.00212 -122.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.565 on 467490 degrees of freedom
## Multiple R-squared:  0.03586,    Adjusted R-squared:  0.03586 
## F-statistic:  8694 on 2 and 467490 DF,  p-value: < 2.2e-16

model4<-lm(newdata1$ment_healthy~newdata1$sleptim1+I(newdata1$sleptim1^2))
summary(model4)

## 
## Call:
## lm(formula = newdata1$ment_healthy ~ newdata1$sleptim1 + I(newdata1$sleptim1^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.087   1.063   2.175   2.891  52.547 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             9.36106    0.11475   81.58   <2e-16 ***
## newdata1$sleptim1       4.12679    0.02888  142.91   <2e-16 ***
## I(newdata1$sleptim1^2) -0.22734    0.00184 -123.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.435 on 467490 degrees of freedom
## Multiple R-squared:  0.04791,    Adjusted R-squared:  0.04791 
## F-statistic: 1.176e+04 on 2 and 467490 DF,  p-value: < 2.2e-16

A quadratic term was added to take care of the bell shape of the predicted curve. Due to such low p-value it is evident that sleep does have a positive correlation with number of days a person is mentally and physically healthy. From the model’s coefficients it is evident that the sleep is slight more correlated with physical health than mental health. It can also be observed that sleep more than around 8 hours tends to linked with decreased number of healthy days a person experiences.

Research quesion 3:

To answer question three, variables diabedu, diabete3 and genhlth are of concern. Diabedu varibale tells if the person has taken a class in managing diabetes or not whereas variable diabete3 tells if the person has diabetes or not. genhlth variable tells about the general health of the individual.

First a data frame with these three variables is created.

newdata2<-brfss2013%>%
  select(genhlth,diabete3,diabedu)

From previous question it could be noticed that genhlth variable is a factor with 5 levels and that data also needs to be screened for NA values.

newdata2<-newdata2%>%
  filter(!is.na(genhlth))

Now let us analyse the variable diabedu and diabete3.

str(newdata2$diabedu)

##  Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA NA NA NA NA ...

str(newdata2$diabete3)

##  Factor w/ 4 levels "Yes","Yes, but female told only during pregnancy",..: 3 3 3 3 3 3 3 3 3 3 ...

Both variables are factors with 2 levels and 4 levels respectively. Since the question of interest only concerns the people who have been told that they have diabetes, the data has to be filtered to only include values of diabete3 as Yes.

varibale diabedu has to be screened to remove NA values then.

newdata2<-newdata2%>%
  filter(diabete3=="Yes")%>%
  filter(!is.na(diabedu))

People with Good health can be defined as people who have not reported their health status as fair and poor. Thus people with fair or poor general health status are considered to be unhealthy. A new variable good_hlth can be created to indicate if the person is healthy or not. If the person reports fair or poor health status then good_hlth variable report value no or else yes. Mutate dunction and ifelse statement is used to accomplish the above task.

newdata2<-newdata2%>%
  mutate(good_hlth=ifelse(((genhlth=="Fair")|(genhlth=="Poor")),"No","Yes"))

Now we can answer the research question by comparing the people who have taken a diabetes management class and their good_health status.

newdata2%>%
  group_by(diabedu,good_hlth)%>%
  summarise(n())

## # A tibble: 4 x 3
## # Groups:   diabedu [?]
##   diabedu good_hlth `n()`
##    <fctr>     <chr> <int>
## 1     Yes        No  8931
## 2     Yes       Yes 10596
## 3      No        No  8162
## 4      No       Yes  8412

The above results can be visualized in a grouped bar plot,

barplot(table(newdata2$diabedu,newdata2$good_hlth),beside=TRUE,ylab="Number of People",xlab="Status on diabetes management class", legend=rownames(table(newdata2$diabedu,newdata2$good_hlth)))

It seems from the bar plot above that there are more number of people who have good health and taken a diabetes management class.

To quantify this difference statistically, prop.test() function can be used.

prop.test(x=c(10596,8412),n=c(10596+8931,8412+8162),correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  c(10596, 8412) out of c(10596 + 8931, 8412 + 8162)
## X-squared = 44.282, df = 1, p-value = 2.843e-11
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.02475913 0.04542356
## sample estimates:
##    prop 1    prop 2 
## 0.5426333 0.5075419

Given that all have diabetes, 54.26% of people have taken a diabetes management class and have reported good health. However only 50.75% of the people have not taken a diabetes management class but have still reported good health.

A low p value does indicate that the difference between the two proportions is significant. This means that greater number of people have reported their health status as good who have taken diabetes management class than the people who have not taken a diabetes management class given that the individuals have been told that they have diabetes. Thus evident from statistical analysis it can be concluded that taking a diabetes management class given that you have diabetes is correlated with improved health status of the individual.

                                 *END*

Exploring the BRFSS data

Setup

Load packages

Load data

Part 1: Data

Part 2: Research questions

Part 3: Exploratory data analysis