Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(DiagrammeR)

Load data

load("brfss2013.RData")
attach(brfss2013)
set.seed(1)

Part 1: Data

The data is gathered every month by phone calls whose receptors are U.S. residents from each state. BRFFS uses a standarized questionnaire that guarantees the homogeneity of the data collected. More than 500,000 interviews were realized along the U.S. territory.

This is a sample obtained from the U.S. population and they randomly chose the people who they call, therefore, this is random sampling. This implies that the information obtained from this data is generalizable to the U.S. residents from each state, because if people are randomly chosen from the population each subject is equally likely to be called, hence, the resulting sample is probably representative of the population.

There is a questionnaire used to collect data, everybody gives the same information without any kind of treatment segmentation; there is no random assignment. This implies that we can only find out if there is an association between variables.

Both put together, we can conclude that this is an observational study that can generalize an assiociative behavior among variables; no causality can be inferred.

However, this sample could be exposed to non-response bias since a non-random fraction of the randomly chosen subjects might not respond the phone call. The lower socioeconomic class is less likely to answer this questionnaire; they might not earn enough income to afford a telephone line. They might earn enough for a telephone line but not enough for affording healthcare causing them to disregard the questionnaire.

Part 2: Research questions

Research quesion 1: Does having medicare influence on having ever taken an HIV test?

A person that has taken an HIV test has risky behavior. Risky behavior is correlated with default, temporal jobs, etc. in general, instability. If we could asses somehow this risk by just knowing wether the subject has or not medicare, this would add significant value to future risk modelling. For this section I will be using the variable hivtst6: people who said to have ever taken an HIV test, and medicare which is people who have or do not have medicare.

Research quesion 2: What about the influence of having ever taken an HIV test on having medicare?

Knowing the likelihood of a person having medicare just by observing public data regarding HIV tests is a cheap way to obtain insights about who to target whenever offering future healthcare services. For this section I will be using again the variable hivtst6: people who said to have ever taken an HIV test, and medicare which is people who have or do not have medicare.

Research quesion 3: Are people concerned about reducing sugar consumption between January and June?

Finding out sugar consumption patterns would help finding a more suitable marketing strategy consisting of a shift from sugary beverages between July and December to healthy or sugar-free beverages between January and June. I chose the variable ssbsugar: How often do you drink regular soda or pop? and imonth which is just a variable containing months. The variable ssbsugar has many values, I picked the value [101 - 199]: Times per day, which is the amount of daily consumption that I aggregated to monthly consumption.

Part 3: Exploratory data analysis

Research quesion 1: Does having a medicare influence on having ever taken an HIV test?

This question does not have an a priori intuitive answer, but, let’s think about possible behaviors that arise this question. Having or not medicare can be due to, either be expecting future diseases or injuries regardless of risk preferences, or maybe being directly related to risk preferences. If the latter is true, risk-averse people might be willing to have medicare; implying a character that prevents them from incurring in making risky decisions that might end up in taking an HIV test. Hence, having a medicare could actually impact in the likelihood of a person having ever taken an HIV test, which is our approach to assesing risk .

Let’s have a look at data:

##               hivtst6_yes hivtst6_no total_medicare
## medicare_yes        22138      98786         120924
## medicare_no         61354      99303         160657
## total_hivtst6       83492     198089         281581

Simply by looking at the matrix, we can notice that, among people who have medicare, the amount of people who have ever taken an HIV test is smaller than the people who took an HIV test among the people who do not have medicare. We can confirm this by computing the relative frequencies:

\[ \begin{aligned} \\ &P(hivtst6:yes\ |\ medicare:no)=\frac{61354}{160657}=0.3818943 \\\\ &P(hivtst6:yes\ |\ medicare:yes)=\frac{22138}{120924}=0.1830737 \\\\ &P(hivtst6:yes\ |\ medicare:no)-P(hivtst6:yes\ |\ medicare:yes)=0.1988206 \\\\ \end{aligned} \] The difference is almost 20%. Is it due to chance or does having a medicare give us information about how likely it is for a person to have taken an HIV test? The following hypothesis reflect this dilemma:

\[ \begin{aligned} \\ &H_0:Having\ medicare\ gives\ no\ information \\ &H_1:Having\ medicare\ gives\ information \\\ \end{aligned} \] Since we will be using the difference between both conditional probabilities to ratify the hypothesis, we can rewrite the hypothesis as follows:

\[ \begin{aligned} \\ &H_0:P(hivtst6:yes\ |\ medicare:no)=P(hivtst6:yes\ |\ medicare:yes) \\ &H_1:P(hivtst6:yes\ |\ medicare:no)\neq P(hivtst6:yes\ |\ medicare:yes) \\\\ \end{aligned} \]

I will simulate data and divide it randomly into two groups. One group will have 120,924 figures which will be the group of people having medicare. The other group will have 160,657 figures which will be the group of people not having medicare. I will calculate the proportion of people who have ever said to have taken an HIV test in both groups and compute the difference. I will repeat the simulation many times and check the likelihood of our non-simulated data by comparing it with the simulated data.

Before that I have to prepare and arrange data for simulation since it is more appropriate to work with a data frame containing the data rather than a matrix:

data.medicare <- brfss2013 %>% 
  select(hivtst6, medicare)
data.medicare <- na.exclude(data.medicare)
head(data.medicare)

##   hivtst6 medicare
## 1      No      Yes
## 2     Yes       No
## 3     Yes       No
## 4      No       No
## 5      No      Yes
## 6     Yes       No

This is how data looks like right now. I am going to create a vector that will contain 120,924 zeros and 160,657 ones, which are the sizes of the both groups in which I will be randomly dividing data. Then I will sample this index to assign a random position to these zeros and ones representing these two groups, filter the data frame using this sampled index, compute the probability of having taken an HIV test within each group, compute the difference between these two values, and finally, repeat this process 1,000 times to have a distribution of simulated differences:

index <- rep(0:1, c(120924,160657))

differences <- c()
for (i in 1:1000) {
  index <- sample(index)
  data1 <- data.medicare[index == 0,]
  data2 <- data.medicare[index == 1,]
  # Medicare == "Yes"
  value1 <- data1 %>%
    filter(hivtst6 == 'Yes') %>%
    summarise(n()/120924) %>% #   
    as.numeric()
  # Medicare == "No"
  value2 <- data2 %>%
    filter(hivtst6 == 'Yes') %>%
    summarise(n()/160657) %>% #  
    as.numeric()
  differences <- c(differences, value1 - value2)
}

Now it is time to plot this distribution to see how likely our sample difference is compared to the simulated data:

Since I am randomly splitting data into two groups, we can expect similar averages in both groups, yielding a difference of zero. The sample difference obtained suggests a dependency between hivtst6 and medicare, and the simulation confirmed that this figure does not look like achieved by chance, since just by observing the plot we can know that the likelihood of our sample difference is almost zero. Hence, we can reject the null hypothesis and accept the alternative hypothesis: there is dependency between hivtst6 and medicare.

All put together implies that people who have medicare are less likely to have ever taken an HIV test, while people who do not have medicare are more like to have taken an HIV test.

Research quesion 2: What about the influence of having ever taken an HIV test on having medicare?

I already proved in the previous question the dependency between hivtst6 and medicare, so this time I will focus on computing the inverse conditional probabilites using probability trees, as well as the pertinent interpretations of these.

After having computed the probability tree, we can derive the probability of having medicare if we know wether the patient took an HIV test or not:

\[ \begin{aligned} \\ &P(medicare:yes\ |\ hivsts6:no)= \frac{P(medicare:yes\cap\ hivsts6:no)} {P(hivtst6:no\cap medicare:yes)+P(hivtst6:no\cap medicare:no)}= \\\\ &P(medicare:yes\ |\ hivsts6:yes)= \frac{P(medicare:yes\cap\ hivsts6:yes)} {P(hivtst6:yes\cap medicare:yes)+P(hivtst6:yes\cap medicare:no)}= \\\\ &P(medicare:yes\ |\ hivsts6:no)=\frac{0.3508262} {0.35082623+0.3526623}=\frac{0.3508262}{0.7034885}=0.498695 \\\\ &P(medicare:yes\ |\ hivsts6:yes)=\frac{0.07862038}{0.07862038+0.2178911}=\frac{0.07862038}{0.2965115}=0.2651512 \\\\ \end{aligned} \]

We can infere that almost 50% of the people who did not take an HIV test have medicare, and around 27% of people who took an HIV test have medicare. This is a smiliar conclusion to he one obtained in the previous question as people who never took an HIV test are more likely to have medicare and people who did take an HIV test are less likely to have a medicare.

This insight might be useful for health insurance company looking for potential costumers. Although These costumers have implicit risk, charging higher fees to compensate for it will offer risk premium profits.

Research quesion 3: Are people concerned about reducing sugar consumption before summer?

Seems legit to think that there might be a relationship between drinking sugary beverages and the month in which you are consuming these drinks. If you think about your own consuming behavior, your eating habits might vary depending on wether Christmas ended or if are we apporaching summer.

datos <- brfss2013 %>%
  select(imonth, ssbsugar) %>%
  filter(ssbsugar %in% 101:199) %>%
  group_by(imonth) %>% 
  summarise(total = length(ssbsugar))

It is clear that consumption decreases between January and June, but one could think, what if are there other underlying variables such as sex that are influencing these consumption behaviours and need to be blocked? it is best to first plot the monthly consumption of sugary beverages filtered by sex and compare these outputs.

datos1 <- brfss2013 %>%
  select(imonth, ssbsugar,sex) %>%
  filter(ssbsugar %in% 101:199, sex == 'Male') %>%
  group_by(imonth) %>% 
  summarise(total = n())

datos2 <- brfss2013 %>%
  select(imonth, ssbsugar,sex) %>%
  filter(ssbsugar %in% 101:199, sex == 'Female') %>%
  group_by(imonth) %>% 
  summarise(total = n())

Both seem to follow the same trend and not creating any sex bias. The only apparent difference is, male consumption decreases at an increasing speed while female consumption decreases at a decreasing speed; there is no need to block the variable sex. There are many other variables that could affect consumption pattern that we could check such as: age, level of income, type of work, amount of hours dedicated to exercise, etc. But for didactic purposes, I will just check sex variable.

Continuing with my research question, I will define the following hypothesis:

\[ \begin{aligned} \\ &H_0: People\ are\ not\ concerned\ about\ reducing\ sugar\ consumption\ between\ January\ and\ June\\ &H_1: People\ are\ concerned\ about\ reducing\ sugar\ consumption\ between\ January\ and\ June\\ \\ \end{aligned} \] The sample difference between the consumption in January and June is calculated as follows:

\[ \begin{aligned} \\ &consumption\ in\ June\ -\ consumption\ January \\ &=1137\ -\ 1425\ = -288\ units\ of\ sugary\ beverage\\ \\ \end{aligned} \]

The sample figures indicate that the consumption was reduced by 288 units of sugary beverage; is this significant? does this imply that there is a month bias in consumption pattern? is this due to chance? For us to find out, it is reasonable to assume that monthly consumption of sugary beverages behaves like a multinomial distribution of size 14,697 units of sugary beverages, which is the total amount of units consumed, where consumption choice has twelve equally-likely months amongst our options. This way, we will be respecting our null hypthosis: people are not concerned about decreasing their sugar consumption for summer, meaning that there is no consumption pattern of sugary beverages and no consumption preference regarding months.

\[ \begin{aligned} \\ &(C_1,C_2,...,C_{12})\sim\ M(14697,P_1,P_2,...,P_{12}) \\ &P_1=P_2=\ ...\ =P_{12}=\frac{1}{12} \\ \\ \end{aligned} \] We can rewrite the hypothesis:

\[ \begin{aligned} \\ &H_0: C_1= C_6\\ &H_1: C_1\neq C_6\\ \\ \end{aligned} \]

Proceeding, I will simulate data that represents what I am stating. Hence, I will first simulate consumption figures and then compute the simulated difference in consumption between January and June.

sim <- rmultinom(n = 10000, size = sum(datos[,2]), prob = rep(1/12, 12))
difference <- (sim[6,]-sim[1,])

Finally, let’s plot the distribution of differences to see how probable our sample difference is:

Our sample difference -288 is an outlier in this distribution of differences. Thus, we can reject the null hypothesis and accept the alternative hypothesis: people care about reducing sugar consumption between January and June.

The marketing insight obtained would be valuable if we were to choose the orientation of future beverages advertisements. Since human reactions take time to occur after having received a stimulation, it would be appropriate to start advertising sugar-free beverages during December, which is the last period in which consumption of sugary beverages increases. Similarly, I would start the sugary beverages’ advertisement campaign during June, which is when consumption of sugar-free beverages last decreases.