Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(tidyr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called brfss2013. Delete this note when before you submit your work.

load("brfss2013.RData")

Part 1: Data

Behavioral Risk Factor Surveillance System (BRFSS) a telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical methodological assistance from CDC. It is used to collect prevalence data among adult U.S. residents regarding their risk behaviors and preventive health practices that can affect their health status.

The data collection is conducted according to BRFSS protocols and states generally follow a suggested BRDSS interviewing schedule and complete all calls for a given survey month within the same sample month. In general, surveys are conducted using the following calling occasions:

20% of the interviews are conducted on weekdays (just after dinner hour)
80% of the interviews are conducted on weeknights and weekends
Appointments are made for callbacks during hours that are not scheduled for other interviews
Change schedules to accommodate holidays and special events

Disproportionate stratified sampling (DSS) has been used for the landline sample. DSS draws telephone numbers from two strata (lists) that are based on the presumed density of known telephone household numbers (High vs. Medium density). DSS sampling telephone numbers is more efficient than simple random sampling.

The cellular telephone sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combinations and then respondents are randomly selected with each having equal probability of selection. States complete ~20% of their completed interviews with respondents on cell phones.

The sample size in BRFSS is at least 4000 interviews per state each year (at least 200k per year).

For more info refer back to the survey data documentation.

Thoughts on the Data collection

I think since the data is collected using random sampling of telephone numbers to conduct surveys across states, the results are generalizable to the adult population (selection and volunteer bias). They tried to call in hours of day and days of the week to be able to complete interviews with all respondents (non-response bias). Also, since this is observational study, the results could merely be correlation and does not show causality for which a random treatment experiment is required.

A possible source of bias could come from the fact that some people who are willing to respond to the interview and answer truthfully while there are sub populations that either don’t respond or for reasons such as illegal immigration don’t answer truthfully which introduces the bias that some people are under represented in the data.

Part 2: Research questions and Summary

Research question 1:

How long do people on average sleep and it’s relation to gender, race, age group, income and education level?

Research question 2:

How does income level, gender and living in college housing affects Health Care Access?

Research question 3:

How does income level and other factors like gender, education level, internet use and number of children related?

Part 3: Exploratory data analysis

Research question 1:

Average sleep time and it’s relation to age group, gender, race, income and education level

The main variable is sleptim1 which according to the BRFSS code book is the data gathered by asking the respondents “How Much Time Do You Sleep?”

Other dataset attributes to work with for this question are gender sex, education level educa and income level income2. Race and Age group are not among the available dataset attributes.

Summary of EDA approach:

To explore the data for this question, I create a dataset with relevant columns and then look at the missing values and the distribution of the data. Next, I decide whether to drop rows with missing values and then I table summary stats and visualize with suitable graphs.

# First create a dataframe with selected columns for the analysis

brfss_sleep <- brfss2013[c("sleptim1", "sex", "educa", "income2")]

# dataset dimension
dim(brfss_sleep)

## [1] 491775      4

Data Cleaning, Data Distribution

Sleep data is right skewed, with couple of outliers over 24 hours.

# length(brfss_sleep$sleptim1)

length(brfss_sleep$sleptim1)

## [1] 491775

# Distribution of sleep hours in the dataset

table(brfss_sleep$sleptim1)

## 
##      0      1      2      3      4      5      6      7      8      9     10 
##      1    228   1076   3496  14261  33436 106197 142469 141102  23800  12102 
##     11     12     13     14     15     16     17     18     19     20     21 
##    833   3675    199    447    367    369     35    164     13     64      3 
##     22     23     24    103    450 
##     10      4     35      1      1

ggplot(data = brfss_sleep, aes(x = sleptim1)) +
  geom_boxplot() +
  xlab("Sleep time")

To clean I’ll dropped rows from Sleep_time, Education and Sex but left it for Income, because there was 71426 rows and it could result in data loss. Also, I filtered sleep data to only include 2-14 hours. 10723 rows are dropped in total from 481052 rows.

# Missing values in each column before
sapply(brfss_sleep, function(x) sum(is.na(x)))

## sleptim1      sex    educa  income2 
##     7387        7     2274    71426

# Drop rows with missing sleep time, sex value

sleep_clean <- brfss_sleep %>%
  drop_na(sex) %>%
  drop_na(sleptim1)

# Drop rows with sleep values above 14 hours and below 2 hours

sleep_clean <- subset(sleep_clean, sleep_clean$sleptim1 >= 2 & sleep_clean$sleptim1 <= 14)

# Drop missing values in education 

sleep_clean <- sleep_clean %>%
   drop_na(educa)

# Missing value count after
sapply(sleep_clean, function(x) sum(is.na(x)))

## sleptim1      sex    educa  income2 
##        0        0        0    66660

# dataset dimension
dim(sleep_clean)

## [1] 481052      4

Distribution of Sleep time in the clean dataset:

# Distribution of sleep hours in the clean dataset

table(sleep_clean$sleptim1)

## 
##      2      3      4      5      6      7      8      9     10     11     12 
##   1067   3472  14203  33290 105777 141904 140471  23709  12044    824   3650 
##     13     14 
##    198    443

ggplot(data = sleep_clean, aes(x = sleptim1)) +
  geom_boxplot() +
  xlab("Sleep time")

There is about 20% difference in data records for males and female respondents, which makes the the data imbalanced. (Genders are not equally represented in the data)

# Frequecy Table for gender

table(brfss_sleep$sex)

## 
##   Male Female 
## 201313 290455

# Frequecy chart for gender

ggplot(data = subset(brfss_sleep, !is.na(sex)), aes(x = factor(sex))) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  scale_y_continuous(labels = scales::percent) +
  ylab("relative frequencies") +
  xlab("Gender")

Income level distribution of survey respondents

# Frequecy Table for income level

table(brfss_sleep$income2)

## 
## Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000 
##             25441             26794             34873             41732 
## Less than $35,000 Less than $50,000 Less than $75,000   $75,000 or more 
##             48867             61509             65231            115902

# Frequecy chart for income level

ggplot(data = subset(brfss_sleep, !is.na(income2)), aes(x = income2)) +
  geom_bar(aes(y = (..count..)/sum(..count..))) +
  ylab("relative frequencies") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylim(0,0.3)+
  xlab("Income")

Eduvation level distribution of survey respondents

# Frequecy Table for education level

table(brfss_sleep$educa)

## 
##                   Never attended school or only kindergarten 
##                                                          677 
##                              Grades 1 through 8 (Elementary) 
##                                                        13395 
##                        Grades 9 though 11 (Some high school) 
##                                                        28141 
##                       Grade 12 or GED (High school graduate) 
##                                                       142971 
## College 1 year to 3 years (Some college or technical school) 
##                                                       134197 
##                   College 4 years or more (College graduate) 
##                                                       170120

# Frequecy chart for education level

ggplot(data = subset(brfss_sleep, !is.na(educa)), aes(x = educa)) +
  geom_bar(aes(y = (..count..)/sum(..count..))) +
   ylab("relative frequencies") +
  xlab("Education") +
  coord_flip()

Distribution of Sleep time between Sex, Education and Income categories

Sleep time distribution between males and females:

Both histogram and box plot shows that both distributions are similar (similar mid point (median or mean) and data spread (Standard deviation)). Both genders report almost same hours of sleep.

In this dataset, it seems there a no relation between the sleep time and the gender of the respondents.

# Summary stats for sleep time between genders

  sleep_clean %>%
  group_by(sex) %>%
  summarise(mean_sleep = mean(sleptim1), median_sleep = median(sleptim1),  max_sleep = max(sleptim1), min_sleep = min(sleptim1),
            SD_sleep = sd(sleptim1))

## # A tibble: 2 × 6
##   sex    mean_sleep median_sleep max_sleep min_sleep SD_sleep
##   <fct>       <dbl>        <dbl>     <int>     <int>    <dbl>
## 1 Male         7.01            7        14         2     1.37
## 2 Female       7.05            7        14         2     1.40

# Box plot of distribution
ggplot(sleep_clean, aes(x = factor(sex), y = sleptim1)) +
  geom_boxplot() +
  xlab("Gender")+
  ylab("Sleep Time")

# Histogram of the distribution

ggplot(data = sleep_clean, aes(x = sleptim1, fill = sex, colour = sex)) +       geom_histogram(binwidth = 1, alpha = 0.5, position = "identity") +
  xlab("Sleep Time")

Sleep time distribution in education level of respondents:

The sleep time distribution box plot and summary stats, are similar between education level.

It seems there also no relationship between sleep time and education level.

# Summary stats for sleep time between different education levels

  sleep_clean %>%  
  group_by(educa) %>% 
  summarise(mean_sleep = mean(sleptim1), median_sleep = median(sleptim1),  max_sleep = max(sleptim1), min_sleep = min(sleptim1), 
            SD_sleep = sd(sleptim1))

## # A tibble: 6 × 6
##   educa                     mean_sleep median_sleep max_sleep min_sleep SD_sleep
##   <fct>                          <dbl>        <dbl>     <int>     <int>    <dbl>
## 1 Never attended school or…       6.94            7        14         2     1.76
## 2 Grades 1 through 8 (Elem…       7.04            7        14         2     1.78
## 3 Grades 9 though 11 (Some…       6.94            7        14         2     1.77
## 4 Grade 12 or GED (High sc…       7.02            7        14         2     1.50
## 5 College 1 year to 3 year…       6.98            7        14         2     1.39
## 6 College 4 years or more …       7.10            7        14         2     1.17

# Boxplot of distribution
ggplot(sleep_clean, aes(x = factor(educa), y = sleptim1)) +
  geom_boxplot() +
  # theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Education") +
  ylab("Sleep Time")+
  coord_flip()

Sleep time distribution in income level of respondents:

Sleep time between income groups is similar but there is small difference in the average mean and spread of sleep time between lower income groups and top income groups. The mean of sleep time slightly increases in top groups and the spread of sleep time is slightly higher in lower income groups (less centered around the mean).

It seems there is a small relationship between Sleep time and income group the respondents belong.

# Summary stats for sleep time between different income levels

sleep_clean %>%  
  group_by(income2) %>% 
  summarise(mean_sleep = mean(sleptim1), median_sleep = median(sleptim1),  max_sleep = max(sleptim1), min_sleep = min(sleptim1),SD_sleep = sd(sleptim1))

## # A tibble: 9 × 6
##   income2           mean_sleep median_sleep max_sleep min_sleep SD_sleep
##   <fct>                  <dbl>        <dbl>     <int>     <int>    <dbl>
## 1 Less than $10,000       6.81            7        14         2     1.87
## 2 Less than $15,000       6.91            7        14         2     1.77
## 3 Less than $20,000       6.99            7        14         2     1.65
## 4 Less than $25,000       7.01            7        14         2     1.52
## 5 Less than $35,000       7.05            7        14         2     1.40
## 6 Less than $50,000       7.06            7        14         2     1.30
## 7 Less than $75,000       7.04            7        14         2     1.20
## 8 $75,000 or more         7.04            7        14         2     1.10
## 9 <NA>                    7.13            7        14         2     1.45

# Boxplot of distribution
ggplot(subset(sleep_clean, !is.na(income2)), aes(x = factor(income2), y = sleptim1)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Income")+
  ylab("Sleep Time")

Research question 2:

How does income level, gender and living in college housing affects Health Care Access?

I’m going to explore the variables to check if there is a correlation between independent and dependent variables.

Explanatory Variables:

colghous (Do You Live In College Housing?)

sex (Gender)

income2 (Income level)

Health Care Access variables in the dataset (Response Variables):

hlthpln1 (Have Any Health Care Coverage?)

persdoc2 (Multiple Health Care Professionals)

medcost (Could Not See Dr Because Of Cost)

checkup1 (Length Of Time Since Last Routine Checkup)

Data Cleaning, Data Distribution

# create a data frame with selected columns for the analysis

brfss_health <- brfss2013[c("colghous", "sex", "hlthpln1", "income2", "persdoc2", "medcost", "checkup1" )]

Distribution of records in the response variables (Health care coverage, Number of health care professionals, Whether someone could not see a doc due to medical cost and Time since last checkup)

# Distribution of response variables


# 1.Health Care coverage

health_care_plan = data.frame(table(brfss_health$hlthpln1))
names(health_care_plan )[1] = "health_care_plan"
health_care_plan

##   health_care_plan   Freq
## 1              Yes 434571
## 2               No  55300

ggplot(data = brfss_health, aes(x = factor(hlthpln1))) +
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(labels = scales::percent) +
  ylab("relative frequencies") +
  xlab("Health Care converage")

#=====
# 2. Multiple Health care Professionals

health_prof = data.frame(table(brfss_health$persdoc2))
names(health_prof)[1] = "health_care_professionals"
health_prof

##   health_care_professionals   Freq
## 1             Yes, only one 369056
## 2             More than one  41334
## 3                        No  79584

ggplot(data = brfss_health, aes(x = factor(persdoc2))) +
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(labels = scales::percent) +
  ylab("relative frequencies") +
  xlab("Multiple Health Care Professionals")

#=====
# 3. Could Not See Dr. Because Of Cost

Med_cost = data.frame(table(brfss_health$medcost))
names(Med_cost)[1] = "Not see a dr due to Med cost"
Med_cost

##   Not see a dr due to Med cost   Freq
## 1                          Yes  60107
## 2                           No 430447

ggplot(data = brfss_health, aes(x = factor(medcost))) +
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(labels = scales::percent) +
  ylab("relative frequencies") +
  xlab("Could not see Dr due to cost")

#=====
# 4. Time Since Last Routine Checkup

checkup = data.frame(table(brfss_health$checkup1))
names(checkup)[1] = "Last checkup"
checkup

##          Last checkup   Freq
## 1    Within past year 356228
## 2 Within past 2 years  57298
## 3 Within past 5 years  33674
## 4 5 or more years ago  33748
## 5               Never   4522

ggplot(data = brfss_health, aes(x = factor(checkup1))) +
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(labels = scales::percent) +
  ylab("relative frequencies") +
  xlab("Time Since Last Routine Checkup") +
  coord_flip()

Since there was too few data records for people in college housing, I will drop the column.

# Do You Live In College Housing?

table(brfss_health$colghous)

## 
## Yes 
##  45

cat("\n")

sapply(brfss_health, function(x) sum(is.na(x)))

## colghous      sex hlthpln1  income2 persdoc2  medcost checkup1 
##   491730        7     1904    71426     1801     1221     6305

# drop colghous
brfss_health <- subset(brfss_health, select = -c(colghous))

There is small number of missing records in each column so I just dropped the rows.

# drop missing values

health_clean <- brfss_health %>%
  drop_na(sex) %>%
  drop_na(hlthpln1) %>% 
  drop_na(persdoc2) %>% 
  drop_na(medcost) %>% 
  drop_na(checkup1)

# after cleaning
sapply(health_clean, function(x) sum(is.na(x)))

##      sex hlthpln1  income2 persdoc2  medcost checkup1 
##        0        0    67756        0        0        0

Correlation

Health care coverage relation to income and gender

It seems between male and female there is not a relationship to health care coverage but income show correlation. As we go up in income categories, the percentage of respondents who are have health care plans increases.

# Checking correlation between resopnse and explanatory variables

# Gender and Health care coverage

# Percentage of health care coverage between gender
gender_hlthpln <- table(health_clean$hlthpln1, health_clean$sex)
prop.table(gender_hlthpln, margin = 2)

##      
##            Male    Female
##   Yes 0.8730228 0.8988280
##   No  0.1269772 0.1011720

cat("\n")

# Proportion of Males and Females that are covered by a Health care plan
ggplot(health_clean, aes(x = sex, fill = hlthpln1)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Gender") +
  labs(fill = "Health care coverage") +
  scale_y_continuous(labels = scales::percent, name = "Proportion")

# Income and health care coverage

# Percentage of health care coverage between gender
income_hlthpln <- table(health_clean$hlthpln1, health_clean$income2)
cat("\n")

prop.table(income_hlthpln , margin = 2)

##      
##       Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000
##   Yes        0.74134345        0.79169866        0.76882605        0.80166846
##   No         0.25865655        0.20830134        0.23117395        0.19833154
##      
##       Less than $35,000 Less than $50,000 Less than $75,000 $75,000 or more
##   Yes        0.85710116        0.90563424        0.94795060      0.97633553
##   No         0.14289884        0.09436576        0.05204940      0.02366447

# Proportion of Males and Females that are covered by a Health care plan
ggplot(subset(health_clean, !is.na(income2)), aes(x = income2, fill = hlthpln1)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Income") +
  labs(fill = "Health care coverage") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  coord_flip()

Relation between Not seeing a doc due to medical cost and income and gender

There is a slight difference between Men and Women who answered Yes to the medcost question (Could Not See Dr Because Of Cost). There are more women who said yes.

A possible contributing factor could be income gap, between men and women, which I is analyzed in the third research question and in the visuals below show a correlation between income and saying yes to medcost question.

#Gender and whether could not see a dr due to cost (Medcost)

# Percentage of Medcost between gender 
gender_medcost <- table(health_clean$medcost, health_clean$sex)
prop.table(gender_medcost, margin = 2)

##      
##            Male    Female
##   Yes 0.1053367 0.1334215
##   No  0.8946633 0.8665785

cat("\n")

# Proportion of Males and Females that could not see a dr due to cost
ggplot(health_clean, aes(x = sex, fill = medcost)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Gender") +
  labs(fill = "Not see a doc for cost") +
  scale_y_continuous(labels = scales::percent, name = "Proportion")

# Income and whether could not see a dr due to cost (Medcost)
income_medcost <- table(health_clean$medcost, health_clean$income2)
cat("\n")

# Percentage of medcost issue between income levels 
prop.table(income_medcost , margin = 2)

##      
##       Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000
##   Yes        0.28798075        0.25552910        0.23240844        0.19916332
##   No         0.71201925        0.74447090        0.76759156        0.80083668
##      
##       Less than $35,000 Less than $50,000 Less than $75,000 $75,000 or more
##   Yes        0.14565090        0.10528138        0.07032486      0.03494404
##   No         0.85434910        0.89471862        0.92967514      0.96505596

# Proportion of income levels that have medcost issue
ggplot(subset(health_clean, !is.na(income2)), aes(x = income2, fill = medcost)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Income") +
  labs(fill = "Not see a doc for cost") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  coord_flip()

Time since last routine checkup relation to income and gender

There is a gap between genders in relation to routine checkups. A smaller group of males have done a routine checkup in the past 1 year and since males are a bit under represented in the dataset, the gap could be even bigger.

Percentage of respondents who did a check up within last year between income groups is similar (proportionally, higher income groups do slightly more ofter routine checkups compare to lower income groups) and the same true for within past 2 years. As we look to within 5 years, more than 5 years and never categories, a bigger proportion in lower income groups, belong to these categories.

There seems to be a small relation between Income and time since last checkup, but overall all income groups look very similar.

#Gender and Time since last routine checkup (check up)
gender_checkup <- table(health_clean$checkup1, health_clean$sex)
gender_checkup

##                      
##                         Male Female
##   Within past year    135026 218402
##   Within past 2 years  24690  32025
##   Within past 5 years  16498  16806
##   5 or more years ago  18299  15001
##   Never                 2294   2143

# Percentage of check up between gender 
prop.table(gender_checkup, margin = 2)

##                      
##                              Male      Female
##   Within past year    0.686083320 0.768001632
##   Within past 2 years 0.125452855 0.112614593
##   Within past 5 years 0.083828319 0.059097606
##   5 or more years ago 0.092979416 0.052750398
##   Never               0.011656089 0.007535771

cat("\n")

# Proportion of Males and Females in check up categories
ggplot(health_clean, aes(x = sex, fill = checkup1)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Gender") +
  labs(fill = "Since last checkup") +
  scale_y_continuous(labels = scales::percent, name = "Proportion")

# Income and Time since last routine checkup (check up)
income_checkup <- table(health_clean$checkup1, health_clean$income2)
cat("\n")

# Percentage of routine checkup categories between income levels 
prop.table(income_checkup , margin = 2)

##                      
##                       Less than $10,000 Less than $15,000 Less than $20,000
##   Within past year          0.689261389       0.720549839       0.701399095
##   Within past 2 years       0.109629267       0.101251728       0.113573570
##   Within past 5 years       0.082425874       0.072915067       0.078978308
##   5 or more years ago       0.102695869       0.090577484       0.092028687
##   Never                     0.015987601       0.014705882       0.014020340
##                      
##                       Less than $25,000 Less than $35,000 Less than $50,000
##   Within past year          0.713254722       0.723667751       0.730736887
##   Within past 2 years       0.112706723       0.113793679       0.119560737
##   Within past 5 years       0.078897152       0.073013093       0.069055353
##   5 or more years ago       0.083814463       0.079476274       0.072996191
##   Never                     0.011326940       0.010049204       0.007650832
##                      
##                       Less than $75,000 $75,000 or more
##   Within past year          0.740807968     0.745539522
##   Within past 2 years       0.123661919     0.131951818
##   Within past 5 years       0.067532347     0.066606904
##   5 or more years ago       0.061745633     0.050296785
##   Never                     0.006252133     0.005604971

# Proportion of income levels to categories in checkup time
ggplot(subset(health_clean, !is.na(income2)), aes(x = income2, fill = checkup1)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Income") +
  labs(fill = "Since last checkup") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  coord_flip()

Number of health care professionals relation to income and gender

Between genders, females among the respondents report having at least one health professionals whereas that proportion is lower among males.

The percent of respondents to have at least one health professionals among higher income groups, higher. There is positive relationship between income and whether having one or more health professionals.

#Gender and number of health care professional indiduals have
gender_docnum <- table(health_clean$persdoc2, health_clean$sex)
gender_docnum

##                
##                   Male Female
##   Yes, only one 138575 225004
##   More than one  15711  24781
##   No             42521  34592

# Percentage of doc_num between gender 
prop.table(gender_docnum, margin = 2)

##                
##                       Male     Female
##   Yes, only one 0.70411622 0.79121729
##   More than one 0.07982948 0.08714137
##   No            0.21605431 0.12164134

cat("\n")

# Proportion of Males and Females in doc_num categories
ggplot(health_clean, aes(x = sex, fill = persdoc2)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Gender") +
  labs(fill = "Health care Pro") +
  scale_y_continuous(labels = scales::percent, name = "Proportion")

# Income and number of health care professional indiduals have
income_docnum <- table(health_clean$persdoc2, health_clean$income2)
cat("\n")

# Percentage of doc_num categories between income levels 
prop.table(income_docnum , margin = 2)

##                
##                 Less than $10,000 Less than $15,000 Less than $20,000
##   Yes, only one        0.65797953        0.69682076        0.68967139
##   More than one        0.09090909        0.09902473        0.08741403
##   No                   0.25111138        0.20415451        0.22291458
##                
##                 Less than $25,000 Less than $35,000 Less than $50,000
##   Yes, only one        0.71582347        0.73855392        0.76358270
##   More than one        0.08442607        0.08187391        0.08132307
##   No                   0.19975046        0.17957218        0.15509423
##                
##                 Less than $75,000 $75,000 or more
##   Yes, only one        0.79039064      0.80898710
##   More than one        0.07736821      0.07851312
##   No                   0.13224115      0.11249978

# Proportion of income levels to doc_num categories
ggplot(subset(health_clean, !is.na(income2)), aes(x = income2, fill = persdoc2)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Income") +
  labs(fill = "Health care Pro") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  coord_flip()

Research question 3:

How does income level and other factors like gender, education level, internet use and number of children related?

Explanatory variable:

income2 (Income level)

sex (Gender)

educa (Education level)

Response variable:

income2 (Income level)

children (Number Of Children In Household)

internet (Internet Use In The Past 30 Days?)

sex (Gender)

More specifically, it’s interesting to see if there correlation between attributes below:

Gender vs Income
Income vs Number of children
Education vs Income
Education vs Internet use
Gender vs Internet use

After creating a dataset with relevant columns, let’s look at the distribution of Number of Children and Internet use in the data.

About 74% of the respondents have no children and 24% didn’t use Internet in the last 30 days.

brfss_demog <- brfss2013[c("children", "sex", "internet", "income2", "educa")]

# Missing values
sapply(brfss_demog, function(x) sum(is.na(x)))

## children      sex internet  income2    educa 
##     2274        7     4926    71426     2274

# 1. Number of children
child_num <- data.frame(table(brfss_demog$children))
names(child_num)[1] = "Children"
child_num

##    Children   Freq
## 1         0 359478
## 2         1  53208
## 3         2  46682
## 4         3  19828
## 5         4   6924
## 6         5   2099
## 7         6    827
## 8         7    240
## 9         8    103
## 10        9     45
## 11       10     34
## 12       11      5
## 13       12      6
## 14       13     11
## 15       14      4
## 16       15      2
## 17       16      1
## 18       17      2
## 19       24      1
## 20       47      1

ggplot(data = brfss_demog, aes(x = factor(children))) +
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(labels = scales::percent) +
  ylab("relative frequencies") +
  xlab("Children") +
  coord_flip()

# 2. Internet Use
table(brfss_demog$internet)

## 
##    Yes     No 
## 367510 119339

ggplot(data = brfss_demog, aes(x = factor(internet))) +
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(labels = scales::percent) +
  ylab("relative frequencies") +
  xlab("Internet use") +
  coord_flip()

After checking amount of missing data, I dropped all of them except for the income column (71426 rows of missing data).

# droping missing values except for the income
demog_clean <- brfss_demog%>%
  drop_na(sex) %>%
  drop_na(internet) %>% 
  drop_na(children) %>% 
  drop_na(educa) 

# missing value after cleaning
sapply(demog_clean, function(x) sum(is.na(x)))

## children      sex internet  income2    educa 
##        0        0        0    67086        0

Correlation between data attributes

1. Income vs. Gender

Since the data is imbalanced in relation to the Sex of the respondents, the relative proportions of male to female in each Income category could be incorrect (Not representative of the population) but the trend could be still valid between income levels. It seems that as we go up in income level, male representation is increasing. In top 2 income levels, the proportion of men to women is almost the same, although the proportion of men to women in the dataset in 4:6 in 10 respondents.

Another question that comes to mind is that, if the female respondents were reporting their individual income or the household income, in which case the meaning of this relation is going to be different.

gender_income <- table(demog_clean$income2, demog_clean$sex)
gender_income

##                    
##                      Male Female
##   Less than $10,000  8200  16991
##   Less than $15,000  9101  17445
##   Less than $20,000 12420  22118
##   Less than $25,000 15570  25756
##   Less than $35,000 19467  29028
##   Less than $50,000 26635  34473
##   Less than $75,000 29180  35637
##   $75,000 or more   56114  58994

# Percentage of males and females in each income category
prop.table(gender_income, margin = 1)

##                    
##                          Male    Female
##   Less than $10,000 0.3255131 0.6744869
##   Less than $15,000 0.3428388 0.6571612
##   Less than $20,000 0.3596039 0.6403961
##   Less than $25,000 0.3767604 0.6232396
##   Less than $35,000 0.4014228 0.5985772
##   Less than $50,000 0.4358676 0.5641324
##   Less than $75,000 0.4501905 0.5498095
##   $75,000 or more   0.4874900 0.5125100

# Proportion of Males and Females in each income category

ggplot(demog_clean, aes(x = income2, fill = sex)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Income") +
  labs(fill = "Sex") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  coord_flip()

2. Income vs Number of children in the Household

Since there more representation of higher income groups in the data, I created percentages of different number of children, for each income group (not across groups). From relative proportions, it seems top two income category tend to have at least one children compare to other groups.

The proportions in other income groups are similar.

# People with 0-3 children (most of the data)
three_child <- subset(demog_clean, demog_clean$children <4)

# Percentage of income categories for each children count
child_income <- table(three_child$children, three_child$income2)
prop.table(child_income, margin = 2)

##    
##     Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000
##   0        0.77262379        0.81027000        0.77599597        0.77761822
##   1        0.11232621        0.09225333        0.10659236        0.10242634
##   2        0.07573786        0.06471560        0.07760256        0.07957415
##   3        0.03931214        0.03276107        0.03980911        0.04038128
##    
##     Less than $35,000 Less than $50,000 Less than $75,000 $75,000 or more
##   0        0.79043405        0.77628348        0.73148148      0.64626443
##   1        0.09839865        0.10246087        0.11959745      0.14058858
##   2        0.07526338        0.08388326        0.10552716      0.15540985
##   3        0.03590392        0.03737240        0.04339390      0.05773713

# Proportion of number of children in income categories
ggplot(three_child, aes(x = income2, fill = factor(children))) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Income") +
  labs(fill = "Child Num") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  coord_flip()

3. Education vs Income

It seems there is a correlation between, not reporting income level and education. As we go up in income groups the rate of missing income data, decreases.

There is positive correlation between income and education level. General trend shows, as we go up in income level, bigger proportions have higher education level. Like in the top income category, almost half of the respondents at least graduated college and only 6 percent never attended school.

edu_income <- table( demog_clean$educa, demog_clean$income2)

# Percentage of income categories in each education category
prop.table(edu_income, margin = 1)

##                                                               
##                                                                Less than $10,000
##   Never attended school or only kindergarten                          0.26696833
##   Grades 1 through 8 (Elementary)                                     0.25458489
##   Grades 9 though 11 (Some high school)                               0.18965363
##   Grade 12 or GED (High school graduate)                              0.07927833
##   College 1 year to 3 years (Some college or technical school)        0.05246461
##   College 4 years or more (College graduate)                          0.01892866
##                                                               
##                                                                Less than $15,000
##   Never attended school or only kindergarten                          0.16063348
##   Grades 1 through 8 (Elementary)                                     0.19453757
##   Grades 9 though 11 (Some high school)                               0.16513022
##   Grade 12 or GED (High school graduate)                              0.09121845
##   College 1 year to 3 years (Some college or technical school)        0.06104548
##   College 4 years or more (College graduate)                          0.02000159
##                                                               
##                                                                Less than $20,000
##   Never attended school or only kindergarten                          0.17194570
##   Grades 1 through 8 (Elementary)                                     0.19631236
##   Grades 9 though 11 (Some high school)                               0.18280677
##   Grade 12 or GED (High school graduate)                              0.12231199
##   College 1 year to 3 years (Some college or technical school)        0.08276633
##   College 4 years or more (College graduate)                          0.02930034
##                                                               
##                                                                Less than $25,000
##   Never attended school or only kindergarten                          0.10407240
##   Grades 1 through 8 (Elementary)                                     0.13833563
##   Grades 9 though 11 (Some high school)                               0.15613533
##   Grade 12 or GED (High school graduate)                              0.14298444
##   College 1 year to 3 years (Some college or technical school)        0.10770284
##   College 4 years or more (College graduate)                          0.04710970
##                                                               
##                                                                Less than $35,000
##   Never attended school or only kindergarten                          0.11538462
##   Grades 1 through 8 (Elementary)                                     0.11003747
##   Grades 9 though 11 (Some high school)                               0.12776336
##   Grade 12 or GED (High school graduate)                              0.15298970
##   College 1 year to 3 years (Some college or technical school)        0.13277803
##   College 4 years or more (College graduate)                          0.07368798
##                                                               
##                                                                Less than $50,000
##   Never attended school or only kindergarten                          0.06787330
##   Grades 1 through 8 (Elementary)                                     0.05501873
##   Grades 9 though 11 (Some high school)                               0.08690593
##   Grade 12 or GED (High school graduate)                              0.15913373
##   College 1 year to 3 years (Some college or technical school)        0.17066385
##   College 4 years or more (College graduate)                          0.13336159
##                                                               
##                                                                Less than $75,000
##   Never attended school or only kindergarten                          0.05203620
##   Grades 1 through 8 (Elementary)                                     0.02346677
##   Grades 9 though 11 (Some high school)                               0.04712253
##   Grade 12 or GED (High school graduate)                              0.12523974
##   College 1 year to 3 years (Some college or technical school)        0.17139192
##   College 4 years or more (College graduate)                          0.19187618
##                                                               
##                                                                $75,000 or more
##   Never attended school or only kindergarten                        0.06108597
##   Grades 1 through 8 (Elementary)                                   0.02770657
##   Grades 9 though 11 (Some high school)                             0.04448223
##   Grade 12 or GED (High school graduate)                            0.12684363
##   College 1 year to 3 years (Some college or technical school)      0.22118693
##   College 4 years or more (College graduate)                        0.48573397

# Proportion of income categories in each education category

ggplot(demog_clean, aes(x =  educa, fill = income2)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Eduaction") +
  labs(fill = "Income") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

4. Education vs Internet use

Internet use also have a positive correlation with education level. As we go up in education level, rate of internet use in last 30 days increases. Only 7.9 percent didn’t use Internet among respondents that graduated college while that number is 75 percent among “Never attended school or only kindergarten” group.

edu_internet <- table( demog_clean$educa, demog_clean$internet)

# Percentage of males and females in each income category
prop.table(edu_internet, margin = 1)

##                                                               
##                                                                       Yes
##   Never attended school or only kindergarten                   0.25460123
##   Grades 1 through 8 (Elementary)                              0.19426317
##   Grades 9 though 11 (Some high school)                        0.39641291
##   Grade 12 or GED (High school graduate)                       0.62266527
##   College 1 year to 3 years (Some college or technical school) 0.81987708
##   College 4 years or more (College graduate)                   0.92065699
##                                                               
##                                                                        No
##   Never attended school or only kindergarten                   0.74539877
##   Grades 1 through 8 (Elementary)                              0.80573683
##   Grades 9 though 11 (Some high school)                        0.60358709
##   Grade 12 or GED (High school graduate)                       0.37733473
##   College 1 year to 3 years (Some college or technical school) 0.18012292
##   College 4 years or more (College graduate)                   0.07934301

# Proportion of Males and Females in each income category

ggplot(demog_clean, aes(x =  educa, fill = internet)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Education") +
  labs(fill = "Internet use") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

5. Gender vs Internet use

It seems there is no correlation between gender of the respondents and whether they used Internet in the past 30 days.

gender_internet <- table( demog_clean$sex, demog_clean$internet)

# Percentage of males and females that either used internet in last 30 days or not
prop.table(gender_internet, margin = 1)

##         
##               Yes       No
##   Male   0.768993 0.231007
##   Female 0.745686 0.254314

# Proportion of Males and Females that either used internet in last 30 days or not

ggplot(demog_clean, aes(x = sex, fill = internet)) +
  geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
  xlab("Sex") +
  labs(fill = "Internet use") +
  scale_y_continuous(labels = scales::percent, name = "Proportion") +
  coord_flip()