library(ggplot2)
library(dplyr)
library(tidyr)
Make sure your data and R Markdown files are in the same directory.
When loaded your data file will be called brfss2013
. Delete
this note when before you submit your work.
load("brfss2013.RData")
Behavioral Risk Factor Surveillance System (BRFSS) a telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical methodological assistance from CDC. It is used to collect prevalence data among adult U.S. residents regarding their risk behaviors and preventive health practices that can affect their health status.
The data collection is conducted according to BRFSS protocols and states generally follow a suggested BRDSS interviewing schedule and complete all calls for a given survey month within the same sample month. In general, surveys are conducted using the following calling occasions:
20% of the interviews are conducted on weekdays (just after dinner hour)
80% of the interviews are conducted on weeknights and weekends
Appointments are made for callbacks during hours that are not scheduled for other interviews
Change schedules to accommodate holidays and special events
Disproportionate stratified sampling (DSS) has been used for the landline sample. DSS draws telephone numbers from two strata (lists) that are based on the presumed density of known telephone household numbers (High vs. Medium density). DSS sampling telephone numbers is more efficient than simple random sampling.
The cellular telephone sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combinations and then respondents are randomly selected with each having equal probability of selection. States complete ~20% of their completed interviews with respondents on cell phones.
The sample size in BRFSS is at least 4000 interviews per state each year (at least 200k per year).
For more info refer back to the survey data documentation.
I think since the data is collected using random sampling of telephone numbers to conduct surveys across states, the results are generalizable to the adult population (selection and volunteer bias). They tried to call in hours of day and days of the week to be able to complete interviews with all respondents (non-response bias). Also, since this is observational study, the results could merely be correlation and does not show causality for which a random treatment experiment is required.
A possible source of bias could come from the fact that some people who are willing to respond to the interview and answer truthfully while there are sub populations that either don’t respond or for reasons such as illegal immigration don’t answer truthfully which introduces the bias that some people are under represented in the data.
Research question 1:
How long do people on average sleep and it’s relation to gender, race, age group, income and education level?
Research question 2:
How does income level, gender and living in college housing affects Health Care Access?
Research question 3:
How does income level and other factors like gender, education level, internet use and number of children related?
Average sleep time and it’s relation to age group, gender, race, income and education level
The main variable is sleptim1
which according to the
BRFSS code book is the data gathered by asking the respondents “How Much
Time Do You Sleep?”
Other dataset attributes to work with for this question are gender
sex
, education level educa
and income level
income2
. Race and Age group are not among the available
dataset attributes.
Summary of EDA approach:
To explore the data for this question, I create a dataset with relevant columns and then look at the missing values and the distribution of the data. Next, I decide whether to drop rows with missing values and then I table summary stats and visualize with suitable graphs.
# First create a dataframe with selected columns for the analysis
<- brfss2013[c("sleptim1", "sex", "educa", "income2")]
brfss_sleep
# dataset dimension
dim(brfss_sleep)
## [1] 491775 4
Sleep data is right skewed, with couple of outliers over 24 hours.
# length(brfss_sleep$sleptim1)
length(brfss_sleep$sleptim1)
## [1] 491775
# Distribution of sleep hours in the dataset
table(brfss_sleep$sleptim1)
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1 228 1076 3496 14261 33436 106197 142469 141102 23800 12102
## 11 12 13 14 15 16 17 18 19 20 21
## 833 3675 199 447 367 369 35 164 13 64 3
## 22 23 24 103 450
## 10 4 35 1 1
ggplot(data = brfss_sleep, aes(x = sleptim1)) +
geom_boxplot() +
xlab("Sleep time")
To clean I’ll dropped rows from Sleep_time, Education and Sex but left it for Income, because there was 71426 rows and it could result in data loss. Also, I filtered sleep data to only include 2-14 hours. 10723 rows are dropped in total from 481052 rows.
# Missing values in each column before
sapply(brfss_sleep, function(x) sum(is.na(x)))
## sleptim1 sex educa income2
## 7387 7 2274 71426
# Drop rows with missing sleep time, sex value
<- brfss_sleep %>%
sleep_clean drop_na(sex) %>%
drop_na(sleptim1)
# Drop rows with sleep values above 14 hours and below 2 hours
<- subset(sleep_clean, sleep_clean$sleptim1 >= 2 & sleep_clean$sleptim1 <= 14)
sleep_clean
# Drop missing values in education
<- sleep_clean %>%
sleep_clean drop_na(educa)
# Missing value count after
sapply(sleep_clean, function(x) sum(is.na(x)))
## sleptim1 sex educa income2
## 0 0 0 66660
# dataset dimension
dim(sleep_clean)
## [1] 481052 4
Distribution of Sleep time in the clean dataset:
# Distribution of sleep hours in the clean dataset
table(sleep_clean$sleptim1)
##
## 2 3 4 5 6 7 8 9 10 11 12
## 1067 3472 14203 33290 105777 141904 140471 23709 12044 824 3650
## 13 14
## 198 443
ggplot(data = sleep_clean, aes(x = sleptim1)) +
geom_boxplot() +
xlab("Sleep time")
There is about 20% difference in data records for males and female respondents, which makes the the data imbalanced. (Genders are not equally represented in the data)
# Frequecy Table for gender
table(brfss_sleep$sex)
##
## Male Female
## 201313 290455
# Frequecy chart for gender
ggplot(data = subset(brfss_sleep, !is.na(sex)), aes(x = factor(sex))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent) +
ylab("relative frequencies") +
xlab("Gender")
Income level distribution of survey respondents
# Frequecy Table for income level
table(brfss_sleep$income2)
##
## Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000
## 25441 26794 34873 41732
## Less than $35,000 Less than $50,000 Less than $75,000 $75,000 or more
## 48867 61509 65231 115902
# Frequecy chart for income level
ggplot(data = subset(brfss_sleep, !is.na(income2)), aes(x = income2)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
ylab("relative frequencies") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylim(0,0.3)+
xlab("Income")
Eduvation level distribution of survey respondents
# Frequecy Table for education level
table(brfss_sleep$educa)
##
## Never attended school or only kindergarten
## 677
## Grades 1 through 8 (Elementary)
## 13395
## Grades 9 though 11 (Some high school)
## 28141
## Grade 12 or GED (High school graduate)
## 142971
## College 1 year to 3 years (Some college or technical school)
## 134197
## College 4 years or more (College graduate)
## 170120
# Frequecy chart for education level
ggplot(data = subset(brfss_sleep, !is.na(educa)), aes(x = educa)) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
ylab("relative frequencies") +
xlab("Education") +
coord_flip()
Sleep time distribution between males and females:
Both histogram and box plot shows that both distributions are similar (similar mid point (median or mean) and data spread (Standard deviation)). Both genders report almost same hours of sleep.
# Summary stats for sleep time between genders
%>%
sleep_clean group_by(sex) %>%
summarise(mean_sleep = mean(sleptim1), median_sleep = median(sleptim1), max_sleep = max(sleptim1), min_sleep = min(sleptim1),
SD_sleep = sd(sleptim1))
## # A tibble: 2 × 6
## sex mean_sleep median_sleep max_sleep min_sleep SD_sleep
## <fct> <dbl> <dbl> <int> <int> <dbl>
## 1 Male 7.01 7 14 2 1.37
## 2 Female 7.05 7 14 2 1.40
# Box plot of distribution
ggplot(sleep_clean, aes(x = factor(sex), y = sleptim1)) +
geom_boxplot() +
xlab("Gender")+
ylab("Sleep Time")
# Histogram of the distribution
ggplot(data = sleep_clean, aes(x = sleptim1, fill = sex, colour = sex)) + geom_histogram(binwidth = 1, alpha = 0.5, position = "identity") +
xlab("Sleep Time")
Sleep time distribution in education level of respondents:
The sleep time distribution box plot and summary stats, are similar between education level.
# Summary stats for sleep time between different education levels
%>%
sleep_clean group_by(educa) %>%
summarise(mean_sleep = mean(sleptim1), median_sleep = median(sleptim1), max_sleep = max(sleptim1), min_sleep = min(sleptim1),
SD_sleep = sd(sleptim1))
## # A tibble: 6 × 6
## educa mean_sleep median_sleep max_sleep min_sleep SD_sleep
## <fct> <dbl> <dbl> <int> <int> <dbl>
## 1 Never attended school or… 6.94 7 14 2 1.76
## 2 Grades 1 through 8 (Elem… 7.04 7 14 2 1.78
## 3 Grades 9 though 11 (Some… 6.94 7 14 2 1.77
## 4 Grade 12 or GED (High sc… 7.02 7 14 2 1.50
## 5 College 1 year to 3 year… 6.98 7 14 2 1.39
## 6 College 4 years or more … 7.10 7 14 2 1.17
# Boxplot of distribution
ggplot(sleep_clean, aes(x = factor(educa), y = sleptim1)) +
geom_boxplot() +
# theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Education") +
ylab("Sleep Time")+
coord_flip()
Sleep time distribution in income level of respondents:
Sleep time between income groups is similar but there is small difference in the average mean and spread of sleep time between lower income groups and top income groups. The mean of sleep time slightly increases in top groups and the spread of sleep time is slightly higher in lower income groups (less centered around the mean).
It seems there is a small relationship between Sleep time and income group the respondents belong.
# Summary stats for sleep time between different income levels
%>%
sleep_clean group_by(income2) %>%
summarise(mean_sleep = mean(sleptim1), median_sleep = median(sleptim1), max_sleep = max(sleptim1), min_sleep = min(sleptim1),SD_sleep = sd(sleptim1))
## # A tibble: 9 × 6
## income2 mean_sleep median_sleep max_sleep min_sleep SD_sleep
## <fct> <dbl> <dbl> <int> <int> <dbl>
## 1 Less than $10,000 6.81 7 14 2 1.87
## 2 Less than $15,000 6.91 7 14 2 1.77
## 3 Less than $20,000 6.99 7 14 2 1.65
## 4 Less than $25,000 7.01 7 14 2 1.52
## 5 Less than $35,000 7.05 7 14 2 1.40
## 6 Less than $50,000 7.06 7 14 2 1.30
## 7 Less than $75,000 7.04 7 14 2 1.20
## 8 $75,000 or more 7.04 7 14 2 1.10
## 9 <NA> 7.13 7 14 2 1.45
# Boxplot of distribution
ggplot(subset(sleep_clean, !is.na(income2)), aes(x = factor(income2), y = sleptim1)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Income")+
ylab("Sleep Time")
How does income level, gender and living in college housing affects Health Care Access?
I’m going to explore the variables to check if there is a correlation between independent and dependent variables.
Explanatory Variables:
colghous
(Do You Live In College Housing?)
sex
(Gender)
income2
(Income level)
Health Care Access variables in the dataset (Response Variables):
hlthpln1
(Have Any Health Care Coverage?)
persdoc2
(Multiple Health Care Professionals)
medcost
(Could Not See Dr Because Of Cost)
checkup1
(Length Of Time Since Last Routine Checkup)
# create a data frame with selected columns for the analysis
<- brfss2013[c("colghous", "sex", "hlthpln1", "income2", "persdoc2", "medcost", "checkup1" )] brfss_health
Distribution of records in the response variables (Health care coverage, Number of health care professionals, Whether someone could not see a doc due to medical cost and Time since last checkup)
# Distribution of response variables
# 1.Health Care coverage
= data.frame(table(brfss_health$hlthpln1))
health_care_plan names(health_care_plan )[1] = "health_care_plan"
health_care_plan
## health_care_plan Freq
## 1 Yes 434571
## 2 No 55300
ggplot(data = brfss_health, aes(x = factor(hlthpln1))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent) +
ylab("relative frequencies") +
xlab("Health Care converage")
#=====
# 2. Multiple Health care Professionals
= data.frame(table(brfss_health$persdoc2))
health_prof names(health_prof)[1] = "health_care_professionals"
health_prof
## health_care_professionals Freq
## 1 Yes, only one 369056
## 2 More than one 41334
## 3 No 79584
ggplot(data = brfss_health, aes(x = factor(persdoc2))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent) +
ylab("relative frequencies") +
xlab("Multiple Health Care Professionals")
#=====
# 3. Could Not See Dr. Because Of Cost
= data.frame(table(brfss_health$medcost))
Med_cost names(Med_cost)[1] = "Not see a dr due to Med cost"
Med_cost
## Not see a dr due to Med cost Freq
## 1 Yes 60107
## 2 No 430447
ggplot(data = brfss_health, aes(x = factor(medcost))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent) +
ylab("relative frequencies") +
xlab("Could not see Dr due to cost")
#=====
# 4. Time Since Last Routine Checkup
= data.frame(table(brfss_health$checkup1))
checkup names(checkup)[1] = "Last checkup"
checkup
## Last checkup Freq
## 1 Within past year 356228
## 2 Within past 2 years 57298
## 3 Within past 5 years 33674
## 4 5 or more years ago 33748
## 5 Never 4522
ggplot(data = brfss_health, aes(x = factor(checkup1))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent) +
ylab("relative frequencies") +
xlab("Time Since Last Routine Checkup") +
coord_flip()
Since there was too few data records for people in college housing, I will drop the column.
# Do You Live In College Housing?
table(brfss_health$colghous)
##
## Yes
## 45
cat("\n")
sapply(brfss_health, function(x) sum(is.na(x)))
## colghous sex hlthpln1 income2 persdoc2 medcost checkup1
## 491730 7 1904 71426 1801 1221 6305
# drop colghous
<- subset(brfss_health, select = -c(colghous)) brfss_health
There is small number of missing records in each column so I just dropped the rows.
# drop missing values
<- brfss_health %>%
health_clean drop_na(sex) %>%
drop_na(hlthpln1) %>%
drop_na(persdoc2) %>%
drop_na(medcost) %>%
drop_na(checkup1)
# after cleaning
sapply(health_clean, function(x) sum(is.na(x)))
## sex hlthpln1 income2 persdoc2 medcost checkup1
## 0 0 67756 0 0 0
It seems between male and female there is not a relationship to health care coverage but income show correlation. As we go up in income categories, the percentage of respondents who are have health care plans increases.
# Checking correlation between resopnse and explanatory variables
# Gender and Health care coverage
# Percentage of health care coverage between gender
<- table(health_clean$hlthpln1, health_clean$sex)
gender_hlthpln prop.table(gender_hlthpln, margin = 2)
##
## Male Female
## Yes 0.8730228 0.8988280
## No 0.1269772 0.1011720
cat("\n")
# Proportion of Males and Females that are covered by a Health care plan
ggplot(health_clean, aes(x = sex, fill = hlthpln1)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Gender") +
labs(fill = "Health care coverage") +
scale_y_continuous(labels = scales::percent, name = "Proportion")
# Income and health care coverage
# Percentage of health care coverage between gender
<- table(health_clean$hlthpln1, health_clean$income2)
income_hlthpln cat("\n")
prop.table(income_hlthpln , margin = 2)
##
## Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000
## Yes 0.74134345 0.79169866 0.76882605 0.80166846
## No 0.25865655 0.20830134 0.23117395 0.19833154
##
## Less than $35,000 Less than $50,000 Less than $75,000 $75,000 or more
## Yes 0.85710116 0.90563424 0.94795060 0.97633553
## No 0.14289884 0.09436576 0.05204940 0.02366447
# Proportion of Males and Females that are covered by a Health care plan
ggplot(subset(health_clean, !is.na(income2)), aes(x = income2, fill = hlthpln1)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Income") +
labs(fill = "Health care coverage") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
coord_flip()
There is a slight difference between Men and Women who answered Yes
to the medcost
question (Could Not See Dr Because Of Cost).
There are more women who said yes.
A possible contributing factor could be income gap, between men and
women, which I is analyzed in the third research question and in the
visuals below show a correlation between income and saying yes to
medcost
question.
#Gender and whether could not see a dr due to cost (Medcost)
# Percentage of Medcost between gender
<- table(health_clean$medcost, health_clean$sex)
gender_medcost prop.table(gender_medcost, margin = 2)
##
## Male Female
## Yes 0.1053367 0.1334215
## No 0.8946633 0.8665785
cat("\n")
# Proportion of Males and Females that could not see a dr due to cost
ggplot(health_clean, aes(x = sex, fill = medcost)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Gender") +
labs(fill = "Not see a doc for cost") +
scale_y_continuous(labels = scales::percent, name = "Proportion")
# Income and whether could not see a dr due to cost (Medcost)
<- table(health_clean$medcost, health_clean$income2)
income_medcost cat("\n")
# Percentage of medcost issue between income levels
prop.table(income_medcost , margin = 2)
##
## Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000
## Yes 0.28798075 0.25552910 0.23240844 0.19916332
## No 0.71201925 0.74447090 0.76759156 0.80083668
##
## Less than $35,000 Less than $50,000 Less than $75,000 $75,000 or more
## Yes 0.14565090 0.10528138 0.07032486 0.03494404
## No 0.85434910 0.89471862 0.92967514 0.96505596
# Proportion of income levels that have medcost issue
ggplot(subset(health_clean, !is.na(income2)), aes(x = income2, fill = medcost)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Income") +
labs(fill = "Not see a doc for cost") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
coord_flip()
There is a gap between genders in relation to routine checkups. A smaller group of males have done a routine checkup in the past 1 year and since males are a bit under represented in the dataset, the gap could be even bigger.
Percentage of respondents who did a check up within last year between income groups is similar (proportionally, higher income groups do slightly more ofter routine checkups compare to lower income groups) and the same true for within past 2 years. As we look to within 5 years, more than 5 years and never categories, a bigger proportion in lower income groups, belong to these categories.
There seems to be a small relation between Income and time since last checkup, but overall all income groups look very similar.
#Gender and Time since last routine checkup (check up)
<- table(health_clean$checkup1, health_clean$sex)
gender_checkup gender_checkup
##
## Male Female
## Within past year 135026 218402
## Within past 2 years 24690 32025
## Within past 5 years 16498 16806
## 5 or more years ago 18299 15001
## Never 2294 2143
# Percentage of check up between gender
prop.table(gender_checkup, margin = 2)
##
## Male Female
## Within past year 0.686083320 0.768001632
## Within past 2 years 0.125452855 0.112614593
## Within past 5 years 0.083828319 0.059097606
## 5 or more years ago 0.092979416 0.052750398
## Never 0.011656089 0.007535771
cat("\n")
# Proportion of Males and Females in check up categories
ggplot(health_clean, aes(x = sex, fill = checkup1)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Gender") +
labs(fill = "Since last checkup") +
scale_y_continuous(labels = scales::percent, name = "Proportion")
# Income and Time since last routine checkup (check up)
<- table(health_clean$checkup1, health_clean$income2)
income_checkup cat("\n")
# Percentage of routine checkup categories between income levels
prop.table(income_checkup , margin = 2)
##
## Less than $10,000 Less than $15,000 Less than $20,000
## Within past year 0.689261389 0.720549839 0.701399095
## Within past 2 years 0.109629267 0.101251728 0.113573570
## Within past 5 years 0.082425874 0.072915067 0.078978308
## 5 or more years ago 0.102695869 0.090577484 0.092028687
## Never 0.015987601 0.014705882 0.014020340
##
## Less than $25,000 Less than $35,000 Less than $50,000
## Within past year 0.713254722 0.723667751 0.730736887
## Within past 2 years 0.112706723 0.113793679 0.119560737
## Within past 5 years 0.078897152 0.073013093 0.069055353
## 5 or more years ago 0.083814463 0.079476274 0.072996191
## Never 0.011326940 0.010049204 0.007650832
##
## Less than $75,000 $75,000 or more
## Within past year 0.740807968 0.745539522
## Within past 2 years 0.123661919 0.131951818
## Within past 5 years 0.067532347 0.066606904
## 5 or more years ago 0.061745633 0.050296785
## Never 0.006252133 0.005604971
# Proportion of income levels to categories in checkup time
ggplot(subset(health_clean, !is.na(income2)), aes(x = income2, fill = checkup1)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Income") +
labs(fill = "Since last checkup") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
coord_flip()
Between genders, females among the respondents report having at least one health professionals whereas that proportion is lower among males.
The percent of respondents to have at least one health professionals among higher income groups, higher. There is positive relationship between income and whether having one or more health professionals.
#Gender and number of health care professional indiduals have
<- table(health_clean$persdoc2, health_clean$sex)
gender_docnum gender_docnum
##
## Male Female
## Yes, only one 138575 225004
## More than one 15711 24781
## No 42521 34592
# Percentage of doc_num between gender
prop.table(gender_docnum, margin = 2)
##
## Male Female
## Yes, only one 0.70411622 0.79121729
## More than one 0.07982948 0.08714137
## No 0.21605431 0.12164134
cat("\n")
# Proportion of Males and Females in doc_num categories
ggplot(health_clean, aes(x = sex, fill = persdoc2)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Gender") +
labs(fill = "Health care Pro") +
scale_y_continuous(labels = scales::percent, name = "Proportion")
# Income and number of health care professional indiduals have
<- table(health_clean$persdoc2, health_clean$income2)
income_docnum cat("\n")
# Percentage of doc_num categories between income levels
prop.table(income_docnum , margin = 2)
##
## Less than $10,000 Less than $15,000 Less than $20,000
## Yes, only one 0.65797953 0.69682076 0.68967139
## More than one 0.09090909 0.09902473 0.08741403
## No 0.25111138 0.20415451 0.22291458
##
## Less than $25,000 Less than $35,000 Less than $50,000
## Yes, only one 0.71582347 0.73855392 0.76358270
## More than one 0.08442607 0.08187391 0.08132307
## No 0.19975046 0.17957218 0.15509423
##
## Less than $75,000 $75,000 or more
## Yes, only one 0.79039064 0.80898710
## More than one 0.07736821 0.07851312
## No 0.13224115 0.11249978
# Proportion of income levels to doc_num categories
ggplot(subset(health_clean, !is.na(income2)), aes(x = income2, fill = persdoc2)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Income") +
labs(fill = "Health care Pro") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
coord_flip()
How does income level and other factors like gender, education level, internet use and number of children related?
Explanatory variable:
income2
(Income level)
sex
(Gender)
educa
(Education level)
Response variable:
income2
(Income level)
children
(Number Of Children In Household)
internet
(Internet Use In The Past 30 Days?)
sex
(Gender)
More specifically, it’s interesting to see if there correlation between attributes below:
After creating a dataset with relevant columns, let’s look at the distribution of Number of Children and Internet use in the data.
About 74% of the respondents have no children and 24% didn’t use Internet in the last 30 days.
<- brfss2013[c("children", "sex", "internet", "income2", "educa")]
brfss_demog
# Missing values
sapply(brfss_demog, function(x) sum(is.na(x)))
## children sex internet income2 educa
## 2274 7 4926 71426 2274
# 1. Number of children
<- data.frame(table(brfss_demog$children))
child_num names(child_num)[1] = "Children"
child_num
## Children Freq
## 1 0 359478
## 2 1 53208
## 3 2 46682
## 4 3 19828
## 5 4 6924
## 6 5 2099
## 7 6 827
## 8 7 240
## 9 8 103
## 10 9 45
## 11 10 34
## 12 11 5
## 13 12 6
## 14 13 11
## 15 14 4
## 16 15 2
## 17 16 1
## 18 17 2
## 19 24 1
## 20 47 1
ggplot(data = brfss_demog, aes(x = factor(children))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent) +
ylab("relative frequencies") +
xlab("Children") +
coord_flip()
# 2. Internet Use
table(brfss_demog$internet)
##
## Yes No
## 367510 119339
ggplot(data = brfss_demog, aes(x = factor(internet))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
scale_y_continuous(labels = scales::percent) +
ylab("relative frequencies") +
xlab("Internet use") +
coord_flip()
After checking amount of missing data, I dropped all of them except for the income column (71426 rows of missing data).
# droping missing values except for the income
<- brfss_demog%>%
demog_clean drop_na(sex) %>%
drop_na(internet) %>%
drop_na(children) %>%
drop_na(educa)
# missing value after cleaning
sapply(demog_clean, function(x) sum(is.na(x)))
## children sex internet income2 educa
## 0 0 0 67086 0
Since the data is imbalanced in relation to the Sex of the respondents, the relative proportions of male to female in each Income category could be incorrect (Not representative of the population) but the trend could be still valid between income levels. It seems that as we go up in income level, male representation is increasing. In top 2 income levels, the proportion of men to women is almost the same, although the proportion of men to women in the dataset in 4:6 in 10 respondents.
Another question that comes to mind is that, if the female respondents were reporting their individual income or the household income, in which case the meaning of this relation is going to be different.
<- table(demog_clean$income2, demog_clean$sex)
gender_income gender_income
##
## Male Female
## Less than $10,000 8200 16991
## Less than $15,000 9101 17445
## Less than $20,000 12420 22118
## Less than $25,000 15570 25756
## Less than $35,000 19467 29028
## Less than $50,000 26635 34473
## Less than $75,000 29180 35637
## $75,000 or more 56114 58994
# Percentage of males and females in each income category
prop.table(gender_income, margin = 1)
##
## Male Female
## Less than $10,000 0.3255131 0.6744869
## Less than $15,000 0.3428388 0.6571612
## Less than $20,000 0.3596039 0.6403961
## Less than $25,000 0.3767604 0.6232396
## Less than $35,000 0.4014228 0.5985772
## Less than $50,000 0.4358676 0.5641324
## Less than $75,000 0.4501905 0.5498095
## $75,000 or more 0.4874900 0.5125100
# Proportion of Males and Females in each income category
ggplot(demog_clean, aes(x = income2, fill = sex)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Income") +
labs(fill = "Sex") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
coord_flip()
Since there more representation of higher income groups in the data, I created percentages of different number of children, for each income group (not across groups). From relative proportions, it seems top two income category tend to have at least one children compare to other groups.
The proportions in other income groups are similar.
# People with 0-3 children (most of the data)
<- subset(demog_clean, demog_clean$children <4)
three_child
# Percentage of income categories for each children count
<- table(three_child$children, three_child$income2)
child_income prop.table(child_income, margin = 2)
##
## Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000
## 0 0.77262379 0.81027000 0.77599597 0.77761822
## 1 0.11232621 0.09225333 0.10659236 0.10242634
## 2 0.07573786 0.06471560 0.07760256 0.07957415
## 3 0.03931214 0.03276107 0.03980911 0.04038128
##
## Less than $35,000 Less than $50,000 Less than $75,000 $75,000 or more
## 0 0.79043405 0.77628348 0.73148148 0.64626443
## 1 0.09839865 0.10246087 0.11959745 0.14058858
## 2 0.07526338 0.08388326 0.10552716 0.15540985
## 3 0.03590392 0.03737240 0.04339390 0.05773713
# Proportion of number of children in income categories
ggplot(three_child, aes(x = income2, fill = factor(children))) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Income") +
labs(fill = "Child Num") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
coord_flip()
It seems there is a correlation between, not reporting income level and education. As we go up in income groups the rate of missing income data, decreases.
There is positive correlation between income and education level. General trend shows, as we go up in income level, bigger proportions have higher education level. Like in the top income category, almost half of the respondents at least graduated college and only 6 percent never attended school.
<- table( demog_clean$educa, demog_clean$income2)
edu_income
# Percentage of income categories in each education category
prop.table(edu_income, margin = 1)
##
## Less than $10,000
## Never attended school or only kindergarten 0.26696833
## Grades 1 through 8 (Elementary) 0.25458489
## Grades 9 though 11 (Some high school) 0.18965363
## Grade 12 or GED (High school graduate) 0.07927833
## College 1 year to 3 years (Some college or technical school) 0.05246461
## College 4 years or more (College graduate) 0.01892866
##
## Less than $15,000
## Never attended school or only kindergarten 0.16063348
## Grades 1 through 8 (Elementary) 0.19453757
## Grades 9 though 11 (Some high school) 0.16513022
## Grade 12 or GED (High school graduate) 0.09121845
## College 1 year to 3 years (Some college or technical school) 0.06104548
## College 4 years or more (College graduate) 0.02000159
##
## Less than $20,000
## Never attended school or only kindergarten 0.17194570
## Grades 1 through 8 (Elementary) 0.19631236
## Grades 9 though 11 (Some high school) 0.18280677
## Grade 12 or GED (High school graduate) 0.12231199
## College 1 year to 3 years (Some college or technical school) 0.08276633
## College 4 years or more (College graduate) 0.02930034
##
## Less than $25,000
## Never attended school or only kindergarten 0.10407240
## Grades 1 through 8 (Elementary) 0.13833563
## Grades 9 though 11 (Some high school) 0.15613533
## Grade 12 or GED (High school graduate) 0.14298444
## College 1 year to 3 years (Some college or technical school) 0.10770284
## College 4 years or more (College graduate) 0.04710970
##
## Less than $35,000
## Never attended school or only kindergarten 0.11538462
## Grades 1 through 8 (Elementary) 0.11003747
## Grades 9 though 11 (Some high school) 0.12776336
## Grade 12 or GED (High school graduate) 0.15298970
## College 1 year to 3 years (Some college or technical school) 0.13277803
## College 4 years or more (College graduate) 0.07368798
##
## Less than $50,000
## Never attended school or only kindergarten 0.06787330
## Grades 1 through 8 (Elementary) 0.05501873
## Grades 9 though 11 (Some high school) 0.08690593
## Grade 12 or GED (High school graduate) 0.15913373
## College 1 year to 3 years (Some college or technical school) 0.17066385
## College 4 years or more (College graduate) 0.13336159
##
## Less than $75,000
## Never attended school or only kindergarten 0.05203620
## Grades 1 through 8 (Elementary) 0.02346677
## Grades 9 though 11 (Some high school) 0.04712253
## Grade 12 or GED (High school graduate) 0.12523974
## College 1 year to 3 years (Some college or technical school) 0.17139192
## College 4 years or more (College graduate) 0.19187618
##
## $75,000 or more
## Never attended school or only kindergarten 0.06108597
## Grades 1 through 8 (Elementary) 0.02770657
## Grades 9 though 11 (Some high school) 0.04448223
## Grade 12 or GED (High school graduate) 0.12684363
## College 1 year to 3 years (Some college or technical school) 0.22118693
## College 4 years or more (College graduate) 0.48573397
# Proportion of income categories in each education category
ggplot(demog_clean, aes(x = educa, fill = income2)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Eduaction") +
labs(fill = "Income") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
Internet use also have a positive correlation with education level. As we go up in education level, rate of internet use in last 30 days increases. Only 7.9 percent didn’t use Internet among respondents that graduated college while that number is 75 percent among “Never attended school or only kindergarten” group.
<- table( demog_clean$educa, demog_clean$internet)
edu_internet
# Percentage of males and females in each income category
prop.table(edu_internet, margin = 1)
##
## Yes
## Never attended school or only kindergarten 0.25460123
## Grades 1 through 8 (Elementary) 0.19426317
## Grades 9 though 11 (Some high school) 0.39641291
## Grade 12 or GED (High school graduate) 0.62266527
## College 1 year to 3 years (Some college or technical school) 0.81987708
## College 4 years or more (College graduate) 0.92065699
##
## No
## Never attended school or only kindergarten 0.74539877
## Grades 1 through 8 (Elementary) 0.80573683
## Grades 9 though 11 (Some high school) 0.60358709
## Grade 12 or GED (High school graduate) 0.37733473
## College 1 year to 3 years (Some college or technical school) 0.18012292
## College 4 years or more (College graduate) 0.07934301
# Proportion of Males and Females in each income category
ggplot(demog_clean, aes(x = educa, fill = internet)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Education") +
labs(fill = "Internet use") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
It seems there is no correlation between gender of the respondents and whether they used Internet in the past 30 days.
<- table( demog_clean$sex, demog_clean$internet)
gender_internet
# Percentage of males and females that either used internet in last 30 days or not
prop.table(gender_internet, margin = 1)
##
## Yes No
## Male 0.768993 0.231007
## Female 0.745686 0.254314
# Proportion of Males and Females that either used internet in last 30 days or not
ggplot(demog_clean, aes(x = sex, fill = internet)) +
geom_bar(position = "fill", aes(y = (..count..)/sum(..count..))) +
xlab("Sex") +
labs(fill = "Internet use") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
coord_flip()