We use chi-squared test of independence to compare the proportion. Here, we have:

The code to conduct chi-squared test will look like:

chisq.test(data1, data2)

where data1 and data2 are columns from the same dataframe that contains the different data.

Here, we will be using NHANES dataset to find out the association between income and education levels:

df <- read.csv('Data/NHANES.csv')
chisq.test(df$HHIncome,df$Education)
## 
##  Pearson's Chi-squared test
## 
## data:  df$HHIncome and df$Education
## X-squared = 1576.8, df = 60, p-value < 2.2e-16

Since the p-value is less than 0.05, we can conclude that there is an association between hosuehold income and education levels.

Contingency table

We can create a contingency table by using the following code:

table(data1, data2)
table(df$HHIncome, df$Education)
##              
##                   8th Grade 9 - 11th Grade College Grad High School
##               212        69            104          117         125
##    0-4999      70        20             30           17          26
##    5000-9999   86        27             43           10          47
##   10000-14999 144        41             86           21         130
##   15000-19999 152        56             87           43          95
##   20000-24999 221        59             77           35         100
##   25000-34999 264        69            117          107         167
##   35000-44999 230        40             79          110         195
##   45000-54999 218        18             58          135         140
##   55000-64999 157        18             49          126         122
##   65000-74999 120        11             42          138          71
##   75000-99999 301        13             58          332         108
##   more 99999  604        10             58          907         191
##              
##               Some College
##                        184
##    0-4999               29
##    5000-9999            41
##   10000-14999          121
##   15000-19999           94
##   20000-24999          125
##   25000-34999          234
##   35000-44999          209
##   45000-54999          215
##   55000-64999          149
##   65000-74999          144
##   75000-99999          272
##   more 99999           450

This will summarize the numerical relationship between two variables.

©2021 by Daiki Tagami. All rights reserved.