Published by

Nicole Radziwill

Contingency tables provide a way to display the frequencies and relative frequencies of observations, which are classified according to two categorical variables. The elements of one category are displayed across the columns; the elements of the other category are displayed over the rows.

For many semesters now, I’ve asked my students to prepare contingency tables that include row percentages and column percentages. Oh, and also the marginal distributions… you know, the totals on the right margin and the bottom margin. There’s an easy way to do this using Minitab, but I’m not a fan of proprietary software… I prefer open source whenever possible, and I wasn’t aware of a way to do this in R. As a result, I let them build their contingency tables by hand, and type them up in Microsoft Word. (Yeah, not efficient at all.)

Then I found gmodels. After installing gmodels and using:

library(gmodels)

to bring the package into active memory, I was able to create a contingency table SO easily that I can’t bear to think about all the hours I spent doing this sort of thing manually. First, I loaded some data describing the colors and defects associated with over 1200 M&M candies that my students observed. This data set has four variables: student (who collected the data), id (the number of the M&M that the student observed, in order of when they encountered that particular M&M), color (whether the candy was Blue, Red, BRown, Green, Orange, or Yellow), and whether there were defects observed (Letter incomplete or missing, Chipped or Cracked, Multiple defects, or No defects):

> mnms <- read.csv("mnm-clean.csv",header=T)
> head(mnms)
 student id color defect
1 wilburld 1 B L
2 wilburld 2 B N
3 wilburld 3 B N
4 wilburld 4 B N
5 wilburld 5 B N
6 wilburld 6 B C

Then, I constructed a really fancy contingency table IN JUST ONE LINE!!! This was very exciting.

> CrossTable(mnms$color, mnms$defect, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE)

You can control whether row percentages (prop.r), column percentages (prop.c), or table percentages (prop.t) show up by making them TRUE in your call to CrossTable. Here’s what it looked like:

You can also do a full Chi-square test of independence WHILE you’re displaying your contingency table… all you need to do is specify the chisq=TRUE argument to CrossTable. Here’s what I got for that:

Statistics for All Table Factors

Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 = 14.47214 d.f. = 15 p = 0.4900641

So, with a p-value that high, the color of M&M and whether it has a defect are INDEPENDENT. That makes sense. If they were not independent, then maybe there’s a problem with the production process.

I also found this fantastic paper that describes how one researcher is exploring alternative (and hopefully better!) ways to visualize categorical data. In addition to being an interesting read, it demonstrates alternatives like the mosaic.

Postscript: I just put a copy of the M&M data on a GitHub repository. I think I’ve started a new habit… this is fantastic. I can actually pull my data directly into R from GitHub. It’s like magic!! Here is the incantation:

library(RCurl)
url <- "https://raw.githubusercontent.com/NicoleRadziwill/Data-"
url <- paste(url,"for-R-Examples/master/mnm-clean.csv",sep="")
x <- getURL(url,ssl.verifypeer=FALSE)
mnms <- read.csv(text = x)

Rate this:

3 responses to “Contingency Tables with gmodels in R”

Shahini

February 8, 2016

What if the p value from ChiSquare Test came out as significant. Then How do we tell which color Produces the most number of defects ??What test do we perform ??

Reply
SFer

February 24, 2016

Good post!

At each cell (intersection of a Color row with a Defect column),
there are 5 numbers.

Example:
Take the 1st cell, (intersection of color: Blue with defect: C):

The top # = 83.
Clearly, that is the total number of Blue M&Ms with defect: C.
…fine!

Q:
but, what do the other 4 #s in that same cell represent?
0.822
0.239
0.253
0.067…all in the 1st cell.

Same question about the 2nd numbers (ie: 0.280)
shown in each “marginal” cell.
What do they represent?

Thanks, Nicole!

Reply
SFer

February 24, 2016

Good post!

At each cell (intersection of a Color row with a Defect column),
there are 5 numbers.

Example:
Take the 1st cell, (intersection of color: Blue with defect: C):

The top # = 83.
Clearly, that is the total number of M&Ms with defect: C.
…fine!

Q:
but, what do the other 4 #s in that same cell represent?
0.822
0.239
0.253
0.067…all in the 1st cell.

Same question about the 2nd numbers (ie: 0.280)
shown in each “marginal” cell.
What do they represent?

Thanks, Nicole!

Reply

I’m Nicole

Since 2008, I’ve been sharing insights and expertise on Digital Transformation & Data Science for Performance Excellence here. As a CxO, I’ve helped orgs build empowered teams, robust programs, and elegant strategies bridging data, analytics, and artificial intelligence (AI)/machine learning (ML)… while building models in R and Python on the side. In 2025, I help leaders drive Quality-Driven Data & AI Strategies and navigate the complex market of data/AI vendors & professional services. Need help sifting through it all? Reach out to inquire – check out my new book that reveal the one thing EVERY organization has been neglecting – Data, Strategy, Culture & Power.

More About Me or HIRE ME OR MY PEOPLE

Let’s connect

Get Notifications

Stay updated with our latest ideas, books, and courses, or follow me on LinkedIn.

Quality and Innovation

Contingency Tables with gmodels in R

Rate this:

3 responses to “Contingency Tables with gmodels in R”

Leave a Reply Cancel reply

I’m Nicole

Let’s connect

Get Notifications

Recent posts

Sausage Program

Psychological Forces in Data Management

How Data Loses Value Over Time

Process in Service of Value

Looking Ahead to 2025

The Scariest Part of Corporate Halloween

Rate this:

Share this:

3 responses to “Contingency Tables with gmodels in R”

Leave a Reply Cancel reply

I’m Nicole

Let’s connect

Get Notifications

Recent posts