Tag Archives: Lean Six Sigma

How the Baldrige Process Can Enrich Any Management System

The “Baldrige Crystal” in a hall at NIST (Gaithersburg, MD). Image Credit: me.

Another wave of reviewing applications for the Malcolm Baldrige National Quality Award (MBNQA) is complete, and I am exhausted — and completely fulfilled and enriched!

That’s the way this process works. As a National Examiner, you will be frustrated, you may cry, and you may think your team of examiners will never come to consensus on the right words to say to the applicant! But because there is a structured process and a discipline, it always happens, and everyone learns.

I’ve been working with the Baldrige Excellence Framework (BEF) for almost 20 years. In the beginning, I used it as a template. Need to develop a Workforce Management Plan that’s solid, and integrates well with leadership, governance, and operations? There’s a framework for that (Criterion 5). Need to beef up your strategic planning process so you do the right thing and get it done right? There’s a framework for that (Criterion 2).

Need to develop Standard Work in any area of your organization, and don’t know where to start (or, want to make sure you covered all the bases)? There’s a framework for that.

Every year, 300 National Examiners are competitively selected from industry experts and senior leaders who care about performance and improvement, and want to share their expertise with others. The stakes are high… after all, this is the only award of its kind sponsored by the highest levels of government!

Once you become a National Examiner (my first year was 2009), you get to look at the Criteria Questions through a completely different lens. You start to see the rich layers of its structure. You begin to appreciate that this guidebook was carefully and iteratively crafted over three decades, drawing from the experiences of executives and senior leaders across a wide swath of industries, faced with both common and unique challenges.

The benefits to companies that are assessed for the award are clear and actionable, but helping others helps examiners, too. Yes, we put in a lot of volunteer hours on evenings and weekends (56 total, for me, this year) — but I got to go deep with one more organization. I got to see how they think of themselves, how they designed their organization to meet their strategic goals, how they act on that design. Our team of examiners got to discuss the strengths we noticed individually, the gaps that concerned us, and we worked together to come to consensus on the most useful and actionable recommendations for the applicant so they can advance to the next stage of quality maturity.

One of the things I learned this year was how well Baldrige complements other frameworks like ISO 9001 and lean. You may have a solid process in place for managing operations, leading continuous improvement events, and sustaining the improvements. You may have a robust strategic planning process, with clear connections between overall objectives and individual actions.

What Baldrige can help you do, even if you’re already a high performance organization, is:

  • tighten the gaps
  • call out places where standard work should be defined
  • identify new breakthrough opportunities for improvement
  • help everyone in your workforce see and understand the connections between people, processes, and technologies

The whitespace — those connections and seams — are where the greatest opportunities for improvement and innovation are hiding. The Criteria Questions in the Baldrige Excellence Framework (BEF) can help you illuminate them.

If Japan Can, Why Can’t We? A Retrospective

if-japan-canJune 24, 1980 is kind of like July 4, 1776 for quality management… that’s the pivotal day that NBC News aired its one hour and 16 minute documentary called “If Japan Can, Why Can’t We?” introducing W. Edwards Deming and his methods to the American public. 

The video has been unavailable for years, but as of 2018, it’s posted on YouTube. So my sophomore undergrads in Production & Operations Management took a step back in time to get a taste of the environment in the manufacturing industry in the late 1970’s, and watched it during class.

The last time I watched it was in 1997, in a graduate industrial engineering class. It didn’t feel quite as dated as it does now, nor did I have the extensive experience in industry as a lens to view the interviews through.

What did surprise me is the challenges they were facing then aren’t that much different than the ones we face today — and the groundbreaking good advice from Deming is still good advice today.

  • Before 1980, it was common practice to produce a whole bunch of stuff and then check and see which ones were bad, and throw them out. The video provides a clear and consistent story around the need to design quality in to products and processes, which then reduces (or eliminates) the need to inspect bad quality out.
  • It was also common to tamper with a process that was just exhibiting random variation. As one of the line workers in the documentary said, “We didn’t know. If we felt like there might be a problem with the process, we would just go fix it.” Deming’s applications of Shewhart’s methods made it clear that there is no need to tamper with a process that’s exhibiting only random variation.
  • Both workers and managers seemed frustrated with the sheer volume of regulations they had to address, and noted that it served to increase costs, decrease the rate of innovation, and disproportionately hurt small businesses. They noted that there was a great need for government and industry to partner to resolve these issues, and that Japan was a model for making these interactions successful.
  • Narrator Lloyd Dobyns remarked that “the Japanese operate by consensus… we, by competition.” He made the point that one reason industrial reforms were so powerful and positive was that Japanese culture naturally supported working together towards shared goals. He cautioned managers that they couldn’t just drop in statistical quality control and expect a rosy outcome: improving quality is a cultural commitment, and the methods are not as useful in the absence of buy-in and engagement.

The video also sheds light on ASQ’s November question to the Influential Voices, which is: “What’s the key to talking quality with the C-Suite?” Typical responses include: think at the strategic level; create compelling arguments using the language of money; learn the art of storytelling and connect your case with what it important to the executives.

But I think the answer is much more subtle. In the 1980 video, workers comment on how amazed their managers were when Deming proclaimed that management was responsible for improving productivity. How could that be??!? Many managers at that time were convinced that if a productivity problem existed, it was because the workers didn’t work fast enough, or with enough skill — or maybe they had attitude problems! Certainly not because the managers were not managing well.

Implementing simple techniques like improving training programs and establishing quality circles (which demonstrated values like increased transparency, considering all ideas, putting executives on the factory floor so they could learn and appreciate the work being done, increasing worker participation and engagement, encouraging work/life balance, and treating workers with respect and integrity) were already demonstrating benefits in some U.S. companies. But surprisingly, these simple techniques were not widespread, and not common sense.

Just like Deming advocated, quality belongs to everyone. You can’t go to a CEO and suggest that there are quality issues that he or she does not care about. More likely, the CEO believes that he or she is paying a lot of attention to quality. They won’t like it if you accuse them of not caring, or not having the technical background to improve quality. The C-Suite is in a powerful position where they can, through policies and governance, influence not only the actions and operating procedures of the system, but also its values and core competencies — through business model selection and implementation. 

What you can do, as a quality professional, is acknowledge and affirm their commitment to quality. Communicate quickly, clearly, and concisely when you do. Executives have to find the quickest ways to decompose and understand complex problems in rapidly changing external environments, and then make decisions that affect thousands (and sometimes, millions!) of people. Find examples and stories from other organizations who have created huge ripples of impact using quality tools and technologies, and relate them concretely to your company.

Let the C-Suite know that you can help them leverage their organization’s talent to achieve their goals, then continually build their trust.

The key to talking quality with the C-suite is empathy.

You may also be interested in “Are Deming’s 14 Points Still Valid?” from Nov 19, 2012.

A Simple Intro to Bayesian Change Point Analysis

The purpose of this post is to demonstrate change point analysis by stepping through an example of the technique in R presented in Rizzo’s excellent, comprehensive, and very mathy book, Statistical Computing with R, and then showing alternative ways to process this data using the changepoint and bcp packages. Much of the commentary is simplified, and that’s on purpose: I want to make this introduction accessible if you’re just learning the method. (Most of the code is straight from Rizzo who provides a much more in-depth treatment of the technique. I’ve added comments in the code to make it easier for me to follow, and that’s about it.)

The idea itself is simple: you have a sample of observations from a Poisson (counting) process (where events occur randomly over a period of time). You probably have a chart that shows time on the horizontal axis, and how many events occurred on the vertical axis. You suspect that the rate at which events occur has changed somewhere over that range of time… either the event is increasing in frequency, or it’s slowing down — but you want to know with a little more certainty. (Alternatively, you could check to see if the variance has changed, which would be useful for process improvement work in Six Sigma projects.)

You want to estimate the rate at which events occur BEFORE the shift (mu), the rate at which events occur AFTER the shift (lambda), and the time when the shift happens (k). To do it, you can apply a Markov Chain Monte Carlo (MCMC) sampling approach to estimate the population parameters at each possible k, from the beginning of your data set to the end of it. The values you get at each time step will be dependent only on the values you computed at the previous timestep (that’s where the Markov Chain part of this problem comes in). There are lots of different ways to hop around the parameter space, and each hopping strategy has a fancy name (e.g. Metropolis-Hastings, Gibbs, “reversible jump”).

In one example, Rizzo (p. 271-277) uses a Markov Chain Monte Carlo (MCMC) method that applies a Gibbs sampler to do the hopping – with the goal of figuring out the change point in number of coal mine disasters from 1851 to 1962. (Looking at a plot of the frequency over time, it appears that the rate of coal mining disasters decreased… but did it really? And if so, when? That’s the point of her example.) She gets the coal mining data from the boot package. Here’s how to get it, and what it looks like:

library(boot)
data(coal)
y <- tabulate(floor(coal[[1]]))
y <- y[1851:length(y)]
barplot(y,xlab="years", ylab="frequency of disasters")

coalmine-freq

First, we initialize all of the data structures we’ll need to use:

# initialization
n <- length(y) # number of data elements to process
m <- 1000 # target length of the chain
L <- numeric(n) # likelihood fxn has one slot per year
k[1] <- sample(1:n,1) # pick 1 random year to start at
mu[1] <- 1
lambda[1] <- 1
b1 <- 1
b2 <- 1
# now set up blank 1000 element arrays for mu, lambda, and k
mu <- lambda <- k <- numeric(m)

Here are the models for prior (hypothesized) distributions that she uses, based on the Gibbs sampler approach:

  • mu comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies UP TO the point in time, k, you’re currently at) and a rate of (k + b1)
  • lambda comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies AFTER the point in time, k, you’re currently at) and a rate of (n – k + b1) where n is the number of the year you’re currently processing
  • b1 comes from a Gamma distribution with a shape parameter of 0.5 and a rate of (mu + 1)
  • b2 comes from a Gamma distribution with a shape parameter of 0.5 and a rate of (lambda + 1)
  • a likelihood function L is also provided, and is a function of k, mu, lambda, and the sum of all the frequencies up until that point in time, k

At each iteration, you pick a value of k to represent a point in time where a change might have occurred. You slice your data into two chunks: the chunk that happened BEFORE this point in time, and the chunk that happened AFTER this point in time. Using your data, you apply a Poisson Process with a (Hypothesized) Gamma Distributed Rate as your model. This is a pretty common model for this particular type of problem. It’s like randomly cutting a deck of cards and taking the average of the values in each of the two cuts… then doing the same thing again… a thousand times. Here is Rizzo’s (commented) code:

# start at 2, so you can use initialization values as seeds
# and go through this process once for each of your m iterations
for (i in 2:m) {
 kt <- k[i-1] # start w/random year from initialization
 # set your shape parameter to pick mu from, based on the characteristics
 # of the early ("before") chunk of your data
 r <- .5 + sum(y[1:kt]) 
 # now use it to pick mu
 mu[i] <- rgamma(1,shape=r,rate=kt+b1) 
 # if you're at the end of the time periods, set your shape parameter
 # to 0.5 + the sum of all the frequencies, otherwise, just set the shape
 # parameter that you will use to pick lambda based on the later ("after")
 # chunk of your data
 if (kt+1 > n) r <- 0.5 + sum(y) else r <- 0.5 + sum(y[(kt+1):n])
 lambda[i] <- rgamma(1,shape=r,rate=n-kt+b2)
 # now use the mu and lambda values that you got to set b1 and b2 for next iteration
 b1 <- rgamma(1,shape=.5,rate=mu[i]+1)
 b2 <- rgamma(1,shape=.5,rate=lambda[i]+1)
 # for each year, find value of LIKELIHOOD function which you will 
 # then use to determine what year to hop to next
 for (j in 1:n) {
 L[j] <- exp((lambda[i]-mu[i])*j) * (mu[i]/lambda[i])^sum(y[1:j])
 }
 L <- L/sum(L)
 # determine which year to hop to next
 k[i] <- sample(1:n,prob=L,size=1)
}

Knowing the distributions of mu, lambda, and k from hopping around our data will help us estimate values for the true population parameters. At the end of the simulation, we have an array of 1000 values of k, an array of 1000 values of mu, and an array of 1000 values of lambda — we use these to estimate the real values of the population parameters. Typically, algorithms that do this automatically throw out a whole bunch of them in the beginning (the “burn-in” period) — Rizzo tosses out 200 observations — even though some statisticians (e.g. Geyer) say that the burn-in period is unnecessary:

> b <- 201 # treats time until the 200th iteration as "burn-in"
> mean(k[b:m])
[1] 39.765
> mean(lambda[b:m])
[1] 0.9326437
> mean(mu[b:m])
[1] 3.146413

The change point happened between the 39th and 40th observations, the arrival rate before the change point was 3.14 arrivals per unit time, and the rate after the change point was 0.93 arrivals per unit time. (Cool!)
After I went through this example, I discovered the changepoint package, which let me run through a similar process in just a few lines of code. Fortunately, the results were very similar! I chose the “AMOC” method which stands for “at most one change”. Other methods are available which can help identify more than one change point (PELT, BinSeg, and SegNeigh – although I got an error message every time I attempted that last method).

> results <- cpt.mean(y,method="AMOC")
> cpts(results)
cpt 
 36 
> param.est(results)
$mean
[1] 3.2500000 0.9736842
> plot(results,cpt.col="blue",xlab="Index",cpt.width=4)

coalmine-changepoint

I decided to explore a little further and found even MORE change point analysis packages! So I tried this example using bcp (which I presume stands for “Bayesian Change Point”) and voila… the output looks very similar to each of the previous two methods!!!):

coalmine-bcp

It’s at this point that the HARD part of the data science project would begin… WHY? Why does it look like the rate of coal mining accidents decreased suddenly? Was there a change in policy or regulatory requirements in Australia, where this data was collected? Was there some sort of mass exodus away from working in the mines, and so there’s a covariate in the number of opportunities for a mining disaster to occur? Don’t know… the original paper from 1979 doesn’t reveal the true story behind the data.

There are also additional resources on R Bloggers that discuss change point analysis:

(Note: If I’ve missed anything, or haven’t explained anything right, please provide corrections and further insights in the comments! Thank you.

What (Really) is a Data Scientist?

Drew Conway's very popular Data Science Venn Diagram. From http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Drew Conway’s very popular Data Science Venn Diagram. From http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

What is a data scientist? What makes for a good (or great!) data scientist? It’s been challenging enough to determine what a data scientist really is (several people have proposed ways to look at this). The Guardian (a UK publication) said, however, that a true data scientist is as “rare as a unicorn”.

I believe that the data scientist “unicorn” is hidden right in front of our faces; the purpose of this post is to help you find it. First, we’ll take a look at some models, and then I’ll present my version of what a data scientist is (and how this person can become “great”).

#1 Drew Conway’s popularData Science Venn Diagram” — created in 2010 — characterizes the data scientist as a person with some combination of skills and expertise in three categories (and preferably, depth in all of them): 1) Hacking, 2) Math and Statistics, and 3) Substantive Expertise (also called “domain knowledge”). 

Later, he added that there was a critical missing element in the diagram: that effective storytelling with data is fundamental. The real value-add, he says, is being able to construct actionable knowledge that facilitates effective decision making. How to get the “actionable” part? Be able to communicate well with the people who have the responsibility and authority to act.

“To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. On the flip-side, substantive expertise plus math and statistics knowledge is where most traditional researcher falls. Doctoral level researchers spend most of their time acquiring expertise in these areas, but very little time learning about technology. Part of this is the culture of academia, which does not reward researchers for understanding technology. That said, I have met many young academics and graduate students that are eager to bucking that tradition.”Drew Conway, March 26, 2013

#2 In 2013, Harlan Harris (along with his two colleagues, Sean Patrick Murphy and Marck Vaisman) published a fantastic study where they surveyed approximately 250 professionals who self-identified with the “data science” label. Each person was asked to rank their proficiency in each of 22 skills (for example, Back-End Programming, Machine Learning, and Unstructured Data). Using clustering, they identified four distinct “personality types” among data scientists:

As a manager, you might try to cut corners by hiring all Data Creatives(*). But then, you won’t benefit from the ultra-awareness that theorists provide. They can help you avoid choosing techniques that are inappropriate, if (say) your data violates the assumptions of the methods. This is a big deal! You can generate completely bogus conclusions by using the wrong tool for the job. You would not benefit from the stress relief that the Data Developers will provide to the rest of the data science team. You would not benefit from the deep domain knowledge that the Data Businessperson can provide… that critical tacit and explicit knowledge that can save you from making a potentially disastrous decision.

Although most analysts and researchers who do screw up very innocently screw up their analyses by stumbling into misuses of statistical techniques, some unscrupulous folks might mislead other on purpose; although an extreme case, see I Fooled Millions Into Thinking Chocolate Helps Weight Loss.

Their complete results are available as a 30-page report (available in print or on Kindle).

#3 The Guardian is, in my opinion, a little more rooted in realistic expectations:

“The data scientist’s skills – advanced analytics, data integration, software development, creativity, good communications skills and business acumen – often already exist in an organisation. Just not in a single person… likely to be spread over different roles, such as statisticians, bio-chemists, programmers, computer scientists and business analysts. And they’re easier to find and hire than data scientists.”

They cite British Airways as an exemplar:

“[British Airways] believes that data scientists are more effective and bring more value to the business when they work within teams. Innovation has usually been found to occur within team environments where there are multiple skills, rather than because someone working in isolation has a brilliant idea, as often portrayed in TV dramas.”

Their position is you can’t get all those skills in one person, so don’t look for it. Just yesterday I realized that if I learn one new amazing thing in R every single day of my life, by the time I die, I will probably be an expert in about 2% of the package (assuming it’s still around).

#4 Others have chimed in on this question and provided outlines of skill sets, such as:

  • Six Qualities of a Great Data Scientist: statistical thinking, technical acumen, multi-modal communication skills, curiosity, creativity, grit
  • The Udacity blog: basic tools (R, Python), software engineering, statistics, machine learning, multivariate calculus, linear algebra, data munging, data visualization and communication, and the ultimately nebulous “thinking like a data scientist”
  • IBM: “part analyst, part artist” skilled in “computer science and applications, modeling, statistics, analytics and math… [and] strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge.”
  • SAS: “a new breed of analytical data expert who have the technical skills to solve complex problems – and the curiosity to explore what problems need to be solved. They’re part mathematician, part computer scientist and part trend-spotter.” (Doesn’t that sound exciting?)
  • DataJobs.Com: well, these guys just took Drew Conway’s Venn diagram and relabeled it.

#5 My Answer to “What is a Data Scientist?”:  A data scientist is a sociotechnical boundary spanner who helps convert data and information into actionable knowledge.

Based on all of the perspectives above, I’d like to add that the data scientist must have an awareness of the context of the problems being solved: social, cultural, economic, political, and technological. Who are the stakeholders? What’s important to them? How are they likely to respond to the actions we take in response to the new knowledge data science brings our way? What’s best for everyone involved so that we can achieve sustainability and the effective use of our resources? And what’s with the word “helps” in the definition above? This is intended to reflect that in my opinion, a single person can’t address the needs of a complex data science challenge. We need each other to be “great” at it.

A data scientist is someone who can effectively span the boundaries between

1) understanding social+ context, 

2) correctly selecting and applying techniques from math and statistics,

3) leveraging hacking skills wherever necessary,

4) applying domain knowledge, and

5) creating compelling and actionable stories and connections that help decision-makers achieve their goals. This person has a depth of knowledge and technical expertise in at least one of these five areas, and a high level of familiarity with each of the other areas (commensurate with Harris’ T-model). They are able to work productively within a small team whose deep skills span all five areas.

It’s data-driven decision making embedded in a rich social, cultural, economic, political, and technological context… where the challenges may be complex, and the stakes (and ultimately, the benefits) may be high. 


(*) Disclosure: I am a Data Creative!

(**)Quality professionals (like Six Sigma Black Belts) have been doing this for decades. How can we enhance, expand, and leverage our skills to address the growing need for data scientists?

My New Favorite Statistics & Data Analysis Book Using R

very-quick-cover-outline

NOTE: The 2nd Edition (Red Swan) was released in 2017. There is a companion book that presents end-to-end examples of each of the methods.


As of today, I now have a NEW FAVORITE introductory statistics textbook… the one I’ve always dreamed of having. I’ve been looking for a book to use in my classes for undergraduate sophomores and juniors, but none of the textbooks I considered over the past three years (and I’ve looked at over a hundred!) had all of the things I really, really wanted. So I had to go make it happen myself. These things are:

download the preview here (first ~100 pages)

1) An integrated treatment of theory and practice. All of my stats textbooks have a lot of formulas, and no information about how to do what the formulas do in the R statistical software. All of my R textbooks have a lot of information about how to run the commands, but not really much information about what formulas are being used. I wanted a book that would show how to solve problems analytically (using the equations), and then show how they’re done in R. If there were discrepancies between the stats textbook answers and the R answers, I wanted to know why. A lot of times, the developers of R packages use very sophisticated adjustments and corrections, which I only became aware of because my analytical solutions didn’t match the R output. At first, I thought I was wrong. But later, I realized I was right, and R was right: we were just doing different things. I wanted my students to know what was going on under the hood, and have an awareness of exactly which methods R was using at every moment.

2) An easy way to develop research questions for observational studies and organize the presentation of results. We always do small research projects in my classes, and in my opinion, this is the best way for students to get a strong grasp of the fundamental statistical concepts. But they always have the same questions: Which statistical test should I use? How should I phrase my research question? What should I include in my report? I wanted a book that made developing statistical research questions easy. In fact, I know a lot of people I went to PhD school with that would have loved to have this book while they were proposing, conducting, and defending their dissertations.

3) A confidence interval cookbook. This is probably one of the most important things I want my students to leave my class remembering: that from whatever sample you collect, you can construct a confidence interval that will give you an idea of what the true population parameter should be. You don’t even need to do a hypothesis test! but it can be difficult to remember which formula to use… so I wanted an easy reference where I’d be able to look things up, and find out really easily how to use R to construct those confidence intervals for me. Furthermore, some of the confidence intervals that everyone is taught in an introductory statistics course are wildly inaccurate – and statisticians know this. But they hesitate to scare away novice data analysts with long, scary looking equations, and so students keep learning those inaccurate methods and believing they’re good. Since so many people never get beyond introductory statistics and still turn into researchers in other fields, I thought this was horrible. I want to make sure my students know the best way to do each confidence interval in their first class… even if the equations are not as friendly.

4) An inference test cookbook. I wanted a book that stepped me through each of the primary parametric inference tests analytically (using the equations), and then showed me how it was done in R. If there were discrepancies, I wanted to know why. I wanted an easy way to remember the assumptions for each test, and when to use a pooled standard deviation versus an unpooled one. There’s a lot to keep track of! I wanted a reference that it would make it easy to keep track of all of it: assumptions, tests for assumptions, equations, R code, and diagnostic plots.

5) No step left behind. It’s really frustrating to me how so many R books assume you can do a psychic fill-in-the-blank for missing code. Since I’ve been using R for several years now, I’ve gotten to the point where my psychic abilities are pretty good, and at least 60% of the time I can figure out the missing pieces. But wow, what a waste of time! So I wanted a book that had all of the steps for each example. Even if it was a little repetitive. I may have missed this in a few places, but I think beginners will have a much easier time with this book. Also, I put all my data and functions on GitHub for people to run the examples with. I’m growing this slowly, but I don’t want people to be left in the lurch.

6) An easy way to produce any of the charts and graphs in the book. One of my pet peeves about R books is that the authors generate beautiful charts and graphs, and then you’re reading through the book and say “Yes!! Yes!! That’s the chart I need for my report… I want to do that… how did they do that?” and they don’t tell you anywhere how they did it. I did not want there to be any secrets in this book. If I generated a page of interesting looking simulated distributions, I wanted you to know how I did it (just in case you want to do it later).

GRANTED… I am sure it will not be perfect – no book is. (For example, Google Forms changes a lot and there are a couple examples that use it that will probably be outdated when the book gets to press… and I just found out this morning that you don’t need the source_https trick in R 3.2.0 and beyond.) [Note: data access has been fully updated in the 2nd Edition.] However, I will keep updating my blog with posts about useful things as they evolve.

In any case, I hope you enjoy my book as much as I’ve been enjoying using it as a reference for myself… it really is all my most important notes, neatly organized into just over 500 pages of everything I want to remember. And everything I want to make sure my students take with them after they leave my class.

[Note: Any errors and omissions from earlier printings (which have been taken care of in later printings) are being recorded at http://qualityandinnovation.com/errata/.]

Why the Ban on P-Values? And What Now?

Just recently, the editors of the academic journal Basic and Applied Social Psychology have decided to ban p-values: that’s right, the nexus for inferential decision making… gone! This has created quite a fuss among anyone who relies on significance testing and p-values to do research (especially those, presumably, in social psychology who were hoping to submit a paper to that journal any time soon). The Royal Statistical Society even shared six interesting letters from academics to see how they felt about the decision.

These letters all tell a similar story: yes, p-values can be mis-used and mis-interpreted, and we need to be more careful about how we plan for — and interpret the results of — just one study! But not everyone advocated throwing out the method in its entirety. I think a lot of the issues could be avoided if people had a better gut sense of what sampling error is… and how easy it is to encounter (and as a result, how easy it can be to accidentally draw a wrong conclusion in an inference study just based on sampling error). I wanted to do a simulation study to illustrate this for my students.

The Problem

You’re a student at a university in a class full of other students. I tell you to go out and randomly sample 100 students, asking them what their cumulative GPA is. You come back to me with 100 different values, and some mean value that represents the average of all the GPAs you went out and collected. You can also find the standard deviation of all the values in your sample of 100. Everyone has their own unique sample.

“It is misleading to emphasize the statistically significant findings of any single team. What matters is the totality of the evidence.” – John P. A. Ioannidis in Why Most Published Research Findings are False

It’s pretty intuitive that everyone will come back with a different sample… and thus everyone will have a different point estimate of the average GPA that students have at your university. But, according to the central limit theorem, we also know that if we take the collection of all the average GPAs and plot a histogram, it will be normally distributed with a peak around the real average GPA. Some students’ estimates will be really close to the real average GPA. Some students’ estimates will be much lower (for example, if you collected the data at a meeting for students who are on academic probation). Some students’ estimates will be much higher (for example, if you collected the data at a meeting for honors students). This is sampling error, which can lead to incorrect inferences during significance testing.

Inferential statistics is good because it lets us make decisions about a whole population just based on one sample. It would require a lot of time, or a lot of effort, to go out and collect a whole bunch of samples. Inferential statistics is bad if your sample size is too small (and thus you haven’t captured the variability in the population within your sample) or have one of these unfortunate too-high or too-low samples, because you can make incorrect inferences. Like this.

The Input Distribution

Let’s test this using simulation in R. Since we want to randomly sample the cumulative GPAs of students, let’s choose a distribution that reasonably reflects the distribution of all GPAs at a university. To do this, I searched the web to see if I could find data that might help me get this distribution. I found some data from the University of Colorado Boulder that describes GPAs and their corresponding percentile ranks. From this data, I could put together an empirical CDF, and then since the CDF is the integral of the PDF, I approximated the PDF by taking the derivatives of the CDF. (I know this isn’t the most efficient way to do it, but I wanted to see both plots):

score <- c(.06,2.17,2.46,2.67,2.86,3.01,3.17,3.34,3.43,3.45,
3.46,3.48,3.5,3.52,3.54,3.56,3.58,3.6,3.62,3.65,3.67,3.69,
3.71,3.74,3.77,3.79,3.82,3.85,3.88,3.91,3.94,3.96,4.0,4.0)
perc.ranks <- c(0,10,20,30,40,50,60,70,75,76,77,78,79,80,81,
82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100)
fn <- ecdf(perc.ranks)
xs <- score
ys <- fn(perc.ranks)
slope <- rep(NA,length(xs))
for (i in 2:length(xs)) {
 slope[i] <- (ys[i]-ys[i-1])/(xs[i]-xs[i-1])
}
slope[1] <- 0
slope[length(xs)] <- slope[length(xs)-1]

Then, I plotted them both together:

par(mfrow=c(1,2))
plot(xs,slope,type="l",main="Estimated PDF")
plot(xs,ys,type="l",main="Estimated CDF")
dev.off()

pdf-cdf-empirical

I looked around for a distribution that might approximate what I saw. (Yes, I am eyeballing.) I found the Stable Distribution, and then played around with the parameters until I plotted something that looked like the empirical PDF from the Boulder data:

x <- seq(0,4,length=100)
hx <- dstable(x, alpha=0.5, beta=0.75, gamma=1, delta=3.2)
plot(x,hx,type="l",lty=2,lwd=2)

stable-dist

The Simulation

First, I used pwr.t.test to do a power analysis to see what sample size I needed to obtain a power of 0.8, assuming a small but not tiny effect size, at a level of significance of 0.05. It told me I needed at least 89. So I’ll tell my students to each collect a sample of 100 other students.

Now that I have a distribution to sample from, I can pretend like I’m sending 10,000 students out to collect a sample of 100 students’ cumulative GPAs. I want each of my 10,000 students to run a one-sample t-test to evaluate the null hypothesis that the real cumulative GPA is 3.0 against the alternative hypothesis that the actual cumulative GPA is greater than 3.0. (Fortunately, R makes it easy for me to pretend I have all these students.)

sample.size <- 100
numtrials <- 10000
p.vals <- rep(NA,numtrials)
gpa.means <- rep(NA,numtrials)
compare.to <- 3.00
for (j in 1:numtrials) {
     r <- rstable(n=1000,alpha=0.5,beta=0.75,gamma=1,delta=3.2)
     meets.conds <- r[r>0 & r<4.001]
     my.sample <- round(meets.conds[1:sample.size],3)
     gpa.means[j] <- round(mean(my.sample),3)
     p.vals[j] <- t.test(my.sample,mu=compare.to,alternative="greater")$p.value
     if (p.vals[j] < 0.02) {
          # capture the last one of these data sets to look at later
     capture <- my.sample
     }
}

For all of my 10,000 students’ significance tests, look at the spread of p-values! They are all over the place! And there are 46 students whose p-values were less than 0.05… and they rejected the null. One of the distributions of observed GPAs for a student who would have rejected the null is shown below, and it looks just fine (right?) Even though the bulk of the P-Values are well over 0.05, and would have led to the accurate inference that you can’t reject the null in this case, there are still plenty of values that DO fall below that 0.05 threshold. 

> summary(p.vals)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.005457 0.681300 0.870900 0.786200 0.959300 1.000000 
> p.vals.under.pointohfive <- p.vals[p.vals<0.05]
> length(p.vals.under.pointohfive)
[1] 46
> par(mfrow=c(1,2))
> hist(capture,main="One Rogue Sample",col="purple")
> boxplot(p.vals,main="All P-Values")

rogue-sample

Even though the p-value shouted “reject the null!” from this rogue sample, a 99% confidence interval shows that the value I’m testing against… that average cumulative GPA of 3.0… is still contained within the confidence interval. So I really shouldn’t have ruled it out:

> mean(capture) + c(-1*qt(0.995,df=(sample.size-1))*(sd(capture)/sqrt(sample.size)),
+ qt(0.995,df=(sample.size-1))*(sd(capture)/sqrt(sample.size)))
[1] 2.989259 3.218201
> t.test(capture,mu=compare.to,alternative="greater")$p.value
[1] 0.009615011

If you’re doing real research, how likely are you to replicate your study so that you know if this happened to you? Not very likely at all, especially if collecting more data is costly (in terms of money or effort). Replication would alleviate the issues that can arise due to the inevitable sampling error… it’s just that we don’t typically do it ourselves, and we don’t make it easy for others to do it. Hence the p-value controversy.

What Now?

What can we do to improve the quality of our research so that we avoid the pitfalls associated with null hypothesis testing completely, or, to make sure that we’re using p-values more appropriately?

  • Make sure your sample size is big enough. This usually involves deciding what you want the power of the test to be, given a certain effect size that you’re trying to detect. A power of 0.80 means you’ll have an 80% chance of detecting an effect that’s actually there. However, knowing what your effect size is prior to your research can be difficult (if not impossible).
  • Be aware of biases that can be introduced by not having a random enough or representative enough sample.
  • Estimation. In our example above we might ask “How much greater than 3.0 is the average cumulative GPA at our university?” Check out Geoff Cummings’ article entitled “The New Statistics: Why and How” for a roadmap that will help you think more in terms of estimation (using effect sizes, confidence intervals, and meta-analysis).
  • Support open science. Make it easy for others to replicate your study. If you’re a journal reviewer, consider accepting more articles that replicate other studies, even if they aren’t “novel enough”.

I am certain that my argument has holes, but it seems to be a good example for students to better embrace the notion of sampling error (and become scared of it… or at least more mindful). Please feel free to suggest alternatives that could make this a more informative example. Thank you!

[ALSO… this xkcd cartoon pretty much nails the whole process of why p-values can be misleading. Enjoy.]

An Easy Trick to Reduce Your Resistance to Losing Weight

(Image Credit: Doug Buckley of http://hyperactive.to)

Measurement is an important aspect of assuring and improving quality(*). As a result, I think about it often, especially in the context of maintaining and losing weight. My BMI is not bad (23.5) but I don’t like to exercise, so I try to eat without reckless abandon. But I have one little tiny problem.

“Weigh Yourself Often” is a commonly reported success strategy for losing weight. But what if you’re too scared to step on the scale???  That kind of gets in the way of being able to weigh yourself a lot.

I hadn’t stepped on the scale to weigh myself in about… well… a year. I admit, I’m scared of it. In fact, every time I go to the doctor I specifically tell them NOT to tell me what I weigh – unless it’s REALLY GOOD. (Usually they say nothing, which I’ve never been able to interpret. I’m hoping that they just don’t want to speculate what I would consider “good”.) I don’t want to hop on the scale and see a number that makes me feel lousy about myself all day (and maybe the next day… and the next).

I just know that it’s an invitation to disaster to see those HUGE numbers upon which I’ll allow an entire coral reef of self-loathing to grow uninhibited, attracting the slithery fish of dismay.

But a few days ago, I put on a pair of dress pants that I hadn’t worn in a while, and they almost fell off. I had to make sure I didn’t stand up too straight or accidentally suck in my gut while I was wearing them, otherwise they would have fallen off. (I have to wear them again next Friday and I’m going to safety pin them together to be safe.) As you can imagine, this made me feel pretty good, and stirred belligerence in the face of the bathroom scale!! So I climbed on the scale this morning in optimistic defiance and saw a number that was pretty darn good. If I lose 10 lbs, I will weigh the same as I did in junior high. So I think I’m pretty motivated to bump off those extra 10 just to say “I did it”.

I did have a contingency plan, though. I realized that the thing holding me back from actively monitoring and reducing my weight was the NEGATIVE EMOTION associated with getting on the scale.

The key is to MEASURE in a way that doesn’t stimulate those negative emotions. So if you live in the US and want to lose POUNDS, set your scale to KILOGRAMS. Start weighing yourself using a measurement scale that you have no psychological or emotional attachment (or resistance) to. The first number you see will mean nothing to you, and as you actively work to reduce your weight, that number will go down. You will not be scared of the scale any more. After you start feeling good, then feel free to convert your new weight back into the measurement scale you’re more familiar with. The new number you weigh might not be your target weight, but at least you will know it’s a weight at which you feel good.

And isn’t that the point?

(*) “Measurements provide critical data and information about key processes, outputs, and results. When supported by sound analytical approaches that project trends and infer cause-and-effect relationships, measurement provide an objective foundation for learning, leading to better customer, operational, and financial performance.” – Evans & Dean, “Total Quality: 3rd Ed.”

« Older Entries