Category Archives: Applied Statistics

Logistic Growth, S Curves, Bifurcations, and Lyapunov Exponents in R

If you’ve ever wondered how logistic population growth (the Verhulst model), S curves, the logistic map, bifurcation diagrams, sensitive dependence on initial conditions, “orbits”, deterministic chaos, and Lyapunov exponents are related to one another… this post explains it in just 10 steps, each with some code in R so you can explore it all yourself. I’ve included some code written by other people who have explored this problem (cited below) as portions of my own code.

It all starts with a hypothesized population… and a process where the size of the population changes over time. We want to understand how (and under what conditions) those changes occur, so we choose a model that characterizes population changes: the logistic growth model. It’s been used in biology, ecology, econometrics, marketing, and other areas.


1. The logistic growth model describes how the size of a population (N) changes over time (t), based on some maximum population growth rate (r). There is a limiting factor called the carrying capacity (K) which represents the total population that the environment could support, based on the amount of available resources. dN/dt is the rate of change of the population over time.

logistic-growth-dndt


2. You can simplify the logistic growth model by defining a new variable x to represent the portion of the population that’s alive, compared to the total population that the environment could support (and keep alive). So with x = N/K, you get a new differential equation in terms of x. Now we are looking at the rate of change of the population fraction over time. Once x = N/K = 1, the environment can’t support any more members in the population:

logistic-growth-dxdt


3. You can solve this equation by integration! Then, you’ll have an expression that you can use to calculate x (which is still the population fraction) for any time t. This is called the sigmoid or (more commonly), the S Curve. To compute x at any time t, all we need to know is how big the population was when we started looking at it (x0) and the maximum growth rate r:

logistic-growth-xt-solution


4. The equation for the S Curve is deterministic and continuous. If we want to solve it numerically, we have to discretize it by chopping up that continuous axis that contains time into little tiny pieces of time. That’s what produces the difference equation that we recognize as the logistic map. It’s a map because it “maps” each value of the sequence onto the next value in the sequence. As long as you know one of those values for x (indicated by the subscript n), you’ll be able to figure out the next value of x (indicated by the subscript n+1). The value x[n] is the population fraction of the current generation, and the value x[n+1] is the population fraction for the next generation. This makes the logistic map a Markov chain. If you plot x[n] on the x axis and x[n+1] on the y axis, this expression will produce the familiar upside down parabola:

logistic-markov-chain


5. The logistic map behaves differently depending upon the maximum growth rate (r) that describes your population. This parameter is also called fecundity and represents how rabbit-like your population is reproducing. The higher the r, the more productive, like rabbits (although I’m not sure precisely which r you’d choose if you were studying rabbits). Here is an R function that you can use to generate the last M iterations from a sequence of N total, developed and described at Mage’s Blog:

logistic.map <- function(r, x, N, M) {
 ## from http://www.magesblog.com/2012/03/logistic-map-feigenbaum-diagram.html
 ## r: bifurcation parameter
 ## x: initial value, something greater than 0 and less than 1
 ## N: number of iterations (total)
 ## M: number of iteration points to be returned
   z <- 1:N
   z[1] <- x
   for(i in c(1:(N-1))){
     z[i+1] <- r *z[i] * (1 - z[i])
   }
   ## Return the last M iterations 
   z[c((N-M):N)]
}

6. The logistic map has many interesting properties, but here are two in particular (the first in Step 6 and the second in Step 7). First, for several values you can choose for r, the chain converges to a single value (or fixed point) when n gets really big. For other values of r, the value of x will eventually bounce between two values instead of converging (a limit cycle of 2). For other values of r, the value of x will eventually bounce between four values instead of converging. Sometimes, x will bounce around a near limitless collection of values (a condition called deterministic chaos). The eventual values (or collection of eventual values, if they bounce between values) is called an orbit. For example, when the growth rate r is 2.6, the logistic map rapidly converges to an orbit of about 0.615:

plot(logistic.map(2.6,.01,20,20), type="l")

logistic-map-converges


7. Sometimes, it can be nice to take a look at how the values bounce around, and where they eventually converge (or not). To do this, we use cobweb diagrams (which are also sometimes called web diagrams). I used a function that I found at http://bayesianbiologist.com to plot the behavior of the orbits for r=2.6, r=3.2, and r=3.9:

logistic.cobweb <- function(r) {
# code adapted from http://bayesianbiologist.com/tag/cobweb-plot/
 x<-seq(0,1,length=100)
 x_next <- lapply(2:N, function(i) r*x[i]*(1-x[i]))
 plot(x[2:N],x_next,type="l",xlim=c(0,1), ylim=c(0,1), main=paste("r=",r),
 xlab=expression(x[t]),ylab=expression(x[t+1]), col="red", lwd=2)
 abline(0,1)

 # start at your random spot on the x-axis and start with a vertical line:
 start=runif(1,0,1)
 vert=FALSE
 lines(x=c(start,start),y=c(0,r*start*(1-start)) )
 for(i in 1:(2*N)) {
 if(vert) {
   lines(x=c(start,start),y=c(start,r*start*(1-start)) )
   vert=FALSE
 } else {
   lines(x=c(start, r*start*(1-start)), y=c(r*start*(1-start), r*start*(1-start)) )
   vert=TRUE
   start=r*start*(1-start)
 }
 }
}

par(mfrow=c(1,3))
logistic.cobweb(2.6)
logistic.cobweb(3.3)
logistic.cobweb(3.9)

logistic-cobwebs


8. (Remember to dev.off() before you continue.) Second, for some values of r, the logistic map shows sensitive dependence on initial conditions. For example, let’s see what happens for two different growth rates (r=3 and r=3.9) when we start one iteration with an x[n]  of 0.5 COLORED BLACK, and another one with an x[n] of 0.5001 COLORED RED. It’s a small, small difference that can lead to big, BIG variations in the orbits. In the r=3 case, the chain produced by the logistic map with x[n] of 0.5 (in black) is IDENTICAL to the chain produced by the logistic map with x[n] of 0.5001 (in red). That’s why you can’t see the black… the values are the same! But for the r=3.9 case, the chain produced by the logistic map with x[n] of 0.5 (in black) RAPIDLY DIVERGES from the chain produced by the logistic map with x[n] of 0.5001 (in red). They are very different, despite a very tiny difference in initial conditions! The logistic map for r=3.9 shows a very sensitive dependence on initial conditions

par(mfrow=c(2,1))
first <- logistic.map(3,.5,120,100)
second <- logistic.map(3,.5001,120,100)
plot(1:length(first),first,type="l",main="r=3 is not sensitive to initial conditions")
lines(1:length(second),second,type="l",col="red")
first <- logistic.map(3.9,.5,120,100)
second <- logistic.map(3.9,.5001,120,100)
plot(1:length(first),first,type="l",main="but r=3.9 is EXTREMELY sensitive")
lines(1:length(second),second,type="l",col="red")

logistic-sensitivity


9. For any chain, we can determine just how sensitive the logistic map is to initial conditions by looking at the Lyapunov exponent. Very simplistically, if the Lyapunov exponent is negative, the chain will converge to one or more fixed points for that value of r. If the Lyapunov exponent is positive, the chain will demonstrate deterministic chaos for that value of r. If the Lyapunov exponent is zero, there is a bifurcation: a 1-cycle is doubling to a 2-cycle, a 2-cycle is doubling to a 4-cycle, or so forth. The top chart shows an approximation of the Lyapunov exponent based on the first 500 iterations (ideally, you’d use an infinite number, but that would eat up too much computing time), and the bottom chart shows a bifurcation diagramYou’ll notice that the Lyapunov exponents are zero where a bifurcation occurs. To interpret the bifurcation diagram, just remember that each vertical slice through it represents the results of ONE COMPLETELY CONVERGED CHAIN from the logistic map. So it shows the results from many, many, many completely converged chains – and provides an excellent way for us to look at the behavior of MANY different types of populations in just one chart:

n <- 400
XI <- lya <- 0
x <- seq(0,4,0.01)
for (i in 1:n) {
 xi <- logistic.map(x[i],.01,500,500)
 XI <- rbind(XI,xi)
}
for (i in 1:length(x)) { 
 lya[i] <- sum(log(abs(x[i]-(2*x[i]*XI[i,]))))/length(x) 
}
plot(x,lya,ylim=c(-4,1),xlim=c(0,4),type="l",main="Lyapunov Exponents for Logistic Map")
abline(h=0, lwd=2, col="red")
# next 3 lines from http://www.magesblog.com/2012/03/logistic-map-feigenbaum-diagram.html:
my.r <- seq(0, 4, by=0.003)
Orbit <- sapply(my.r, logistic.map, x=0.1, N=1000, M=300)
r <- sort(rep(my.r, 301))

par(mfrow=c(2,1))
plot(x,lya,ylim=c(-5,1),xlim=c(0,4),type="l",main="Lyapunov Exponents for Logistic Map")
abline(h=0, col="red", lwd=2)
abline(v=3, col="blue", lwd=2)
plot(r, Orbit, pch=".", cex=0.5, main="Bifurcation Diagram for r=0 to r=4 Logistic Maps")
abline(v=3, col="blue", lwd=2)

logistic-lyapunov-bifurcation

10. Notice that in the bifurcation diagram, we can easily see that when r is between 0 and 1, the population converges to extinction. This makes sense, because the growth rate is smaller than what’s required to sustain the size of the population. You might like to zoom in, though, and see what the orbits look like for some smaller portions of the diagram. Here’s how you can do it (but be sure to refresh your graphics window with dev.off() before you try it). Try changing the plot character (pch) too, or maybe the size of the characters with cex=0.2 or cex=0.5 in the last line:

# adapted from http://www.magesblog.com/2012/03/logistic-map-feigenbaum-diagram.html:
my.r <- seq(3.5, 4, by=0.003)
Orbit <- sapply(my.r, logistic.map, x=0.1, N=1000, M=300)
multiplier <- length(Orbit)/length(my.r)
r <- sort(rep(my.r, multiplier))
plot(r, Orbit, pch=".")

logistic-lyapunov-bifurcation-2

That’s it!


Find out more information on these other web pages, which are listed in order of difficulty level:

A 15-Week Intro Statistics Course Featuring R

Morgan at Burning Man 2014. (Image Credit: Nicole Radziwill)

Morgan at Burning Man 2014. (Image Credit: Nicole Radziwill)

Do you teach introductory statistics or data science? Need some help planning your fall class?

I apply the 10 Principles of Burning Man in the design and conduct of all my undergraduate and graduate-level courses, including my introductory statistics class (which has a heavy focus on R and data science) at JMU. This means that I consider learning to be emergent, and as a result, it often doesn’t follow a prescribed path of achieving specified learning objectives. However, in certain courses, I still feel like it’s important to provide a general structure to help guide the way! This also helps the students get a sense of our general trajectory over the course of the semester, and do readings in advance if they’re ready.

Since several people have asked for a copy, here is the SYLLABUS that I use for my 15-week class (that also uses the “informal” TEXTBOOK I wrote this past spring). We meet twice a week for an hour and 15 minutes each session. The class is designed for undergraduate sophomores, but there are always students from all levels enrolled. The course is intended to provide an introduction to (frequentist) statistical thinking, but with an applied focus that has practical data analysis at its core.

My goal is simple. At the end of the semester, I want students to be able to:

Please let me know if this syllabus is helpful to you! I’ll be posting my intensive (5-session) version of this tomorrow or the next day.

Feel free to join our class Facebook group at https://www.facebook.com/groups/262216220608559/ if you want to play along at home.

A Simple Intro to Bayesian Change Point Analysis

The purpose of this post is to demonstrate change point analysis by stepping through an example of the technique in R presented in Rizzo’s excellent, comprehensive, and very mathy book, Statistical Computing with R, and then showing alternative ways to process this data using the changepoint and bcp packages. Much of the commentary is simplified, and that’s on purpose: I want to make this introduction accessible if you’re just learning the method. (Most of the code is straight from Rizzo who provides a much more in-depth treatment of the technique. I’ve added comments in the code to make it easier for me to follow, and that’s about it.)

The idea itself is simple: you have a sample of observations from a Poisson (counting) process (where events occur randomly over a period of time). You probably have a chart that shows time on the horizontal axis, and how many events occurred on the vertical axis. You suspect that the rate at which events occur has changed somewhere over that range of time… either the event is increasing in frequency, or it’s slowing down — but you want to know with a little more certainty. (Alternatively, you could check to see if the variance has changed, which would be useful for process improvement work in Six Sigma projects.)

You want to estimate the rate at which events occur BEFORE the shift (mu), the rate at which events occur AFTER the shift (lambda), and the time when the shift happens (k). To do it, you can apply a Markov Chain Monte Carlo (MCMC) sampling approach to estimate the population parameters at each possible k, from the beginning of your data set to the end of it. The values you get at each time step will be dependent only on the values you computed at the previous timestep (that’s where the Markov Chain part of this problem comes in). There are lots of different ways to hop around the parameter space, and each hopping strategy has a fancy name (e.g. Metropolis-Hastings, Gibbs, “reversible jump”).

In one example, Rizzo (p. 271-277) uses a Markov Chain Monte Carlo (MCMC) method that applies a Gibbs sampler to do the hopping – with the goal of figuring out the change point in number of coal mine disasters from 1851 to 1962. (Looking at a plot of the frequency over time, it appears that the rate of coal mining disasters decreased… but did it really? And if so, when? That’s the point of her example.) She gets the coal mining data from the boot package. Here’s how to get it, and what it looks like:

library(boot)
data(coal)
y <- tabulate(floor(coal[[1]]))
y <- y[1851:length(y)]
barplot(y,xlab="years", ylab="frequency of disasters")

coalmine-freq

First, we initialize all of the data structures we’ll need to use:

# initialization
n <- length(y) # number of data elements to process
m <- 1000 # target length of the chain
L <- numeric(n) # likelihood fxn has one slot per year
k[1] <- sample(1:n,1) # pick 1 random year to start at
mu[1] <- 1
lambda[1] <- 1
b1 <- 1
b2 <- 1
# now set up blank 1000 element arrays for mu, lambda, and k
mu <- lambda <- k <- numeric(m)

Here are the models for prior (hypothesized) distributions that she uses, based on the Gibbs sampler approach:

  • mu comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies UP TO the point in time, k, you’re currently at) and a rate of (k + b1)
  • lambda comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies AFTER the point in time, k, you’re currently at) and a rate of (n – k + b1) where n is the number of the year you’re currently processing
  • b1 comes from a Gamma distribution with a shape parameter of 0.5 and a rate of (mu + 1)
  • b2 comes from a Gamma distribution with a shape parameter of 0.5 and a rate of (lambda + 1)
  • a likelihood function L is also provided, and is a function of k, mu, lambda, and the sum of all the frequencies up until that point in time, k

At each iteration, you pick a value of k to represent a point in time where a change might have occurred. You slice your data into two chunks: the chunk that happened BEFORE this point in time, and the chunk that happened AFTER this point in time. Using your data, you apply a Poisson Process with a (Hypothesized) Gamma Distributed Rate as your model. This is a pretty common model for this particular type of problem. It’s like randomly cutting a deck of cards and taking the average of the values in each of the two cuts… then doing the same thing again… a thousand times. Here is Rizzo’s (commented) code:

# start at 2, so you can use initialization values as seeds
# and go through this process once for each of your m iterations
for (i in 2:m) {
 kt <- k[i-1] # start w/random year from initialization
 # set your shape parameter to pick mu from, based on the characteristics
 # of the early ("before") chunk of your data
 r <- .5 + sum(y[1:kt]) 
 # now use it to pick mu
 mu[i] <- rgamma(1,shape=r,rate=kt+b1) 
 # if you're at the end of the time periods, set your shape parameter
 # to 0.5 + the sum of all the frequencies, otherwise, just set the shape
 # parameter that you will use to pick lambda based on the later ("after")
 # chunk of your data
 if (kt+1 > n) r <- 0.5 + sum(y) else r <- 0.5 + sum(y[(kt+1):n])
 lambda[i] <- rgamma(1,shape=r,rate=n-kt+b2)
 # now use the mu and lambda values that you got to set b1 and b2 for next iteration
 b1 <- rgamma(1,shape=.5,rate=mu[i]+1)
 b2 <- rgamma(1,shape=.5,rate=lambda[i]+1)
 # for each year, find value of LIKELIHOOD function which you will 
 # then use to determine what year to hop to next
 for (j in 1:n) {
 L[j] <- exp((lambda[i]-mu[i])*j) * (mu[i]/lambda[i])^sum(y[1:j])
 }
 L <- L/sum(L)
 # determine which year to hop to next
 k[i] <- sample(1:n,prob=L,size=1)
}

Knowing the distributions of mu, lambda, and k from hopping around our data will help us estimate values for the true population parameters. At the end of the simulation, we have an array of 1000 values of k, an array of 1000 values of mu, and an array of 1000 values of lambda — we use these to estimate the real values of the population parameters. Typically, algorithms that do this automatically throw out a whole bunch of them in the beginning (the “burn-in” period) — Rizzo tosses out 200 observations — even though some statisticians (e.g. Geyer) say that the burn-in period is unnecessary:

> b <- 201 # treats time until the 200th iteration as "burn-in"
> mean(k[b:m])
[1] 39.765
> mean(lambda[b:m])
[1] 0.9326437
> mean(mu[b:m])
[1] 3.146413

The change point happened between the 39th and 40th observations, the arrival rate before the change point was 3.14 arrivals per unit time, and the rate after the change point was 0.93 arrivals per unit time. (Cool!)
After I went through this example, I discovered the changepoint package, which let me run through a similar process in just a few lines of code. Fortunately, the results were very similar! I chose the “AMOC” method which stands for “at most one change”. Other methods are available which can help identify more than one change point (PELT, BinSeg, and SegNeigh – although I got an error message every time I attempted that last method).

> results <- cpt.mean(y,method="AMOC")
> cpts(results)
cpt 
 36 
> param.est(results)
$mean
[1] 3.2500000 0.9736842
> plot(results,cpt.col="blue",xlab="Index",cpt.width=4)

coalmine-changepoint

I decided to explore a little further and found even MORE change point analysis packages! So I tried this example using bcp (which I presume stands for “Bayesian Change Point”) and voila… the output looks very similar to each of the previous two methods!!!):

coalmine-bcp

It’s at this point that the HARD part of the data science project would begin… WHY? Why does it look like the rate of coal mining accidents decreased suddenly? Was there a change in policy or regulatory requirements in Australia, where this data was collected? Was there some sort of mass exodus away from working in the mines, and so there’s a covariate in the number of opportunities for a mining disaster to occur? Don’t know… the original paper from 1979 doesn’t reveal the true story behind the data.

There are also additional resources on R Bloggers that discuss change point analysis:

(Note: If I’ve missed anything, or haven’t explained anything right, please provide corrections and further insights in the comments! Thank you.

My New Favorite Statistics & Data Analysis Book Using R

very-quick-cover-outline

NOTE: The 2nd Edition (Red Swan) was released in 2017. There is a companion book that presents end-to-end examples of each of the methods.


As of today, I now have a NEW FAVORITE introductory statistics textbook… the one I’ve always dreamed of having. I’ve been looking for a book to use in my classes for undergraduate sophomores and juniors, but none of the textbooks I considered over the past three years (and I’ve looked at over a hundred!) had all of the things I really, really wanted. So I had to go make it happen myself. These things are:

download the preview here (first ~100 pages)

1) An integrated treatment of theory and practice. All of my stats textbooks have a lot of formulas, and no information about how to do what the formulas do in the R statistical software. All of my R textbooks have a lot of information about how to run the commands, but not really much information about what formulas are being used. I wanted a book that would show how to solve problems analytically (using the equations), and then show how they’re done in R. If there were discrepancies between the stats textbook answers and the R answers, I wanted to know why. A lot of times, the developers of R packages use very sophisticated adjustments and corrections, which I only became aware of because my analytical solutions didn’t match the R output. At first, I thought I was wrong. But later, I realized I was right, and R was right: we were just doing different things. I wanted my students to know what was going on under the hood, and have an awareness of exactly which methods R was using at every moment.

2) An easy way to develop research questions for observational studies and organize the presentation of results. We always do small research projects in my classes, and in my opinion, this is the best way for students to get a strong grasp of the fundamental statistical concepts. But they always have the same questions: Which statistical test should I use? How should I phrase my research question? What should I include in my report? I wanted a book that made developing statistical research questions easy. In fact, I know a lot of people I went to PhD school with that would have loved to have this book while they were proposing, conducting, and defending their dissertations.

3) A confidence interval cookbook. This is probably one of the most important things I want my students to leave my class remembering: that from whatever sample you collect, you can construct a confidence interval that will give you an idea of what the true population parameter should be. You don’t even need to do a hypothesis test! but it can be difficult to remember which formula to use… so I wanted an easy reference where I’d be able to look things up, and find out really easily how to use R to construct those confidence intervals for me. Furthermore, some of the confidence intervals that everyone is taught in an introductory statistics course are wildly inaccurate – and statisticians know this. But they hesitate to scare away novice data analysts with long, scary looking equations, and so students keep learning those inaccurate methods and believing they’re good. Since so many people never get beyond introductory statistics and still turn into researchers in other fields, I thought this was horrible. I want to make sure my students know the best way to do each confidence interval in their first class… even if the equations are not as friendly.

4) An inference test cookbook. I wanted a book that stepped me through each of the primary parametric inference tests analytically (using the equations), and then showed me how it was done in R. If there were discrepancies, I wanted to know why. I wanted an easy way to remember the assumptions for each test, and when to use a pooled standard deviation versus an unpooled one. There’s a lot to keep track of! I wanted a reference that it would make it easy to keep track of all of it: assumptions, tests for assumptions, equations, R code, and diagnostic plots.

5) No step left behind. It’s really frustrating to me how so many R books assume you can do a psychic fill-in-the-blank for missing code. Since I’ve been using R for several years now, I’ve gotten to the point where my psychic abilities are pretty good, and at least 60% of the time I can figure out the missing pieces. But wow, what a waste of time! So I wanted a book that had all of the steps for each example. Even if it was a little repetitive. I may have missed this in a few places, but I think beginners will have a much easier time with this book. Also, I put all my data and functions on GitHub for people to run the examples with. I’m growing this slowly, but I don’t want people to be left in the lurch.

6) An easy way to produce any of the charts and graphs in the book. One of my pet peeves about R books is that the authors generate beautiful charts and graphs, and then you’re reading through the book and say “Yes!! Yes!! That’s the chart I need for my report… I want to do that… how did they do that?” and they don’t tell you anywhere how they did it. I did not want there to be any secrets in this book. If I generated a page of interesting looking simulated distributions, I wanted you to know how I did it (just in case you want to do it later).

GRANTED… I am sure it will not be perfect – no book is. (For example, Google Forms changes a lot and there are a couple examples that use it that will probably be outdated when the book gets to press… and I just found out this morning that you don’t need the source_https trick in R 3.2.0 and beyond.) [Note: data access has been fully updated in the 2nd Edition.] However, I will keep updating my blog with posts about useful things as they evolve.

In any case, I hope you enjoy my book as much as I’ve been enjoying using it as a reference for myself… it really is all my most important notes, neatly organized into just over 500 pages of everything I want to remember. And everything I want to make sure my students take with them after they leave my class.

[Note: Any errors and omissions from earlier printings (which have been taken care of in later printings) are being recorded at http://qualityandinnovation.com/errata/.]

Sampling Distributions and Central Limit Theorem in R

The Central Limit Theorem (CLT), and the concept of the sampling distribution, are critical for understanding why statistical inference works. There are at least a handful of problems that require you to invoke the Central Limit Theorem on every ASQ Certified Six Sigma Black Belt (CSSBB) exam. The CLT says that if you take many repeated samples from a population, and calculate the averages or sum of each one, the collection of those averages will be normally distributed… and it doesn’t matter what the shape of the source distribution is! (Caveat: so long as the data comes from a distribution with finite variance… so that means the Cauchy distribution doesn’t count.)

I wrote some R code to help illustrate this principle for my students. This code allows you to choose a sample size (n), a source distribution, and parameters for that source distribution, and generate a plot of the sampling distributions of the mean, sum, and variance. (Note: the sampling distribution for the variance is a Chi-square distribution — if your source distribution is normal!)

sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) {
   r <- 10000  # Number of replications/samples - DO NOT ADJUST
   # This produces a matrix of observations with  
   # n columns and r rows. Each row is one sample:
   my.samples <- switch(src.dist,
	"E" = matrix(rexp(n*r,param1),r),
	"N" = matrix(rnorm(n*r,param1,param2),r),
	"U" = matrix(runif(n*r,param1,param2),r),
	"P" = matrix(rpois(n*r,param1),r),
        "B" = matrix(rbinom(n*r,param1,param2),r),
	"G" = matrix(rgamma(n*r,param1,param2),r),
	"X" = matrix(rchisq(n*r,param1),r),
	"T" = matrix(rt(n*r,param1),r))
   all.sample.sums <- apply(my.samples,1,sum)
   all.sample.means <- apply(my.samples,1,mean)   
   all.sample.vars <- apply(my.samples,1,var) 
   par(mfrow=c(2,2))
   hist(my.samples[1,],col="gray",main="Distribution of One Sample")
   hist(all.sample.sums,col="gray",main="Sampling Distribution\nof
	the Sum")
   hist(all.sample.means,col="gray",main="Sampling Distribution\nof the Mean")
   hist(all.sample.vars,col="gray",main="Sampling Distribution\nof
	the Variance")
}

There are 8 population distributions to choose from: exponential (E), normal (N), uniform (U), Poisson (P), binomial (B), gamma (G), Chi-Square (X), and the Student’s t distribution (T). Note also that you have to provide either one or two parameters, depending upon what distribution you are selecting. For example, a normal distribution requires that you specify the mean and standard deviation to describe where it’s centered, and how fat or thin it is (that’s two parameters). A Chi-square distribution requires that you specify the degrees of freedom (that’s only one parameter). You can find out exactly what distributions require what parameters by going here: http://en.wikibooks.org/wiki/R_Programming/Probability_Distributions.

Here is an example that draws from an exponential distribution with a mean of 1/1 (you specify the number you want in the denominator of the mean):

sdm.sim(50,src.dist="E",param1=1)

The code above produces this sequence of plots:

sd-blog-1

You aren’t allowed to change the number of replications in this simulation because of the nature of the sampling distribution: it’s a theoretical model that describes the distribution of statistics from an infinite number of samples. As a result, if you increase the number of replications, you’ll see the mean of the sampling distribution bounce around until it converges on the mean of the population. This is just an artifact of the simulation process: it’s not a characteristic of the sampling distribution, because to be a sampling distribution, you’ve got to have an infinite number of samples. Watkins et al. have a great description of this effect that all statistics instructors should be aware of. I chose 10,000 for the number of replications because 1) it’s close enough to infinity to ensure that the mean of the sampling distribution is the same as the mean of the population, but 2) it’s far enough away from infinity to not crash your computer, even if you only have 4GB or 8GB of memory.

Here are some more examples to try. You can see that as you increase your sample size (n), the shapes of the sampling distributions become more and more normal, and the variance decreases, constraining your estimates of the population parameters more and more.

sdm.sim(10,src.dist="E",1)
sdm.sim(50,src.dist="E",1)
sdm.sim(100,src.dist="E",1)
sdm.sim(10,src.dist="X",14)
sdm.sim(50,src.dist="X",14)
sdm.sim(100,src.dist="X",14)
sdm.sim(10,src.dist="N",param1=20,param2=3)
sdm.sim(50,src.dist="N",param1=20,param2=3)
sdm.sim(100,src.dist="N",param1=20,param2=3)
sdm.sim(10,src.dist="G",param1=5,param2=5)
sdm.sim(50,src.dist="G",param1=5,param2=5)
sdm.sim(100,src.dist="G",param1=5,param2=5)

One-proportion z test in R

For quite a while, I’ve been confused by the behavior of the prop.test function in R: 1) the p-values always seem to be slightly off, and 2) the confidence intervals never seem to be symmetric around my estimate of the population proportion. Today, I decided I was going to figure out what was up… and whether I should trust my own analytical calculations for this test or what I get in R. I started by checking R Bloggers to see if anyone had explored this problem, but I only found one post about the two proportion z-test, and found none focusing on the one proportion z-test.

Here’s the essence of the test: you observe x subjects out of a sample size of n observations. Perhaps you just want to create a confidence interval to get a sense of where the true population proportion lies. Or maybe, you’d like to test the null hypothesis that the true population proportion is some value p0, against the alternative that it is really greater than, less than, or not equal to that value.

To do this, you can use the normal approximation to the binomial distribution to calculate a test statistic z. The estimated population proportion, p-hat, is just x/n. You have to find the standard error (SE) of your estimate as well to get the value of z:

one-prop-z-ts-formula

The confidence interval is constructed from the estimated population proportion, and the margin of error, which is determined by the standard error of the estimate and a scaling factor (z*) that establishes the width of the confidence interval:

ci-general-form

ci-one-proportion-i

ci-ine-proportion-iiHere’s an example. Let’s say I know that nationally, 90% of home buyers would select a home with energy efficient features, even paying a premium of 2-3%, to reduce their utility costs long term. As the hypothetical president of an engineering firm that installs these energy efficient features, I want to know if less than 90% of buyers in my local area feel this way (because if so, I might want to launch an education campaign to help my business). I randomly sample 360 potential buyers and find that 312 of them would indeed select a home with energy efficient features.

I want to test my sample against the null hypothesis that the true population proportion is 0.90, with the alternative that the proportion is less than 0.90. I use a estimated population proportion (p-hat) of 312/360=0.867, and using the equations above, find that my test statistic z turns out to be -2.108, with a corresponding p-value of 0.0175. I reject the null hypothesis that the true population proportion is 0.90 in favor of the alternative, and start making plans to launch my education program.  

(I don’t get the same values when I use prop.test. But later in this post, I’ll show you why I shouldn’t have expected that in the first place!!)

I wrote an R function to automate my z test (and also, to double check my calculations):

z.test <- function(x,n,p=NULL,conf.level=0.95,alternative="less") {
   ts.z <- NULL
   cint <- NULL
   p.val <- NULL
      phat <- x/n
      qhat <- 1 - phat
      # If you have p0 from the population or H0, use it.
      # Otherwise, use phat and qhat to find SE.phat:
      if(length(p) > 0) { 
         q <- 1-p
         SE.phat <- sqrt((p*q)/n) 
         ts.z <- (phat - p)/SE.phat
         p.val <- pnorm(ts.z)
         if(alternative=="two.sided") {
            p.val <- p.val * 2
         }
         if(alternative=="greater") {
            p.val <- 1 - p.val
         }
      } else {
      # If all you have is your sample, use phat to find
      # SE.phat, and don't run the hypothesis test:
         SE.phat <- sqrt((phat*qhat)/n)
      }
      cint <- phat + c( 
         -1*((qnorm(((1 - conf.level)/2) + conf.level))*SE.phat),
         ((qnorm(((1 - conf.level)/2) + conf.level))*SE.phat) )
   return(list(estimate=phat,ts.z=ts.z,p.val=p.val,cint=cint))
}

When I run it on my example above, I get what I expect:

> z.test(312,360,p=0.9)
$estimate
[1] 0.867

$ts.z
[1] -2.11

$p.val
[1] 0.0175

$cint
[1] 0.836 0.898

But when I run prop.test, this isn’t what I see at all!

> prop.test(312,360,p=0.9)

 1-sample proportions test with continuity correction

data: 312 out of 360, null probability 0.9
X-squared = 4.08, df = 1, p-value = 0.04335
alternative hypothesis: true p is not equal to 0.9
95 percent confidence interval:
 0.826 0.899
sample estimates:
 p 
0.867 

Why are there differences? Here’s what I found out:

  • The prop.test function doesn’t even do a z test. It does a Chi-square test, based on there being one categorical variable with two states (success and failure)! Which I should have noticed immediately from the line that starts with “X-squared” 🙂
  • The Yates continuity correction, which adjusts for the differences that come from using a normal approximation to the binomial distribution, is also automatically applied. This removes 0.5/n from the lower limit of the confidence interval, and adds 0.5/n to the upper limit.
  • Most notably, I was really disturbed to see that the confidence interval given by prop.test wasn’t even centered around the sample estimate, p-hat. Oh no! But again, this is no concern, since prop.test uses the Wilson score interval to construct the confidence interval. This results in an asymmetric, but presumably more accurate (with respect to the real population) confidence interval.

Moral of the story? There are many ways to test an observed proportion against a standard, target, or recommended value. If you’re doing research, and especially if you’re using R, be sure to know what’s going on under the hood before you report your results.

(Other moral? Try binom.test.)

Why the Ban on P-Values? And What Now?

Just recently, the editors of the academic journal Basic and Applied Social Psychology have decided to ban p-values: that’s right, the nexus for inferential decision making… gone! This has created quite a fuss among anyone who relies on significance testing and p-values to do research (especially those, presumably, in social psychology who were hoping to submit a paper to that journal any time soon). The Royal Statistical Society even shared six interesting letters from academics to see how they felt about the decision.

These letters all tell a similar story: yes, p-values can be mis-used and mis-interpreted, and we need to be more careful about how we plan for — and interpret the results of — just one study! But not everyone advocated throwing out the method in its entirety. I think a lot of the issues could be avoided if people had a better gut sense of what sampling error is… and how easy it is to encounter (and as a result, how easy it can be to accidentally draw a wrong conclusion in an inference study just based on sampling error). I wanted to do a simulation study to illustrate this for my students.

The Problem

You’re a student at a university in a class full of other students. I tell you to go out and randomly sample 100 students, asking them what their cumulative GPA is. You come back to me with 100 different values, and some mean value that represents the average of all the GPAs you went out and collected. You can also find the standard deviation of all the values in your sample of 100. Everyone has their own unique sample.

“It is misleading to emphasize the statistically significant findings of any single team. What matters is the totality of the evidence.” – John P. A. Ioannidis in Why Most Published Research Findings are False

It’s pretty intuitive that everyone will come back with a different sample… and thus everyone will have a different point estimate of the average GPA that students have at your university. But, according to the central limit theorem, we also know that if we take the collection of all the average GPAs and plot a histogram, it will be normally distributed with a peak around the real average GPA. Some students’ estimates will be really close to the real average GPA. Some students’ estimates will be much lower (for example, if you collected the data at a meeting for students who are on academic probation). Some students’ estimates will be much higher (for example, if you collected the data at a meeting for honors students). This is sampling error, which can lead to incorrect inferences during significance testing.

Inferential statistics is good because it lets us make decisions about a whole population just based on one sample. It would require a lot of time, or a lot of effort, to go out and collect a whole bunch of samples. Inferential statistics is bad if your sample size is too small (and thus you haven’t captured the variability in the population within your sample) or have one of these unfortunate too-high or too-low samples, because you can make incorrect inferences. Like this.

The Input Distribution

Let’s test this using simulation in R. Since we want to randomly sample the cumulative GPAs of students, let’s choose a distribution that reasonably reflects the distribution of all GPAs at a university. To do this, I searched the web to see if I could find data that might help me get this distribution. I found some data from the University of Colorado Boulder that describes GPAs and their corresponding percentile ranks. From this data, I could put together an empirical CDF, and then since the CDF is the integral of the PDF, I approximated the PDF by taking the derivatives of the CDF. (I know this isn’t the most efficient way to do it, but I wanted to see both plots):

score <- c(.06,2.17,2.46,2.67,2.86,3.01,3.17,3.34,3.43,3.45,
3.46,3.48,3.5,3.52,3.54,3.56,3.58,3.6,3.62,3.65,3.67,3.69,
3.71,3.74,3.77,3.79,3.82,3.85,3.88,3.91,3.94,3.96,4.0,4.0)
perc.ranks <- c(0,10,20,30,40,50,60,70,75,76,77,78,79,80,81,
82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100)
fn <- ecdf(perc.ranks)
xs <- score
ys <- fn(perc.ranks)
slope <- rep(NA,length(xs))
for (i in 2:length(xs)) {
 slope[i] <- (ys[i]-ys[i-1])/(xs[i]-xs[i-1])
}
slope[1] <- 0
slope[length(xs)] <- slope[length(xs)-1]

Then, I plotted them both together:

par(mfrow=c(1,2))
plot(xs,slope,type="l",main="Estimated PDF")
plot(xs,ys,type="l",main="Estimated CDF")
dev.off()

pdf-cdf-empirical

I looked around for a distribution that might approximate what I saw. (Yes, I am eyeballing.) I found the Stable Distribution, and then played around with the parameters until I plotted something that looked like the empirical PDF from the Boulder data:

x <- seq(0,4,length=100)
hx <- dstable(x, alpha=0.5, beta=0.75, gamma=1, delta=3.2)
plot(x,hx,type="l",lty=2,lwd=2)

stable-dist

The Simulation

First, I used pwr.t.test to do a power analysis to see what sample size I needed to obtain a power of 0.8, assuming a small but not tiny effect size, at a level of significance of 0.05. It told me I needed at least 89. So I’ll tell my students to each collect a sample of 100 other students.

Now that I have a distribution to sample from, I can pretend like I’m sending 10,000 students out to collect a sample of 100 students’ cumulative GPAs. I want each of my 10,000 students to run a one-sample t-test to evaluate the null hypothesis that the real cumulative GPA is 3.0 against the alternative hypothesis that the actual cumulative GPA is greater than 3.0. (Fortunately, R makes it easy for me to pretend I have all these students.)

sample.size <- 100
numtrials <- 10000
p.vals <- rep(NA,numtrials)
gpa.means <- rep(NA,numtrials)
compare.to <- 3.00
for (j in 1:numtrials) {
     r <- rstable(n=1000,alpha=0.5,beta=0.75,gamma=1,delta=3.2)
     meets.conds <- r[r>0 & r<4.001]
     my.sample <- round(meets.conds[1:sample.size],3)
     gpa.means[j] <- round(mean(my.sample),3)
     p.vals[j] <- t.test(my.sample,mu=compare.to,alternative="greater")$p.value
     if (p.vals[j] < 0.02) {
          # capture the last one of these data sets to look at later
     capture <- my.sample
     }
}

For all of my 10,000 students’ significance tests, look at the spread of p-values! They are all over the place! And there are 46 students whose p-values were less than 0.05… and they rejected the null. One of the distributions of observed GPAs for a student who would have rejected the null is shown below, and it looks just fine (right?) Even though the bulk of the P-Values are well over 0.05, and would have led to the accurate inference that you can’t reject the null in this case, there are still plenty of values that DO fall below that 0.05 threshold. 

> summary(p.vals)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
0.005457 0.681300 0.870900 0.786200 0.959300 1.000000 
> p.vals.under.pointohfive <- p.vals[p.vals<0.05]
> length(p.vals.under.pointohfive)
[1] 46
> par(mfrow=c(1,2))
> hist(capture,main="One Rogue Sample",col="purple")
> boxplot(p.vals,main="All P-Values")

rogue-sample

Even though the p-value shouted “reject the null!” from this rogue sample, a 99% confidence interval shows that the value I’m testing against… that average cumulative GPA of 3.0… is still contained within the confidence interval. So I really shouldn’t have ruled it out:

> mean(capture) + c(-1*qt(0.995,df=(sample.size-1))*(sd(capture)/sqrt(sample.size)),
+ qt(0.995,df=(sample.size-1))*(sd(capture)/sqrt(sample.size)))
[1] 2.989259 3.218201
> t.test(capture,mu=compare.to,alternative="greater")$p.value
[1] 0.009615011

If you’re doing real research, how likely are you to replicate your study so that you know if this happened to you? Not very likely at all, especially if collecting more data is costly (in terms of money or effort). Replication would alleviate the issues that can arise due to the inevitable sampling error… it’s just that we don’t typically do it ourselves, and we don’t make it easy for others to do it. Hence the p-value controversy.

What Now?

What can we do to improve the quality of our research so that we avoid the pitfalls associated with null hypothesis testing completely, or, to make sure that we’re using p-values more appropriately?

  • Make sure your sample size is big enough. This usually involves deciding what you want the power of the test to be, given a certain effect size that you’re trying to detect. A power of 0.80 means you’ll have an 80% chance of detecting an effect that’s actually there. However, knowing what your effect size is prior to your research can be difficult (if not impossible).
  • Be aware of biases that can be introduced by not having a random enough or representative enough sample.
  • Estimation. In our example above we might ask “How much greater than 3.0 is the average cumulative GPA at our university?” Check out Geoff Cummings’ article entitled “The New Statistics: Why and How” for a roadmap that will help you think more in terms of estimation (using effect sizes, confidence intervals, and meta-analysis).
  • Support open science. Make it easy for others to replicate your study. If you’re a journal reviewer, consider accepting more articles that replicate other studies, even if they aren’t “novel enough”.

I am certain that my argument has holes, but it seems to be a good example for students to better embrace the notion of sampling error (and become scared of it… or at least more mindful). Please feel free to suggest alternatives that could make this a more informative example. Thank you!

[ALSO… this xkcd cartoon pretty much nails the whole process of why p-values can be misleading. Enjoy.]

« Older Entries Recent Entries »