Applied Statistics

A Simple Intro to Bayesian Change Point Analysis

The purpose of this post is to demonstrate change point analysis by stepping through an example of the technique in R presented in Rizzo’s excellent, comprehensive, and very mathy book, Statistical Computing with R, and then showing alternative ways to process this data using the changepoint and bcp packages. Much of the commentary is simplified, and that’s on purpose: I want to make this introduction accessible if you’re just learning the method. (Most of the code is straight from Rizzo who provides a much more in-depth treatment of the technique. I’ve added comments in the code to make it easier for me to follow, and that’s about it.)

The idea itself is simple: you have a sample of observations from a Poisson (counting) process (where events occur randomly over a period of time). You probably have a chart that shows time on the horizontal axis, and how many events occurred on the vertical axis. You suspect that the rate at which events occur has changed somewhere over that range of time… either the event is increasing in frequency, or it’s slowing down — but you want to know with a little more certainty. (Alternatively, you could check to see if the variance has changed, which would be useful for process improvement work in Six Sigma projects.)

You want to estimate the rate at which events occur BEFORE the shift (mu), the rate at which events occur AFTER the shift (lambda), and the time when the shift happens (k). To do it, you can apply a Markov Chain Monte Carlo (MCMC) sampling approach to estimate the population parameters at each possible k, from the beginning of your data set to the end of it. The values you get at each time step will be dependent only on the values you computed at the previous timestep (that’s where the Markov Chain part of this problem comes in). There are lots of different ways to hop around the parameter space, and each hopping strategy has a fancy name (e.g. Metropolis-Hastings, Gibbs, “reversible jump”).

In one example, Rizzo (p. 271-277) uses a Markov Chain Monte Carlo (MCMC) method that applies a Gibbs sampler to do the hopping – with the goal of figuring out the change point in number of coal mine disasters from 1851 to 1962. (Looking at a plot of the frequency over time, it appears that the rate of coal mining disasters decreased… but did it really? And if so, when? That’s the point of her example.) She gets the coal mining data from the boot package. Here’s how to get it, and what it looks like:

library(boot)
data(coal)
y <- tabulate(floor(coal[[1]]))
y <- y[1851:length(y)]
barplot(y,xlab="years", ylab="frequency of disasters")

coalmine-freq

First, we initialize all of the data structures we’ll need to use:

# initialization
n <- length(y) # number of data elements to process
m <- 1000 # target length of the chain
L <- numeric(n) # likelihood fxn has one slot per year
k[1] <- sample(1:n,1) # pick 1 random year to start at
mu[1] <- 1
lambda[1] <- 1
b1 <- 1
b2 <- 1
# now set up blank 1000 element arrays for mu, lambda, and k
mu <- lambda <- k <- numeric(m)

Here are the models for prior (hypothesized) distributions that she uses, based on the Gibbs sampler approach:

  • mu comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies UP TO the point in time, k, you’re currently at) and a rate of (k + b1)
  • lambda comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies AFTER the point in time, k, you’re currently at) and a rate of (n – k + b1) where n is the number of the year you’re currently processing
  • b1 comes from a Gamma distribution with a shape parameter of 0.5 and a rate of (mu + 1)
  • b2 comes from a Gamma distribution with a shape parameter of 0.5 and a rate of (lambda + 1)
  • a likelihood function L is also provided, and is a function of k, mu, lambda, and the sum of all the frequencies up until that point in time, k

At each iteration, you pick a value of k to represent a point in time where a change might have occurred. You slice your data into two chunks: the chunk that happened BEFORE this point in time, and the chunk that happened AFTER this point in time. Using your data, you apply a Poisson Process with a (Hypothesized) Gamma Distributed Rate as your model. This is a pretty common model for this particular type of problem. It’s like randomly cutting a deck of cards and taking the average of the values in each of the two cuts… then doing the same thing again… a thousand times. Here is Rizzo’s (commented) code:

# start at 2, so you can use initialization values as seeds
# and go through this process once for each of your m iterations
for (i in 2:m) {
 kt <- k[i-1] # start w/random year from initialization
 # set your shape parameter to pick mu from, based on the characteristics
 # of the early ("before") chunk of your data
 r <- .5 + sum(y[1:kt]) 
 # now use it to pick mu
 mu[i] <- rgamma(1,shape=r,rate=kt+b1) 
 # if you're at the end of the time periods, set your shape parameter
 # to 0.5 + the sum of all the frequencies, otherwise, just set the shape
 # parameter that you will use to pick lambda based on the later ("after")
 # chunk of your data
 if (kt+1 > n) r <- 0.5 + sum(y) else r <- 0.5 + sum(y[(kt+1):n])
 lambda[i] <- rgamma(1,shape=r,rate=n-kt+b2)
 # now use the mu and lambda values that you got to set b1 and b2 for next iteration
 b1 <- rgamma(1,shape=.5,rate=mu[i]+1)
 b2 <- rgamma(1,shape=.5,rate=lambda[i]+1)
 # for each year, find value of LIKELIHOOD function which you will 
 # then use to determine what year to hop to next
 for (j in 1:n) {
 L[j] <- exp((lambda[i]-mu[i])*j) * (mu[i]/lambda[i])^sum(y[1:j])
 }
 L <- L/sum(L)
 # determine which year to hop to next
 k[i] <- sample(1:n,prob=L,size=1)
}

Knowing the distributions of mu, lambda, and k from hopping around our data will help us estimate values for the true population parameters. At the end of the simulation, we have an array of 1000 values of k, an array of 1000 values of mu, and an array of 1000 values of lambda — we use these to estimate the real values of the population parameters. Typically, algorithms that do this automatically throw out a whole bunch of them in the beginning (the “burn-in” period) — Rizzo tosses out 200 observations — even though some statisticians (e.g. Geyer) say that the burn-in period is unnecessary:

> b <- 201 # treats time until the 200th iteration as "burn-in"
> mean(k[b:m])
[1] 39.765
> mean(lambda[b:m])
[1] 0.9326437
> mean(mu[b:m])
[1] 3.146413

The change point happened between the 39th and 40th observations, the arrival rate before the change point was 3.14 arrivals per unit time, and the rate after the change point was 0.93 arrivals per unit time. (Cool!)
After I went through this example, I discovered the changepoint package, which let me run through a similar process in just a few lines of code. Fortunately, the results were very similar! I chose the “AMOC” method which stands for “at most one change”. Other methods are available which can help identify more than one change point (PELT, BinSeg, and SegNeigh – although I got an error message every time I attempted that last method).

> results <- cpt.mean(y,method="AMOC")
> cpts(results)
cpt 
 36 
> param.est(results)
$mean
[1] 3.2500000 0.9736842
> plot(results,cpt.col="blue",xlab="Index",cpt.width=4)

coalmine-changepoint

I decided to explore a little further and found even MORE change point analysis packages! So I tried this example using bcp (which I presume stands for “Bayesian Change Point”) and voila… the output looks very similar to each of the previous two methods!!!):

coalmine-bcp

It’s at this point that the HARD part of the data science project would begin… WHY? Why does it look like the rate of coal mining accidents decreased suddenly? Was there a change in policy or regulatory requirements in Australia, where this data was collected? Was there some sort of mass exodus away from working in the mines, and so there’s a covariate in the number of opportunities for a mining disaster to occur? Don’t know… the original paper from 1979 doesn’t reveal the true story behind the data.

There are also additional resources on R Bloggers that discuss change point analysis:

(Note: If I’ve missed anything, or haven’t explained anything right, please provide corrections and further insights in the comments! Thank you.

7 replies »

  1. “But you got different numbers than the ones in Rizzo’s book!” — Yeah, I’m pretty sure we must have had different random number seeds. That’s not uncommon or troubling (in my opinion)… as long as the values are close to within a couple decimal points, depending on the seriousness of the problem I’m studying.

  2. Thank you a lot for this article. I need to solve a very similar problem and I found it very helpful!

    I’d like to point out just a few things. The full conditional of the lambda parameter should be a Gamma distribution with rate parameter equal to (n-k+b2) instead of (n-k+b1) as it is written in the article. Is it correct?

    I’ve read Rizzo’s code carefully. I might be wrong but isn’t there a mistake in it? In my opinion the calculation of the shape parameter of the Gamma distribution of the lambda parameter isn’t correct. As you said correctly in the article <>. If k is greater or equal to n (where n is the length of the data), shouldn’t the shape parameter be just 0.5? Why should we add the sum of all frequencies to 0.5, if k is greater or equal to n?
    In other words, I’d change this code:

    if (kt+1 > n) r <- 0.5 + sum(y) else r n) r <- 0.5 else r <- 0.5 + sum(y[(kt+1):n])

    I'd be pleased to know what you think about it. Thanking you in advance.
    Your sincerely,
    Simone

  3. Thank you a lot for this article. I need to solve a very similar problem and I found it very helpful!

    I’d like to point out just a few things. The full conditional of the lambda parameter should be a Gamma distribution with rate parameter equal to (n-k+b2) instead of (n-k+b1) as it is written in the article. Is it correct?

    I’ve read Rizzo’s code carefully. I might be wrong but isn’t there a mistake in it? In my opinion the calculation of the shape parameter of the Gamma distribution of the lambda parameter isn’t correct. As you said correctly in the article “”lambda comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies AFTER the point in time, k, you’re currently at)””. If k is greater or equal to n (where n is the length of the data), shouldn’t the shape parameter be just 0.5? Why should we add the sum of all frequencies to 0.5, if k is greater or equal to n?
    In other words, I’d change this code:
    if (kt+1 > n) r <- 0.5 + sum(y) else r n) r <- 0.5 else r <- 0.5 + sum(y[(kt+1):n])
    I'd be pleased to know what you think about it. Thanking you in advance.
    Your sincerely,
    Simone

  4. Thank you a lot for this article. I need to solve a very similar problem and I found it very helpful!

    I’d like to point out just a few things. The full conditional of the lambda parameter should be a Gamma distribution with rate parameter equal to (n-k+b2) instead of (n-k+b1) as it is written in the article. Is it correct?

    I’ve read Rizzo’s code carefully. I might be wrong but isn’t there a mistake in it? In my opinion the calculation of the shape parameter of the Gamma distribution of the lambda parameter isn’t correct. As you said correctly in the article “”lambda comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies AFTER the point in time, k, you’re currently at)””. If k is greater or equal to n (where n is the length of the data), shouldn’t the shape parameter be just 0.5? Why should we add the sum of all frequencies to 0.5, if k is greater or equal to n?
    In other words, I’d change this code:
    ” if (kt+1 > n) r <- 0.5 + sum(y) else r n) r <- 0.5 else r <- 0.5 + sum(y[(kt+1):n])"
    I'd be pleased to know what you think about it. Thanking you in advance.
    Your sincerely,
    Simone

  5. Thank you a lot for this article. I need to solve a very similar problem and I found it very helpful!

    I’d like to point out just a few things. The full conditional of the lambda parameter should be a Gamma distribution with rate parameter equal to (n-k+b2) instead of (n-k+b1) as it is written in the article. Is it correct?

    I’ve read Rizzo’s code carefully. I might be wrong but isn’t there a mistake in it? In my opinion the calculation of the shape parameter of the Gamma distribution of the lambda parameter isn’t correct. As you said correctly in the article “”lambda comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies AFTER the point in time, k, you’re currently at)””. If k is greater or equal to n (where n is the length of the data), shouldn’t the shape parameter be just 0.5? Why should we add the sum of all frequencies to 0.5, if k is greater or equal to n?
    In other words, I’d change this code:

    if (kt+1 > n) r = 0.5 + sum(y) else r = 0.5 + sum(y[(kt+1):n])

    into this:

    if (kt+1 > n) r = 0.5 else r = 0.5 + sum(y[(kt+1):n])

    I’d be pleased to know what you think about it. Thanking you in advance.
    Your sincerely,
    Simone

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s