## Course Materials for Statistics (The Easier Way) With R Are you an instructor with a Spring 2016 intro to statistics course coming up… and yet you haven’t started preparing? If so, I have a potential solution for you to consider. The materials (lecture slides, in-class practice problems in R, exams, syllabus) go with my book and are about 85% compiled, but good enough to get a course started this week (I will be finishing the collection by mid-January).
There is also a 36MB .zip file if you would like to download the materials.
If you would like permission to edit the materials, I can do that as well — I know a couple of you have expressed that you would like to add to the collection (e.g translate them to another language). If you see any issues or errors, please either fix and/or email to tell me to fix!
Thanks for your interest and participation! Also, Happy New Year!

## My Second (R) Shiny App: Sampling Distributions & CLT Image Credit: Doug Buckley of http://hyperactive.to

I was so excited about my initial foray into Shiny development using jennybc‘s amazing package that I stayed up half the night last night (again) working on my second Shiny app: a Shiny-fied version of the function I shared in March to do simulations illustrating sampling distributions and the Central Limit Theorem using many different source distributions. (Note that Cauchy doesn’t play by the rules!) Hope this info is useful to all new Shiny developers.

If the app doesn’t work for you, it’s possible that I’ve exhausted my purchased hours at http://shinyapps.io — no idea how much traffic this post might generate. So if that happens to you, please try getting Shiny to work locally, cutting and pasting the code below into server.R and ui.R files, and then launching the simulation from your R console.

Here are some important lessons I learned on my 2nd attempt at Shiny development:

• Creating a container (rv) for the server-side values that would change as a result of inputs from the UI was important. That container was then available to the portions of my Shiny code that prepared data for the UI, e.g. output\$plotSample.
• Because switch only takes arguments that are 1 character long, using radio buttons in the Shiny UI was really useful: I can map the label on each radio button to one character that will get passed into the data processing on the server side.
• I was able to modify the CSS for the page by adding a couple lines to mainPanel() in my UI.
• Although it was not mentally easy (for me) to convert from an R function to a Shiny app when initially presented with the problem, in retrospect, it was indeed straightforward. All I had to do was take the original function, split out the data processing from the presentation (par & hist commands), put the data processing code on the server side and the presentation code on the UI side, change the variable names on the server side so that they had the input\$ prefix, and make sure the variable names were consistent between server and UI.
• I originally tried writing one app.R file, but http://shinyapps.io did not seem to like that, so I put all the code that was not UI into the server side and tried deploying with server.R and ui.R, which worked. I don’t know what I did wrong.
• If you want to publish to http://shinyapps.io, the directory name that hosts your files must be at least 4 characters long or you will get a “validation error” when you attempt to deployApp().
```## Nicole's Second Shiny Demo App
## Used code from http://github.com/homerhanumat as a base
###########################################################
## ui
###########################################################

ui <- fluidPage(
titlePanel('Sampling Distributions and the Central Limit Theorem'),
sidebarPanel(
helpText('Choose your source distribution and number of items, n, in each
sample. 10000 replications will be run when you click "Sample Now".'),
href="http://www.r-bloggers.com/sampling-distributions-and-central-limit-theorem-in-r/", target="_blank")),
sliderInput(inputId="n","Sample Size n",value=30,min=5,max=100,step=2),
c("Exponential: Param1 = mean, Param2 = not used" = "E",
"Normal: Param1 = mean, Param2 = sd" = "N",
"Uniform: Param1 = min, Param2 = max" = "U",
"Poisson: Param1 = lambda, Param2 = not used" = "P",
"Cauchy: Param1 = location, Param2 = scale" = "C",
"Binomial: Param1 = size, Param2 = success prob" = "B",
"Gamma: Param1 = shape, Param2 = scale" = "G",
"Chi Square: Param1 = df, Param2 = ncp" = "X",
"Student t: Param1 = df, Param2 = not used" = "T")),
numericInput("param1","Parameter 1:",10),
numericInput("param2","Parameter 2:",2),
actionButton("takeSample","Sample Now")
), # end sidebarPanel
mainPanel(
# Use CSS to control the background color of the entire page
tags\$style("body {background-color: #9999aa; }")
),
plotOutput("plotSample")
) # end mainPanel
) # end UI

##############################################################
## server
##############################################################

library(shiny)
r <- 10000 # Number of replications... must be ->inf for sampling distribution!

palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
"#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999"))

server <- function(input, output) {
set.seed(as.numeric(Sys.time()))

# Create a reactive container for the data structures that the simulation
# will produce. The rv\$variables will be available to the sections of your
# server code that prepare output for the UI, e.g. output\$plotSample
rv <- reactiveValues(sample = NULL,
all.sums = NULL,
all.means = NULL,
all.vars = NULL)

# Note: We are giving observeEvent all the output connected to the UI actionButton.
# We can refer to input variables from our UI as input\$variablename
observeEvent(input\$takeSample,
{
my.samples <- switch(input\$src.dist,
"E" = matrix(rexp(input\$n*r,input\$param1),r),
"N" = matrix(rnorm(input\$n*r,input\$param1,input\$param2),r),
"U" = matrix(runif(input\$n*r,input\$param1,input\$param2),r),
"P" = matrix(rpois(input\$n*r,input\$param1),r),
"C" = matrix(rcauchy(input\$n*r,input\$param1,input\$param2),r),
"B" = matrix(rbinom(input\$n*r,input\$param1,input\$param2),r),
"G" = matrix(rgamma(input\$n*r,input\$param1,input\$param2),r),
"X" = matrix(rchisq(input\$n*r,input\$param1),r),
"T" = matrix(rt(input\$n*r,input\$param1),r))

# It was very important to make sure that rv contained numeric values for plotting:
rv\$sample <- as.numeric(my.samples[1,])
rv\$all.sums <- as.numeric(apply(my.samples,1,sum))
rv\$all.means <- as.numeric(apply(my.samples,1,mean))
rv\$all.vars <- as.numeric(apply(my.samples,1,var))
}
)

output\$plotSample <- renderPlot({
# Plot only when user input is submitted by clicking "Sample Now"
if (input\$takeSample) {
# Create a 2x2 plot area & leave a big space (5) at the top for title
par(mfrow=c(2,2), oma=c(0,0,5,0))
hist(rv\$sample, main="Distribution of One Sample",
ylab="Frequency",col=1)
hist(rv\$all.sums, main="Sampling Distribution of the Sum",
ylab="Frequency",col=2)
hist(rv\$all.means, main="Sampling Distribution of the Mean",
ylab="Frequency",col=3)
hist(rv\$all.vars, main="Sampling Distribution of the Variance",
ylab="Frequency",col=4)
mtext("Simulation Results", outer=TRUE, cex=3)
}
}, height=660, width=900) # end plotSample

} # end server```

## Control Charts in R: A Guide to X-Bar/R Charts in the qcc Package Statistical process control provides a mechanism for measuring, managing, and controlling processes. There are many different flavors of control charts, but if data are readily available, the X-Bar/R approach is often used. The following PDF describes X-Bar/R charts and shows you how to create them in R and interpret the results, and uses the fantastic qcc package that was developed by Luca Scrucca. Please let me know if you find it helpful!

Creating and Interpreting X-Bar/R Charts in R

## A Simple Intro to Bayesian Change Point Analysis

The purpose of this post is to demonstrate change point analysis by stepping through an example of the technique in R presented in Rizzo’s excellent, comprehensive, and very mathy book, Statistical Computing with R, and then showing alternative ways to process this data using the changepoint and bcp packages. Much of the commentary is simplified, and that’s on purpose: I want to make this introduction accessible if you’re just learning the method. (Most of the code is straight from Rizzo who provides a much more in-depth treatment of the technique. I’ve added comments in the code to make it easier for me to follow, and that’s about it.)

The idea itself is simple: you have a sample of observations from a Poisson (counting) process (where events occur randomly over a period of time). You probably have a chart that shows time on the horizontal axis, and how many events occurred on the vertical axis. You suspect that the rate at which events occur has changed somewhere over that range of time… either the event is increasing in frequency, or it’s slowing down — but you want to know with a little more certainty. (Alternatively, you could check to see if the variance has changed, which would be useful for process improvement work in Six Sigma projects.)

You want to estimate the rate at which events occur BEFORE the shift (mu), the rate at which events occur AFTER the shift (lambda), and the time when the shift happens (k). To do it, you can apply a Markov Chain Monte Carlo (MCMC) sampling approach to estimate the population parameters at each possible k, from the beginning of your data set to the end of it. The values you get at each time step will be dependent only on the values you computed at the previous timestep (that’s where the Markov Chain part of this problem comes in). There are lots of different ways to hop around the parameter space, and each hopping strategy has a fancy name (e.g. Metropolis-Hastings, Gibbs, “reversible jump”).

In one example, Rizzo (p. 271-277) uses a Markov Chain Monte Carlo (MCMC) method that applies a Gibbs sampler to do the hopping – with the goal of figuring out the change point in number of coal mine disasters from 1851 to 1962. (Looking at a plot of the frequency over time, it appears that the rate of coal mining disasters decreased… but did it really? And if so, when? That’s the point of her example.) She gets the coal mining data from the boot package. Here’s how to get it, and what it looks like:

```library(boot)
data(coal)
y <- tabulate(floor(coal[]))
y <- y[1851:length(y)]
barplot(y,xlab="years", ylab="frequency of disasters")``` First, we initialize all of the data structures we’ll need to use:

```# initialization
n <- length(y) # number of data elements to process
m <- 1000 # target length of the chain
L <- numeric(n) # likelihood fxn has one slot per year
k <- sample(1:n,1) # pick 1 random year to start at
mu <- 1
lambda <- 1
b1 <- 1
b2 <- 1
# now set up blank 1000 element arrays for mu, lambda, and k
mu <- lambda <- k <- numeric(m)```

Here are the models for prior (hypothesized) distributions that she uses, based on the Gibbs sampler approach:

• mu comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies UP TO the point in time, k, you’re currently at) and a rate of (k + b1)
• lambda comes from a Gamma distribution with shape parameter of (0.5 + the sum of all your frequencies AFTER the point in time, k, you’re currently at) and a rate of (n – k + b1) where n is the number of the year you’re currently processing
• b1 comes from a Gamma distribution with a shape parameter of 0.5 and a rate of (mu + 1)
• b2 comes from a Gamma distribution with a shape parameter of 0.5 and a rate of (lambda + 1)
• a likelihood function L is also provided, and is a function of k, mu, lambda, and the sum of all the frequencies up until that point in time, k

At each iteration, you pick a value of k to represent a point in time where a change might have occurred. You slice your data into two chunks: the chunk that happened BEFORE this point in time, and the chunk that happened AFTER this point in time. Using your data, you apply a Poisson Process with a (Hypothesized) Gamma Distributed Rate as your model. This is a pretty common model for this particular type of problem. It’s like randomly cutting a deck of cards and taking the average of the values in each of the two cuts… then doing the same thing again… a thousand times. Here is Rizzo’s (commented) code:

```# start at 2, so you can use initialization values as seeds
# and go through this process once for each of your m iterations
for (i in 2:m) {
kt <- k[i-1] # start w/random year from initialization
# set your shape parameter to pick mu from, based on the characteristics
# of the early ("before") chunk of your data
r <- .5 + sum(y[1:kt])
# now use it to pick mu
mu[i] <- rgamma(1,shape=r,rate=kt+b1)
# if you're at the end of the time periods, set your shape parameter
# to 0.5 + the sum of all the frequencies, otherwise, just set the shape
# parameter that you will use to pick lambda based on the later ("after")
if (kt+1 > n) r <- 0.5 + sum(y) else r <- 0.5 + sum(y[(kt+1):n])
lambda[i] <- rgamma(1,shape=r,rate=n-kt+b2)
# now use the mu and lambda values that you got to set b1 and b2 for next iteration
b1 <- rgamma(1,shape=.5,rate=mu[i]+1)
b2 <- rgamma(1,shape=.5,rate=lambda[i]+1)
# for each year, find value of LIKELIHOOD function which you will
# then use to determine what year to hop to next
for (j in 1:n) {
L[j] <- exp((lambda[i]-mu[i])*j) * (mu[i]/lambda[i])^sum(y[1:j])
}
L <- L/sum(L)
# determine which year to hop to next
k[i] <- sample(1:n,prob=L,size=1)
}```

Knowing the distributions of mu, lambda, and k from hopping around our data will help us estimate values for the true population parameters. At the end of the simulation, we have an array of 1000 values of k, an array of 1000 values of mu, and an array of 1000 values of lambda — we use these to estimate the real values of the population parameters. Typically, algorithms that do this automatically throw out a whole bunch of them in the beginning (the “burn-in” period) — Rizzo tosses out 200 observations — even though some statisticians (e.g. Geyer) say that the burn-in period is unnecessary:

```> b <- 201 # treats time until the 200th iteration as "burn-in"
> mean(k[b:m])
 39.765
> mean(lambda[b:m])
 0.9326437
> mean(mu[b:m])
 3.146413```

The change point happened between the 39th and 40th observations, the arrival rate before the change point was 3.14 arrivals per unit time, and the rate after the change point was 0.93 arrivals per unit time. (Cool!)
After I went through this example, I discovered the changepoint package, which let me run through a similar process in just a few lines of code. Fortunately, the results were very similar! I chose the “AMOC” method which stands for “at most one change”. Other methods are available which can help identify more than one change point (PELT, BinSeg, and SegNeigh – although I got an error message every time I attempted that last method).

```> results <- cpt.mean(y,method="AMOC")
> cpts(results)
cpt
36
> param.est(results)
\$mean
 3.2500000 0.9736842
> plot(results,cpt.col="blue",xlab="Index",cpt.width=4)``` I decided to explore a little further and found even MORE change point analysis packages! So I tried this example using bcp (which I presume stands for “Bayesian Change Point”) and voila… the output looks very similar to each of the previous two methods!!!): It’s at this point that the HARD part of the data science project would begin… WHY? Why does it look like the rate of coal mining accidents decreased suddenly? Was there a change in policy or regulatory requirements in Australia, where this data was collected? Was there some sort of mass exodus away from working in the mines, and so there’s a covariate in the number of opportunities for a mining disaster to occur? Don’t know… the original paper from 1979 doesn’t reveal the true story behind the data.

There are also additional resources on R Bloggers that discuss change point analysis:

(Note: If I’ve missed anything, or haven’t explained anything right, please provide corrections and further insights in the comments! Thank you.

## What (Really) is a Data Scientist?

What is a data scientist? What makes for a good (or great!) data scientist? It’s been challenging enough to determine what a data scientist really is (several people have proposed ways to look at this). The Guardian (a UK publication) said, however, that a true data scientist is as “rare as a unicorn”.

I believe that the data scientist “unicorn” is hidden right in front of our faces; the purpose of this post is to help you find it. First, we’ll take a look at some models, and then I’ll present my version of what a data scientist is (and how this person can become “great”).

#1 Drew Conway’s popularData Science Venn Diagram” — created in 2010 — characterizes the data scientist as a person with some combination of skills and expertise in three categories (and preferably, depth in all of them): 1) Hacking, 2) Math and Statistics, and 3) Substantive Expertise (also called “domain knowledge”).

Later, he added that there was a critical missing element in the diagram: that effective storytelling with data is fundamental. The real value-add, he says, is being able to construct actionable knowledge that facilitates effective decision making. How to get the “actionable” part? Be able to communicate well with the people who have the responsibility and authority to act.

“To me, data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science. Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. On the flip-side, substantive expertise plus math and statistics knowledge is where most traditional researcher falls. Doctoral level researchers spend most of their time acquiring expertise in these areas, but very little time learning about technology. Part of this is the culture of academia, which does not reward researchers for understanding technology. That said, I have met many young academics and graduate students that are eager to bucking that tradition.”Drew Conway, March 26, 2013

#2 In 2013, Harlan Harris (along with his two colleagues, Sean Patrick Murphy and Marck Vaisman) published a fantastic study where they surveyed approximately 250 professionals who self-identified with the “data science” label. Each person was asked to rank their proficiency in each of 22 skills (for example, Back-End Programming, Machine Learning, and Unstructured Data). Using clustering, they identified four distinct “personality types” among data scientists:

As a manager, you might try to cut corners by hiring all Data Creatives(*). But then, you won’t benefit from the ultra-awareness that theorists provide. They can help you avoid choosing techniques that are inappropriate, if (say) your data violates the assumptions of the methods. This is a big deal! You can generate completely bogus conclusions by using the wrong tool for the job. You would not benefit from the stress relief that the Data Developers will provide to the rest of the data science team. You would not benefit from the deep domain knowledge that the Data Businessperson can provide… that critical tacit and explicit knowledge that can save you from making a potentially disastrous decision.

Although most analysts and researchers who do screw up very innocently screw up their analyses by stumbling into misuses of statistical techniques, some unscrupulous folks might mislead other on purpose; although an extreme case, see I Fooled Millions Into Thinking Chocolate Helps Weight Loss.

Their complete results are available as a 30-page report (available in print or on Kindle).

“The data scientist’s skills – advanced analytics, data integration, software development, creativity, good communications skills and business acumen – often already exist in an organisation. Just not in a single person… likely to be spread over different roles, such as statisticians, bio-chemists, programmers, computer scientists and business analysts. And they’re easier to find and hire than data scientists.”

They cite British Airways as an exemplar:

“[British Airways] believes that data scientists are more effective and bring more value to the business when they work within teams. Innovation has usually been found to occur within team environments where there are multiple skills, rather than because someone working in isolation has a brilliant idea, as often portrayed in TV dramas.”

Their position is you can’t get all those skills in one person, so don’t look for it. Just yesterday I realized that if I learn one new amazing thing in R every single day of my life, by the time I die, I will probably be an expert in about 2% of the package (assuming it’s still around).

#4 Others have chimed in on this question and provided outlines of skill sets, such as:

• : statistical thinking, technical acumen, multi-modal communication skills, curiosity, creativity, grit
• The Udacity blog: basic tools (R, Python), software engineering, statistics, machine learning, multivariate calculus, linear algebra, data munging, data visualization and communication, and the ultimately nebulous “thinking like a data scientist”
• IBM: “part analyst, part artist” skilled in “computer science and applications, modeling, statistics, analytics and math… [and] strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge.”
• SAS: “a new breed of analytical data expert who have the technical skills to solve complex problems – and the curiosity to explore what problems need to be solved. They’re part mathematician, part computer scientist and part trend-spotter.” (Doesn’t that sound exciting?)
• DataJobs.Com: well, these guys just took Drew Conway’s Venn diagram and relabeled it.

#5 My Answer to “What is a Data Scientist?”:  A data scientist is a sociotechnical boundary spanner who helps convert data and information into actionable knowledge.

Based on all of the perspectives above, I’d like to add that the data scientist must have an awareness of the context of the problems being solved: social, cultural, economic, political, and technological. Who are the stakeholders? What’s important to them? How are they likely to respond to the actions we take in response to the new knowledge data science brings our way? What’s best for everyone involved so that we can achieve sustainability and the effective use of our resources? And what’s with the word “helps” in the definition above? This is intended to reflect that in my opinion, a single person can’t address the needs of a complex data science challenge. We need each other to be “great” at it.

A data scientist is someone who can effectively span the boundaries between

1) understanding social+ context,

2) correctly selecting and applying techniques from math and statistics,

3) leveraging hacking skills wherever necessary,

4) applying domain knowledge, and

5) creating compelling and actionable stories and connections that help decision-makers achieve their goals. This person has a depth of knowledge and technical expertise in at least one of these five areas, and a high level of familiarity with each of the other areas (commensurate with Harris’ T-model). They are able to work productively within a small team whose deep skills span all five areas.

It’s data-driven decision making embedded in a rich social, cultural, economic, political, and technological context… where the challenges may be complex, and the stakes (and ultimately, the benefits) may be high.

(*) Disclosure: I am a Data Creative!

(**)Quality professionals (like Six Sigma Black Belts) have been doing this for decades. How can we enhance, expand, and leverage our skills to address the growing need for data scientists?

## My New Favorite Statistics & Data Analysis Book Using R ## NOTE: The 3rd Edition (Purple Swan) was released in April 2019. There is a companion book that presents end-to-end examples of each of the methods.

As of today, I now have a NEW FAVORITE introductory statistics textbook… the one I’ve always dreamed of having. I’ve been looking for a book to use in my classes for undergraduate sophomores and juniors, but none of the textbooks I considered over the past three years (and I’ve looked at over a hundred!) had all of the things I really, really wanted. So I had to go make it happen myself. These things are:

1) An integrated treatment of theory and practice. All of my stats textbooks have a lot of formulas, and no information about how to do what the formulas do in the R statistical software. All of my R textbooks have a lot of information about how to run the commands, but not really much information about what formulas are being used. I wanted a book that would show how to solve problems analytically (using the equations), and then show how they’re done in R. If there were discrepancies between the stats textbook answers and the R answers, I wanted to know why. A lot of times, the developers of R packages use very sophisticated adjustments and corrections, which I only became aware of because my analytical solutions didn’t match the R output. At first, I thought I was wrong. But later, I realized I was right, and R was right: we were just doing different things. I wanted my students to know what was going on under the hood, and have an awareness of exactly which methods R was using at every moment.

2) An easy way to develop research questions for observational studies and organize the presentation of results. We always do small research projects in my classes, and in my opinion, this is the best way for students to get a strong grasp of the fundamental statistical concepts. But they always have the same questions: Which statistical test should I use? How should I phrase my research question? What should I include in my report? I wanted a book that made developing statistical research questions easy. In fact, I know a lot of people I went to PhD school with that would have loved to have this book while they were proposing, conducting, and defending their dissertations.

3) A confidence interval cookbook. This is probably one of the most important things I want my students to leave my class remembering: that from whatever sample you collect, you can construct a confidence interval that will give you an idea of what the true population parameter should be. You don’t even need to do a hypothesis test! but it can be difficult to remember which formula to use… so I wanted an easy reference where I’d be able to look things up, and find out really easily how to use R to construct those confidence intervals for me. Furthermore, some of the confidence intervals that everyone is taught in an introductory statistics course are wildly inaccurate – and statisticians know this. But they hesitate to scare away novice data analysts with long, scary looking equations, and so students keep learning those inaccurate methods and believing they’re good. Since so many people never get beyond introductory statistics and still turn into researchers in other fields, I thought this was horrible. I want to make sure my students know the best way to do each confidence interval in their first class… even if the equations are not as friendly.

4) An inference test cookbook. I wanted a book that stepped me through each of the primary parametric inference tests analytically (using the equations), and then showed me how it was done in R. If there were discrepancies, I wanted to know why. I wanted an easy way to remember the assumptions for each test, and when to use a pooled standard deviation versus an unpooled one. There’s a lot to keep track of! I wanted a reference that it would make it easy to keep track of all of it: assumptions, tests for assumptions, equations, R code, and diagnostic plots.

5) No step left behind. It’s really frustrating to me how so many R books assume you can do a psychic fill-in-the-blank for missing code. Since I’ve been using R for several years now, I’ve gotten to the point where my psychic abilities are pretty good, and at least 60% of the time I can figure out the missing pieces. But wow, what a waste of time! So I wanted a book that had all of the steps for each example. Even if it was a little repetitive. I may have missed this in a few places, but I think beginners will have a much easier time with this book. Also, I put all my data and functions on GitHub for people to run the examples with. I’m growing this slowly, but I don’t want people to be left in the lurch.

6) An easy way to produce any of the charts and graphs in the book. One of my pet peeves about R books is that the authors generate beautiful charts and graphs, and then you’re reading through the book and say “Yes!! Yes!! That’s the chart I need for my report… I want to do that… how did they do that?” and they don’t tell you anywhere how they did it. I did not want there to be any secrets in this book. If I generated a page of interesting looking simulated distributions, I wanted you to know how I did it (just in case you want to do it later).

GRANTED… I am sure it will not be perfect – no book is. (For example, Google Forms changes a lot and there are a couple examples that use it that will probably be outdated when the book gets to press… and I just found out this morning that you don’t need the source_https trick in R 3.2.0 and beyond.) [Note: data access has been fully updated in the 2nd Edition.] However, I will keep updating my blog with posts about useful things as they evolve.

In any case, I hope you enjoy my book as much as I’ve been enjoying using it as a reference for myself… it really is all my most important notes, neatly organized into just over 500 pages of everything I want to remember. And everything I want to make sure my students take with them after they leave my class.

[Note: Any errors and omissions from earlier printings (which have been taken care of in later printings) are being recorded at http://qualityandinnovation.com/errata/.]

## Sampling Distributions and Central Limit Theorem in R

The Central Limit Theorem (CLT), and the concept of the sampling distribution, are critical for understanding why statistical inference works. There are at least a handful of problems that require you to invoke the Central Limit Theorem on every ASQ Certified Six Sigma Black Belt (CSSBB) exam. The CLT says that if you take many repeated samples from a population, and calculate the averages or sum of each one, the collection of those averages will be normally distributed… and it doesn’t matter what the shape of the source distribution is! (Caveat: so long as the data comes from a distribution with finite variance… so that means the Cauchy distribution doesn’t count.)

I wrote some R code to help illustrate this principle for my students. This code allows you to choose a sample size (n), a source distribution, and parameters for that source distribution, and generate a plot of the sampling distributions of the mean, sum, and variance. (Note: the sampling distribution for the variance is a Chi-square distribution — if your source distribution is normal!)

```sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) {
r <- 10000  # Number of replications/samples - DO NOT ADJUST
# This produces a matrix of observations with
# n columns and r rows. Each row is one sample:
my.samples <- switch(src.dist,
"E" = matrix(rexp(n*r,param1),r),
"N" = matrix(rnorm(n*r,param1,param2),r),
"U" = matrix(runif(n*r,param1,param2),r),
"P" = matrix(rpois(n*r,param1),r),
"B" = matrix(rbinom(n*r,param1,param2),r),
"G" = matrix(rgamma(n*r,param1,param2),r),
"X" = matrix(rchisq(n*r,param1),r),
"T" = matrix(rt(n*r,param1),r))
all.sample.sums <- apply(my.samples,1,sum)
all.sample.means <- apply(my.samples,1,mean)
all.sample.vars <- apply(my.samples,1,var)
par(mfrow=c(2,2))
hist(my.samples[1,],col="gray",main="Distribution of One Sample")
hist(all.sample.sums,col="gray",main="Sampling Distribution\nof
the Sum")
hist(all.sample.means,col="gray",main="Sampling Distribution\nof the Mean")
hist(all.sample.vars,col="gray",main="Sampling Distribution\nof
the Variance")
}```

There are 8 population distributions to choose from: exponential (E), normal (N), uniform (U), Poisson (P), binomial (B), gamma (G), Chi-Square (X), and the Student’s t distribution (T). Note also that you have to provide either one or two parameters, depending upon what distribution you are selecting. For example, a normal distribution requires that you specify the mean and standard deviation to describe where it’s centered, and how fat or thin it is (that’s two parameters). A Chi-square distribution requires that you specify the degrees of freedom (that’s only one parameter). You can find out exactly what distributions require what parameters by going here: http://en.wikibooks.org/wiki/R_Programming/Probability_Distributions.

Here is an example that draws from an exponential distribution with a mean of 1/1 (you specify the number you want in the denominator of the mean):

`sdm.sim(50,src.dist="E",param1=1)`

The code above produces this sequence of plots: You aren’t allowed to change the number of replications in this simulation because of the nature of the sampling distribution: it’s a theoretical model that describes the distribution of statistics from an infinite number of samples. As a result, if you increase the number of replications, you’ll see the mean of the sampling distribution bounce around until it converges on the mean of the population. This is just an artifact of the simulation process: it’s not a characteristic of the sampling distribution, because to be a sampling distribution, you’ve got to have an infinite number of samples. Watkins et al. have a great description of this effect that all statistics instructors should be aware of. I chose 10,000 for the number of replications because 1) it’s close enough to infinity to ensure that the mean of the sampling distribution is the same as the mean of the population, but 2) it’s far enough away from infinity to not crash your computer, even if you only have 4GB or 8GB of memory.

Here are some more examples to try. You can see that as you increase your sample size (n), the shapes of the sampling distributions become more and more normal, and the variance decreases, constraining your estimates of the population parameters more and more.

```sdm.sim(10,src.dist="E",1)
sdm.sim(50,src.dist="E",1)
sdm.sim(100,src.dist="E",1)
sdm.sim(10,src.dist="X",14)
sdm.sim(50,src.dist="X",14)
sdm.sim(100,src.dist="X",14)
sdm.sim(10,src.dist="N",param1=20,param2=3)
sdm.sim(50,src.dist="N",param1=20,param2=3)
sdm.sim(100,src.dist="N",param1=20,param2=3)
sdm.sim(10,src.dist="G",param1=5,param2=5)
sdm.sim(50,src.dist="G",param1=5,param2=5)
sdm.sim(100,src.dist="G",param1=5,param2=5)```