Category Archives: Applied Statistics

Deploying Your Very Own Shiny Server

Nicole has been having a lot of fun the last few days creating her own Shiny apps. We work in the same space, and let’s just say her enthusiasm is very contagious. While she focused on deploying R-based web apps on, I’m more of a web development geek, so I put my energy towards setting up a server where she could host her apps. This should come in handy, since she blew through all of her free server time on ShinyApps after just a couple of days!

Before you begin, you can see a working example of this at

In this tutorial, I’m going to walk you through the process of:

  1. Setting up an Ubuntu 14.04 + NGINX server at DigitalOcean
  2. Installing and configuring R
  3. Installing and configuring Shiny and the open-source edition of Shiny Server
  4. Installing a free SSL certificate from Let’s Encrypt
  5. Securing the Shiny Server using the SSL cert and reverse proxy through NGINX
  6. Setting appropriate permissions on the files to be served
  7. Creating and launching the app Nicole created in her recent post

Setting Up an Ubuntu 14.04 Server at DigitalOcean

DigitalOcean is my new favorite web host. (Click this link to get a $10 credit when you sign up!) They specialize in high-performance, low-cost, VPS (virtual private servers) targeted at developers. If you want full control over your server, you can’t beat their $5/month offering. They also provide excellent documentation. In order to set up your server, you should start by following these tutorials:

  1. How to Create Your First DigitalOcean Droplet Virtual Server
  2. How to Connect to Your Droplet with SSH
  3. Initial Server Setup with Ubuntu 14.04
  4. Additional Recommended Steps for New Ubuntu 14.04 Servers
  5. How To Protect SSH with Fail2Ban on Ubuntu 14.04

I followed these pretty much exactly without any difficulties. I did make a few changes to their procedure, which I’ll describe next.

Allowing HTTPS with UFW

I found that the instructions for setting up ufw needed a tweak. Since HTTPS traffic uses port 443 on the server, I thought that sudo ufw allow 443/tcp should take care of letting HTTPS traffic through the firewall. Unfortunately, it doesn’t. In addition you should run the following:

$ sudo ufw allow https

$ sudo ufw enable

Your web server may not accept incoming HTTPS traffic if you do not do this. Note: you may not have noticed, but you also installed NGINX as part of the UFW tutorial.

Setting up Automatic Updates on Your Server

The default install of Ubuntu at DigitalOcean comes with the automatic updates package already installed. This means your server will get security packages and upgrades without you having to do it manually. However, this package needs to be configured. First, edit /etc/apt/apt.conf.d/50unattended-upgrades to look like this:

Unattended-Upgrade::Allowed-Origins {
Unattended-Upgrade::Mail "";
Unattended-Upgrade::Remove-Unused-Dependencies "true";
Unattended-Upgrade::Automatic-Reboot "true";
Unattended-Upgrade::Automatic-Reboot-Time "02:00";

Note, that this configuration will install upgrades and security updates, and will automatically reboot your server, if necessary, at 2:00AM, and it will purge unused packages from your system completely. Some people don’t like to have that much stuff happen automatically without supervision. Also, my /etc/apt/apt.conf.d/10periodic file looks like:

APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Download-Upgradeable-Packages "1";
APT::Periodic::AutocleanInterval "7";
APT::Periodic::Unattended-Upgrade "1";

Which sets upgrades to happen daily, and purges to happen once a week.

Installing and Configuring R

Okay, now that your server is set up (you should be able to view the default NGINX page at, it’s time to install R.

Set the CRAN Repository in Ubuntu’s sources.list

The first step is to add your favorite CRAN repository to Ubuntu’s sources list. This will ensure that you get the latest version of R when you install it. To open and edit the sources list, type the following:

$ sudo nano /etc/apt/sources.list

Move the cursor down to the bottom of this file using the arrow keys, and add the following line at the bottom:

deb trusty/

Of course, you can substitute your favorite CRAN repo here. I like Berkeley. Don’t miss that there is a space between “ubuntu” and “trusty”. Hit CTRL+x to exit from this file. Say “yes” when they ask if you want to save your changes. The official docs on installing R packages on Ubuntu also recommend that you activate the backports repositories as well, but I found that this was already done on my DigitalOcean install.

Add the Public Key for the Ubuntu R Package

In order for Ubuntu to be able to recognize, and therefore trust, download, and install the R packages from the CRAN repo, we need to install the public key. This can be done with the following command:

$ sudo apt-key adv --keyserver --recv-keys 51716619E084DAB9

Install R

Run the following:

$ sudo apt-get update

$ sudo apt-get install r-base

When this is finished, you should be able to type R –version and get back the following message:

$ R --version

R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see

If you get this, you’ll know that R was successfully installed on your server. If not, you’ll need to do some troubleshooting.

Configure R to Use curl and Your CRAN Repository of Choice

Type the following to open up the file:

$ sudo pico /etc/R/

You may delete all of the content and add the following:


    r <- getOption("repos")
    r["CRAN"] <- ""

This will allow us to run install.packages('packagename') without specifying the repository later.

Install Dependencies and Packages Needed by Shiny Server

We’re going to need the devtools package, which means we need to install the libraries upon which it depends first (libcurl and libxml2):

$ sudo apt-get -y build-dep libcurl4-gnutls-dev

$ sudo apt-get -y install libcurl4-gnutls-dev

$ sudo apt-get -y build-dep libxml2-dev

$ sudo apt-get -y install libxml2-dev

Now we can install devtools, rsconnect, and rmarkdown:

$ sudo su - -c "R -e \"install.packages('devtools')\""

$ sudo su - -c "R -e \"devtools::install_github('rstudio/rsconnect')\""

$ sudo su - -c "R -e \"install.packages('rmarkdown')\""

$ sudo su - -c "R -e \"install.packages('shiny')\""

Install Shiny Server

Okay! Now we’re finally ready to install Shiny Server. Run the following:

$ cd ~ 
$ sudo apt-get install gdebi-core
$ wget
$ sudo gdebi shiny-server-

At this point, your Shiny Server should be up and running, but we can’t visit it on the web yet because by default, it runs on port 3838, which is blocked by the firewall we set up earlier. We’re now going to secure it, and use a reverse proxy to run it through NGINX.

Install an SSL Certificate with Let’s Encrypt

Let’s Encrypt is a new, free service that will allow you to install a trusted SSL certificate on your server. Since Google and Mozilla are working hard to phase out all non-HTTPS traffic on the web, it’s a good idea to get into the habit of installing SSL certs whenever you set up a new website. First install git, then use it to download letsencrypt:

$ sudo apt-get install git
$ git clone
$ cd letsencrypt

Now before we install the certificate, we have to stop our web server (NGINX). In the code below, replace with your actual domain name that you registered for this site.

$ sudo service nginx stop
$ sudo ./letsencrypt-auto certonly --standalone -d -d

If all goes well, it should have installed your new certificates in the /etc/letsencrypt/live/ folder.

Configure the Reverse Proxy on NGINX

Open up the following file for editing:

$ sudo nano /etc/nginx/nginx.conf

And add the following lines near the bottom of the main http block, just before the section labeled “Virtual Host Configs”. In my file, this started around line 62:


# Map proxy settings for RStudio
map $http_upgrade $connection_upgrade {
    default upgrade;
    '' close;

# Virtual Host Configs

And then open up the default site config file:

$ sudo nano /etc/nginx/sites-available/default

And replace its contents with the following. Note you should replace with your actual domain name, and with the actual IP address of your server.

server {
   listen 80 default_server;
   listen [::]:80 default_server ipv6only=on;
   return 301 https://$server_name$request_uri;
server {
   listen 443 ssl;
   ssl_certificate /etc/letsencrypt/live/;
   ssl_certificate_key /etc/letsencrypt/live/;
   ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
   ssl_prefer_server_ciphers on;
   ssl_ciphers AES256+EECDH:AES256+EDH:!aNULL;

   location / {
       proxy_redirect https://$host/;
       proxy_http_version 1.1;
       proxy_set_header Upgrade $http_upgrade;
       proxy_set_header Connection $connection_upgrade;
       proxy_read_timeout 20d;

Now start NGINX up again:

$ sudo service nginx start

And if all went well, your new Shiny Server should be up and running at!

Note that even if you try to go to the insecure URL, traffic will be automatically redirected through HTTPS.

Setting Appropriate Permissions

Sometimes, your Shiny apps will need access to the filesystem to read or write files. Since the Shiny server runs as the user shiny, and since all the files that are being served are owned by root, then your apps will crash when they try to access files. I like Dean Attali’s solution. Run the following commands, substituting yourusername with the username you are using to access the server:

$ sudo groupadd shiny-apps
$ sudo usermod -aG shiny-apps yourusername
$ sudo usermod -aG shiny-apps shiny
$ cd /srv/shiny-server
$ sudo chown -R yourusername:shiny-apps .
$ sudo chmod g+w .
$ sudo chmod g+s .

In the future, any time you add files under /srv/shiny-server, you may need to change the permissions so the Shiny server can read them. We’ll do that in a moment.

Installing a New App

Finally, I’m going to show you how to put a new app on the server. We’re going to use the app that Nicole created and add it into the “sample apps” folder. Run the following:

$ cd /srv/shiny-server/sample-apps
$ mkdir sampdistclt
$ cd sampdistclt
$ nano server.R

This will create a new file called server.R and open it for editing. Copy and paste the second half of the code from Nicole’s post (the part that starts with ## server) into this file. Save and exit. Now create a second file in this directory called ui.R and paste the code from the first half of Nicole’s post (the part that starts with ## ui up to but not including the part that starts ## server). Save and exit.

Now you need to make sure that the permissions are set correctly:

$ chown -R :shiny-apps .

You may also need to restart the Shiny and/or NGINX servers. The commands to do that are:

$ sudo service nginx restart
$ sudo service shiny-server restart

If all has gone well, you can now view the app up and running at!


I haven’t had a lot of time to use this configuration, so please let me know if you find any bugs, or things that need to be tweaked. On the plus side, this configuration may be cheaper than using, but it also doesn’t have all the cool bells and whistles that you get there, either, like their user interface and monitoring traffic. At the very least, it should be a way to experiment, and put things out in the public for others to play with. Enjoy!

A Discrete Time Markov Chain (DTMC) SIR Model in R

Image Credit: Doug Buckley of

Image Credit: Doug Buckley of

There are many different techniques that be used to model physical, social, economic, and conceptual systems. The purpose of this post is to show how the Kermack-McKendrick (1927) formulation of the SIR Model for studying disease epidemics (where S stands for Susceptible, I stands for Infected, and R for Recovered) can be easily implemented in R as a discrete time Markov Chain using the markovchain package.

A Discrete Time Markov Chain (DTMC) is a model for a random process where one or more entities can change state between distinct timesteps. For example, in SIR, people can be labeled as Susceptible (haven’t gotten a disease yet, but aren’t immune), Infected (they’ve got the disease right now), or Recovered (they’ve had the disease, but no longer have it, and can’t get it because they have become immune). If they get the disease, they change states from Susceptible to Infected. If they get well, they change states from Infected to Recovered. It’s impossible to change states between Susceptible and Recovered, without first going through the Infected state. It’s totally possible to stay in the Susceptible state between successive checks on the population, because there’s not a 100% chance you’ll actually be infected between any two timesteps. You might have a particularly good immune system, or maybe you’ve been hanging out by yourself for several days programming.

Discrete time means you’re not continuously monitoring the state of the people in the system. It would get really overwhelming if you had to ask them every minute “Are you sick yet? Did you get better yet?” It makes more sense to monitor individuals’ states on a discrete basis rather than continuously, for example, like maybe once a day. (Ozgun & Barlas (2009) provide a more extensive illustration of the difference between discrete and continuous modeling, using a simple queuing system.)

To create a Markov Chain in R, all you need to know are the 1) transition probabilities, or the chance that an entity will move from one state to another between successive timesteps, 2) the initial state (that is, how many entities are in each of the states at time t=0), and 3) the markovchain package in R. Be sure to install markovchain before moving forward.

Imagine that there’s a 10% infection rate, and a 20% recovery rate. That implies that 90% of Susceptible people will remain in the Susceptible state, and 80% of those who are Infected will move to the Recovered Category, between successive timesteps. 100% of those Recovered will stay recovered. None of the people who are Recovered will become Susceptible.

Say that you start with a population of 100 people, and only 1 is infected. That means your “initial state” is that 99 are Susceptible, 1 is Infected, and 0 are Recovered. Here’s how you set up your Markov Chain:

mcSIR <- new("markovchain", states=c("S","I","R"),
    byrow=TRUE, nrow=3), name="SIR")
initialState <- c(99,1,0)

At this point, you can ask R to see your transition matrix, which shows the probability of moving FROM each of the three states (that form the rows) TO each of the three states (that form the columns).

> show(mcSIR)
 A  3 - dimensional discrete Markov Chain with following states
 S I R 
 The transition matrix   (by rows)  is defined as follows
    S   I   R
S 0.9 0.1 0.0
I 0.0 0.8 0.2
R 0.0 0.0 1.0

You can also plot your transition probabilities:



But all we’ve done so far is to create our model. We haven’t yet done a simulation, which would show us how many people are in each of the three states as you move from one discrete timestep to many others. We can set up a data frame to contain labels for each timestep, and a count of how many people are in each state at each timestep. Then, we fill that data frame with the results after each timestep i, calculated by initialState*mcSIR^i:

timesteps <- 100
sir.df <- data.frame( "timestep" = numeric(),
 "S" = numeric(), "I" = numeric(),
 "R" = numeric(), stringsAsFactors=FALSE)
 for (i in 0:timesteps) {
newrow <- as.list(c(i,round(as.numeric(initialState * mcSIR ^ i),0)))
 sir.df[nrow(sir.df) + 1, ] <- newrow

Now that we have a data frame containing our SIR results (sir.df), we can display them to see what the values look like:

> head(sir.df)
  timestep  S  I  R
1        0 99  1  0
2        1 89 11  0
3        2 80 17  2
4        3 72 22  6
5        4 65 25 10
6        5 58 26 15

And then plot them to view our simulation results using this DTMC SIR Model:

points(sir.df$timestep,sir.df$I, col="red")
points(sir.df$timestep,sir.df$R, col="green")


It’s also possible to use the markovchain package to identify elements of your system as it evolves over time:

> absorbingStates(mcSIR)
[1] "R"
> transientStates(mcSIR)
[1] "S" "I"
> steadyStates(mcSIR)
     S I R
[1,] 0 0 1

And you can calculate the first timestep that your Markov Chain reaches its steady state (the “time to absorption”), which your plot should corroborate:

> ab.state <- absorbingStates(mcSIR)
> <- min(which(sir.df[,ab.state]==max(sir.df[,ab.state])))
> (sir.df[row,]$timestep)+1
[1] 58

You can use this code to change the various transition probabilities to see what the effects are on the outputs yourself (sensitivity analysis). Also, there are methods you can use to perform uncertainty analysis, e.g. putting confidence intervals around your transition probabilities. We won’t do either of these here, nor will we create a Shiny app to run this simulation, despite the significant temptation.

My Second (R) Shiny App: Sampling Distributions & CLT

Image Credit: Doug Buckley of

Image Credit: Doug Buckley of

I was so excited about my initial foray into Shiny development using jennybc‘s amazing googlesheets package that I stayed up half the night last night (again) working on my second Shiny app: a Shiny-fied version of the function I shared in March to do simulations illustrating sampling distributions and the Central Limit Theorem using many different source distributions. (Note that Cauchy doesn’t play by the rules!) Hope this info is useful to all new Shiny developers.

If the app doesn’t work for you, it’s possible that I’ve exhausted my purchased hours at — no idea how much traffic this post might generate. So if that happens to you, please try getting Shiny to work locally, cutting and pasting the code below into server.R and ui.R files, and then launching the simulation from your R console.

Here are some important lessons I learned on my 2nd attempt at Shiny development:

  • Creating a container (rv) for the server-side values that would change as a result of inputs from the UI was important. That container was then available to the portions of my Shiny code that prepared data for the UI, e.g. output$plotSample.
  • Because switch only takes arguments that are 1 character long, using radio buttons in the Shiny UI was really useful: I can map the label on each radio button to one character that will get passed into the data processing on the server side.
  • I was able to modify the CSS for the page by adding a couple lines to mainPanel() in my UI.
  • Although it was not mentally easy (for me) to convert from an R function to a Shiny app when initially presented with the problem, in retrospect, it was indeed straightforward. All I had to do was take the original function, split out the data processing from the presentation (par & hist commands), put the data processing code on the server side and the presentation code on the UI side, change the variable names on the server side so that they had the input$ prefix, and make sure the variable names were consistent between server and UI.
  • I originally tried writing one app.R file, but did not seem to like that, so I put all the code that was not UI into the server side and tried deploying with server.R and ui.R, which worked. I don’t know what I did wrong.
  • If you want to publish to, the directory name that hosts your files must be at least 4 characters long or you will get a “validation error” when you attempt to deployApp().
## Nicole's Second Shiny Demo App
## N. Radziwill, 12/6/2015,
## Used code from as a base
## ui

ui <- fluidPage(
titlePanel('Sampling Distributions and the Central Limit Theorem'),
helpText('Choose your source distribution and number of items, n, in each
sample. 10000 replications will be run when you click "Sample Now".'),
h6(a("Read an article about this simulation at",
href="", target="_blank")),
sliderInput(inputId="n","Sample Size n",value=30,min=5,max=100,step=2),
radioButtons("src.dist", "Distribution type:",
c("Exponential: Param1 = mean, Param2 = not used" = "E",
"Normal: Param1 = mean, Param2 = sd" = "N",
"Uniform: Param1 = min, Param2 = max" = "U",
"Poisson: Param1 = lambda, Param2 = not used" = "P",
"Cauchy: Param1 = location, Param2 = scale" = "C",
"Binomial: Param1 = size, Param2 = success prob" = "B",
"Gamma: Param1 = shape, Param2 = scale" = "G",
"Chi Square: Param1 = df, Param2 = ncp" = "X",
"Student t: Param1 = df, Param2 = not used" = "T")),
numericInput("param1","Parameter 1:",10),
numericInput("param2","Parameter 2:",2),
actionButton("takeSample","Sample Now")
), # end sidebarPanel
# Use CSS to control the background color of the entire page
tags$style("body {background-color: #9999aa; }")
) # end mainPanel
) # end UI

## server

r <- 10000 # Number of replications... must be ->inf for sampling distribution!

palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
"#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999"))

server <- function(input, output) {

# Create a reactive container for the data structures that the simulation
# will produce. The rv$variables will be available to the sections of your
# server code that prepare output for the UI, e.g. output$plotSample
rv <- reactiveValues(sample = NULL,
all.sums = NULL,
all.means = NULL,
all.vars = NULL)

# Note: We are giving observeEvent all the output connected to the UI actionButton.
# We can refer to input variables from our UI as input$variablename
my.samples <- switch(input$src.dist,
"E" = matrix(rexp(input$n*r,input$param1),r),
"N" = matrix(rnorm(input$n*r,input$param1,input$param2),r),
"U" = matrix(runif(input$n*r,input$param1,input$param2),r),
"P" = matrix(rpois(input$n*r,input$param1),r),
"C" = matrix(rcauchy(input$n*r,input$param1,input$param2),r),
"B" = matrix(rbinom(input$n*r,input$param1,input$param2),r),
"G" = matrix(rgamma(input$n*r,input$param1,input$param2),r),
"X" = matrix(rchisq(input$n*r,input$param1),r),
"T" = matrix(rt(input$n*r,input$param1),r))

# It was very important to make sure that rv contained numeric values for plotting:
rv$sample <- as.numeric(my.samples[1,])
rv$all.sums <- as.numeric(apply(my.samples,1,sum))
rv$all.means <- as.numeric(apply(my.samples,1,mean))
rv$all.vars <- as.numeric(apply(my.samples,1,var))

output$plotSample <- renderPlot({
# Plot only when user input is submitted by clicking "Sample Now"
if (input$takeSample) {
# Create a 2x2 plot area & leave a big space (5) at the top for title
par(mfrow=c(2,2), oma=c(0,0,5,0))
hist(rv$sample, main="Distribution of One Sample",
hist(rv$all.sums, main="Sampling Distribution of the Sum",
hist(rv$all.means, main="Sampling Distribution of the Mean",
hist(rv$all.vars, main="Sampling Distribution of the Variance",
mtext("Simulation Results", outer=TRUE, cex=3)
}, height=660, width=900) # end plotSample

} # end server

My First (R) Shiny App: An Annotated Tutorial

Image Credit: Doug Buckley of

Image Credit: Doug Buckley of

I’ve been meaning to learn Shiny for 2 years now… and thanks to a fortuitous email from @ImADataGuy this morning and a burst of wild coding energy about 5 hours ago, I am happy to report that I have completely fallen in love again. The purpose of this post is to share how I got my first Shiny app up and running tonight on localhost, how I deployed it to the service, and how you can create a “Hello World” style program of your own that actually works on data that’s meaningful to you.

If you want to create a “Hello World!” app with Shiny (and your own data!) just follow these steps:

0. Install R 3.2.0+ first! This will save you time.
1. I signed up for an account at
2. Then I clicked the link in the email they sent me.
3. That allowed me to set up my location.
4. Then I followed the instructions at
of problems with devtools::install_github('rstudio/shinyapps') - Had to go 
into my R directory, manually delete RCurl and digest, then 
reinstall both RCurl and digest... then installing shinyapps worked.
Note: this last command they tell you to do WILL NOT WORK because you do not have an app yet! 
If you try it, this is what you'll see:
> shinyapps::deployApp('path/to/your/app')
Error in shinyapps::deployApp("path/to/your/app") : 
C:\Users\Nicole\Documents\path\to\your\app does not exist
5. Then I went to and installed rsconnect.
6. I clicked on my name and gravatar in the upper right hand corner of the window I had opened, and then clicked 
"tokens". I realized I'd already done this part, so I skipped down to read 
"A Demo App" on
7. Then, I re-installed ggplot2 and shiny using this command:
install.packages(c('ggplot2', 'shiny'))
8. I created a new directory (C:/Users/Nicole/Documents/shinyapps) and used
setwd to get to it.
9. I pasted the code at to create two files, 
server.R and ui.R, which I put into my new shinyapps directory 
under a subdirectory called demo. The subdirectory name IS your app name.
10. I typed runApp("demo") into my R console, and voila! The GUI appeared in 
my browser window on my localhost.
-- Don't just try to close the browser window to get the Shiny app 
to stop. R will hang. To get out of this, I had to use Task Manager and kill R.
--- Use the main menu, and do Misc -> Stop Current Computation
11. I did the same with the "Hello Shiny" code at 
But what I REALLY want is to deploy a hello world app with MY OWN data. You know, something that's 
meaningful to me. You probably want to do a test app with data that is meaningful to you... here's 
how you can do that.
12. A quick search shows that I need jennybc's (Github) googlesheets package to get 
data from Google Drive viewable in my new Shiny app.
13. So I tried to get the googlesheets package with this command:
but then found out it requires R version 3.2.0. I you already have 3.2.0 you can skip 
to step 16 now.
14. So I reinstalled R using the installr package (highly advised if you want to 
overcome the agony of upgrading on windows). 
for info -- all it requires is that you type installR() -- really!
15. After installing R I restarted my machine. This is probably the first time in a month that 
I've shut all my browser windows, documents, spreadsheets, PDFs, and R sessions. I got the feeling 
that this made my computer happy.
16. Then, I created a Google Sheet with my data. While viewing that document, I went to 
File -> "Publish to the Web". I also discovered that my DOCUMENT KEY is that 
looooong string in the middle of the address, so I copied it for later:
17. Then I created a new directory in C:/Users/Nicole/Documents/shinyapps to test out 
jennybc's googlesheets package, and called it jennybc
18. I copied and pasted the code in her server.R file and ui.R file
into files with the same names in my jennybc directory
19. I went into my R console, used getwd() to make sure I was in the
C:/Users/Nicole/Documents/shinyapps directory, and then typed
20. A browser window popped up on localhost with her test Shiny app! I played with it, and then 
closed that browser tab.
21. When I went back into the R console, it was still hanging, so I went to the menu bar 
to Misc -> Stop Current Computation. This brought my R prompt back.
22. Now it was time to write my own app. I went to and
found a layout I liked (, then copied the 
server.R and ui.R code into C:/Users/Nicole/Documents/shinyapps/my-hello -- 
and finally, tweaked the code and engaged in about 100 iterations of: 1) edit the two files, 
2) type runApp("my-hello") in the R console, 3) test my Shiny app in the 
browser window, 4) kill browser window, 5) do Misc -> Stop Current Computation 
in R. ALL of the computation happens in server.R, and all the display happens in ui.R:



my_key <- "1Bs0OH6F-Pdw5BG8yVo2t_VS9Wq1F7vb_VovOmnDSNf4"
my_ss <- gs_key(my_key)
my_data <- gs_read(my_ss)

shinyServer(function(input, output, session) {
 output$plot <- renderPlot({
 my_data$type <- ordered(my_data$type,levels=c("PRE","POST"))
 output$summary <- renderPrint({
 aggregate(score~type,data=my_data, summary)
 output$the_data <- renderDataTable({




 # Application title
 titlePanel("Nicole's First Shiny App"),
 # Sidebar with controls to select the random distribution type
 # and number of observations to generate. Note the use of the
 # br() element to introduce extra vertical spacing
     helpText("This is my first Shiny app!! It grabs some of my data 
from a Google Spreadsheet, and displays it here. I      
also used lots of examples from"),
href="", target="_blank")),
     h6(a("Click Here for a Tutorial on How It Was Made", 
 # Show a tabset that includes a plot, summary, and table view
 # of the generated distribution
    tabsetPanel(type = "tabs", 
    tabPanel("Plot", plotOutput("plot")), 
    tabPanel("Summary", verbatimTextOutput("summary")), 
    tabPanel("Table", DT::dataTableOutput("the_data"))

23. Once I decided my app was good enough for my practice round, it was time to 
deploy it to the cloud.
24. This part of the process requires the shinyapps and dplyr 
packages, so be sure to install them:

25. To deploy, all I did was this: setwd("C:/Users/Nicole/Documents/shinyapps/my-hello/")


Control Charts in R: A Guide to X-Bar/R Charts in the qcc Package

xbar-chartStatistical process control provides a mechanism for measuring, managing, and controlling processes. There are many different flavors of control charts, but if data are readily available, the X-Bar/R approach is often used. The following PDF describes X-Bar/R charts and shows you how to create them in R and interpret the results, and uses the fantastic qcc package that was developed by Luca Scrucca. Please let me know if you find it helpful!

Creating and Interpreting X-Bar/R Charts in R

Simulation for Data Science With R

Image Credit: Doug Buckley of

Image Credit: Doug Buckley of

Hey everyone! I just wanted to give you the heads up on a book project that I’ve been working on (which should be available by Spring 2016). It’s all about using the R programming language to do simulation — which I think is one of the most powerful (and overlooked) tools in data science. Please feel free to email or write comments below if you have any suggestions for material you’d like to have included in it!

Originally, this project was supposed to be a secret… I’ve been working on it for about two years now, along with two other writing projects, and was approached in June by a traditional publishing company (who I won’t mention by name) who wanted to brainstorm with me about possibly publishing and distributing my next book. After we discussed the intersection of their wants and my needs, I prepared a full outline for them, and they came up with a work schedule and sent me a contract. While I was reading the contract, I got cold feet. It was the part about giving up “all moral rights” to my work, which sounds really frightening (and is not something I have to do under creative commons licensing, which I prefer). I shared the contract with a few colleagues and a lawyer, hoping that they’d say don’t worry… it sounds a lot worse than it really is. But the response I got was it sounds pretty much like it is.

While deliberating the past two weeks, I’ve been moving around a lot and haven’t been in touch with the publisher. I got an email this morning asking for my immediate decision on the matter (paraphrased, because there’s a legal disclaimer at the bottom of their emails that says “this information may be privileged” and I don’t want to violate any laws):

If we don’t hear from you, unfortunately we’ll be moving forward with this project. Do you still want to be on board?

The answer is YEAH – of COURSE I’m “on board” with my own project. But this really made me question the value of a traditional publisher over an indie publisher, or even self-publishing. And if they’re moving forward anyway, does that mean they take my outline (and supporting information about what I’m planning for each chapter) and just have someone else write to it? That doesn’t sound very nice. Since all the content on my blog is copyrighted by ME, I’m sharing the entire contents of what I sent to them on July 6th to establish the copyright on my outline in a public forum.

So if you see this chapter structure in someone ELSE’S book… you know what happened. The publisher came up with the idea for the main title (“Simulation for Data Science With R”) so I might publish under a different title that still has the words Simulation and R in them.

I may still publish with them, but I’ll make that decision after I have the full manuscript in place in a couple months. And after I have the chance to reflect more on what’s best for everyone. What do you think is the best route forward?


Simulation for Data Science With R

Effective Data-Driven Decision Making for Business Analysis by Nicole M. Radziwill


Simulation is an essential (yet often overlooked) tool in data science – an interdisciplinary approach to problem-solving that leverages computer science, statistics, and domain expertise. This easy-to-understand introductory text for new and intermediate-level programmers, data scientists, and business analysts surveys five different simulation techniques (Monte Carlo, Discrete Event Simulation, System Dynamics, Agent-Based Modeling, and Resampling). The book focuses on practical and illustrative examples using the R Statistical Software, presented within the context of structured methodologies for problem solving (such as DMAIC and DMADV) that will enable you to more easily use simulation to make effective data-driven decisions. Readers should have exposure to basic concepts in programming but can be new to the R Statistical Software.


This book helps its readers 1) formulate research questions that simulation can help solve, 2) choose an appropriate problem-solving methodology, 3) choose one or more simulation techniques to help solve that problem,  4) perform basic simulations using the R Statistical Software, and 5) present results and conclusions clearly and effectively.

Objectives and achievements

The reader will:

  • Learn about essential and foundational concepts in modeling and simulation
  • Determine whether a simulation project is also a data science project
  • Choose an appropriate problem-solving methodology for effective data-driven decision making
  • Select suitable simulation techniques to provide insights about a given problem
  • Build and interpret the results from basic simulations using the R Statistical Software


  1. Introduction to Simulation for Data Science
  2. Foundations for Decision-Making


  1. Variation and Random Variable Generation
  2. Distribution Fitting
  3. Data Generating Processes


  1. Monte Carlo Simulation
  2. Discrete Event Simulation
  3. System Dynamics
  4. Agent-Based Modeling
  5. Resampling Methods


  1. Case Study 1: Possibly modeling opinion dynamics… specific example still TBD
  2. Case Study 2: A Really Practical Application of Simulation (especially for women)

Chapter 1: Introduction to Simulation for Data Science – 35 pages


This chapter explains the role of simulation in data science, and provides the context for understanding the differences between simulation techniques and their philosophical underpinnings.



Topics covered

Variation and Data-Driven Decision Making

What are Complex Systems?

What are Complex Dynamical Systems? What is systems thinking? Why is a systems perspective critical for data-driven decision making? Where do we encounter complex  systems in business or day-to-day life?

What is Data Science?

A Taxonomy of Data Science. The Data Science Venn Diagram. What are the roles of modeling and simulation in data science projects? “Is it a Data Science Project?” — a Litmus Test. How modeling and simulation align with data science.

What is a Model?

Conceptual Models. Equations. Deterministic Models, Stochastic Models. Endogeneous and Exogenous Variables.

What is Simulation?

Types of Simulation: Static vs. Dynamic, Stochastic vs. Deterministic, Discrete vs. Continuous, Terminating and Non-Terminating (Steady State). Philosophical Principles: Holistic vs. Reductionist, Kadanoff’s Universality, Parsimony, Sensitivity to Initial Conditions

Why Use Simulation?

Simulation and Big Data

Choosing the Right Simulation Technique

Skills learned

The reader will be able to:

  • Distinguish a model from a simulation
  • Explain how simulation can provide a valuable perspective in data-driven decision making
  • Understand how simulation fits into the taxonomy of data science
  • Determine whether a simulation project is also a data science project
  • Determine which simulation technique to apply to various kinds of real-world problems

Chapter 2: Foundations for Decision Making – 25 pages


In this chapter, the reader will learn how to plan and structure a simulation project to aid in the decision-making process as well as the presentation of results. The social context of data science will be explained, emphasizing the growing importance of collaborative data and information sharing.



Topics covered

The Social Context of Data Science

Ethics and Provenance. Data Curation. Replicability, Reproducibility, and Open Science. Open, interoperable frameworks for collaborative data and information sharing. Problem-Centric Habits of Mind.

Selecting Key Performance Indicators (KPIs)

Determining the Number of Replications

Methodologies for Simulation Projects

A General Problem-Solving Approach



Root Cause Analysis (RCA)


Verification and Validation Techniques

Output Analysis

Skills learned

The reader will be able to:

  • Plan a simulation study that is supported by effective and meaningful metadata
  • Select an appropriate methodology to guide the simulation project
  • Choose activities to ensure that verification and validation requirements are met
  • Construct confidence intervals for reporting simulation output

Chapter 3: Variability and Random Variate Generation – 25 pages


Simulation is powerful because it provides a way to closely examine the random behavior in systems that arises due to interdependencies and variability. This requires being able to generate random numbers and random variates that come from populations with known statistical characteristics. This chapter describes how random numbers and random variates are generated, and shows how they are applied to perform simple simulations.



Topics covered

Variability in Stochastic Processes

Why Generate Random Variables?

Pseudorandom Number Generation

Linear Congruential Generators

Inverse Transformation Method

Using sample for Discrete Distributions

Is this Sequence Random? Tests for Randomness

Autocorrelation, Frequency, Runs Tests. Using the randtests package

Tests for homogeneity

Simple Simulations with Random Numbers


Skills learned

The reader will be able to:

  • Generate pseudorandom numbers that are uniformly distributed
  • Use random numbers to generate random variates from a target distribution
  • Perform simple simulations using streams of random numbers

Chapter 4: Data Generating Processes – 30 pages


To execute a simulation, you must be able to generate random variates that represent the physical process you are trying to emulate. In this chapter, we cover several common statistical distributions that can be used to represent real physical processes, and explain which physical processes are often modeled using those distributions.



Topics covered

What is a Data Generating Process?

Continuous, Discrete, and Multivariate Distributions

Discrete Distributions

Binomial Distribution

Geometric Distribution

Hypergeometric Distribution

Poisson Distribution

Continuous Distributions

Exponential Distribution

F Distribution

Lognormal Distribution

Normal Distribution

Student’s t Distribution

Uniform Distribution

Weibull Distribution

Chi2 Distribution

Stochastic Processes

Markov. Poisson. Gaussian, Bernoulli. Brownian Motion. Random Walk.

Stationary and Autoregressive Processes.


Skills learned

The reader will be able to:

  • Understand the characteristics of several common discrete and continuous data generating processes
  • Use those distributions to generate streams of random variates
  • Describe several common types of stochastic processes

Chapter 5: Distribution Fitting – 30 pages


An effective simulation is driven by data generating processes that accurately reflect real physical populations. This chapter shows how to use a sample of data to determine which statistical distribution best represents the real population. The resulting distribution is used to generate random variates for the simulation.



Topics covered

Why is Distribution Fitting Essential?

Techniques for Distribution Fitting

Shapiro-Wilk Test for Normality

Anderson-Darling Test

Lillefors Test

Kolmogorov-Smirnov Test

Chi2 Goodness of Fit Test

Other Goodness Of Fit Tests

Transforming Your Data

When There’s No Data, Use Interviews

Skills learned

The reader will be able to:

  • Use a sample of real data to determine which data generating process is required in a simulation
  • Transform data to find a more effective data generating process
  • Estimate appropriate distributions when samples of real data are not available

Chapter 6: Monte Carlo Simulation – 30 pages


This chapter explains how to set up and execute simple Monte Carlo simulations, using data generating processes to represent random inputs.



Topics covered

Anatomy of a Monte Carlo Project

The Many Flavors of Monte Carlo

The Hit-or-Miss Method

Example: Estimating Pi

Monte Carlo Integration

Example: Numerical Integration of y = x2

Estimating Variables

Monte Carlo Confidence Intervals

Example: Projecting Profits

Sensitivity Analysis

Example: Projecting Variability of Profits

Example: Projecting Yield of a Process

Markov Chain Monte Carlo

Skills learned

The reader will be able to:

  • Plan and execute a Monte Carlo simulation in R
  • Construct confidence intervals using the Monte Carlo method
  • Determine the sensitivity of process outputs and interpret the results

Chapter 7: Discrete Event Simulation – 30 pages


What is this chapter about?



Topics covered

Anatomy of a DES Project

Entities, Locations, Resources and Events

System Performance Metrics

Queuing Models and Kendall’s Notation

The Event Calendar

Manual Event Calendar Generation

Example: An M/M/1 system in R

Using the queueing package

Using the simmer package

Arrival-Counting Processes with the NHPoisson Package

Survival Analysis with the survival Package

Example: When Will the Bagels Run Out?

Skills learned

The reader will be able to:

  • Plan and execute discrete event simulation in R
  • Choose an appropriate model for a queueing problem
  • Manually generate an event calendar to verify simulation results
  • Use arrival counting processes for practical problem-solving
  • Execute a survival analysis in R and interpret the results

Chapter 8: System Dynamics – 30 pages


This chapter presents system dynamics, a powerful technique for characterizing the effects of multiple nested feedback loops in a dynamical system. This technique helps uncover the large-scale patterns in a complex system where interdependencies and variation are critical.



Topics covered

Anatomy of a SD Project

The Law of Unintended Consequences and Policy Resistance

Introduction to Differential Equations

Causal Loop Diagrams (CLDs)

Stock and Flow Diagrams (SFDs)

Using the deSolve Package

Example: Lotka-Volterra Equations

Dynamic Archetypes

Linear Growth

Exponential Growth and Collapse

S-Shaped Growth

S-Shaped Growth with Overshoot

Overshoot and Collapse

Delays and Oscillations

Using the stellaR and simecol Packages

Skills learned

The reader will be able to:

  • Plan and execute a system dynamics project
  • Create causal loop diagrams and stock-and-flow diagrams
  • Set up simple systems of differential equations and solve them with deSolve in R
  • Predict the evolution of stocks using dynamic archetypes in CLDs
  • Convert STELLA models to R

Chapter 9: Agent-Based Modeling – 25 pages


Agent-Based Modeling (ABM) provides a unique perspective on simulation, illuminating the emergent behavior of the whole system by simply characterizing the rules by which each participant in the system operates. This chapter provides an overview of ABM, compares and contrasts it with the other simulation techniques, and demonstrates how to set up a simulation using an ABM in R.



Topics covered

Anatomy of an ABM Project

Emergent Behavior

PAGE (Percepts, Actions, Goals, and Environment)

Turtles and Patches

Using the RNetLogo package

Skills learned

The reader will be able to:

  • Plan and execute an ABM project in R
  • Create specifications for the ABM using PAGE

Chapter 10: Resampling – 25 pages


Resampling methods are related to Monte Carlo simulation, but serve a different purpose: to help us characterize a data generating process or make inferences about the population our data came from when all we have is a small sample. In this chapter, resampling methods (and some practical problems that use them) are explained.



Topics covered

Anatomy of an Resampling Project



Permutation Tests

Skills learned

The reader will be able to:

  • Plan and execute a resampling project in R
  • Understand how to select and use a resampling technique for real data

Chapter 11: Comparing the Simulation Techniques – 15 pages


In this chapter, the simulation techniques will be compared and contrasted in terms of their strengths, weaknesses, biases, and computational complexity.



Topics covered

TBD – at least two simulation approaches will be applied

Skills learned

The reader will learn how to:

  • Think about a simulation study holistically
  • Select an appropriate combination of techniques for a real simulation study

Logistic Growth, S Curves, Bifurcations, and Lyapunov Exponents in R

If you’ve ever wondered how logistic population growth (the Verhulst model), S curves, the logistic map, bifurcation diagrams, sensitive dependence on initial conditions, “orbits”, deterministic chaos, and Lyapunov exponents are related to one another… this post attempts to provide a simplified explanation(!) in just 10 steps, each with some code in R so you can explore it all yourself. I’ve included some code written by other people who have explored this problem (cited below) as portions of my own code.

It all starts with a hypothesized population… and a process where the size of the population changes over time. We want to understand how (and under what conditions) those changes occur, so we choose a model that characterizes population changes: the logistic growth model. It’s been used in biology, ecology, econometrics, marketing, and other areas.

1. The logistic growth model describes how the size of a population (N) changes over time (t), based on some maximum population growth rate (r). There is a limiting factor called the carrying capacity (K) which represents the total population that the environment could support, based on the amount of available resources. dN/dt is the rate of change of the population over time.


2. You can simplify the logistic growth model by defining a new variable x to represent the portion of the population that’s alive, compared to the total population that the environment could support (and keep alive). So with x = N/K, you get a new differential equation in terms of x. Now we are looking at the rate of change of the population fraction over time. Once x = N/K = 1, the environment can’t support any more members in the population:


3. You can solve this equation by integration! Then, you’ll have an expression that you can use to calculate x (which is still the population fraction) for any time t. This is called the sigmoid or (more commonly), the S Curve. To compute x at any time t, all we need to know is how big the population was when we started looking at it (x0) and the maximum growth rate r:


4. The equation for the S Curve is deterministic and continuous. If we want to solve it numerically, we have to discretize it by chopping up that continuous axis that contains time into little tiny pieces of time. That’s what produces the difference equation that we recognize as the logistic map. It’s a map because it “maps” each value of the sequence onto the next value in the sequence. As long as you know one of those values for x (indicated by the subscript n), you’ll be able to figure out the next value of x (indicated by the subscript n+1). The value x[n] is the population fraction of the current generation, and the value x[n+1] is the population fraction for the next generation. This makes the logistic map a Markov chain. If you plot x[n] on the x axis and x[n+1] on the y axis, this expression will produce the familiar upside down parabola:


5. The logistic map behaves differently depending upon the maximum growth rate (r) that describes your population. This parameter is also called fecundity and represents how rabbit-like your population is reproducing. The higher the r, the more productive, like rabbits (although I’m not sure precisely which r you’d choose if you were studying rabbits). Here is an R function that you can use to generate the last M iterations from a sequence of N total, developed and described at Mage’s Blog: <- function(r, x, N, M) {
 ## from
 ## r: bifurcation parameter
 ## x: initial value, something greater than 0 and less than 1
 ## N: number of iterations (total)
 ## M: number of iteration points to be returned
   z <- 1:N
   z[1] <- x
   for(i in c(1:(N-1))){
     z[i+1] <- r *z[i] * (1 - z[i])
   ## Return the last M iterations 

6. The logistic map has many interesting properties, but here are two in particular (the first in Step 6 and the second in Step 7). First, for several values you can choose for r, the chain converges to a single value (or fixed point) when n gets really big. For other values of r, the value of x will eventually bounce between two values instead of converging (a limit cycle of 2). For other values of r, the value of x will eventually bounce between four values instead of converging. Sometimes, x will bounce around a near limitless collection of values (a condition called deterministic chaos). The eventual values (or collection of eventual values, if they bounce between values) is called an orbit. For example, when the growth rate r is 2.6, the logistic map rapidly converges to an orbit of about 0.615:

plot(,.01,20,20), type="l")


7. Sometimes, it can be nice to take a look at how the values bounce around, and where they eventually converge (or not). To do this, we use cobweb diagrams (which are also sometimes called web diagrams). I used a function that I found at to plot the behavior of the orbits for r=2.6, r=3.2, and r=3.9:

logistic.cobweb <- function(r) {
# code adapted from
 x_next <- lapply(2:N, function(i) r*x[i]*(1-x[i]))
 plot(x[2:N],x_next,type="l",xlim=c(0,1), ylim=c(0,1), main=paste("r=",r),
 xlab=expression(x[t]),ylab=expression(x[t+1]), col="red", lwd=2)

 # start at your random spot on the x-axis and start with a vertical line:
 lines(x=c(start,start),y=c(0,r*start*(1-start)) )
 for(i in 1:(2*N)) {
 if(vert) {
   lines(x=c(start,start),y=c(start,r*start*(1-start)) )
 } else {
   lines(x=c(start, r*start*(1-start)), y=c(r*start*(1-start), r*start*(1-start)) )




8. (Remember to before you continue.) Second, for some values of r, the logistic map shows sensitive dependence on initial conditions. For example, let’s see what happens for two different growth rates (r=3 and r=3.9) when we start one iteration with an x[n]  of 0.5 COLORED BLACK, and another one with an x[n] of 0.5001 COLORED RED. It’s a small, small difference that can lead to big, BIG variations in the orbits. In the r=3 case, the chain produced by the logistic map with x[n] of 0.5 (in black) is IDENTICAL to the chain produced by the logistic map with x[n] of 0.5001 (in red). That’s why you can’t see the black… the values are the same! But for the r=3.9 case, the chain produced by the logistic map with x[n] of 0.5 (in black) RAPIDLY DIVERGES from the chain produced by the logistic map with x[n] of 0.5001 (in red). They are very different, despite a very tiny difference in initial conditions! The logistic map for r=3.9 shows a very sensitive dependence on initial conditions

first <-,.5,120,100)
second <-,.5001,120,100)
plot(1:length(first),first,type="l",main="r=3 is not sensitive to initial conditions")
first <-,.5,120,100)
second <-,.5001,120,100)
plot(1:length(first),first,type="l",main="but r=3.9 is EXTREMELY sensitive")


9. For any chain, we can determine just how sensitive the logistic map is to initial conditions by looking at the Lyapunov exponent. Very simplistically, if the Lyapunov exponent is negative, the chain will converge to one or more fixed points for that value of r. If the Lyapunov exponent is positive, the chain will demonstrate deterministic chaos for that value of r. If the Lyapunov exponent is zero, there is a bifurcation: a 1-cycle is doubling to a 2-cycle, a 2-cycle is doubling to a 4-cycle, or so forth. The top chart shows an approximation of the Lyapunov exponent based on the first 500 iterations (ideally, you’d use an infinite number, but that would eat up too much computing time), and the bottom chart shows a bifurcation diagramYou’ll notice that the Lyapunov exponents are zero where a bifurcation occurs. To interpret the bifurcation diagram, just remember that each vertical slice through it represents the results of ONE COMPLETELY CONVERGED CHAIN from the logistic map. So it shows the results from many, many, many completely converged chains – and provides an excellent way for us to look at the behavior of MANY different types of populations in just one chart:

n <- 400
XI <- lya <- 0
x <- seq(0,4,0.01)
for (i in 1:n) {
 xi <-[i],.01,500,500)
 XI <- rbind(XI,xi)
for (i in 1:length(x)) { 
 lya[i] <- sum(log(abs(x[i]-(2*x[i]*XI[i,]))))/length(x) 
plot(x,lya,ylim=c(-4,1),xlim=c(0,4),type="l",main="Lyapunov Exponents for Logistic Map")
abline(h=0, lwd=2, col="red")
# next 3 lines from
my.r <- seq(0, 4, by=0.003)
Orbit <- sapply(my.r,, x=0.1, N=1000, M=300)
r <- sort(rep(my.r, 301))

plot(x,lya,ylim=c(-5,1),xlim=c(0,4),type="l",main="Lyapunov Exponents for Logistic Map")
abline(h=0, col="red", lwd=2)
abline(v=3, col="blue", lwd=2)
plot(r, Orbit, pch=".", cex=0.5, main="Bifurcation Diagram for r=0 to r=4 Logistic Maps")
abline(v=3, col="blue", lwd=2)


10. Notice that in the bifurcation diagram, we can easily see that when r is between 0 and 1, the population converges to extinction. This makes sense, because the growth rate is smaller than what’s required to sustain the size of the population. You might like to zoom in, though, and see what the orbits look like for some smaller portions of the diagram. Here’s how you can do it (but be sure to refresh your graphics window with before you try it). Try changing the plot character (pch) too, or maybe the size of the characters with cex=0.2 or cex=0.5 in the last line:

# adapted from
my.r <- seq(3.5, 4, by=0.003)
Orbit <- sapply(my.r,, x=0.1, N=1000, M=300)
multiplier <- length(Orbit)/length(my.r)
r <- sort(rep(my.r, multiplier))
plot(r, Orbit, pch=".")



That’s it!

Find out more information on these other web pages, which are listed in order of difficulty level:

« Older Entries Recent Entries »