Category Archives: Education

A Discrete Time Markov Chain (DTMC) SIR Model in R

Image Credit: Doug Buckley of http://hyperactive.to

Image Credit: Doug Buckley of http://hyperactive.to

There are many different techniques that be used to model physical, social, economic, and conceptual systems. The purpose of this post is to show how the Kermack-McKendrick (1927) formulation of the SIR Model for studying disease epidemics (where S stands for Susceptible, I stands for Infected, and R for Recovered) can be easily implemented in R as a discrete time Markov Chain using the markovchain package.

A Discrete Time Markov Chain (DTMC) is a model for a random process where one or more entities can change state between distinct timesteps. For example, in SIR, people can be labeled as Susceptible (haven’t gotten a disease yet, but aren’t immune), Infected (they’ve got the disease right now), or Recovered (they’ve had the disease, but no longer have it, and can’t get it because they have become immune). If they get the disease, they change states from Susceptible to Infected. If they get well, they change states from Infected to Recovered. It’s impossible to change states between Susceptible and Recovered, without first going through the Infected state. It’s totally possible to stay in the Susceptible state between successive checks on the population, because there’s not a 100% chance you’ll actually be infected between any two timesteps. You might have a particularly good immune system, or maybe you’ve been hanging out by yourself for several days programming.

Discrete time means you’re not continuously monitoring the state of the people in the system. It would get really overwhelming if you had to ask them every minute “Are you sick yet? Did you get better yet?” It makes more sense to monitor individuals’ states on a discrete basis rather than continuously, for example, like maybe once a day. (Ozgun & Barlas (2009) provide a more extensive illustration of the difference between discrete and continuous modeling, using a simple queuing system.)

To create a Markov Chain in R, all you need to know are the 1) transition probabilities, or the chance that an entity will move from one state to another between successive timesteps, 2) the initial state (that is, how many entities are in each of the states at time t=0), and 3) the markovchain package in R. Be sure to install markovchain before moving forward.

Imagine that there’s a 10% infection rate, and a 20% recovery rate. That implies that 90% of Susceptible people will remain in the Susceptible state, and 80% of those who are Infected will move to the Recovered Category, between successive timesteps. 100% of those Recovered will stay recovered. None of the people who are Recovered will become Susceptible.

Say that you start with a population of 100 people, and only 1 is infected. That means your “initial state” is that 99 are Susceptible, 1 is Infected, and 0 are Recovered. Here’s how you set up your Markov Chain:

library(markovchain)
mcSIR <- new("markovchain", states=c("S","I","R"),
    transitionMatrix=matrix(data=c(0.9,0.1,0,0,0.8,0.2,0,0,1),
    byrow=TRUE, nrow=3), name="SIR")
initialState <- c(99,1,0)

At this point, you can ask R to see your transition matrix, which shows the probability of moving FROM each of the three states (that form the rows) TO each of the three states (that form the columns).

> show(mcSIR)
SIR
 A  3 - dimensional discrete Markov Chain with following states
 S I R 
 The transition matrix   (by rows)  is defined as follows
    S   I   R
S 0.9 0.1 0.0
I 0.0 0.8 0.2
R 0.0 0.0 1.0

You can also plot your transition probabilities:

plot(mcSIR,package="diagram")

dtmc-sir-transitionnetwork

But all we’ve done so far is to create our model. We haven’t yet done a simulation, which would show us how many people are in each of the three states as you move from one discrete timestep to many others. We can set up a data frame to contain labels for each timestep, and a count of how many people are in each state at each timestep. Then, we fill that data frame with the results after each timestep i, calculated by initialState*mcSIR^i:

timesteps <- 100
sir.df <- data.frame( "timestep" = numeric(),
 "S" = numeric(), "I" = numeric(),
 "R" = numeric(), stringsAsFactors=FALSE)
 for (i in 0:timesteps) {
newrow <- as.list(c(i,round(as.numeric(initialState * mcSIR ^ i),0)))
 sir.df[nrow(sir.df) + 1, ] <- newrow
 }

Now that we have a data frame containing our SIR results (sir.df), we can display them to see what the values look like:

> head(sir.df)
  timestep  S  I  R
1        0 99  1  0
2        1 89 11  0
3        2 80 17  2
4        3 72 22  6
5        4 65 25 10
6        5 58 26 15

And then plot them to view our simulation results using this DTMC SIR Model:

plot(sir.df$timestep,sir.df$S)
points(sir.df$timestep,sir.df$I, col="red")
points(sir.df$timestep,sir.df$R, col="green")

dtmc-sir-simulation

It’s also possible to use the markovchain package to identify elements of your system as it evolves over time:

> absorbingStates(mcSIR)
[1] "R"
> transientStates(mcSIR)
[1] "S" "I"
> steadyStates(mcSIR)
     S I R
[1,] 0 0 1

And you can calculate the first timestep that your Markov Chain reaches its steady state (the “time to absorption”), which your plot should corroborate:

> ab.state <- absorbingStates(mcSIR)
> occurs.at <- min(which(sir.df[,ab.state]==max(sir.df[,ab.state])))
> (sir.df[row,]$timestep)+1
[1] 58

You can use this code to change the various transition probabilities to see what the effects are on the outputs yourself (sensitivity analysis). Also, there are methods you can use to perform uncertainty analysis, e.g. putting confidence intervals around your transition probabilities. We won’t do either of these here, nor will we create a Shiny app to run this simulation, despite the significant temptation.

My First (R) Shiny App: An Annotated Tutorial

Image Credit: Doug Buckley of http://hyperactive.to

Image Credit: Doug Buckley of http://hyperactive.to

I’ve been meaning to learn Shiny for 2 years now… and thanks to a fortuitous email from @ImADataGuy this morning and a burst of wild coding energy about 5 hours ago, I am happy to report that I have completely fallen in love again. The purpose of this post is to share how I got my first Shiny app up and running tonight on localhost, how I deployed it to the http://shinyapps.io service, and how you can create a “Hello World” style program of your own that actually works on data that’s meaningful to you.

If you want to create a “Hello World!” app with Shiny (and your own data!) just follow these steps:

0. Install R 3.2.0+ first! This will save you time.
1. I signed up for an account at http://shinyapps.io.
2. Then I clicked the link in the email they sent me.
3. That allowed me to set up my https://radziwill.shinyapps.io location.
4. Then I followed the instructions at https://www.shinyapps.io/admin/#/dashboard
(This page has SPECIAL SECRET INFO CUSTOMIZED JUST FOR YOU ON IT!!) I had lots 
of problems with devtools::install_github('rstudio/shinyapps') - Had to go 
into my R directory, manually delete RCurl and digest, then 
reinstall both RCurl and digest... then installing shinyapps worked.
Note: this last command they tell you to do WILL NOT WORK because you do not have an app yet! 
If you try it, this is what you'll see:
> shinyapps::deployApp('path/to/your/app')
Error in shinyapps::deployApp("path/to/your/app") : 
C:\Users\Nicole\Documents\path\to\your\app does not exist
5. Then I went to http://shiny.rstudio.com/articles/shinyapps.html and installed rsconnect.
6. I clicked on my name and gravatar in the upper right hand corner of the 
https://www.shinyapps.io/admin/#/dashboard window I had opened, and then clicked 
"tokens". I realized I'd already done this part, so I skipped down to read 
"A Demo App" on http://shiny.rstudio.com/articles/shinyapps.html
7. Then, I re-installed ggplot2 and shiny using this command:
install.packages(c('ggplot2', 'shiny'))
8. I created a new directory (C:/Users/Nicole/Documents/shinyapps) and used
setwd to get to it.
9. I pasted the code at http://shiny.rstudio.com/articles/shinyapps.html to create two files, 
server.R and ui.R, which I put into my new shinyapps directory 
under a subdirectory called demo. The subdirectory name IS your app name.
10. I typed runApp("demo") into my R console, and voila! The GUI appeared in 
my browser window on my localhost.
-- Don't just try to close the browser window to get the Shiny app 
to stop. R will hang. To get out of this, I had to use Task Manager and kill R.
--- Use the main menu, and do Misc -> Stop Current Computation
11. I did the same with the "Hello Shiny" code at http://shiny.rstudio.com/articles/shinyapps.html. 
But what I REALLY want is to deploy a hello world app with MY OWN data. You know, something that's 
meaningful to me. You probably want to do a test app with data that is meaningful to you... here's 
how you can do that.
12. A quick search shows that I need jennybc's (Github) googlesheets package to get 
data from Google Drive viewable in my new Shiny app.
13. So I tried to get the googlesheets package with this command:
devtools::install_github('jennybc/googlesheets')
but then found out it requires R version 3.2.0. I you already have 3.2.0 you can skip 
to step 16 now.
14. So I reinstalled R using the installr package (highly advised if you want to 
overcome the agony of upgrading on windows). 
See http://www.r-statistics.com/2013/03/updating-r-from-r-on-windows-using-the-installr-package/
for info -- all it requires is that you type installR() -- really!
15. After installing R I restarted my machine. This is probably the first time in a month that 
I've shut all my browser windows, documents, spreadsheets, PDFs, and R sessions. I got the feeling 
that this made my computer happy.
16. Then, I created a Google Sheet with my data. While viewing that document, I went to 
File -> "Publish to the Web". I also discovered that my DOCUMENT KEY is that 
looooong string in the middle of the address, so I copied it for later:
1Bs0OH6F-Pdw5BG8yVo2t_VS9Wq1F7vb_VovOmnDSNf4
17. Then I created a new directory in C:/Users/Nicole/Documents/shinyapps to test out 
jennybc's googlesheets package, and called it jennybc
18. I copied and pasted the code in her server.R file and ui.R file
from https://github.com/jennybc/googlesheets/tree/master/inst/shiny-examples/01_read-public-sheet 
into files with the same names in my jennybc directory
19. I went into my R console, used getwd() to make sure I was in the
C:/Users/Nicole/Documents/shinyapps directory, and then typed
 runApp("jennybc")
20. A browser window popped up on localhost with her test Shiny app! I played with it, and then 
closed that browser tab.
21. When I went back into the R console, it was still hanging, so I went to the menu bar 
to Misc -> Stop Current Computation. This brought my R prompt back.
22. Now it was time to write my own app. I went to http://shiny.rstudio.com/gallery/ and
found a layout I liked (http://shiny.rstudio.com/gallery/tabsets.html), then copied the 
server.R and ui.R code into C:/Users/Nicole/Documents/shinyapps/my-hello -- 
and finally, tweaked the code and engaged in about 100 iterations of: 1) edit the two files, 
2) type runApp("my-hello") in the R console, 3) test my Shiny app in the 
browser window, 4) kill browser window, 5) do Misc -> Stop Current Computation 
in R. ALL of the computation happens in server.R, and all the display happens in ui.R:

server.R:

library(shiny)
library(googlesheets)
library(DT)

my_key <- "1Bs0OH6F-Pdw5BG8yVo2t_VS9Wq1F7vb_VovOmnDSNf4"
my_ss <- gs_key(my_key)
my_data <- gs_read(my_ss)

shinyServer(function(input, output, session) {
 output$plot <- renderPlot({
 my_data$type <- ordered(my_data$type,levels=c("PRE","POST"))
 boxplot(my_data$score~my_data$type,ylim=c(0,100),boxwex=0.6)
 })
 output$summary <- renderPrint({
 aggregate(score~type,data=my_data, summary)
 })
 output$the_data <- renderDataTable({
 datatable(my_data)
 })

})

ui.R:

library(shiny)
library(shinythemes)
library(googlesheets)

shinyUI(fluidPage(
 
 # Application title
 titlePanel("Nicole's First Shiny App"),
 
 # Sidebar with controls to select the random distribution type
 # and number of observations to generate. Note the use of the
 # br() element to introduce extra vertical spacing
 sidebarLayout(
 sidebarPanel(
     helpText("This is my first Shiny app!! It grabs some of my data 
from a Google Spreadsheet, and displays it here. I      
also used lots of examples from"),
     h6(a("http://shiny.rstudio.com/gallery/", 
href="http://shiny.rstudio.com/gallery/", target="_blank")),
     br(),
     h6(a("Click Here for a Tutorial on How It Was Made", 
href="https://qualityandinnovation.com/2015/12/08/my-first-shin     
y-app-an-annotated-tutorial/",
      target="_blank")),
      br()
 ),
 
 # Show a tabset that includes a plot, summary, and table view
 # of the generated distribution
 mainPanel(
    tabsetPanel(type = "tabs", 
    tabPanel("Plot", plotOutput("plot")), 
    tabPanel("Summary", verbatimTextOutput("summary")), 
    tabPanel("Table", DT::dataTableOutput("the_data"))
 )
 )
 )
))


23. Once I decided my app was good enough for my practice round, it was time to 
deploy it to the cloud.
24. This part of the process requires the shinyapps and dplyr 
packages, so be sure to install them:

devtools::install_github('hadley/dplyr')
library(dplyr)
devtools::install_github('rstudio/shinyapps')
library(shinyapps)
25. To deploy, all I did was this: setwd("C:/Users/Nicole/Documents/shinyapps/my-hello/")
deployApp()

CHECK OUT MY SHINY APP!!

What if Your Job Was Focused on Play?

james-siegal

James Siegal (picture from his Twitter profile, @jsiegal at http://twitter.com/jsiegal)

Last weekend, I had the opportunity to talk to James Siegal, the President of KaBOOM! – a non-profit whose mission is lighthearted, but certainly not frivolous: to bring balanced and active play into the daily lives of all kids! James is another new Business Innovation Factory (BIF) storyteller for 2015… and I wanted to find out how I could learn from his experiences to bring a sense of play into the work environment. (For me, that’s at a university, interacting with students on a daily basis.)

Over the past 20 years, KaBOOM! has built thousands of playgrounds, focusing on children growing up in poverty. By enlisting the help of over a million volunteers, James and his organization have mobilized communities using a model that starts with kids designing their dream playgrounds. It’s a form of crowdsourced placemaking.

Now, KaBOOM! is thinking about a vision that’s a little broader: driving social change at the city level. Doing this, they’ve found, requires answering one key question: How can you integrate play into the daily routine for kids and families? If play is a destination, there are “hassle factors” that must be overcome: safety, travel time, good lighting, and restroom facilities, for starters. So, in addition to building playgrounds, KaBOOM! is challenging cities to think about integrating play everywhere — on the sidewalk, at the bus stop, and beyond.

How can this same logic apply to organizations integrating play into their cultures? Although KaBOOM! focuses on kids, he had some more generalizable advice:

  • The desire for play has to be authentic, not forced. “We truly value kids, and we truly value families. Our policies and our culture strive to reflect that.” What does your organization value at its core? Seek to amplify the enjoyment of that.
  • We take our work really seriously,” he said. “We don’t take ourselves too seriously. You have to leave your ego at the door.” Can your organization engage in more playful collaboration?
  • We drive creativity out of kids as they grow older, he noted. “Kids expect to play everywhere,” and so even ordinary elements like sidewalks can turn into experiences. (This reminded me of how people decorate the Porta-Potties at Burning Man with lights and music… although I wouldn’t necessarily do the same thing to the restrooms at my university, it did make me think about how we might make ordinary places or situations more fun for our students.)

KaBOOM! is such a unique organization that I had to ask James: what’s the most amazing thing you’ve ever observed in your role as President? He says it’s something that hasn’t just happened once… but happens every time KaBOOM! organizes a new playground build. When people from diverse backgrounds come together with a strong shared mission, vision, and purpose, you foster intense community engagement that yields powerful, tangible results — and this is something that so many organizations strive to achieve.

If you haven’t made plans already to hear James and the other storytellers at BIF, there may be a few tickets left — but this event always sells out! Check the BIF registration page and share a memorable experience with the BIF community this year: http://www.businessinnovationfactory.com/summit/register

A Chat with Jaime Casap, Google’s Chief Education Evangelist

jaime-casap-head

“The classroom of the future does not exist!”

That’s the word from Jaime Casap (@jcasap), Google’s Chief Education Evangelist — and a highly anticipated new Business Innovation Factory (BIF) storyteller for 2015.  In advance of the summit which takes place on September 16 and 17, Morgan and I had the opportunity to chat with Jaime about a form of business model innovation that’s close to our hearts – improving education. He’s a native New Yorker, so he’s naturally outspoken and direct. But his caring and considerate tone makes it clear he’s got everyone’s best interests at heart.

At Google, he’s the connector and boundary spanner… the guy the organization trusts to “predict the future” where education is concerned. He makes sure that the channels of communication are open between everyone working on education-related projects. Outside of Google, he advocates smart and innovative applications of technology in education that will open up educational opportunities for everyone.  Most recently, he visited the White House on this mission.

jaime-quote-image

The current system educational system is not broken, he says. It’s doing exactly what it was designed to do: prepare workers for a hierarchical, industrialized production economy. The problem is that the system cannot be high-performing because it’s not doing what we need it to for the upcoming decades, which requires leveraging the skills and capabilities of everyone.

He points out that low-income minorities now have a 9% chance of graduating from college… whereas a couple decades ago, they had a 6% chance. This startling statistic reflects an underlying deficiency in how education is designed and delivered in this country today.

So how do we fix it?

“Technology gives us the ability to question everything,” he says.  As we shift to performance-based assessments, we can create educational experiences that are practical, iterative, and focused on continuous improvement — where we measure iteration, innovation, and sustained incremental progress.

Measuring these, he says, will be a lot more interesting than what we tend to measure now: whether a learner gets something right the first time — or how long it took for a competency to emerge. From this new perspective, we’ll finally be able to answer questions like: What is an excellent school? What does a high-performing educational system look (and feel) like?

Jaime’s opportunity-driven vision for inclusiveness  is an integral part of Google’s future. And you can hear more about his personal story and how it shaped this vision next month at BIF.

If you haven’t made plans already to hear Jaime and the other storytellers at BIF, there may be a few tickets left — but this event always sells out! Check the BIF registration page and share a memorable experience with the BIF community this year: http://www.businessinnovationfactory.com/summit/register

Quality and Diversity, Especially Women in Tech

Image Credit: Doug Buckley of http://hyperactive.to

Image Credit: Doug Buckley of http://hyperactive.to

The newly launched R Consortium has announced its inaugural Board members, and not one of them is a woman. (Even more unfortunately, I don’t think any of them are active R users; although I’m sure he’s used it, the new President’s bio establishes him as a SAS and S-PLUS user.)

Although I’m sure the lack of diversity is an oversight (as it so often is), I’ve gotten my knickers in a knot a lot more about this issue lately. It’s probably just because I’m getting older (I’ll be 40 next year), but it’s also due to the fact that I’ve been reflecting an awful lot more lately: about what I’ve done, and what I’ve chosen not to do. About how I’ve struggled, and the battles I’ve chosen (versus those I’ve chosen to ignore). About how the subtle and unspoken climate of women in technology is keeping them out, and chasing them away, even though the industry needs more.

I really love programming. I’ve been doing it since 1982, when I realized that I could make my Atari 800 beep on command.

But in the workplace, I never really felt comfortable as a programmer. Whether they intended to or not, male colleagues always gave off a vibe of mistrust when they integrated my code… they always had a better way to design a new module, or a better approach to resolve a troubleshooting issue. When I got an instrumentation job that required field work on the hardware, I’d hear comments like “maybe you can stay here… girls don’t like to get dirty.” I felt uncomfortable geeking out with other women because I even felt like I’d be judged by them… like if they were some technical rock star, they would find my skills an embarrassment to other women like themselves who were trying to become experts.

So I went into software development management, where my role was much more accepted. My job was to let the coders do their job, and just keep everyone else out of their hair. I remember hearing comments like “you know a lot more about code than I thought you would.” I wanted to get a lot deeper into the technical aspects of the work, but I never felt like one of the guys. So I stopped trying.

Even while working as a manager, the organizations I was a part of were always male-dominated, in both the hierarchy and the style and tone of the work environment. (It was much like the masculine, emotionally void environment of so many of the classrooms I’d spent time in during my youth.) I felt lots of pressure to be firm and decisive, to never show emotions, and to work a 60 hour week even when I had a newborn at home. When I was firm and unyielding, I was called “difficult” and “strident.” I changed my approach and became “not assertive enough.” The women who I saw as being successful were all decidedly masculine, and I couldn’t transform my personality to become an ultra-productive, emotion-suppressing machine. (I’ve got the personality of an artist, and I’ve got to flow with my ideas and inspiration.)

Eventually I lost my mojo, switched careers entirely and went into higher education. (What do I teach? Mostly R… so I’m having fun, and I get to code pretty much every day.) But I still fantasize about getting back into the technical workforce and being one of those rare women leaders in technology (which I try to rationalize is not that rare at all, because I know plenty of women scientists, engineers, and technicians). But yeah, comparatively, we are a minority.

My situation is not unique. So why does this tend to happen? Gordon Hunt of Silicon Republic reports that gender stereotypes, a small talent pool, and in-group favoritism are to blame. I’ll agree with the gender stereotyping – even women do it to each other. My college roommate called me “Nerdcole” and it was sort of endearing, and sort of not. As a hiring manager, I remember being surprised every time a resume from a woman crossed my email box, and giving it a second look no matter what. I remember feeling guilty every time I thought “oh, well, she can’t be as serious about doing this as the guys are.” As for in-group favoritism, I think it’s hard not to favor naturally masculine people for jobs in a naturally masculine environment. 

The role of diversity in achieving quality and stimulating innovation has not been deeply explored in the research. Doing a quick literature search, I could only find a few examples. Liang et al. (2013) found that diversity does influence innovation, but due to inconsistent outcomes they couldn’t recommend a management intervention. Feldman & Audretch (1999) found that more innovation occurs in cities because of greater diversity. Ostergaard et al. (2011) explored the breadth of a firm’s knowledge base and its influence on innovation. And in one of my favorite papers ever, Bassett-Jones (2005) explains that diversity creates a “combustible cocktail of creative tension” that, although difficult to manage, ultimately enhancesa firm’s innovation performance.

I found no papers that looked at a link between diversity and quality performance.

But I would love to have a combustible cocktail of creative tension right now.

A 15-Week Course to Introduce Machine Learning and Intelligent Systems in R

lantz-ml-in-rEvery fall, I teach a survey course for advanced undergraduates that covers one of the most critical themes in data science: intelligent systems. According to the IEEE, these are “systems that perceive, reason, learn, and act intelligently.” While data science is focused on analyzing data (often quite a lot of it) to make effective data-driven decisions, intelligent systems use those decisions to accomplish goals. As more and more devices join the Internet of Things (IoT), collecting data and sharing it with other “things” to make even more complex decisions, the role of intelligent systems will become even more pronounced.

So by the end of my course, I want students to have some practical skills that will be useful in analyzing, specifying, building, testing, and using intelligent systems:

  • Know whether a system they’re building (or interacting with) is intelligent… and how it could be made more intelligent
  • Be sensitized to ethical, social, political, and legal aspects of building and using intelligent systems 
  • Use regression techniques to uncover relationships in data using R (including linear, nonlinear, and neural network approaches)
  • Use classification and clustering methods to categorize observations (neural networks, k-means/KNN, Naive Bayes, support vector machines)
  • Be able to handle structured and unstructured data, using both supervised and unsupervised approaches
  • Understand what “big data” is, know when (and when not) to use it, and be familiar with some tools that help them deal with it

My course uses Brett Lantz’s VERY excellent book, Machine Learning with R (which is now also available in Kindle format), which I provide effusive praise for at https://qualityandinnovation.com/2014/04/14/the-best-book-ever-on-machine-learning-and-intelligent-systems-in-r/

One of the things I like the MOST about my class is that we actually cover the link between how your brain works and how neural networks are set up. (Other classes and textbooks typically just show you a picture of a neuron superimposed with inputs, a summation, an activation, and outputs, implying that “See? They’re pretty much the same!”) But it goes much deeper than this… we actually model error-correction learning and observational learning through the different algorithms we employ. To make this point real, we have an amazing guest lecture every year by Dr. Anne Henriksen, who is also a faculty member in the Department of Integrated Science and Technology at JMU. She also does research in neuroscience at the University of Virginia. After we do an exercise where we use a spreadsheet to iteratively determine the equation for a single layer perceptron’s decision boundary, we watch a video by Dr. Mark Gluck that shows how what we’re doing is essentially error-correction learning… and then he explains the chemistry that supports the process. We’re going to videotape Anne’s lecture this fall so you can see it!

Here is the syllabus I am using for Fall 2015. Please feel free to use it (in full or in part) if you are planning a similar class… but do let me know!

Simulation for Data Science With R

Image Credit: Doug Buckley of http://hyperactive.to

Image Credit: Doug Buckley of http://hyperactive.to

Hey everyone! I just wanted to give you the heads up on a book project that I’ve been working on (which should be available by Spring 2016). It’s all about using the R programming language to do simulation — which I think is one of the most powerful (and overlooked) tools in data science. Please feel free to email or write comments below if you have any suggestions for material you’d like to have included in it!

Originally, this project was supposed to be a secret… I’ve been working on it for about two years now, along with two other writing projects, and was approached in June by a traditional publishing company (who I won’t mention by name) who wanted to brainstorm with me about possibly publishing and distributing my next book. After we discussed the intersection of their wants and my needs, I prepared a full outline for them, and they came up with a work schedule and sent me a contract. While I was reading the contract, I got cold feet. It was the part about giving up “all moral rights” to my work, which sounds really frightening (and is not something I have to do under creative commons licensing, which I prefer). I shared the contract with a few colleagues and a lawyer, hoping that they’d say don’t worry… it sounds a lot worse than it really is. But the response I got was it sounds pretty much like it is.

While deliberating the past two weeks, I’ve been moving around a lot and haven’t been in touch with the publisher. I got an email this morning asking for my immediate decision on the matter (paraphrased, because there’s a legal disclaimer at the bottom of their emails that says “this information may be privileged” and I don’t want to violate any laws):

If we don’t hear from you, unfortunately we’ll be moving forward with this project. Do you still want to be on board?

The answer is YEAH – of COURSE I’m “on board” with my own project. But this really made me question the value of a traditional publisher over an indie publisher, or even self-publishing. And if they’re moving forward anyway, does that mean they take my outline (and supporting information about what I’m planning for each chapter) and just have someone else write to it? That doesn’t sound very nice. Since all the content on my blog is copyrighted by ME, I’m sharing the entire contents of what I sent to them on July 6th to establish the copyright on my outline in a public forum.

So if you see this chapter structure in someone ELSE’S book… you know what happened. The publisher came up with the idea for the main title (“Simulation for Data Science With R”) so I might publish under a different title that still has the words Simulation and R in them.

I may still publish with them, but I’ll make that decision after I have the full manuscript in place in a couple months. And after I have the chance to reflect more on what’s best for everyone. What do you think is the best route forward?


 

Simulation for Data Science With R

Effective Data-Driven Decision Making for Business Analysis by Nicole M. Radziwill

Audience

Simulation is an essential (yet often overlooked) tool in data science – an interdisciplinary approach to problem-solving that leverages computer science, statistics, and domain expertise. This easy-to-understand introductory text for new and intermediate-level programmers, data scientists, and business analysts surveys five different simulation techniques (Monte Carlo, Discrete Event Simulation, System Dynamics, Agent-Based Modeling, and Resampling). The book focuses on practical and illustrative examples using the R Statistical Software, presented within the context of structured methodologies for problem solving (such as DMAIC and DMADV) that will enable you to more easily use simulation to make effective data-driven decisions. Readers should have exposure to basic concepts in programming but can be new to the R Statistical Software.

Mission

This book helps its readers 1) formulate research questions that simulation can help solve, 2) choose an appropriate problem-solving methodology, 3) choose one or more simulation techniques to help solve that problem,  4) perform basic simulations using the R Statistical Software, and 5) present results and conclusions clearly and effectively.

Objectives and achievements

The reader will:

  • Learn about essential and foundational concepts in modeling and simulation
  • Determine whether a simulation project is also a data science project
  • Choose an appropriate problem-solving methodology for effective data-driven decision making
  • Select suitable simulation techniques to provide insights about a given problem
  • Build and interpret the results from basic simulations using the R Statistical Software

SECTION I: BASIC CONCEPTS

  1. Introduction to Simulation for Data Science
  2. Foundations for Decision-Making
  3. SECRET NEW CHAPTER THAT YOU WILL BE REALLY EXCITED ABOUT

SECTION II: STOCHASTIC PROCESSES

  1. Variation and Random Variable Generation
  2. Distribution Fitting
  3. Data Generating Processes

SECTION III: SIMULATION TECHNIQUES

  1. Monte Carlo Simulation
  2. Discrete Event Simulation
  3. System Dynamics
  4. Agent-Based Modeling
  5. Resampling Methods
  6. SECRET NEW CHAPTER THAT YOU WILL BE REALLY EXCITED ABOUT

SECTION IV: CASE STUDIES

  1. Case Study 1: Possibly modeling opinion dynamics… specific example still TBD
  2. Case Study 2: A Really Practical Application of Simulation (especially for women)

Chapter 1: Introduction to Simulation for Data Science – 35 pages

Description

This chapter explains the role of simulation in data science, and provides the context for understanding the differences between simulation techniques and their philosophical underpinnings.

Level

BASIC

Topics covered

Variation and Data-Driven Decision Making

What are Complex Systems?

What are Complex Dynamical Systems? What is systems thinking? Why is a systems perspective critical for data-driven decision making? Where do we encounter complex  systems in business or day-to-day life?

What is Data Science?

A Taxonomy of Data Science. The Data Science Venn Diagram. What are the roles of modeling and simulation in data science projects? “Is it a Data Science Project?” — a Litmus Test. How modeling and simulation align with data science.

What is a Model?

Conceptual Models. Equations. Deterministic Models, Stochastic Models. Endogeneous and Exogenous Variables.

What is Simulation?

Types of Simulation: Static vs. Dynamic, Stochastic vs. Deterministic, Discrete vs. Continuous, Terminating and Non-Terminating (Steady State). Philosophical Principles: Holistic vs. Reductionist, Kadanoff’s Universality, Parsimony, Sensitivity to Initial Conditions

Why Use Simulation?

Simulation and Big Data

Choosing the Right Simulation Technique

Skills learned

The reader will be able to:

  • Distinguish a model from a simulation
  • Explain how simulation can provide a valuable perspective in data-driven decision making
  • Understand how simulation fits into the taxonomy of data science
  • Determine whether a simulation project is also a data science project
  • Determine which simulation technique to apply to various kinds of real-world problems

Chapter 2: Foundations for Decision Making – 25 pages

Description

In this chapter, the reader will learn how to plan and structure a simulation project to aid in the decision-making process as well as the presentation of results. The social context of data science will be explained, emphasizing the growing importance of collaborative data and information sharing.

Level

BASIC

Topics covered

The Social Context of Data Science

Ethics and Provenance. Data Curation. Replicability, Reproducibility, and Open Science. Open, interoperable frameworks for collaborative data and information sharing. Problem-Centric Habits of Mind.

Selecting Key Performance Indicators (KPIs)

Determining the Number of Replications

Methodologies for Simulation Projects

A General Problem-Solving Approach

DMAIC

DMADV

Root Cause Analysis (RCA)

PDSA

Verification and Validation Techniques

Output Analysis

Skills learned

The reader will be able to:

  • Plan a simulation study that is supported by effective and meaningful metadata
  • Select an appropriate methodology to guide the simulation project
  • Choose activities to ensure that verification and validation requirements are met
  • Construct confidence intervals for reporting simulation output

Chapter 3: Variability and Random Variate Generation – 25 pages

Description

Simulation is powerful because it provides a way to closely examine the random behavior in systems that arises due to interdependencies and variability. This requires being able to generate random numbers and random variates that come from populations with known statistical characteristics. This chapter describes how random numbers and random variates are generated, and shows how they are applied to perform simple simulations.

Level

MEDIUM

Topics covered

Variability in Stochastic Processes

Why Generate Random Variables?

Pseudorandom Number Generation

Linear Congruential Generators

Inverse Transformation Method

Using sample for Discrete Distributions

Is this Sequence Random? Tests for Randomness

Autocorrelation, Frequency, Runs Tests. Using the randtests package

Tests for homogeneity

Simple Simulations with Random Numbers

 

Skills learned

The reader will be able to:

  • Generate pseudorandom numbers that are uniformly distributed
  • Use random numbers to generate random variates from a target distribution
  • Perform simple simulations using streams of random numbers

Chapter 4: Data Generating Processes – 30 pages

Description

To execute a simulation, you must be able to generate random variates that represent the physical process you are trying to emulate. In this chapter, we cover several common statistical distributions that can be used to represent real physical processes, and explain which physical processes are often modeled using those distributions.

Level

MEDIUM

Topics covered

What is a Data Generating Process?

Continuous, Discrete, and Multivariate Distributions

Discrete Distributions

Binomial Distribution

Geometric Distribution

Hypergeometric Distribution

Poisson Distribution

Continuous Distributions

Exponential Distribution

F Distribution

Lognormal Distribution

Normal Distribution

Student’s t Distribution

Uniform Distribution

Weibull Distribution

Chi2 Distribution

Stochastic Processes

Markov. Poisson. Gaussian, Bernoulli. Brownian Motion. Random Walk.

Stationary and Autoregressive Processes.

 

Skills learned

The reader will be able to:

  • Understand the characteristics of several common discrete and continuous data generating processes
  • Use those distributions to generate streams of random variates
  • Describe several common types of stochastic processes

Chapter 5: Distribution Fitting – 30 pages

Description

An effective simulation is driven by data generating processes that accurately reflect real physical populations. This chapter shows how to use a sample of data to determine which statistical distribution best represents the real population. The resulting distribution is used to generate random variates for the simulation.

Level

MEDIUM

Topics covered

Why is Distribution Fitting Essential?

Techniques for Distribution Fitting

Shapiro-Wilk Test for Normality

Anderson-Darling Test

Lillefors Test

Kolmogorov-Smirnov Test

Chi2 Goodness of Fit Test

Other Goodness Of Fit Tests

Transforming Your Data

When There’s No Data, Use Interviews

Skills learned

The reader will be able to:

  • Use a sample of real data to determine which data generating process is required in a simulation
  • Transform data to find a more effective data generating process
  • Estimate appropriate distributions when samples of real data are not available

Chapter 6: Monte Carlo Simulation – 30 pages

Description

This chapter explains how to set up and execute simple Monte Carlo simulations, using data generating processes to represent random inputs.

Level

ADVANCED

Topics covered

Anatomy of a Monte Carlo Project

The Many Flavors of Monte Carlo

The Hit-or-Miss Method

Example: Estimating Pi

Monte Carlo Integration

Example: Numerical Integration of y = x2

Estimating Variables

Monte Carlo Confidence Intervals

Example: Projecting Profits

Sensitivity Analysis

Example: Projecting Variability of Profits

Example: Projecting Yield of a Process

Markov Chain Monte Carlo

Skills learned

The reader will be able to:

  • Plan and execute a Monte Carlo simulation in R
  • Construct confidence intervals using the Monte Carlo method
  • Determine the sensitivity of process outputs and interpret the results

Chapter 7: Discrete Event Simulation – 30 pages

Description

What is this chapter about?

Level

ADVANCED

Topics covered

Anatomy of a DES Project

Entities, Locations, Resources and Events

System Performance Metrics

Queuing Models and Kendall’s Notation

The Event Calendar

Manual Event Calendar Generation

Example: An M/M/1 system in R

Using the queueing package

Using the simmer package

Arrival-Counting Processes with the NHPoisson Package

Survival Analysis with the survival Package

Example: When Will the Bagels Run Out?

Skills learned

The reader will be able to:

  • Plan and execute discrete event simulation in R
  • Choose an appropriate model for a queueing problem
  • Manually generate an event calendar to verify simulation results
  • Use arrival counting processes for practical problem-solving
  • Execute a survival analysis in R and interpret the results

Chapter 8: System Dynamics – 30 pages

Description

This chapter presents system dynamics, a powerful technique for characterizing the effects of multiple nested feedback loops in a dynamical system. This technique helps uncover the large-scale patterns in a complex system where interdependencies and variation are critical.

Level

ADVANCED

Topics covered

Anatomy of a SD Project

The Law of Unintended Consequences and Policy Resistance

Introduction to Differential Equations

Causal Loop Diagrams (CLDs)

Stock and Flow Diagrams (SFDs)

Using the deSolve Package

Example: Lotka-Volterra Equations

Dynamic Archetypes

Linear Growth

Exponential Growth and Collapse

S-Shaped Growth

S-Shaped Growth with Overshoot

Overshoot and Collapse

Delays and Oscillations

Using the stellaR and simecol Packages

Skills learned

The reader will be able to:

  • Plan and execute a system dynamics project
  • Create causal loop diagrams and stock-and-flow diagrams
  • Set up simple systems of differential equations and solve them with deSolve in R
  • Predict the evolution of stocks using dynamic archetypes in CLDs
  • Convert STELLA models to R

Chapter 9: Agent-Based Modeling – 25 pages

Description

Agent-Based Modeling (ABM) provides a unique perspective on simulation, illuminating the emergent behavior of the whole system by simply characterizing the rules by which each participant in the system operates. This chapter provides an overview of ABM, compares and contrasts it with the other simulation techniques, and demonstrates how to set up a simulation using an ABM in R.

Level

ADVANCED

Topics covered

Anatomy of an ABM Project

Emergent Behavior

PAGE (Percepts, Actions, Goals, and Environment)

Turtles and Patches

Using the RNetLogo package

Skills learned

The reader will be able to:

  • Plan and execute an ABM project in R
  • Create specifications for the ABM using PAGE

Chapter 10: Resampling – 25 pages

Description

Resampling methods are related to Monte Carlo simulation, but serve a different purpose: to help us characterize a data generating process or make inferences about the population our data came from when all we have is a small sample. In this chapter, resampling methods (and some practical problems that use them) are explained.

Level

MEDIUM

Topics covered

Anatomy of an Resampling Project

Bootstrapping

Jackknifing

Permutation Tests

Skills learned

The reader will be able to:

  • Plan and execute a resampling project in R
  • Understand how to select and use a resampling technique for real data

Chapter 11: Comparing the Simulation Techniques – 15 pages

Description

In this chapter, the simulation techniques will be compared and contrasted in terms of their strengths, weaknesses, biases, and computational complexity.

Level

ADVANCED

Topics covered

TBD – at least two simulation approaches will be applied

Skills learned

The reader will learn how to:

  • Think about a simulation study holistically
  • Select an appropriate combination of techniques for a real simulation study
« Older Entries Recent Entries »