Applied Statistics

Simulation for Data Science With R

Image Credit: Doug Buckley of http://hyperactive.to

Image Credit: Doug Buckley of http://hyperactive.to

Hey everyone! I just wanted to give you the heads up on a book project that I’ve been working on (which should be available by Spring 2016). It’s all about using the R programming language to do simulation — which I think is one of the most powerful (and overlooked) tools in data science. Please feel free to email or write comments below if you have any suggestions for material you’d like to have included in it!

Originally, this project was supposed to be a secret… I’ve been working on it for about two years now, along with two other writing projects, and was approached in June by a traditional publishing company (who I won’t mention by name) who wanted to brainstorm with me about possibly publishing and distributing my next book. After we discussed the intersection of their wants and my needs, I prepared a full outline for them, and they came up with a work schedule and sent me a contract. While I was reading the contract, I got cold feet. It was the part about giving up “all moral rights” to my work, which sounds really frightening (and is not something I have to do under creative commons licensing, which I prefer). I shared the contract with a few colleagues and a lawyer, hoping that they’d say don’t worry… it sounds a lot worse than it really is. But the response I got was it sounds pretty much like it is.

While deliberating the past two weeks, I’ve been moving around a lot and haven’t been in touch with the publisher. I got an email this morning asking for my immediate decision on the matter (paraphrased, because there’s a legal disclaimer at the bottom of their emails that says “this information may be privileged” and I don’t want to violate any laws):

If we don’t hear from you, unfortunately we’ll be moving forward with this project. Do you still want to be on board?

The answer is YEAH – of COURSE I’m “on board” with my own project. But this really made me question the value of a traditional publisher over an indie publisher, or even self-publishing. And if they’re moving forward anyway, does that mean they take my outline (and supporting information about what I’m planning for each chapter) and just have someone else write to it? That doesn’t sound very nice. Since all the content on my blog is copyrighted by ME, I’m sharing the entire contents of what I sent to them on July 6th to establish the copyright on my outline in a public forum.

So if you see this chapter structure in someone ELSE’S book… you know what happened. The publisher came up with the idea for the main title (“Simulation for Data Science With R”) so I might publish under a different title that still has the words Simulation and R in them.

I may still publish with them, but I’ll make that decision after I have the full manuscript in place in a couple months. And after I have the chance to reflect more on what’s best for everyone. What do you think is the best route forward?


 

Simulation for Data Science With R

Effective Data-Driven Decision Making for Business Analysis by Nicole M. Radziwill

Audience

Simulation is an essential (yet often overlooked) tool in data science – an interdisciplinary approach to problem-solving that leverages computer science, statistics, and domain expertise. This easy-to-understand introductory text for new and intermediate-level programmers, data scientists, and business analysts surveys five different simulation techniques (Monte Carlo, Discrete Event Simulation, System Dynamics, Agent-Based Modeling, and Resampling). The book focuses on practical and illustrative examples using the R Statistical Software, presented within the context of structured methodologies for problem solving (such as DMAIC and DMADV) that will enable you to more easily use simulation to make effective data-driven decisions. Readers should have exposure to basic concepts in programming but can be new to the R Statistical Software.

Mission

This book helps its readers 1) formulate research questions that simulation can help solve, 2) choose an appropriate problem-solving methodology, 3) choose one or more simulation techniques to help solve that problem,  4) perform basic simulations using the R Statistical Software, and 5) present results and conclusions clearly and effectively.

Objectives and achievements

The reader will:

  • Learn about essential and foundational concepts in modeling and simulation
  • Determine whether a simulation project is also a data science project
  • Choose an appropriate problem-solving methodology for effective data-driven decision making
  • Select suitable simulation techniques to provide insights about a given problem
  • Build and interpret the results from basic simulations using the R Statistical Software

SECTION I: BASIC CONCEPTS

  1. Introduction to Simulation for Data Science
  2. Foundations for Decision-Making
  3. SECRET NEW CHAPTER THAT YOU WILL BE REALLY EXCITED ABOUT

SECTION II: STOCHASTIC PROCESSES

  1. Variation and Random Variable Generation
  2. Distribution Fitting
  3. Data Generating Processes

SECTION III: SIMULATION TECHNIQUES

  1. Monte Carlo Simulation
  2. Discrete Event Simulation
  3. System Dynamics
  4. Agent-Based Modeling
  5. Resampling Methods
  6. SECRET NEW CHAPTER THAT YOU WILL BE REALLY EXCITED ABOUT

SECTION IV: CASE STUDIES

  1. Case Study 1: Possibly modeling opinion dynamics… specific example still TBD
  2. Case Study 2: A Really Practical Application of Simulation (especially for women)

Chapter 1: Introduction to Simulation for Data Science – 35 pages

Description

This chapter explains the role of simulation in data science, and provides the context for understanding the differences between simulation techniques and their philosophical underpinnings.

Level

BASIC

Topics covered

Variation and Data-Driven Decision Making

What are Complex Systems?

What are Complex Dynamical Systems? What is systems thinking? Why is a systems perspective critical for data-driven decision making? Where do we encounter complex  systems in business or day-to-day life?

What is Data Science?

A Taxonomy of Data Science. The Data Science Venn Diagram. What are the roles of modeling and simulation in data science projects? “Is it a Data Science Project?” — a Litmus Test. How modeling and simulation align with data science.

What is a Model?

Conceptual Models. Equations. Deterministic Models, Stochastic Models. Endogeneous and Exogenous Variables.

What is Simulation?

Types of Simulation: Static vs. Dynamic, Stochastic vs. Deterministic, Discrete vs. Continuous, Terminating and Non-Terminating (Steady State). Philosophical Principles: Holistic vs. Reductionist, Kadanoff’s Universality, Parsimony, Sensitivity to Initial Conditions

Why Use Simulation?

Simulation and Big Data

Choosing the Right Simulation Technique

Skills learned

The reader will be able to:

  • Distinguish a model from a simulation
  • Explain how simulation can provide a valuable perspective in data-driven decision making
  • Understand how simulation fits into the taxonomy of data science
  • Determine whether a simulation project is also a data science project
  • Determine which simulation technique to apply to various kinds of real-world problems

Chapter 2: Foundations for Decision Making – 25 pages

Description

In this chapter, the reader will learn how to plan and structure a simulation project to aid in the decision-making process as well as the presentation of results. The social context of data science will be explained, emphasizing the growing importance of collaborative data and information sharing.

Level

BASIC

Topics covered

The Social Context of Data Science

Ethics and Provenance. Data Curation. Replicability, Reproducibility, and Open Science. Open, interoperable frameworks for collaborative data and information sharing. Problem-Centric Habits of Mind.

Selecting Key Performance Indicators (KPIs)

Determining the Number of Replications

Methodologies for Simulation Projects

A General Problem-Solving Approach

DMAIC

DMADV

Root Cause Analysis (RCA)

PDSA

Verification and Validation Techniques

Output Analysis

Skills learned

The reader will be able to:

  • Plan a simulation study that is supported by effective and meaningful metadata
  • Select an appropriate methodology to guide the simulation project
  • Choose activities to ensure that verification and validation requirements are met
  • Construct confidence intervals for reporting simulation output

Chapter 3: Variability and Random Variate Generation – 25 pages

Description

Simulation is powerful because it provides a way to closely examine the random behavior in systems that arises due to interdependencies and variability. This requires being able to generate random numbers and random variates that come from populations with known statistical characteristics. This chapter describes how random numbers and random variates are generated, and shows how they are applied to perform simple simulations.

Level

MEDIUM

Topics covered

Variability in Stochastic Processes

Why Generate Random Variables?

Pseudorandom Number Generation

Linear Congruential Generators

Inverse Transformation Method

Using sample for Discrete Distributions

Is this Sequence Random? Tests for Randomness

Autocorrelation, Frequency, Runs Tests. Using the randtests package

Tests for homogeneity

Simple Simulations with Random Numbers

 

Skills learned

The reader will be able to:

  • Generate pseudorandom numbers that are uniformly distributed
  • Use random numbers to generate random variates from a target distribution
  • Perform simple simulations using streams of random numbers

Chapter 4: Data Generating Processes – 30 pages

Description

To execute a simulation, you must be able to generate random variates that represent the physical process you are trying to emulate. In this chapter, we cover several common statistical distributions that can be used to represent real physical processes, and explain which physical processes are often modeled using those distributions.

Level

MEDIUM

Topics covered

What is a Data Generating Process?

Continuous, Discrete, and Multivariate Distributions

Discrete Distributions

Binomial Distribution

Geometric Distribution

Hypergeometric Distribution

Poisson Distribution

Continuous Distributions

Exponential Distribution

F Distribution

Lognormal Distribution

Normal Distribution

Student’s t Distribution

Uniform Distribution

Weibull Distribution

Chi2 Distribution

Stochastic Processes

Markov. Poisson. Gaussian, Bernoulli. Brownian Motion. Random Walk.

Stationary and Autoregressive Processes.

 

Skills learned

The reader will be able to:

  • Understand the characteristics of several common discrete and continuous data generating processes
  • Use those distributions to generate streams of random variates
  • Describe several common types of stochastic processes

Chapter 5: Distribution Fitting – 30 pages

Description

An effective simulation is driven by data generating processes that accurately reflect real physical populations. This chapter shows how to use a sample of data to determine which statistical distribution best represents the real population. The resulting distribution is used to generate random variates for the simulation.

Level

MEDIUM

Topics covered

Why is Distribution Fitting Essential?

Techniques for Distribution Fitting

Shapiro-Wilk Test for Normality

Anderson-Darling Test

Lillefors Test

Kolmogorov-Smirnov Test

Chi2 Goodness of Fit Test

Other Goodness Of Fit Tests

Transforming Your Data

When There’s No Data, Use Interviews

Skills learned

The reader will be able to:

  • Use a sample of real data to determine which data generating process is required in a simulation
  • Transform data to find a more effective data generating process
  • Estimate appropriate distributions when samples of real data are not available

Chapter 6: Monte Carlo Simulation – 30 pages

Description

This chapter explains how to set up and execute simple Monte Carlo simulations, using data generating processes to represent random inputs.

Level

ADVANCED

Topics covered

Anatomy of a Monte Carlo Project

The Many Flavors of Monte Carlo

The Hit-or-Miss Method

Example: Estimating Pi

Monte Carlo Integration

Example: Numerical Integration of y = x2

Estimating Variables

Monte Carlo Confidence Intervals

Example: Projecting Profits

Sensitivity Analysis

Example: Projecting Variability of Profits

Example: Projecting Yield of a Process

Markov Chain Monte Carlo

Skills learned

The reader will be able to:

  • Plan and execute a Monte Carlo simulation in R
  • Construct confidence intervals using the Monte Carlo method
  • Determine the sensitivity of process outputs and interpret the results

Chapter 7: Discrete Event Simulation – 30 pages

Description

What is this chapter about?

Level

ADVANCED

Topics covered

Anatomy of a DES Project

Entities, Locations, Resources and Events

System Performance Metrics

Queuing Models and Kendall’s Notation

The Event Calendar

Manual Event Calendar Generation

Example: An M/M/1 system in R

Using the queueing package

Using the simmer package

Arrival-Counting Processes with the NHPoisson Package

Survival Analysis with the survival Package

Example: When Will the Bagels Run Out?

Skills learned

The reader will be able to:

  • Plan and execute discrete event simulation in R
  • Choose an appropriate model for a queueing problem
  • Manually generate an event calendar to verify simulation results
  • Use arrival counting processes for practical problem-solving
  • Execute a survival analysis in R and interpret the results

Chapter 8: System Dynamics – 30 pages

Description

This chapter presents system dynamics, a powerful technique for characterizing the effects of multiple nested feedback loops in a dynamical system. This technique helps uncover the large-scale patterns in a complex system where interdependencies and variation are critical.

Level

ADVANCED

Topics covered

Anatomy of a SD Project

The Law of Unintended Consequences and Policy Resistance

Introduction to Differential Equations

Causal Loop Diagrams (CLDs)

Stock and Flow Diagrams (SFDs)

Using the deSolve Package

Example: Lotka-Volterra Equations

Dynamic Archetypes

Linear Growth

Exponential Growth and Collapse

S-Shaped Growth

S-Shaped Growth with Overshoot

Overshoot and Collapse

Delays and Oscillations

Using the stellaR and simecol Packages

Skills learned

The reader will be able to:

  • Plan and execute a system dynamics project
  • Create causal loop diagrams and stock-and-flow diagrams
  • Set up simple systems of differential equations and solve them with deSolve in R
  • Predict the evolution of stocks using dynamic archetypes in CLDs
  • Convert STELLA models to R

Chapter 9: Agent-Based Modeling – 25 pages

Description

Agent-Based Modeling (ABM) provides a unique perspective on simulation, illuminating the emergent behavior of the whole system by simply characterizing the rules by which each participant in the system operates. This chapter provides an overview of ABM, compares and contrasts it with the other simulation techniques, and demonstrates how to set up a simulation using an ABM in R.

Level

ADVANCED

Topics covered

Anatomy of an ABM Project

Emergent Behavior

PAGE (Percepts, Actions, Goals, and Environment)

Turtles and Patches

Using the RNetLogo package

Skills learned

The reader will be able to:

  • Plan and execute an ABM project in R
  • Create specifications for the ABM using PAGE

Chapter 10: Resampling – 25 pages

Description

Resampling methods are related to Monte Carlo simulation, but serve a different purpose: to help us characterize a data generating process or make inferences about the population our data came from when all we have is a small sample. In this chapter, resampling methods (and some practical problems that use them) are explained.

Level

MEDIUM

Topics covered

Anatomy of an Resampling Project

Bootstrapping

Jackknifing

Permutation Tests

Skills learned

The reader will be able to:

  • Plan and execute a resampling project in R
  • Understand how to select and use a resampling technique for real data

Chapter 11: Comparing the Simulation Techniques – 15 pages

Description

In this chapter, the simulation techniques will be compared and contrasted in terms of their strengths, weaknesses, biases, and computational complexity.

Level

ADVANCED

Topics covered

TBD – at least two simulation approaches will be applied

Skills learned

The reader will learn how to:

  • Think about a simulation study holistically
  • Select an appropriate combination of techniques for a real simulation study

8 replies »

  1. Hi Nicole,
    The book structure you described here sounds nice.
    I wish you very big success in this your challenge.
    I’ve been following your articles and I’m sure you will be achieving huge success upon it.

  2. Nicole, I really like your outline. I would not proceed and get better terms to your liking. I definitely buy this book. I’m new to R and can perform basic statistical and reliability analysis with R. I’d really like to step up my game with R. This seems to be the ticket.

    Michael

    • Well, it’s in process, so it’s going to get done — maybe even before next spring! Let me know if you want to be a reviewer… I can get you an early draft.

  3. Hi Nicole,

    I really like the outline. Simulation is something that has intrigued me as a tool to create reasonable datasets, especially in health data. I just ran across an R package by the name of wakefield (it’s not in CRAN yet, but here: https://github.com/trinker/wakefield) that is intended to generate random data sets – have you heard of it? I’ve been learning R over the last few years and now I’m taking the Data Science Specialization from Johns Hopkins through Coursera. I’m definitely interested in expanding my understanding of R!

    As far as your experience with the publisher is concerned, I think you have to go with your gut feeling on it. It certainly seems that this publisher’s intentions do not align with yours! However, even if they found someone else to write a book on simulation using R you have a unique style of presenting material that will set you apart from anyone else (I’ve just started reading and enjoying your book on Statistics).

    Paul

    • Thanks Paul. It’s still going to get written this fall, regardless! I have a class I can teach with it in the spring.

      That package (wakefield) sounds really interesting — I’ll have to check it out. Definitely one of the topics I want to cover is the power of being able to simulate data sets or streams. When I worked at the NRAO on large, multi-million dollar telescopes, we relied on simulated data to test our software because it’s just too expensive to use live telescope time for testing.

      I took some of the data science MOOC classes, but I really need a greater sense of community to be able to learn. I don’t get the happy touchy feelies from MOOCs 😦

      Keep in touch!
      Nicole

  4. Dear Nicole

    I really like your outline. And I want to buy your book!

    Do you have a estimated date? I caould read-proof for you, or I coud help you in som other way.
    I think that this is a very importa (an needed) topic in the community.

    Kind regards

    Adolfo

  5. Hi Nicole

    I have two suggestions:

    1). About your publishing approach.
    What value do you get from a publisher? I suspect that for most academic textbooks, the leanpub approach suits authors and readers the best – for example see:
    https://leanpub.com/rprogramming

    2). About the content.
    What I would like to see is a section on how to validate a model, and apply it to assess the impact a change could have on the modelled system. Perhaps this will be covered in your case studies.

    Best wishes for your project: it will be challenging, but well worthwhile, and I look forward to seeing the result.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s