Tag Archives: data analysis

Imperfect Action is Better Than Perfect Inaction: What Harry Truman Can Teach Us About Loss Functions (with an intro to ggplot)

One of the heuristics we use at Intelex to guide decision making is former US President Truman’s advice that “imperfect action is better than perfect inaction.” What it means is — don’t wait too long to take action, because you don’t want to miss opportunities. Good advice, right?

When I share this with colleagues, I often hear a response like: “that’s dangerous!” To which my answer is “well sure, sometimes, but it can be really valuable depending on how you apply it!” The trick is: knowing how and when.

Here’s how it can be dangerous. For example, statistical process control (SPC) exists to keep us from tampering with processes — from taking imperfect action based on random variation, which will not only get us nowhere, but can exacerbate the problem we were trying to solve. The secret is to apply Truman’s heuristic based on an understanding of exactly how imperfect is OK with your organization, based on your risk appetite. And this is where loss functions can help.

Along the way, we’ll demonstrate how to do a few important things related to plotting with the ggplot package in R, gradually adding in new elements to the plot so you can see how it’s layered, including:

  • Plot a function based on its equation
  • Add text annotations to specific locations on a ggplot
  • Draw horizontal and vertical lines on a ggplot
  • Draw arrows on a ggplot
  • Add extra dots to a ggplot
  • Eliminate axis text and axis tick marks

What is a Loss Function?

A loss function quantifies how unhappy you’ll be based on the accuracy or effectiveness of a prediction or decision. In the simplest case, you control one variable (x) which leads to some cost or loss (y). For the case we’ll examine in this post, the variables are:

  • How much time and effort you put in to scoping and characterizing the problem (x); we assume that time+effort invested leads to real understanding
  • How much it will cost you (y); can be expressed in terms of direct costs (e.g. capex + opex) as well as opportunity costs or intangible costs (e.g. damage to reputation)

Here is an example of what this might look like, if you have a situation where overestimating (putting in too much x) OR underestimating (putting in too little x) are both equally bad. In this case, x=10 is the best (least costly) decision or prediction:

plot of a typical squared loss function
# describe the equation we want to plot
parabola <- function(x) ((x-10)^2)+10  

# initialize ggplot with a dummy dataset
library(ggplot)
p <- ggplot(data = data.frame(x=0), mapping = aes(x=x)) 

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) +
     xlab("x = the variable you can control") + 
     ylab("y = cost of loss ($$)")

In regression (and other techniques where you’re trying to build a model to predict a quantitative dependent variable), mean square error is a squared loss function that helps you quantify error. It captures two facts: the farther away you are from the correct answer the worse the error is — and both overestimating and underestimating is bad (which is why you square the values). Across this and related techniques, the loss function captures these characteristics:

From http://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote10.html

Not all loss functions have that general shape. For classification, for example, the 0-1 loss function tells the story that if you get a classification wrong (x < 0) you incur all the penalty or loss (y=1), whereas if you get it right (x > 0) there is no penalty or loss (y=0):

# set up data frame of red points
d.step <- data.frame(x=c(-3,0,0,3), y=c(1,1,0,0))

# note that the loss function really extends to x=-Inf and x=+Inf
ggplot(d.step) + geom_step(mapping=aes(x=x, y=y), direction="hv") +
     geom_point(mapping=aes(x=x, y=y), color="red") + 
     xlab("y* f(x)") + ylab("Loss (Cost)") +  
     ggtitle("0-1 Loss Function for Classification")

Use the Loss Function to Make Strategic Decisions

So let’s get back to Truman’s advice. Ideally, we want to choose the x (the amount of time and effort to invest into project planning) that results in the lowest possible cost or loss. That’s the green point at the nadir of the parabola:

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen")

Costs get higher as we move up the x-axis:

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green")

And time+effort grows as we move along the x-axis (we might spend minutes on a problem at the left of the plot, or weeks to years by the time we get to the right hand side):

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green") +
     annotate(geom="text", x=2, y=0, label="minutes\nof effort", size=3) +
     annotate(geom="text", x=20, y=0, label="months\nof effort", size=3)

Planning too Little = Planning too Much = Costly

What this means is — if we don’t plan, or we plan just a little bit, we incur high costs. We might make the wrong decision! Or miss critical opportunities! But if we plan too much — we’re going to spend too much time, money, and/or effort compared to the benefit of the solution we provide.


p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green") +
     annotate(geom="text", x=2, y=0, label="minutes\nof effort", size=3) +
     annotate(geom="text", x=20, y=0, label="months\nof effort", size=3) +
     annotate(geom="text",x=3, y=85, label="Little (or no) Planning\nHIGH COST", color="red") +
     annotate(geom="text", x=18, y=85, label="Paralysis by Planning\nHIGH COST", color="red") +
     geom_vline(xintercept=0, linetype="dotted") + geom_hline(yintercept=0, linetype="dotted")

The trick is to FIND THAT CRITICAL LEVEL OF TIME and EFFORT invested to gain information and understanding about your problem… and then if you’re going to err, make sure you err towards the left — if you’re going to make a mistake, make the mistake that costs less and takes less time to make:

arrow.x <- c(10, 10, 10, 10)
arrow.y <- c(35, 50, 65, 80)
arrow.x.end <- c(6, 6, 6, 6)
arrow.y.end <- arrow.y
d <- data.frame(arrow.x, arrow.y, arrow.x.end, arrow.y.end)

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green") +
     annotate(geom="text", x=2, y=0, label="minutes\nof effort", size=3) +
     annotate(geom="text", x=20, y=0, label="months\nof effort", size=3) +
     annotate(geom="text",x=3, y=85, label="Little (or no) Planning\nHIGH COST", color="red") +
     annotate(geom="text", x=18, y=85, label="Paralysis by Planning\nHIGH COST", color="red") +
     geom_vline(xintercept=0, linetype="dotted") + 
     geom_hline(yintercept=0, linetype="dotted") +
     geom_vline(xintercept=10) +
     geom_segment(data=d, mapping=aes(x=arrow.x, y=arrow.y, xend=arrow.x.end, yend=arrow.y.end),
     arrow=arrow(), color="blue", size=2) +
     annotate(geom="text", x=8, y=95, size=2.3, color="blue",
     label="we prefer to be\non this side of the\nloss function")

Moral of the Story

The moral of the story is… imperfect action can be expensive, but perfect action is ALWAYS expensive. Spend less to make mistakes and learn from them, if you can! This is one of the value drivers for agile methodologies… agile practices can help improve communication and coordination so that the loss function is minimized.

## FULL CODE FOR THE COMPLETELY ANNOTATED CHART ##
# If you change the equation for the parabola, annotations may shift and be in the wrong place.
parabola <- function(x) ((x-10)^2)+10

my.title <- expression(paste("Imperfect Action Can Be Expensive. But Perfect Action is ", italic("Always"), " Expensive."))

arrow.x <- c(10, 10, 10, 10)
arrow.y <- c(35, 50, 65, 80)
arrow.x.end <- c(6, 6, 6, 6)
arrow.y.end <- arrow.y
d <- data.frame(arrow.x, arrow.y, arrow.x.end, arrow.y.end)

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green") +
     annotate(geom="text", x=2, y=0, label="minutes\nof effort", size=3) +
     annotate(geom="text", x=20, y=0, label="months\nof effort", size=3) +
     annotate(geom="text",x=3, y=85, label="Little (or no) Planning\nHIGH COST", color="red") +
     annotate(geom="text", x=18, y=85, label="Paralysis by Planning\nHIGH COST", color="red") +
     geom_vline(xintercept=0, linetype="dotted") + 
     geom_hline(yintercept=0, linetype="dotted") +
     geom_vline(xintercept=10) +
     geom_segment(data=d, mapping=aes(x=arrow.x, y=arrow.y, xend=arrow.x.end, yend=arrow.y.end),
     arrow=arrow(), color="blue", size=2) +
     annotate(geom="text", x=8, y=95, size=2.3, color="blue",
     label="we prefer to be\non this side of the\nloss function") +
     ggtitle(my.title) +
     theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(),
     axis.text.y=element_blank(), axis.ticks.y=element_blank()) 

Now sometimes you need to make this investment! (Think nuclear power plants, or constructing aircraft carriers or submarines.) Don’t get caught up in getting your planning investment perfectly optimized — but do be aware of the trade-offs, and go into the decision deliberately, based on the risk level (and regulatory nature) of your industry, and your company’s risk appetite.

Object of Type Closure is Not Subsettable

I started using R in 2004. I started using R religiously on the day of the annular solar eclipse in Madrid (October 3, 2005) after being inspired by David Hunter’s talk at ADASS.

It took me exactly 4,889 days to figure out what this vexing error means, even though trial and error helped me move through it most every time it happened! I’m so moved by what I learned (and embarrassed that it took so long to make sense), that I have to share it with you.

This error happens when you’re trying to treat a function like a list, vector, or data frame. To fix it, start treating the function like a function.

Let’s take something that’s very obviously a function. I picked the sampling distribution simulator from a 2015 blog post I wrote. Cut and paste it into your R console:

sdm.sim <- function(n,src.dist=NULL,param1=NULL,param2=NULL) {
   r <- 10000  # Number of replications/samples - DO NOT ADJUST
   # This produces a matrix of observations with  
   # n columns and r rows. Each row is one sample:
   my.samples <- switch(src.dist,
	"E" = matrix(rexp(n*r,param1),r),
	"N" = matrix(rnorm(n*r,param1,param2),r),
	"U" = matrix(runif(n*r,param1,param2),r),
	"P" = matrix(rpois(n*r,param1),r),
        "B" = matrix(rbinom(n*r,param1,param2),r),
	"G" = matrix(rgamma(n*r,param1,param2),r),
	"X" = matrix(rchisq(n*r,param1),r),
	"T" = matrix(rt(n*r,param1),r))
   all.sample.sums <- apply(my.samples,1,sum)
   all.sample.means <- apply(my.samples,1,mean)   
   all.sample.vars <- apply(my.samples,1,var) 
   par(mfrow=c(2,2))
   hist(my.samples[1,],col="gray",main="Distribution of One Sample")
   hist(all.sample.sums,col="gray",main="Sampling Distribution\nof
	the Sum")
   hist(all.sample.means,col="gray",main="Sampling Distribution\nof the Mean")
   hist(all.sample.vars,col="gray",main="Sampling Distribution\nof
	the Variance")
}

The right thing to do with this function is use it to simulate a bunch of distributions and plot them using base R. You write the function name, followed by parenthesis, followed by each of the four arguments the function needs to work. This will generate a normal distribution with mean of 20 and standard deviation of 3, along with three sampling distributions, using a sample size of 100 and 10000 replications:

sdm.sim(100, src.dist="N", param1=20, param2=3)

(You should get four plots, arranged in a 2×2 grid.)

But what if we tried to treat sdm.sim like a list, and call the 3rd element of it? Or what if we tried to treat it like a data frame, and we wanted to call one of the variables in the column of the data frame?

> sdm.sim[3]
Error in sdm.sim[3] : object of type 'closure' is not subsettable

> sdm.sim$values
Error in sdm.sim$values : object of type 'closure' is not subsettable

SURPRISE! Object of type closure is not subsettable. This happens because sdm.sim is a function, and its data type is (shockingly) something called “closure”:

> class(sdm.sim)
[1] "function"

> typeof(sdm.sim)
[1] "closure"

I had read this before on Stack Overflow a whole bunch of times, but it never really clicked until I saw it like this! And now that I had a sense for why the error was occurring, turns out it’s super easy to reproduce with functions in base R or functions in anyfamiliar packages you use:

> ggplot$a
Error in ggplot$a : object of type 'closure' is not subsettable

> table$a
Error in table$a : object of type 'closure' is not subsettable

> ggplot[1]
Error in ggplot[1] : object of type 'closure' is not subsettable

> table[1]
Error in table[1] : object of type 'closure' is not subsettable

As a result, if you’re pulling your hair out over this problem, check and see where in your rogue line of code you’re treating something like a non-function. Or maybe you picked a variable name that competes with a function name. Or maybe you got your operations out of order. In any case, change your notation so that your function is doing function things, and your code should start working.

But as Luke Smith pointed out, this is not true for functional sequences (which you can also write as functions). Functional sequences are those chains of commands you write when you’re in tidyverse mode, all strung together with %>% pipes:

Luke’s code that you can cut and paste (and try), with the top line that got cut off by Twitter, is:

fun1 <- . %>% 
    group_by(col1) %>% 
    mutate(n=n+h) %>% 
    filter(n==max(n)) %>% 
    ungroup() %>% 
    summarize(new_col=mean(n))

fun2 <- fun1[-c(2,5)] 

Even though Luke’s fun1 and fun2 are of type closure, they are subsettable because they contain a sequence of functions:

> typeof(fun1)
[1] "closure"

> typeof(fun2)
[1] "closure"

> fun1[1]
Functional sequence with the following components:

 1. group_by(., col1)

Use 'functions' to extract the individual functions. 

> fun1[[1]]
function (.) 
group_by(., col1)

Don’t feel bad! This problem has plagued all of us for many, many hours (me: years), and yet it still happens to us more often than we would like to admit. Awareness of this issue will not prevent you from attempting things that give you this error in the future. It’s such a popular error that there have been memes about it and sad valentines written about it:

SCROLL DOWN PAST STEPH’S TWEET TO SEE THE JOKE!!

(Also, if you’ve made it this far, FOLLOW THESE GOOD PEOPLE ON TWITTER: @stephdesilva @djnavarro @lksmth – They all share great information on data science in general, and R in particular. Many thanks too to all the #rstats crowd who shared in my glee last night and didn’t make me feel like an idiot for not figuring this out for ALMOST 14 YEARS. It seems so simple now.

Steph is also a Microsoft whiz who you should definitely hire if you need anything R+ Microsoft. Thanks to all of you!)

A 15-Week Intro Statistics Course Featuring R

Morgan at Burning Man 2014. (Image Credit: Nicole Radziwill)

Morgan at Burning Man 2014. (Image Credit: Nicole Radziwill)

Do you teach introductory statistics or data science? Need some help planning your fall class?

I apply the 10 Principles of Burning Man in the design and conduct of all my undergraduate and graduate-level courses, including my introductory statistics class (which has a heavy focus on R and data science) at JMU. This means that I consider learning to be emergent, and as a result, it often doesn’t follow a prescribed path of achieving specified learning objectives. However, in certain courses, I still feel like it’s important to provide a general structure to help guide the way! This also helps the students get a sense of our general trajectory over the course of the semester, and do readings in advance if they’re ready.

Since several people have asked for a copy, here is the SYLLABUS that I use for my 15-week class (that also uses the “informal” TEXTBOOK I wrote this past spring). We meet twice a week for an hour and 15 minutes each session. The class is designed for undergraduate sophomores, but there are always students from all levels enrolled. The course is intended to provide an introduction to (frequentist) statistical thinking, but with an applied focus that has practical data analysis at its core.

My goal is simple. At the end of the semester, I want students to be able to:

Please let me know if this syllabus is helpful to you! I’ll be posting my intensive (5-session) version of this tomorrow or the next day.

Feel free to join our class Facebook group at https://www.facebook.com/groups/262216220608559/ if you want to play along at home.

Top 20 Data Visualization Tools

Every researcher or practitioner of quality (or pretty much any other subject, for that matter) needs a great toolbox packed with flexible visualization tools. I am very happy to see this list of “Top 20 Data Visualization Tools” that came out last week. For me, it’s like a TO DO list! Although I am THRILLED to see #18 (R) and #19 (Weka) on the list, I’m also happy to get some new ideas for what to learn next. I’m thinking #16 (Processing) and #20 (Gephi) and next, it’s a toss up between #7 (Visual.ly) and #9 (Tangle). Preeeeeety.

Why I <3 R: My Not-So-Secret Valentine

My valentine is unique. It will not provide me with flowers, or chocolates, or a romantic dinner tonight, and will certainly not whisper sweet nothings into my good ear. And yet – I will feel no less loved. In contrast, my valentine will probably give me some routines for identifying control limits on control charts, and maybe a way to classify time series. I’m really looking forward to spending some quality time today with this great positive force in my life that saves me so much time and makes me so productive.

Today, on Valentine’s Day, I am serenading one of the loves of my life – R. Technically, R is a statistical software package, but for me, it’s the nirvana of data analysis. I am not a hardcore geek programmer, you see. I don’t like to spend hours coding, admiring the elegance of the syntax and data structures, or finding more compact ways to get the job done. I just want to crack open my data and learn cool things about it, and the faster and more butter-like the better.

Here are a few of the reasons why I love R:

  • R did not play hard to get. The first time I downloaded R from http://www.r-project.org, it only took about 3 minutes, I was able to start playing with it immediately, and it actually worked without a giant installation struggle.
  • R is free. I didn’t have to pay to download it. I don’t have to pay its living expenses in the form of license fees, upgrade fees, or rental charges (like I did when I used SPSS). If I need more from R, I can probably download a new package, and get that too for free.
  • R blended into my living situation rather nicely, and if I decide to move, I’m confident that R will be happy in my new place. As a Windows user, I’m accustomed to having hellacious issues installing software, keeping it up to date, loading new packages, and so on. But R works well on Windows. And when I want to move to Linux, R works well there too. And on the days when I just want to get touchy feely with a Mac, R works well there too.
  • R gets a lot of exercise, so it’s always in pretty good shape. There is an enthusiastic global community of R users who number in the tens of thousands (and maybe more), and report issues to the people who develop and maintain the individual packages. It’s rare to run into an error with R, especially when you’re using a package that is very popular.
  • R is very social; in fact, it’s on Facebook. And if you friendR Bloggersyou’ll get updates about great things you can do with the software (some basic techniques, but some really advanced ones too). Most updates from R Bloggers come with working code.
  • Instead of just having ONE nice package, R has HUNDREDS of nice packages. And each performs a different and unique function, from graphics, to network analysis, to machine learning, to bioinformatics, to super hot-off-the-press algorithms that someone just developed and published. (I even learned how to use the “dtw” package over the weekend, which provides algorithms for time series clustering and classification using a technique called Dynamic Time Warping. Sounds cool, huh!) If you aren’t happy with one package, you can probably find a comparable package that someone else wrote that implements your desired functions in a different way.
  • (And if you aren’t satisfied by those packages, there’s always someone out there coding a new one.)
  • R helps me meditate. OK, so we can’t go to tai chi class together, but I do find it very easy to get into the flow (a la Csikzentmihalyi) when I’m using R.
  • R doesn’t argue with me for no reason. Most of the error messages actually make sense and mean something.
  • R always has time to spend with me. All I have to do is turn it on by double-clicking that nice R icon on my desktop. I don’t ever have to compete with other users or feel jealous of them. R never turns me down or says it’s got other stuff to do. R always makes me feel important and special, because it helps me accomplish great things that I would not be able to do on my own. R supports my personal and professional goals.
  • R has its own journal. Wow. Not only is it utilitarian and fun to be around, but it’s also got a great reputation and is recognized and honored as a solid citizen of the software community.
  • R always remembers me. I can save the image of my entire session with it and pick it up at a later time.
  • R will never leave me. (Well, I hope.)

The most important reason I like R is that I just like spending time with it, learning more about it, and feeling our relationship deepen as it gently helps me analyze all my new data. (Seriously geeky – yeah, I know. At least I won’t be disappointed by the object of MY affection today : )

<3