Tag Archives: data science

My First R Package (Part 3)

After refactoring my programming so that it was only about 10 lines of code, using 12 functions I wrote an loaded in via the source command, I went through all the steps in Part 1 of this blog post and Part 2 of this blog post to set up the R package infrastructure using testthis in RStudio. Then things started humming along with the rest of the setup:

> use_mit_license("Nicole Radziwill")
✔ Setting active project to 'D:/R/easyMTS'
✔ Setting License field in DESCRIPTION to 'MIT + file LICENSE'
✔ Writing 'LICENSE.md'
✔ Adding '^LICENSE\\.md$' to '.Rbuildignore'
✔ Writing 'LICENSE'

> use_testthat()
✔ Adding 'testthat' to Suggests field in DESCRIPTION
✔ Creating 'tests/testthat/'
✔ Writing 'tests/testthat.R'
● Call `use_test()` to initialize a basic test file and open it for editing.

> use_vignette("easyMTS")
✔ Adding 'knitr' to Suggests field in DESCRIPTION
✔ Setting VignetteBuilder field in DESCRIPTION to 'knitr'
✔ Adding 'inst/doc' to '.gitignore'
✔ Creating 'vignettes/'
✔ Adding '*.html', '*.R' to 'vignettes/.gitignore'
✔ Adding 'rmarkdown' to Suggests field in DESCRIPTION
✔ Writing 'vignettes/easyMTS.Rmd'
● Modify 'vignettes/easyMTS.Rmd'

> use_citation()
✔ Creating 'inst/'
✔ Writing 'inst/CITATION'
● Modify 'inst/CITATION'

Add Your Dependencies

> use_package("ggplot2")
✔ Adding 'ggplot2' to Imports field in DESCRIPTION
● Refer to functions with `ggplot2::fun()`
> use_package("dplyr")
✔ Adding 'dplyr' to Imports field in DESCRIPTION
● Refer to functions with `dplyr::fun()`

> use_package("magrittr")
✔ Adding 'magrittr' to Imports field in DESCRIPTION
● Refer to functions with `magrittr::fun()`
> use_package("tidyr")
✔ Adding 'tidyr' to Imports field in DESCRIPTION
● Refer to functions with `tidyr::fun()`

> use_package("MASS")
✔ Adding 'MASS' to Imports field in DESCRIPTION
● Refer to functions with `MASS::fun()`

> use_package("qualityTools")
✔ Adding 'qualityTools' to Imports field in DESCRIPTION
● Refer to functions with `qualityTools::fun()`

> use_package("highcharter")
Registered S3 method overwritten by 'xts':
  method     from
  as.zoo.xts zoo 
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
✔ Adding 'highcharter' to Imports field in DESCRIPTION
● Refer to functions with `highcharter::fun()`

> use_package("cowplot")
✔ Adding 'cowplot' to Imports field in DESCRIPTION
● Refer to functions with `cowplot::fun()`

Adding Data to the Package

I want to include two files, one data frame containing 50 observations of a healthy group with 5 predictors each, and another data frame containing 15 observations from an abnormal or unhealthy group (also with 5 predictors). I made sure the two CSV files I wanted to add to the package were in my working directory first by using dir().

> use_data_raw()
✔ Creating 'data-raw/'
✔ Adding '^data-raw$' to '.Rbuildignore'
✔ Writing 'data-raw/DATASET.R'
● Modify 'data-raw/DATASET.R'
● Finish the data preparation script in 'data-raw/DATASET.R'
● Use `usethis::use_data()` to add prepared data to package

> mtsdata1 <- read.csv("MTS-Abnormal.csv") %>% mutate(abnormal=1)
> usethis::use_data(mtsdata1)
✔ Creating 'data/'
✔ Saving 'mtsdata1' to 'data/mtsdata1.rda'

> mtsdata2 <- read.csv("MTS-Normal.csv") %>% mutate(normal=1)
> usethis::use_data(mtsdata2)
✔ Saving 'mtsdata2' to 'data/mtsdata2.rda'

Magically, this added my two files (in .rds format) into my /data directory. (Now, though, I don’t know why the /data-raw directory is there… maybe we’ll figure that out later.) I decided it was time to commit these to my repository again:

Following the instruction above, I re-knit the README.Rmd and then it was possible to commit everything to Github again. At which point I ended up in a fistfight with git, again saved only by my software engineer partner who uses Github all the time:

I think it should be working. The next test will be if anyone can install this from github using devtools. Let me know if it works for you… it works for me locally, but you know how that goes. The next post will show you how to use it 🙂

install.packages("devtools")
install_github("NicoleRadziwill/easyMTS")

SEE WHAT WILL BECOME THE easyMTS VIGNETTE –>

My First R Package (Part 2)

In Part 1, I set up RStudio with usethis, and created my first Minimum Viable R Package (MVRP?) which was then pushed to Github to create a new repository.

I added a README:

> use_readme_rmd()
✔ Writing 'README.Rmd'
✔ Adding '^README\\.Rmd$' to '.Rbuildignore'
● Modify 'README.Rmd'
✔ Writing '.git/hooks/pre-commit'

Things were moving along just fine, until I got this unkind message (what do you mean NOT an R package???!! What have I been doing the past hour?)

> use_testthat()
Error: `use_testthat()` is designed to work with packages.
Project 'easyMTS' is not an R package.

> use_mit_license("Nicole Radziwill")
✔ Setting active project to 'D:/R/easyMTS'
Error: `use_mit_license()` is designed to work with packages.
Project 'easyMTS' is not an R package.

Making easyMTS a Real Package

I sent out a tweet hoping to find some guidance, because Stack Overflow and Google and the RStudio community were coming up blank. As soon as I did, I discovered this button in RStudio:

The first time I ran it, it complained that I needed Rtools, but that Rtools didn’t exist for version 3.6.1. I decided to try finding and installing Rtools anyway because what could I possibly lose. I went to my favorite CRAN repository and found a link for Rtools just under the link for the base install:

I’m on Windows 10, so this downloaded an .exe which I quickly right-clicked on to run… the installer did its thing, and I clicked “Finish”, assuming that all was well. Then I went back into RStudio and tried to do Build -> Clean and Rebuild… and here’s what happened:

IT WORKED!! (I think!!!)

It created a package (top right) and then loaded it into my RStudio session (bottom left)! It loaded the package name into the package console (bottom right)!

I feel like this is a huge accomplishment for now, so I’m going to move to Part 3 of my blog post. We’ll figure out how to close the gaps that I’ve invariably introduced by veering off-tutorial.

GO TO PART 3 –>

My First R Package (Part 1)

(What does this new package do? Find out here.)

I have had package-o-phobia for years, and have skillfully resisted learning how to build a new R package. However, I do have a huge collection of scripts on my hard drive with functions in them, and I keep a bunch of useful functions up on Github so anyone who wants can source and use them. I source them myself! So, really, there’s no reason to package them up and (god forbid) submit them to CRAN. I’m doing fine without packages!

Reality check: NO. As I’ve been told by so many people, if you have functions you use a lot, you should write a package. You don’t even have to think about a package as something you write so that other people can use. It is perfectly fine to write a package for an audience of one — YOU.

But I kept making excuses for myself until very recently, when I couldn’t find a package to do something I needed to do, and all the other packages were either not getting the same answers as in book examples OR they were too difficult to use. It was time.

So armed with moral support and some exciting code, I began the journey of a thousand miles with the first step, guided by Tomas Westlake and Emil Hvitfeldt and of course Hadley. I already had some of the packages I needed, but did not have the most magical one of all, usethis:

install.packages("usethis")

library(usethis)
library(roxygen2)
library(devtools)

Finding a Package Name

First, I checked to see if the package name I wanted was available. It was not available on CRAN, which was sad:

> available("MTS")
Urban Dictionary can contain potentially offensive results,
  should they be included? [Y]es / [N]o:
1: Y
-- MTS -------------------------------------------------------------------------
Name valid: ✔
Available on CRAN: ✖ 
Available on Bioconductor: ✔
Available on GitHub:  ✖ 
Abbreviations: http://www.abbreviations.com/MTS
Wikipedia: https://en.wikipedia.org/wiki/MTS
Wiktionary: https://en.wiktionary.org/wiki/MTS

My second package name was available though, and I think it’s even better. I’ve written code to easily create and evaluate diagnostic algorithms using the Mahalanobis-Taguchi System (MTS), so my target package name is easyMTS:

> available("easyMTS")
-- easyMTS ------------------------------------------------------------
Name valid: ✔
Available on CRAN: ✔ 
Available on Bioconductor: ✔
Available on GitHub:  ✔ 
Abbreviations: http://www.abbreviations.com/easy
Wikipedia: https://en.wikipedia.org/wiki/easy
Wiktionary: https://en.wiktionary.org/wiki/easy
Sentiment:+++

Create Minimum Viable Package

Next, I set up the directory structure locally. Another RStudio session started up on its own; I’m hoping this is OK.

> create_package("D:/R/easyMTS")
✔ Creating 'D:/R/easyMTS/'
✔ Setting active project to 'D:/R/easyMTS'
✔ Creating 'R/'
✔ Writing 'DESCRIPTION'
Package: easyMTS
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
    * First Last <first.last@example.com> [aut, cre] (<https://orcid.org/YOUR-ORCID-ID>)
Description: What the package does (one paragraph).
License: What license it uses
Encoding: UTF-8
LazyData: true
✔ Writing 'NAMESPACE'
✔ Writing 'easyMTS.Rproj'
✔ Adding '.Rproj.user' to '.gitignore'
✔ Adding '^easyMTS\\.Rproj$', '^\\.Rproj\\.user$' to '.Rbuildignore'
✔ Opening 'D:/R/easyMTS/' in new RStudio session
✔ Setting active project to '<no active project>'

Syncing with Github

use_git_config(user.name = "nicoleradziwill", user.email = "nicole.radziwill@gmail.com")

browse_github_token()

This took me to a page on Github where I entered my password, and then had to go down to the bottom of the page to click on the green button that said “Generate Token.” They said I would never be able to see it again, so I gmailed it to myself for easy searchability. Next, I put this token where it is supposed to be locally:

edit_r_environ()

A blank file popped up in RStudio, and I added this line, then saved the file to its default location (not my real token):

GITHUB_PAT=e54545x88f569fff6c89abvs333443433d

Then I had to restart R and confirm it worked:

github_token()

This revealed my token! I must have done the Github setup right. Finally I could proceed with the rest of the git setup:

> use_github()
✔ Setting active project to 'D:/R/easyMTS'
Error: Cannot detect that project is already a Git repository.
Do you need to run `use_git()`?
> use_git()
✔ Initialising Git repo
✔ Adding '.Rhistory', '.RData' to '.gitignore'
There are 5 uncommitted files:
* '.gitignore'
* '.Rbuildignore'
* 'DESCRIPTION'
* 'easyMTS.Rproj'
* 'NAMESPACE'
Is it ok to commit them?

1: No
2: Yeah
3: Not now

Selection: use_github()
Enter an item from the menu, or 0 to exit
Selection: 2
✔ Adding files
✔ Commit with message 'Initial commit'
● A restart of RStudio is required to activate the Git pane
Restart now?

1: No way
2: For sure
3: Nope

Selection: 2

When I tried to commit to Github, it was asking me if the description was OK, but it was NOT. Every time I said no, it kicked me out. Turns out it wanted me to go directly into the DESCRIPTION file and edit it, so I did. I used Notepad because this was crashing RStudio. But this caused a new problem:

Error: Uncommited changes. Please commit to git before continuing.

This is the part of the exercise where it’s great to be living with a software engineer who uses git and Github all the time. He pointed me to a tiny little tab that said “Terminal” in the bottom left corner of RStudio, just to the right of “Console”. He told me to type this, which unstuck me:

THEN, when I went back to the Console, it all worked:

> use_git()
> use_github()
✔ Checking that current branch is 'master'
Which git protocol to use? (enter 0 to exit) 

1: ssh   <-- presumes that you have set up ssh keys
2: https <-- choose this if you don't have ssh keys (or don't know if you do)

Selection: 2
● Tip: To suppress this menu in future, put
  `options(usethis.protocol = "https")`
  in your script or in a user- or project-level startup file, '.Rprofile'.
  Call `usethis::edit_r_profile()` to open it for editing.
● Check title and description
  Name:        easyMTS
  Description: 
Are title and description ok?

1: Yes
2: Negative
3: No

Selection: 1
✔ Creating GitHub repository
✔ Setting remote 'origin' to 'https://github.com/NicoleRadziwill/easyMTS.git'
✔ Pushing 'master' branch to GitHub and setting remote tracking branch
✔ Opening URL 'https://github.com/NicoleRadziwill/easyMTS'

This post is getting long, so I’ll split it into parts. See you in Part 2.

GO TO PART 2 –>

A Decade of PhD: What I’ve Learned from Academia and Industry

Today is Cinco de Mayo! It’s also the 10th Anniversary of my PhD defense (in Quality Systems)…. something I carefully timed for late afternoon on this day in 2009. (I wanted to make sure I could celebrate the joyful occasion — or drown my sorrows — with 2-for-1 margaritas. Fortunately, the situation was liquid joy; unfortunately, I still got a hangover.)

I’m writing this post to share what I’ve learned about the value of getting a PhD (is there value?) and the applicability of PhD-level work to industry. If you’re considering more education, maybe this will help you decide whether it’s the right choice. If you’re in industry and trying to figure out whether to hire PhDs, some of what I write here might help. But first, some background!

I never even thought I’d get a PhD — it certainly didn’t happen out of intent or design. My family was poor (my dad was an East Prussian refugee whose family lost a couple hundred years’ worth of assets and had to start from scratch in the U.S. in the 1960s, and my mom’s grandparents were very poor Irish laborers who came to the U.S. in the early 1900s) so I studied ridiculously hard to “escape”. I didn’t think I was smart enough for a PhD, even though I started college at 16 taking half undergrad classes and half grad classes in meteorology. I aced my grad classes and very maturely ignored my required classes, so I got kicked out. (At the same time, I wasn’t really fitting in with people… my roommate called me “Nerdcole”.) When I was let back in the department head wouldn’t let me take any grad classes so I got bored and burned out… not surprising since I was supporting myself, and working three jobs to make that happen. I quit school to work at an e-commerce startup when I was 18. A few months later, thanks to (good) peer pressure, I took 3 credit by exams to see if it would get me over the finish line, and thanks to some side skills I had picked up in vector calculus and statistics, it worked and I got the BS. But I was still left with a pretty bad GPA, and even worse self esteem, and I was convinced no one would ever let me into grad school.

I figured I’d focus on industry and help companies grow. There was no other choice.

The Back Story

After spending a couple years building web sites and storefronts (a huge feat in 1995 and 1996!) I took a job at a national lab as a systems analyst, supporting established scientists and engineers and helping them get work done. The main lesson I learned during this time was: Alignment between strategy and objectives doesn’t come for free (teams of people have to spend dedicated time on it), and most people are really disorganized. There had to be a better way to get work done.

A few years later, I was a traveling Solutions Architect, parachuted once or twice a month into CRM software implementation fiascos around the globe. My job was to figure out what to do to turn these jobs around — was it a people problem? An architecture problem? A training problem? A systems thinking problem? A little of everything? I had a couple weeks to make a recommendation, and then I was on to the next project (results were usually pretty good). But since this required evaluating technology decisions in the context of business and financial constraints, my boss suggested that I use the tuition benefits offered by my job to get an MBA. I had taken 9 credits of science and industrial engineering classes since I’d graduated, so I contacted two of the local MBA schools to see if they’d accept me and my credits. Sure enough, one of them did! I took evening classes for a year and a half, and eventually ended up with an MBA. But I never thought I could (or would) go farther — I’m not that smart, I’d tell myself. Also, it’s expensive. Also, a PhD would probably make me less marketable. (All lies, spoken by a lack of confidence and a heavy dose of impostor syndrome.)

Shortly thereafter, the travel started to get to me (I was flying at least three days a week), so I looked for an opportunity to grow and cultivate a software development organization. (That’s how I ended up leading monitor and control systems and data management at NRAO.) A little management led to a lot of management. A few years later one of the organization’s leaders casually said it was “too bad I didn’t have a PhD” — because in a highly scientific and technical organization like NRAO, it would give me more credibility and make me a better leader.

“Will you pay for it?” I asked. “Sure,” they said. I just had to find a suitable program that wouldn’t require me to go full time. I’ve always loved learning, and I couldn’t resist the temptation of free education — even if it meant I’d have to balance the demands of a challenging full-time job and a first-time baby at the same time. That’s how much I love learning, just for learning’s sake! I still didn’t think a PhD had that much value, unless you were studying to be a lab scientist or you were dead set on becoming a historian and teaching for the rest of your life. None of these personas was me, but the free education thing sold me, and I didn’t really think about how relevant this step was to my career direction until much later.

Fortunately, I found the perfect program for me — a hybrid academic/practitioner PhD that would help me develop the research and analytical skills to solve practical problems in business and industrial technology.

The next few years were pretty rough. By the time I got my PhD, I was in my 14th year of post-college professional employment. First lesson learned: it’s probably not the best move to start PhD coursework when you have a three-month old. I have no idea how I made it through.

Shortly after graduating, the impacts of the financial crisis hit our federally funded organization and I was able to segue into a second career as a college professor, teaching data science, manufacturing, and EHSQ classes. For the past year, I’ve been back in industry (maybe permanently; we’ll see) and have a better sense of the value of PhDs in industry.

Value of Getting a PhD

There are lots of reasons I’m happy with the time I spent getting a PhD, other than the fact that it helped me get an entirely new job when the economy was down:

  • First and foremost, I’m a better critical thinker. It’s now my nature to look at all parts of a problem, examine the interactions between them, and make sure I have all the required background information needed to start working on a problem.
  • I’m a better writer too. I look at reports and presentations I wrote years ago, and can see all the holes and places where I made assumptions that weren’t valid.
  • I developed a new appreciation for clarity. Researchers want to make sure their messages, methodologies, and models are clear and unambiguous… through the contrast, I was able to recognize that in industry, there’s often pressure to skip due diligence and move fast to perform. This pressure leads to ambiguity, which tends towards what I call “intellectual waste” – people assuming that they see a problem or a project in the same way because they haven’t taken the time to guarantee clarity.
  • It’s easier for me to quickly determine whether information might be true or false, or whether there are gaps that need to be closed before moving forward. (It’s possible that this skill is more from grading and evaluating student work… something that’s orders of magnitude harder than it seems.)
  • I realized that words matter. Really thinking about how one person will respond to a word or phrase, and whether it conveys the meaning that you intend, is a craft — that’s enhanced by working with collaborators.
  • And although I knew this one prior to the PhD, I found that data matters. Where did your data come from? Can you access the original sources? What kind of people (or instruments) gathered it? Can you trust them? The quality of your data — and the suitability of the methods you choose — will impact the quality and integrity of the conclusions you generate from it. Awareness of these factors is essential.

Value of Caution

One of the biggest lessons was the most surprising. Early on in the PhD program I was told that my opinion didn’t count — regardless of how many years of experience I had. Every statement I made had to be backed up and cited, preferably using material that had been peer-reviewed by other qualified people. At first I was kind of offended by this… didn’t these academics have ANY sense of the value of actual real-world employment? Apparently not.

But something funny happened as I developed the habit of looking for solid references, distilling their messages, and citing them accurately: I became more careful. And in the evolution of my caution and attention to detail, the quality of my work — ANY work — improved tremendously. I was able to learn from what other people had discovered, and anticipate (and resolve) problems in advance. I learned that “standing on the shoulders of giants” actually means figuring out when solved problems already exist so you don’t waste time reinventing wheels.

Something else funny happened as soon as I graduated: all of a sudden, people were asking me for my opinion. But the habit of due diligence was so ingrained that I couldn’t express my opinion… I was compelled to back up any opinion with facts!

(I think this was the point all along. Go figure.)

The beauty of going through the entire messy process of PhD coursework and comps and research and defense and editing — the entire end-to-end process, not cutting out in the middle anywhere — it gave me the discipline and process to root out accurate and complete answers to problems. Or at the very least, to be able to call out the gaps to close to get there.

There’s a lot of pressure in industry to move fast, but due diligence is still critical for accurate self-assessment and effective cross-functional communication. Slowing down and figuring out how you know what you know — and making sure everyone is literally on the same page — can help your organization achieve its goals more quickly.

Value of PhDs to Industry

So employers (especially in tech) — should you hire PhDs? Yes. Here’s why:

  • PhDs are trained to find gaps in knowledge and understanding. Is your strategic plan grounded in reality, or is it just wishful thinking? Are your Project Charters well scoped, budgeted, and planned out? Is your workforce prepared to carry out your strategic initiatives?
  • Many PhDs with experience teaching undergrads are great at making complex topics accessible to other audiences. This is fantastic for training, cross-training, and marketing.
  • PhDs love research and writing, and can help you with gathering and interpreting data and content marketing.
  • PhDs love learning. Want to be on the cutting edge? They’re great in R&D… they can help you distill new insights from research papers and interpret and apply them accurately.
  • If you want to do AI or machine learning, or anything that uses Big Data, make sure you have at least one PhD statistician with practical analytical experience. They can prevent you from spending millions on dead ends and help you apply Occam’s Razor to avoid unnecessary complexity (the kind that can lead to technical debt later).

Will there be drawbacks? Sure. The habit of caution may need to be tempered somewhat — you don’t have to probe all the way to the bottom of an issue to generate useful information that a business can use to make progress. (This is where the perception comes from that “PhDs are slow.” If you’re a PhD, always ask yourself: How can I move this project forward as quickly as possible? Your job is to find a path forward, not find objections or cause the work to stagnate. Of course this is good guidance for anyone on a team or in a workgroup.)

Bottom line… don’t be afraid of PhDs! We are mere mortals who just happen to have spent several years trying to figure out how to get to the core — the fundamental truths — of a complex problem. As a result we know how to approach complex problems like this — problems that many businesses have lots of. (We are not overqualified at all… we just have an extra skill set in something you desperately need, but may not realize you need it.)

Getting a PhD was challenging, frustrating, and maddening at times (especially the final part of getting your camera-ready text ready for ProQuest). I never planned to do it, but I’d totally do it again. I think my only regret is that I got a PhD in a hybrid business/industrial engineering discipline… it allowed me the freedom to pursue my interests, but if I was at the same crossroads now, I’d get a PhD in statistics to complement my MBA. Overall, this is a pretty tiny regret.

Imperfect Action is Better Than Perfect Inaction: What Harry Truman Can Teach Us About Loss Functions (with an intro to ggplot)

One of the heuristics we use at Intelex to guide decision making is former US President Truman’s advice that “imperfect action is better than perfect inaction.” What it means is — don’t wait too long to take action, because you don’t want to miss opportunities. Good advice, right?

When I share this with colleagues, I often hear a response like: “that’s dangerous!” To which my answer is “well sure, sometimes, but it can be really valuable depending on how you apply it!” The trick is: knowing how and when.

Here’s how it can be dangerous. For example, statistical process control (SPC) exists to keep us from tampering with processes — from taking imperfect action based on random variation, which will not only get us nowhere, but can exacerbate the problem we were trying to solve. The secret is to apply Truman’s heuristic based on an understanding of exactly how imperfect is OK with your organization, based on your risk appetite. And this is where loss functions can help.

Along the way, we’ll demonstrate how to do a few important things related to plotting with the ggplot package in R, gradually adding in new elements to the plot so you can see how it’s layered, including:

  • Plot a function based on its equation
  • Add text annotations to specific locations on a ggplot
  • Draw horizontal and vertical lines on a ggplot
  • Draw arrows on a ggplot
  • Add extra dots to a ggplot
  • Eliminate axis text and axis tick marks

What is a Loss Function?

A loss function quantifies how unhappy you’ll be based on the accuracy or effectiveness of a prediction or decision. In the simplest case, you control one variable (x) which leads to some cost or loss (y). For the case we’ll examine in this post, the variables are:

  • How much time and effort you put in to scoping and characterizing the problem (x); we assume that time+effort invested leads to real understanding
  • How much it will cost you (y); can be expressed in terms of direct costs (e.g. capex + opex) as well as opportunity costs or intangible costs (e.g. damage to reputation)

Here is an example of what this might look like, if you have a situation where overestimating (putting in too much x) OR underestimating (putting in too little x) are both equally bad. In this case, x=10 is the best (least costly) decision or prediction:

plot of a typical squared loss function
# describe the equation we want to plot
parabola <- function(x) ((x-10)^2)+10  

# initialize ggplot with a dummy dataset
library(ggplot)
p <- ggplot(data = data.frame(x=0), mapping = aes(x=x)) 

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) +
     xlab("x = the variable you can control") + 
     ylab("y = cost of loss ($$)")

In regression (and other techniques where you’re trying to build a model to predict a quantitative dependent variable), mean square error is a squared loss function that helps you quantify error. It captures two facts: the farther away you are from the correct answer the worse the error is — and both overestimating and underestimating is bad (which is why you square the values). Across this and related techniques, the loss function captures these characteristics:

From http://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote10.html

Not all loss functions have that general shape. For classification, for example, the 0-1 loss function tells the story that if you get a classification wrong (x < 0) you incur all the penalty or loss (y=1), whereas if you get it right (x > 0) there is no penalty or loss (y=0):

# set up data frame of red points
d.step <- data.frame(x=c(-3,0,0,3), y=c(1,1,0,0))

# note that the loss function really extends to x=-Inf and x=+Inf
ggplot(d.step) + geom_step(mapping=aes(x=x, y=y), direction="hv") +
     geom_point(mapping=aes(x=x, y=y), color="red") + 
     xlab("y* f(x)") + ylab("Loss (Cost)") +  
     ggtitle("0-1 Loss Function for Classification")

Use the Loss Function to Make Strategic Decisions

So let’s get back to Truman’s advice. Ideally, we want to choose the x (the amount of time and effort to invest into project planning) that results in the lowest possible cost or loss. That’s the green point at the nadir of the parabola:

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen")

Costs get higher as we move up the x-axis:

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green")

And time+effort grows as we move along the x-axis (we might spend minutes on a problem at the left of the plot, or weeks to years by the time we get to the right hand side):

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green") +
     annotate(geom="text", x=2, y=0, label="minutes\nof effort", size=3) +
     annotate(geom="text", x=20, y=0, label="months\nof effort", size=3)

Planning too Little = Planning too Much = Costly

What this means is — if we don’t plan, or we plan just a little bit, we incur high costs. We might make the wrong decision! Or miss critical opportunities! But if we plan too much — we’re going to spend too much time, money, and/or effort compared to the benefit of the solution we provide.


p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green") +
     annotate(geom="text", x=2, y=0, label="minutes\nof effort", size=3) +
     annotate(geom="text", x=20, y=0, label="months\nof effort", size=3) +
     annotate(geom="text",x=3, y=85, label="Little (or no) Planning\nHIGH COST", color="red") +
     annotate(geom="text", x=18, y=85, label="Paralysis by Planning\nHIGH COST", color="red") +
     geom_vline(xintercept=0, linetype="dotted") + geom_hline(yintercept=0, linetype="dotted")

The trick is to FIND THAT CRITICAL LEVEL OF TIME and EFFORT invested to gain information and understanding about your problem… and then if you’re going to err, make sure you err towards the left — if you’re going to make a mistake, make the mistake that costs less and takes less time to make:

arrow.x <- c(10, 10, 10, 10)
arrow.y <- c(35, 50, 65, 80)
arrow.x.end <- c(6, 6, 6, 6)
arrow.y.end <- arrow.y
d <- data.frame(arrow.x, arrow.y, arrow.x.end, arrow.y.end)

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green") +
     annotate(geom="text", x=2, y=0, label="minutes\nof effort", size=3) +
     annotate(geom="text", x=20, y=0, label="months\nof effort", size=3) +
     annotate(geom="text",x=3, y=85, label="Little (or no) Planning\nHIGH COST", color="red") +
     annotate(geom="text", x=18, y=85, label="Paralysis by Planning\nHIGH COST", color="red") +
     geom_vline(xintercept=0, linetype="dotted") + 
     geom_hline(yintercept=0, linetype="dotted") +
     geom_vline(xintercept=10) +
     geom_segment(data=d, mapping=aes(x=arrow.x, y=arrow.y, xend=arrow.x.end, yend=arrow.y.end),
     arrow=arrow(), color="blue", size=2) +
     annotate(geom="text", x=8, y=95, size=2.3, color="blue",
     label="we prefer to be\non this side of the\nloss function")

Moral of the Story

The moral of the story is… imperfect action can be expensive, but perfect action is ALWAYS expensive. Spend less to make mistakes and learn from them, if you can! This is one of the value drivers for agile methodologies… agile practices can help improve communication and coordination so that the loss function is minimized.

## FULL CODE FOR THE COMPLETELY ANNOTATED CHART ##
# If you change the equation for the parabola, annotations may shift and be in the wrong place.
parabola <- function(x) ((x-10)^2)+10

my.title <- expression(paste("Imperfect Action Can Be Expensive. But Perfect Action is ", italic("Always"), " Expensive."))

arrow.x <- c(10, 10, 10, 10)
arrow.y <- c(35, 50, 65, 80)
arrow.x.end <- c(6, 6, 6, 6)
arrow.y.end <- arrow.y
d <- data.frame(arrow.x, arrow.y, arrow.x.end, arrow.y.end)

p + stat_function(fun=parabola) + xlim(-2,23) + ylim(-2,100) + 
     xlab("Time Spent and Information Gained (e.g. person-weeks)") + ylab("$$ COST $$") +
     annotate(geom="text", x=10, y=5, label="Some Effort, Lowest Cost!!", color="darkgreen") +
     geom_point(aes(x=10, y=10), colour="darkgreen") +
     annotate(geom="text", x=0, y=100, label="$$$$$", color="green") +
     annotate(geom="text", x=0, y=75, label="$$$$", color="green") +
     annotate(geom="text", x=0, y=50, label="$$$", color="green") +
     annotate(geom="text", x=0, y=25, label="$$", color="green") +
     annotate(geom="text", x=0, y=0, label="$ 0", color="green") +
     annotate(geom="text", x=2, y=0, label="minutes\nof effort", size=3) +
     annotate(geom="text", x=20, y=0, label="months\nof effort", size=3) +
     annotate(geom="text",x=3, y=85, label="Little (or no) Planning\nHIGH COST", color="red") +
     annotate(geom="text", x=18, y=85, label="Paralysis by Planning\nHIGH COST", color="red") +
     geom_vline(xintercept=0, linetype="dotted") + 
     geom_hline(yintercept=0, linetype="dotted") +
     geom_vline(xintercept=10) +
     geom_segment(data=d, mapping=aes(x=arrow.x, y=arrow.y, xend=arrow.x.end, yend=arrow.y.end),
     arrow=arrow(), color="blue", size=2) +
     annotate(geom="text", x=8, y=95, size=2.3, color="blue",
     label="we prefer to be\non this side of the\nloss function") +
     ggtitle(my.title) +
     theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(),
     axis.text.y=element_blank(), axis.ticks.y=element_blank()) 

Now sometimes you need to make this investment! (Think nuclear power plants, or constructing aircraft carriers or submarines.) Don’t get caught up in getting your planning investment perfectly optimized — but do be aware of the trade-offs, and go into the decision deliberately, based on the risk level (and regulatory nature) of your industry, and your company’s risk appetite.

« Older Entries