My New Favorite Statistics & Data Analysis Book Using R
NOTE: The 3rd Edition (Purple Swan) was released in April 2019. There is a companion book that presents end-to-end examples of each of the methods.
As of today, I now have a NEW FAVORITE introductory statistics textbook… the one I’ve always dreamed of having. I’ve been looking for a book to use in my classes for undergraduate sophomores and juniors, but none of the textbooks I considered over the past three years (and I’ve looked at over a hundred!) had all of the things I really, really wanted. So I had to go make it happen myself. These things are:
1) An integrated treatment of theory and practice. All of my stats textbooks have a lot of formulas, and no information about how to do what the formulas do in the R statistical software. All of my R textbooks have a lot of information about how to run the commands, but not really much information about what formulas are being used. I wanted a book that would show how to solve problems analytically (using the equations), and then show how they’re done in R. If there were discrepancies between the stats textbook answers and the R answers, I wanted to know why. A lot of times, the developers of R packages use very sophisticated adjustments and corrections, which I only became aware of because my analytical solutions didn’t match the R output. At first, I thought I was wrong. But later, I realized I was right, and R was right: we were just doing different things. I wanted my students to know what was going on under the hood, and have an awareness of exactly which methods R was using at every moment.
2) An easy way to develop research questions for observational studies and organize the presentation of results. We always do small research projects in my classes, and in my opinion, this is the best way for students to get a strong grasp of the fundamental statistical concepts. But they always have the same questions: Which statistical test should I use? How should I phrase my research question? What should I include in my report? I wanted a book that made developing statistical research questions easy. In fact, I know a lot of people I went to PhD school with that would have loved to have this book while they were proposing, conducting, and defending their dissertations.
3) A confidence interval cookbook. This is probably one of the most important things I want my students to leave my class remembering: that from whatever sample you collect, you can construct a confidence interval that will give you an idea of what the true population parameter should be. You don’t even need to do a hypothesis test! but it can be difficult to remember which formula to use… so I wanted an easy reference where I’d be able to look things up, and find out really easily how to use R to construct those confidence intervals for me. Furthermore, some of the confidence intervals that everyone is taught in an introductory statistics course are wildly inaccurate – and statisticians know this. But they hesitate to scare away novice data analysts with long, scary looking equations, and so students keep learning those inaccurate methods and believing they’re good. Since so many people never get beyond introductory statistics and still turn into researchers in other fields, I thought this was horrible. I want to make sure my students know the best way to do each confidence interval in their first class… even if the equations are not as friendly.
4) An inference test cookbook. I wanted a book that stepped me through each of the primary parametric inference tests analytically (using the equations), and then showed me how it was done in R. If there were discrepancies, I wanted to know why. I wanted an easy way to remember the assumptions for each test, and when to use a pooled standard deviation versus an unpooled one. There’s a lot to keep track of! I wanted a reference that it would make it easy to keep track of all of it: assumptions, tests for assumptions, equations, R code, and diagnostic plots.
5) No step left behind. It’s really frustrating to me how so many R books assume you can do a psychic fill-in-the-blank for missing code. Since I’ve been using R for several years now, I’ve gotten to the point where my psychic abilities are pretty good, and at least 60% of the time I can figure out the missing pieces. But wow, what a waste of time! So I wanted a book that had all of the steps for each example. Even if it was a little repetitive. I may have missed this in a few places, but I think beginners will have a much easier time with this book. Also, I put all my data and functions on GitHub for people to run the examples with. I’m growing this slowly, but I don’t want people to be left in the lurch.
6) An easy way to produce any of the charts and graphs in the book. One of my pet peeves about R books is that the authors generate beautiful charts and graphs, and then you’re reading through the book and say “Yes!! Yes!! That’s the chart I need for my report… I want to do that… how did they do that?” and they don’t tell you anywhere how they did it. I did not want there to be any secrets in this book. If I generated a page of interesting looking simulated distributions, I wanted you to know how I did it (just in case you want to do it later).
GRANTED… I am sure it will not be perfect – no book is. (For example, Google Forms changes a lot and there are a couple examples that use it that will probably be outdated when the book gets to press… and I just found out this morning that you don’t need the source_https trick in R 3.2.0 and beyond.) [Note: data access has been fully updated in the 2nd Edition.] However, I will keep updating my blog with posts about useful things as they evolve.
In any case, I hope you enjoy my book as much as I’ve been enjoying using it as a reference for myself… it really is all my most important notes, neatly organized into just over 500 pages of everything I want to remember. And everything I want to make sure my students take with them after they leave my class.
[Note: Any errors and omissions from earlier printings (which have been taken care of in later printings) are being recorded at http://qualityandinnovation.com/errata/.]