Tag Archives: applied statistics

Where Do Z-Score Tables Come From? (+ how to make them in R)

Every student learns how to look up areas under the normal curve using Z-Score tables in their first statistics class. But what is less commonly covered, especially in courses where calculus is not a prerequisite, is where those Z-Score tables come from: figuring out the area under the normal curve for all possible places you could chop it into two, then making a table from it.

You get the z-score by evaluating the integral of the equation for the bell-shaped normal curve, usually from -Inf to the z-score of interest. This is the same thing that the R command pnorm does when you provide it with a z-score. Here is the slide presentation I put together to explain the use and origin of the Z-Score table, and how it relates to pnorm and qnorm (the command that lets you input an area to find the z-score at which the area to the left is swiped out). It’s free to use under Creative Commons, and is part of the course materials that is available for use with this 2017 book.

One of the fun things I did was to make my own z-score table in R. I don’t know why anyone would WANT to do this — they are easy to find in books, and online, and if you know how to use pnorm and qnorm, you don’t need one at all. But, you can, and here’s how.

First, let’s create a z-score table just with left-tail areas. Using symmetry, we can also use this to get any areas in the right tail, because the area to the left of any -z is the same as any area to the right of any +z. Even though the z-score table contains areas in its cells, our first step is to create a table just of the z-scores that correspond to each cell:

c0 <- seq(-3.4,0,.1)
c1 <- seq(-3.41,0,.1)
c2 <- seq(-3.42,0,.1)
c3 <- seq(-3.43,0,.1)
c4 <- seq(-3.44,0,.1)
c5 <- seq(-3.45,0,.1)
c6 <- seq(-3.46,0,.1)
c7 <- seq(-3.47,0,.1)
c8 <- seq(-3.48,0,.1)
c9 <- seq(-3.49,0,.1)


z <- cbind(c0,c1,c2,c3,c4,c5,c6,c7,c8,c9)
z</pre>
<pre>c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
[1,] -3.4 -3.41 -3.42 -3.43 -3.44 -3.45 -3.46 -3.47 -3.48 -3.49
[2,] -3.3 -3.31 -3.32 -3.33 -3.34 -3.35 -3.36 -3.37 -3.38 -3.39
[3,] -3.2 -3.21 -3.22 -3.23 -3.24 -3.25 -3.26 -3.27 -3.28 -3.29
[4,] -3.1 -3.11 -3.12 -3.13 -3.14 -3.15 -3.16 -3.17 -3.18 -3.19
[5,] -3.0 -3.01 -3.02 -3.03 -3.04 -3.05 -3.06 -3.07 -3.08 -3.09
[6,] -2.9 -2.91 -2.92 -2.93 -2.94 -2.95 -2.96 -2.97 -2.98 -2.99
[7,] -2.8 -2.81 -2.82 -2.83 -2.84 -2.85 -2.86 -2.87 -2.88 -2.89
[8,] -2.7 -2.71 -2.72 -2.73 -2.74 -2.75 -2.76 -2.77 -2.78 -2.79
[9,] -2.6 -2.61 -2.62 -2.63 -2.64 -2.65 -2.66 -2.67 -2.68 -2.69
[10,] -2.5 -2.51 -2.52 -2.53 -2.54 -2.55 -2.56 -2.57 -2.58 -2.59
[11,] -2.4 -2.41 -2.42 -2.43 -2.44 -2.45 -2.46 -2.47 -2.48 -2.49
[12,] -2.3 -2.31 -2.32 -2.33 -2.34 -2.35 -2.36 -2.37 -2.38 -2.39
[13,] -2.2 -2.21 -2.22 -2.23 -2.24 -2.25 -2.26 -2.27 -2.28 -2.29
[14,] -2.1 -2.11 -2.12 -2.13 -2.14 -2.15 -2.16 -2.17 -2.18 -2.19
[15,] -2.0 -2.01 -2.02 -2.03 -2.04 -2.05 -2.06 -2.07 -2.08 -2.09
[16,] -1.9 -1.91 -1.92 -1.93 -1.94 -1.95 -1.96 -1.97 -1.98 -1.99
[17,] -1.8 -1.81 -1.82 -1.83 -1.84 -1.85 -1.86 -1.87 -1.88 -1.89
[18,] -1.7 -1.71 -1.72 -1.73 -1.74 -1.75 -1.76 -1.77 -1.78 -1.79
[19,] -1.6 -1.61 -1.62 -1.63 -1.64 -1.65 -1.66 -1.67 -1.68 -1.69
[20,] -1.5 -1.51 -1.52 -1.53 -1.54 -1.55 -1.56 -1.57 -1.58 -1.59
[21,] -1.4 -1.41 -1.42 -1.43 -1.44 -1.45 -1.46 -1.47 -1.48 -1.49
[22,] -1.3 -1.31 -1.32 -1.33 -1.34 -1.35 -1.36 -1.37 -1.38 -1.39
[23,] -1.2 -1.21 -1.22 -1.23 -1.24 -1.25 -1.26 -1.27 -1.28 -1.29
[24,] -1.1 -1.11 -1.12 -1.13 -1.14 -1.15 -1.16 -1.17 -1.18 -1.19
[25,] -1.0 -1.01 -1.02 -1.03 -1.04 -1.05 -1.06 -1.07 -1.08 -1.09
[26,] -0.9 -0.91 -0.92 -0.93 -0.94 -0.95 -0.96 -0.97 -0.98 -0.99
[27,] -0.8 -0.81 -0.82 -0.83 -0.84 -0.85 -0.86 -0.87 -0.88 -0.89
[28,] -0.7 -0.71 -0.72 -0.73 -0.74 -0.75 -0.76 -0.77 -0.78 -0.79
[29,] -0.6 -0.61 -0.62 -0.63 -0.64 -0.65 -0.66 -0.67 -0.68 -0.69
[30,] -0.5 -0.51 -0.52 -0.53 -0.54 -0.55 -0.56 -0.57 -0.58 -0.59
[31,] -0.4 -0.41 -0.42 -0.43 -0.44 -0.45 -0.46 -0.47 -0.48 -0.49
[32,] -0.3 -0.31 -0.32 -0.33 -0.34 -0.35 -0.36 -0.37 -0.38 -0.39
[33,] -0.2 -0.21 -0.22 -0.23 -0.24 -0.25 -0.26 -0.27 -0.28 -0.29
[34,] -0.1 -0.11 -0.12 -0.13 -0.14 -0.15 -0.16 -0.17 -0.18 -0.19
[35,] 0.0 -0.01 -0.02 -0.03 -0.04 -0.05 -0.06 -0.07 -0.08 -0.09

Now that we have slots for all the z-scores, we can use pnorm to transform all those values into the areas that are swiped out to the left of that z-score. This part is easy, and only takes one line. The remaining three lines format and display the z-score table:

zscore.df <- round(pnorm(z),4)
row.names(zscore.df) <- sprintf("%.2f", c0)
colnames(zscore.df) <- seq(0,0.09,0.01)


zscore.df

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-3.40 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
-3.30 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
-3.20 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
-3.10 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
-3.00 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
-2.90 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
-2.80 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
-2.70 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
-2.60 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
-2.50 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
-2.40 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
-2.30 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
-2.20 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
-2.10 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
-2.00 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
-1.90 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
-1.80 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
-1.70 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
-1.60 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
-1.50 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
-1.40 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
-1.30 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
-1.20 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
-1.10 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
-1.00 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
-0.90 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
-0.80 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
-0.70 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
-0.60 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
-0.50 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
-0.40 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
-0.30 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
-0.20 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
-0.10 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
0.00 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641

You can also draw a picture to go along with your z-score table, so that people remember which area they are looking up:

x <- seq(-4,4,0.1)
y <- dnorm(x)
plot(x,dnorm(x),type="l", col="black", lwd=3)
abline(v=-1,lwd=3,col="blue")
abline(h=0,lwd=3,col="black")
polygon(c(x[1:31],rev(x[1:31])), c(rep(0,31),rev(y[1:31])), col="lightblue")
[/code]
It looks like this:

z-score-table-icon-small

In the slides, code to produce a giant-tail z-score table is also provided (where the areas are > 50%).

Taking a Subset of a Data Frame in R

Note: Everything in this article is easier with dplyr and magrittr in tidyverse. I’ll write a followup sometime this year.


I just wrote a new chapter for my students describing how to subset a data frame in R. The full text is available at https://docs.google.com/document/d/1K5U11-IKRkxNmitu_lS71Z6uLTQW_fp6QNbOMMwA5J8/edit?usp=sharing but here’s a preview:

Let’s load in ChickWeight, one of R’s built in datasets. This contains the weights of little chickens at 12 different times throughout their lives. The chickens are on different diets, numbered 1, 2, 3, and 4. Using the str command, we find that there are 578 observations in this data frame, and two different categorical variables: Chick and Diet.

> data(ChickWeight)
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
> str(ChickWeight)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame': 578 obs. of 4 variables:
$ weight: num 42 51 59 64 76 93 106 125 149 171 ...
$ Time : num 0 2 4 6 8 10 12 14 16 18 ...
$ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
$ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "formula")=Class 'formula' length 3 weight ~ Time | Chick
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "outer")=Class 'formula' length 2 ~Diet
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "labels")=List of 2
..$ x: chr "Time"
..$ y: chr "Body weight"
- attr(*, "units")=List of 2
..$ x: chr "(days)"
..$ y: chr "(gm)"

Get One Column: Now that we have a data frame named ChickWeight loaded into R, we can take subsets of these 578 observations. First, let’s assume we just want to pull out the column of weights. There are two ways we can do this: specifying the column by name, or specifying the column by its order of appearance. The general form for pulling information from data frames is data.frame[rows,columns] so you can get the first column in either of these two ways:

ChickWeight[,1] # get all rows, but only the first column
ChickWeight[,c("weight")] # get all rows, and only the column named “weight”

Get Multiple Columns: If you want more than one column, you can specify the column numbers or the names of the variables that you want to extract. If you want to get the weight and diet columns, you would do this:

ChickWeight[,c(1,4)] # get all rows, but only 1st and 4th columns
ChickWeight[,c("weight","Diet")] # get all rows, only “weight” & “Diet” columns

If you want more than one column and those columns are next to each other, you can do this:

ChickWeight[,c(1:3)]

Get One Row: You can get the first row similarly to how you got the first column, and any other row the same way:

ChickWeight[1,] # get first row, and all columns
ChickWeight[82,] # get 82nd row, and all columns

Get Multiple Rows: If you want more than one row, you can specify the row numbers you want like this:

> ChickWeight[c(1:6,15,18,27),]
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
15 58 4 2 1
18 103 10 2 1
27 55 4 3 1

Using xda with googlesheets in R

Image Credit: Doug Buckley of http://hyperactive.to

Image Credit: Doug Buckley of http://hyperactive.to

Want to do a quick, exploratory data analysis in R of your data that’s stored in a spreadsheet on Google Drive? You’re in luck, because now you can use the new xda package in conjunction with Jenny Bryan‘s googlesheets. There are some quirks, though, and that’s what this post is all about.

Before proceeding, you should review this recent article from R-Bloggers called “Introducing xda”.

First, be sure to install the googlesheets and xda packages. Although googlesheets is on CRAN, xda is not, and you’ll have to bring it in directly from github. You can actually do the same for googlesheets if you like:

install.packages("devtools")
library(devtools)
install_github("jennybc/googlesheets")
install_github("ujjwalkarn/xda")
library(googlesheets)
library(xda)

Next, you’ll have to show R how to access your Google spreadsheet. While you are looking at your spreadsheet, go to File -> Publish to the Web. The URL that’s in the text box is the one you want to capture. Just to make sure it works, copy and paste it into a new browser address window and see if you can display your spreadsheet in your browser.

If you want to import the data at https://docs.google.com/spreadsheets/d/1DO0ksD8d-rn_j2Yn7DQKZDPBrhrvZTpgszewxokjWKU/pubhtml into R, for example, you’ll need to know the spreadsheet’s key. That’s the long string of unintelligible numbers and letters between the “d” and the “pubhtml”. So, my key would be “1DO0ksD8d-rn_j2Yn7DQKZDPBrhrvZTpgszewxokjWKU” — which you’ll see in this next block of code:

> my.gs <- gs_key("1DO0ksD8d-rn_j2Yn7DQKZDPBrhrvZTpgszewxokjWKU") 
> my.data <- gs_read(my.gs) # Retrieves data from googlesheets and places it into an R object. 
> my.df <- as.data.frame(my.data)  # Important! xda needs you to extract only the data in a data frame.

Now, you can access your data. Try head(my.df) to make sure you’ve imported it properly.

Next, it’s time for exploratory data analysis. There are three commands available:

  • numSummary – takes a data frame as an argument, provides descriptive statistics, quantiles, and missing data info for quantitative variables
  • charSummary – takes a data frame as an argument, provides counts, missing data info, and number of unique factors for quantitative variables
  • bivariate – takes a data frame and two quantitative variables as an argument, and performs a quick bivariate analysis (giving this categorical variables, or giving this one categorical and one quantitative variable, will throw an error)

Here’s what happens when you run those commands on the data you just loaded in from your Google spreadsheet:

> numSummary(my.df)
               n   mean    sd median   max  min  mode miss miss%    1%   5%   25%   50%   75%   90%   95%   99%
obs          200 100.50 57.88  100.5 200.0  1.0   1.0    0     0  2.99 11.0  50.8 100.5 150.2 180.1 190.0 198.0
heartrates   200  73.01  7.43   73.0  96.3 53.4  71.2    0     0 56.49 61.2  68.4  73.0  77.4  82.6  85.7  90.3
systolics    200 139.27 29.27  138.0 221.0 59.0 139.0    0     0 79.98 96.0 117.0 138.0 160.0 177.2 188.1 205.0
diastolics   200  87.76  9.74   87.7 116.4 62.4  85.2    0     0 66.01 72.2  81.9  87.7  93.7 100.3 104.5 108.3
bmis         200  25.53  3.06   25.0  33.1 18.4  24.7    0     0 19.00 21.0  23.5  25.0  27.7  29.6  31.2  32.8
ages         200  44.41 14.59   45.0  70.0 18.0  30.0    0     0 18.00 22.0  32.0  45.0  57.0  64.1  67.0  70.0
heartpm      200  72.26  3.55   72.2  83.8 64.2  74.2    0     0 64.72 66.2  69.8  72.2  74.2  76.4  78.7  81.4
fitnesslevel 200   2.62  1.17    3.0   4.0  1.0   4.0    0     0  1.00  1.0   2.0   3.0   4.0   4.0   4.0   4.0

> charSummary(my.df)
          n miss miss% unique
genders 200    0     0      2
smokers 200    0     0      2
group   200    0     0      4

> bivariate(my.df,'heartrates','bmis')
     bin_bmis min_heartrates max_heartrates mean_heartrates
1   (18.3,22]          53.40          85.60           72.80
2   (22,25.7]          55.70          90.70           72.87
3 (25.7,29.4]          60.30          96.30           73.45
4 (29.4,33.1]          56.50          90.30           72.46

Observations:

  • There is a fourth “Plot” command but I couldn’t get it to work on any googlesheetsdata. The xda package is looking for class(range) to be anything other than “function”, which it was for every sheet I attempted to load.
  • There really should be an extra column in xda that displays the enumeration of all the unique values for the factors. It felt great to know how many unique values there were, but I would love to be reminded of what they are too, unless there are too many of them.

Please share your experiences using xda & googlesheets together in the comments! Thanks!

Analytic Hierarchy Process (AHP) using preferenceFunction in ahp

Yesterday, I wrote about how to use gluc‘s new ahp package on a simple Tom-Dick-Harry one level decision making problem using Analytic Hierarchy Process (AHP). One of the cool things about that package is that in addition to specifying the pairwise comparisons directly using Saaty’s scale (below, from https://kristalaace2014.wordpress.com/2014/05/14/w12_al_vendor-evaluation/)…

saaty-scale

…you can also describe each of the Alternatives in terms of descriptive variables which you can use inside a function to make the pairwise comparisons automatically. This is VERY helpful if you have lots of criteria, subcriteria, or alternatives to evaluate!! For example, I used preferenceFunction to compare 55 alternatives using 6 criteria and 4 subcriteria, and was very easily able to create functions to represent my judgments. This was much easier than manually entering all the comparisons.

This post shows HOW I replaced some of my manual comparisons with automated comparisons using preferenceFunction. (The full YAML file is included at the bottom of this post for you to use if you want to run this example yourself.) First, recall that the YAML file starts with specifying the alternatives that you are trying to choose from (at the bottom level of the decision hierarchy) and some variables that characterize those alternatives. I used the descriptions in the problem statement to come up with some assessments between 1=not great and 10=great:

#########################
# Alternatives Section
# THIS IS FOR The Tom, Dick, & Harry problem at
# https://en.wikipedia.org/wiki/Analytic_hierarchy_process_%E2%80%93_leader_example
#
Alternatives: &alternatives
# 1= not well; 10 = best possible
# Your assessment based on the paragraph descriptions may be different.
  Tom:
    age: 50
    experience: 7
    education: 4
    leadership: 10
  Dick:
    age: 60
    experience: 10
    education: 6
    leadership: 6
  Harry:
    age: 30
    experience: 5
    education: 8
    leadership: 6
#
# End of Alternatives Section
#####################################

Here is a snippet from my original YAML file specifying my AHP problem manually ():

  children: 
    Experience:
      preferences:
        - [Tom, Dick, 1/4]
        - [Tom, Harry, 4]
        - [Dick, Harry, 9]
      children: *alternatives
    Education:
      preferences:
        - [Tom, Dick, 3]
        - [Tom, Harry, 1/5]
        - [Dick, Harry, 1/7]
      children: *alternatives

And here is what I changed that snippet to, so that it would do my pairwise comparisons automatically. The functions are written in standard R (fortunately), and each function has access to a1 and a2 (the two alternatives). Recursion is supported which makes this capability particularly useful. I tried to write a function using two of the characteristics in the decision (a1$age and a1$experience) but this didn’t seen to work. I’m not sure whether the package supports it or not. Here are my comparisons rewritten as functions:

  children: 
    Experience:
          preferenceFunction: >
            ExperiencePreference <- function(a1, a2) {
              if (a1$experience < a2$experience) return (1/ExperiencePreference(a2, a1))
              ratio <- a1$experience / a2$experience
              if (ratio < 1.05) return (1)
              if (ratio < 1.2) return (2)
              if (ratio < 1.5) return (3)
              if (ratio < 1.8) return (4)
              if (ratio < 2.1) return (5) return (6) } children: *alternatives Education: preferenceFunction: >
            EducPreference <- function(a1, a2) {
              if (a1$education < a2$education) return (1/EducPreference(a2, a1))
              ratio <- a1$education / a2$education
              if (ratio < 1.05) return (1)
              if (ratio < 1.15) return (2)
              if (ratio < 1.25) return (3)
              if (ratio < 1.35) return (4)
              if (ratio < 1.55) return (5)
              return (5)
            }
          children: *alternatives

To run the AHP with functions in R, I used this code (I am including the part that gets the ahp package, in case you have not done that yet). BE CAREFUL and make sure, like in FORTRAN, that you line things up so that the words START in the appropriate columns. For example, the “p” in preferenceFunction MUST be immediately below the 7th character of your criterion’s variable name.

devtools::install_github("gluc/ahp", build_vignettes = TRUE)
install.packages("data.tree")

library(ahp)
library(data.tree)

setwd("C:/AHP/artifacts")
nofxnAhp <- LoadFile("tomdickharry.txt")
Calculate(nofxnAhp)
fxnAhp <- LoadFile("tomdickharry-fxns.txt")
Calculate(fxnAhp)

print(nofxnAhp, "weight")
print(fxnAhp, "weight")

You can see that the weights are approximately the same, indicating that I did a good job at developing functions that represent the reality of how I used the variables attached to the Alternatives to make my pairwise comparisons. The results show that Dick is now the best choice, although there is some inconsistency in our judgments for Experience that we should examine further. (I have not examined this case to see whether rank reversal could be happening).

> print(nofxnAhp, "weight")
                         levelName     weight
1  Choose the Most Suitable Leader 1.00000000
2   ¦--Experience                  0.54756924
3   ¦   ¦--Tom                     0.21716561
4   ¦   ¦--Dick                    0.71706504
5   ¦   °--Harry                   0.06576935
6   ¦--Education                   0.12655528
7   ¦   ¦--Tom                     0.18839410
8   ¦   ¦--Dick                    0.08096123
9   ¦   °--Harry                   0.73064467
10  ¦--Charisma                    0.26994992
11  ¦   ¦--Tom                     0.74286662
12  ¦   ¦--Dick                    0.19388163
13  ¦   °--Harry                   0.06325174
14  °--Age                         0.05592555
15      ¦--Tom                     0.26543334
16      ¦--Dick                    0.67162545
17      °--Harry                   0.06294121

> print(fxnAhp, "weight")
                         levelName     weight
1  Choose the Most Suitable Leader 1.00000000
2   ¦--Experience                  0.54756924
3   ¦   ¦--Tom                     0.25828499
4   ¦   ¦--Dick                    0.63698557
5   ¦   °--Harry                   0.10472943
6   ¦--Education                   0.12655528
7   ¦   ¦--Tom                     0.08273483
8   ¦   ¦--Dick                    0.26059839
9   ¦   °--Harry                   0.65666678
10  ¦--Charisma                    0.26994992
11  ¦   ¦--Tom                     0.74286662
12  ¦   ¦--Dick                    0.19388163
13  ¦   °--Harry                   0.06325174
14  °--Age                         0.05592555
15      ¦--Tom                     0.26543334
16      ¦--Dick                    0.67162545
17      °--Harry                   0.06294121

> ShowTable(fxnAhp)

tomdick-ahp-fxns

Here is the full YAML file for the “with preferenceFunction” case.

#########################
# Alternatives Section
# THIS IS FOR The Tom, Dick, & Harry problem at
# https://en.wikipedia.org/wiki/Analytic_hierarchy_process_%E2%80%93_leader_example
#
Alternatives: &alternatives
# 1= not well; 10 = best possible
# Your assessment based on the paragraph descriptions may be different.
  Tom:
    age: 50
    experience: 7
    education: 4
    leadership: 10
  Dick:
    age: 60
    experience: 10
    education: 6
    leadership: 6
  Harry:
    age: 30
    experience: 5
    education: 8
    leadership: 6
#
# End of Alternatives Section
#####################################
# Goal Section
#
Goal:
# A Goal HAS preferences (within-level comparison) and HAS Children (items in level)
  name: Choose the Most Suitable Leader
  preferences:
    # preferences are defined pairwise
    # 1 means: A is equal to B
    # 9 means: A is highly preferrable to B
    # 1/9 means: B is highly preferrable to A
    - [Experience, Education, 4]
    - [Experience, Charisma, 3]
    - [Experience, Age, 7]
    - [Education, Charisma, 1/3]
    - [Education, Age, 3]
    - [Age, Charisma, 1/5]
  children: 
    Experience:
          preferenceFunction: >
            ExperiencePreference <- function(a1, a2) {
              if (a1$experience < a2$experience) return (1/ExperiencePreference(a2, a1))
              ratio <- a1$experience / a2$experience
              if (ratio < 1.05) return (1)
              if (ratio < 1.2) return (2)
              if (ratio < 1.5) return (3)
              if (ratio < 1.8) return (4)
              if (ratio < 2.1) return (5) return (6) } children: *alternatives Education: preferenceFunction: >
            EducPreference <- function(a1, a2) {
              if (a1$education < a2$education) return (1/EducPreference(a2, a1))
              ratio <- a1$education / a2$education
              if (ratio < 1.05) return (1)
              if (ratio < 1.15) return (2)
              if (ratio < 1.25) return (3)
              if (ratio < 1.35) return (4)
              if (ratio < 1.55) return (5)
              return (5)
            }
          children: *alternatives
    Charisma:
      preferences:
        - [Tom, Dick, 5]
        - [Tom, Harry, 9]
        - [Dick, Harry, 4]
      children: *alternatives
    Age:
      preferences:
        - [Tom, Dick, 1/3]
        - [Tom, Harry, 5]
        - [Dick, Harry, 9]
      children: *alternatives
#
# End of Goal Section
#####################################

Course Materials for Statistics (The Easier Way) With R

very-quick-cover-outlineAre you an instructor with a Spring 2016 intro to statistics course coming up… and yet you haven’t started preparing? If so, I have a potential solution for you to consider. The materials (lecture slides, in-class practice problems in R, exams, syllabus) go with my book and are about 85% compiled, but good enough to get a course started this week (I will be finishing the collection by mid-January).
There is also a 36MB .zip file if you would like to download the materials.
Whether you will be using them or just considering them, please fill in the Google Form at https://docs.google.com/forms/d/1Z7djuKHg1L4k7bTtktHXI7Juduad3fUW9G69bxN6jRA/viewform so I can keep track of everyone and provide you with updates. Also, I want to make sure that I’m providing the materials to INSTRUCTORS (and not students), so please use the email account from your institution when you sign up. Once I get your contact information, I will email you the link to the materials.
If you would like permission to edit the materials, I can do that as well — I know a couple of you have expressed that you would like to add to the collection (e.g translate them to another language). If you see any issues or errors, please either fix and/or email to tell me to fix!
Thanks for your interest and participation! Also, Happy New Year!

Deploying Your Very Own Shiny Server

Nicole has been having a lot of fun the last few days creating her own Shiny apps. We work in the same space, and let’s just say her enthusiasm is very contagious. While she focused on deploying R-based web apps on ShinyApps.io, I’m more of a web development geek, so I put my energy towards setting up a server where she could host her apps. This should come in handy, since she blew through all of her free server time on ShinyApps after just a couple of days!

Before you begin, you can see a working example of this at https://shinyisat.net/sample-apps/sampdistclt/.

In this tutorial, I’m going to walk you through the process of:

  1. Setting up an Ubuntu 14.04 + NGINX server at DigitalOcean
  2. Installing and configuring R
  3. Installing and configuring Shiny and the open-source edition of Shiny Server
  4. Installing a free SSL certificate from Let’s Encrypt
  5. Securing the Shiny Server using the SSL cert and reverse proxy through NGINX
  6. Setting appropriate permissions on the files to be served
  7. Creating and launching the app Nicole created in her recent post

Setting Up an Ubuntu 14.04 Server at DigitalOcean

DigitalOcean is my new favorite web host. (Click this link to get a $10 credit when you sign up!) They specialize in high-performance, low-cost, VPS (virtual private servers) targeted at developers. If you want full control over your server, you can’t beat their $5/month offering. They also provide excellent documentation. In order to set up your server, you should start by following these tutorials:

  1. How to Create Your First DigitalOcean Droplet Virtual Server
  2. How to Connect to Your Droplet with SSH
  3. Initial Server Setup with Ubuntu 14.04
  4. Additional Recommended Steps for New Ubuntu 14.04 Servers
  5. How To Protect SSH with Fail2Ban on Ubuntu 14.04

I followed these pretty much exactly without any difficulties. I did make a few changes to their procedure, which I’ll describe next.

Allowing HTTPS with UFW

I found that the instructions for setting up ufw needed a tweak. Since HTTPS traffic uses port 443 on the server, I thought that sudo ufw allow 443/tcp should take care of letting HTTPS traffic through the firewall. Unfortunately, it doesn’t. In addition you should run the following:


$ sudo ufw allow https

$ sudo ufw enable

Your web server may not accept incoming HTTPS traffic if you do not do this. Note: you may not have noticed, but you also installed NGINX as part of the UFW tutorial.

Setting up Automatic Updates on Your Server

The default install of Ubuntu at DigitalOcean comes with the automatic updates package already installed. This means your server will get security packages and upgrades without you having to do it manually. However, this package needs to be configured. First, edit /etc/apt/apt.conf.d/50unattended-upgrades to look like this:

Unattended-Upgrade::Allowed-Origins {
   "${distro_id}:${distro_codename}-security";
   "${distro_id}:${distro_codename}-updates";
};
Unattended-Upgrade::Mail "admin@mydomain.com";
Unattended-Upgrade::Remove-Unused-Dependencies "true";
Unattended-Upgrade::Automatic-Reboot "true";
Unattended-Upgrade::Automatic-Reboot-Time "02:00";

Note, that this configuration will install upgrades and security updates, and will automatically reboot your server, if necessary, at 2:00AM, and it will purge unused packages from your system completely. Some people don’t like to have that much stuff happen automatically without supervision. Also, my /etc/apt/apt.conf.d/10periodic file looks like:

APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Download-Upgradeable-Packages "1";
APT::Periodic::AutocleanInterval "7";
APT::Periodic::Unattended-Upgrade "1";

Which sets upgrades to happen daily, and purges to happen once a week.

Installing and Configuring R

Okay, now that your server is set up (you should be able to view the default NGINX page at http://your-domain-name.com), it’s time to install R.

Set the CRAN Repository in Ubuntu’s sources.list

The first step is to add your favorite CRAN repository to Ubuntu’s sources list. This will ensure that you get the latest version of R when you install it. To open and edit the sources list, type the following:


$ sudo nano /etc/apt/sources.list

Move the cursor down to the bottom of this file using the arrow keys, and add the following line at the bottom:


deb https://cran.cnr.berkeley.edu/bin/linux/ubuntu trusty/

Of course, you can substitute your favorite CRAN repo here. I like Berkeley. Don’t miss that there is a space between “ubuntu” and “trusty”. Hit CTRL+x to exit from this file. Say “yes” when they ask if you want to save your changes. The official docs on installing R packages on Ubuntu also recommend that you activate the backports repositories as well, but I found that this was already done on my DigitalOcean install.

Add the Public Key for the Ubuntu R Package

In order for Ubuntu to be able to recognize, and therefore trust, download, and install the R packages from the CRAN repo, we need to install the public key. This can be done with the following command:


$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9

Install R

Run the following:


$ sudo apt-get update

$ sudo apt-get install r-base

When this is finished, you should be able to type R –version and get back the following message:


$ R --version

R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

If you get this, you’ll know that R was successfully installed on your server. If not, you’ll need to do some troubleshooting.

Configure R to Use curl and Your CRAN Repository of Choice

Type the following to open up the Rprofile.site file:


$ sudo pico /etc/R/Rprofile.site

You may delete all of the content and add the following:


options(download.file.method="libcurl")

local({
    r <- getOption("repos")
    r["CRAN"] <- "https://cran.rstudio.com/"
    options(repos=r)
})

This will allow us to run install.packages('packagename') without specifying the repository later.

Install Dependencies and Packages Needed by Shiny Server

We’re going to need the devtools package, which means we need to install the libraries upon which it depends first (libcurl and libxml2):


$ sudo apt-get -y build-dep libcurl4-gnutls-dev

$ sudo apt-get -y install libcurl4-gnutls-dev

$ sudo apt-get -y build-dep libxml2-dev

$ sudo apt-get -y install libxml2-dev

Now we can install devtools, rsconnect, and rmarkdown:


$ sudo su - -c "R -e \"install.packages('devtools')\""

$ sudo su - -c "R -e \"devtools::install_github('rstudio/rsconnect')\""

$ sudo su - -c "R -e \"install.packages('rmarkdown')\""

$ sudo su - -c "R -e \"install.packages('shiny')\""

Install Shiny Server

Okay! Now we’re finally ready to install Shiny Server. Run the following:


$ cd ~ 
$ sudo apt-get install gdebi-core
$ wget https://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-1.4.1.759-amd64.deb
$ sudo gdebi shiny-server-1.4.1.759-amd64.deb

At this point, your Shiny Server should be up and running, but we can’t visit it on the web yet because by default, it runs on port 3838, which is blocked by the firewall we set up earlier. We’re now going to secure it, and use a reverse proxy to run it through NGINX.

Install an SSL Certificate with Let’s Encrypt

Let’s Encrypt is a new, free service that will allow you to install a trusted SSL certificate on your server. Since Google and Mozilla are working hard to phase out all non-HTTPS traffic on the web, it’s a good idea to get into the habit of installing SSL certs whenever you set up a new website. First install git, then use it to download letsencrypt:


$ sudo apt-get install git
$ git clone https://github.com/letsencrypt/letsencrypt
$ cd letsencrypt

Now before we install the certificate, we have to stop our web server (NGINX). In the code below, replace yourdomain.com with your actual domain name that you registered for this site.


$ sudo service nginx stop
$ sudo ./letsencrypt-auto certonly --standalone -d yourdomain.com -d www.yourdomain.com

If all goes well, it should have installed your new certificates in the /etc/letsencrypt/live/yourdomain.com folder.

Configure the Reverse Proxy on NGINX

Open up the following file for editing:


$ sudo nano /etc/nginx/nginx.conf

And add the following lines near the bottom of the main http block, just before the section labeled “Virtual Host Configs”. In my file, this started around line 62:


...

##
# Map proxy settings for RStudio
##
map $http_upgrade $connection_upgrade {
    default upgrade;
    '' close;
}

##
# Virtual Host Configs
##
...

And then open up the default site config file:


$ sudo nano /etc/nginx/sites-available/default

And replace its contents with the following. Note you should replace yourdomain.com with your actual domain name, and 123.123.123.123 with the actual IP address of your server.


server {
   listen 80 default_server;
   listen [::]:80 default_server ipv6only=on;
   server_name yourdomain.com www.yourdomain.com;
   return 301 https://$server_name$request_uri;
}
server {
   listen 443 ssl;
   server_name yourdomain.com www.yourdomain.com;
   ssl_certificate /etc/letsencrypt/live/yourdomain.com/fullchain.pem;
   ssl_certificate_key /etc/letsencrypt/live/yourdomain.com/privkey.pem;
   ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
   ssl_prefer_server_ciphers on;
   ssl_ciphers AES256+EECDH:AES256+EDH:!aNULL;

   location / {
       proxy_pass http://123.123.123.123:3838;
       proxy_redirect http://123.123.123.123:3838/ https://$host/;
       proxy_http_version 1.1;
       proxy_set_header Upgrade $http_upgrade;
       proxy_set_header Connection $connection_upgrade;
       proxy_read_timeout 20d;
   }
}

Now start NGINX up again:


$ sudo service nginx start

And if all went well, your new Shiny Server should be up and running at https://yourdomain.com!

Note that even if you try to go to the insecure URL, traffic will be automatically redirected through HTTPS.

Setting Appropriate Permissions

Sometimes, your Shiny apps will need access to the filesystem to read or write files. Since the Shiny server runs as the user shiny, and since all the files that are being served are owned by root, then your apps will crash when they try to access files. I like Dean Attali’s solution. Run the following commands, substituting yourusername with the username you are using to access the server:


$ sudo groupadd shiny-apps
$ sudo usermod -aG shiny-apps yourusername
$ sudo usermod -aG shiny-apps shiny
$ cd /srv/shiny-server
$ sudo chown -R yourusername:shiny-apps .
$ sudo chmod g+w .
$ sudo chmod g+s .

In the future, any time you add files under /srv/shiny-server, you may need to change the permissions so the Shiny server can read them. We’ll do that in a moment.

Installing a New App

Finally, I’m going to show you how to put a new app on the server. We’re going to use the app that Nicole created and add it into the “sample apps” folder. Run the following:


$ cd /srv/shiny-server/sample-apps
$ mkdir sampdistclt
$ cd sampdistclt
$ nano server.R

This will create a new file called server.R and open it for editing. Copy and paste the second half of the code from Nicole’s post (the part that starts with ## server) into this file. Save and exit. Now create a second file in this directory called ui.R and paste the code from the first half of Nicole’s post (the part that starts with ## ui up to but not including the part that starts ## server). Save and exit.

Now you need to make sure that the permissions are set correctly:


$ chown -R :shiny-apps .

You may also need to restart the Shiny and/or NGINX servers. The commands to do that are:


$ sudo service nginx restart
$ sudo service shiny-server restart

If all has gone well, you can now view the app up and running at https://yourdomain.com/sample-apps/sampdistclt!

Conclusion

I haven’t had a lot of time to use this configuration, so please let me know if you find any bugs, or things that need to be tweaked. On the plus side, this configuration may be cheaper than using ShinyApps.io, but it also doesn’t have all the cool bells and whistles that you get there, either, like their user interface and monitoring traffic. At the very least, it should be a way to experiment, and put things out in the public for others to play with. Enjoy!

« Older Entries