Using xda with googlesheets in R

Image Credit: Doug Buckley of http://hyperactive.to

Want to do a quick, exploratory data analysis in R of your data that’s stored in a spreadsheet on Google Drive? You’re in luck, because now you can use the new xda package in conjunction with Jenny Bryan‘s googlesheets. There are some quirks, though, and that’s what this post is all about.

Before proceeding, you should review this recent article from R-Bloggers called “Introducing xda”.

First, be sure to install the googlesheets and xda packages. Although googlesheets is on CRAN, xda is not, and you’ll have to bring it in directly from github. You can actually do the same for googlesheets if you like:

install.packages("devtools")
library(devtools)
install_github("jennybc/googlesheets")
install_github("ujjwalkarn/xda")
library(googlesheets)
library(xda)

Next, you’ll have to show R how to access your Google spreadsheet. While you are looking at your spreadsheet, go to File -> Publish to the Web. The URL that’s in the text box is the one you want to capture. Just to make sure it works, copy and paste it into a new browser address window and see if you can display your spreadsheet in your browser.

If you want to import the data at https://docs.google.com/spreadsheets/d/1DO0ksD8d-rn_j2Yn7DQKZDPBrhrvZTpgszewxokjWKU/pubhtml into R, for example, you’ll need to know the spreadsheet’s key. That’s the long string of unintelligible numbers and letters between the “d” and the “pubhtml”. So, my key would be “1DO0ksD8d-rn_j2Yn7DQKZDPBrhrvZTpgszewxokjWKU” — which you’ll see in this next block of code:

> my.gs <- gs_key("1DO0ksD8d-rn_j2Yn7DQKZDPBrhrvZTpgszewxokjWKU") 
> my.data <- gs_read(my.gs) # Retrieves data from googlesheets and places it into an R object. 
> my.df <- as.data.frame(my.data)  # Important! xda needs you to extract only the data in a data frame.

Now, you can access your data. Try head(my.df) to make sure you’ve imported it properly.

Next, it’s time for exploratory data analysis. There are three commands available:

numSummary – takes a data frame as an argument, provides descriptive statistics, quantiles, and missing data info for quantitative variables
charSummary – takes a data frame as an argument, provides counts, missing data info, and number of unique factors for quantitative variables
bivariate – takes a data frame and two quantitative variables as an argument, and performs a quick bivariate analysis (giving this categorical variables, or giving this one categorical and one quantitative variable, will throw an error)

Here’s what happens when you run those commands on the data you just loaded in from your Google spreadsheet:

> numSummary(my.df)
               n   mean    sd median   max  min  mode miss miss%    1%   5%   25%   50%   75%   90%   95%   99%
obs          200 100.50 57.88  100.5 200.0  1.0   1.0    0     0  2.99 11.0  50.8 100.5 150.2 180.1 190.0 198.0
heartrates   200  73.01  7.43   73.0  96.3 53.4  71.2    0     0 56.49 61.2  68.4  73.0  77.4  82.6  85.7  90.3
systolics    200 139.27 29.27  138.0 221.0 59.0 139.0    0     0 79.98 96.0 117.0 138.0 160.0 177.2 188.1 205.0
diastolics   200  87.76  9.74   87.7 116.4 62.4  85.2    0     0 66.01 72.2  81.9  87.7  93.7 100.3 104.5 108.3
bmis         200  25.53  3.06   25.0  33.1 18.4  24.7    0     0 19.00 21.0  23.5  25.0  27.7  29.6  31.2  32.8
ages         200  44.41 14.59   45.0  70.0 18.0  30.0    0     0 18.00 22.0  32.0  45.0  57.0  64.1  67.0  70.0
heartpm      200  72.26  3.55   72.2  83.8 64.2  74.2    0     0 64.72 66.2  69.8  72.2  74.2  76.4  78.7  81.4
fitnesslevel 200   2.62  1.17    3.0   4.0  1.0   4.0    0     0  1.00  1.0   2.0   3.0   4.0   4.0   4.0   4.0

> charSummary(my.df)
          n miss miss% unique
genders 200    0     0      2
smokers 200    0     0      2
group   200    0     0      4

> bivariate(my.df,'heartrates','bmis')
     bin_bmis min_heartrates max_heartrates mean_heartrates
1   (18.3,22]          53.40          85.60           72.80
2   (22,25.7]          55.70          90.70           72.87
3 (25.7,29.4]          60.30          96.30           73.45
4 (29.4,33.1]          56.50          90.30           72.46

Observations:

There is a fourth “Plot” command but I couldn’t get it to work on any googlesheetsdata. The xda package is looking for class(range) to be anything other than “function”, which it was for every sheet I attempted to load.
There really should be an extra column in xda that displays the enumeration of all the unique values for the factors. It felt great to know how many unique values there were, but I would love to be reminded of what they are too, unless there are too many of them.

Please share your experiences using xda & googlesheets together in the comments! Thanks!