Applied Statistics

Taking a Subset of a Data Frame in R

I just wrote a new chapter for my students describing how to subset a data frame in R. The full text is available at https://docs.google.com/document/d/1K5U11-IKRkxNmitu_lS71Z6uLTQW_fp6QNbOMMwA5J8/edit?usp=sharing but here’s a preview:

Let’s load in ChickWeight, one of R’s built in datasets. This contains the weights of little chickens at 12 different times throughout their lives. The chickens are on different diets, numbered 1, 2, 3, and 4. Using the str command, we find that there are 578 observations in this data frame, and two different categorical variables: Chick and Diet.

> data(ChickWeight)
> head(ChickWeight)
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1
> str(ChickWeight)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame':      578 obs. of  4 variables:
 $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
 $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
 $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
 $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "formula")=Class 'formula' length 3 weight ~ Time | Chick
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "outer")=Class 'formula' length 2 ~Diet
  .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
 - attr(*, "labels")=List of 2
  ..$ x: chr "Time"
  ..$ y: chr "Body weight"
 - attr(*, "units")=List of 2
  ..$ x: chr "(days)"
  ..$ y: chr "(gm)"

Get One Column: Now that we have a data frame named ChickWeight loaded into R, we can take subsets of these 578 observations. First, let’s assume we just want to pull out the column of weights. There are two ways we can do this: specifying the column by name, or specifying the column by its order of appearance. The general form for pulling information from data frames is data.frame[rows,columns] so you can get the first column in either of these two ways:

ChickWeight[,1]   		# get all rows, but only the first column
ChickWeight[,c("weight")]	# get all rows, and only the column named “weight”

Get Multiple Columns: If you want more than one column, you can specify the column numbers or the names of the variables that you want to extract. If you want to get the weight and diet columns, you would do this:

ChickWeight[,c(1,4)]   		# get all rows, but only 1st and 4th columns
ChickWeight[,c("weight","Diet")]	# get all rows, only “weight” & “Diet” columns

If you want more than one column and those columns are next to each other, you can do this:

ChickWeight[,c(1:3)]

Get One Row: You can get the first row similarly to how you got the first column, and any other row the same way:

ChickWeight[1,]   		# get first row, and all columns
ChickWeight[82,]   		# get 82nd row, and all columns

Get Multiple Rows: If you want more than one row, you can specify the row numbers you want like this:

> ChickWeight[c(1:6,15,18,27),] 
   weight Time Chick Diet      
1      42    0     1    1   
2      51    2     1    1 
3      59    4     1    1    
4      64    6     1    1    
5      76    8     1    1 
6      93   10     1    1    
15     58    4     2    1    
18    103   10     2    1 
27     55    4     3    1    

6 replies »

  1. Dr. Radziwill,

    This is a very nice resource. Thank you for sharing. Quick question regarding this bit:

    > chick13 chick28 t.test(chick13,chick28)

    Given how this data is set up, there are only 2 chicks here, with some number of weights attached to each 9e.g. the number of rows for each chick). The t.test command you have assumes independent samples by default, but these weights are dependent, no? You wrote a book. I didn’t LOL! So I’m guess there is something obvious here I’m missing, just want to know what it is.

    thx in advance,
    jeff

    • There’s no way to know whether the weights are dependent or paired, so I defaulted to independent samples. Some of the chick-weight-vectors don’t even report all the weights for each time! So it’s just an assumption based on missing context… no magic 🙂

    • What a good (and clearly logical) idea! I’ll add a section on subset() too — never hurts to be able to do things many, many different ways.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s