Taking a Subset of a Data Frame in R

Note: Everything in this article is easier with dplyr and magrittr in tidyverse. I’ll write a followup sometime this year.


I just wrote a new chapter for my students describing how to subset a data frame in R. The full text is available at https://docs.google.com/document/d/1K5U11-IKRkxNmitu_lS71Z6uLTQW_fp6QNbOMMwA5J8/edit?usp=sharing but here’s a preview:

Let’s load in ChickWeight, one of R’s built in datasets. This contains the weights of little chickens at 12 different times throughout their lives. The chickens are on different diets, numbered 1, 2, 3, and 4. Using the str command, we find that there are 578 observations in this data frame, and two different categorical variables: Chick and Diet.

> data(ChickWeight)
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
> str(ChickWeight)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame': 578 obs. of 4 variables:
$ weight: num 42 51 59 64 76 93 106 125 149 171 ...
$ Time : num 0 2 4 6 8 10 12 14 16 18 ...
$ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
$ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "formula")=Class 'formula' length 3 weight ~ Time | Chick
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "outer")=Class 'formula' length 2 ~Diet
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "labels")=List of 2
..$ x: chr "Time"
..$ y: chr "Body weight"
- attr(*, "units")=List of 2
..$ x: chr "(days)"
..$ y: chr "(gm)"

Get One Column: Now that we have a data frame named ChickWeight loaded into R, we can take subsets of these 578 observations. First, let’s assume we just want to pull out the column of weights. There are two ways we can do this: specifying the column by name, or specifying the column by its order of appearance. The general form for pulling information from data frames is data.frame[rows,columns] so you can get the first column in either of these two ways:

ChickWeight[,1] # get all rows, but only the first column
ChickWeight[,c("weight")] # get all rows, and only the column named “weight”

Get Multiple Columns: If you want more than one column, you can specify the column numbers or the names of the variables that you want to extract. If you want to get the weight and diet columns, you would do this:

ChickWeight[,c(1,4)] # get all rows, but only 1st and 4th columns
ChickWeight[,c("weight","Diet")] # get all rows, only “weight” & “Diet” columns

If you want more than one column and those columns are next to each other, you can do this:

ChickWeight[,c(1:3)]

Get One Row: You can get the first row similarly to how you got the first column, and any other row the same way:

ChickWeight[1,] # get first row, and all columns
ChickWeight[82,] # get 82nd row, and all columns

Get Multiple Rows: If you want more than one row, you can specify the row numbers you want like this:

> ChickWeight[c(1:6,15,18,27),]
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
15 58 4 2 1
18 103 10 2 1
27 55 4 3 1

7 comments

  • I usually use it also with boolean parameter like ChickWeight[ChickWeight$Chick == 1, ]

  • Dr. Radziwill,

    This is a very nice resource. Thank you for sharing. Quick question regarding this bit:

    > chick13 chick28 t.test(chick13,chick28)

    Given how this data is set up, there are only 2 chicks here, with some number of weights attached to each 9e.g. the number of rows for each chick). The t.test command you have assumes independent samples by default, but these weights are dependent, no? You wrote a book. I didn’t LOL! So I’m guess there is something obvious here I’m missing, just want to know what it is.

    thx in advance,
    jeff

    • There’s no way to know whether the weights are dependent or paired, so I defaulted to independent samples. Some of the chick-weight-vectors don’t even report all the weights for each time! So it’s just an assumption based on missing context… no magic 🙂

  • Alright,

    so that cut-n-paste is borked in my previous post. The syntax I’m referring to is at the bottom of page 5, where you subset the weights for two different chicks (13 and 28), save each chick’s weight to a different vector (chick13 and chick28), then run a t.test.

  • From the title, I thought it would be about subset() and logical indexing.

    • What a good (and clearly logical) idea! I’ll add a section on subset() too — never hurts to be able to do things many, many different ways.

  • A third way to access a single column is through the $ symbol, e.g. ChickWeight$weight.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s