Subsetting in base R is a fundamental skill for any R user. Whether you’re working with vectors, lists, matrices, data frames, or other more specific object types (e.g., an S4 class ExpressionSet), the ability to effectively subset these is crucial for efficient data manipulation.
In this post, we will explore the key differences between []
(single square brackets) and [[]]
(double square brackets), their application across different R object types, and common subsetting techniques.
Indices are the numeric positions used to identify and access elements within data structures, allowing you to retrieve or manipulate specific parts of an object; these are the numbers that you are about to see used inside []
and [[]]
.
In R, indexing starts at 1 (unlike Python, where it starts at 0), meaning the first element of any object is accessed with 1, not 0.
Vectors are the simplest data structures in R, storing atomic data types like numbers, characters, or logical values.
[]
Using []
allows you to subset single or multiple elements from a vector.
# create a numeric vector
vec <- c(10, 20, 30, 40)
# return a single element (2nd)
vec[2]
# return non-consecutive elements (1st and 3rd)
vec[c(1, 3)]
# subset using logical conditions
vec[vec > 25]
You can of course have objects which are the output of a logical condition and use these to subset a vector too. When subsetting with logical vectors, the elements that evaluate to TRUE
are those which are retained.
# same output as the previous example
keep <- vec > 25
vec[keep]
In all of these cases, and indeed whenever subsetting a vector, using []
always returns a vector.
[[]]
In contrast, [[]]
extracts a single element and removes the vector structure.
# returns an atomic value
vec[[2]]
One point that often confuses people is that atomic values (like 10 or “a”) are considered vectors of length 1. This is because vectors are the most basic data structure in R. An atomic value without attributes is technically indistinguishable from a vector of length 1. If you are in any doubt about these statements, then run the following to check:
# check data structures - both return TRUE
is.vector(vec[2])
is.vector(vec[[2]])
# check data types (storage mode) - both return "double"
typeof(vec[2])
typeof(vec[[2]])
If there was a more basic structure than a vector i.e., if a scalar was a formal data structure in R (which it isn’t), one might expect the second is.vector()
call to return FALSE
(which it doesn’t); R does not have a distinct scalar type because its core philosophy revolves around vectorisation and treating all objects as vectors, even when their length is 1.
In case you are wondering about the output of the typeof()
calls i.e., why these are not integer: numeric literals (e.g., 10, 20) in R default to being stored as double-precision floating-point numbers unless explicitly specified.
One final thing to be clear on is that you can’t do the following with double brackets:
# returns an error
vec[[c(1, 3)]]
If you don’t understand why this is, go back and read the section header again.
So to summarise: []
returns a vector and [[]]
returns an atomic value when these are applied to vectors.
Vector elements can be named. Adding names to a vector can make subsetting more intuitive, especially in cases where indices are difficult to remember or interpret.
In the example shown here, the vector elements 10, 20 and 30 are given the names “a”, “b” and “c”, respectively.
# create a named numeric vector
named_vec <- c(a = 10, b = 20, c = 30)
Subsetting named vectors follows the same rules as given above. I have provided a couple of examples here, but you should experiment to understand how names can be used for subsetting as this will be relevant when we come to look at other data structures shortly.
# return a single element (1st)
named_vec["a"]
# return non-consecutive elements (1st and 3rd)
named_vec[c("a", "c")]
# return an atomic value
named_vec[["a"]]
Note that the vector names that are evident when subsetting using []
are not present when using [[]]
, denoting the fact that a single atomic value (bereft of its name attribute) is returned.
Also note that you can use names with logical conditions too, but you will have to access these using the names()
function in order to test the logical condition; you can only do this using []
(and not [[]]
) as the logical condition will always return a logical vector that is used for the subsetting operation.
# return a single element (1st)
named_vec[names(named_vec) == "a"]
# return multiple elements (1st and 3rd)
named_vec[names(named_vec) %in% c("a", "c")]
Unfortunately, R does not have a built-in shorthand equivalent to the colon operator for subsetting named character vectors by sequential names. There are workarounds that can help achieve similar functionality, but these are outside the scope of this post.
Matrices are two-dimensional structures with homogeneous data types i.e., a matrix has rows and columns, and the data stored in these must all be of the same type.
[]
You can subset specific rows, columns, or elements using row, column indexing. Think of these like a set of coordinates that are given the same way that you read the longitude (horizontal / east-west axis) on a map before providing the latitude (vertical / north-south axis). An easy way I used to remember this as a youth was “along the hallway and up the stairs”, which was consequently a path I frequently took as a child having been caught doing something wrong, but that’s another story…
# create a numeric matrix
mat <- matrix(1:9, nrow = 3)
If we print mat
we can see that it is a 3 x 3 grid of numbers, just in case that wasn’t obvious from the above matrix()
call. Note that the values shown in the row and column names/headers are the respective indices or numeric position IDs.
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
It is these indices that we use when subsetting matrices. Try to look at the matrix above and anticipate the expected output of the calls below before running them (clue: the last call is the intersect of row 2 and column 3).
# subset rows (all values in row two)
mat[2, ]
# subset columns (all values in column three)
mat[ , 3]
# return as specific single element
mat[2, 3]
Note here that the returned objects are vectors. To prevent the automatic simplification of data structure i.e., if you want to return a matrix, which is more relevant to the first two examples than the final one above, we can add the argument drop = FALSE
.
# subset a matrix and return a matrix
mat[2, , drop = FALSE]
You can also use sequences of numbers to subset multiple rows and / or columns. Note the difference in syntax available where the indices are sequential versus where they are non-sequential. You could, of course, still use the c()
function to subset to sequential indices, the :
operator just saves time if you have a long sequence. For example, 1:5
is clearly more efficient to type and easier to read versus c(1, 2, 3, 4, 5)
.
# subset to rows 1 and 2 (sequential indices)
mat[1:2, ]
# subset to rows 1 and 3 (non-sequential indices)
mat[c(1, 3), ]
# subset to columns 1 and 2
mat[ , 1:2]
# subset to columns 1 and 3
mat[ , c(1, 3)]
It is worth noting that you can subset unnamed matrices using logical conditions and either return vectors or matrices. These are edge cases and could make following this post even more confusing than it might already be for novice learners, so I will omit any examples here. I am just mentioning this here to say that it is technically possible should you wish to investigate further, though I don’t think I have ever found a use case.
Naming rows and columns in matrices allows subsetting based on meaningful labels rather than indices, improving code readability. There are alternative ways to set the dimension names, but the method I have shown is the most basic available using only base R.
# create a matrix with row and column names
named_mat <- matrix(1:9, nrow = 3)
# set row and column names
rownames(named_mat) <- c("row_1", "row_2", "row_3")
colnames(named_mat) <- c("col_1", "col_2", "col_3")
Setting the dimension names means that these are shown in the printed output instead of the respective indices.
col_1 col_2 col_3
row_1 1 4 7
row_2 2 5 8
row_3 3 6 9
Subsetting using the dimension names that we have just applied follows the same rules that we have just learned for unnamed matrices.
# subset rows (all values in row two in this case)
named_mat["row_2", ]
# subset columns (all values in column three in this case)
named_mat[ , "col_3"]
# subset to rows 1 and 3
named_mat[c("row_1", "row_3"), ]
# subset to columns 1 and 3
named_mat[ , c("col_1", "col_3")]
# return as specific single element
named_mat["row_2", "col_3"]
# return specific row and column combinations
named_mat[c("row_2", "row_3"), c("col_1", "col_3")]
Matrices do not support [[]]
for subsetting - there’s no need as single element extraction is already taken care of using []
, as we have seen.
Lists are flexible data structures, allowing storage of heterogeneous data types (e.g., vectors, data frames and other lists, as well as many other objects and data structures).
Everyone seems to hate lists when they set out learning R because their behaviour can be confusing - seemingly a byproduct of the flexibility they offer and ironically the reason you will learn to love lists if you stick with R for long enough. Possibly my second favourite data structure in R after data frames and their enhanced variants.
[]
Recall that using []
with a vector always returns a vector, it’s the same with lists; []
applied to a list always returns a list, albeit a smaller sub-list but which retains the list data structure.
# create a list (heterogenious data types)
lst <- list(1, "text", c(1, 2, 3))
As we have not named our list, the printed output shows the list element indices - those numbers shown within [[]]
below; the indices for each list entry may or may not be shown depending on the stored object and whether any names have been set. In this case, we have stored three unnamed vectors, so no names are shown.
[[1]]
[1] 1
[[2]]
[1] "text"
[[3]]
[1] 1 2 3
If you have mastered the operations for subsetting vectors using []
then lists should be straight forward, as the same rules apply to lists; we can use single values, sequential and non-sequential values, and the colon operator :
is available as a convenient shorthand for sequential values.
# return a sublist using indices
lst[1]
lst[2:3]
lst[c(1, 3)]
Note that the printed output in all of these instances is a list - if you don’t believe me pass, the output to is.list()
.
[[]]
Similar to the examples we saw with vectors, [[]]
extracts the element directly, removing the list structure. This works with either the element index or the element name if assigned (see below).
# return single list element
lst[[1]]
To summarise: []
returns a returns a sub-list and [[]]
applied to lists extracts a single list element.
Named lists provide the additional convenience of subsetting by names, which can make your code more readable and intuitive. Named lists are especially useful when you need to access elements repeatedly or want your code to be more self-documenting, as names clearly indicate the purpose of each element.
As an aside, named lists of data frames and list-columns (a topic for another blog post) are a staple of my workflows, enabling iterative functional programming (and swirving ever having to use loops); if you aspire to R mastery, learn to love lists.
# create a named list
named_lst <- list(a = 1, b = "text", c = c(1, 2, 3))
# return a sublist using list element names
named_lst["a"]
named_lst[c("a", "c")]
# return single list element
named_lst[["b"]]
As with vectors, you can subset using logical operations, including using the element names as part of logical conditions but with the caveat that we saw earlier; this only works with []
and not [[]]
. Also, the same note regarding the lack of a convenient shorthand equivalent to the colon operator for sequential element names and the existence of a possible workaround that I won’t be showing here also applies.
Data frames are “special” lists of equal-length vectors, with each column being treated as a list element. Data frames require that the list elements, i.e., columns or variables, are named. If you don’t supply these when creating (or reading) a data frame, they will be automatically generated. These auto-created names are usually unintuitive, so it is best to supply your own meaningful names.
$
Operator for Accessing ColumnsThe $
operator provides a convenient way to access individual columns of a data frame by name. It is often used in place of []
or [[]]
for quick and readable subsetting when working with named columns.
# create a data frame
df <- data.frame(a = 1:3, b = c("x", "y", "z"))
# access a column using the $ operator
df$b
Note that this operation requires that column name be written exactly, and it does not work with dynamically generated names.
[]
You can subset data frame rows and columns similarly to matrices, using indices, variable names, or logical conditions.
# subset a column (returned data structure is a data frame)
df[1]
df["b"]
# subset rows (returned data structure is a data frame)
df[1, ]
# subset a specific element
df[1, 2]
# subset a specific element combining indices and names
df[1, "b"]
As with the other data structures shown above, it is possible to use ranges of row and column indices or names for subsetting data frames. These can be created either using c()
or the :
operator where appropriate.
Logical conditions are a powerful way to subset rows in a data frame, often used for filtering data based on column values. The $
operator is frequently combined with logical conditions to access specific columns for filtering.
# subset rows where column 'a' is greater than 1
df[df$a > 1, ]
# subset rows where column 'b' equals "y"
df[df$b == "y", ]
# subset rows where 'a' is greater than 1 AND 'b' is "z"
df[df$a > 1 & df$b == "z", ]
# subset rows where 'a' is less than 3 OR 'b' equals "x"
df[df$a < 3 | df$b == "x", ]
Logical conditions can be combined with column selection to create more complex queries:
# subset rows where 'a' > 1, selecting only column 'b'
df[df$a > 1, "b"]
# subset rows where 'b' is "y", extracting column 'a' as a vector
df[df$b == "y", "a"]
As I am sure you can appreciate, even within the limitations of the base R syntax and the operators we have seen here, there are a vast number of ways of subsetting data frames, particularly when building complex or dynamic queries. This topic could be a series of post in and of itself. We will leave it here for now - go forth and experiment.
[[]]
Subsetting data frames with [[]]
extracts a column as a vector. This is particularly useful when you need to work with the raw values of a column rather than its data frame structure. Unlike the $
operator, [[]]
allows for dynamic column selection by name or index.
# extract column 'a' as a vector
df[[1]]
# extract column 'b' as a vector
df[["b"]]
# dynamic column extraction by name or index
col_name <- "a"
df[[col_name]]
Although $
is simpler and more readable, it only works natively with hardcoded column names. The [[]]
method can be more versatile, allowing dynamic column extraction by name or index.
[]
and [[]]
Below is a handy summary of all the key behaviours that I have covered in this short post.
Type | [] Behaviour |
[[]] Behaviour |
---|---|---|
Vector | Returns a vector | Extracts a single atomic value |
Matrix | Subsets rows, columns, or elements | Not applicable |
List | Returns a sublist | Extracts the element itself |
Data Frame | Subsets rows, columns, or elements | Extracts a single column as a vector |
Understanding the nuances of []
and [[]]
in base R is essential for data manipulation and analysis. While []
is versatile and retains the object structure, [[]]
is precise, extracting individual elements or columns directly i.e., removing the data structure to which the operation is applied. Mastering these subsetting techniques is a crucial step towards efficient R scripting.
See you next time 😉
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.2 (2024-10-31)
## os macOS Sequoia 15.2
## system aarch64, darwin20
## ui RStudio
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/London
## date 2025-01-25
## rstudio 2024.12.0+467 Kousa Dogwood (desktop)
## pandoc NA
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
##
## [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────
Thanks for reading. I hope you enjoyed the article and that it helps you to get a job done more quickly or inspires you to further your data science journey. Please do let me know if there’s anything you want me to cover in future posts.
If this tutorials has helped you, consider supporting the blog on Ko-fi!
Happy Data Analysis!
Disclaimer: All views expressed on this site are exclusively my own and do not represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.