R User's Guide to Python Data Structures

As an R user who has started to explore Python, you might find yourself wondering how the basic building blocks of these two languages compare. After all, one of the fastest ways to get comfortable in a new programming ecosystem is to relate new ideas back to what you already know.

In this article, I will walk you through the basic data structures in Python. I’ll cover tuples, lists, dictionaries, sets, and a few others, and look at how they stack up against data structures found in base R.

By the end, you should have a solid foundation for starting to manipulate data in Python just as naturally as you do in R.

If you’re looking to move swiftly into the world of NumPy, Pandas and beyond, this post will give you a strong foundation.

Python Tuples

Let’s get tuples out the way first because they have no direct R equivalent.

Tuples in Python look like Python lists but are immutable: once created, you cannot change them. Tuples are often used for things like fixed-length records (e.g., x, y points), which can make your code safer by reducing accidental changes.

Note that tuples do not have to be of length 2 in Python. This common misconception probably comes from examples where they are used to represent pairs such as coordinate points (x, y), or key-value pairs in contexts such as dictionaries, which we will see shortly.

A tuple can have:

1) Zero elements i.e., an empty tuple

empty = ()

2) One element (note you must include a trailing comma, otherwise Python treats it as just a value in parentheses)

single = (42, )

3) Two elements (often used for pairs, e.g., coordinates)

coordinates = (10.0, 20.0)

4) Three, four, or many elements

triple = (1, 2, 3)
quadruple = (1, 2, 3, 4)
many = (1, "apple", True, 3.14)

The general rule is that a tuple is just an immutable ordered collection of any number of elements.

Note that tuples can hold heterogeneous data types i.e., each element can be of a different type (as seen in the many example above). This also highlights the second point for clarification: ordering in this context refers to the sequence in which elements are inserted, not to any inherent alphanumeric sorting or categorical order.

This concept of enforced immutability doesn’t have a direct equivalent in R. You can promise not to modify a list or vector, but R won’t stop you programmatically.

If you want to mentally map it, you can think of a tuple as an R list you pledge not to mutate - but there’s no enforced rule in R to prevent you from doing this.

Python Sets ≈ unique(R Vectors)

A Python set is a collection of unique, unordered items:

unique_names = {"Jon", "Bob", "Dave", "Steve"}

Though there is no directly analogous formal data structure in R, you would simply create a vector and use unique() to obtain similar behaviour:

names <- c("Jon", "Jon", "Bob", "Bob", "Bob", "Dave", "Steve")
unique(names)

[1] "Jon"   "Bob"   "Dave"  "Steve"

In Python, sets are a proper object type with built-in methods for these operations. There’s no dedicated “set” structure in R’s base language; you achieve the same effects (like finding unions, intersections, differences) through functions like union(), intersect(), and setdiff(), so called set operations.

Python Lists ≈ R Lists

In Python, the list is your all-purpose, flexible container.

Just like in R, a list can hold heterogeneous data types: numbers, strings, other lists etc. Python lists maintain order, allow duplicates, and are fully mutable: you can add, remove, or change elements.

# homogeneous types
fruits = ["apple", "banana", "cherry"]
print(fruits)

['apple', 'banana', 'cherry']

# heterogeneous types
mixed_list = [42, "hello", 3.14, [1, 2, 3]]
print(mixed_list)

[42, 'hello', 3.14, [1, 2, 3]]

# maintains order
print(mixed_list[0])
print(mixed_list[1])

42
hello

# allow duplication
dup_list = [1, 2, 2, "a", "a"]
print(dup_list)

[1, 2, 2, 'a', 'a']

# add an element
mixed_list.append("new item")
print(mixed_list)

[42, 'hello', 3.14, [1, 2, 3], 'new item']

# remove an element
mixed_list.pop(0)
print(mixed_list)

['hello', 3.14, [1, 2, 3], 'new item']

# change an element
mixed_list[1] = "world"
print(mixed_list)

['hello', 'world', [1, 2, 3], 'new item']

If you use a homogeneous list in Python - say, all integers - it behaves somewhat like an R vector:

numbers = [1, 2, 3, 4, 5]

However, and importantly, Python doesn’t enforce homogeneity like R vectors do. This means there’s no type coercion in Python and it’s up to you to be disciplined if you want vector-like behaviour.

Creating the following vector in R would result in a character vector of length 3 despite the heterogeneous input data types:

# R
c(1, 2.1, "three")

[1] "1"     "2.1"   "three"

By contrast, the underlying data type would be preserved in Python (like it would in an R list):

print([1, 2.1, "three"])

[1, 2.1, 'three']

In short, think of Python and R lists as equivalent in terms of data type heterogeneity. Homogeneous Python lists can be thought of as R vector equivalents but without enforced type checking and coercion. There are some closer R vector analogues in Python that we will see later, but for now, let’s stick with the list analogy and look at Python dictionaries.

Python Dictionaries ≈ R Named Lists

A Python dictionary is a key-value mapping. Those of you familiar with the JSON format will be familiar with this concept.

person = {
    "name": "Lewis",
    "age": 36,
    "is_male": True
}

print(person)

{'name': 'Lewis', 'age': 36, 'is_male': True}

Keys are typically strings (but can technically be any immutable type). Values are fully heterogeneous: strings, numbers, lists, even other dictionaries.

In R, a named list is the closest concept:

person <- list(name = "Lewis", age = 36, is_male = TRUE)

print(person)

$name
[1] "Lewis"

$age
[1] 36

$is_male
[1] TRUE

You access values in both languages using the key or name.

# python
person["name"]

'Lewis'

# R
person[["name"]]

[1] "Lewis"

Dictionaries in Python are hugely important and form the foundation of how things like Pandas DataFrames internally structure column/row mappings. More on that another time.

NumPy ndarrays ≈ R Vectors and Matrices

To find true equivalents to R vectors and matrices, which are one- and two-dimensional homogenous data structures, respectively, we can look to the NumPy library.

The N-dimensional array or ndarray, which is produced when you call np.array(), is a true homogeneous, fixed-type, and vectorised data structure.

# load dependencies
import numpy as np

# create a 1D array ≈ R vector
np.array([1, 2, 3, 4])

# create a 2D array ≈ R matrix
np.array([
    ["a", "b", 'c'],
    ["d", "e", "f"]
])

These 1D and 2D NumPy arrays are far closer to an R vector or matrix than a basic Python list. NumPy ndarrays provide fast element-wise operations (e.g. array * 2 multiplies every element) and enforce a single data type internally; if you mix types inside the initial list passed to np.array(), NumPy will automatically coerce them to a common type, just like R does inside c().

One (minor) point to note: NumPy does support matrices, though their use is generally discouraged and multi-dimensional ndarrays are the most prevalent and idiomatic way to handle multi-dimensional data in this ecosystem.

Pandas Series ≈ R Vector

The Pandas library, the ubiquitous (though lesser in my opinion) equivalent to R’s dplyr package, also has a one-dimensional, homogeneous data structure called a series.

# load dependencies
import pandas as pd

# create a pandas series
pd.Series([10, 20, 30, 40])

Type enforcement in Pandas is relatively soft. Although a data type (dtype) is automatically assigned, a series can still hold mixed types at the Python object level. In contrast, R enforces types more strictly and will coerce them to the most general compatible type, as we have already seen.

Pandas provides each element in a series with an explicit index, which can be either numeric (starting at 0) or labelled. R vectors, however, use an implicit positional index that starts at 1, with no built-in mechanism for assigning labels unless one explicitly assigns names (which are technically an additional attribute). One could argue that this enables more expressive and intuitive data access in a Pandas series, but it is a trivial point at best and similar behaviour can be achieved in R if desired.

The underlying object model also differs. A Pandas series is built on top of a NumPy ndarray, meaning it inherits performance and array-processing capabilities from NumPy. R vectors, by contrast, are native to the R language and tightly integrated into its own data structures.

In handling missing data, Pandas supports NaN (a floating-point representation for “Not a Number”), which is inherited from NumPy. R uses NA as its native indicator of missing values, which is more consistently applied across different data types in the language.

In short, you can think of a series as a “named vector” when you’re first moving across from R - that’s the easiest mental model.

Final Thoughts

If you’re coming from R, the key takeaway is this: Python’s core data structures aren’t alien but mostly just dressed a little differently. Lists and dictionaries should feel particularly familiar, tuples and sets are easy to grasp, and NumPy gives you that familiar feeling of vectors and matrices.

You don’t need to master every nuance to be productive. Grasp the essentials here, and you’ll be well-armed to tackle real-world data problems, and ready to level up with Pandas, scikit-learn, and beyond.

If there’s appetite, I might follow up with a deep dive into Pandas DataFrames and how they stack up against R’s data.frame. That one’s a real gem 👀

. . . . .

Thanks for reading. I hope you enjoyed the article and that it helps you to get a job done more quickly or inspires you to further your data science journey. Please do let me know if there’s anything you want me to cover in future posts.

If this tutorial has helped you, consider supporting the blog on Ko-fi!

Happy Data Analysis!

. . . . .

Disclaimer: All views expressed on this site are exclusively my own and do not represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.

← Previous Post Next Post →