There’s a very particular kind of pain that comes with finding a great data science resource only to discover that a critical dependency has aged like milk in the sun. I was recently afflicted with such a pain when wanting to re-run a tutorial I had covered years ago for some machine learning teaching materials, only to find the tools underpinning it had been scratched (buckle-up, they don’t get any better 🫣).
The post in question was the PCA for hip hop songs analysis by Posit’s own Julia Silge; a person that’s a frequent recommendation to my students and one of my personal data science heroes. The issue with this particular post is that almost the entire workflow relies on retrieving audio features from the Spotify API via the spotifyr package and in their eternal wisdome, Spotify have recently changed their API behaviour and authentication requirements in a way that has left part of the package broken.
According to Spotify’s own changelog, access to certain audio-feature data has been tightened and client authentication now mandates loopback redirects using an explicit IP address such as 127.0.0.1 (rather than localhost alias). Both pose issues, but the first one is the biggie 😉.
Now, if we want data science to be taken seriously - even in the highly academic field of ranking the best hip-hop tracks - then our work must remain runnable years later without having to reverse-engineer half the internet. It would have been nice if Julia had thought to cache the data for the original tutorial and make it available, which for the record she didn’t, but we can fortunately get this back on track.
To get things working again, we need to:
✔ Patch the broken auth function
✔ Retrieve Spotify track IDs reliably
✔ Fetch audio features from a different API
That API is ReccoBeats, a third-party service that accepts Spotify Track IDs as query parameters and returns the familiar audio descriptors - acousticness, danceability, energy, and so on. It does have a few limitations though, namely a simultaneous request limit and restrictions on the submit rate. It’s not perfect, but a bit of tidyverse wizardry will bypass these limitations and gets us the data we need. Bring da ruckus.
Setup here is pretty light: we need the rankings table from the Rap Artists data set originally posted as part of the TidyTuesday project on 2020-04-14 and to load a couple of packages.
# load packages
library(tidyverse)
library(spotifyr)
# load data
rankings <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/refs/heads/main/data/2020/2020-04-14/rankings.csv")
You’ll also need to follow the short guide to setting up a Spotify developer account with spotifyr that you can find here.
Let’s resolve authentication first by re-writing the get_spotify_authorization_code() function from spotifyr so that sends OAuth request loopback to 127.0.0.1 rather than localhost.
# my next pull request
get_spotify_auth_code <-
function(
client_id = Sys.getenv("SPOTIFY_CLIENT_ID"),
client_secret = Sys.getenv("SPOTIFY_CLIENT_SECRET"),
scope = spotifyr::scopes()
) {
endpoint <-
httr::oauth_endpoint(
authorize = "https://accounts.spotify.com/authorize",
access = "https://accounts.spotify.com/api/token"
)
app <- httr::oauth_app("spotifyr", client_id, client_secret, redirect_uri = "http://127.0.0.1:1410/")
token <- (purrr::safely(.f = httr::oauth2.0_token))(endpoint = endpoint, app = app, scope = scope)
if (!is.null(token$error)) {
token$error
} else {
token$result
}
}
We can now authenticate successfully and sanity-check the access token:
# get auth code
auth <- get_spotify_auth_code(scope = "user-read-email user-read-private")
# retrieve user profile info
get_my_profile(authorization = auth)
# A tibble: 1 × 18
country display_name email explicit_content.fil…¹ explicit_content.fil…² external_urls.spotify followers.total href
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 GB ########## ###### FALSE FALSE https://open.spotify… 0 http…
# ℹ abbreviated names: ¹explicit_content.filter_enabled, ²explicit_content.filter_locked
# ℹ 10 more variables: id <chr>, images.height1 <chr>, images.height2 <chr>, images.url1 <chr>, images.url2 <chr>,
# images.width1 <chr>, images.width2 <chr>, product <chr>, type <chr>, uri <chr>
Next, we need Spotify Track IDs for each song in the rankings dataset. We can thankfully leverage a native spotifyr function for this - search_spotify() - but we will need to wrap it in some additional code so that we can add the IDs to our data set elegantly.
get_spotify_id <- function(title, artist) {
res <-
str_c(title, artist, sep = " ") %>%
search_spotify("track") %>%
unnest(artists, names_sep = "_") %>%
filter(str_detect(artists_name, fixed(artist, ignore_case = TRUE))) %>%
arrange(album.release_date) %>%
slice_head(n = 1) %>%
pull(id)
# safeguard
if (length(res) == 0) {
return(NA_character_)
} else {
return(res)
}
}
In the original tutorial, Julia concatenated title and artist into a single query which caused some mismatches and duplications which she didn’t realise were present. I have refined this by using pre-cleaned artist names (keeping collaborations separate) and track titles, matching only where the listed artist actually includes the queried artist, and selecting earliest releases when multiple versions exist.
With the function defined, we can add the Spotify track ID.
# register parallel backend
future::plan(future::multisession(), workers = parallel::detectCores() - 1)
# add spotify track ids to data
rankings_ids <-
rankings %>%
select(!"ID") %>%
mutate(across(title:artist, str_to_lower)) %>%
separate_wider_delim(artist, delim = " ft. ", names = c("artist", "collab"), too_few = "align_start") %>%
mutate(
spotify_id = furrr::future_map2_chr(
title,
artist,
get_spotify_id,
.progress = TRUE
)
)
Before we call the lookup, I register a parallel backend via futures. This spins up one background R session per core (leaving one free so the machine stays responsive) and makes furrr mapping functions run those requests concurrently. I use multisession() rather than multicore() because it’s cross-platform and behaves consistently on macOS, Windows, and Linux.
The wrangling that follows is about making searches deterministic. After dropping the irrelevant surrogate key ID, I normalise the two text fields that matter for matching (title and artist) to lower case so that case differences don’t create spurious mismatches. Featured artists are then split away from the primary artist. Keeping collaborators in a separate column is important because the search function I wrote filters on the listed artist to avoid pulling a remix or a compilation where the same track title appears under a different primary artist. This is the main reason my approach produces fewer duplicates and fewer “near miss” IDs than concatenating title and artist raw.
With the cleaning in place, the terminal mutate() call adds a spotify_id by calling furrr::future_map2_chr(). Using map2_chr() means the function is applied pairwise row-by-row to each title–artist combination, preserving row order and returning a character vector. Because it’s a furrr function call, each API search is sent from a worker session, so we resolve the whole column much faster than a single-threaded loop (as would result with the purrr equivalent), and the progress bar keeps you in the loop about how it’s going.
A small practical note: search_spotify() uses httr::RETRY() under the hood, which introduces randomised back-off. If you prefer fully reproducible parallel streams (and to silence the occasional warning about RNG safety), set a deterministic seed via the seed argument in future_map2_chr(). You could also wrap get_spotify_id() with purrr::possibly() to guard against transient HTTP errors; I didn’t bother because I like to live life on the edge.
Let’s see how we got on:
# check for missing or duplicate ids
rankings_ids %>%
summarise(
tot = length(spotify_id),
miss = sum(is.na(spotify_id)),
dup = sum(duplicated(na.omit(spotify_id))),
avail_uniq = length(unique(na.omit(spotify_id)))
)
# A tibble: 1 × 4
tot miss dup avail_uniq
<int> <int> <int> <int>
1 311 41 25 245
I know I could do better here if I didn’t have the time penalty incurred by the minor inconvenience of a full time job, but even with missing IDs (almost unavoidable for some obscure tracks), we’re in better shape than the original tutorial: after cleaning duplicates, we recover 79% of the track IDs for those in the data set where the original tutorial only actually managed 71%. Juicy.
The last thing to do here is quickly clean up the data and get it ready for the next step.
# clean-up the results
rankings_ids_cln <-
rankings_ids %>%
filter(!is.na(spotify_id)) %>%
distinct(spotify_id, .keep_all = TRUE)
print(rankings_ids_cln, n = 5)
# A tibble: 244 × 13
title artist collab year gender points n n1 n2 n3 n4 n5 spotify_id
<chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 juicy the notorious b.i.g. NA 1994 male 140 18 9 3 3 1 2 5ByAIlEEn…
2 fight the power public enemy NA 1989 male 100 11 7 3 1 0 0 6JmrKTzhh…
3 shook ones (part ii) mobb deep NA 1995 male 94 13 4 5 1 1 2 1VIG2mUBL…
4 the message grandmaster flash & t… NA 1982 male 90 14 5 3 1 0 5 6XJWGeJws…
5 c.r.e.a.m. wu-tang clan NA 1993 male 62 10 3 1 1 4 1 119c93MHj…
# ℹ 239 more rows
# ℹ Use `print(n = ...)` to see more rows
Now we’re ready to extract audio features. Spotify locked the door on multi-track audio features…so we’ll go in through the side window.
What we need here is an alternative to the spotifyr function get_track_audio_features() that plays nicely with the ReccoBeats API.
The ReccoBeats endpoint accepts multiple Spotify IDs in a single query, so the function takes a vector of IDs, collapses them into a comma-separated string, and builds a GET request using httr::modify_url(). The response is JSON, which we convert to plain text, parse, extract the “content” element that contains the audio features we care about and then jam this into something tidyverse-friendly to work with downstream.
The whole thing is wrapped in purrr::slowly(). This is important because the ReccoBeats API has a limit on how fast you can send requests; without some throttling, you’ll quickly earn yourself an HTTP 429, a stern look from their server and some NA values instead of data. The rate_delay(pause = 0.5) argument ensures we leave at least half a second between calls, which keeps us on the right side of the API police.
get_recco_audio_features <-
slowly(
function(ids) {
url <- httr::modify_url(
"https://api.reccobeats.com/v1/audio-features",
query = list(ids = str_c(ids, sep = " ", collapse = ","))
)
url %>%
httr::GET() %>%
httr::content("text", encoding = "UTF-8") %>%
jsonlite::fromJSON() %>%
magrittr::extract2("content") %>%
as_tibble()
},
rate = rate_delay(pause = 0.5)
)
Next, we need to batch the data and run the query. To do this, I use nesting: instead of sending hundreds of individual requests (which would almost certainly upset the server), we group track IDs into small batches and tuck each batch into a list-column; furrr::future_map() can then iterate over those groups in parallel, making one API call per batch rather than one per row.
ReccoBeats don’t publish a clear, authoritative batch-size limit for this endpoint, and I’ve seen different numbers mentioned in various threads, but the server starts complaining if you get too ambitious. Rather than trying to reverse-engineer the exact ceiling, I’m using a conservative batch size of five IDs per request. It’s not that I think anyone should seriously limit an API to five-track chunks - only that I prefer a method that always works over one that only sometimes works.
# get audio features for cleaned rankings data
rankings_features <-
rankings_ids_cln %>%
mutate(group_id = row_number() %/% 5) %>%
select(group_id, spotify_id) %>%
nest(data = !c(group_id)) %>%
mutate(
audio_features = furrr::future_map(
data,
~get_recco_audio_features(pull(.x, spotify_id)),
.progress = TRUE
)
)
With parallelisation and chunking, we avoid hammering their server and still finish in reasonable time. Once retrieved, we merge the audio features back into the cleaned dataset:
# extract audio features
audio_features <-
rankings_features %>%
unnest(audio_features) %>%
mutate(href = str_remove(href, "https://open.spotify.com/track/")) %>%
select(spotify_id = href, acousticness:last_col())
# add audio features to cleaned rankings data
rankings_audio <-
rankings_ids_cln %>%
inner_join(audio_features, by = "spotify_id")
You can get your own copy of the data from here.
APIs evolve - often without warning - and reproducibility can break overnight. One minute everything’s working; the next, a critical endpoint takes a dirt-nap.
This is exactly why open data science matters: we debug, we document, we adapt - and we make sure the next person doesn’t fall into the same hole. In this case, a small authentication tweak, a more careful approach to ID lookup, and a polite workaround for Spotify’s new restrictions are enough to keep at least one excellent tutorial (Julia’s, not mine) running well into 2025 and beyond.
So, if you’ve been trying to resurrect that tutorial and felt like Spotify had a personal vendetta against your interest in audio features, hopefully this guide has saved you a bit of grief (and a few new grey hairs). And maybe it’s a gentle reminder to all of us that if a workflow depends on volatile APIs, ship the dataset too.
The analysis is once again reproducible. The future of hip-hop PCA tutorials is secure. Reproducibility: up and to the right.
Thanks for reading. I hope you enjoyed the article and that it helps you to get a job done more quickly or inspires you to further your data science journey. Please do let me know if there’s anything you want me to cover in future posts.
If this tutorial has helped you, consider supporting the blog on Ko-fi!
Happy Data Analysis!
Disclaimer: All views expressed on this site are exclusively my own and do not represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.