RNA-Seq End-to-End Pt. 4 - Lewis Does Data

Introduction

Welcome back to the fourth and final part in this step-by-step guide to bulk RNA sequencing data analysis covering the streamlined workflow from raw reads to publication-ready figures for differential expression analysis.

In part one of this series we undertook raw sequencing (FASTQ) data processing on a HPC. In part two we read the resultant transcript counts matrix and experimental metadata into R and carried out exploratory and quality control (QC) analyses, before undertaking differential expression testing in part three.

This final post in the series will explore some common visualisation techniques that can be used to summarise the statistical testing results and enable conveyance of the insights obtained from our data.

Note: This tutorial is going to continue directly on from the previous tutorial in this series, so I will be assuming that you have all the requisite files downloaded, your project directory setup, software and dependencies installed and loaded, and the relevant objects created previously available in your R environment.

If you have not followed the previous tutorials in this series and just want to experiment with making some plots, the requisite DESeqDataSet object and annotated DE testing result for this tutorial can be downloaded as an RData file from here. Place this into the ./data/processed folder of your project directory tree and read the file by running the code below:

# load dds and res_cln objects
load(file = here("data", "processed", "ir_versus_ctr.rda"))

Now that everyone is on the same page, let’s make some plots!

Volcano Plot

A volcano plot is one of the most effective ways to visualise differential expression results. It combines statistical significance and the magnitude of the difference between conditions by plotting the –log₁₀ adjusted p-value (y-axis) against the log₂ fold-change (x-axis). Points far from the origin in either direction stand out as both highly significant and biologically interesting.

To begin, we can generate a simple volcano plot using the cleaned differential expression results stored in the res_cln data frame.

# basic volcano plot
ggplot(res_cln, aes(x = lfc, y = -log10(pvalue))) +
  geom_point()

This plot is a good starting point but lacks detail. For instance, it doesn’t distinguish between significant and non-significant results. By adding a new column to indicate significance, we can colour the points accordingly, making the plot more informative.

# simple coloured points
res_cln %>% 
  mutate(signif = padj < 0.05 & abs(lfc) > 1) %>% 
  
  ggplot(aes(x = lfc, y = -log10(pvalue), colour = signif)) +
  geom_point()

This improvement highlights significant genes, but we can take it a step further by distinguishing between up- and down-regulated genes. By using the case_when() function to categorise points, we can colour genes based on their direction of regulation.

# differential coloured points
res_cln %>% 
  mutate(
    signif = case_when(
      padj < 0.05 & lfc > 1 ~ "Up",
      padj < 0.05 & lfc < -1 ~ "Down",
      padj > 0.05 ~ "NS"
    )
  ) %>% 
  
  ggplot(aes(x = lfc, y = -log10(pvalue), colour = signif)) +
  geom_point() +
  scale_color_discrete(type = c("lightsteelblue", "grey60", "royalblue"))

Labelling specific genes can further enhance the plot’s usefulness, especially when highlighting biologically significant findings. Instead of overwhelming the plot with all gene names, we can selectively label the top 10 up- and down-regulated genes. Using the geom_text_repel() function from the ggrepel package ensures labels don’t overlap.

# annotated points
up <- 
  res_cln %>% 
  filter(lfc > 0) %>% 
  slice_head(n = 10) %>% 
  pull(symbol)

dn <- 
  res_cln %>% 
  filter(lfc < 0) %>% 
  slice_head(n = 10) %>% 
  pull(symbol)

res_cln %>% 
  mutate(
    signif = case_when(
      padj < 0.05 & lfc > 1 ~ "Up",
      padj < 0.05 & lfc < -1 ~ "Down",
      padj > 0.05 ~ "NS"
    ),
    label = if_else(symbol %in% c(up, dn), true = symbol, false = "")
  ) %>% 
  
  ggplot(aes(x = lfc, y = -log10(pvalue), label = label)) +
  geom_point(aes(colour = signif)) +
  scale_color_discrete(type = c("lightsteelblue", "grey60", "royalblue")) +
  geom_text_repel()

By combining these elements, we can customise the plot further to suit our dataset and biological questions. For example, instead of the most significant genes, you might choose to highlight genes in a specific pathway or those related to a hypothesis. In the plot below, I have tinkered with the ggplot2 aesthetics a little more to create a publication-ready figure.

Expression Plot

Expression plots are a powerful yet underutilised tool for visualising the expression levels of individual genes across experimental conditions, offering an intuitive way to validate statistical results by showing the actual distribution of normalised counts. These both direct complement the volcano plot and visual aid confirmation of differential expression.

To build an expression plot, we first extract the normalised counts for genes of interest. Here, I’ve chosen the top 15 most significant genes, but this can be tailored to highlight a specific pathway or gene set.

# subset DE testing result to top n most significant genes
top_res <- 
  res_cln %>% 
  slice_min(order_by = padj, n = 15)

# extract normalised counts matrix and subset to genes selected above
top_res_counts <- 
  dds %>% 
  estimateSizeFactors() %>% 
  counts(normalized = TRUE) %>% 
  as.data.frame() %>% 
  filter(rownames(.) %in% top_res$ensembl)

# inspect the result
head(top_res_counts, n = 10)

                  ctr_bc_2  tgf_bc_4    ir_bc_5   ctr_bc_6  tgf_bc_7  ir_bc_12  ctr_bc_13 tgf_bc_14   ir_bc_15
ENSG00000011426  3732.2509 4020.6575 1176.88878  3548.5447 3981.9387 1193.7137  3694.6614 3460.2323  969.52236
ENSG00000046653   640.7015  145.8357   58.90107   521.1213  133.2389   87.3737   473.6476  142.8151   67.11284
ENSG00000066279  1367.9589 1588.9313  329.61948  1433.0837 1697.8438  343.5912  1794.8199 1238.6642  365.50683
ENSG00000089685   904.1324 1124.2918  207.28647   790.7925 1113.4962  220.7957   756.1558 1131.3195  199.27350
ENSG00000105989   989.7474 1666.0886  306.96522  1099.6389 1569.3635  247.9524   979.8519 1487.8905  230.24865
ENSG00000107562  9112.8265 5766.4470 1012.64540  8877.2837 6169.9112 1438.1239  8339.7686 5622.9941  861.10932
ENSG00000112984  1338.7933 1629.6296  251.46228  1111.4826 1541.7640  328.2418  1134.2337 1489.7574  287.03644
ENSG00000113368   774.2986 1217.5589   87.21890   754.3505 1235.3146  168.8438   751.9550 1121.9852  114.60807
ENSG00000115594 11402.7934 3480.5565 3084.37743 10936.2595 4072.3508 3437.0926 10596.6837 4346.9926 2739.23624
ENSG00000116774  4830.1932 6388.7925 1319.61061  4954.2969 6127.0844 1479.4493  4430.8631 6288.5311 1333.99669

The next step is to reformat the data into a long format suitable for plotting. Since tibbles in R do not support rownames, we first move the Ensembl IDs from the rownames into a dedicated column in the data frame before pivoting.

# pivot to long format data frame
top_res_counts_long <- 
  top_res_counts %>% 
  rownames_to_column(var = "ensembl") %>% 
  pivot_longer(cols = !ensembl, names_to = "sample_id", values_to = "norm_count")

# inspect the result
print(top_res_counts_long)

# A tibble: 135 × 3
   ensembl         sample_id norm_count
   <chr>           <chr>          <dbl>
 1 ENSG00000046653 ctr_bc_2       641. 
 2 ENSG00000046653 tgf_bc_4       146. 
 3 ENSG00000046653 ir_bc_5         58.9
 4 ENSG00000046653 ctr_bc_6       521. 
 5 ENSG00000046653 tgf_bc_7       133. 
 6 ENSG00000046653 ir_bc_12        87.4
 7 ENSG00000046653 ctr_bc_13      474. 
 8 ENSG00000046653 tgf_bc_14      143. 
 9 ENSG00000046653 ir_bc_15        67.1
10 ENSG00000107562 ctr_bc_2      9113. 
# ℹ 125 more rows
# ℹ Use `print(n = ...)` to see more rows

At this stage, we can create a basic expression plot. We’ll create a grouping variable from the sample IDs to colour the data points by replicate groups and enable exclusion of the TGF group data, as this is not part of the contrast of interest.

# basic expression plot
top_res_counts_long %>% 
  mutate(group = str_remove(sample_id, "_.*")) %>% 
  filter(group != "tgf") %>% 
  
  ggplot(aes(x = norm_count, y = ensembl, colour = group)) +
  geom_point() +
  scale_x_log10()

To refine the plot, we can swap the Ensembl IDs for gene symbol annotation and reorder genes by the significance of differential expression (adjusted p-value) and add the metadata for additonal annotation e.g., colouring by group.

# join with metadata and columns from de testing result
top_res_counts_annot <- 
  top_res_counts_long %>% 
  left_join(top_res, by = "ensembl") %>% 
  left_join(as.data.frame(colData(dds)), by = "sample_id") %>% 
  filter(group != "TGF") %>% 
  mutate(symbol = fct_reorder(symbol, padj, .desc = TRUE))

# expression plot
top_res_counts_annot %>% 
  ggplot(aes(x = norm_count, y = symbol, colour = group)) +
  geom_point() +
  scale_x_log10()

Finally, just as with the volcano plot we created earlier, spending time refining the plot aesthetics - such as adjusting colours, axis labels, or titles - can transform this preliminary version into a polished, publication-ready figure.

When interpreting expression plots, consistent clustering of samples within the same condition indicates low variability and reliability, while separation between groups reflects robust differential expression. Outliers and overlapping distributions should be carefully examined, as they could highlight data issues.

Clustered Heatmap

Clustered heatmaps are a powerful way to visualise patterns of gene expression across multiple samples. By clustering genes and samples with similar expression profiles, they reveal underlying structure in the data, such as co-regulated genes or the effects of experimental conditions. This visualisation is especially useful for identifying broader trends that might not be apparent from other plots.

Although ggplot2 has a geom_tile() function to make heatmaps, specialised packages such as pheatmap offer more functionality such as scaling values and clustering the data.

For our heatmap, we’ll use variance-stabilised counts, as these are more suitable for visualisation, particularly those using distance measures such as heatmaps with hierarchical clustering. We’ll start by selecting the top 50 differentially expressed genes based on significance.

# top 50 de results
top_results <- 
  res_cln %>% 
  slice_min(n = 50, order_by = padj)

# extract subsetted vst counts matrix
top_counts <- 
  dds %>% 
  vst() %>% 
  assay() %>% 
  subset(subset = rownames(.) %in% pull(top_results, ensembl))

With these data, we can create a basic heatmap. Note that the default colour palette goes from low expression in blue to high expression in red, which is a good alternative to the traditional red/green heatmaps which are not suitable for those with red-green colour-vision deficiency.

# basic heatmap
pheatmap(
  mat = top_counts,
  scale = "row"
)

This plot provides a good starting point but isn’t very interpretable due to the Ensembl IDs in the row labels. To make it more user-friendly, we can replace these with HUGO gene symbols.

# change the ensembl ids to gene names
tc_symbol <- 
  top_counts %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "ensembl") %>%
  left_join(select(top_results, ensembl, symbol), by = "ensembl") %>% 
  column_to_rownames(var = "symbol") %>% 
  select(!ensembl) %>% 
  as.matrix()

pheatmap(
  mat = tc_symbol,
  scale = "row"
)

Note that there are a few ways of doing this, one of which is to use the labels_row argument. However, I personally prefer the above approach to using the labels_row option, mainly because this method enables dendrogram customisation and optimisation (as shown in my final example that uses MOLO method with average linkage) without the risk of mislabelling, or the hassle required to reorder the lables to avoid mislabelling.

The column names can also be updated to reflect the experimental group and replicate to improve clarity.

# modify column names
col_labs <-
  dds %>% 
  colData() %>% 
  as.data.frame() %>% 
  select(sample_id:rep) %>% 
  mutate(label = str_c(group, rep, sep = " ")) %>%
  pull(label)

pheatmap(
  mat = tc_symbol,
  scale = "row",
  labels_col = col_labs
)

To add another layer of information, we can annotate the columns with sample metadata, such as experimental groups, using the annotation_col argument - this can be a little trickier than our previous modifications but essentially requires a data frame where the rownames match the colnames in the matrix passed to the pheatmap function.

# add sample info
sample_info <- 
  dds %>% 
  colData() %>% 
  as.data.frame() %>% 
  select("Group" = group)

pheatmap(
  mat = tc_symbol,
  scale = "row",
  labels_col = col_labs,
  annotation_col = sample_info,
  annotation_names_col = FALSE
)

This annotated heatmap offers a much clearer view of the patterns in the data, highlighting how genes and samples cluster together.

For publication-ready figures, you can customise the heatmap further. Options include modifying the clustering algorithm, scaling, colour schemes, or even defining the order of rows and columns (manually or using other algorithms); doing so relies on a more in-depth understanding of unsupervised clustering and empirical clustering coefficient calculation, as well as lower-level familiarity with pheatmap and other great packages such as cluster or dendsort; all topics which are outside the scope of this tutorial. One for future, perhaps? Let me know in the tutorial video comments.

To save your heatmap, you can assign it to an object and use ggsave() for export.

# save output
heatmap <- 
  pheatmap(
    mat = tc_symbol,
    scale = "row",
    annotation_col = sample_info,
    annotation_names_col = FALSE,
    silent = TRUE
  )

ggsave(
  plot = heatmap,
  filename = "heatmap.png",
  device = "png",
  path = here("results", "figures"),
  width = 8.27,
  height = 11.69,
  units = "in",
  dpi = 300
)

Adding the silent = TRUE argument here prevents the assignment operation also printing output to the plot viewer.

Summary

In this tutorial, we explored three core visualisation techniques for RNA-Seq differential expression analysis: volcano plots, expression plots, and clustered heatmaps. Each serves a unique purpose, from highlighting significant genes to uncovering broader patterns in the data.

From this point, there are many more analyses and visualisations to consider, such as pathway over-representation analysis, gene set enrichment analysis (GSEA) and variants thereof, myriad graph network analyses, or even preparing data for machine learning applications. I plan to cover some of these in future blog posts, so stay tuned!

If you’ve found this post or series on bulk RNA-Seq data analysis helpful, I’d love to hear your feedback. Please let me know if there’s anything you’d like me to explore in future tutorials.

See you next time.

. . . . .

Thanks for reading. I hope you enjoyed the article and that it helps you to get a job done more quickly or inspires you to further your data science journey. Please do let me know if there’s anything you want me to cover in future posts.

If this tutorials has helped you, consider supporting the blog on Ko-fi!

Happy Data Analysis!

. . . . .

Disclaimer: All views expressed on this site are exclusively my own and do not represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.

← Previous Post