Raw Data Analysis Pipeline Overview

The nf-core RNA-Seq analysis pipeline was used to process raw sequencing data. This pipeline offers gold-standard alignment and gene/transcript quantification tools (STAR, RSEM, HISAT2 or Salmon) in various combinations with gene/isoform counts and extensive quality control metrics as outputs. This analysis was run using STAR-RSEM. An overview of the pipeline and the tools used in each alignment-quantification tool combination is shown below.

Experimental Groups and Samples

The experimental groups and samples for this experiment are summarise and described in the below table.

Table 1. Experimental Groups and Samples
Sample ID Barcode Sequence Sample Type Sample Description Group Group Description
NEG_1 CATGGC Cell Line MCF-7 Control Vybrant DiD- (proliferating)
NEG_2 CATTTT Cell Line MCF-7 Control Vybrant DiD- (proliferating)
NEG_3 CCAACA Cell Line MCF-7 Control Vybrant DiD- (proliferating)
NEG_4 TAATCG Cell Line MCF-7 Control Vybrant DiD- (proliferating)
POS_1 CTTGTA Cell Line MCF-7 Experimental Vybrant DiD+ (quiescent)
POS_2 GTAGAG Cell Line MCF-7 Experimental Vybrant DiD+ (quiescent)
POS_3 ATTCCT Cell Line MCF-7 Experimental Vybrant DiD+ (quiescent)
POS_4 CAGGCG Cell Line MCF-7 Experimental Vybrant DiD+ (quiescent)

Sequencing Data Processing

Raw Data Quality Control

The first stage in processing the FASTQ files generated during sequencing is a quality control check. A set of standard quality metrics for the raw data were generated as pipeline outputs. The most important metrics are summarised below.

Table 2. Raw Data Quality Control
Sample ID Barcode Sequence Total Reads Mean Read Length Total Low Quality Reads Per Base Sequence Check Per Sequence Check Duplicates (%) GC Content (%)
NEG_1 CATGGC 45951026 101 0 pass pass 60.590 50
NEG_2 CATTTT 53673001 101 0 pass pass 60.489 49
NEG_3 CCAACA 48940458 101 0 pass pass 59.771 49
NEG_4 TAATCG 50770052 101 0 pass pass 59.755 49
POS_1 CTTGTA 47420041 101 0 pass pass 59.638 49
POS_2 GTAGAG 47327421 101 0 pass pass 60.375 49
POS_3 ATTCCT 50320353 101 0 pass pass 61.093 49
POS_4 CAGGCG 45211787 101 0 pass pass 59.391 49

Trimming

Sequence reads were trimmed to remove adapter sequences and nucleotides with poor quality using Cutadapt which automates low-quality read and quality and adapter trimming. Cutadapt retains bases called with Phred quality scores ≥ 20 and uses the first 13 bp of Illumina standard adapters (‘AGATCGGAAGAGC’) by default which are suitable for both ends of paired-end libraries.

Table 3. Trimming Quality Control
Sample ID Barcode Sequence Total BP Trimmed (%) Adapter Content Check Total Reads Mean Read Length Duplicates (%) GC Content (%)
NEG_1 CATGGC 0.575 pass 45946306 99.721 60.580 50
NEG_2 CATTTT 0.574 pass 53667486 99.724 60.479 49
NEG_3 CCAACA 0.575 pass 48935413 99.723 59.767 49
NEG_4 TAATCG 0.573 pass 50765009 99.725 59.745 49
POS_1 CTTGTA 0.574 pass 47415372 99.723 59.631 49
POS_2 GTAGAG 0.594 pass 47322217 99.696 60.375 49
POS_3 ATTCCT 0.574 pass 50315184 99.723 61.089 50
POS_4 CAGGCG 0.573 pass 45207539 99.724 59.384 50

Alignment to Reference Genome

The trimmed reads were mapped to the Homo sapiens GRCh38 reference genome available through the genomic analysis toolkit (GATK) using the STAR aligner. The STAR aligner is a splice-aware aligner that detects splice junctions and incorporates them to help align the entire read sequences. Below are the key statistics for mapping of trimmed reads to the reference genome.

Table 4. Alignment to Reference Genome
Sample ID Barcode Sequence RSEM Alignable Reads Uniquely Mapped Reads Aligned to Multiple Loci Unalignable Reads Error Rate 5’-3’ Bias
NEG_1 CATGGC 40138589 38876003 1262586 379939 0.003 1.16
NEG_2 CATTTT 44532086 43119716 1412370 436471 0.003 1.16
NEG_3 CCAACA 40721028 39443892 1277136 403007 0.003 1.16
NEG_4 TAATCG 41911578 40101042 1810536 397038 0.003 1.18
POS_1 CTTGTA 40174585 38908831 1265754 388634 0.003 1.11
POS_2 GTAGAG 40471461 39193465 1277996 385575 0.003 1.14
POS_3 ATTCCT 42823253 41477206 1346047 413662 0.003 1.15
POS_4 CAGGCG 38454399 37247091 1207308 356789 0.003 1.16

Transcript Quantification

Read counts were quantified using RSEM. Transcript counts files were combined into a matrix of non-normalised transcript counts for each sample that was used for downstream analyses.

Table 5. Transcript Count Matrix Preview
NEG_1 NEG_2 NEG_3 NEG_4 POS_1 POS_2 POS_3 POS_4
A1BG 131.72 141.27 135.91 148.63 167.90 169.50 174.53 152.10
A1BG-AS1 134.03 161.78 146.61 171.47 145.51 135.91 145.06 110.88
A1CF 0.00 5.00 1.00 5.00 1.00 1.00 2.00 3.00
A2M 5.13 6.11 5.00 2.00 2.00 5.03 12.18 9.05
A2M-AS1 20.87 21.89 25.00 30.00 17.00 22.97 39.82 24.95
A2ML1 20.00 21.00 27.00 33.00 30.00 27.00 35.00 29.00
A2MP1 0.00 0.00 0.00 0.00 2.00 1.00 0.00 0.00
A3GALT2 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00
A4GALT 895.00 1007.00 860.00 950.00 797.00 781.00 943.00 788.00
A4GNT 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00

Dimensionality Reduction Analyses

For RNA-seq count data, variance increases with the mean. Transforming the count data to the log2 scale and applying a small pseudo count acts to minimize large differences in sequencing depth and helps normalise all samples to a similar dynamic range, thereby accounting for large variations seen between the highest expressing genes so that these do not dominate the dimensionality reduction plots.

Due to the strong noise among low count values owing to the Poisson distribution, a general log2 transformation would act to amplify this noise, meaning that low count genes would instead dominate the dimensionality reduction plots. Data is therefore transformed with a regularized log2 transformation that gives similar results for high counts as a log2 transformation but also shrinks the values of low counts towards the genes’ average across samples.

In practice, data transformation is achieved using the variance stabilizing transformation (VST) function in DESeq2. Transformations are calculated from the fitted dispersion-mean relation(s) and then the count data is transformed (normalised by division by the size factors or normalisation factors), yielding a matrix of values which are now have approximately constant variance along the range of mean values. This transformation also normalises with respect to library size. The log2 transformed counts matrix is then used in dimensionality reduction analyses.

Principal Component Analysis

Principal Component Analysis (PCA) is a multivariate dimensionality reduction technique that allows us to summarise the systematic patterns of variations in the data. PCA takes the expression levels for genes and transforms it in principal component space, reducing each sample into one point. This facilitates separation of samples by expression variation and enables identify potential sample outliers. The PCA plot is a way to look at how samples are clustering.

Inter-Sample Correlation

Inter-sample correlation analysis is another way to look at how well samples cluster. Agglomerative hierarchical clustering based on Euclidean distance computed from the normalised log2-transformed count values matrix allows us to see overall similarities and differences between the gene expression profiles of the samples.

Differential Gene Expression Analysis

Comparison of gene expression between the experimental sample groups was performed using DESeq2. Median ratio normalisation was first applied to account for differences in library size before estimation of the per gene dispersion and differential expression testing based on the negative binomial distribution; the Wald test was used to generate p-values and log2 fold-changes.

Volcano Plot

During differential expression testing, genes with an absolute fold-change > 1.5 and an adjusted p-value < 0.05 were called as differentially expressed. The results of differential expression testing are summarised in the volcano plot shown below.

Differentially Expressed Genes

Genes called as differentially expressed genes between experimental conditions (groups) are shown in the below table, ordered by padj by default. The direction column indicates the direction of the log2 fold-change (lfc) in gene expression between groups; down-regulated genes are denoted as “-1” and up-regulated genes are denoted as “1”.

Heatmap Analysis

A heatmap depicting the 50 most significantly differentially expressed genes between groups is shown below. Agglomerative hierarchical clustering based on Euclidean distance computed with an average-linkage matrix was performed using normalised log2-transformed count values.

Downstream Analysis

Bespoke downstream analyses are provided on a case-by-case basis upon request and consultation with an investigator. These might include anything from rudimentary pathway enrichment or gene set enrichment analysis (GSEA) and visualisation to protein-protein interaction network construction, validation and interrogation, or integration with large-scale public experimental and clinical data sets, including survival analyses.

Session Information

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.1 (2022-06-23)
##  os       macOS Big Sur ... 10.16
##  system   x86_64, darwin17.0
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/London
##  date     2022-12-01
##  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package              * version   date (UTC) lib source
##  annotate               1.74.0    2022-04-26 [1] Bioconductor
##  AnnotationDbi        * 1.58.0    2022-04-26 [1] Bioconductor
##  assertthat             0.2.1     2019-03-21 [1] CRAN (R 4.2.0)
##  Biobase              * 2.56.0    2022-04-26 [1] Bioconductor
##  BiocGenerics         * 0.42.0    2022-04-26 [1] Bioconductor
##  BiocParallel           1.30.4    2022-10-13 [1] Bioconductor
##  Biostrings             2.64.1    2022-08-18 [1] Bioconductor
##  bit                    4.0.5     2022-11-15 [1] CRAN (R 4.2.0)
##  bit64                  4.0.5     2020-08-30 [1] CRAN (R 4.2.0)
##  bitops                 1.0-7     2021-04-24 [1] CRAN (R 4.2.0)
##  blob                   1.2.3     2022-04-10 [1] CRAN (R 4.2.0)
##  bslib                  0.4.1     2022-11-02 [1] CRAN (R 4.2.0)
##  cachem                 1.0.6     2021-08-19 [1] CRAN (R 4.2.0)
##  cli                    3.4.1     2022-09-23 [1] CRAN (R 4.2.0)
##  codetools              0.2-18    2020-11-04 [1] CRAN (R 4.2.1)
##  colorspace             2.0-3     2022-02-21 [1] CRAN (R 4.2.0)
##  crayon                 1.5.2     2022-09-29 [1] CRAN (R 4.2.0)
##  crosstalk              1.2.0     2021-11-04 [1] CRAN (R 4.2.0)
##  DBI                    1.1.3     2022-06-18 [1] CRAN (R 4.2.0)
##  DelayedArray           0.22.0    2022-04-26 [1] Bioconductor
##  dendextend           * 1.16.0    2022-07-04 [1] CRAN (R 4.2.0)
##  dendsort             * 0.3.4     2021-04-20 [1] CRAN (R 4.2.0)
##  DESeq2               * 1.36.0    2022-04-26 [1] Bioconductor
##  digest                 0.6.30    2022-10-18 [1] CRAN (R 4.2.0)
##  dplyr                * 1.0.10    2022-09-01 [1] CRAN (R 4.2.0)
##  ellipsis               0.3.2     2021-04-29 [1] CRAN (R 4.2.0)
##  evaluate               0.18      2022-11-07 [1] CRAN (R 4.2.0)
##  fansi                  1.0.3     2022-03-24 [1] CRAN (R 4.2.0)
##  farver                 2.1.1     2022-07-06 [1] CRAN (R 4.2.0)
##  fastmap                1.1.0     2021-01-25 [1] CRAN (R 4.2.0)
##  genefilter             1.78.0    2022-04-26 [1] Bioconductor
##  geneplotter            1.74.0    2022-04-26 [1] Bioconductor
##  generics               0.1.3     2022-07-05 [1] CRAN (R 4.2.0)
##  GenomeInfoDb         * 1.32.4    2022-09-06 [1] Bioconductor
##  GenomeInfoDbData       1.2.8     2022-10-10 [1] Bioconductor
##  GenomicRanges        * 1.48.0    2022-04-26 [1] Bioconductor
##  ggplot2              * 3.4.0     2022-11-04 [1] CRAN (R 4.2.0)
##  ggrepel              * 0.9.2     2022-11-06 [1] CRAN (R 4.2.0)
##  glue                   1.6.2     2022-02-24 [1] CRAN (R 4.2.0)
##  gridExtra              2.3       2017-09-09 [1] CRAN (R 4.2.0)
##  gtable                 0.3.1     2022-09-01 [1] CRAN (R 4.2.0)
##  here                 * 1.0.1     2020-12-13 [1] CRAN (R 4.2.0)
##  highr                  0.9       2021-04-16 [1] CRAN (R 4.2.0)
##  hms                    1.1.2     2022-08-19 [1] CRAN (R 4.2.0)
##  htmltools              0.5.3     2022-07-18 [1] CRAN (R 4.2.0)
##  htmlwidgets            1.5.4     2021-09-08 [1] CRAN (R 4.2.0)
##  httr                   1.4.4     2022-08-17 [1] CRAN (R 4.2.0)
##  IRanges              * 2.30.1    2022-08-18 [1] Bioconductor
##  jquerylib              0.1.4     2021-04-26 [1] CRAN (R 4.2.0)
##  jsonlite               1.8.3     2022-10-21 [1] CRAN (R 4.2.0)
##  KEGGREST               1.36.3    2022-07-14 [1] Bioconductor
##  knitr                  1.41      2022-11-18 [1] CRAN (R 4.2.0)
##  lattice                0.20-45   2021-09-22 [1] CRAN (R 4.2.1)
##  lifecycle              1.0.3     2022-10-07 [1] CRAN (R 4.2.1)
##  locfit                 1.5-9.6   2022-07-11 [1] CRAN (R 4.2.0)
##  magrittr             * 2.0.3     2022-03-30 [1] CRAN (R 4.2.0)
##  Matrix                 1.5-3     2022-11-11 [1] CRAN (R 4.2.0)
##  MatrixGenerics       * 1.8.1     2022-06-30 [1] Bioconductor
##  matrixStats          * 0.63.0    2022-11-18 [1] CRAN (R 4.2.0)
##  memoise                2.0.1     2021-11-26 [1] CRAN (R 4.2.0)
##  munsell                0.5.0     2018-06-12 [1] CRAN (R 4.2.0)
##  org.Hs.eg.db         * 3.15.0    2022-10-31 [1] Bioconductor
##  pheatmap             * 1.0.12    2019-01-04 [1] CRAN (R 4.2.0)
##  pillar                 1.8.1     2022-08-19 [1] CRAN (R 4.2.0)
##  pkgconfig              2.0.3     2019-09-22 [1] CRAN (R 4.2.0)
##  png                    0.1-7     2013-12-03 [1] CRAN (R 4.2.0)
##  purrr                  0.3.5     2022-10-06 [1] CRAN (R 4.2.0)
##  R6                     2.5.1     2021-08-19 [1] CRAN (R 4.2.0)
##  RColorBrewer         * 1.1-3     2022-04-03 [1] CRAN (R 4.2.0)
##  Rcpp                   1.0.9     2022-07-08 [1] CRAN (R 4.2.0)
##  RCurl                  1.98-1.9  2022-10-03 [1] CRAN (R 4.2.0)
##  reactable              0.3.0     2022-05-26 [1] CRAN (R 4.2.0)
##  reactR                 0.4.4     2021-02-22 [1] CRAN (R 4.2.0)
##  readr                * 2.1.3     2022-10-01 [1] CRAN (R 4.2.0)
##  rlang                  1.0.6     2022-09-24 [1] CRAN (R 4.2.0)
##  rmarkdown              2.18      2022-11-09 [1] CRAN (R 4.2.0)
##  rprojroot              2.0.3     2022-04-02 [1] CRAN (R 4.2.0)
##  RSQLite                2.2.19    2022-11-24 [1] CRAN (R 4.2.1)
##  rstudioapi             0.14      2022-08-22 [1] CRAN (R 4.2.0)
##  S4Vectors            * 0.34.0    2022-04-26 [1] Bioconductor
##  sass                   0.4.4     2022-11-24 [1] CRAN (R 4.2.1)
##  scales                 1.2.1     2022-08-20 [1] CRAN (R 4.2.0)
##  sessioninfo            1.2.2     2021-12-06 [1] CRAN (R 4.2.0)
##  stringi                1.7.8     2022-07-11 [1] CRAN (R 4.2.0)
##  stringr              * 1.4.1     2022-08-20 [1] CRAN (R 4.2.0)
##  SummarizedExperiment * 1.26.1    2022-05-01 [1] Bioconductor
##  survival               3.4-0     2022-08-09 [1] CRAN (R 4.2.0)
##  tibble                 3.1.8     2022-07-22 [1] CRAN (R 4.2.0)
##  TidyMultiqc          * 1.0.3     2022-09-25 [1] CRAN (R 4.2.0)
##  tidyr                * 1.2.1     2022-09-08 [1] CRAN (R 4.2.0)
##  tidyselect             1.2.0     2022-10-10 [1] CRAN (R 4.2.0)
##  tzdb                   0.3.0     2022-03-28 [1] CRAN (R 4.2.0)
##  utf8                   1.2.2     2021-07-24 [1] CRAN (R 4.2.0)
##  vctrs                  0.5.1     2022-11-16 [1] CRAN (R 4.2.0)
##  viridis                0.6.2     2021-10-13 [1] CRAN (R 4.2.0)
##  viridisLite            0.4.1     2022-08-22 [1] CRAN (R 4.2.0)
##  vroom                  1.6.0     2022-09-30 [1] CRAN (R 4.2.0)
##  withr                  2.5.0     2022-03-03 [1] CRAN (R 4.2.0)
##  xfun                   0.35      2022-11-16 [1] CRAN (R 4.2.0)
##  XML                    3.99-0.12 2022-10-28 [1] CRAN (R 4.2.0)
##  xtable                 1.8-4     2019-04-21 [1] CRAN (R 4.2.0)
##  XVector                0.36.0    2022-04-26 [1] Bioconductor
##  yaml                   2.3.6     2022-10-18 [1] CRAN (R 4.2.0)
##  zlibbioc               1.42.0    2022-04-26 [1] Bioconductor
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────