The nf-core RNA-Seq analysis pipeline was used to process raw sequencing data. This pipeline offers gold-standard alignment and gene/transcript quantification tools (STAR, RSEM, HISAT2 or Salmon) in various combinations with gene/isoform counts and extensive quality control metrics as outputs. This analysis was run using STAR-RSEM. An overview of the pipeline and the tools used in each alignment-quantification tool combination is shown below.
The experimental groups and samples for this experiment are summarise and described in the below table.
Sample ID | Barcode Sequence | Sample Type | Sample Description | Group | Group Description |
---|---|---|---|---|---|
NEG_1 | CATGGC | Cell Line | MCF-7 | Control | Vybrant DiD- (proliferating) |
NEG_2 | CATTTT | Cell Line | MCF-7 | Control | Vybrant DiD- (proliferating) |
NEG_3 | CCAACA | Cell Line | MCF-7 | Control | Vybrant DiD- (proliferating) |
NEG_4 | TAATCG | Cell Line | MCF-7 | Control | Vybrant DiD- (proliferating) |
POS_1 | CTTGTA | Cell Line | MCF-7 | Experimental | Vybrant DiD+ (quiescent) |
POS_2 | GTAGAG | Cell Line | MCF-7 | Experimental | Vybrant DiD+ (quiescent) |
POS_3 | ATTCCT | Cell Line | MCF-7 | Experimental | Vybrant DiD+ (quiescent) |
POS_4 | CAGGCG | Cell Line | MCF-7 | Experimental | Vybrant DiD+ (quiescent) |
The first stage in processing the FASTQ files generated during sequencing is a quality control check. A set of standard quality metrics for the raw data were generated as pipeline outputs. The most important metrics are summarised below.
Sample ID | Barcode Sequence | Total Reads | Mean Read Length | Total Low Quality Reads | Per Base Sequence Check | Per Sequence Check | Duplicates (%) | GC Content (%) |
---|---|---|---|---|---|---|---|---|
NEG_1 | CATGGC | 45951026 | 101 | 0 | pass | pass | 60.590 | 50 |
NEG_2 | CATTTT | 53673001 | 101 | 0 | pass | pass | 60.489 | 49 |
NEG_3 | CCAACA | 48940458 | 101 | 0 | pass | pass | 59.771 | 49 |
NEG_4 | TAATCG | 50770052 | 101 | 0 | pass | pass | 59.755 | 49 |
POS_1 | CTTGTA | 47420041 | 101 | 0 | pass | pass | 59.638 | 49 |
POS_2 | GTAGAG | 47327421 | 101 | 0 | pass | pass | 60.375 | 49 |
POS_3 | ATTCCT | 50320353 | 101 | 0 | pass | pass | 61.093 | 49 |
POS_4 | CAGGCG | 45211787 | 101 | 0 | pass | pass | 59.391 | 49 |
Sequence reads were trimmed to remove adapter sequences and nucleotides with poor quality using Cutadapt which automates low-quality read and quality and adapter trimming. Cutadapt retains bases called with Phred quality scores ≥ 20 and uses the first 13 bp of Illumina standard adapters (‘AGATCGGAAGAGC’) by default which are suitable for both ends of paired-end libraries.
Sample ID | Barcode Sequence | Total BP Trimmed (%) | Adapter Content Check | Total Reads | Mean Read Length | Duplicates (%) | GC Content (%) |
---|---|---|---|---|---|---|---|
NEG_1 | CATGGC | 0.575 | pass | 45946306 | 99.721 | 60.580 | 50 |
NEG_2 | CATTTT | 0.574 | pass | 53667486 | 99.724 | 60.479 | 49 |
NEG_3 | CCAACA | 0.575 | pass | 48935413 | 99.723 | 59.767 | 49 |
NEG_4 | TAATCG | 0.573 | pass | 50765009 | 99.725 | 59.745 | 49 |
POS_1 | CTTGTA | 0.574 | pass | 47415372 | 99.723 | 59.631 | 49 |
POS_2 | GTAGAG | 0.594 | pass | 47322217 | 99.696 | 60.375 | 49 |
POS_3 | ATTCCT | 0.574 | pass | 50315184 | 99.723 | 61.089 | 50 |
POS_4 | CAGGCG | 0.573 | pass | 45207539 | 99.724 | 59.384 | 50 |
The trimmed reads were mapped to the Homo sapiens GRCh38 reference genome available through the genomic analysis toolkit (GATK) using the STAR aligner. The STAR aligner is a splice-aware aligner that detects splice junctions and incorporates them to help align the entire read sequences. Below are the key statistics for mapping of trimmed reads to the reference genome.
Sample ID | Barcode Sequence | RSEM Alignable Reads | Uniquely Mapped Reads | Aligned to Multiple Loci | Unalignable Reads | Error Rate | 5’-3’ Bias |
---|---|---|---|---|---|---|---|
NEG_1 | CATGGC | 40138589 | 38876003 | 1262586 | 379939 | 0.003 | 1.16 |
NEG_2 | CATTTT | 44532086 | 43119716 | 1412370 | 436471 | 0.003 | 1.16 |
NEG_3 | CCAACA | 40721028 | 39443892 | 1277136 | 403007 | 0.003 | 1.16 |
NEG_4 | TAATCG | 41911578 | 40101042 | 1810536 | 397038 | 0.003 | 1.18 |
POS_1 | CTTGTA | 40174585 | 38908831 | 1265754 | 388634 | 0.003 | 1.11 |
POS_2 | GTAGAG | 40471461 | 39193465 | 1277996 | 385575 | 0.003 | 1.14 |
POS_3 | ATTCCT | 42823253 | 41477206 | 1346047 | 413662 | 0.003 | 1.15 |
POS_4 | CAGGCG | 38454399 | 37247091 | 1207308 | 356789 | 0.003 | 1.16 |
Read counts were quantified using RSEM. Transcript counts files were combined into a matrix of non-normalised transcript counts for each sample that was used for downstream analyses.
NEG_1 | NEG_2 | NEG_3 | NEG_4 | POS_1 | POS_2 | POS_3 | POS_4 | |
---|---|---|---|---|---|---|---|---|
A1BG | 131.72 | 141.27 | 135.91 | 148.63 | 167.90 | 169.50 | 174.53 | 152.10 |
A1BG-AS1 | 134.03 | 161.78 | 146.61 | 171.47 | 145.51 | 135.91 | 145.06 | 110.88 |
A1CF | 0.00 | 5.00 | 1.00 | 5.00 | 1.00 | 1.00 | 2.00 | 3.00 |
A2M | 5.13 | 6.11 | 5.00 | 2.00 | 2.00 | 5.03 | 12.18 | 9.05 |
A2M-AS1 | 20.87 | 21.89 | 25.00 | 30.00 | 17.00 | 22.97 | 39.82 | 24.95 |
A2ML1 | 20.00 | 21.00 | 27.00 | 33.00 | 30.00 | 27.00 | 35.00 | 29.00 |
A2MP1 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | 1.00 | 0.00 | 0.00 |
A3GALT2 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 |
A4GALT | 895.00 | 1007.00 | 860.00 | 950.00 | 797.00 | 781.00 | 943.00 | 788.00 |
A4GNT | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
For RNA-seq count data, variance increases with the mean. Transforming the count data to the log2 scale and applying a small pseudo count acts to minimize large differences in sequencing depth and helps normalise all samples to a similar dynamic range, thereby accounting for large variations seen between the highest expressing genes so that these do not dominate the dimensionality reduction plots.
Due to the strong noise among low count values owing to the Poisson distribution, a general log2 transformation would act to amplify this noise, meaning that low count genes would instead dominate the dimensionality reduction plots. Data is therefore transformed with a regularized log2 transformation that gives similar results for high counts as a log2 transformation but also shrinks the values of low counts towards the genes’ average across samples.
In practice, data transformation is achieved using the variance stabilizing transformation (VST) function in DESeq2. Transformations are calculated from the fitted dispersion-mean relation(s) and then the count data is transformed (normalised by division by the size factors or normalisation factors), yielding a matrix of values which are now have approximately constant variance along the range of mean values. This transformation also normalises with respect to library size. The log2 transformed counts matrix is then used in dimensionality reduction analyses.
Principal Component Analysis (PCA) is a multivariate dimensionality reduction technique that allows us to summarise the systematic patterns of variations in the data. PCA takes the expression levels for genes and transforms it in principal component space, reducing each sample into one point. This facilitates separation of samples by expression variation and enables identify potential sample outliers. The PCA plot is a way to look at how samples are clustering.
Inter-sample correlation analysis is another way to look at how well samples cluster. Agglomerative hierarchical clustering based on Euclidean distance computed from the normalised log2-transformed count values matrix allows us to see overall similarities and differences between the gene expression profiles of the samples.
Comparison of gene expression between the experimental sample groups was performed using DESeq2. Median ratio normalisation was first applied to account for differences in library size before estimation of the per gene dispersion and differential expression testing based on the negative binomial distribution; the Wald test was used to generate p-values and log2 fold-changes.
During differential expression testing, genes with an absolute fold-change > 1.5 and an adjusted p-value < 0.05 were called as differentially expressed. The results of differential expression testing are summarised in the volcano plot shown below.
Genes called as differentially expressed genes between experimental conditions (groups) are shown in the below table, ordered by padj by default. The direction column indicates the direction of the log2 fold-change (lfc) in gene expression between groups; down-regulated genes are denoted as “-1” and up-regulated genes are denoted as “1”.
A heatmap depicting the 50 most significantly differentially expressed genes between groups is shown below. Agglomerative hierarchical clustering based on Euclidean distance computed with an average-linkage matrix was performed using normalised log2-transformed count values.
Bespoke downstream analyses are provided on a case-by-case basis upon request and consultation with an investigator. These might include anything from rudimentary pathway enrichment or gene set enrichment analysis (GSEA) and visualisation to protein-protein interaction network construction, validation and interrogation, or integration with large-scale public experimental and clinical data sets, including survival analyses.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23)
## os macOS Big Sur ... 10.16
## system x86_64, darwin17.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/London
## date 2022-12-01
## pandoc 2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## annotate 1.74.0 2022-04-26 [1] Bioconductor
## AnnotationDbi * 1.58.0 2022-04-26 [1] Bioconductor
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
## Biobase * 2.56.0 2022-04-26 [1] Bioconductor
## BiocGenerics * 0.42.0 2022-04-26 [1] Bioconductor
## BiocParallel 1.30.4 2022-10-13 [1] Bioconductor
## Biostrings 2.64.1 2022-08-18 [1] Bioconductor
## bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.0)
## bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
## bitops 1.0-7 2021-04-24 [1] CRAN (R 4.2.0)
## blob 1.2.3 2022-04-10 [1] CRAN (R 4.2.0)
## bslib 0.4.1 2022-11-02 [1] CRAN (R 4.2.0)
## cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
## cli 3.4.1 2022-09-23 [1] CRAN (R 4.2.0)
## codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.1)
## colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
## crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
## crosstalk 1.2.0 2021-11-04 [1] CRAN (R 4.2.0)
## DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
## DelayedArray 0.22.0 2022-04-26 [1] Bioconductor
## dendextend * 1.16.0 2022-07-04 [1] CRAN (R 4.2.0)
## dendsort * 0.3.4 2021-04-20 [1] CRAN (R 4.2.0)
## DESeq2 * 1.36.0 2022-04-26 [1] Bioconductor
## digest 0.6.30 2022-10-18 [1] CRAN (R 4.2.0)
## dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
## evaluate 0.18 2022-11-07 [1] CRAN (R 4.2.0)
## fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
## farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
## genefilter 1.78.0 2022-04-26 [1] Bioconductor
## geneplotter 1.74.0 2022-04-26 [1] Bioconductor
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
## GenomeInfoDb * 1.32.4 2022-09-06 [1] Bioconductor
## GenomeInfoDbData 1.2.8 2022-10-10 [1] Bioconductor
## GenomicRanges * 1.48.0 2022-04-26 [1] Bioconductor
## ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)
## ggrepel * 0.9.2 2022-11-06 [1] CRAN (R 4.2.0)
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
## gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.0)
## gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.2.0)
## highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
## hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
## htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
## htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.0)
## httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
## IRanges * 2.30.1 2022-08-18 [1] Bioconductor
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
## jsonlite 1.8.3 2022-10-21 [1] CRAN (R 4.2.0)
## KEGGREST 1.36.3 2022-07-14 [1] Bioconductor
## knitr 1.41 2022-11-18 [1] CRAN (R 4.2.0)
## lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
## lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.1)
## locfit 1.5-9.6 2022-07-11 [1] CRAN (R 4.2.0)
## magrittr * 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
## Matrix 1.5-3 2022-11-11 [1] CRAN (R 4.2.0)
## MatrixGenerics * 1.8.1 2022-06-30 [1] Bioconductor
## matrixStats * 0.63.0 2022-11-18 [1] CRAN (R 4.2.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
## org.Hs.eg.db * 3.15.0 2022-10-31 [1] Bioconductor
## pheatmap * 1.0.12 2019-01-04 [1] CRAN (R 4.2.0)
## pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
## png 0.1-7 2013-12-03 [1] CRAN (R 4.2.0)
## purrr 0.3.5 2022-10-06 [1] CRAN (R 4.2.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
## RColorBrewer * 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
## Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0)
## RCurl 1.98-1.9 2022-10-03 [1] CRAN (R 4.2.0)
## reactable 0.3.0 2022-05-26 [1] CRAN (R 4.2.0)
## reactR 0.4.4 2021-02-22 [1] CRAN (R 4.2.0)
## readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0)
## rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
## rmarkdown 2.18 2022-11-09 [1] CRAN (R 4.2.0)
## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.0)
## RSQLite 2.2.19 2022-11-24 [1] CRAN (R 4.2.1)
## rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
## S4Vectors * 0.34.0 2022-04-26 [1] Bioconductor
## sass 0.4.4 2022-11-24 [1] CRAN (R 4.2.1)
## scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
## stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
## stringr * 1.4.1 2022-08-20 [1] CRAN (R 4.2.0)
## SummarizedExperiment * 1.26.1 2022-05-01 [1] Bioconductor
## survival 3.4-0 2022-08-09 [1] CRAN (R 4.2.0)
## tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
## TidyMultiqc * 1.0.3 2022-09-25 [1] CRAN (R 4.2.0)
## tidyr * 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
## tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
## vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.0)
## viridis 0.6.2 2021-10-13 [1] CRAN (R 4.2.0)
## viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0)
## vroom 1.6.0 2022-09-30 [1] CRAN (R 4.2.0)
## withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
## xfun 0.35 2022-11-16 [1] CRAN (R 4.2.0)
## XML 3.99-0.12 2022-10-28 [1] CRAN (R 4.2.0)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
## XVector 0.36.0 2022-04-26 [1] Bioconductor
## yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0)
## zlibbioc 1.42.0 2022-04-26 [1] Bioconductor
##
## [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────