Single-cell RNA-seq essentials (legacy)
Note: this is a LEGACY tutorial, meaning that it was written several years ago and is using potentially outdated software.
Introduction
This document is the first in a series of tutorials covering the essentials of single-cell transcriptomics analysis. The following figure from Luecken & Theis illustrates many of the analysis techniques that are commonly employed in my lab and by many others for single-cell transcriptomics analysis.
The following topics are covered in this tutorial:
- Load UMI counts matrix into R/Seurat.
- Quality Control
- Normalization
- Feature selection
- Dimensionality Reduction
- Clustering
- Visualization
Seurat
The following section covers the absolute basics of single-cell analysis using the R package Seurat. Most of this tutorial is inspired by Seurat’s clustering tutorial, however we will be using a dataset from the human brain since that is much more relevant to the lab’s research.
Loading data into Seurat
In many cases, we work with single-cell data generated from the 10X Genomics platform. Single-cell analysis packages such as Seurat and Scanpy make it easy to load UMI counts matrices from the output of cellranger. The following commands create a Seurat object from the output of cellranger:
toggle code
For this tutorial I am using a published dataset from a study of TREM2 in AD using single-cell transcriptomics in human and mouse samples (link to study). In this case a UMI counts matrix is stored separately for each sample in .mtx
format, so we have to load each of these files before merging into one big Seurat object. This is a more general use case than using the Read10X
function.
toggle code
Let’s get a sense for the size of our dataset by plotting the number of cells in each sample.
toggle code
We can see that the number of cells varies quite a bit between samples, the largest number of cells one sample being 9718
and the smallest being 1075
. Based on how the experiment was set up, we reasonably expect 10,000
cells as the maximum for a single sample, and a minimum of a few thousand, so overall this is not out of the ordinary. Next we will remove some outliers that do not pass our quality control criteria.
Quality Control
In this section we will remove low quality cells based on several quality control criteria, such as the percentage of reads in mitochondrial genes, the number of genes detected per cell, and the number of UMIs detected per cell. First we compute the percentage of mitochondrial reads for each cell, and then we plot the distribution of these metrics in each sample.
toggle code
These distributions provide insight into the overall quality of these samples. For example, we observe that sample AD8 has much lower UMI and gene detection than other samples, as well as a much higher percent of mitochondrial reads. Considering the relative low quality of sample AD8, it might be indicative that something went wrong with the single-cell RNA-seq experiment itself, and that it should be removed from the analysis entirely. In this published dataset, the authors included all of the cells before quality control, so we have to apply the QC filters ourselves.
Here we apply QC filtering such that the downstream analysis only contains only very high quality cells. I want to emphasise how important this step is, as outliers can highly influence all of the downstream analysis and interpretation. Later on if something doesn’t look right, this is always a good step to revisit.
Additionally, this step serves as a great example of how often data analysis is both an art and a science. There isn’t a one-size fits all solution to properly filtering every single-cell dataset, meaning that some level of personal judgement is required.
Here I will apply the following filters:
- Remove cells with more than 15% mitochondrial reads.
- Remove cells with greater than
30,000
UMIs and fewer than250
UMIs.
toggle code
Using these filters we removed 20,125
low-quality cells, retaining a total of 94,847
cells for downstream analysis. It looks like the paper actually has a more stringent QC cutoff, retaining a total of 66,311
nuclei.
Normalization
In this section we apply a logarithmic transformation and apply a scaling factor to the UMI counts matrix. We then scale the data to have a mean expression of 0 for each feature, and a variance of 1 (this is standard for many machine learning pre-processing pipelines).
toggle code
Feature Selection
In this section we identify a set of genes that are highly variable across the dataset. These genes tend to be more cell type and cell population specific, so we expect that they have high expression levels in some cells and low in others. We will use these genes for downstream analysis such as dimensionality reduction and clustering. Much like the previous QC step, the number of features to select is highly subject to each dataset.
toggle code
Linear Dimensionality Reduction
In this section we perform Principal Components Analysis (PCA) to construct a linear dimensionality reduction of the dataset. Specifically, we perform PCA on the scaled dataset using only the genes that we have identified as highly variable in the previous section.
toggle code
Now that we have constructed a PCA matrix, we can inspect the genes that contribute the most to the variance of the PCs.
toggle code
We can also visualize the data as a scatter plot showing each cell in PCA space.
toggle code
This scatter plot shows all ~90k
cells in PCA space. While there are some patterns that we can observe here, the cells do not form easily distinguishable clusters. Applying a non-linear dimensionality reduction on top of this PCA transformation should yield better results for visualization and clustering purposes.
For downstream analysis, we do not necessarily need the entire PCA matrix. We can use a number of methods to identify the PCs that contain most of the complexity of the dataset, and discard the remaining PCs. First we can simply visualize heatmaps of the PCA matrix.
toggle code
Now we inspect the standard deviation of each PC in a ‘elbow plot’.
toggle code
Based on these plots we can decide how many PCs to retain for downstream analysis. Once again, there is not a one size fits all solution to determining the number of PCs for a particular single-cell dataset, so you may have to try using different numbers of PCs in the downstream analysis to figure out what is appropriate. Based on the elbow plot, I am going to use a cutoff of 30 PCs for this dataset.
Clustering and non-linear dimensionality reduction
Here we cluster the cells using a graph-based approach within PCA space, and then apply non-linear dimensionality reductions for further visualization purposes. An important parameter for clustering is resolution
, where a higher value of this parameter yields a larger number of clusters. For the purpose of this tutorial I am using resolution=0.5
.
Following clustering we apply UMAP and t-SNE for non-linear dimensionality reductions. These algorithms are pretty complex, so you may be interested in reading up about them, however it is not necessary to know how they work in detail to use them effectively. The overall goal of these approaches are to construct low-dimensional manifolds of high-dimensional datasets such that data entries (single cells in this case) that are similar are closer together in manifold space. Of course, there are many more manifold learning approaches aside from UMAP and t-SNE, each with their own pros and cons. UMAP is a newer approach compared to t-SNE, and it generally does a better job at preserving the global structure of the data instead of just local structures. Both t-SNE and UMAP have a variety of hyperparameters that can be tweaked ad infinitum, but for this tutorial (and in most use cases) we can use the default parameters. I actually have a short blog post going over UMAP hyperparameters, and for the most part the default parameters are suitable.
toggle code
Now let’s look at our clusters using our UMAP and t-SNE embeddings.
toggle code
By coloring these plots by their cluster assignment, we can immediately see that both methods do a decent job at spatially separating cells by their clusters in this low-dimensional space. In general we don’t expect a perfect separation of the clusters in this space. Qualitatively, the UMAP plot separates the clusters further apart from one another, while the t-SNE plot looks more like a bunch of blobs stuck together. In this particular t-SNE I can tell that there is some funky things going on, for example part of cluster 2 is inside of cluster 0.
Cell-type annotation
Up to this point, we have only needed to use our data scientist hat to understand the different parts of the single-cell data analysis workflow. We have gone from a tangled mess of billions of nucleotide sequences to a much more organized and interpretable format consisting of a cells by genes expression matrix, a linear dimensionality reduction matrix, clusters, and a two-dimensional data manifold. So now it is our job to start interpreting the biology within this datase
We have to consider some of the known biology from published literature with regards to the samples that were sequenced for this analysis. Zhou et al. sequenced postmortem human brain samples from the dorsolateral prefrontal cortex (DLPFC). Based on what we know about the cellular composition of the DLPFC, we expect to primarily recover neurons and glia. Since we have transcriptomic data individual cells, we can actually provide much more precise annotations than neuronal/glia. You may wish to consult a neurobiology textbook to get a better idfea of what cell types you expect to find in the DLPFC and in other regions of the brain.
In this section we use a literature-curated list of known cell-type marker genes to provide a cell type annotation to each cluster. A critical assumption of this method is that all cells within the same cluster are of the same cell type. This assumption mostly holds true with a high enough resolution, but of course there are some exceptions that we have to live with. Here we visualize the gene expression of these marker genes directly on the UMAP embedding, as we visualize the distributions of these marker genes in each cluster.
toggle code
Next we can plot the distributions of these genes in each cluster.
toggle code
Astrocyte marker genes
Pan-neuronal marker genes
Excitatory Neuron Markers
Inhibitory Neuron Markers
Microglia Markers
Oligodendrocyte Markers
Oligodendrocyte Progenitor Markers
Finally, we can use all of this gene expression information to annotate each cluster with a cell type. Here we use a short hand to name each cluster. For clusters that have high expression of more than one cell type marker gene, we
toggle code
Identifying cluster biomarkers
In this section, we perform differential gene expression to find cluster marker genes. Marker genes are identified by iteratively comparing gene expression in each cluster to all other clusters. These comparisons can be done using a variety of statistical tests, such as logistic regression. Here we are using MAST, a hurdle model specifically tailored to the nuances of scRNA-seq data.
Seurat provides FindAllMarkers
, a conventient function for iteratively performing these tests for each cluster. Alternatively, we could use the FindMarkers
function to just compare two groups of cells. These functions have a lot of different options that effect the downstream results. Here I will use the default settings, which only look at genes that are up-regulated in the cluster of interest and show non-zero expression in at least 25% of cells in that clusters.
toggle code
Now that we have perfomed these tests, we can create some plots that summarize the results. Below we plot the number of DEGs per cluster as a bar plot, and a heatmap of the top 3 DEGs per cluster.
toggle code
Saving and loading Seurat objects
This concludes the very basics of exploratory data analysis using Seurat. Finally, we will save the processed object so we can use it again later.