Introduction

In this tutorial, I will cover how to use the Python package scVelo to perform RNA velocity analysis in single-cell RNA-seq data (scRNA-seq). scVelo was published in 2020 in Nature Biotechnology, making several improvements from the original RNA velocity study and its accomanpying software velocyto.

Briefly, RNA velocity analysis allows us to infer transcriptional dynamics that are not directly observed in a scRNA-seq experiment using a mathematical model of transcriptional kinetics. We can use RNA velocity to determine if a gene of interest is being induced or repressed in a give cell population of interest. Moreover, we can extrapolate this information to predict cell fate decision via pseudotime trajectories.

The majority of this tutorial is taken from the scVelo documentation.

Step -1: Convert data from Seurat to Python / anndata

For this tutorial, I am starting with a mouse brain dataset that contains cells from disease and control samples. I have already performed the primary data processing (filtering, normalization, clustering, batch alignment, dimensionality reduction) using Seurat in R. First I will go over the code that I used to convert my Seurat object into a format that is usable in scVelo (anndata format). It is possible to use the SeuratDisk R package to convert between Seurat and anndata formats, but this software is in early development stages and I have had mixed results when using it.

# assuming that you have some Seurat object called seurat_obj:

# save metadata table:
seurat_obj$barcode <- colnames(seurat_obj)
seurat_obj$UMAP_1 <- seurat_obj@reductions$umap@cell.embeddings[,1]
seurat_obj$UMAP_2 <- seurat_obj@reductions$umap@cell.embeddings[,2]
write.csv(seurat_obj@meta.data, file='metadata.csv', quote=F, row.names=F)

# write expression counts matrix
library(Matrix)
counts_matrix <- GetAssayData(seurat_obj, assay='RNA', slot='counts')
writeMM(counts_matrix, file=paste0(out_data_dir, 'counts.mtx'))

# write dimesnionality reduction matrix, in this example case pca matrix
write.csv(seurat_obj@reductions$pca@cell.embeddings, file='pca.csv'), quote=F, row.names=F)

# write gene names
write.table(
  data.frame('gene'=rownames(counts_matrix)),file='gene_names.csv',
  quote=F,row.names=F,col.names=F
)

import scanpy as sc
import anndata
from scipy import io
from scipy.sparse import coo_matrix, csr_matrix
import numpy as np
import os
import pandas as pd

# load sparse matrix:
X = io.mmread("counts.mtx")

# create anndata object
adata = anndata.AnnData(
    X=X.transpose().tocsr()
)

# load cell metadata:
cell_meta = pd.read_csv("metadata.csv")

# load gene names:
with open("gene_names.csv", 'r') as f:
    gene_names = f.read().splitlines()

# set anndata observations and index obs by barcodes, var by gene names
adata.obs = cell_meta
adata.obs.index = adata.obs['barcode']
adata.var.index = gene_names

# load dimensional reduction:
pca = pd.read_csv("pca.csv")
pca.index = adata.obs.index

# set pca and umap
adata.obsm['X_pca'] = pca.to_numpy()
adata.obsm['X_umap'] = np.vstack((adata.obs['UMAP_1'].to_numpy(), adata.obs['UMAP_2'].to_numpy())).T

# plot a UMAP colored by sampleID to test:
sc.pl.umap(adata, color=['SampleID'], frameon=False, save=True)

# save dataset as anndata format
adata.write('my_data.h5ad')

# reload dataset
adata = sc.read_h5ad('my_data.h5ad')

Step 0: Constructing spliced and unspliced counts matrices

Rather than using the same UMI-based genes-by-counts matrix that we used in Seurat, we need to have a matrix for spliced and unspliced transcripts. We can construct this matrix using the velocyto command line tool, or using Kallisto-Bustools. Here I am using the velocyto command line tool, simply because I had a working script from before Kallisto supported RNA velocity, so I have personally never bothered to try Kallisto.

The velocyto command line tool has a function that works directly from the cellranger output directory, but it also can be used on any single-cell platform as long as you provide a .bam file. We also have to supply a reference .gtf annotation for your species (mm10 used here), and optionally you can provide a .gtf to mask repeat regions (recommended by velocyto).

repeats="/path/to/repeats/mm10_rmsk.gtf"
transcriptome="/path/to/annoation/file/gencode.vM25.annotation.gtf"
cellranger_output="/path/to/cellranger/output/"

velocyto run10x -m $repeats \
                $cellranger_output \
                $transcriptome

Step 1: Load data

Now that we have our input data properly formatted, we can load it into python. Velocyto created a separate spliced and unspliced matrix for each sample, so we first have to merge the different samples into one object. Additionally, I am reformatting the cell barcodes to match my anndata object with the full genes-by-cells data.

import scvelo as scv
import scanpy as sc
import cellrank as cr
import numpy as np
import pandas as pd
import anndata as ad

scv.settings.verbosity = 3
scv.settings.set_figure_params('scvelo', facecolor='white', dpi=100, frameon=False)
cr.settings.verbosity = 2

adata = sc.read_h5ad('my_data.h5ad')

# load loom files for spliced/unspliced matrices for each sample:
ldata1 = scv.read('Sample1.loom', cache=True)
ldata2 = scv.read('Sample2.loom', cache=True)
ldata3 = scv.read('Sample3.loom', cache=True)

Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.

# rename barcodes in order to merge:
barcodes = [bc.split(':')[1] for bc in ldata1.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_10' for bc in barcodes]
ldata1.obs.index = barcodes

barcodes = [bc.split(':')[1] for bc in ldata2.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_11' for bc in barcodes]
ldata2.obs.index = barcodes

barcodes = [bc.split(':')[1] for bc in ldata3.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_9' for bc in barcodes]
ldata3.obs.index = barcodes

# make variable names unique
ldata1.var_names_make_unique()
ldata2.var_names_make_unique()
ldata3.var_names_make_unique()

# concatenate the three loom
ldata = ldata1.concatenate([ldata2, ldata3])

# merge matrices into the original adata object
adata = scv.utils.merge(adata, ldata)

# plot umap to check
sc.pl.umap(adata, color='celltype', frameon=False, legend_loc='on data', title='', save='_celltypes.pdf')

Part 2: Computing RNA velocity using scVelo

Finally we can actually use scVelo to compute RNA velocity. scVelo allows the user to use the steady-state model from the original 2018 publication as well as their updated dynamical model from the 2020 publication.

First we inspect the in each of our cell clusters. Next we perform some pre-processing, and then we compute RNA velocity using the steady-state model (stochastic option).

scv.pl.proportions(adata, groupby='celltype_full')

# pre-process
scv.pp.filter_and_normalize(adata)
scv.pp.moments(adata)

WARNING: Did not normalize X as it looks processed already. To enforce normalization, set `enforce=True`.
Normalized count data: spliced, unspliced.
WARNING: Did not modify X as it looks preprocessed already.
computing neighbors
    finished (0:00:14) --> added
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:17) --> added
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)

# compute velocity
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)

computing velocities
    finished (0:01:15) --> added
    'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph
    finished (0:15:00) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)

Part 2.1: Visualize velocity fields

We can get a broad visualiztion of RNA velocity across all genes and all cells by visualizing a vector field on top of the 2D dimensional reduction.

scv.pl.velocity_embedding(adata, basis='umap', frameon=False, save='embedding.pdf')

computing velocity embedding
    finished (0:00:06) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
saving figure to file ./figures/scvelo_embedding.pdf

scv.pl.velocity_embedding_grid(adata, basis='umap', color='celltype', save='embedding_grid.pdf', title='', scale=0.25)

saving figure to file ./figures/scvelo_embedding_grid.pdf

scv.pl.velocity_embedding_stream(adata, basis='umap', color=['celltype', 'condition'], save='embedding_stream.pdf', title='')

We can also visualize the dynamics of our favorite genes. Here we show the ratio of unspliced to spliced transcripts for Ptgds, as well as the velocity and expression values as feature plots.

# plot velocity of a selected gene
scv.pl.velocity(adata, var_names=['Ptgds'], color='celltype')

Part 3: Downstream analysis

Here we show how to identify highly dynamic genes, compute a measure of coherence among neighboring cells in terms of velocity, and perform pseudotime inference. Using the pseudotime trajectory, we can identify predicted ancestors of individual cells, and we can orient the directionality of partition-based graph abstractions (PAGA).

scv.tl.rank_velocity_genes(adata, groupby='celltype', min_corr=.3)

df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head()

ranking velocity genes
    finished (0:00:23) --> added
    'rank_velocity_genes', sorted scores by group ids (adata.uns)
    'spearmans_score', spearmans correlation scores (adata.var)

	AF1	AF2	AF3	AF4	AF5	AF6	CP	EC1	EC2	MM1	MM2	MM3	OA	PE	PF1	PF2	SMC	UF	UI
0	Adam12	Zfpm2	Zfpm2	Ptprd	Pcdh7	Nav1	Sgip1	Tmtc2	St6galnac3	Elmo1	Rnf150	Tmcc3	Eya4	Cobll1	Nr3c2	Slc1a3	Dmd	Cadm2	Adam10
1	Ghr	Efemp2	Mmp14	Mtch1	Rtl4	Zfpm2	Dlc1	Ptprg	Plcb1	Mir142hg	Ophn1	Gm5086	Soga3	Atp13a5	Fbxl7	Frmpd4	Ctnna3	Dync1i1	Trim37
2	Dpyd	Tmem208	Efna5	Cped1	Adamtsl1	Arl6ip4	Grm7	Mecom	Cyyr1	Rbm47	Frmd4b	Srgap2	Lrrc4c	Lnx2	Slc47a2	4930594M22Rik	Cacnb2	Nrp2	Esam
3	Tcf4	Rpl29	Lrrc7	Ccbe1	Slc24a3	Rpl29	Rgs6	Dnm3	Ptprm	Gramd1b	Plekhg5	Slc7a8	Frmd4a	Apbb2	Tbx15	Acss3	Slc38a11	Ust	Sgms1
4	Rbms3	Mtch1	Dclk1	Cers6	Slit3	Efna5	Trpc3	Tjp1	Lrch1	Lyn	Runx1	Qk	Ptprt	Egflam	Mmp16	Kcnk2	Sox6	Sorbs1	Emcn

kwargs = dict(frameon=False, size=10, linewidth=1.5,
              add_outline='AF6, AF1')

scv.pl.scatter(adata, df['AF6'][:3], ylabel='AF6', frameon=False, color='celltype', size=10, linewidth=1.5)
scv.pl.scatter(adata, df['AF1'][:3], ylabel='AF1', frameon=False, color='celltype', size=10, linewidth=1.5)

Speed: length of the velocity vector
Coherence: how well a velocity vector correlates to its neighbors

scv.tl.velocity_confidence(adata)
keys = 'velocity_length', 'velocity_confidence'
scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95])

--> added 'velocity_length' (adata.obs)
--> added 'velocity_confidence' (adata.obs)

scv.pl.velocity_graph(adata, threshold=.1, color='celltype')

x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=70)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False)
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax)

scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')

# this is needed due to a current bug - bugfix is coming soon.
adata.uns['neighbors']['distances'] = adata.obsp['distances']
adata.uns['neighbors']['connectivities'] = adata.obsp['connectivities']

scv.tl.paga(adata, groups='celltype')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
#df.style.background_gradient(cmap='Blues').format('{:.2g}')

running PAGA using priors: ['velocity_pseudotime']
    finished (0:00:04) --> added
    'paga/connectivities', connectivities adjacency (adata.uns)
    'paga/connectivities_tree', connectivities subtree (adata.uns)
    'paga/transitions_confidence', velocity transitions (adata.uns)

scv.pl.paga(adata, basis='umap', size=50, alpha=.1,
            min_edge_width=2, node_size_scale=1.5)

Part 4: Analyzing a specific cell population

In a scRNA-seq dataset comprised of multiple cell lineages, it may be more relevant to perfom RNA velocity analysis separately for each major cell population Here we are going to perform RNA velocity analysis using only our fibroblast clusters. Additionally, we are going to use the dynamical model (from 2020 paper) whereas earlier we used the steady-state model (2018 paper).

cur_celltypes = ['AF1', 'AF2', 'AF3', 'AF4', 'AF5', 'AF6', 'PF1', 'PF2']
adata_subset = adata[adata.obs['celltype'].isin(cur_celltypes)]

sc.pl.umap(adata_subset, color=['celltype', 'condition'], frameon=False, title=['', ''])

sc.pp.neighbors(adata_subset, n_neighbors=15, use_rep='X_pca')

# pre-process
scv.pp.filter_and_normalize(adata_subset)
scv.pp.moments(adata_subset)

WARNING: Did not normalize X as it looks processed already. To enforce normalization, set `enforce=True`.
WARNING: Did not normalize spliced as it looks processed already. To enforce normalization, set `enforce=True`.
WARNING: Did not normalize unspliced as it looks processed already. To enforce normalization, set `enforce=True`.
WARNING: Did not modify X as it looks preprocessed already.
computing neighbors
    finished (0:00:03) --> added
    'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
    finished (0:00:11) --> added
    'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
computing velocities
    finished (0:00:31) --> added
    'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph
    finished (0:04:00) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)

In this step, transcriptional dynamics are computed. This can take a lot longer than other steps, and may require the use of a high performance compute cluster such as HPC3 at UCI. This step took about 90 minutes to run on my laptop for ~15k cells.

scv.tl.recover_dynamics(adata_subset)

recovering dynamics
    finished (0:55:01) --> added
    'fit_pars', fitted parameters for splicing dynamics (adata.var)

scv.tl.velocity(adata_subset, mode='dynamical')
scv.tl.velocity_graph(adata_subset)

computing velocities
    finished (0:01:54) --> added
    'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph
    finished (0:01:37) --> added
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)

scv.pl.velocity_embedding_stream(adata_subset, basis='umap', color=['celltype', 'condition'], save='embedding_stream.pdf', title='')

computing velocity embedding
    finished (0:00:08) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)
figure cannot be saved as pdf, using png instead.
saving figure to file ./figures/scvelo_embedding_stream.pdf.png

Using the dynamical model, we can actually look at the transcritpion rate, splicing rate, and degradation rate.

df = adata_subset.var
df = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]

kwargs = dict(xscale='log', fontsize=16)
with scv.GridSpec(ncols=3) as pl:
    pl.hist(df['fit_alpha'], xlabel='transcription rate', **kwargs)
    pl.hist(df['fit_beta'] * df['fit_scaling'], xlabel='splicing rate', xticks=[.1, .4, 1], **kwargs)
    pl.hist(df['fit_gamma'], xlabel='degradation rate', xticks=[.1, .4, 1], **kwargs)

scv.get_df(adata_subset, 'fit*', dropna=True).head()

	fit_alpha	fit_beta	fit_gamma	fit_t_	fit_scaling	fit_std_u	fit_std_s	fit_likelihood	fit_pval_steady	fit_steady_u	fit_steady_s	fit_variance	fit_alignment_scaling	fit_r2
Sox17	0.898695	6.110614	0.484805	10.475681	0.105126	0.037980	0.421499	0.000018	0.492005	0.104282	1.640273	1.071412	1.816156	0.661397
Pcmtd1	0.211919	0.720424	0.264111	12.202008	0.248776	0.086398	0.164096	0.232746	0.336931	0.270257	0.446798	0.749243	2.372727	0.108137
Sgk3	0.031516	0.226350	0.197472	10.975034	1.652986	0.050927	0.038999	0.177014	0.159321	0.164692	0.126664	1.650863	3.383189	0.087770
Prex2	0.209454	0.229509	0.231223	17.965210	1.644350	0.334659	0.288491	0.303137	0.430445	0.757209	0.657950	0.545063	3.398314	0.796931
Sulf1	0.184210	0.221469	0.183979	14.692180	1.693538	0.300075	0.217056	0.243579	0.311456	0.848147	0.743920	0.904572	3.110569	0.164060

Similar to pseudotime, ‘latent time’ is computed from the dynamical model.

scv.tl.latent_time(adata_subset)
scv.pl.scatter(adata_subset, color='latent_time', color_map='gnuplot', size=80)

computing latent time using root_cells as prior
    finished (0:00:48) --> added
    'latent_time', shared time (adata.obs)

top_genes = adata_subset.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata_subset, var_names=top_genes, sortby='latent_time', col_color='celltype', n_convolve=100)

top_genes = adata_subset.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata_subset, color='celltype', basis=top_genes[:15], ncols=5, frameon=False)

var_names = ['Rps3', 'Col1a2', 'Tmeff2']
scv.pl.scatter(adata_subset, var_names, color='celltype', frameon=False)
scv.pl.scatter(adata_subset, x='latent_time', y=var_names, color='celltype', frameon=False)

Conclusion and future directions:

In this tutorial we have learned how to perform RNA velocity analysis in scRNA-seq data using scVelo, as well as some downstream applications of RNA velocity such as identifying potential cell-state transitions and identifying genes that are being induced / repressed. RNA velocity is the starting point for a cell fate analysis program called CellRank, which can provide further insights for the analysis of cell lineages.