Enrichment analysis • hdWGCNA

Compiled: 10-01-2025

Introduction

In this tutorial, we aim to provide further biological context for our co-expression modules by performing different enrichment tests. We leverage the R pacakge enrichR to perform enrichment tests on a wide range of curated gene lists. We also demonstrate using the R package fgsea another type of enrichment analysis (GSEA).

First, we first need to load the data and the required libraries.

library(Seurat)
library(tidyverse)
library(cowplot)
library(patchwork)
library(WGCNA)
library(hdWGCNA)

# gene enrichment packages
library(enrichR)
library(GeneOverlap)

# using the cowplot theme for ggplot (optional)
theme_set(theme_cowplot())

# set random seed for reproducibility
set.seed(12345)

# re-load the Zhou et al snRNA-seq dataset, which was already processed 
# as shown in the hdWGCNA basics tutorial
seurat_obj <- readRDS('data/Zhou_control.rds')

Enrichr

In this section we discuss how to perform Enrichr enrichment tests and how to visualize the results using hdWGCNA. hdWGCNA includes the function RunEnrichr to compare the set of genes in each module with any of the gene lists hosted by Enrichr. You can take a look at the list of different Enrichr gene lists here.

Run Enrichr

In the following example, we perform the Enrichr enrichment test with three Gene Ontology datbases:

GO_Biological_Process_2023
GO_Cellular_Component_2023
GO_Molecular_Function_2023

# define the enrichr databases to test
dbs <- c('GO_Biological_Process_2023','GO_Cellular_Component_2023','GO_Molecular_Function_2023')

# perform enrichment tests
seurat_obj <- RunEnrichr(
  seurat_obj,
  dbs=dbs,
  max_genes = 100 # use max_genes = Inf to choose all genes
)

# retrieve the output table
enrich_df <- GetEnrichrTable(seurat_obj)

# look at the results
head(enrich_df)

                                                                  Term Overlap
1           Central Nervous System Neuron Differentiation (GO:0021953)    2/31
2                                Response To Amyloid-Beta (GO:1904645)    2/43
3                                      Cardiac Conduction (GO:0061337)    2/46
4                          Cellular Response To Radiation (GO:0071478)    2/48
5 Regulation Of Growth Hormone Receptor Signaling Pathway (GO:0060398)     1/5
6                 Negative Regulation Of Neuron Migration (GO:2001223)     1/5
     P.value Adjusted.P.value Old.P.value Old.Adjusted.P.value Odds.Ratio
1 0.01047245        0.4204923           0                    0  13.983814
2 0.01956379        0.4204923           0                    0   9.885017
3 0.02220804        0.4204923           0                    0   9.209647
4 0.02404833        0.4204923           0                    0   8.808341
5 0.02475353        0.4204923           0                    0  50.242424
6 0.02475353        0.4204923           0                    0  50.242424
  Combined.Score         Genes                         db module
1       63.75231   ZC4H2;KNDC1 GO_Biological_Process_2023 INH-M1
2       38.88840 CACNA1B;GRIN1 GO_Biological_Process_2023 INH-M1
3       35.06390  AKAP9;CTNNA3 GO_Biological_Process_2023 INH-M1
4       32.83476   SPIDR;RAD9A GO_Biological_Process_2023 INH-M1
5      185.83604          MBD5 GO_Biological_Process_2023 INH-M1
6      185.83604        NEXMIF GO_Biological_Process_2023 INH-M1

Term: The name of the term (ie biological process, etc).
Overlap: The fraction of genes overlapping between the module and the gene list.
P.value: Fisher’s exact test p-value.
Adjusted.P.value: Benjamini-Hochberg multiple testing correction for the Fisher’s exact test p-values.
Odds.Ratio: statistic to quantify the association between the gene list in the current module and the gene list for the current Term.
Combined.Score: natural log of the p-value multiplied by the z-score, where the z-score is the deviation from the expected rank.
Genes: semicolon delimited list of gene symbols for the overlapping genes.
db: the name of the Enrichr gene list.
module: the name of the hdWGCNA module.

Visualize enrichments

Now that we have done the enrichment tests, there are several ways we can go about visualizing the results.

`EnrichrBarPlot`

Next we use EnrichrBarPlot to summarize the results of every Enrichr database and every module. This function outputs a .pdf figure for each module, containing a barplot showing the top N enriched terms. The following example will plot the top 10 terms in each module and will output the results to a folder called enrichr_plots.

# make GO term plots:
EnrichrBarPlot(
  seurat_obj,
  outdir = "enrichr_plots", # name of output directory
  n_terms = 10, # number of enriched terms to show (sometimes more are shown if there are ties)
  plot_size = c(5,7), # width, height of the output .pdfs
  logscale=TRUE # do you want to show the enrichment as a log scale?
)

The following bar plot is a single example of the EnrichrBarPlot output:

Interpreting Enrichr results

Each of the enrichment bar plots are colored by the module’s unique color, and each term is sorted by the enrichment (combined score). We encourage users to carefully inspect the results of the enrichment tests, and use prior biological knowledge before jumping to conclusions. In this example, we see some terms that make sense for inhibitory neurons, such as “inhibitory synapse assembly” and “synaptic transmission, GABAergic”. On the other hand, we see several cardiac related terms that are realistically not at all related to our system in this example (human brain). Many genes take part in distinct biological processes in different tissues within the same organism, which leads to enrichment results like this.

`EnrichrDotPlot`

hdWGCNA includes an additional visualization function for enrichment results, EnrichrDotPlot, which shows the top results for one Enrichr database in each module. In the following example, we plot the top term in the GO_Biological_Process_2021 database.

# enrichr dotplot
EnrichrDotPlot(
  seurat_obj,
  mods = "all", # use all modules (default)
  database = "GO_Biological_Process_2023", # this must match one of the dbs used previously
  n_terms=2, # number of terms per module
  term_size=8, # font size for the terms
  p_adj = FALSE # show the p-val or adjusted p-val?
)  + scale_color_stepsn(colors=rev(viridis::magma(256)))

In this plot, the dot size shows the enrichment and the color shows the significance level. Terms are only shown in a module if the enrichment is significant (p < 0.05).

Gene Set Enrichment Analysis (GSEA)

Aside from Enrichr, we can also perform Gene Set Enrichment Analysis (GSEA) to provide further biological context to our co-expression modules. For this section, we use the Bioconductor package fgsea. Note that this analysis simply uses the functions provided by fgsea rather than including new functions specific to hdWGCNA.

First, install fgsea.

BiocManager::install("fgsea")
library(fgsea)

While the Enrichr package queries their web database for gene lists, fgsea does not have a similar functionality and we need to directly supply the gene lists of interest. For this example, we download the GO_Biological_Process_2023 list from the Enrichr website as a .txt file. Please see this link to download other Enrichr gene sets. Now we will load the gene lists using fgsea.

# load the GO Biological Pathways file (downloaded from EnrichR website)
pathways <- fgsea::gmtPathways('GO_Biological_Process_2023.txt')

# optionally, remove the GO term ID from the pathway names to make the downstream plots look cleaner
names(pathways) <- stringr::str_replace(names(pathways), " \\s*\\([^\\)]+\\)", "")

Expand to see explanation of pathways

pathways is a named list where the names represent the pathways, and each element of the list is a character vector representing the genes associated with this pathway. The order of genes in the character vector does not matter. You can easily use this structure to create custom lists to query with fgsea.

head(pathways)

$`'De Novo' AMP Biosynthetic Process`
[1] "ATIC"  "PAICS" "PFAS"  "ADSS1" "ADSS2" "GART" 

$`'De Novo' Post-Translational Protein Folding`
 [1] "SDF2L1"  "HSPA9"   "CCT2"    "HSPA6"   "ST13"    "ENTPD5"  "HSPA1L" 
 [8] "HSPA5"   "PTGES3"  "HSPA8"   "HSPA7"   "DNAJB13" "HSPA2"   "DNAJB14"
[15] "HSPE1"   "DNAJC18" "GAK"     "DNAJC7"  "DNAJB12" "HSPA1A"  "ST13P5" 
[22] "HSPA1B"  "ERO1A"   "SELENOF" "HSPA14"  "HSPA13"  "DNAJB1"  "CHCHD4" 
[29] "DNAJB5"  "DNAJB4"  "SDF2"    "UGGT1"  

$`2-Oxoglutarate Metabolic Process`
 [1] "IDH1"   "PHYH"   "GOT2"   "MRPS36" "GOT1"   "IDH2"   "ADHFE1" "GPT2"  
 [9] "TAT"    "DLST"   "OGDHL"  "L2HGDH" "D2HGDH" "OGDH"  

$`3'-UTR-mediated mRNA Destabilization`
 [1] "UPF1"    "TRIM71"  "RC3H1"   "ZFP36L1" "ZFP36L2" "MOV10"   "KHSRP"  
 [8] "ZC3H12D" "ZFP36"   "ZC3H12A" "DHX36"   "DND1"    "PLEKHN1" "RBM24"  
[15] "TARDBP" 

$`3'-UTR-mediated mRNA Stabilization`
 [1] "DAZ4"     "TIRAP"    "DAZ3"     "YBX3"     "DAZ2"     "DAZ1"    
 [7] "ELAVL1"   "ELAVL4"   "MAPK14"   "DAZL"     "MYD88"    "BOLL"    
[13] "ZFP36"    "MAPKAPK2" "HNRNPC"   "RBM47"    "TARDBP"   "HNRNPA0" 

$`3'-Phosphoadenosine 5'-Phosphosulfate Metabolic Process`
 [1] "PAPSS1"  "PAPSS2"  "SULT1A2" "SULT1C4" "SULT2A1" "SULT1A1" "SULT1C3"
 [8] "SULT2B1" "TPST2"   "SULT1B1" "SULT1E1" "ENPP1"   "TPST1"   "BPNT1"  
[15] "SULT1A3"

Example 1: using all co-expression network genes

Next, we need to provide a “ranked” list of genes to compare against our gene lists. In this example, we focus on module INH-M5, and we use all genes in our co-expression analysis except for genes in the grey module. We will rank the genes by their eigengene-based connectivity (kME) in the module.

# get the modules table and remove grey genes
modules <- GetModules(seurat_obj) %>% subset(module != 'grey')

# rank all genes in this kME
cur_mod <- 'INH-M5'
cur_genes <- modules[,(c('gene_name', 'module', paste0('kME_', cur_mod)))]
ranks <- cur_genes$kME; names(ranks) <- cur_genes$gene_name
ranks <- ranks[order(ranks)]

head(ranks)

  KIAA1217     SPOCK3      GRIK3      TENM2    SPARCL1       SOX6 
-0.4932557 -0.4605339 -0.3541557 -0.3448085 -0.3354791 -0.3145449

ranks is a numeric where the values represent kMEs, and the names correspond to genes. They are ordered from low to high.

Next we run the GSEA test using the fgsea function. Please consult thefgsea documentation for information about the options for this function.

# run fgsea to compute enrichments
gsea_df <- fgsea::fgsea(
  pathways = pathways, 
  stats = ranks,
  minSize = 10,
  maxSize = 500
)
head(gsea_df)

                                        pathway       pval      padj    log2err
                                         <char>      <num>     <num>      <num>
1: 'De Novo' Post-Translational Protein Folding 0.68875740 0.9861627 0.03871667
2:                     ATP Biosynthetic Process 0.01018654 0.1789733 0.38073040
3:                        ATP Metabolic Process 0.98579882 1.0000000 0.02048936
4:               Actin Filament Bundle Assembly 0.45405405 0.8872106 0.15114876
5:                  Actin Filament Organization 0.79468085 0.9954510 0.02660528
6:     Actin Polymerization Or Depolymerization 0.02005858 0.2209742 0.35248786
           ES        NES  size  leadingEdge
        <num>      <num> <int>       <list>
1:  0.4162606  0.8738619    12 ERO1A, H....
2: -0.6001491 -1.8153572    14 ATP5F1B,....
3:  0.2417035  0.5074116    12 AK4, ATP....
4: -0.3595058 -0.9972337    10 MYO1B, S....
5:  0.3299864  0.8015173    31 COBL, FA....
6:  0.7482025  1.4964736    10 COBL, DI....

Next we use the fgsea function plotGseaTable to visualize the top 25 pathways enriched in this module.

top_pathways <- gsea_df %>% 
    subset(pval < 0.05) %>% 
    slice_max(order_by=NES, n=25) %>% 
    .$pathway

plotGseaTable(
    pathways[top_pathways], 
    ranks, 
    gsea_df, 
    gseaParam=0.5,
    colwidths = c(10, 4, 1, 1, 1)
)

We can also visualize one pathway at a time using the function plotEnrichment.

# name of the pathway to plot 
selected_pathway <- 'Neuron Projection Morphogenesis'

plotEnrichment(
    pathways[[selected_pathway]],
    ranks
) + labs(title=selected_pathway)

Example 2: using only the genes in one module

Here we show a similar example as above, but only using the genes in one module rather than all the genes in our co-expression network. These enrichment results may be more specific to each module, but the downside is that for small modules there may be no significant results.

We will use module INH-M5 again, which has 287 genes.

# rank by INH-M5 genes only by kME
cur_mod <- 'INH-M5'
modules <- GetModules(seurat_obj) %>% subset(module == cur_mod)
cur_genes <- modules[,(c('gene_name', 'module', paste0('kME_', cur_mod)))]
ranks <- cur_genes$kME; names(ranks) <- cur_genes$gene_name
ranks <- ranks[order(ranks)]

# run fgsea to compute enrichments
gsea_df2 <- fgsea::fgsea(
  pathways = pathways, 
  stats = ranks,
  minSize = 3,
  maxSize = 500
)

Let’s see how many significant results there are in this table versus the previous table.

n_signif1 <- subset(gsea_df, pval < 0.05) %>% nrow 
n_signif2 <- subset(gsea_df2, pval < 0.05) %>% nrow

print(paste0("All genes: ", n_signif1, ", ", cur_mod, " only: ", n_signif2))

[1] "All genes: 138 | INH-M5 only: 18"

We can see that there are more significant hits when we use all of the genes, so we encourage users to think about which approach better suits your particular use case. We can use these results to make similar plots as above using the same functions.

top_pathways <- gsea_df2 %>% 
    subset(pval < 0.05) %>% 
    slice_max(order_by=NES, n=25) %>% 
    .$pathway

plotGseaTable(
    pathways[top_pathways], 
    ranks, 
    gsea_df2, 
    gseaParam=0.5,
    colwidths = c(10, 4, 1, 1, 1)
)

Finally, let’s demonstrate what happens when we run GSEA on a very small module. We will use module INH-M17, which contains 57 genes.

# rank by INH-M17 genes only by kME
cur_mod <- 'INH-M17'
modules <- GetModules(seurat_obj) %>% subset(module == cur_mod)
cur_genes <- modules[,(c('gene_name', 'module', paste0('kME_', cur_mod)))]
ranks <- cur_genes$kME; names(ranks) <- cur_genes$gene_name
ranks <- ranks[order(ranks)]

# run fgsea to compute enrichments
gsea_df3 <- fgsea::fgsea(
  pathways = pathways, 
  stats = ranks,
  minSize = 3,
  maxSize = 500
)

n_signif3 <- subset(gsea_df3, pval < 0.05) %>% nrow 
print(n_signif3)

[1] 2

For a small module like this one, we do not expect many significant results with this approach. Only 2 significant results were found in this example. It would likely be better to use the ranked list of genes by kME for all genes in the co-expression network as in the first example.