Q&A-style analysis for stickleback scRNA data

This file does the Q&A-style analysis for the stickleback single-cell RNA sequencing (scRNA) data in Google Drive.

Questions

As far as I am concerned, the ultimate goal of Dan’s project is to discover the potential factors (e.g. genotype, environment, celltype) that have significant effects to the gene-gene (GxG) network in stickleback. Note that we do not have genotype and environment information with current data. Therefore, following questions focus on the general and celltype-specified GxG network.

General GxG network:
- Q1: How to efficiently reconstruct the general GxG network from scRNA data? Are there any community structures for the general GxG network?
Celltype-specified GxG network:
- Q2: How to reconstruct the GxG network for each celltype? Are there any community structures for the GxG networks in each celltype?
- Q3: Are there any significant network differences among the celltypes?
- Q4: Are there any significant community differences among different celltype-specified networks?

In this file, I answered the questions Q1, Q2 and Q3 with quick-and-dirty methods using R package WGCNA and sLED. We mainly follow the WGCNA Tutorial for the network analysis.

0. Preliminary

0.1 Load dependencies

library(WGCNA)
options(stringsAsFactors = FALSE);

# install.packages("devtools") ## if not installed
# library("devtools")
# devtools::install_github("lingxuez/sLED")
library(sLED)

library(RColorBrewer)

0.2 Read pre-processed data

Due to the huge dimension of the scRNA data, we consider a subset of stickleback scRNA data based on the preliminary celltype clustering analysis. We use the function FindVariableFeatures() in Seurat and select 500 high variable genes with large dispersion. We also randomly sample 2% cells from 8 celltypes. The reduced dataset involves the expressions of 500 genes in 1075 cells and the celltype labels for each cell.

# read data
load("stk_red.RData")

stk_red = as.data.frame(t(stk_red_df))
rownames(stk_red) =  gsub("\\.", "-", rownames(stk_red))
rm(stk_red_df)

dim(stk_red)

## [1] 1075  500

Following the tutorial of WGCNA, we exclude 108 genes with excessive 0 expressions.

# delete bad samples and genes in the dataset
gsg = goodSamplesGenes(stk_red, verbose = 3);

##  Flagging genes and samples with too many missing values...
##   ..step 1
##   ..Excluding 108 genes from the calculation due to too many missing samples or zero variance.
##   ..step 2

gsg$allOK

## [1] FALSE

if (!gsg$allOK)
{
  # Optionally, print the gene and sample names that were removed:
  if (sum(!gsg$goodGenes)>0) 
    printFlush(paste("Removing genes:", paste(names(stk_red)[!gsg$goodGenes], collapse = ", ")));
  if (sum(!gsg$goodSamples)>0) 
    printFlush(paste("Removing samples:", paste(rownames(stk_red)[!gsg$goodSamples], collapse = ", ")));
  # Remove the offending genes and samples from the data:
  stk_red = stk_red[gsg$goodSamples, gsg$goodGenes]
}

## Removing genes: cd59, plp1b, ENSGACG00000002897, ENSGACG00000009609, flj13639, LYZ, zgc:92745, krt5, krt98, ENSGACG00000009880, il13ra2, ENSGACG00000015947, ENSGACG00000017777, gpnmb, gch2, ace, tppp2, ENSGACG00000020509, lmo1, ENSGACG00000015962, si:ch211-165b10.3, glulc, PNMT, serpina1l, maker-chrY-snap-gene-93.110, pltp, fgg, ENSGACG00000012644, ftr83, ENSGACG00000016730, ENSGACG00000020510, maker-chrY-snap-gene-2.21, KCNH7, th, cpa4, frem1b, ambp, ENSGACG00000006160, filip1a, cort, si:ch211-43f4.1, phox2a, ndufa4l2a, apoda.2, ENSGACG00000004236, scg3, ENSGACG00000014196, abcc9, aplp2, maker-chrY-snap-gene-127.48, ENSGACG00000011672, si:dkey-243k1.3, igfbp6a, ptprna, dbh, rgs16, hand2, f2, ENSGACG00000014054, scn1laa, hpdb, HAND1, thrsp, ttc12, ENSGACG00000017123, ENSGACG00000003732, npas4a, c6, rhbg, dhrs3b, ttc36, CPXM2, gapdh, si:ch1073-126c3.2, nr5a1a, ENSGACG00000017516, ebf3a, rimbp2, cfbl, maker-chrIV-exonerate-protein2genome-gene-335.6, adcyap1r1a.1, plvapa, MFAP2, maker-chrY-snap-gene-3.14, TYMP, fsta, apoc2, paqr5a, fga, ENSGACG00000006505, slc51a, ENSGACG00000007230, ENSGACG00000006138, eomesb, fhod3a, spp1, postna, vtnb, ttn.1, kcnk3a, ENSGACG00000019595, ENSGACG00000003600, ENSGACG00000016104, ENSGACG00000019789, ENSGACG00000010000, sema6e, ddc, angpt2b

dim(stk_red)

## [1] 1075  392

No more than 30 cells come from celltype c6, c7, and c8, which may lead to an unstable result. We merge celltypes c6, c7, and c8 for a more robust analysis.

table(random_cell$cluster)

## 
##  c1  c3  c4  c2  c5  c6  c8  c7 
## 571 208 111  99  64  14   6   2

# merged c6, c7, c8 
random_cell$cluster[random_cell$cluster == "c7"|random_cell$cluster == "c8"] = "c6"

Q1: General network reconstruction and community detection

1.1 Network reconstruction using topological overlap matrix

Following WCGNA tutorial, we reconstruct the GxG network using the topological overlap matrix (TOM) of the soft-thresholding adjacency matrix. Specifically, the soft-thresholding adjacency matrix \(A\) has entries \[A_{ij} = |\text{Cor}(\text{Gene}_i, \text{Gene}_j)|^{\beta},\] where \(\text{Gene}_i\) is the \(i\)-th gene expression across the cells, and \(\beta\) is the soft-thresholding parameter that can be automatically determined by the function pickSoftThreshold(). With given \(A\), we generate the TOM \(\Omega\) using TOMsimilarity(). A larger \(\Omega_{ij}\) indicates a higher similarity between gene \(i\) and gene \(j\). See the reference Zhang and Horvath, 2005 for details.

We take TOM as the weighted GxG network.

# select the soft-thresholding parameter
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(stk_red, powerVector = powers, verbose = 5)

## pickSoftThreshold: will use block size 392.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 392 of 392

## Warning: executing %dopar% sequentially: no parallel backend registered

##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1    0.358 -0.552         0.2390   45.50  41.70000  108.0
## 2      2    0.517 -0.604         0.3950   19.10  13.20000   60.0
## 3      3    0.604 -0.619         0.4930   11.00   6.09000   38.9
## 4      4    0.663 -0.731         0.5680    7.29   2.68000   29.9
## 5      5    0.223 -2.950         0.1370    5.29   1.46000   29.0
## 6      6    0.266 -3.210         0.1740    4.14   0.93200   28.3
## 7      7    0.235 -3.090         0.0171    3.44   0.67200   27.8
## 8      8    0.211 -3.330         0.0118    3.00   0.41800   27.5
## 9      9    0.216 -3.180         0.0155    2.71   0.28800   27.2
## 10    10    0.209 -3.200         0.0562    2.51   0.16800   26.9
## 11    12    0.216 -3.000         0.0439    2.27   0.07100   26.5
## 12    14    0.173 -2.300         0.0267    2.14   0.02970   26.2
## 13    16    0.238 -2.920         0.1080    2.06   0.01090   26.0
## 14    18    0.248 -2.810         0.1140    2.01   0.00456   25.8
## 15    20    0.171 -2.310         0.1790    1.97   0.00183   25.7

softPower = which.max(-sign(sft$fitIndices[,3])*sft$fitIndices[,2])

# generate soft-thresholding adjacency matrix
adjacency = adjacency(stk_red, power = softPower);

# transfer to TOM
TOM = TOMsimilarity(adjacency)

## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.

# dissimilarity
dissTOM = 1-TOM

1.2 Community detection

We use the hierarchical clustering based on the TOM dissimilarity (\(1 - \Omega\)) to produce a hierarchical clustering tree of genes. Then, we merge the clusters whose expression profiles are very similar using mergeCloseModules() .

geneTree = hclust(as.dist(dissTOM), method = "average")

minModuleSize = 10;
# Module identification using dynamic tree cut:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
                            deepSplit = 2, pamRespectsDendro = FALSE,
                            minClusterSize = minModuleSize)

##  ..cutHeight not given, setting it to 0.991  ===>  99% of the (truncated) height range in dendro.
##  ..done.

dynamicColors = labels2colors(dynamicMods)

# merge the modules
merge = mergeCloseModules(stk_red, dynamicColors, verbose = 3)

##  mergeCloseModules: Merging modules whose distance is less than 0.2
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 6 module eigengenes in given set.
##    Calculating new MEs...
##    multiSetMEs: Calculating module MEs.
##      Working on set 1 ...
##      moduleEigengenes: Calculating 6 module eigengenes in given set.

mergedColors = merge$colors

table(mergedColors)

## mergedColors
##      blue     brown     green      grey turquoise    yellow 
##        46        36        10       129       159        12

1.3 Visualization

We plot the general GxG network using heatmap with gene cluster labels. For a better visualization, we do the log transformation of the TOM and reorder the genes based on the clustering assignment. The color bar indicates the cluster labels for each gene.

plotTOM = TOM*(10^8);
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA;

gene_order = order(dynamicMods)
#pdf("general_network.pdf", height = 8, width = 8)
heatmap(log(plotTOM[gene_order, gene_order]), 
        ColSideColors = dynamicColors[gene_order],
        RowSideColors= dynamicColors[gene_order], 
        main = paste0("General Network"),Colv=NA, Rowv=NA, scale="none")

#dev.off()

The heatmap indicates there exists a significant block structure in the TOM with 392 genes.

Q2: Cell-type network reconstruction and community detection

We repeat the network reconstruction and community detection with the expression data for each celltype. Note that for each celltype, there are some genes keep 0 expression across the cells, which will return NA in the TOM and be reflected as the blank space in the heatmap of TOM.

TOM_list = list()
plotTOM_list = list()
geneTree_list = list()
dynamicColors_list = list()

cell_type_name = c("c1", "c2","c3","c4","c5","c6")

for (i in 1:length(cell_type_name)) {
  cat("cell type = ", i, "\n")
  
  c_cell = random_cell$cell[random_cell$cluster == cell_type_name[i]]
  stk_c = stk_red[rownames(stk_red) %in% c_cell, ]
  
  powers = c(c(1:10), seq(from = 12, to=20, by=2))
  sft = pickSoftThreshold(stk_c, powerVector = powers, verbose = 5)
  softPower = which.max(-sign(sft$fitIndices[,3])*sft$fitIndices[,2])

  # generate soft-thresholding adjacency matrix
  adjacency = adjacency(stk_c, power = softPower);

  # transfer to TOM
  TOM = TOMsimilarity(adjacency)
  TOM_list[[i]] = TOM
  
  plotTOM = TOM*(10^8);
  # Set diagonal to NA for a nicer plot
  diag(plotTOM) = NA;
  plotTOM_list[[i]] = plotTOM
  
  # dissimilarity
  dissTOM = 1-TOM

  geneTree = hclust(as.dist(dissTOM), method = "average")
  geneTree_list[[i]] = geneTree
  
  minModuleSize = 20;
  # Module identification using dynamic tree cut:
  dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
                              deepSplit = 2, pamRespectsDendro = FALSE,
                              minClusterSize = minModuleSize)
  dynamicColors = labels2colors(dynamicMods)
  dynamicColors_list[[i]] = dynamicColors
}

## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .

## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .

## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .

## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .

## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .

## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .

We visualize the networks and communities for each celltype. The order of genes are the same as the general network.

#pdf(file = "c_network.pdf", height = 8, width = 8)
for (i in 1:length(cell_type_name)) {
  heatmap(log(plotTOM_list[[i]][gene_order,gene_order]),
          ColSideColors = dynamicColors_list[[i]][gene_order],
          RowSideColors= dynamicColors_list[[i]][gene_order], 
          main = paste0("Network Heatmap for Cell type ", cell_type_name[[i]]),
          Colv=NA, Rowv=NA, scale = "none")

}

#dev.off()

In general, the community structures of each celltype are not super clear when the gene follows the same order of general network. The heatmap for the celltype c6 (merged with c7 and c8) shows a significantly different pattern than other celltypes. The celltype c6 shows strong connections among the genes that rarely expressed in other celltypes. Though other 5 celltypes show the similar pattern of the blank space, there exist network connection differences among the expressed gene. We will take a closer look in the next section.

Q3: Cell-type network difference testing

We check the differences among the celltype networks by visualizing the differential TOM networks and the covariance-based hypothesis testing in Zhu et.al.

3.1 Visualizing differential TOM networks

Checking the network heatmaps for 6 celltypes, the heatmaps of type c1, c2, c4 are most similar with each other. We plot the centralized differential TOM matrices; i.e., \(\tilde D = D - \bar D, D = \Omega_i - \Omega_j\), \(\Omega_i\) is the TOM of celltype \(i\), and \(\bar D\) is the mean value of all the entries in \(D\). Blue indicates the negative entry and red indicates the positive entry.

# c1 vs c2
d12 = (plotTOM_list[[1]][gene_order,gene_order]- plotTOM_list[[2]][gene_order,gene_order] )*(10^(-5))
heatmap(d12 - mean(d12,na.rm = T),
          main = paste0("Differential network of ", cell_type_name[[1]], " and ", cell_type_name[[2]]),
          Colv=NA, Rowv=NA, scale = "none", col = colorRampPalette(brewer.pal(10, "RdBu"))(7))

# c1 vs c4
d14 = (plotTOM_list[[1]][gene_order,gene_order]- plotTOM_list[[4]][gene_order,gene_order] )*(10^(-5))
heatmap(d14 - mean(d14,na.rm = T),
          main = paste0("Differential network of ", cell_type_name[[1]], " and ", cell_type_name[[4]]),
          Colv=NA, Rowv=NA, scale = "none", col = colorRampPalette(brewer.pal(10, "RdBu"))(7))

# c2 vs c4
d24 = (plotTOM_list[[2]][gene_order,gene_order]- plotTOM_list[[4]][gene_order,gene_order] )*(10^(-5))
heatmap(d24 - mean(d24,na.rm = T),
          main = paste0("Differential network of ", cell_type_name[[2]], " and ", cell_type_name[[4]]),
          Colv=NA, Rowv=NA, scale = "none", col = colorRampPalette(brewer.pal(10, "RdBu"))(7))

The differential plots indicate that there exist network differences among celltypes c1, c2, and c4.

3.2 Covariance-based hypothesis testing

We apply the pairwise test for high-dimensional covariance matrices for all the pairs \((i,j),\ i,j, \in \{ \text{c1, c2, c3, c4, c5, c6} \}.\)

stk_c_list = list()
for (i in  1:length(cell_type_name)) {
  c_cell = random_cell$cell[random_cell$cluster == cell_type_name[i]]
  stk_c_list[[i]] = stk_red[rownames(stk_red) %in% c_cell, ]
}

for (i in 1:(length(cell_type_name)-1) ){
  for (j in (i+1):length(cell_type_name)) {
    cat("Compare cell type", cell_type_name[i], " and ", cell_type_name[j],
        " : p value = ", sLED(stk_c_list[[2]], stk_c_list[[3]])$pVal, "\n" )
  }
}

## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1  and  c2  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1  and  c3  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1  and  c4  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1  and  c5  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1  and  c6  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c2  and  c3  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c2  and  c4  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c2  and  c5  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c2  and  c6  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c3  and  c4  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c3  and  c5  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c3  and  c6  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c4  and  c5  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c4  and  c6  : p value =  0 
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c5  and  c6  : p value =  0

We have 0 p-values for all the pairwise tests. This indicates that there are significant difference among the gene expression covariance matrices for each celltype. Note that TOM is also generated from the covariance matrix. We can believe there also exist significant differences among the TOMs for each celltype.

Discussion

Though there exists a clear community structure in the general GxG network, we can not use the celltype-specified network to explain the community structure very well. Specifically, celltypes c1 - c5 all show strong connections among the genes in the middle part (labeled as light blue in the general network). We may further focus on that subset of genes in the future analysis.
The celltype-specified networks do not show a straightforward differences among celltypes c1 - c5. Though covariance-based testing supports the existence of differences, the Gaussian assumption may not be true for the scRNA data and the TOM network is not exactly equal to the covariance matrix. Therefore, we may find another test or method to directly and statistically check the differences among TOM networks.
We do not check the community differences among the celltypes. Need to find a test or method for the community testing.