This file does the Q&A-style analysis for the stickleback single-cell RNA sequencing (scRNA) data in Google Drive.
As far as I am concerned, the ultimate goal of Dan’s project is to discover the potential factors (e.g. genotype, environment, celltype) that have significant effects to the gene-gene (GxG) network in stickleback. Note that we do not have genotype and environment information with current data. Therefore, following questions focus on the general and celltype-specified GxG network.
In this file, I answered the questions Q1, Q2 and Q3 with
quick-and-dirty methods using R package WGCNA
and
sLED
. We mainly follow the WGCNA
Tutorial for the network analysis.
library(WGCNA)
options(stringsAsFactors = FALSE);
# install.packages("devtools") ## if not installed
# library("devtools")
# devtools::install_github("lingxuez/sLED")
library(sLED)
library(RColorBrewer)
Due to the huge dimension of the scRNA data, we consider a subset of
stickleback scRNA data based on the preliminary celltype clustering
analysis. We use the function FindVariableFeatures()
in
Seurat
and select 500 high variable genes with large
dispersion. We also randomly sample 2% cells from 8 celltypes. The
reduced dataset involves the expressions of 500 genes in 1075 cells and
the celltype labels for each cell.
# read data
load("stk_red.RData")
stk_red = as.data.frame(t(stk_red_df))
rownames(stk_red) = gsub("\\.", "-", rownames(stk_red))
rm(stk_red_df)
dim(stk_red)
## [1] 1075 500
Following the tutorial of WGCNA, we exclude 108 genes with excessive 0 expressions.
# delete bad samples and genes in the dataset
gsg = goodSamplesGenes(stk_red, verbose = 3);
## Flagging genes and samples with too many missing values...
## ..step 1
## ..Excluding 108 genes from the calculation due to too many missing samples or zero variance.
## ..step 2
gsg$allOK
## [1] FALSE
if (!gsg$allOK)
{
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(stk_red)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(stk_red)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
stk_red = stk_red[gsg$goodSamples, gsg$goodGenes]
}
## Removing genes: cd59, plp1b, ENSGACG00000002897, ENSGACG00000009609, flj13639, LYZ, zgc:92745, krt5, krt98, ENSGACG00000009880, il13ra2, ENSGACG00000015947, ENSGACG00000017777, gpnmb, gch2, ace, tppp2, ENSGACG00000020509, lmo1, ENSGACG00000015962, si:ch211-165b10.3, glulc, PNMT, serpina1l, maker-chrY-snap-gene-93.110, pltp, fgg, ENSGACG00000012644, ftr83, ENSGACG00000016730, ENSGACG00000020510, maker-chrY-snap-gene-2.21, KCNH7, th, cpa4, frem1b, ambp, ENSGACG00000006160, filip1a, cort, si:ch211-43f4.1, phox2a, ndufa4l2a, apoda.2, ENSGACG00000004236, scg3, ENSGACG00000014196, abcc9, aplp2, maker-chrY-snap-gene-127.48, ENSGACG00000011672, si:dkey-243k1.3, igfbp6a, ptprna, dbh, rgs16, hand2, f2, ENSGACG00000014054, scn1laa, hpdb, HAND1, thrsp, ttc12, ENSGACG00000017123, ENSGACG00000003732, npas4a, c6, rhbg, dhrs3b, ttc36, CPXM2, gapdh, si:ch1073-126c3.2, nr5a1a, ENSGACG00000017516, ebf3a, rimbp2, cfbl, maker-chrIV-exonerate-protein2genome-gene-335.6, adcyap1r1a.1, plvapa, MFAP2, maker-chrY-snap-gene-3.14, TYMP, fsta, apoc2, paqr5a, fga, ENSGACG00000006505, slc51a, ENSGACG00000007230, ENSGACG00000006138, eomesb, fhod3a, spp1, postna, vtnb, ttn.1, kcnk3a, ENSGACG00000019595, ENSGACG00000003600, ENSGACG00000016104, ENSGACG00000019789, ENSGACG00000010000, sema6e, ddc, angpt2b
dim(stk_red)
## [1] 1075 392
No more than 30 cells come from celltype c6, c7, and c8, which may lead to an unstable result. We merge celltypes c6, c7, and c8 for a more robust analysis.
table(random_cell$cluster)
##
## c1 c3 c4 c2 c5 c6 c8 c7
## 571 208 111 99 64 14 6 2
# merged c6, c7, c8
random_cell$cluster[random_cell$cluster == "c7"|random_cell$cluster == "c8"] = "c6"
Following WCGNA tutorial, we reconstruct the GxG network using the
topological overlap matrix (TOM) of the soft-thresholding adjacency
matrix. Specifically, the soft-thresholding adjacency matrix \(A\) has entries \[A_{ij} = |\text{Cor}(\text{Gene}_i,
\text{Gene}_j)|^{\beta},\] where \(\text{Gene}_i\) is the \(i\)-th gene expression across the cells,
and \(\beta\) is the soft-thresholding
parameter that can be automatically determined by the function
pickSoftThreshold()
. With given \(A\), we generate the TOM \(\Omega\) using
TOMsimilarity()
. A larger \(\Omega_{ij}\) indicates a higher similarity
between gene \(i\) and gene \(j\). See the reference Zhang
and Horvath, 2005 for details.
We take TOM as the weighted GxG network.
# select the soft-thresholding parameter
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(stk_red, powerVector = powers, verbose = 5)
## pickSoftThreshold: will use block size 392.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 392 of 392
## Warning: executing %dopar% sequentially: no parallel backend registered
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.358 -0.552 0.2390 45.50 41.70000 108.0
## 2 2 0.517 -0.604 0.3950 19.10 13.20000 60.0
## 3 3 0.604 -0.619 0.4930 11.00 6.09000 38.9
## 4 4 0.663 -0.731 0.5680 7.29 2.68000 29.9
## 5 5 0.223 -2.950 0.1370 5.29 1.46000 29.0
## 6 6 0.266 -3.210 0.1740 4.14 0.93200 28.3
## 7 7 0.235 -3.090 0.0171 3.44 0.67200 27.8
## 8 8 0.211 -3.330 0.0118 3.00 0.41800 27.5
## 9 9 0.216 -3.180 0.0155 2.71 0.28800 27.2
## 10 10 0.209 -3.200 0.0562 2.51 0.16800 26.9
## 11 12 0.216 -3.000 0.0439 2.27 0.07100 26.5
## 12 14 0.173 -2.300 0.0267 2.14 0.02970 26.2
## 13 16 0.238 -2.920 0.1080 2.06 0.01090 26.0
## 14 18 0.248 -2.810 0.1140 2.01 0.00456 25.8
## 15 20 0.171 -2.310 0.1790 1.97 0.00183 25.7
softPower = which.max(-sign(sft$fitIndices[,3])*sft$fitIndices[,2])
# generate soft-thresholding adjacency matrix
adjacency = adjacency(stk_red, power = softPower);
# transfer to TOM
TOM = TOMsimilarity(adjacency)
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
# dissimilarity
dissTOM = 1-TOM
We use the hierarchical clustering based on the TOM dissimilarity
(\(1 - \Omega\)) to produce a
hierarchical clustering tree of genes. Then, we merge the clusters whose
expression profiles are very similar using
mergeCloseModules()
.
geneTree = hclust(as.dist(dissTOM), method = "average")
minModuleSize = 10;
# Module identification using dynamic tree cut:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize)
## ..cutHeight not given, setting it to 0.991 ===> 99% of the (truncated) height range in dendro.
## ..done.
dynamicColors = labels2colors(dynamicMods)
# merge the modules
merge = mergeCloseModules(stk_red, dynamicColors, verbose = 3)
## mergeCloseModules: Merging modules whose distance is less than 0.2
## multiSetMEs: Calculating module MEs.
## Working on set 1 ...
## moduleEigengenes: Calculating 6 module eigengenes in given set.
## Calculating new MEs...
## multiSetMEs: Calculating module MEs.
## Working on set 1 ...
## moduleEigengenes: Calculating 6 module eigengenes in given set.
mergedColors = merge$colors
table(mergedColors)
## mergedColors
## blue brown green grey turquoise yellow
## 46 36 10 129 159 12
We plot the general GxG network using heatmap with gene cluster labels. For a better visualization, we do the log transformation of the TOM and reorder the genes based on the clustering assignment. The color bar indicates the cluster labels for each gene.
plotTOM = TOM*(10^8);
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA;
gene_order = order(dynamicMods)
#pdf("general_network.pdf", height = 8, width = 8)
heatmap(log(plotTOM[gene_order, gene_order]),
ColSideColors = dynamicColors[gene_order],
RowSideColors= dynamicColors[gene_order],
main = paste0("General Network"),Colv=NA, Rowv=NA, scale="none")
#dev.off()
The heatmap indicates there exists a significant block structure in the TOM with 392 genes.
We repeat the network reconstruction and community detection with the
expression data for each celltype. Note that for each celltype, there
are some genes keep 0 expression across the cells, which will return
NA
in the TOM and be reflected as the blank space in the
heatmap of TOM.
TOM_list = list()
plotTOM_list = list()
geneTree_list = list()
dynamicColors_list = list()
cell_type_name = c("c1", "c2","c3","c4","c5","c6")
for (i in 1:length(cell_type_name)) {
cat("cell type = ", i, "\n")
c_cell = random_cell$cell[random_cell$cluster == cell_type_name[i]]
stk_c = stk_red[rownames(stk_red) %in% c_cell, ]
powers = c(c(1:10), seq(from = 12, to=20, by=2))
sft = pickSoftThreshold(stk_c, powerVector = powers, verbose = 5)
softPower = which.max(-sign(sft$fitIndices[,3])*sft$fitIndices[,2])
# generate soft-thresholding adjacency matrix
adjacency = adjacency(stk_c, power = softPower);
# transfer to TOM
TOM = TOMsimilarity(adjacency)
TOM_list[[i]] = TOM
plotTOM = TOM*(10^8);
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA;
plotTOM_list[[i]] = plotTOM
# dissimilarity
dissTOM = 1-TOM
geneTree = hclust(as.dist(dissTOM), method = "average")
geneTree_list[[i]] = geneTree
minModuleSize = 20;
# Module identification using dynamic tree cut:
dynamicMods = cutreeDynamic(dendro = geneTree, distM = dissTOM,
deepSplit = 2, pamRespectsDendro = FALSE,
minClusterSize = minModuleSize)
dynamicColors = labels2colors(dynamicMods)
dynamicColors_list[[i]] = dynamicColors
}
## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .
## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .
## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .
## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .
## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .
## Warning in eval(xpr, envir = envir): Some correlations are NA in block 1 : 392 .
We visualize the networks and communities for each celltype. The order of genes are the same as the general network.
#pdf(file = "c_network.pdf", height = 8, width = 8)
for (i in 1:length(cell_type_name)) {
heatmap(log(plotTOM_list[[i]][gene_order,gene_order]),
ColSideColors = dynamicColors_list[[i]][gene_order],
RowSideColors= dynamicColors_list[[i]][gene_order],
main = paste0("Network Heatmap for Cell type ", cell_type_name[[i]]),
Colv=NA, Rowv=NA, scale = "none")
}
#dev.off()
In general, the community structures of each celltype are not super clear when the gene follows the same order of general network. The heatmap for the celltype c6 (merged with c7 and c8) shows a significantly different pattern than other celltypes. The celltype c6 shows strong connections among the genes that rarely expressed in other celltypes. Though other 5 celltypes show the similar pattern of the blank space, there exist network connection differences among the expressed gene. We will take a closer look in the next section.
We check the differences among the celltype networks by visualizing the differential TOM networks and the covariance-based hypothesis testing in Zhu et.al.
Checking the network heatmaps for 6 celltypes, the heatmaps of type c1, c2, c4 are most similar with each other. We plot the centralized differential TOM matrices; i.e., \(\tilde D = D - \bar D, D = \Omega_i - \Omega_j\), \(\Omega_i\) is the TOM of celltype \(i\), and \(\bar D\) is the mean value of all the entries in \(D\). Blue indicates the negative entry and red indicates the positive entry.
# c1 vs c2
d12 = (plotTOM_list[[1]][gene_order,gene_order]- plotTOM_list[[2]][gene_order,gene_order] )*(10^(-5))
heatmap(d12 - mean(d12,na.rm = T),
main = paste0("Differential network of ", cell_type_name[[1]], " and ", cell_type_name[[2]]),
Colv=NA, Rowv=NA, scale = "none", col = colorRampPalette(brewer.pal(10, "RdBu"))(7))
# c1 vs c4
d14 = (plotTOM_list[[1]][gene_order,gene_order]- plotTOM_list[[4]][gene_order,gene_order] )*(10^(-5))
heatmap(d14 - mean(d14,na.rm = T),
main = paste0("Differential network of ", cell_type_name[[1]], " and ", cell_type_name[[4]]),
Colv=NA, Rowv=NA, scale = "none", col = colorRampPalette(brewer.pal(10, "RdBu"))(7))
# c2 vs c4
d24 = (plotTOM_list[[2]][gene_order,gene_order]- plotTOM_list[[4]][gene_order,gene_order] )*(10^(-5))
heatmap(d24 - mean(d24,na.rm = T),
main = paste0("Differential network of ", cell_type_name[[2]], " and ", cell_type_name[[4]]),
Colv=NA, Rowv=NA, scale = "none", col = colorRampPalette(brewer.pal(10, "RdBu"))(7))
The differential plots indicate that there exist network differences among celltypes c1, c2, and c4.
We apply the pairwise test for high-dimensional covariance matrices for all the pairs \((i,j),\ i,j, \in \{ \text{c1, c2, c3, c4, c5, c6} \}.\)
stk_c_list = list()
for (i in 1:length(cell_type_name)) {
c_cell = random_cell$cell[random_cell$cluster == cell_type_name[i]]
stk_c_list[[i]] = stk_red[rownames(stk_red) %in% c_cell, ]
}
for (i in 1:(length(cell_type_name)-1) ){
for (j in (i+1):length(cell_type_name)) {
cat("Compare cell type", cell_type_name[i], " and ", cell_type_name[j],
" : p value = ", sLED(stk_c_list[[2]], stk_c_list[[3]])$pVal, "\n" )
}
}
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1 and c2 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1 and c3 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1 and c4 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1 and c5 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c1 and c6 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c2 and c3 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c2 and c4 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c2 and c5 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c2 and c6 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c3 and c4 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c3 and c5 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c3 and c6 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c4 and c5 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c4 and c6 : p value = 0
## 100 permutation started:
## 10 ,20 ,30 ,40 ,50 ,60 ,70 ,80 ,90 ,100 ,permutations finished.
## Compare cell type c5 and c6 : p value = 0
We have 0 p-values for all the pairwise tests. This indicates that there are significant difference among the gene expression covariance matrices for each celltype. Note that TOM is also generated from the covariance matrix. We can believe there also exist significant differences among the TOMs for each celltype.
Though there exists a clear community structure in the general GxG network, we can not use the celltype-specified network to explain the community structure very well. Specifically, celltypes c1 - c5 all show strong connections among the genes in the middle part (labeled as light blue in the general network). We may further focus on that subset of genes in the future analysis.
The celltype-specified networks do not show a straightforward differences among celltypes c1 - c5. Though covariance-based testing supports the existence of differences, the Gaussian assumption may not be true for the scRNA data and the TOM network is not exactly equal to the covariance matrix. Therefore, we may find another test or method to directly and statistically check the differences among TOM networks.
We do not check the community differences among the celltypes. Need to find a test or method for the community testing.