Article directory
foreword
It is not difficult to find in many academic articles on single-cell data research that after most researchers name all cell groups (cluster), in the subsequent analysis process, the cell groups of interest will be isolated again, and the clusters will be analyzed again. The cell group is subclustered in order to make a more detailed discussion of the cell group. Different analysts will have different analysis methods for subclustering. There is no such thing as right or wrong. As long as the logic is correct, the thinking is correct, and it meets the needs of their own research topics, then I think the analysis method and steps can be said to be correct. of. The following are my analysis steps and methods for specific cell clustering (subclustering). If there are any inaccuracies, please leave a message for advice.
To help promote my article, if you are interested, you can go to see the Seurat teaching I wrote:
- Seurat V4.9.9 – A powerful R suite for single-cell analysis
- Single-cell data analysis: detailed explanation of batch effect and Seurat Integration analysis
sample data
In this example, we will analyze two sets of single-cell data sets from peripheral blood (PBMC) of 10X Genomics, which respectively contain 2700 and 5025 single-cell data sets. The original data.rds save file
Candownload in my cloud. If you have doubts about the file, you can go to the 10X Genomics database to download it yourself, and then follow theSingle-cell data analysis: detailed explanation of batch effect and Seurat Integration analysisThis article is analyzed.
Dataset download location:
5k Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor (v3 chemistry)
1. Load the package and load the .rds save file
Load the required packages into the R environment, and putpbmc3k5K_final.rds
The data is loaded into R, the command used here isreadRDS()
function,The path provided in the code is just an example path and needs to be modified according to the path where the actual file is located on the computer.
# load package library(Seurat) library(dplyr) library(patchwork) # needs to use the correct path in the computer, "\" must be changed to "/" pbmc.combined <- readRDS(file = "C:/Users/Administrator /Desktop/pbmc3k5K_final.rds") # Check input results DimPlot(pbmc.combined, reduction = "umap", group.by = "orig.ident") DimPlot(pbmc.combined, reduction = "umap") DimPlot(pbmc. combined, reduction = "umap", label = TRUE, split.by = "orig.ident")
Since the resolution of this set of analysis results is too fine, it is not suitable for subsequent grouping analysis, so we first make some adjustments to the results to reduce the resolution."pretend"The result we get is the following:
# reduces resolution to match subsequent analysis pbmc.combined <- FindClusters(pbmc.combined, resolution = 0.1) DimPlot(pbmc.combined, reduction = "umap",label = T)
Remind again, this is just for teaching purpose to lower the resolution, please adjust to the most suitable resolution according to your own data when analyzing single cell data! ! !
2. Extract the desired cell population
Next, we further extract the cells that we are interested in and want to analyze. Assuming that we are particularly interested in the first group of cells, we can usesubset()
The function extracts the cell.
# Extract the cell group cluster1 with number 0 <- subset(x = pbmc.combined,idents= c("1")) # Check the extraction result DimPlot(cluster1, reduction = "umap", label = T)
Slightly check whether the extracted cell population is correct, and if there is no problem, you can proceed to the next step.
3. Data integration (Seurat integration) and PCA analysis
The next analysis is the same as the basic process of Seurat integration. First, transfer the assay to RNA assay, and rerun the data integration (Seurat integration) of the first group of cells with the original RNA data, and perform data scaling (scaling data) and PCA dimensionality reduction. Analyze, determine data dimensions, and perform cell clustering.
# Convert the assay to RNA, and use the original RNA data to rerun the data integration (Seurat integration) DefaultAssay(cluster1) <- "RNA" cluster1 # Set the data segmentation list, standardize and select the analysis of highly variable features cluster1.list< - SplitObject(cluster1, split.by = "orig.ident") cluster1.list cluster1.list <- lapply(X = cluster1.list, FUN = function(x) { x <- NormalizeData(x,normalization.method = " LogNormalize", scale.factor = 10000) x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000) }) # Select features that vary repeatedly across datasets for integration features <- SelectIntegrationFeatures(object.list = cluster1.list) # Integrated analysis cluster1.anchors <- FindIntegrationAnchors(object.list = cluster1.list, dims = 1:15, anchor.features = features) # Create an integrated data analysis cluster1.combined <- IntegrateData(anchorset = cluster1 .anchors, dims = 1:15) # for integrated data analysis # is set to use the previously integrated integrated assay for analysis. # The original unmodified data still exists in the RNA assay and can be transformed with DefaultAssay(cluster1.combined) <- "RNA" DefaultAssay(cluster1.combined) <- "integrated" cluster1.combined <- ScaleData(cluster1.combined, verbose = TRUE) cluster1.combined <- RunPCA(cluster1.combined, npcs = 50, verbose = TRUE) #Dertermine the dimensionality of the dataset ElbowPlot(cluster1.combined, ndims= 50) # Cluster the cells (1:15) cluster1. combined <- FindNeighbors(cluster1.combined, reduction = "pca", dims = 1:15) cluster1.combined <- FindClusters(cluster1.combined, resolution = 0.2) table(cluster1.combined@active.ident) table(Idents( cluster1.combined), cluster1.combined$orig.ident)
If the resolution (Resolution) is 0.2 to analyze, the first group of cells can be further subdivided into three subgroups. The number of subgroups can be obtained viaresolution()
The size of the function can be adjusted, and clustree can also be used to judge. For details, please refer toClustree (0.5.0) - A good thing for clustering hierarchies!
4. Nonlinear Dimensionality Reduction Analysis Method (UMAP/tSNE)
The next step is to perform a visual nonlinear dimension reduction analysis. The illustration uses the UMAP diagram as an example.
# UMAP (1:15) cluster1.combined <- RunUMAP(cluster1.combined, reduction = "pca", dims = 1:15) DimPlot(cluster1.combined, reduction = "umap", group.by = "orig.ident ") DimPlot(cluster1.combined, reduction = "umap") DimPlot(cluster1.combined, reduction = "umap", split.by = "orig.ident") DimPlot(cluster1.combined, reduction = "umap",label = TRUE, split.by = "orig.ident") # tSNE (1:15) cluster1.combined <- RunTSNE(cluster1.combined, reduction = "pca", dims = 1:15) DimPlot(cluster1.combined, reduction = "tsne", group.by = "orig.ident") DimPlot(cluster1.combined, reduction = "tsne") DimPlot(cluster1.combined, reduction = "tsne", split.by = "orig.ident") DimPlot( cluster1.combined, reduction = "tsne", label = TRUE, split.by = "orig.ident")
5. Analysis of differentially expressed genes
After all the results are confirmed to be correct, the next step of differentially expressed genes analysis (DEGs, Differentially expression genes analysis) can be carried out. Because the integrated assay is mainly for the analysis of integration, my habit is to transfer the assay to RNA when doing DEGs analysis, and then standardize the RNA assay again, select highly variable features, and perform data scaling , and then withFindAllMarkers()
The function finds the differentially expressed genes in each group of cell populations.
# differentially expressed gene analysis DefaultAssay(cluster1.combined) <- "RNA" cluster1.combined <- NormalizeData(cluster1.combined, normalization.method = "LogNormalize", scale.factor = 10000) cluster1.combined <- FindVariableFeatures(cluster1. combined, selection.method = "vst", nfeatures = 2000) all.genes <- rownames(cluster1.combined) cluster1.combined <- ScaleData(cluster1.combined, features = all.genes) cluster1.combined.markers <- FindAllMarkers (cluster1.combined, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) View(cluster1.combined.markers) top10 <- cluster1.combined.markers %>% group_by(cluster) %>% top_n (n = 10, wt = avg_log2FC) DoHeatmap(cluster1.combined, features = top10$gene, label = F) library(ggplot2) ggsave(path = "C:/Users/Administrator/Desktop", filename = "Heatmaptop.jpeg", width=40, height=20, dpi=300, units='cm',limitsize=FALSE)
epilogue
In this article, we focus on the subclustering method and steps of multiple single-cell datasets. The setting of various function values during analysis,It all depends on the analyst's consideration and judgment of his own data,Analysts should choose appropriate functions and parameter settings according to the characteristics of the data, experimental design and research purpose, and avoid blindly following the teaching parameters.
references
1. Malte D Luecken and Fabian J Theis, Current best practices in single‐cell RNA‐seq analysis: a tutorial, Mol Syst Biol. 2019 Jun; 15(6): e8746.
2. Tim Stuart, Andrew Butler, Paul Hoffman, Christoph Hafemeister et al., Comprehensive integration of single-cell data, Cell. 2019 Jun 13;177(7):1888-1902.e21.
3. Seurat official website: https://satijalab.org/seurat/articles/integration_introduction.html