Article directory
- foreword
- How to download the Seurat suite?
- sample data
- 1. Load the suite and load the PBMC dataset
- 2. Filter out low-quality cells
- 3. Normalization
- 4. Select highly variable features (Highly Variable Features, HVGs)
- 5. Feature scaling (Scaling data)
- 6. Linear dimensional reduction analysis (linear dimensional reduction)
- 7. Determine the dimensions of the data
- 8. Cell Clustering
- 9. Nonlinear Dimensionality Reduction Analysis Method (UMAP/tSNE)
- 10. Analysis of differentially expressed genes
- 11. Visualize the results
- 12. Identification of the cell type of each cluster
- 13. save
- epilogue
- references
foreword
Seurat is an R suite for single-cell data analysis, originally developed bySatija LabDeveloped in 2015, it is one of the first R suites for the integration, analysis and visualization of single-cell RNA-seq data, and has been developed to the V5 beta version.
How to download the Seurat suite?
First, you must be on the computerInstall R language and RStudio, and then open the RStudio window and enter the command to install the Seurat suite:
# download remotes package install.packages("remotes") # download Rtools4.3 https://cran.r-project.org/bin/windows/Rtools/ # download Seurat remotes::install_github("satijalab/seurat", " seurat5", quiet = TRUE)
If you want to download an earlier version of the Seurat suite, you can do so in the following ways:
# Download remotes package install.packages('remotes') # Example Replace '2.3.0' with your desired version remotes::install_version(package = 'Seurat', version = package_version('2.3.0')) library(Seurat)
sample data
In this example, we will analyze the peripheral blood (PBMC) single cell data set from 10X Genomics, which is a data set containing 2700 single cells sequenced by Illumina NextSeq 500. The raw data can bedownload here.
1. Load the suite and load the PBMC dataset
# load package library(Seurat) library(dplyr) library(patchwork) # check version packageVersion("Seurat") # load PBMC data set, need to use the correct path in the computer, you must change "\" to "/" pbmc .data <- Read10X(data.dir = "C:/Users/Administrator/Desktop/hg19") # Create Seurat object pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200) pbmc
use hereCreateSeuratObject()
The function imports data, where the counts parameter is the input data, the project parameter is the name used to identify the Seurat object, and the min.cells and min.features parameters are used to filter out some low-quality cells and genes. Then we can continue further analysis, the analysis method used here is the standard analysis method.
2. Filter out low-quality cells
Low-quality or dying cells often have extensive mitochondrial contamination, so we can usePercentageFeatureSet()
function to calculate the indicators of mitochondria, and further remove abnormal cells, so as not to affect subsequent data analysis.As for the cut off value of the removal parameter, it is up to the researcher.
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
The [[ ]] notation adds columns to the object data, in this example,PBMC
After generating the Seurat object, by running the above command, the QC index of mitochondrial gene composition can be added to thePBMC
in the data.If it is a human dataset, pattern is "^MT-"; if it is a mouse dataset, pattern is set to "^mt-".
Then execute the QC indicator visualization command:
# uses a violin plot to visualize QC indicators VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
- nFeature_RNA: The number of genes per cell, most cells have about 1000 genes.
- nCount_RNA: The total UMI count (Unique Molecular Identifier, which is a quantitative marker of gene expression) of each cell, and the count of most cells falls around 2300.
- percent.mt: The proportion of mitochondrial genes (total UMI count of mitochondrial genes ÷ total UMI count of cells), the proportion of mitochondria in most cells falls below 5%.
Finally, the Seurat paradigm screening condition is that cells with nFeature_RNA between 200 and 2500 and mitochondrial gene ratio less than 5% are used for subsequent analysis.
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) pbmc
After performing the filtering conditions, there were 2638 cells left to be used in the subsequent analysis.
3. Normalization
In single-cell RNA data analysis, due to factors such as differences in RNA molecule counts and batch effects between different cells, it is necessary to standardize the raw data for subsequent analysis. In Seurat, the commonly used normalization method is to perform total normalization on the gene expression data (UMI count) of each cell, that is, divide the count value of all genes in each cell by the total count of the cell, and multiply the obtained quotient by By a scaling factor (usually choose 10000). This process can be achieved through the NormalizeData() function in Seurat. After normalization, the gene expression data of all cells are at the same magnitude, which is convenient for subsequent analysis.
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000) # If all parameters are default values, it can be abbreviated as the following command # pbmc <- NormalizeData(pbmc)
4. Select highly variable features (Highly Variable Features, HVGs)
In the analysis of single-cell RNA data, the expression pattern of each gene can be different. The expression of some genes may vary slightly between cells, while the expression of some genes may vary greatly from cell to cell. For the latter, differences in the expression of these genes may correlate with differences in different cell types or states. Therefore, in the analysis of single-cell RNA data, it is necessary to identify highly differential genes and use them for downstream clustering and differential expression analysis. In Seurat, HVGs can be selected by the FindVariableFeatures() function, which identifies HVGs by calculating the mean expression and variance-to-mean ratio of each gene. By selecting these HVGs, calculations for genes that are not differential can be reduced, thereby improving the efficiency and accuracy of downstream analysis.
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000) # View the 10 genes with the highest variation top10 <- head(VariableFeatures(pbmc), 10) # Draw the marked 10 hypervariable genes plot1 <- VariableFeaturePlot(pbmc) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) plot1 + plot2
5. Feature scaling (Scaling data)
Feature scaling is an important step used to adjust the magnitude of differences in gene expression between cells to better capture differences between cells in subsequent analyses. Feature scaling is transformed by z-score, which makes each gene have a mean of 0 and a standard deviation of 1. In Seurat, usescale. data()
function to achieve this step.
all.genes <- rownames(pbmc) pbmc <- ScaleData(pbmc, features = all.genes)
6. Linear dimensional reduction analysis (linear dimensional reduction)
Linear dimensionality reduction refers to a data dimensionality reduction method that converts high-dimensional data into low-dimensional data. By mapping the original high-dimensional feature space to a low-dimensional feature space, redundant features can be removed while retaining useful features. information. In the analysis of single-cell RNA data, linear dimensionality reduction can be used to explore the main variation direction of the data and find the similarities and differences between cells. In Seurat we can use RunPCA()
function (PCA, Principle Component Analysis) to perform PCA dimension reduction, you can set the verbose parameter to TRUE to view more information about PCA calculations, and use VizDimReduction(), DimPlot()
as well asDimHeatmap()
function to visualize the results of PCA.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) # View the results of PCA, here we take DimHeatmap as an example # VizDimLoadings(pbmc, dims = 1:2, reduction = "pca") # DimPlot(pbmc, reduction = "pca") DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE) # call the result of PC1 to PC15 DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
7. Determine the dimensions of the data
Seurat clusters cells based on their PCA scores, where each PC basically represents a principal component, which combines information from related feature sets. Next, how many PCA principal components should we choose for analysis? There are two methods in the Seurat suite to determine the choice of PCA principal components, one isJackStraw()
function, and the other isElbowplot()
function, here I will useElbowplot()
function as an example:
ElbowPlot (pbmc)
As can be seen from the results in the figure, there is an inflection point (elbow) near PC9-10, which indicates that most of the real signal is captured in the first 10 PCs, so 10 PCA principal components can be selected as parameters for subsequent analysis .
8. Cell Clustering
Cell clustering refers to the analytical process of grouping together similar cell populations in single-cell RNA-seq data. Through cell aggregation, a large amount of single-cell data can be divided into different subpopulations, so as to better understand the heterogeneity of cells.
pbmc <- FindNeighbors(pbmc, dims = 1:10) pbmc <- FindClusters(pbmc, resolution = 0.5) # View IDs of the first five cells head(Idents(pbmc), 5)
dims = 1:10 means that the first ten principal components are selected for cell grouping,Don't forget to add ":", which means that 1-10 PCA are selected!!!
9. Nonlinear Dimensionality Reduction Analysis Method (UMAP/tSNE)
Nonlinear dimensionality reduction analysis is a method to convert high-dimensional data into low-dimensional data, and preserve the local structure and relative distance of the data as much as possible. In single-cell RNA data, UMAP (Uniform Manifold Approximation and Projection) and tSNE (t-Distributed Stochastic Neighbor Embedding) are commonly used non-linear dimensionality reduction analysis methods for mapping data to two-dimensional or three-dimensional space for visual quantify and understand relationships between data.
# Use UMAP clustering pbmc <- RunUMAP(pbmc, dims = 1:10) DimPlot(pbmc, reduction = "umap") # Display cluster labels DimPlot(pbmc, reduction = "umap", label = TRUE) # Use TSNE aggregation Class pbmc <- RunTSNE(pbmc, dims = 1:10) DimPlot(pbmc, reduction = "tsne") # Display cluster labels DimPlot(pbmc, reduction = "tsne", label = TRUE)
10. Analysis of differentially expressed genes
Differentially expressed gene analysis (DEGs, Differentially expression genes analysis) refers to dividing cells into different clusters (clusters), and comparing the expression of genes in these clusters, so as to determine which genes have significant differential expression in different clusters.
# Find all the characteristic genes in cluster 2 cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25) head(cluster2.markers, n = 5) # Find out the difference between cluster 5 and cluster 0 and All feature genes of cluster 3 cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25) head(cluster5.markers, n = 5) # find List all cluster feature genes, and only report positive feature genes (only positive markers) pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) pbmc.markers % >% group_by(cluster) %>% slice_max(n = 2, order_by = avg_log2FC)
11. Visualize the results
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# Use counts to draw the graph and take the log value VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
# heatmap(heatmap) top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC) DoHeatmap(pbmc, features = top10$gene) + NoLegend()
12. Identification of the cell type of each cluster
If there are specific genes that distinguish each type of cell, then the step of identifying the cell type can then be performed directly.Genes specific to different cell types can use databases, literature searches, or the R suite to assist in cell type annotation.In the entire single-cell data analysis process, this step is the most time-consuming, because once the judgment is wrong, it is basically equivalent to doing it all over again.
After the cell type is identified, the name of the cell can be labeled.
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet ") names(new.cluster.ids) <- levels(pbmc) pbmc <- RenameIdents(pbmc, new.cluster.ids) DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
13. save
# needs to use the correct path in the computer, "\" must be changed to "/" saveRDS(pbmc, file = "C:/Users/Administrator/Desktop/pbmc3k_final.rds")
epilogue
Seurat is a very powerful single-cell analysis R suite, and it can be said that it is the big brother in single-cell analysis. But whenIn the process of analysis, it is the homework that researchers must do to understand the background, principle and setting considerations of various parameters, so that they will not realize that they have done a lonely job at the end of the analysis..
references
1. Butler, A., Hoffman, P., Smibert, P., Papalexi, E., & Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5), 411–420.
2. Seurat official website https://satijalab.org/seurat/index.html