Article directory
- foreword
- Fundamentals of DoubletFinder
- How to download the DoubletFinder suite?
- sample data
- 1. Load the suite and load the PBMC dataset
- 2. Filter out low-quality cells
- 3. Perform standard single-cell analysis workflow
- 4. Identification of pK values
- 5. Estimating the proportion of homotypic doublets
- 6. Identification of Double Cells
- 7. Removal of double cells
- 8. Precautions for using the DoubletFinder suite
- external link
- references
foreword
Single Cell RNA Sequencing (scRNA-seq) can reveal the heterogeneity of gene expression in single cells. However, this technique faces a common problem, that is, the interference of doublets. Double cells are mixtures of RNA from two or more different cells, the presence of which can lead to inaccurate interpretation of gene expression. Therefore, we can use the DoubletFinder suite to remove these double cells to obtain more accurate and cleaner scRNA-seq data.
Fundamentals of DoubletFinder
The four main steps of the DoubletFinder algorithm:
(1) Generate artificial doublets: DoubletFinder creates artificial doublets by combining the gene expression profiles of randomly selected real single cells. This step aims to model the gene expression patterns of potential doublet cells.
(2) Merge real and artificial data: Combine artificial double-cell and real single-cell data to create a merged dataset, and then preprocess this merged dataset, including normalization and standardization, to prepare for subsequent analysis.
(3) Perform PCA and calculate pANN: Apply Principal Component Analysis (PCA) to the merged dataset to reduce the dimensionality of the data. Use the principal component distance matrix to find the proportion of artificial k-nearest neighbors (pANN) for each cell. This metric estimates the likelihood of a cell being a double cell based on its similarity to an artificial double cell.
(4) Ranking and thresholding pANN values: Ranking and thresholding the pANN values, cells with pANN values above the threshold are considered as potential double cells, while those below the threshold are considered as single cells.
How to download the DoubletFinder suite?
# Download DoubletFinder package remotes::install_github('chris-mcginnis-ucsf/DoubletFinder') # Check version packageVersion("DoubletFinder")
This teaching uses theDoubletFinder
v2.0.3 version, noremotes
Kit friends can useinstall.packages("remotes")
Download the package first.
sample data
In this example, we analyze the peripheral blood (PBMC) single cell data set from 10X Genomics, which is a data set containing 11,439 single cells sequenced by Illumina NovaSeq 6000. The raw data can bedownload here(Feature / cell matrix (filtered)). If you want to know the detailed Seurat single cell analysis steps, you can refer toSeurat V4.9.9 – A powerful R suite for single-cell analysisteaching.
1. Load the suite and load the PBMC dataset
# load package library(Seurat) library(ggplot2) library(tidyverse) library(DoubletFinder) # load PBMC data set, need to use the correct path in the computer, you must change "\" to "/" pbmc.data <- Read10X (data.dir = "C:/Users/USER/Desktop/filtered_feature_bc_matrix") # Create Seurat object pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc", min.cells = 3, min.features = 100 ) pbmc
use hereCreateSeuratObject()
The function imports data, where the counts parameter is the input data, the project parameter is the name used to identify the Seurat object, and the min.cells and min.features parameters are used to filter out some low-quality cells and genes. Then we can continue to analyze further, the analysis method used here is the standard analysis method.
2. Filter out low-quality cells
Low-quality or dying cells often have extensive mitochondrial contamination, so we can usePercentageFeatureSet()
function to calculate the indicators of mitochondria, and further remove abnormal cells, so as not to affect subsequent data analysis.As for the cut off value of the removal parameter, it is up to the researcher.
# Filter low-quality cells pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # Use violin plot to visualize QC indicators VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) # filter cells pbmc <- subset(pbmc, subset = nCount_RNA > 800 & nFeature_RNA > 500 & percent.mt < 10) pbmc
After performing the filter conditions, there were 9,983 cells left for subsequent analysis.
3. Perform standard single-cell analysis workflow
Before performing DoubletFinder analysis, the standard single-cell analysis process must be performed first:
# for standard analysis process pbmc <- NormalizeData(object = pbmc) pbmc <- FindVariableFeatures(object = pbmc) pbmc <- ScaleData(object = pbmc) pbmc <- RunPCA(object = pbmc) ElbowPlot(pbmc) pbmc <- FindNeighbors( object = pbmc, dims = 1:20) pbmc <- FindClusters(object = pbmc) pbmc <- RunUMAP(object = pbmc, dims = 1:20) DimPlot(pbmc, reduction = "umap", label = T)
4. Identification of pK values
This parameter is for processed scRNA-seq data pbmc
, carry out pK (PC neighborhood size) identification, and use parametric search to estimate the best pK value. The explanation is as follows:
paramSweep_v3(
, PCs = 1:20, sct = FALSE): This function is used to perform a parameter search, looking for a range of the specified number of principal components (1 to 20 principal components), using preprocessed scRNA-seq datapbmc
pbmc
Perform PCA with different parameter combinations.parametersct = FALSE
Indicates that SCTransform is not used for analysis.summarizeSweep(
, GT = FALSE): This function is used to summarize the results of the parameter search, returning statistics for different parameter combinations under each number of principal components.parametersweep.res.list_pbmc
GT = FALSE
Indicates that there is no ground-truth data for comparison, that is, there is no known double-cell information.find.pK(sweep.stats_pbmc)
: This function is used to estimate the best pK value based on the statistical results of the parameter search. pK is an important parameter used in DoubletFinder to calculate the pANN of each cell.find.pK()
The function finds the optimal pK value under different principal component numbers by estimating the BCmvn (two-cell detection performance indicator) value corresponding to different pK values.
# pK value identification (no ground-truth) sweep.res.list_pbmc <- paramSweep_v3(pbmc, PCs = 1:20, sct = FALSE) sweep.stats_pbmc <- summarizeSweep(sweep.res.list_pbmc, GT = FALSE) bcmvn_pbmc < - find.pK(sweep.stats_pbmc) # select pK with maximum BCmvn (two-cell assay performance metric) value ggplot(bcmvn_pbmc, aes(pK, BCmetric, group = 1)) + geom_point() + geom_line() pK <- bcmvn_pbmc %>% filter(BCmetric == max(BCmetric)) %>% select(pK) pK <- as.numeric(as.character(pK[[1]])) head(pK)
From the above figure we can see that the pK with the maximum BCmvn (two-cell detection performance indicator) value falls at the value of 0.14, so we will usepK=0.14
parameters for further analysis.
5. Estimating the proportion of homotypic doublets
# estimates the proportion of homotypic double cells annotations <- pbmc@meta.data$seurat_clusters homotypic.prop <- modelHomotypic(annotations) # assumes that the double cell formation rate is 7.5%, this value can be found from the protocol of 10X genomics nExp_poi <- round(0.075 *nrow(pbmc@meta.data)) nExp_poi.adj <- round(nExp_poi*(1-homotypic.prop)) head(nExp_poi.adj)
The number of homotypic doublets estimated by the kit is 680, and then the doublet identification analysis can be carried out.
6. Identification of Double Cells
# run DoubletFinder pbmc <- doubletFinder_v3(pbmc, PCs = 1:20, pN = 0.25, pK = pK, nExp = nExp_poi.adj, reuse.pANN = FALSE, sct = FALSE) head(pbmc) # visualize double cells DimPlot( pbmc, reduction = 'umap', group.by = "DF.classifications_0.25_0.14_680") # check the result table of single cell and double cell(pbmc@meta.data$DF.classifications_0.25_0.14_680)
usehead()
The function checks the results of the first few lines in the pbmc data, and selectsDF.classifications_0.25_0.14_680
The results are visualized.
7. Removal of double cells
# Remove double cells pbmc_NoDoublets <- subset(pbmc, subset = DF.classifications_0.25_0.14_680 == "Singlet") DimPlot(pbmc_NoDoublets, reduction = "umap", group.by="DF.classifications_0.25_0.14_680" )pbmc_NoDoublets
After removing the double cells, you can continue to do subsequent single cell data analysis!
8. Precautions for using the DoubletFinder suite
- Do not use DoubletFinder forIntegrated scRNA-seq data from multiple diverse samples. Running the DoubletFinder suite on a single sample and then bringing all the data together is the correct analytical workflow.
- Ensure that low-quality cell populations are not included in the input data, data can be filtered and screened according to the following methods:
- Data were preprocessed using standard analysis procedures.
- Filter cell populations with (A) cells with very low RNA UMIs, (B) high proportion of mitochondrial reads, and/or (C) marker genes that are not informative.
external link
references
1. Christopher S McGinnis, Lyndsay M Murrow, Zev J Gartner. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Syst. 2019 Apr 24;8(4):329-337.e4.
2. DoubletFinder – GitHub teaching official website