Use DoubletFinder to remove double cells in single cell data

foreword

Single Cell RNA Sequencing (scRNA-seq) can reveal the heterogeneity of gene expression in single cells. However, this technique faces a common problem, that is, the interference of doublets. Double cells are mixtures of RNA from two or more different cells, the presence of which can lead to inaccurate interpretation of gene expression. Therefore, we can use the DoubletFinder suite to remove these double cells to obtain more accurate and cleaner scRNA-seq data.

Fundamentals of DoubletFinder

The four main steps of the DoubletFinder algorithm:

(1) Generate artificial doublets: DoubletFinder creates artificial doublets by combining the gene expression profiles of randomly selected real single cells. This step aims to model the gene expression patterns of potential doublet cells.

(2) Merge real and artificial data: Combine artificial double-cell and real single-cell data to create a merged dataset, and then preprocess this merged dataset, including normalization and standardization, to prepare for subsequent analysis.

(3) Perform PCA and calculate pANN: Apply Principal Component Analysis (PCA) to the merged dataset to reduce the dimensionality of the data. Use the principal component distance matrix to find the proportion of artificial k-nearest neighbors (pANN) for each cell. This metric estimates the likelihood of a cell being a double cell based on its similarity to an artificial double cell.

(4) Ranking and thresholding pANN values: Ranking and thresholding the pANN values, cells with pANN values above the threshold are considered as potential double cells, while those below the threshold are considered as single cells.

How to download the DoubletFinder suite?

# Download DoubletFinder package remotes::install_github('chris-mcginnis-ucsf/DoubletFinder') # Check version packageVersion("DoubletFinder")

This teaching uses theDoubletFinder v2.0.3 version, noremotesKit friends can useinstall.packages("remotes")Download the package first.

sample data

In this example, we analyze the peripheral blood (PBMC) single cell data set from 10X Genomics, which is a data set containing 11,439 single cells sequenced by Illumina NovaSeq 6000. The raw data can bedownload here(Feature / cell matrix (filtered)). If you want to know the detailed Seurat single cell analysis steps, you can refer toSeurat V4.9.9 – A powerful R suite for single-cell analysisteaching.

1. Load the suite and load the PBMC dataset

# load package library(Seurat) library(ggplot2) library(tidyverse) library(DoubletFinder) # load PBMC data set, need to use the correct path in the computer, you must change "\" to "/" pbmc.data <- Read10X (data.dir = "C:/Users/USER/Desktop/filtered_feature_bc_matrix") # Create Seurat object pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc", min.cells = 3, min.features = 100 ) pbmc

use hereCreateSeuratObject()The function imports data, where the counts parameter is the input data, the project parameter is the name used to identify the Seurat object, and the min.cells and min.features parameters are used to filter out some low-quality cells and genes. Then we can continue to analyze further, the analysis method used here is the standard analysis method.

2. Filter out low-quality cells

Low-quality or dying cells often have extensive mitochondrial contamination, so we can usePercentageFeatureSet()function to calculate the indicators of mitochondria, and further remove abnormal cells, so as not to affect subsequent data analysis.As for the cut off value of the removal parameter, it is up to the researcher.

# Filter low-quality cells pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") # Use violin plot to visualize QC indicators VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3) # filter cells pbmc <- subset(pbmc, subset = nCount_RNA > 800 & nFeature_RNA > 500 & percent.mt < 10) pbmc

After performing the filter conditions, there were 9,983 cells left for subsequent analysis.

3. Perform standard single-cell analysis workflow

Before performing DoubletFinder analysis, the standard single-cell analysis process must be performed first:

# for standard analysis process pbmc <- NormalizeData(object = pbmc) pbmc <- FindVariableFeatures(object = pbmc) pbmc <- ScaleData(object = pbmc) pbmc <- RunPCA(object = pbmc) ElbowPlot(pbmc) pbmc <- FindNeighbors( object = pbmc, dims = 1:20) pbmc <- FindClusters(object = pbmc) pbmc <- RunUMAP(object = pbmc, dims = 1:20) DimPlot(pbmc, reduction = "umap", label = T)
The legend is the UMAP result graph

4. Identification of pK values

This parameter is for processed scRNA-seq data pbmc, carry out pK (PC neighborhood size) identification, and use parametric search to estimate the best pK value. The explanation is as follows:

  1. paramSweep_v3( pbmc, PCs = 1:20, sct = FALSE): This function is used to perform a parameter search, looking for a range of the specified number of principal components (1 to 20 principal components), using preprocessed scRNA-seq datapbmcPerform PCA with different parameter combinations.parameter sct = FALSE Indicates that SCTransform is not used for analysis.
  2. summarizeSweep( sweep.res.list_pbmc, GT = FALSE): This function is used to summarize the results of the parameter search, returning statistics for different parameter combinations under each number of principal components.parameter GT = FALSE Indicates that there is no ground-truth data for comparison, that is, there is no known double-cell information.
  3. find.pK(sweep.stats_pbmc): This function is used to estimate the best pK value based on the statistical results of the parameter search. pK is an important parameter used in DoubletFinder to calculate the pANN of each cell.find.pK()The function finds the optimal pK value under different principal component numbers by estimating the BCmvn (two-cell detection performance indicator) value corresponding to different pK values.
# pK value identification (no ground-truth) sweep.res.list_pbmc <- paramSweep_v3(pbmc, PCs = 1:20, sct = FALSE) sweep.stats_pbmc <- summarizeSweep(sweep.res.list_pbmc, GT = FALSE) bcmvn_pbmc < - find.pK(sweep.stats_pbmc) # select pK with maximum BCmvn (two-cell assay performance metric) value ggplot(bcmvn_pbmc, aes(pK, BCmetric, group = 1)) + geom_point() + geom_line() pK <- bcmvn_pbmc %>% filter(BCmetric == max(BCmetric)) %>% select(pK) pK <- as.numeric(as.character(pK[[1]])) head(pK)

From the above figure we can see that the pK with the maximum BCmvn (two-cell detection performance indicator) value falls at the value of 0.14, so we will usepK=0.14parameters for further analysis.

5. Estimating the proportion of homotypic doublets

# estimates the proportion of homotypic double cells annotations <- pbmc@meta.data$seurat_clusters homotypic.prop <- modelHomotypic(annotations) # assumes that the double cell formation rate is 7.5%, this value can be found from the protocol of 10X genomics nExp_poi <- round(0.075 *nrow(pbmc@meta.data)) nExp_poi.adj <- round(nExp_poi*(1-homotypic.prop)) head(nExp_poi.adj)

The number of homotypic doublets estimated by the kit is 680, and then the doublet identification analysis can be carried out.

6. Identification of Double Cells

# run DoubletFinder pbmc <- doubletFinder_v3(pbmc, PCs = 1:20, pN = 0.25, pK = pK, nExp = nExp_poi.adj, reuse.pANN = FALSE, sct = FALSE) head(pbmc) # visualize double cells DimPlot( pbmc, reduction = 'umap', group.by = "DF.classifications_0.25_0.14_680") # check the result table of single cell and double cell(pbmc@meta.data$DF.classifications_0.25_0.14_680)

usehead()The function checks the results of the first few lines in the pbmc data, and selectsDF.classifications_0.25_0.14_680The results are visualized.

7. Removal of double cells

# Remove double cells pbmc_NoDoublets <- subset(pbmc, subset = DF.classifications_0.25_0.14_680 == "Singlet") DimPlot(pbmc_NoDoublets, reduction = "umap", group.by="DF.classifications_0.25_0.14_680" )pbmc_NoDoublets

After removing the double cells, you can continue to do subsequent single cell data analysis!

8. Precautions for using the DoubletFinder suite

  1. Do not use DoubletFinder forIntegrated scRNA-seq data from multiple diverse samples. Running the DoubletFinder suite on a single sample and then bringing all the data together is the correct analytical workflow.
  2. Ensure that low-quality cell populations are not included in the input data, data can be filtered and screened according to the following methods:
    • Data were preprocessed using standard analysis procedures.
    • Filter cell populations with (A) cells with very low RNA UMIs, (B) high proportion of mitochondrial reads, and/or (C) marker genes that are not informative.

external link

references

1. Christopher S McGinnis, Lyndsay M Murrow, Zev J Gartner. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Syst. 2019 Apr 24;8(4):329-337.e4.

2. DoubletFinder – GitHub teaching official website

I am very grateful for your sharing!!!
MillionQuesn
Million Quesn

A foreigner living in Taiwan, sharing the highlights of a sudden flash of inspiration.

Articles: 46

Leave a Reply

Your email address will not be published. Required fields are marked *