Introduction to CellID

Akira Cortal

2020-07-02

Introduction

We present here some usage case of Cell-ID, a clustering-free multivariate statistical method for the robust extraction of per-cell gene signatures from single-cell RNA-seq. Cell-ID allows unbiased cell identity recognition across different donors, tissues-of-origin, model organisms and single-cell omics protocols.

Installation

In order to install CellID, the user will have to set the repositories option to enable downloading Bioconductor dependencies.

Load library

Download data

To illustrate CellID usage we will use two different pancreatic data sets from Baron and Segerstolpe.

Preprocessing and Create Seurat Object

CellID can be either used on Seurat or SingleCellExperiment object, here before creating the Seurat object we filter the matrix genes with protein coding genes.

Running Standard Seurat Protocole

Before running CellID we perform standard seurat workflow as desribed here.

BaronSeurat <- NormalizeData(BaronSeurat)
BaronSeurat <- ScaleData(BaronSeurat, features = rownames(BaronSeurat))
#> Centering and scaling data matrix
BaronSeurat <- RunPCA(BaronSeurat, features = rownames(BaronSeurat))
#> PC_ 1 
#> Positive:  IFITM3, ZFP36L1, CDC42EP1, TMSB10, PMEPA1, YBX3, SOX4, IL32, TMSB4X, RHOC 
#>     S100A11, TACSTD2, TM4SF1, KRT7, IFITM2, CD44, FLNA, MSN, S100A16, MYH9 
#>     SDC4, LGALS3, SERPING1, PRSS8, SERPINA3, ITGA2, COL18A1, GSTP1, SAT1, DDIT4 
#> Negative:  PCSK1N, CPE, SCG5, PTPRN, CHGA, CHGB, GNAS, SCGN, SCG2, CD99 
#>     PAM, PEMT, VGF, PPP1R1A, BEX1, PCSK2, KIAA1324, TTR, PAX6, IDS 
#>     NLRP1, SLC30A8, EEF1A2, UCHL1, GAD2, APLP1, SYT7, NEUROD1, ERO1B, TMOD1 
#> PC_ 2 
#> Positive:  CELA3A, PRSS1, CELA3B, CTRC, CTRB1, PRSS2, CLPS, CPA1, PNLIPRP1, PLA2G1B 
#>     CPA2, CELA2A, CTRB2, CPB1, KLK1, SYCN, REG1A, PNLIP, REG1B, GP2 
#>     CUZD1, SPINK1, PRSS3, CTRL, CELA2B, FOXD2, REG3G, MGST1, PDIA2, AMY2A 
#> Negative:  PSAP, IGFBP7, COL18A1, APP, CALR, ITGB1, PKM, LGALS3BP, CD81, ITM2B 
#>     GNAI2, GRN, APLP2, CTSD, CDK2AP1, LAMP1, PXDN, SPARC, ATP1A1, MMP14 
#>     ITGA5, PLXNB2, TIMP1, FSTL1, COL4A1, CALD1, CTSB, NBL1, HSPG2, FLNA 
#> PC_ 3 
#> Positive:  SPARC, COL4A1, ENG, IGFBP4, COL6A2, PDGFRB, BGN, PXDN, COL15A1, COL3A1 
#>     COL1A2, CYGB, AEBP1, ADAMTS4, COL6A3, HTRA3, EMILIN1, LRRC32, CRISPLD2, COL1A1 
#>     COL5A3, C11orf96, LAMA4, MMP2, COL6A1, COL5A1, ITGA1, TIMP3, HIC1, COL5A2 
#> Negative:  KRT19, TACSTD2, KRT7, SPINT2, ELF3, KRT8, ANXA4, PRSS8, SERINC2, SPINT1 
#>     CDH1, LCN2, CLDN7, CLDN4, CFB, CD24, SERPINA5, CFTR, KRT18, CLDN1 
#>     MMP7, LGALS4, ST14, C3, SDC4, LAD1, S100A14, SLC4A4, PTPRF, ABCC3 
#> PC_ 4 
#> Positive:  INS, HADH, IAPP, PCSK1, ADCYAP1, TIMP2, IGF2, DLK1, GSN, NPTX2 
#>     TSPAN13, PDX1, VAT1L, RPS25, WLS, RPL7, RPL24, MAFA, SLC6A6, SYT13 
#>     ELMO1, DNAJB9, UCHL1, RPL3, RGS16, RPS6, RBP4, PKIB, TGFBR3, RPL17 
#> Negative:  GCG, GC, TM4SF4, CLU, TMEM176B, IRX2, TTR, TMEM176A, FEV, F10 
#>     FXYD5, CRYBA2, SMIM24, SERPINA1, FXYD3, PCSK2, ARX, ALDH1A1, LOXL4, B2M 
#>     PAPPA2, CD82, GPX3, FXYD6, GLS, MUC13, VSTM2L, EGFL7, RGS4, COTL1 
#> PC_ 5 
#> Positive:  COL6A3, COL1A2, PDGFRB, COL1A1, BGN, COL3A1, CRLF1, COL5A1, SFRP2, CYGB 
#>     THBS2, EMILIN1, COL6A1, FMOD, VCAN, COL6A2, PRRX1, COL5A2, COL5A3, ADAMTS12 
#>     CRISPLD2, LAMC3, AEBP1, NOTCH3, MXRA8, LUM, VSTM4, TPM2, INHBA, FN1 
#> Negative:  PLVAP, PECAM1, CD93, FLT1, KDR, SOX18, PODXL, ACVRL1, RGCC, VWF 
#>     ROBO4, TIE1, ESM1, ADGRL4, ESAM, ECSCR, S1PR1, SEMA3F, CXCR4, CLDN5 
#>     PASK, NOTCH4, ERG, DYSF, CDH5, CYP1A1, ANGPT2, BCL6B, IFI27, PTPRB
BaronSeurat <- RunUMAP(BaronSeurat, dims = 1:30)
#> Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
#> To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
#> This message will be shown once per session
#> 10:56:40 UMAP embedding parameters a = 0.9922 b = 1.112
#> 10:56:40 Read 8569 rows and found 30 numeric columns
#> 10:56:40 Using Annoy for neighbor search, n_neighbors = 30
#> 10:56:40 Building Annoy index with metric = cosine, n_trees = 50
#> 0%   10   20   30   40   50   60   70   80   90   100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 10:56:41 Writing NN index file to temp file /tmp/Rtmp9xQMzV/file27ad36ef2b93
#> 10:56:41 Searching Annoy index using 1 thread, search_k = 3000
#> 10:56:44 Annoy recall = 100%
#> 10:56:45 Commencing smooth kNN distance calibration using 1 thread
#> 10:56:47 Initializing from normalized Laplacian + noise
#> 10:56:47 Commencing optimization for 500 epochs, with 352400 positive edges
#> 10:57:09 Optimization finished
BaronSeurat <- RunTSNE(BaronSeurat, dims = 1:30)
PCA <- DimPlot(BaronSeurat, reduction = "pca", group.by = "cell.type")  + ggtitle("PCA") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
tSNE <- DimPlot(BaronSeurat, reduction = "tsne", group.by = "cell.type")+ ggtitle("tSNE") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
UMAP <- DimPlot(BaronSeurat, reduction = "umap", group.by = "cell.type") + ggtitle("UMAP") + theme(legend.text = element_text(size =10), aspect.ratio = 1)
ggarrange(tSNE, UMAP, common.legend = T, legend = "top")

Running MCA

CellID is based on Multiple Correspondence Analysis (MCA), a dimensionality reduction methods that allows the representation of both cells and genes in a single euclidean space. Running MCA enables to compute afterwards the distance betwwen genes and cells to obtain a ranking of gene specificity for each cells. MCA can be Run like other dimensionality reduction methods in Seurat by using the command RunMCA. Using the DimPlotMC it is also possible to plot both cells and genes in the same space.

Using pre-established gene sets

After running MCA, the user can perform enrichment analysis on each cells in the data by performing Hypergeometric test. This enables using pre-established cell types gene sets in the litterature to predict cell types in the data. Here we will use the panglao database curated signature database to predict the cell types in the Baron data. We will use two different collection signatures: one with only known pancreatic cell types and one with all cell types.

Cell type predictions using preestablished gene sets

After downloading and formatting the signatures, we perform the Hypergeometric test to test each cells against each signatures. By assigning the most highly enriched signature to each cells, it is possible to determine the cell type composition of the dataset. The results quality is highly dependant on the quality of the signatures.

Gene Set Extraction from Reference dataset

CellID allows signatures extraction from data set both at group level and cell level. These signatures can later be used as reference gene sets. Here we use Baron as a Reference dataset to extract both cell signatures and cell type signatures, and we use the Segerstolpe dataset as a query dataset to prdict the cell type composition.

Predictions using cell type gene sets

Here we perform Hypergeometric test on Segerstolpe dataset using Baron cell type signatures. As we did for the pre-established signatures, we choose the signature with the highest enrichment for each cells to assign cell types to the query dataset.

Predictions using cell matching

Alternatively we can also use Baron individual cell signatures to match one Baron data cells to a Segerstolpe data cells and the transfer the label to predict cell types in Segerstolpe.

Functionnal Enrichment

Once MCA is performed, any type of signatures can be tested and that includes functionnal pathways. Here we use the Hallmark and KEGG pathways in HyperGeometric test and integrate the results into the Seurat object to visualise the -log10 pvalue of the enrichment into an UMAP.