Impact of similarity metrics on single-cell RNA-seq data clustering

T Kim, IR Chen, Y Lin, AYY Wang… - Briefings in …, 2019 - academic.oup.com
T Kim, IR Chen, Y Lin, AYY Wang, JYH Yang, P Yang
Briefings in bioinformatics, 2019academic.oup.com
Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA
sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from
complex samples. A common goal in scRNA-seq data analysis is to discover and
characterise cell types, typically through clustering methods. The quality of the clustering
therefore plays a critical role in biological discovery. While numerous clustering algorithms
have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric …
Abstract
Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from complex samples. A common goal in scRNA-seq data analysis is to discover and characterise cell types, typically through clustering methods. The quality of the clustering therefore plays a critical role in biological discovery. While numerous clustering algorithms have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric for categorising individual cells. Although several studies have compared the performance of various clustering algorithms for scRNA-seq data, currently there is no benchmark of different similarity metrics and their influence on scRNA-seq data clustering. Here, we compared a panel of similarity metrics on clustering a collection of annotated scRNA-seq datasets. Within each dataset, a stratified subsampling procedure was applied and an array of evaluation measures was employed to assess the similarity metrics. This produced a highly reliable and reproducible consensus on their performance assessment. Overall, we found that correlation-based metrics (e.g. Pearson’s correlation) outperformed distance-based metrics (e.g. Euclidean distance). To test if the use of correlation-based metrics can benefit the recently published clustering techniques for scRNA-seq data, we modified a state-of-the-art kernel-based clustering algorithm (SIMLR) using Pearson’s correlation as a similarity measure and found significant performance improvement over Euclidean distance on scRNA-seq data clustering. These findings demonstrate the importance of similarity metrics in clustering scRNA-seq data and highlight Pearson’s correlation as a favourable choice. Further comparison on different scRNA-seq library preparation protocols suggests that they may also affect clustering performance. Finally, the benchmarking framework is available at http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html.
Oxford University Press