Identification of differentially expressed gene sets using the Generalized Berk–Jones statistic

SM Gaynor, R Sun, X Lin, J Quackenbush - Bioinformatics, 2019 - academic.oup.com
Bioinformatics, 2019academic.oup.com
Motivation Cancer genomics studies frequently aim to identify genes that are differentially
expressed between clinically distinct patient subgroups, generally by testing single genes
one at a time. However, the results of any individual transcriptomic study are often not fully
reproducible. A particular challenge impeding statistical analysis is the difficulty of
distinguishing between differential expression comprising part of the genomic disease
etiology and that induced by downstream effects. More robust analytical approaches that are …
Motivation
Cancer genomics studies frequently aim to identify genes that are differentially expressed between clinically distinct patient subgroups, generally by testing single genes one at a time. However, the results of any individual transcriptomic study are often not fully reproducible. A particular challenge impeding statistical analysis is the difficulty of distinguishing between differential expression comprising part of the genomic disease etiology and that induced by downstream effects. More robust analytical approaches that are well-powered to detect potentially causative genes, are less prone to discovering spurious associations, and can deliver reproducible findings across different studies are needed.
Results
We propose a set-based procedure for testing of differential expression and show that this set-based approach can produce more robust results by aggregating information across multiple, correlated genomic markers. Specifically, we adapt the Generalized Berk–Jones statistic to test for the transcription factors that may contribute to the progression of estrogen receptor positive breast cancer. We demonstrate the ability of our method to produce reproducible findings by applying the same analysis to 21 publicly available datasets, producing a similar list of significant transcription factors across most studies. Our Generalized Berk–Jones approach produces results that show improved consistency over three set-based testing algorithms: Generalized Higher Criticism, Gene Set Analysis and Gene Set Enrichment Analysis.
Availability and implementation
Data are in the MetaGxBreast R package. Code is available at github.com/ryanrsun/gaynor_sun_GBJ_breast_cancer.
Supplementary information
Supplementary data are available at Bioinformatics online.
Oxford University Press