Google Scholar

Second-generation PLINK: rising to the challenge of larger and richer datasets

CC Chang, CC Chow, LCAM Tellier, S Vattikuti… - …, 2015 - academic.oup.com

CC Chang, CC Chow, LCAM Tellier, S Vattikuti, SM Purcell, JJ Lee

Gigascience, 2015•academic.oup.com

Background PLINK 1 is a widely used open-source C/C++ toolset for genome-wide
association studies (GWAS) and research in population genetics. However, the steady
accumulation of data from imputation and whole-genome sequencing studies has exposed
a strong need for faster and scalable implementations of key functions, such as logistic
regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition,
GWAS and population-genetic data now frequently contain genotype likelihoods, phase …

Background

PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.

Findings

To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0).

Conclusions

The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Oxford University Press

Show moreShow less

Save Cite Cited by 8967 Related articles All 28 versions Full View Full View

Cite

Advanced search

Saved to My library

Second-generation PLINK: rising to the challenge of larger and richer datasets