Abstract
Genetic correlation is a central parameter for understanding shared genetic architecture between complex traits. By using summary statistics from genomewide association studies (GWAS), linkage disequilibrium score regression (LDSC) was developed for unbiased estimation of genetic correlations. Although easy to use, LDSC only partially utilizes LD information. By fully accounting for LD across the genome, we develop a highdefinition likelihood (HDL) method to improve precision in genetic correlation estimation. Compared to LDSC, HDL reduces the variance of genetic correlation estimates by about 60%, equivalent to a 2.5fold increase in sample size. We apply HDL and LDSC to estimate 435 genetic correlations among 30 behavioral and diseaserelated phenotypes measured in the UK Biobank (UKBB). In addition to 154 significant genetic correlations observed for both methods, HDL identified another 57 significant genetic correlations, compared to only another 2 significant genetic correlations identified by LDSC. HDL brings more power to genomic analyses and better reveals the underlying connections across human complex traits.
Data availability
The individuallevel genotype and phenotype data are available by application from the UKBB (http://www.ukbiobank.ac.uk/). The UKBB GWAS summary statistics by the Neale laboratory can be obtained from http://www.nealelab.is/ukbiobank/. Source data are provided with this paper.
Code availability
HDL software is available at https://github.com/zhenin/HDL/. LDSC software is available at https://github.com/bulik/ldsc/. PLINK 2.0 (https://www.coggenomics.org/plink/2.0/) was used to extract individuallevel data of imputed SNPs from the UKBB. PLINK 1.9 (https://www.coggenomics.org/plink/) and LDAK (http://dougspeed.com/ldak/) were used in LD correlation calculation and simulations.
References
 1.
Lee, S. H., Yang, J., Goddard, M. E., Visscher, P. M. & Wray, N. R. Estimation of pleiotropy between complex diseases using singlenucleotide polymorphismderived genomic relationships and restricted maximum likelihood. Bioinformatics 28, 2540–2542 (2012).
 2.
Loh, P.R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variancecomponents analysis. Nat. Genet. 47, 1385–1392 (2015).
 3.
BulikSullivan, B. et al. LD score regression distinguishes confounding from polygenicity in genomewide association studies. Nat. Genet. 47, 291–295 (2015).
 4.
BulikSullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).
 5.
Zheng, J. et al. LD hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2017).
 6.
Ni, G. et al. Estimation of genetic correlation via linkage disequilibrium score regression and genomic restricted maximum likelihood. Am. J. Hum. Genet. 102, 1185–1194 (2018).
 7.
Yang, J. et al. Genomewide genetic homogeneity between sexes and populations for human height and body mass index. Hum. Mol. Genet. 24, 7445–7449 (2015).
 8.
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
 9.
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
 10.
Speed, D. & Balding, D. J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat. Genet. 51, 277–284 (2019).
 11.
CanelaXandri, O., Rawlik, K. & Tenesa, A. An atlas of genetic associations in UK Biobank. Nat. Genet. 50, 1593–1599 (2018).
 12.
Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 50, 737–745 (2018).
 13.
Loh, P.R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixedmodel association for biobankscale datasets. Nat. Genet. 50, 906–908 (2018).
 14.
Yengo, L., Yang, J. & Visscher, P. M. Expectation of the intercept from bivariate LD score regression in the presence of population stratification. Preprint at bioRxiv https://doi.org/10.1101/310565 (2018).
 15.
Ganna, A. et al. Largescale GWAS reveals insights into the genetic architecture of samesex sexual behavior. Science 365, eaat7693 (2019).
 16.
Purcell, S. et al. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Acknowledgements
We thank the UKBB resource, approved under application no. 14302 and 19655, for the individuallevel genotype data used in LD correlation calculation and simulations. X.S. was in receipt of a Swedish Research Council starting grant (no. 201702543). Y.P. received a Swedish Research Council grant (no. 201604194). We thank the Edinburgh Compute and Data Facility (ECDF) for providing highperformance computing resources.
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Relative efficiency of HDL against LDSC when 100% SNPs are causal.
In each heritability group, we generated 100 pairs of traits, where true genetic correlation and phenotypic correlation are 0.5. In the high heritability group, the heritability of the pair of traits is 0.6 and 0.8 separately; in the low heritability group, the heritability of the pair of traits is 0.2 and 0.4 separately. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute the LD matrix for both HDL and LDSC. The Pvalues are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data
Extended Data Fig. 2 Relative efficiency of HDL against LDSC under different model setups when 10% SNPs with MAF > 1% are causal.
52,914 out of 529,139 array SNPs with MAF > 1% were randomly selected as causal variants. 100 pairs of traits were generated, where true genetic correlation and phenotypic correlation are 0.5. The true phenotypes of trait i is generated from model ({mathbf{y}}_i = mathop {sum}nolimits_{k = 1}^M {{mathbf{X}}_{ik}beta _{ik} + epsilon_i}), where ({mathbf{X}}_{ik} = ({mathbf{Z}}_{ik} – 2p_k1)[2p_k(1 – p_k)]^{alpha /2}); Z_{ik} are the original genotypes of SNP k for trait i; p_{k} is the MAF of SNP k; M is the number of causal variants. Four scenarios were simulated: (1) α = −1, and the marginal distribution of β_{ik} is (N(0,h_i^2/M)); (2) α = −1, and the marginal distribution of β_{ik} is (N(0,w_kh_i^2/M)), where w_{k} is the LDAK weight of SNP k which is inversely proportional to its LD score; (3) α = −0.25, and the marginal distribution of β_{ik} is (N(0,h_i^2/M)) and (4) α = −0.25, and the marginal distribution of β_{ik} is (N(0,w_kh_i^2/M)). After β_{i} were generated, they were rescaled by multiplying the same constant so that the true heritabilities were 0.5 for both traits. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute LD matrix for both HDL and LDSC. The Pvalues are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data
Extended Data Fig. 3 Relative efficiency of HDL against LDSC under different model setups when 10% SNPs with 5% > MAF > 1% are causal.
52,914 out of 221,620 array SNPs with 5% > MAF > 1% were randomly selected as causal variants. 100 pairs of traits were generated, where true genetic correlation and phenotypic correlation are 0.5. The true phenotypes of trait i is generated from model ({mathbf{y}}_i = mathop {sum}nolimits_{k = 1}^M {{mathbf{X}}_{ik}beta _{ik} + epsilon_i}), where ({mathbf{X}}_{ik} = ({mathbf{Z}}_{ik} – 2p_k1)[2p_k(1 – p_k)]^{alpha /2}); Z_{ik} are the original genotypes of SNP k for trait i; p_{k} is the MAF of SNP k; M is the number of causal variants. Four scenarios were simulated: (1) α = −1, and the marginal distribution of β_{ik} is (N(0,h_i^2/M)); (2) α = −1, and the marginal distribution of β_{ik} is (N(0,w_kh_i^2/M)), where w_{k} is the LDAK weight of SNP k which is inversely proportional to its LD score; (3) α =−0.25, and the marginal distribution of β_{ik} is (N(0,h_i^2/M)) and (4) α =−0.25, and the marginal distribution of β_{ik} is (N(0,w_kh_i^2/M)). After β_{i} were generated, they were rescaled by multiplying the same constant so that the true heritabilities were 0.5 for both traits. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute LD matrix for both HDL and LDSC. The Pvalues are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data
Extended Data Fig. 4 Relative efficiency of HDL using imputed reference panel against LDSC.
100 pairs of traits were generated, where true heritabilities are 0.5, genetic correlation and phenotypic correlation are 0.5. The 1,029,876 imputed SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes. LDSC and LDSC.1kG stand for the LDSC software using UKBB imputed reference panel and default 1000 Genomes reference panel, respectively. 102,988 (10% of 1,029,876) randomly sampled SNPs are set to be causal variants. The Pvalues are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data
Extended Data Fig. 5 Relative efficiency and standard error of LDSC estimate among 30 phenotypes in UK Biobank.
Each dot represents genetic correlation results for one pair of traits among 435 pairs. The xaxis represents the standard error of the LDSC estimate. The yaxis represents the relative efficiency of HDL against LDSC. HDL reference panel: UKBB imputed SNPs; LDSC reference panel: 1000 Genomes (default). Colors indicate the number of binary traits in the pair.
Source data
Extended Data Fig. 6 Genetic correlation estimates from HDL and LDSC among 30 phenotypes in UK Biobank based on directly genotyped variants on the array.
Lower triangle: HDL estimates; Upper triangle: LDSC estimates. The areas of the squares represent the absolute value of corresponding genetic correlations. After Bonferroni correction for 435 tests at 5% significance level, genetic correlations estimates that are significantly different from zero in both methods are marked with a dot; estimates that are significantly different from zero in only one method are marked with an asterisk and a black square. HDL reference panel: UKBB array SNPs; LDSC reference panel: UKBB array SNPs.
Source data
Extended Data Fig. 7 Relative efficiency of HDL using imputed reference panel against LDSC for the estimation of heritability.
a, 100 traits were generated using 14,867 imputed SNPs on chromosome 22 of ~336,000 UKBB genomic British individuals, where true heritability was set to 0.05. LDSC and LDSC.1kG stand for the LDSC software using UKBB imputed reference panel and default 1kG reference panel, respectively. 1,487 (10% of 14,867) randomly sampled SNPs are set to be causal variants. b, The relative efficiency, calculated as the ratio of the estimated variances of the LDSC estimates to those of the HDL estimates, was evaluated for 30 GWAS of real phenotypes in UKBB. HDL reference panel: UKBB imputed SNPs; LDSC reference panel: 1000 Genomes (default). Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data
Extended Data Fig. 8 Comparison of the heritability estimates from HDL and default LDSC across 30 UKBB phenotypes.
The default LDSC uses the 1000 Genomes reference panel. HDL uses UKBB imputed markers as reference. R represents the correlation between the two sets of estimates. The red dashed line represents identity.
Source data
Extended Data Fig. 9 Example of the eigenvalues of an LD matrix.
5,420 genotyped variants on chromosome 22 for UKBB genomic British individuals were used to generate the LD matrix. The red dashed line represents the cutoff where the leading eigenvalues and corresponding eigenvectors capture 90% of the information of the LD matrix.
Source data
Extended Data Fig. 10 HDL results where the LD matrix is approximated by different numbers of leading eigenvalues and eigenvectors.
After performing eigendecomposition to the LD matrix, leading eigenvalues explaining different amount of variances of the LD matrix and their corresponding eigenvectors were taken to approximate the LD matrix. In each heritability group, we generated 100 pairs of traits, where true genetic correlation and phenotypic correlation are 0.5. In the high heritability group, the heritability of the pair of traits is 0.6 and 0.8 separately; in low heritability group, the heritability of the pair of traits is 0.2 and 0.4 separately. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute the LD matrix for HDL. 30,752 SNPs are causal (10% of 307,519). Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data
Supplementary information
About this article
Cite this article
Ning, Z., Pawitan, Y. & Shen, X. Highdefinition likelihood inference of genetic correlations across human complex traits.
Nat Genet (2020). https://doi.org/10.1038/s415880200653y

Received:

Accepted:

Published: