Categories
High-definition likelihood

High-definition likelihood inference of genetic correlations across human complex traits

AbstractGenetic correlation is a central parameter for understanding shared genetic architecture between complex traits. By using summary statistics from genome-wide association studies (GWAS), linkage disequilibrium score regression (LDSC) was developed for unbiased estimation of genetic correlations. Although easy to use, LDSC only partially utilizes LD information. By fully accounting for LD across the genome, we…

Abstract

Genetic correlation is a central parameter for understanding shared genetic architecture between complex traits. By using summary statistics from genome-wide association studies (GWAS), linkage disequilibrium score regression (LDSC) was developed for unbiased estimation of genetic correlations. Although easy to use, LDSC only partially utilizes LD information. By fully accounting for LD across the genome, we develop a high-definition likelihood (HDL) method to improve precision in genetic correlation estimation. Compared to LDSC, HDL reduces the variance of genetic correlation estimates by about 60%, equivalent to a 2.5-fold increase in sample size. We apply HDL and LDSC to estimate 435 genetic correlations among 30 behavioral and disease-related phenotypes measured in the UK Biobank (UKBB). In addition to 154 significant genetic correlations observed for both methods, HDL identified another 57 significant genetic correlations, compared to only another 2 significant genetic correlations identified by LDSC. HDL brings more power to genomic analyses and better reveals the underlying connections across human complex traits.

Data availability

The individual-level genotype and phenotype data are available by application from the UKBB (http://www.ukbiobank.ac.uk/). The UKBB GWAS summary statistics by the Neale laboratory can be obtained from http://www.nealelab.is/uk-biobank/. Source data are provided with this paper.

Code availability

HDL software is available at https://github.com/zhenin/HDL/. LDSC software is available at https://github.com/bulik/ldsc/. PLINK 2.0 (https://www.cog-genomics.org/plink/2.0/) was used to extract individual-level data of imputed SNPs from the UKBB. PLINK 1.9 (https://www.cog-genomics.org/plink/) and LDAK (http://dougspeed.com/ldak/) were used in LD correlation calculation and simulations.

References

  1. 1.

    Lee, S. H., Yang, J., Goddard, M. E., Visscher, P. M. & Wray, N. R. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics 28, 2540–2542 (2012).

    CAS 
    Article 

    Google Scholar
     

  2. 2.

    Loh, P.-R. et al. Contrasting genetic architectures of schizophrenia and other complex diseases using fast variance-components analysis. Nat. Genet. 47, 1385–1392 (2015).

    CAS 
    Article 

    Google Scholar
     

  3. 3.

    Bulik-Sullivan, B. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

    CAS 
    Article 

    Google Scholar
     

  4. 4.

    Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).

    CAS 
    Article 

    Google Scholar
     

  5. 5.

    Zheng, J. et al. LD hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2017).

    CAS 
    Article 

    Google Scholar
     

  6. 6.

    Ni, G. et al. Estimation of genetic correlation via linkage disequilibrium score regression and genomic restricted maximum likelihood. Am. J. Hum. Genet. 102, 1185–1194 (2018).

    CAS 
    Article 

    Google Scholar
     

  7. 7.

    Yang, J. et al. Genome-wide genetic homogeneity between sexes and populations for human height and body mass index. Hum. Mol. Genet. 24, 7445–7449 (2015).

    CAS 
    Article 

    Google Scholar
     

  8. 8.

    Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    CAS 
    Article 

    Google Scholar
     

  9. 9.

    Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).

    CAS 
    Article 

    Google Scholar
     

  10. 10.

    Speed, D. & Balding, D. J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat. Genet. 51, 277–284 (2019).

    CAS 
    Article 

    Google Scholar
     

  11. 11.

    Canela-Xandri, O., Rawlik, K. & Tenesa, A. An atlas of genetic associations in UK Biobank. Nat. Genet. 50, 1593–1599 (2018).

    CAS 
    Article 

    Google Scholar
     

  12. 12.

    Evans, L. M. et al. Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nat. Genet. 50, 737–745 (2018).

    CAS 
    Article 

    Google Scholar
     

  13. 13.

    Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).

    CAS 
    Article 

    Google Scholar
     

  14. 14.

    Yengo, L., Yang, J. & Visscher, P. M. Expectation of the intercept from bivariate LD score regression in the presence of population stratification. Preprint at bioRxiv https://doi.org/10.1101/310565 (2018).

  15. 15.

    Ganna, A. et al. Large-scale GWAS reveals insights into the genetic architecture of same-sex sexual behavior. Science 365, eaat7693 (2019).

    CAS 
    Article 

    Google Scholar
     

  16. 16.

    Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    CAS 
    Article 

    Google Scholar
     

Download references

Acknowledgements

We thank the UKBB resource, approved under application no. 14302 and 19655, for the individual-level genotype data used in LD correlation calculation and simulations. X.S. was in receipt of a Swedish Research Council starting grant (no. 2017-02543). Y.P. received a Swedish Research Council grant (no. 2016-04194). We thank the Edinburgh Compute and Data Facility (ECDF) for providing high-performance computing resources.

Author information

Affiliations

  1. Biostatistics Group, State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

    Zheng Ning & Xia Shen

  2. Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden

    Zheng Ning, Yudi Pawitan & Xia Shen

  3. Centre for Global Health Research, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK

    Xia Shen

Contributions

X.S. and Y.P. initiated and coordinated the study. Z.N. performed data analysis. All authors contributed to method development and manuscript writing.

Corresponding author

Correspondence to
Xia Shen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Relative efficiency of HDL against LDSC when 100% SNPs are causal.

In each heritability group, we generated 100 pairs of traits, where true genetic correlation and phenotypic correlation are 0.5. In the high heritability group, the heritability of the pair of traits is 0.6 and 0.8 separately; in the low heritability group, the heritability of the pair of traits is 0.2 and 0.4 separately. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute the LD matrix for both HDL and LDSC. The P-values are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data

Extended Data Fig. 2 Relative efficiency of HDL against LDSC under different model setups when 10% SNPs with MAF > 1% are causal.

52,914 out of 529,139 array SNPs with MAF > 1% were randomly selected as causal variants. 100 pairs of traits were generated, where true genetic correlation and phenotypic correlation are 0.5. The true phenotypes of trait i is generated from model ({mathbf{y}}_i = mathop {sum}nolimits_{k = 1}^M {{mathbf{X}}_{ik}beta _{ik} + epsilon_i}), where ({mathbf{X}}_{ik} = ({mathbf{Z}}_{ik} – 2p_k1)[2p_k(1 – p_k)]^{alpha /2}); Zik are the original genotypes of SNP k for trait i; pk is the MAF of SNP k; M is the number of causal variants. Four scenarios were simulated: (1) α = −1, and the marginal distribution of βik is (N(0,h_i^2/M)); (2) α = −1, and the marginal distribution of βik is (N(0,w_kh_i^2/M)), where wk is the LDAK weight of SNP k which is inversely proportional to its LD score; (3) α = −0.25, and the marginal distribution of βik is (N(0,h_i^2/M)) and (4) α = −0.25, and the marginal distribution of βik is (N(0,w_kh_i^2/M)). After βi were generated, they were rescaled by multiplying the same constant so that the true heritabilities were 0.5 for both traits. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute LD matrix for both HDL and LDSC. The P-values are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data

Extended Data Fig. 3 Relative efficiency of HDL against LDSC under different model setups when 10% SNPs with 5% > MAF > 1% are causal.

52,914 out of 221,620 array SNPs with 5% > MAF > 1% were randomly selected as causal variants. 100 pairs of traits were generated, where true genetic correlation and phenotypic correlation are 0.5. The true phenotypes of trait i is generated from model ({mathbf{y}}_i = mathop {sum}nolimits_{k = 1}^M {{mathbf{X}}_{ik}beta _{ik} + epsilon_i}), where ({mathbf{X}}_{ik} = ({mathbf{Z}}_{ik} – 2p_k1)[2p_k(1 – p_k)]^{alpha /2}); Zik are the original genotypes of SNP k for trait i; pk is the MAF of SNP k; M is the number of causal variants. Four scenarios were simulated: (1) α = −1, and the marginal distribution of βik is (N(0,h_i^2/M)); (2) α = −1, and the marginal distribution of βik is (N(0,w_kh_i^2/M)), where wk is the LDAK weight of SNP k which is inversely proportional to its LD score; (3) α =−0.25, and the marginal distribution of βik is (N(0,h_i^2/M)) and (4) α =−0.25, and the marginal distribution of βik is (N(0,w_kh_i^2/M)). After βi were generated, they were rescaled by multiplying the same constant so that the true heritabilities were 0.5 for both traits. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute LD matrix for both HDL and LDSC. The P-values are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data

Extended Data Fig. 4 Relative efficiency of HDL using imputed reference panel against LDSC.

100 pairs of traits were generated, where true heritabilities are 0.5, genetic correlation and phenotypic correlation are 0.5. The 1,029,876 imputed SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes. LDSC and LDSC.1kG stand for the LDSC software using UKBB imputed reference panel and default 1000 Genomes reference panel, respectively. 102,988 (10% of 1,029,876) randomly sampled SNPs are set to be causal variants. The P-values are from Levene’s test for variance heterogeneity. Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data

Extended Data Fig. 5 Relative efficiency and standard error of LDSC estimate among 30 phenotypes in UK Biobank.

Each dot represents genetic correlation results for one pair of traits among 435 pairs. The x-axis represents the standard error of the LDSC estimate. The y-axis represents the relative efficiency of HDL against LDSC. HDL reference panel: UKBB imputed SNPs; LDSC reference panel: 1000 Genomes (default). Colors indicate the number of binary traits in the pair.
Source data

Extended Data Fig. 6 Genetic correlation estimates from HDL and LDSC among 30 phenotypes in UK Biobank based on directly genotyped variants on the array.

Lower triangle: HDL estimates; Upper triangle: LDSC estimates. The areas of the squares represent the absolute value of corresponding genetic correlations. After Bonferroni correction for 435 tests at 5% significance level, genetic correlations estimates that are significantly different from zero in both methods are marked with a dot; estimates that are significantly different from zero in only one method are marked with an asterisk and a black square. HDL reference panel: UKBB array SNPs; LDSC reference panel: UKBB array SNPs.
Source data

Extended Data Fig. 7 Relative efficiency of HDL using imputed reference panel against LDSC for the estimation of heritability.

a, 100 traits were generated using 14,867 imputed SNPs on chromosome 22 of ~336,000 UKBB genomic British individuals, where true heritability was set to 0.05. LDSC and LDSC.1kG stand for the LDSC software using UKBB imputed reference panel and default 1kG reference panel, respectively. 1,487 (10% of 14,867) randomly sampled SNPs are set to be causal variants. b, The relative efficiency, calculated as the ratio of the estimated variances of the LDSC estimates to those of the HDL estimates, was evaluated for 30 GWAS of real phenotypes in UKBB. HDL reference panel: UKBB imputed SNPs; LDSC reference panel: 1000 Genomes (default). Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data

Extended Data Fig. 8 Comparison of the heritability estimates from HDL and default LDSC across 30 UKBB phenotypes.

The default LDSC uses the 1000 Genomes reference panel. HDL uses UKBB imputed markers as reference. R represents the correlation between the two sets of estimates. The red dashed line represents identity.
Source data

Extended Data Fig. 9 Example of the eigenvalues of an LD matrix.

5,420 genotyped variants on chromosome 22 for UKBB genomic British individuals were used to generate the LD matrix. The red dashed line represents the cutoff where the leading eigenvalues and corresponding eigenvectors capture 90% of the information of the LD matrix.
Source data

Extended Data Fig. 10 HDL results where the LD matrix is approximated by different numbers of leading eigenvalues and eigenvectors.

After performing eigen-decomposition to the LD matrix, leading eigenvalues explaining different amount of variances of the LD matrix and their corresponding eigenvectors were taken to approximate the LD matrix. In each heritability group, we generated 100 pairs of traits, where true genetic correlation and phenotypic correlation are 0.5. In the high heritability group, the heritability of the pair of traits is 0.6 and 0.8 separately; in low heritability group, the heritability of the pair of traits is 0.2 and 0.4 separately. The 307,519 array SNPs of ~336,000 UKBB genomic British individuals were used to simulate true phenotypes and to compute the LD matrix for HDL. 30,752 SNPs are causal (10% of 307,519). Inside each box, the line indicates the median value, the central box indicates the interquartile range (IQR), and whiskers extend up to 1.5 times the IQR.
Source data

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ning, Z., Pawitan, Y. & Shen, X. High-definition likelihood inference of genetic correlations across human complex traits.
Nat Genet (2020). https://doi.org/10.1038/s41588-020-0653-y

Download citation

Read More