Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction
- Hao Cheng^{1, 2},
- Dorian J. Garrick^{1, 3} and
- Rohan L. Fernando^{1}Email author
https://doi.org/10.1186/s40104-017-0164-6
© The Author(s) 2017
Received: 21 September 2016
Accepted: 27 March 2017
Published: 2 May 2017
Abstract
Background
A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model.
Methods
Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.
Results
Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.
Conclusions
Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.
Keywords
Background
A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction (BLUP) [1], using whole-genome data. Breeding values are defined as the sum of the effects of all the markers or haplotypes, and their estimates are widely used for prediction of the merit of selection candidates. Estimates of marker or haplotype effects are used to predict breeding values of individuals that were not present in a previous analysis commonly referred to as training. An alternative earlier published approach to use marker or haplotype information fits breeding values as random effects based on covariances defined by a “genomic relationship matrix” computed from genotypes [2]. These two models have been shown to be equivalent in terms of predicting breeding values [3, 4] and we refer to them here as marker effect models (MEM) or breeding value models (BVM), the latter often known as Genomic Best Linear Unbiased Prediction (GBLUP).
Cross validation is often used to quantify the predictive ability of a statistical model. In k-fold cross validation, the whole dataset is partitioned into k parts with k analyses, where one part is omitted for training with validation on the omitted part. Leave-one-out cross validation (LOOCV) is a special case of k-fold cross validation with k=n, the number of observations. When the dataset is small, leave-one-out cross validation is appealing as the size of the training set is maximized. However, naive application of LOOCV is computationally intensive, requiring n analyses.
We show below how LOOCV can be performed using either the MEM or BVM with little more effort than is required for a single analysis with n observations.
Methods
Use of the MEM is more efficient when the number n of individuals is larger than the number p of markers, because for this model the mixed model equations are of order p plus the number of other effects. When n<p, estimated breeding values can be obtained more efficiently by solving the mixed model equations for the BVM of order n plus the number of other effects. We deal with the special case where the only other effect is a general mean and phenotypes have been pre-corrected for other nuisance variables. Efficient strategies for LOOCV using this special case for MEM when n≥p and BVM when p≥n are shown below.
Marker effect models
where y, a n×1 vector for phenotypes, has been pre-corrected for all fixed effects other than μ, the overall mean, X is the n×p matrix of marker covariates, \(\mathcal {\boldsymbol {\beta }}\) is a p×1 random vector of the allele substitution effects and e is a n×1 random vector of residuals. Often it is assumed that marker effects are identically and independently distributed (iid) random variables with null means and variances \(\sigma _{\beta }^{2}\). Thus, under the usual assumption that the residuals are iid with null means and variances \(\sigma _{e}^{2}\), E(y)=1 μ. When MEM is used, LOOCV can be performed by using a well-known strategy used in least-squares regression to compute the predicted residual sum of square (PRESS) [5] statistic.
LOOCV strategy for MEM
where X ^{∗}= [ 1 X], \({\hat {\boldsymbol {\beta }^{*}}}=\left [ \begin {array}{c} \hat {\mathbf {\mu }}\\ \hat {\mathbf {\boldsymbol {\beta }}} \end {array}\right ]\), D is a diagonal matrix whose elements are 0 followed by a p vector of 1s and \(\lambda =\frac {\sigma _{e}^{2}}{\sigma _{\beta }^{2}}\).
where \(\boldsymbol {X}_{-j}^{*}\) is X ^{∗} with the jth row removed and y _{−j } is y with the jth element removed.
where the quadratic \(H_{jj}=\boldsymbol {x}^{*}_{j'}\left (\boldsymbol {X}^{*{\prime }}\boldsymbol {X}^{*}+\boldsymbol {D}\lambda \right)^{-1}\boldsymbol {x}^{*}_{j}\) is the jth diagonal element of H=X ^{∗}(X ^{∗′} X ^{∗}+D λ)^{−1} X ^{∗′}.
These prediction errors can be squared and accumulated over n realizations to compute PRESS defined as \(\sum _{j=1}^{n}\hat {e_{j}}^{2}\). The accuracy of genomic prediction is often quantified as the correlation between the predicted and observed values of y _{ j }, and that correlation can be estimated from the values of \(\hat {y}_{j}\), which can be computed efficiently as \(\hat {y}_{j}=y_{j}-\hat {e}_{j}\), using the observed values of y _{ j }. When a specific group of individuals is of interest, prediction accuracies and PRESS can also be calculate using \(\hat {e_{j}}\) for individuals in that group.
Breeding value models
where u=X β, \({\text {var}\left (\mathbf {u}\right)=\mathbf {XX'}\sigma _{\beta }^{2}}\), Z is the identity matrix of order n and other variables are as in the MEM. Further, in both models E(y)=1 μ, and \(\text {var}(\mathbf {y})=\mathbf {X}\mathbf {X}'\sigma _{\beta }^{2}+\mathbf {I}\sigma _{e}^{2}\). These two models are said to be equivalent [6], and linear functions predicted from one model are identical to corresponding predictions from the other model. Two efficient strategies for LOOCV using the BVM are shown below.
LOOCV strategy I for BVM
where the quadratic \(\mathbf {z}^{*{\prime }}_{j}\left (\mathbf {Z}^{*{\prime }}\mathbf {Z}^{*}+\mathbf {G}\lambda \right)^{-1}\mathbf {z}^{*}_{j}\) is the jth diagonal element of C=Z ^{∗}(Z ^{∗′} Z ^{∗}+G λ)^{−1} Z ^{∗′}.
LOOCV strategy II for BVM
Now V _{ j,−j } in element (2,1) of the above inverse matrix is the vector of covariances between y _{ j } and y _{−j } and \(\mathbf {V_{-j,-j}^{-1}}\) is the inverse of the covariance matrix of y _{−j }. Thus, \(\hat {y}_{j}=\boldsymbol {V}_{j,-j}\boldsymbol {V}_{-j,-j}^{-1}\mathbf {y}_{-j}\) is the Best Linear Predictor (BLP) of y _{ j } given y _{−j }, and element (2,1) of (10) is the prediction error of y _{ j }. The element (2,2) in (10) is the prediction error variance (PEV) for y _{ j }, where \(PEV=var(y_{j}-\hat {y}_{j})\). PEV can also be used to calculate theoretical reliability for individual i as \(1-\frac {PEV_{i}}{V_{jj}}\), and characterizing the distributions of reliability for all the individuals in a dataset has a number of practical applications. Note this allows us to obtain the PEV of every individual and the distribution of these values provide information as to the robustness of genomic predictions across the population of individuals represented in the dataset. This PEV is determined by the genomic variance-covariance matrix and does not depend on y. Two different datasets could generate the same PRESS statistic but with different distributions of PEV.
where q ^{ i,j } is the element from row i and column j of Q ^{−1}. Thus, once Q ^{−1} is computed, \(\hat {e}_{j}\) for all j can be computed using (12), and these values can be used to compute PRESS as \(\sum _{j=1}^{n}\hat {e_{j}}^{2}\). To estimate the correlation between the predicted and observed values of y _{ j }, the value of \(\hat {y}_{j}\) is efficiently obtained as the difference \(\hat {y}_{j}=y_{j}-\hat {e}_{j}\).
for sufficiently large value of \(\sigma _{L}^{2}\). Then under this assumption, E(y)=0 and \(var\left (\boldsymbol {y}\right)=\mathbf {X}^{*}\boldsymbol {\Sigma }\boldsymbol {X}^{*'}+\mathbf {I}\sigma _{e}^{2}=\boldsymbol {V}^{*}\), and thus \(\hat {y}_{j}=\boldsymbol {V}_{j,-j}^{*}\boldsymbol {V}_{-j,-j}^{*-1}\mathbf {y}_{-j}\) is the BLP from the random effects rather than mixed effects model of y _{ j } given y _{−j }. This BLP obtained from the model with random μ will be numerically very close to the BLUP obtained from the mixed model with fixed μ. The Q matrix corresponding to the BLP with random μ is constructed as \(\left [\begin {array}{cc} \mathbf {y'y} & \mathbf {y^{{\prime }}}\\ \mathbf {y} & \mathbf {V}^{*} \end {array}\right ]\) and prediction residuals are obtained as (12).
Numerical example
Phenotypes and genotypes at 5 markers for 3 individuals used in the numerical example
M1 | M2 | M3 | M4 | M5 | Phenotypes | |
---|---|---|---|---|---|---|
1 | 1 | 2 | 1 | 2 | 2 | 1.97 |
2 | 2 | 1 | 0 | 1 | 1 | 2.12 |
3 | 0 | 0 | 2 | 1 | 2 | –0.62 |
Diagonal elements of H in LOOCV strategy for MEM and C for BVM
j=1 | j=2 | j=3 | |
---|---|---|---|
H _{ jj } | 0.46 | 0.51 | 0.55 |
C _{ jj } | 0.46 | 0.51 | 0.55 |
Q matrix in strategy II for BVM
1 | 2 | 3 | 4 | |
---|---|---|---|---|
1 | 8.75 | 1.97 | 2.12 | –0.62 |
2 | 1.97 | 1,002.40 | 1,000.80 | 1,000.80 |
3 | 2.12 | 1,000.80 | 1,001.70 | 1,000.30 |
4 | –0.62 | 1,000.80 | 1,000.30 | 1,001.90 |
Prediction errors from different LOOCV strategies (different strategies gave identical prediction errors)
j=1 | j=2 | j=3 | |
---|---|---|---|
\(\hat {e_{j}}\) | 1.13 | 1.21 | –2.66 |
Simulation to compare efficiency
Two datasets were simulated using XSim [8], where 1,000 offspring were sampled from random mating of 100 parents for 10 non-overlapping generations, to compare the computational efficiencies for naive and efficient strategies using BVM or MEM for LOOCV in GBLUP. Dataset I was simulated with 1,000 observations and 10,000 SNP markers for a p≫n scenario. Dataset II was simulated with 1,000 observations and 100 markers for a n≫p scenario. The processor used in the analyses was a 1.4 GHz Intel Core i5 with 4 GB of memory.
Efficiency of alternative LOOCV strategies for GBLUP
Alternative LOOCV strategies | |||||
---|---|---|---|---|---|
Naive MEM | Naive BVM | Efficient MEM | Efficient BVM I | Efficient BVM II | |
n=1,000; p=10,000 | 9,490.608 | 2,442.590 | 105.141 | 3.107 | 5.945 |
n=1,000; p=100 | 2.979 | 169.928 | 0.030 | 2.725 | 0.217 |
Discussion
In genomic prediction, the candidates to be predicted are often offspring that are genotyped but not yet phenotyped. In this situation, LOOCV using all individuals in the training dataset will provide an upper bound for the accuracy of prediction, because ancestors in the training dataset with large numbers of descendants have more accurate predictions than descendants. A better estimate of the accuracy of prediction can be obtained by applying LOOCV to only terminal offspring in the training dataset.
Conclusions
Efficient strategies for LOOCV in GBLUP are presented in this paper. LOOCV strategy I and II for BVM are more efficient when p≫n. LOOCV strategy for MEM is more efficient when n≫p. The accuracy of genomic prediction is often quantified as the correlation between the predicted and observed values of y _{ j }, and this correlation can be estimated efficiently using LOOCV strategies. Compared to naive application of LOOCV, which is computationally intensive, LOOCV can be implemented efficiently.
Declarations
Acknowledgements
Not applicable.
Funding
This work was supported by the US Department of Agriculture, Agriculture and Food Research Initiative National Institute of Food and Agriculture Competitive grant no. 2015-67015-22947. The funders had no role in the design of the study and collection, analysis, and interpretation of data and in writing of the manuscript.
Availability of data and materials
The datasets during and/or analysed during the current study available from the corresponding author on reasonable request.
Authors’ contributions
All authors contributed to the development of the statistical methods. HC wrote the program code and conducted the analyses. The manuscript was prepared by HC, RLF and DG. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001; 157:1819–29.PubMedPubMed CentralGoogle Scholar
- Nejati-Javaremi A, Smith C, Gibson JP. Effect of total allelic relationship on accuracy of evaluation and response to selection. J Anim Sci. 1997; 75:1738–45.View ArticlePubMedGoogle Scholar
- Fernando RL. Genetic evaluation and selection using genotypic, phenotypic and pedigree information. Proc 6th Wld Cong Genet Appl Livest Prod. 1998; 26:329–36.Google Scholar
- Strandén I, Garrick DJ. Technical note: Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J Dairy Sci. 2009; 92(6):2971–75.View ArticlePubMedGoogle Scholar
- Allen DM. The relationship between variable selection and data augmentation and a method for prediction. Technometrics. 1974; 16(1):125–7.View ArticleGoogle Scholar
- Henderson CR. Applications of Linear Models in Animal Breeding. Guelph: Univ. Guelph; 1984.Google Scholar
- Searle SR. Matrix Algebra Useful for Statistics. New York: John Wildy and Sons, Inc; 1982.Google Scholar
- Cheng H, Garrick D, Fernando R. Xsim: Simulation of descendants from ancestors with sequence data. G3. 2015; 5(7):1415–17.View ArticlePubMedPubMed CentralGoogle Scholar