Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction

Background A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model. Methods Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis. Results Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations. Conclusions Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.


Background
A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction (BLUP) [1], using whole-genome data. Breeding values are defined as the sum of the effects of all the markers or haplotypes, and their estimates are widely used for prediction of the merit of selection candidates. Estimates of marker or haplotype effects are used to predict breeding values of individuals that were not present in a previous analysis commonly referred to as training. An alternative earlier published approach to use marker or haplotype information fits breeding values as random effects based on covariances defined by a "genomic relationship matrix" computed from genotypes [2]. These two models have been shown to be equivalent in terms of predicting breeding values [3,4] *Correspondence: rohan@iastate.edu 1 Department of Animal Science, Iowa State University, 50011, Ames, Iowa, USA Full list of author information is available at the end of the article and we refer to them here as marker effect models (MEM) or breeding value models (BVM), the latter often known as Genomic Best Linear Unbiased Prediction (GBLUP).
Cross validation is often used to quantify the predictive ability of a statistical model. In k-fold cross validation, the whole dataset is partitioned into k parts with k analyses, where one part is omitted for training with validation on the omitted part. Leave-one-out cross validation (LOOCV) is a special case of k-fold cross validation with k = n, the number of observations. When the dataset is small, leave-one-out cross validation is appealing as the size of the training set is maximized. However, naive application of LOOCV is computationally intensive, requiring n analyses.
We show below how LOOCV can be performed using either the MEM or BVM with little more effort than is required for a single analysis with n observations.

Methods
Use of the MEM is more efficient when the number n of individuals is larger than the number p of markers, because for this model the mixed model equations are of order p plus the number of other effects. When n < p, estimated breeding values can be obtained more efficiently by solving the mixed model equations for the BVM of order n plus the number of other effects. We deal with the special case where the only other effect is a general mean and phenotypes have been pre-corrected for other nuisance variables. Efficient strategies for LOOCV using this special case for MEM when n ≥ p and BVM when p ≥ n are shown below.

Marker effect models
The MEM for GBLUP can be written as where y, a n × 1 vector for phenotypes, has been precorrected for all fixed effects other than μ, the overall mean, X is the n × p matrix of marker covariates, β is a p × 1 random vector of the allele substitution effects and e is a n × 1 random vector of residuals. Often it is assumed that marker effects are identically and independently distributed (iid) random variables with null means and variances σ 2 β . Thus, under the usual assumption that the residuals are iid with null means and variances σ 2 e , E (y) = 1μ. When MEM is used, LOOCV can be performed by using a well-known strategy used in least-squares regression to compute the predicted residual sum of square (PRESS) [5] statistic.

LOOCV strategy for MEM
BLUP of β * = μ β can be obtained by solving the mixed model equations where X * = [1 X],β * = μ β , D is a diagonal matrix whose elements are 0 followed by a p vector of 1s and λ = where X * −j is X * with the jth row removed and y −j is y with the jth element removed. Suppose x * j is the jth row of X * , then from the matrix inverse lemma [4], where the quadratic Using (3) in (4), the prediction residual for the jth observation can be written aŝ These prediction errors can be squared and accumulated over n realizations to compute PRESS defined as n j=1ê j 2 . The accuracy of genomic prediction is often quantified as the correlation between the predicted and observed values of y j , and that correlation can be estimated from the values ofŷ j , which can be computed efficiently asŷ j = y j −ê j , using the observed values of y j . When a specific group of individuals is of interest, prediction accuracies and PRESS can also be calculate usingê j for individuals in that group.

Breeding value models
When n < p, the genomic prediction of the breeding value x jβ can be obtained more efficiently by solving the mixed model equations for the BVM: where u = Xβ, var (u) = XX σ 2 β , Z is the identity matrix of order n and other variables are as in the MEM. Further, in both models E(y) = 1μ, and var(y) = XX σ 2 β + Iσ 2 e . These two models are said to be equivalent [6], and linear functions predicted from one model are identical to corresponding predictions from the other model. Two efficient strategies for LOOCV using the BVM are shown below.

LOOCV strategy I for BVM
The mixed model equations for this model are: where Due to the relative order of the coefficient matrices for the MEM and the BVM, can be obtained more efficiently as var(u * j −û * j ) = z * j Z * Z * + Gλ −1 z * j σ 2 e . Using these two equalities, the formula forê j becomes: where the quadratic z * j Z * Z * + Gλ −1 z * j is the jth diagonal element of C = Z * Z * Z * + Gλ −1 Z * .

LOOCV strategy II for BVM
Another efficient strategy for BVM is shown here. First we consider the situation where y has been pre-corrected for μ in addition to nuisance effects so that E(y) = 0 and we define var y = XX σ 2 β + Iσ 2 e = V . Now matrix Q is constructed by augmenting the covariance matrix of y with one leading row and column as Q = y y y y V .
To obtain the prediction error for observation j, the second row and column of Q are permuted with row and column j + 1. In this manner Q has its rows and columns symmetrically permuted as P j QP j = W , where the permutation matrix P j is obtained by permuting the second row of the n order identity matrix with row j + 1. So the permuted matrix is: where we will define the leading 2 × 2 matrix as A = y y y j y j V jj , and the other partitions as B = y −j V j,−j , and C = V −j,−j , where −j denotes that the jth element, row or column has been removed. Defining W 11 as the top left or leading 2 × 2 sub-matrix in W −1 corresponding to the position of A in W, and using partitioned inverse-matrix identities [7], the inverse of W 11 can be written as, Now V j,−j in element (2, 1) of the above inverse matrix is the vector of covariances between y j and y −j and V −1 −j,−j is the inverse of the covariance matrix of y −j . Thus,ŷ j = V j,−j V −1 −j,−j y −j is the Best Linear Predictor (BLP) of y j given y −j , and element (2,1) of (10) is the prediction error of y j . The element (2, 2) in (10) is the prediction error variance (PEV) for y j , where PEV = var(y j −ŷ j ). PEV can also be used to calculate theoretical reliability for individual i as 1 − PEV i V jj , and characterizing the distributions of reliability for all the individuals in a dataset has a number of practical applications. Note this allows us to obtain the PEV of every individual and the distribution of these values provide information as to the robustness of genomic predictions across the population of individuals represented in the dataset. This PEV is determined by the genomic variance-covariance matrix and does not depend on y. Two different datasets could generate the same PRESS statistic but with different distributions of PEV. Now, because the permutation matrix P j is orthogonal, W −1 = (P j QP j ) −1 = P j Q −1 P j , and the elements of W 11 that are of interest in terms of predicting individual j can be obtained directly from Q −1 as 1 q (1+j),(1+j) . (11) It follows thatê j , which is the off-diagonal element of the inverse of the 2 × 2 matrix W 11 , can be written in terms of Q −1 aŝ where q i,j is the element from row i and column j of Q −1 . Thus, once Q −1 is computed,ê j for all j can be com- To estimate the correlation between the predicted and observed values of y j , the value ofŷ j is efficiently obtained as the differenceŷ j = y j −ê j . Now we consider the situation without pre-correcting y for μ, where E(y) = 1μ. Now the mixed model (7) contains both fixed and random effects. Note that the mixed model equations that correspond to this mixed effects model can be derived by treating μ as "random" with null mean and large variance. So, let for sufficiently large value of σ 2 L . Then under this assumption, E(y) = 0 and var y = X * X * + Iσ 2 e = V * , and thusŷ j = V * j,−j V * −1 −j,−j y −j is the BLP from the random effects rather than mixed effects model of y j given y −j . This BLP obtained from the model with random μ will be numerically very close to the BLUP obtained from the mixed model with fixed μ. The Q matrix corresponding to the BLP with random μ is constructed as y y y y V * and prediction residuals are obtained as (12).

Numerical example
Phenotypes y and genotypes X at 5 markers for 3 individuals are in Table 1. Assume σ 2 β = σ 2 e 10 and the overall mean μ is the only fixed effect. In LOOCV strategy for MEM and strategy I for BVM, the diagonal elements of H for MEM and C for BVM, which are in the denominators of (6) and (9), are in Table 2. The numerators of (6) and (9) are obtained by solving the MME (2) and (8). Then prediction errors are calculated as in (6) and (9) and shown in Table 4. In LOOCV strategy II for BVM, the Q matrix (Table 3) is constructed using σ 2 L = 1000, which is sufficiently large  relative to σ 2 e for μ to be indistinguishable from a fixed effect with a flat prior. The prediction errors are calculated as (12) and shown in Table 4. The MEM strategy and BVM strategy I gave identical prediction errors and identical PRESS for this numerical example were numerically very close to those from the BVM strategy II.

Simulation to compare efficiency
Two datasets were simulated using XSim [8], where 1,000 offspring were sampled from random mating of 100 parents for 10 non-overlapping generations, to compare the computational efficiencies for naive and efficient strategies using BVM or MEM for LOOCV in GBLUP. Dataset I was simulated with 1,000 observations and 10,000 SNP markers for a p n scenario. Dataset II was simulated with 1,000 observations and 100 markers for a n p scenario. The processor used in the analyses was a 1.4 GHz Intel Core i5 with 4 GB of memory.
For dataset II, efficient MEM is 99 times faster than the naive application (2.979 s versus 0.030 s) ( Table 5). All strategies implemented in Julia, a scientific programming language, gave virtually identical prediction accuracies defined as the correlation between y andŷ for each dataset. For dataset I, efficient BVM is 786 times faster than the naive application (3.107 s versus 2,442.59 s) ( Table 5).

Discussion
In genomic prediction, the candidates to be predicted are often offspring that are genotyped but not yet phenotyped. In this situation, LOOCV using all individuals in the training dataset will provide an upper bound for the accuracy of prediction, because ancestors in the training dataset with large numbers of descendants have more accurate predictions than descendants. A better estimate of the accuracy of prediction can be obtained by applying LOOCV to only terminal offspring in the training dataset.

Conclusions
Efficient strategies for LOOCV in GBLUP are presented in this paper. LOOCV strategy I and II for BVM are more efficient when p n. LOOCV strategy for MEM is more efficient when n p. The accuracy of genomic prediction is often quantified as the correlation between the predicted and observed values of y j , and this correlation can be estimated efficiently using LOOCV strategies. Compared to naive application of LOOCV, which is computationally intensive, LOOCV can be implemented efficiently.