Methodology Article | Open | Published:

# Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction

*Journal of Animal Science and Biotechnology***volume 8**, Article number: 38 (2017)

## Abstract

### Background

A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model.

### Methods

Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.

### Results

Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.

### Conclusions

Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.

## Background

A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction (BLUP) [1], using whole-genome data. Breeding values are defined as the sum of the effects of all the markers or haplotypes, and their estimates are widely used for prediction of the merit of selection candidates. Estimates of marker or haplotype effects are used to predict breeding values of individuals that were not present in a previous analysis commonly referred to as training. An alternative earlier published approach to use marker or haplotype information fits breeding values as random effects based on covariances defined by a “genomic relationship matrix” computed from genotypes [2]. These two models have been shown to be equivalent in terms of predicting breeding values [3, 4] and we refer to them here as marker effect models (MEM) or breeding value models (BVM), the latter often known as Genomic Best Linear Unbiased Prediction (GBLUP).

Cross validation is often used to quantify the predictive ability of a statistical model. In k-fold cross validation, the whole dataset is partitioned into k parts with k analyses, where one part is omitted for training with validation on the omitted part. Leave-one-out cross validation (LOOCV) is a special case of k-fold cross validation with *k*=*n*, the number of observations. When the dataset is small, leave-one-out cross validation is appealing as the size of the training set is maximized. However, naive application of LOOCV is computationally intensive, requiring *n* analyses.

We show below how LOOCV can be performed using either the MEM or BVM with little more effort than is required for a single analysis with *n* observations.

## Methods

Use of the MEM is more efficient when the number *n* of individuals is larger than the number *p* of markers, because for this model the mixed model equations are of order *p* plus the number of other effects. When *n*<*p*, estimated breeding values can be obtained more efficiently by solving the mixed model equations for the BVM of order *n* plus the number of other effects. We deal with the special case where the only other effect is a general mean and phenotypes have been pre-corrected for other nuisance variables. Efficient strategies for LOOCV using this special case for MEM when *n*≥*p* and BVM when *p*≥*n* are shown below.

### Marker effect models

The MEM for GBLUP can be written as

where ** y**, a

*n*×1 vector for phenotypes, has been pre-corrected for all fixed effects other than

*μ*, the overall mean,

**is the**

*X**n*×

*p*matrix of marker covariates, $\mathcal {\boldsymbol {\beta }}$ is a

*p*×1 random vector of the allele substitution effects and

**is a**

*e**n*×1 random vector of residuals. Often it is assumed that marker effects are identically and independently distributed (iid) random variables with null means and variances $\sigma _{\beta }^{2}$. Thus, under the usual assumption that the residuals are iid with null means and variances $\sigma _{e}^{2}$,

*E*(

**y**)=

**1**

*μ*. When MEM is used, LOOCV can be performed by using a well-known strategy used in least-squares regression to compute the predicted residual sum of square (PRESS) [5] statistic.

#### LOOCV strategy for MEM

BLUP of $\mathcal {\boldsymbol {\beta }}^{*}=\left [ \begin {array}{c} \mathbf {\mu }\\ \boldsymbol {\beta } \end {array}\right ]$ can be obtained by solving the mixed model equations

where *X*^{∗}= [ **1**
** X**], ${\hat {\boldsymbol {\beta }^{*}}}=\left [ \begin {array}{c} \hat {\mathbf {\mu }}\\ \hat {\mathbf {\boldsymbol {\beta }}} \end {array}\right ]$,

**is a diagonal matrix whose elements are 0 followed by a**

*D**p*vector of 1s and $\lambda =\frac {\sigma _{e}^{2}}{\sigma _{\beta }^{2}}$.

Now, BLUP for ${\boldsymbol {\beta }_{-j}^{*}}$, where observation *j* is left out, can be obtained as

where $\boldsymbol {X}_{-j}^{*}$ is *X*^{∗} with the *j*th row removed and *y*_{−j
} is ** y** with the

*j*th element removed.

Suppose $\boldsymbol {x}^{*{\prime }}_{j}$ is the *j*th row of *X*^{∗}, then from the matrix inverse lemma [4],

where the quadratic $H_{jj}=\boldsymbol {x}^{*}_{j'}\left (\boldsymbol {X}^{*{\prime }}\boldsymbol {X}^{*}+\boldsymbol {D}\lambda \right)^{-1}\boldsymbol {x}^{*}_{j}$ is the *j*th diagonal element of ** H**=

*X*^{∗}(

*X*^{∗′}

*X*^{∗}+

*D**λ*)

^{−1}

*X*^{∗′}.

Using (3) in (4), the prediction residual for the *j*th observation can be written as

These prediction errors can be squared and accumulated over *n* realizations to compute PRESS defined as $\sum _{j=1}^{n}\hat {e_{j}}^{2}$. The accuracy of genomic prediction is often quantified as the correlation between the predicted and observed values of *y*
_{
j
}, and that correlation can be estimated from the values of $\hat {y}_{j}$, which can be computed efficiently as $\hat {y}_{j}=y_{j}-\hat {e}_{j}$, using the observed values of *y*
_{
j
}. When a specific group of individuals is of interest, prediction accuracies and PRESS can also be calculate using $\hat {e_{j}}$ for individuals in that group.

### Breeding value models

When *n*<*p*, the genomic prediction of the breeding value $\mathbf {x}_{j}'\hat {\mathbf {\boldsymbol {\beta }}}$ can be obtained more efficiently by solving the mixed model equations for the BVM:

where **u**=**X**
**β**, ${\text {var}\left (\mathbf {u}\right)=\mathbf {XX'}\sigma _{\beta }^{2}}$, **Z** is the identity matrix of order *n* and other variables are as in the MEM. Further, in both models E(**y**)=**1**
*μ*, and $\text {var}(\mathbf {y})=\mathbf {X}\mathbf {X}'\sigma _{\beta }^{2}+\mathbf {I}\sigma _{e}^{2}$. These two models are said to be equivalent [6], and linear functions predicted from one model are identical to corresponding predictions from the other model. Two efficient strategies for LOOCV using the BVM are shown below.

#### LOOCV strategy I for BVM

The mixed model equations for this model are:

where $\boldsymbol {Z}^{*}=[\!\mathbf {1}~~ \boldsymbol {Z}],\boldsymbol {\hat {u}}^{*}=\left [ \begin {array}{c} \hat {\mathbf {\mu }}\\ \hat {\mathbf {\boldsymbol {u}}} \end {array}\right ], \mathbf {G}=\left [ \begin {array}{cc} 0 & \mathbf {0^{'}}\\ \mathbf {0} & \left (\boldsymbol {XX'}\right)^{-1} \end {array}\right ]$. Due to the relative order of the coefficient matrices for the MEM and the BVM, when *n*<*p*, $\boldsymbol {x}^{*{\prime }}_{j}\hat {\boldsymbol {\beta }^{*}}$ is more efficiently obtained as $\hat {u^{*}}_{j}$. Similarly, $\text {var}(\boldsymbol {x}^{*{\prime }}_{j}\boldsymbol {\beta }^{*}-\boldsymbol {x}^{*{\prime }}_{j}\hat {\boldsymbol {\beta }^{*}})=\mathbf {x^{*{\prime }}}_{j} \left (\mathbf {X}^{*{\prime }}\mathbf {X}^{*}+\mathbf {D}\lambda \right)^{-1}\mathbf {x}^{*}_{j}\sigma _{e}^{2}$ can be obtained more efficiently as $\text {var}(u_{j}^{*}-\hat {u^{*}}_{j})=\mathbf {z^{*}}_{j'}\left (\mathbf {Z}^{*{\prime }}\mathbf {Z^{*}}+\mathbf {G}\lambda \right)^{-1} \mathbf {z^{*}}_{j}\sigma _{e}^{2}$. Using these two equalities, the formula for $\hat {e}_{j}$ becomes:

where the quadratic $\mathbf {z}^{*{\prime }}_{j}\left (\mathbf {Z}^{*{\prime }}\mathbf {Z}^{*}+\mathbf {G}\lambda \right)^{-1}\mathbf {z}^{*}_{j}$ is the *j*th diagonal element of ** C**=

*Z*^{∗}(

*Z*^{∗′}

*Z*^{∗}+

*G**λ*)

^{−1}

*Z*^{∗′}.

#### LOOCV strategy II for BVM

Another efficient strategy for BVM is shown here. First we consider the situation where ** y** has been pre-corrected for

*μ*in addition to nuisance effects so that E(

**y**)=

**0**and we define $\mathbf {var\left (\boldsymbol {y}\right)=X}\mathbf {X}'\sigma _{\beta }^{2}+\mathbf {I}\sigma _{e}^{2}=\boldsymbol {V}$. Now matrix

**is constructed by augmenting the covariance matrix of**

*Q***with one leading row and column as**

*y*To obtain the prediction error for observation *j*, the second row and column of **Q** are permuted with row and column *j*+1. In this manner **Q** has its rows and columns symmetrically permuted as **P**
*j*′**Q**
**P**
_{
j
}
** =W**, where the permutation matrix

**P**

_{ j }is obtained by permuting the second row of the

*n*order identity matrix with row

*j*+1. So the permuted matrix is:

where we will define the leading 2×2 matrix as $\mathbf {A}=\left [ \begin {array}{cc}\mathbf {y'y} & y_{j}\\ y_{j} & V_{jj} \end {array}\right ]$, and the other partitions as $\mathbf {B}=\left [ \begin {array}{c}\mathbf {y}_{-j}^{{\prime }}\\ \mathbf {V}_{j,-j} \end {array}\right ]$, and **C**=**V**
_{−j,−j
}, where −*j* denotes that the *j*th element, row or column has been removed. Defining *W*
^{11} as the top left or leading 2×2 sub-matrix in *W*
^{−1} corresponding to the position of **A** in **W**, and using partitioned inverse-matrix identities [7], the inverse of *W*
^{11} can be written as,

Now *V*_{
j,−j
} in element (2,1) of the above inverse matrix is the vector of covariances between *y*
_{
j
} and **y**
_{−j
} and $\mathbf {V_{-j,-j}^{-1}}$ is the inverse of the covariance matrix of **y**
_{−j
}. Thus, $\hat {y}_{j}=\boldsymbol {V}_{j,-j}\boldsymbol {V}_{-j,-j}^{-1}\mathbf {y}_{-j}$ is the Best Linear Predictor (BLP) of *y*
_{
j
} given **y**
_{−j
}, and element (2,1) of (10) is the prediction error of *y*
_{
j
}. The element (2,2) in (10) is the prediction error variance (PEV) for *y*
_{
j
}, where $PEV=var(y_{j}-\hat {y}_{j})$. PEV can also be used to calculate theoretical reliability for individual *i* as $1-\frac {PEV_{i}}{V_{jj}}$, and characterizing the distributions of reliability for all the individuals in a dataset has a number of practical applications. Note this allows us to obtain the PEV of every individual and the distribution of these values provide information as to the robustness of genomic predictions across the population of individuals represented in the dataset. This PEV is determined by the genomic variance-covariance matrix and does not depend on **y**. Two different datasets could generate the same PRESS statistic but with different distributions of PEV.

Now, because the permutation matrix **P**
_{
j
} is orthogonal, $\mathbf {W}^{-1}=(\mathbf {P}_{j}^{{\prime }}\mathbf {Q}\mathbf {P}_{j})^{-1}=\mathbf {P}_{j}^{{\prime }}\mathbf {Q}^{-1}\mathbf {P}_{j}$, and the elements of **W**
^{11} that are of interest in terms of predicting individual *j* can be obtained directly from **Q**
^{−1} as

It follows that $\hat {e}_{j}$, which is the off-diagonal element of the inverse of the 2×2 matrix **W**
^{11}, can be written in terms of **Q**
^{−1} as

where *q*
^{i,j} is the element from row *i* and column *j* of **Q**
^{−1}. Thus, once **Q**
^{−1} is computed, $\hat {e}_{j}$ for all *j* can be computed using (12), and these values can be used to compute PRESS as $\sum _{j=1}^{n}\hat {e_{j}}^{2}$. To estimate the correlation between the predicted and observed values of *y*
_{
j
}, the value of $\hat {y}_{j}$ is efficiently obtained as the difference $\hat {y}_{j}=y_{j}-\hat {e}_{j}$.

Now we consider the situation without pre-correcting **y** for *μ*, where E(**y**)=**1**
*μ*. Now the mixed model (7) contains both fixed and random effects. Note that the mixed model equations that correspond to this mixed effects model can be derived by treating *μ* as “random” with null mean and large variance. So, let

for sufficiently large value of $\sigma _{L}^{2}$. Then under this assumption, E(**y**)=**0** and $var\left (\boldsymbol {y}\right)=\mathbf {X}^{*}\boldsymbol {\Sigma }\boldsymbol {X}^{*'}+\mathbf {I}\sigma _{e}^{2}=\boldsymbol {V}^{*}$, and thus $\hat {y}_{j}=\boldsymbol {V}_{j,-j}^{*}\boldsymbol {V}_{-j,-j}^{*-1}\mathbf {y}_{-j}$ is the BLP from the random effects rather than mixed effects model of *y*
_{
j
} given **y**
_{−j
}. This BLP obtained from the model with random *μ* will be numerically very close to the BLUP obtained from the mixed model with fixed *μ*. The **Q** matrix corresponding to the BLP with random *μ* is constructed as $\left [\begin {array}{cc} \mathbf {y'y} & \mathbf {y^{{\prime }}}\\ \mathbf {y} & \mathbf {V}^{*} \end {array}\right ]$ and prediction residuals are obtained as (12).

### Numerical example

Phenotypes ** y** and genotypes

**at 5 markers for 3 individuals are in Table 1. Assume $\sigma _{\beta }^{2}=\frac {\sigma _{e}^{2}}{10}$ and the overall mean**

*X**μ*is the only fixed effect. In LOOCV strategy for MEM and strategy I for BVM, the diagonal elements of

**for MEM and**

*H***for BVM, which are in the denominators of (6) and (9), are in Table 2. The numerators of (6) and (9) are obtained by solving the MME (2) and (8). Then prediction errors are calculated as in (6) and (9) and shown in Table 4. In LOOCV strategy II for BVM, the**

*C***matrix (Table 3) is constructed using $\sigma _{L}^{2}=1000$, which is sufficiently large relative to $\sigma ^{2}_{e}$ for**

*Q**μ*to be indistinguishable from a fixed effect with a flat prior. The prediction errors are calculated as (12) and shown in Table 4. The MEM strategy and BVM strategy I gave identical prediction errors and identical PRESS for this numerical example were numerically very close to those from the BVM strategy II.

### Simulation to compare efficiency

Two datasets were simulated using XSim [8], where 1,000 offspring were sampled from random mating of 100 parents for 10 non-overlapping generations, to compare the computational efficiencies for naive and efficient strategies using BVM or MEM for LOOCV in GBLUP. Dataset I was simulated with 1,000 observations and 10,000 SNP markers for a *p*≫*n* scenario. Dataset II was simulated with 1,000 observations and 100 markers for a *n*≫*p* scenario. The processor used in the analyses was a 1.4 GHz Intel Core i5 with 4 GB of memory.

For dataset II, efficient MEM is 99 times faster than the naive application (2.979 s versus 0.030 s) (Table 5). All strategies implemented in Julia, a scientific programming language, gave virtually identical prediction accuracies defined as the correlation between **y** and $\hat {\mathbf {y}}$ for each dataset. For dataset I, efficient BVM is 786 times faster than the naive application (3.107 s versus 2,442.59 s) (Table 5).

## Discussion

In genomic prediction, the candidates to be predicted are often offspring that are genotyped but not yet phenotyped. In this situation, LOOCV using all individuals in the training dataset will provide an upper bound for the accuracy of prediction, because ancestors in the training dataset with large numbers of descendants have more accurate predictions than descendants. A better estimate of the accuracy of prediction can be obtained by applying LOOCV to only terminal offspring in the training dataset.

## Conclusions

Efficient strategies for LOOCV in GBLUP are presented in this paper. LOOCV strategy I and II for BVM are more efficient when *p*≫*n*. LOOCV strategy for MEM is more efficient when *n*≫*p*. The accuracy of genomic prediction is often quantified as the correlation between the predicted and observed values of *y*
_{
j
}, and this correlation can be estimated efficiently using LOOCV strategies. Compared to naive application of LOOCV, which is computationally intensive, LOOCV can be implemented efficiently.

## References

- 1
Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001; 157:1819–29.

- 2
Nejati-Javaremi A, Smith C, Gibson JP. Effect of total allelic relationship on accuracy of evaluation and response to selection. J Anim Sci. 1997; 75:1738–45.

- 3
Fernando RL. Genetic evaluation and selection using genotypic, phenotypic and pedigree information. Proc 6th Wld Cong Genet Appl Livest Prod. 1998; 26:329–36.

- 4
Strandén I, Garrick DJ. Technical note: Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit. J Dairy Sci. 2009; 92(6):2971–75.

- 5
Allen DM. The relationship between variable selection and data augmentation and a method for prediction. Technometrics. 1974; 16(1):125–7.

- 6
Henderson CR. Applications of Linear Models in Animal Breeding. Guelph: Univ. Guelph; 1984.

- 7
Searle SR. Matrix Algebra Useful for Statistics. New York: John Wildy and Sons, Inc; 1982.

- 8
Cheng H, Garrick D, Fernando R. Xsim: Simulation of descendants from ancestors with sequence data. G3. 2015; 5(7):1415–17.

## Acknowledgements

Not applicable.

### Funding

This work was supported by the US Department of Agriculture, Agriculture and Food Research Initiative National Institute of Food and Agriculture Competitive grant no. 2015-67015-22947. The funders had no role in the design of the study and collection, analysis, and interpretation of data and in writing of the manuscript.

### Availability of data and materials

The datasets during and/or analysed during the current study available from the corresponding author on reasonable request.

### Authors’ contributions

All authors contributed to the development of the statistical methods. HC wrote the program code and conducted the analyses. The manuscript was prepared by HC, RLF and DG. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Consent for publication

Not applicable.

### Ethics approval and consent to participate

Not applicable.

## Author information

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## About this article

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Leave-one-out cross validation
- GBLUP