Regularized quantile regression for SNP marker estimation of pig growth curves

Background Genomic growth curves are generally defined only in terms of population mean; an alternative approach that has not yet been exploited in genomic analyses of growth curves is the Quantile Regression (QR). This methodology allows for the estimation of marker effects at different levels of the variable of interest. We aimed to propose and evaluate a regularized quantile regression for SNP marker effect estimation of pig growth curves, as well as to identify the chromosome regions of the most relevant markers and to estimate the genetic individual weight trajectory over time (genomic growth curve) under different quantiles (levels). Results The regularized quantile regression (RQR) enabled the discovery, at different levels of interest (quantiles), of the most relevant markers allowing for the identification of QTL regions. We found the same relevant markers simultaneously affecting different growth curve parameters (mature weight and maturity rate): two (ALGA0096701 and ALGA0029483) for RQR(0.2), one (ALGA0096701) for RQR(0.5), and one (ALGA0003761) for RQR(0.8). Three average genomic growth curves were obtained and the behavior was explained by the curve in quantile 0.2, which differed from the others. Conclusions RQR allowed for the construction of genomic growth curves, which is the key to identifying and selecting the most desirable animals for breeding purposes. Furthermore, the proposed model enabled us to find, at different levels of interest (quantiles), the most relevant markers for each trait (growth curve parameter estimates) and their respective chromosomal positions (identification of new QTL regions for growth curves in pigs). These markers can be exploited under the context of marker assisted selection while aiming to change the shape of pig growth curves.


Background
In general, the study of growth curves is carried out by fitting nonlinear models to weight (dependent variable) and age (independent variable) data. These models are used because they are flexible and have parameters with biological interpretations, such as maturity rate and adult weight.
With the goal of estimating SNP marker effects on parameter estimates of growth curves, Pong-Wong and Hadjipavlou [1] proposed a two-step approach. In the first step, nonlinear models were fitted to the weight-age data of each animal. In the second step, genomic regression models were fitted while considering the parameter estimates from the previous step as the dependent variable. Such an approach allows for the estimation of marker effects based only on the conditional mean of the dependent variable. Specifically, genomic growth curves are defined only in terms of population mean, i.e., the identification of genetically superior individuals in relation to the growth efficiency is based on population mean distribution (quantile 0.5 of a normal distribution of the sampled data).
An alternative approach for the second step that has not yet been exploited in genomic analyses of growth curves is the Quantile Regression (QR) [2]. This methodology allows for the estimation of marker effects at different levels (quantiles) of the variable of interest. Obtaining these effects in specific quantiles allows for a more informative study on the chromosomal regions affecting the growth curve trajectory.
In general, the larger number of markers and the dependence between them due to linkage disequilibrium leads to multicolinearity estimation problems. Thus, methods such as shrinkage estimation, which highlight the high dimensionality and multicollinearity issues, are required. Under a QR framework, this method is named regularized quantile regression (RQR), since the shrinkage (or penalty) parameter regularizes the variance of the markers' effects, thus performing a direct variable selection framework.
We aimed to propose and evaluate a regularized quantile regression for SNP marker effect estimation of pig growth curves, as well as to identify the chromosome regions of the most relevant markers and to estimate the genetic individual weight trajectory over time (genomic growth curve) under different quantiles (levels).

Animals and genotyping data
Phenotypic data was obtained from the Pig Breeding Farm of the Department of Animal Science of the Federal University of Viçosa, Minas Gerais, and refer to the weights at birth, 21, 42, 63, 77, 105 and 150 days of age. These weights were measured in 345 animals from a F2 outbred population (Brazilian Piau X commercial). More details about this population are found by Azevedo et al. [3] and Band et al. [4].
DNA was extracted at the Animal Biotechnology Lab from Animal Science Department of Federal University of Viçosa. The low-density customized SNPChip with 384 markers was based on the Illumina Porcine SNP60 BeadChip (San Diego, CA, USA, [5]). The number of SNP markers was distributed as follows in the pig chromosomes: (Sus scrofa; SSC): SSC1 (n = 56), SSC4 (n = 54), SSC7 (n = 59), SSC8 (n = 31), SSC17 (n = 25), and SSCX (n = 12), totaling 237 SNPs. These markers were selected according to QTL positions that were previously identified in this population by using metaanalyses [6] and fine mapping [7,8]. Thus, although a small number of markers have been used, the customized SNPchip based on previously identified QTL positions ensures appropriate coverage of the relevant genome regions in this population.

Statistical analysis
Initially, the logistic nonlinear regression model [9] was fitted to the individual weight-age data: where w ij is the weight of the animal i at age t j (0, 21, 42, 63, 77, 105 and 150); α 1i , α 2i and α 3i are the parameters. If α 3i > 0 then α 1i is the horizontal asymptote as t j → ∞ (mature weight) and 0 is the horizontal asymptote ast j → − ∞. If α 3i < 0 these roles are reversed. The parameter α 2i is the t j value at which the response is α 1i /2. It is the inflection point of the curve. The scale parameter α 3i (growth scale) represents the distance on the t-axis between this inflection point and the point where the response is α 1i /(1 + e −1 ) ≈ 0.73α 1i ; e ij is the independent and normally distributed residual term, e ij eN 0; σ 2 e À Á : In this parameterization, the growth scale parameter is the reciprocal of growth rate on the model presented by Ratkowsky [10].
After obtaining parameter estimates of the logistic model, they were used as dependent variables in a linear model to carry out fixed effect corrections (sex, lot, and halothane gene). The corrected variables were identified based on the residual of the fitted linear model plus the overall mean. Subsequently, the corrected variables (α̂Ã 1i ; α̂Ã 2i and α̂Ã 3i ) were used as dependent variables in a multiple regression model while using SNP markers as the independent variables. This procedure is known in the literature as a two-step approach: in the first step, a growth curve is fitted to the data of each animal, and in the second step, the parameter estimates from the previous step are used as phenotypic values [1,11].
In the second step, the following genomic model proposed by Meuwissen et al. [12] was fitted separately for each trait (parameter estimates from previous step): in which y i is the corrected phenotype α̂Ã 1i ; α̂Ã 2i and α̂Ã 3i from the first step; μis the general mean; x ik is the SNP marker, encoded as 2 (AA), 1 (Aa), or 0 (aa); β k is the effect of the marker k; and ε i corresponds to the residual term, ε i eN 0; σ 2 e À Á . To obtain the markers' effects at different levels of the variables (traits defined by α̂Ã 1 ; α̂Ã 2 and α̂Ã 3 ), the regularized quantile regression [13] was used. This method consists of obtaining the marker effects (β k ) that solve the following optimization problem: where s = 1, 2, and 3 (respectively for each assumed trait, α̂Ã 1 ; α Ã 2 and α̂Ã 3 ); P k¼1 237 β sk0 is the sum of the absolute values of the regression coefficients; λ s is the regularization parameter for each trait; and τ ∈ (0, 1) indicates the quantile of interest. This parameter (λ s ) is required to avoid multicollinearity problems that are a result of the larger number of highly dependent markers associated with linkage disequilibrium. It leads to the formulation of the RQR. The parameter ρ τs (.) is denoted as a check function [2] and is defined by: ; otherwise: in which τ ∈ (0, 1) indicates the quantile of interest. Thus, the values of β sk (τ) represent the markers' effects in the τ th quantile of interest for s th trait.
In this study, for each trait (α̂Ã 1 ; α Ã 2 and α̂Ã 3 ), the quantiles τ = 0.2 , 0.5 and 0.8 were used to generate results at three distinct levels that may characterize the low, average, and high distribution of the phenotypic values under study ( α̂Ã 1i ; α̂Ã 2i and α̂Ã 3i ). Furthermore, these quantiles were chosen to minimize the residual term in previous studies (pilot analysis) by using the same datasets.
In order to verify whether marker effects differ between the quantile levels of the traits ( α̂Ã 1 ; α Ã 2 and α̂Ã 3 ), the 2.5% most relevant SNPs (highest absolute values) and their p values, based on bootstrapped standard error values, were presented. In addition, these SNPs were used to identify possible QTL regions affecting growth traits in pigs.
The Genomic Estimated Breeding Values (GEBV) from RQR were obtained through GEBV τ ð Þ ¼ û ¼ P k x ik β̂k τ ð Þ, in which τ represents the quantile of interest. Subsequently, the genomic growth curves were obtained for each animal based on GEBV (û) according to the following expression: in which ŷi j is the predicted breeding value for each animal i for the weight at each age (t ij ) (j = 0 to 150 d); μ̂αÃ 1 ; μ α̂Ã 2 and μ̂αÃ 3 are the means of each trait (parameter estimates for the logistic model); and ûÃ α1 i ; ûÃ α2 i and ûÃ α3 i are the GEBV of these traits.
Finally, the genetic parameters for the interpretable traits derived from the logistic model (α 1 and α 3 ) as well as the original traits associated with slaughter weight (SW) and average daily gain (ADG) were estimated by using the following multi-trait model: where y 1 y 2 ! is the vector of response variables of traits I and II (α 1 and α 3 with SW and ADG), X 1 and X 2 are the fixed-effects design matrix (Sex, Batch, and Halothane presence), Z 1 and Z 2 are the random-effects design matrix, and e 1 e 2 ! is the vector of random residuals of the two traits. It is assumed that is the additive genetic variance and covariance matrix of the two traits, and e 1 e 2

! eN
is the residual variance and covariance matrix of the two traits. Finally, G is the additive relationship matrix constructed by using 501 pigs and I is the identity matrix.

Computational features
Fitting of the models was carried out by using the nls (to fit the logistic nonlinear model in the first step) and rq (to fit the regularized quartile regression in the second step) functions of the stats and quantreg packages [14] of R software [15], respectively. The Mixed Model Analyses were performed in ASReml 3.0 [16].
To obtain the shrinkage parameter values (λ), a grid of λ values between 0 and 50 was utilized, varying in 0.5 increments. The predictive capacity, defined as the correlation between the estimated and observed values (curve parameters that were obtained from fitting the Logistic model to the weight-age data), was used as a criterion to define the optimal value λ.
The computational codes that were implemented in the R software are found on the website of the Statistics Department of the Federal University of Viçosa (2017): https://licaeufv.wordpress.com/scriptrqr_jasb/.

Results
The summary containing the descriptive statistics of the adjusted phenotypic data is presented in Table 1.
The summary containing the correlation and descriptive statistics of the adjusted phenotypic data ( α̂Ã 1i ; α̂Ã 2i and α̂Ã 3i ) is presented in Table 2.
Considering the aforementioned grid (0 to 50, by 0.5), the shrinkage parameter value that showed the best results in terms of predictive capacity was λ = 0.5.
The mean and standard error for marker effects (β̂k ′ s) and R 1 goodness of fit measure for each quantile adjusted model are present in Table 4. The goodness of fit ranged between 0.67 and 0.75 (Table 4).
In order to verify whether the most relevant SNPs for the three approaches (RQR (0.2), RQR (0.5), and RQR (0.8)) were the same, the 2.5% most relevant SNPs for each phenotype (α̂Ã 1i ; α̂Ã 2i and α̂Ã 3i ) were reported (Table 5). Table 5 describes the most relevant markers considering the fitting through RQR (0.2). For the mature weight (α 1 ), the markers are located on chromosomes SSC1, SSC4, SSC7, SSC8, and SSC17 ( Table 5). The position of the marker ALGA0096701 on chromosome 17 (55.81 cM) is in accordance with the results of Pierzchala et al. [17], in which the authors found QTL for the slaughter weight at the position 51.1 cM with the cross between Meishan, Pietrain, and European Wild Boar. For birth weight (α 2 ), the marker ALGA0044519 stands out, which is found in the SSC7 at the position 115.23 cM, next to the QTL for the birth weight found by Guo et al. [18] at the position 120.9 cM for crosses of Large white and Meishan. In terms of growth rate (α 3 ), the marker that presented with the highest effect is found on chromosome 8. The position of the marker ALGA0049546 at SSC8 (60.04 cM) is close to the position 62.2 cM, as reported by Casas-Carrillo et al. [19] for average daily gain when using families from outbred lines that were selected for high (fast) and low (slow) growth rates.
Considering the RQR (0.5) in Table 5, the most important markers for α 1 , α 2 and α 3 are located on chromosomes SSC4 and SSC8 (Table 5; RQR (0.5)). For α 1 ,, the marker ALGA0047992 stands out, which is found on SSC8 at the position 30.17 cM, which is close to the QTL for slaughter weight found by Beeckmann et al. [20], and at the position 33.9 cM on chromosome 8 in pigs obtained from crosses between Meishan, Pietrain, and European Wild Boar. For the birth weight trait (α 2 ), the marker with the greatest estimated effect was ALGA0026100. The position of this marker at SSC4 (75.53 cM) is close to the position at 74.4 cM reported by Walling et al. [21] for body weight at birth. For α 3 , the position of marker ALGA0048131 on SSC8 (35.02 cM) was close to the position 33.1 cM reported by Beeckmann et al. [20] who used data from an experimental cross between Meishan, Pietrain, and European Wild Boar for average daily gain (Table 5; RQR (0.5)).
Considering the RQR (0.8) in Table 5, the most significant SNPs for α 1 , α 2 and α 3 are located on chromosomes SSC1 and SSC8 (Table 5; RQR (0.8)). Regarding the mature weight trait (α 1 ), the marker with the highest absolute value pertaining to the estimates of the parameter effect is ALGA0007216. This marker is located on chromosome 1 (160.61 cM). Chen et al. [22] used a pig population comprised of Yorkshires and Meishans to find significant QTLs for slaughter weight at the position 122.4 cM of SSC1, i.e., close to the position 160.61 cM of the ALGA0007216 marker (Table 5; RQR (0.8)).     Another interesting result that was observed through RQR is the simultaneous existence of important markers for different traits (Table 5). This fact is important for breeding, since pleiotropy is the main factor in genetic correlation. Specifically, for RQR (0.5) ( Table 5), two markers (ALGA0047992 and ALGA0047995) were simultaneously important for the mature weight (α 1 ) and birth weight (α 2 ) traits. In addition, three SNPs for RQR (0.2) (ALGA0096701, ALGA0026109, and ALGA0029483) and one for RQR (0.8) (ALGA0042986) were simultaneously relevant for α 1 and α 2 .
The three genomic growth curves (τ = 0.2 , 0.5 , 0.8) that were obtained based on all of the data are shown in Fig. 1b. The estimated curve based on the three quantiles showed a similar pattern until 100 d. After that, differences in the estimated growth curves increased with time (Fig. 1b). This result was expected given the increase in the heterogeneity of variances that were presented at the final evaluated times, 100 and 150 d (Fig. 1a).
The genomic growth curves for each RQR, for quantiles 0.2, 0.5, and 0.8 and their confidence intervals showed significant differences (based on non-overlapping confidence intervals) only in terms of mature weight (Fig. 2a). These differences are highlighted in Fig. 2b.
Estimates of genetic parameters (heritability, and genetic and phenotypic correlations) are presented in Table 6. Estimates of heritability for growth curve parameters were moderate, with 0.447 ± 0.200 and 0.4991 ± 0.164, for parameters α 1 and α 3 , respectively. The original traits (SW and ADG) had low heritability estimates, with 0.214 ± 0.127 and 0.094 ± 0.087, for SW and ADG, respectively.
Estimates of genetic and phenotypic correlations are presented in the off-diagonals (Table 6). Between the interpretable growth curve parameters (α 1 and α 3 ) with the original correspondent traits (SW and ADG), correlations were, respectively, highly positive and negative, with a positive genetic correlation estimated for parameters α 1 and SW (0.404 ± 0.113) and a negative genetic correlation estimated for α 3 with ADG (−0.681 ± 0.229). Phenotypic correlations between interpretable growth curve parameters with slaughter weight (SW) and average daily gain (ADG) traits were also moderately positive and negative, with 0.662 ± 0.051 for α 1 with SW and −0.451 ± 0.06 for α 3 with ADG.

Discussion
In this study, we aimed to propose and evaluate a regularized quantile regression (RQR) for SNP marker effect estimation on pig growth curves and to estimate the genetic weight trajectory over time (genomic growth curve) under different quantiles (levels). In order to do so, a real data set consisting of 345 animals from an F2 outbred population with information on 237 SNP markers, randomly distributed over six chromosomes, was used. The phenotypic data refers to the weight at birth, 21, 42, 63, 77, 105, and 150 days of age. To estimate SNP marker effects for growth curves, we used a two-step approach [1]. In the first step, we fitted logistic nonlinear models to the data of each animal, and in the second step, genomic regression models were fitted while considering the estimated parameters from the previous step as the phenotypic values. We obtained the three genomic growth curves for the three evaluated quantiles (τ = 0.2 , 0.5 , 0.8). Finally, the genetic parameters for the interpretable traits of the logistic model (α 1 and α 3 ) and the original traits, slaughter weight and average daily gain, were estimated. Quantile regression (QR) can be used to provide a more complete statistical analysis of the stochastic relationships among random variables. In general, the chosen quantiles depend entirely on the purpose of the study, i.e., we can study all distributions or only some parts by defining specific quantiles. In this study, with the aim of representing three distinct levels that characterize low, average, and high distributions of the phenotypic values (estimated parameters while considering a logistic nonlinear model), we choose, τ = 0.2 , 0.5 , 0.8.
The use of RQR to estimate SNP marker effects and obtain the estimated genomic growth curve was efficient since it was possible to construct genomic growth curves and find the most relevant markers, which thus allows for the identification of QTL regions at different levels of interest. Besides that, R 1 goodness of fit measures ranging from 0.67 to 0.75 indicating that the model fits well for the observations. Unlike traditional methods that are based on conditional expectations, E(Y|X), RQR allows us to fit regression models on different parts of the distribution of the variable response, therefore enabling a more complete understanding of the phenomenon under study [2,23]. Besides, the heterogeneous variance over time (Fig. 1a) indicates that there is not a single rate of change that characterizes changes in the probability distribution, therefore indicating that RQR is a good tool to deal with those situations. Also, the predictive capacity that was obtained by means of RQR (Table 3) was better than that obtained by Silva et al. [24].
The advantages of RQR, such as studying different parts of the distribution of the variable response, can be combined with those from the two-step approach. Specifically, the two-step approach enables us to obtain the genomic values for each observed time (t j ), as well as to estimate the weight for any other time of interest within the measured range before this weight is attained [24].
Based on the results, it is possible to note that RQR allows for the identification of markers close to QTLs at The use of quantile regression to estimate genomic curves based on three contrasting quantiles in our population was efficient when it came to producing distinct growth curves. Specifically, we can see in Fig. 2b that the final BW of the genomic growth curves was statistically different; in other words, the growth behavior over time changed in terms of mature weight. In fact, this result shows that RQR is a statistical method that could be effectively used to estimate more than a single mean behavior, thereby providing a more complete picture of the relationships between variables.
The genetic correlations between α 1 and α 3 with BW and ADG had, respectively, a high positive and negative genetic correlation, which indicates that α 1 and α 3 have the potential to be used as selection tools to improve SW and ADG. Additionally, the high genetic correlation between α 1 with α 3 and SW with ADG enable us to understand causes of SNPs' pleiotropic effects. These results are in agreement with Silva et al. [24], who found significant genetic correlation between the interpretable traits of logistic model ( r α 1 ;α 3 ¼ −0:69) in the same populations that were used in this study. The difference between the signals of genetic correlation estimates observed in the present study is due to the different Logistic model parameterizations. Specifically, our approach uses the parameterization presented in Pinheiro and Bates [9], where the growth scale parameter (α 3 ) is the reciprocal of growth rate [10,24].
The study of different distribution levels of the variable of interest using QR has been successfully performed in medicine by Beyerlein et al. [25], who used QR in GWAS (Genome-Wide Association Study) analysis in human genetics where they emphasized statistical and biological advantages when estimating marker effects in different quantiles of the phenotypic distribution. Sun et al. [26] proposed to use QR to identify hypermethylated CpG islands (CGIs) that can be associated with breast and ovarian cancer. They concluded that the quantile level between 80 and 90% is the best strategy to identify methylated and unmethylated CGIs. Moreover, regularized quantile regression has already been successfully evaluated for analyzing ultra-high dimension data [27]. These authors demonstrated that QR greatly enhances existing tools for large dimensional data analysis, since it revealed a substantial reduction in model complexity when compared with alternative methods.
However, even though the use of RQR is promising and efficient, more studies are needed to address the choice of the shrinkage parameter value, which is always critical to find as it can be defined by using a grid of values, cross-validation, or by using a Bayesian approach. Another issue about the use of RQR is the choice of the quantile. There are a lot of quantiles that can be used; therefore, finding the best one to explain the functional relationship is a challenge.

Conclusions
The proposed model enabled the discovery, at different levels of interest (quantiles), of the most relevant markers for each trait (growth curve parameter estimates) and their respective chromosomal positions (identification of new QTL regions for growth curves in pigs). Furthermore, RQR enabled the construction of genomic growth curves, which identified genetically superior individuals in relation to growth efficiency.  Table 6 Genetic parameters a (standard error) for growth curve parameters, ADG, and SW  Author details