Research  Open  Published:
Haplotype phasing after joint estimation of recombination and linkage disequilibrium in breeding populations
Journal of Animal Science and Biotechnologyvolume 4, Article number: 30 (2013)
Abstract
A novel method for haplotype phasing in families after joint estimation of recombination fraction and linkage disequilibrium is developed. Results from Monte Carlo computer simulations show that the newly developed E.M. algorithm is accurate if true recombination fraction is 0 even for single families of relatively small sizes. Estimates of recombination fraction and linkage disequilibrium were 0.00 (SD 0.00) and 0.19 (SD 0.03) for simulated recombination fraction and linkage disequilibrium of 0.00 and 0.20, respectively. A genome fragmentation phasing strategy was developed and used for phasing haplotypes in a sire and 36 progeny using the 50 k Illumina BeadChip by: a) estimation of the recombination fraction and LD in consecutive SNPs using family information, b) linkage analyses between fragments, c) phasing of haplotypes in parents and progeny and in following generations. Homozygous SNPs in progeny allowed determination of paternal fragment inheritance, and deduction of SNP sequence information of haplotypes from dams. The strategy also allowed detection of genotyping errors. A total of 613 recombination events were detected after linkage analysis was carried out between fragments. Hot and cold spots were identified at the individual (sire level). SNPs for which the sire and calf were heterozygotes became informative (over 90%) after the phasing of haplotypes. Average of regions of identity between halfsibs when comparing its maternal inherited haplotypes (with at least 20 SNP) in common was 0.11 with a maximum of 0.29 and a minimum of 0.05. A MonteCarlo simulation of BTA1 with the same linkage disequilibrium structure and genetic linkage as the cattle family yielded a 99.98 and 99.94% of correct phases for informative SNPs in sire and calves, respectively.
Background
Advances in molecular biology have allowed rapid and massive genotyping of Single Nucleotide Polymorphisms (SNP) in animal and plant species. SNP arrays have been implemented for genomewide selection to enhance genetic improvement of farm animals [1]. In many instances, it has been done without consideration of the nature of the SNP information. That is, SNPs have been assumed to be unlinked in spite of providing low information (maximum PIC values of 0.375 at intermediate frequencies for two alleles). In reality, SNP polymorphisms are arranged in aligned sequences forming haplotypes where the order of the alleles for consecutive SNPs within each homologue chromosome contains relevant information. To make full use of SNP microarray technology, haplotype phasing must be estimated because genotyping only generates information for each single SNP. Haplotype phasing consists of arranging the order of allelic variants in a chromosomal segment within each of the two homologous chromosomes of diploid species. Phasing knowledge can be applied to trace SNP inheritance and to account for regions that are identical by descent in genomic evaluations aimed at the genetic improvement of agricultural species [2, 3]. Haplotype phasing can be assessed in the laboratory or computationally [4]. Computational methods can make use of the population structure of linkage disequilibrium assuming no relationships among individuals or may also establish inheritance of haplotypes within pedigrees in order to reconstruct haplotypes. In the latter, haplotypes can be traced up to ancestors in the top of the pedigree [5–7]. PHASE [7], FASTPHASE [6], and BEAGLE [5] implement a Bayesian statistical method for reconstructing using coalescent and hidden Markov models. FASTPHASE [6] cannot provide estimates of recombination rates but PHASE [7] does. BEAGLE [5] can infer haplotypes from unrelated individuals, parentoffspring pairs, and parentoffspring trios. Rohde and Fuerst [8] developed methods for haplotype inference based on maximum likelihood, which could be used with nuclear families. Their method was based on searching long genomic segments for which an allele was shared for two individuals and has been also used for phasing haplotypes. A linkage program like GENEHUNTER can extract multiple relative’s information [9], although it is done by assuming linkage equilibrium between polymorphisms, which may lead to incorrect phasing. If the genotype information is from sibs and genotyping information on one of the parents is missing then linkage disequilibrium (from gametes from the other parent) may not be separated from genetic linkage (within the parent with genotype information). Another group of methods makes use of relative’s information [10, 11] together with laws of Mendelian inheritance to infer haplotypes. Haplotypes inferred from data sets from families are accurate across extended genomic distances if family sizes are large. Williams et al.[12] proposed a new haplotype inference method for nuclear families based on maximum likelihood and minimum recombination using linear programming (implemented in software Hapi). More recently, Lai et al.[13] proposed an algorithm for reconstructing haplotypes in a pedigree by assuming zero recombinants. Their method can be applied to general pedigrees but it is not designed to make use of nuclear families with one single parent.
All of the above methods for haplotype inference cannot be readily applied to the situation most commonly found in farm animals. This is, genotyping information is available from just one of the parents which in turn have large progeny groups (e.g., dairy bulls with up to one million progeny). Therefore, none of the above methods can estimate recombination and linkage disequilibrium simultaneously because they assume that there is linkage equilibrium. As stated by Browning and Bronwing [4]: “When sites are in LD, linkage programs that assume linkage equilibrium may falsely infer IBD in situations in which is not present”.
In this study, we develop a computational method to estimate haplotype phases in breeding populations using closely linked biallelic markers densely distributed in the entire genome. The strategy is based on making use of large families of breeding populations in order to estimate recombination fraction and linkage disequilibrium (LD) simultaneously in order to phase parents. Then, progeny are phased using Mendelian laws of inheritance together with genotyping information from parents and progeny as it has been done elsewhere e.g., [9–11]. A new EM algorithm is developed to simultaneously estimate recombination fraction and linkage disequilibrium in halfsib families (required for accurate phasing). The method to phase haplotypes is tested in a cattle family with 36 calves. Montecarlocomputer simulations were carried out for joint estimation of recombination and linkage equilibrium in a family resembling recombination and linkage disequilibrium estimated from real data in BTA1.
Methods
Genome fragmentation phasing strategy
Assume SNP microarrays with a dense coverage of the animal genome are used for largescale genotyping of animals. The situation considered is dairy cattle in which bull sires and bull dams are mated (usually by artificial insemination) and have usually one progeny from each dam (halfsibs). Among the progeny of those elite bulls are young bulls that are typed for microarrays in order to carry out preselection based on genomic information. Some of the bull dams are also progeny of elite bulls and this process repeats itself in successive generations. Therefore, all male selection candidates are typed for microarrays. A method named genome fragmentation phasing strategy (GFPS) is proposed to phase haplotypes when information is available in families as it is common in breeding populations. GFPS steps assumptions are: 1) families are large with progeny sharing at least one parent; 2) both linkage disequilibrium and recombination events are modeled; and 3) use is made of SNP arrays with high coverage of the genome and with known SNP location. The steps for GFPS are:

A.
Estimation of the recombination fraction in consecutive SNPs using family information in the first generation
There are two possibilities depending on the available genotype information from one or from two parents. If both parents are available then standard linkage analyses can be performed. Noninformative progeny can be ignored and the resulting recombination fraction estimates will be unbiased. If only one parent is known then joint estimation of estimation of recombination fraction and linkage disequilibrium (LD) is needed for each pair of consecutive closely linked SNPs. An EM algorithm is proposed to estimate recombination and LD (Appendix 1). MonteCarlo computer simulation was carried out to validate joint estimation of LD. If recombination fraction between two consecutive SNP, T/t and M/m, is zero then:

A1
Establish linkage phase in the parent or parents by eq. A3 :
$$\begin{array}{l}\mathrm{Pr}\phantom{\rule{0.5em}{0ex}}\mathit{of}\phantom{\rule{0.5em}{0ex}}\mathit{\text{Phase}}\phantom{\rule{0.5em}{0ex}}\left(\mathit{TM}/\mathit{tm}\right)\\ \phantom{\rule{0.5em}{0ex}}=\frac{{L}_{\mathit{TM}/\mathit{tm}}\left(\delta ,{f}_{T},{f}_{M},c\mathrm{data}\right)}{{L}_{\mathit{TM}/\mathit{tm}}\left(\delta ,{f}_{T},{f}_{M},c\text{data}\right)+{L}_{\mathit{Tm}/\mathit{tM}}\left(\delta ,{f}_{T},{f}_{M},c\text{data}\right)}\end{array}$$where L_{TM/tm}(δ, f_{ T }, f_{ M }, cdata) and L_{Tm/tM}(δ, f_{ T }, f_{ M }, cdata) are the likelihoods of phase (TM/tm) and phase (Tm/tM) under a model estimating linkage disequilibrium (δ), allele frequencies for one of the alleles at markers T/t and M/m (f_{ T }, and f_{ M }), and the recombination fraction (c). See Appendix 1 for a full description of the maximum likelihood method. Repeat this process for each pair of consecutive SNPs in which the last SNP in a pair is the first SNP of the following until an estimate of recombination fraction is greater than 0. If recombination fraction is greater than 0 then it means that either a recombination event has taken place or it is a genotyping error. This is a signal for the termination of one fragment and the starting of the next one. Genotyping errors can be of two kinds: failure of genotyping itself or incorrect alignment of SNPs in the array. At this point, all consecutive SNPs fully linked and aligned two by two are considered a fragment, or haplotype block.

A2
Reconstruct phases of parent(s) for each fragment
The phases for each pair of SNPs in the sire are reconstructed by aligning the most likely phases for each pair of SNPs until fragment phasing information is completed. Therefore, a fragment is a piece of a chromosome in which no recombination events are detected between any consecutive markers. If the sire is homozygote then the same allele would appear in the two homologous chromosomes. Figure 1 illustrates reconstruction of parent haploypes for a sire. The blue arrow indicates heterozygote SNPs that are concatenated according the linkage phases having the higher probability.

A3
Reconstruct phases in progeny with paternal and maternal haplotypes
The sire has two fragments and the progeny has one paternal fragment and another maternal fragment that it is deducted from the genotype information. Progeny phases can be reconstructed by the following rules. If a progeny is homozygous for one of the alleles for which the parent is heterozygous then the full haplotype of that fragment is inherited by that progeny (square in the Figure 1). The contribution from the mother completes the genotype information for each SNP within the fragment. Homozygous calf’s genotypes from an heterozygous sire are represented in red in Figure 1. If the progeny is heterozygous at one SNP but the parent is homozygous then that paternal allele in the progeny is the sire allele while the maternal allele is the other allele (green in Figure 1).

A1

B.
Linkage analyses between fragments
Fragments with no recombination events detected between SNPs can be used in linkage analyses considering long haplotypes as highly informative “superalleles” (haplotypes of fragments whose inheritance can be traced to their paternal and maternal contributions), which allows detection of recombination events. Also changes in the haplotype configuration inherited in the calf when compared to sire’s phased chromosomes may reveal a recombination event in that calf.

C.
Phasing of haplotypes in following generations
Once the phase is established in the parents and progeny, phases of individuals in the following generation is straightforward. Phases of progeny that become a parent and contributes to the next generation are used to reconstruct phases in the progeny of the progeny. Linkage analyses are carried out in the next generation using family information in order to establish haplotype phases in each of the subsequent generations.
MonteCarlo computer simulation of a halfsib family for joint estimation of LD and genetic linkage
A MonteCarlo computer simulation was carried out in order to validate methods to estimate recombination fraction and LD jointly in halfsib families. The sire was assumed to be a double heterozygote. A random generator from the uniform distribution was used to assign progeny with the genotypes TTMM, TTMm, TTmm, TtMM, TtMm, Ttmm, ttMM, ttMm, and ttmm according to their probability (frequency): ${\varphi}_{\mathit{\text{TTMM}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{TM}}$, ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{Tm}}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{TM}}$, ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{Tm}}$, ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{tM}}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{TM}}$, ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right)\left({f}_{\mathit{tm}}+{f}_{\mathit{TM}}\right)+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}\left({f}_{\mathit{tM}}+{f}_{\mathit{Tm}}\right)$, ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{Tm}}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{tm}}$, ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{tM}}$, ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{tM}}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{tm}}$, and ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{tm}}$. If the drawing of the uniform distribution was between 0 and ϕ_{ TTMM }, then the offspring had genotype TTMM. If the drawing of the uniform distribution was between ϕ_{ TTMM } and ϕ_{ TTMM }_{+}ϕ_{ TTMm } then the offspring genotype was TTMm. Assigning other genotypes to offspring was done following the same rule.
Multifamily estimation of linkage disequilibrium
A total of six families with sizes 94, 77, 106, 81, 79, and 100 halfsib progeny were simulated resembling the sire Norwegian cattle population (after pooling selected and culled bulls in Table 1 of [14]). The allele frequencies were intermediate, recombination fraction was 0, 0.25, or 0.50, and linkage disequilibrium ranged from 0 to 0.25. The sires were simulated as if they were coming from a population with the same linkage disequilibrium and allele frequencies as used to generate the halfsib progeny. In order to do so, the two haplotypes of each sire were generated following the same principles as above, with probabilities according to the simulated frequencies: f_{ TM } = δ + f_{ T }f_{ M }, f_{ Tm } =  δ + f_{ T }f_{ m }, f_{ tM } =  δ + f_{ t }f_{ M }, and f_{ tm } = δ + f_{ t }f_{ m }, in which allele frequencies f_{ M }, f_{ m }, f_{ T }, f_{ t }, and δ were input parameters. Thus, the sire could be a double homozygote, homoheterozygote or double heterozygote after assigning the two haplotypes. The halfsib progeny was generated as described in the previous section. Join estimation of linkage disequilibrium and recombination fraction was carried out using the developed E.M. algorithm for multiple families (Appendix 1). Each experiment was replicated 10,000 times.
Programming source code for GFPS
Subroutines were written in Fortran 90 to compute join estimates of LD and recombination fraction using the E.M. algorithm as well as to perform all steps of the GFPS algorithm. All source code is available on request to the authors (gomez.luis@inia.es).
Genome analyses of LD in a beef cattle halfsib family
A halfsib family consisting of 36 calves from commercial beef cattle at the Gund ranch in Nevada was used to illustrate and to compare alternative methods for estimation of linkage disequilibrium. The first step was to determine paternity of the calves at the ranch. A set of 25 microsatellites (BMS410, BMS499, BMS650, BMS1244, BMS1634, TGLA227, BMS601, BMS1789, BMS2005, ILSTS081, BMS1315, BMS1226, BMS2573, ILSTS058, TGLA126, CSSM66, SPS115, TGLA53, BM1824, BM2113, ETH3R, TGLA122, INRA023, ETH225, ETH10) was used to assign paternity that was carried out using Cervus software. The procedures are fully described in [15]. The Illumina bovine 50K BeadChip was used with bull #302 and his 36 calves in order to compare methods to estimate LD in halfsib families. Only SNPs with a call rate higher than 0.80 in at least 24 calves and with MAF of 0.10 or more were used. The data was also filtered for SNPs that were not consistent for inheritance from sire to progeny. If a SNP was not consistent for one progeny then the SNP information was discarded for the entire family. Only pairs of SNPs within the same chromosome and within a distance of 50Mb or less were used for estimating linkage disequilibrium and recombination fraction.
The GFPS algorithm was used to reconstruct haplotypes of fragments in the sire and its 36 calves. Recombination events between fragments were detected. In addition, regions of identity (ROI) were used as a measure of molecular relatedness. ROIs are a generalization of runs of homozygosity (ROH) and are the proportion (in respect to the total of the genome) of long haplotypes shared by two individuals. ROH have been thoroughly investigated in human [16, 17] and animal populations [18], and in this study were estimated after reconstruction of haplotypes via GFPS. Thus, only haplotypes made up by 20 or more identical SNP alleles for each two individuals were used. ROIs were used to estimate patterns of Mendelian segregation as well as genetic relatedness between two individuals due to the inheritance of paternal or maternal gametes.
MonteCarlo computer simulation of a phased chromosome
A MonteCarlo computer simulation of BTA1 using phasing results of the cattle family was performed. A total number of progeny was 36. Only heterozygous SNPs (for the sire) were simulated (sire homozygous SNPs are trivial for phasing purposes). The chromosome coming from the sire to their calves was assumed to come from meiosis with a probability of 0, 1, 2, or 3 recombination events of 0.511494, 0.398467, 0.081418, and 0.00862, respectively. These are the genomewide values found for recombination events per chromosome in our study. Once, the two resulting gametes from the sires were created then a drawing from the uniform distribution was used to assign either of the two chromosomes to a calf. The chromosome coming from the dams was generated using the haplotype frequencies (of chromosome 1) of the two first SNPs as estimated in this study. The next SNP was simulated using haplotype frequencies (for SNPs second and third) but conditional to the SNP allele at the second SNP. This process was repeated until the entire chromosome was terminated. The same procedure was followed to generate BTA1 for each of the 36 calves. The SNP data was randomly allocate to disturb the order of alleles and the resulting data was analyzed with the methods developed in this paper. The location of each (base pair) was the same as SNPs in the real data. Also, markers with MAF<0.10 were excluded. A total of 7,000 replicates were performed.
Results
Monte Carlo computer results from joint estimation of recombination fraction and linkage disequilibrium
An E.M. algorithm was implemented for the joint estimation of recombination fraction and linkage disequilibrium in both single and multiple family situations (Appendix 1). Tables 1 and 2 show the results for the joint estimation of recombination fraction in a halfsib family with varying family sizes (36 to 2,000). The results show that both recombination fraction and linkage disequilibrium are accurately estimated when either true recombination or true disequilibrium is 0 even for relative small family sizes (36). On the contrary, estimates tend to be biased when both recombination fraction and linkage disequilibrium (absolute value) are greater than 0. Substituting the observed counts in equation A1 by their expected values according to true parameters (recombination fraction, linkage disequilibrium, allele frequencies) was carried out to plot the log of the likelihood against those parameters in order to investigate if the maximum likelihood method can or cannot separate effects of recombination fraction and linkage disequilibrium (i.e., behavior of the likelihood when sample size is infinite). Figure 2 shows several scenarios (A to D) regarding true parameters. If the method would work properly then the highest peak (maximum value of the likelihood) should correspond to true parameters. If either true recombination fraction or linkage disequilibrium is 0 then there is one single maximum value, which corresponds to the true parameters (Figure 2A, C and D). If the true values of both recombination fraction and linkage disequilibrium are different from 0 then estimates of both parameters are biased (Figure 2B). This occurs because the estimation of both parameters is confounded. Nevertheless, estimation of recombination fraction would be unbiased if true recombination is 0, which is necessary for phasing parents.
Table 3 shows the average of the estimate of the recombination fraction after assuming linkage equilibrium as described by [19] and as it has been assumed for all published QTL mapping experiments [20]. It is shown that estimates of the recombination fraction are generally biased even for true recombination fraction of 0 when family size is small. Table 4 shows computer simulation results for multifamily joint estimation of recombination fraction and linkage disequilibrium. The results show that estimates of both linkage disequilibrium and recombination fraction are unbiased in this setting regardless of the true (simulated) value of those parameters.
Genome fragmentation phasing strategy in a cattle halfsib family
A method named genome fragmentation phasing strategy was developed for phasing parents and progeny and used for phasing a cattle family comprising 36 calves to illustrate GFPS. The first step was to estimate recombination fraction between each of two consecutive SNPs. The genomewide distribution of recombination events between each of two consecutive SNPs is depicted in Figure 3. Although the great majority (over 92%) of the estimates were 0.00, there were many estimates of recombination fraction too high for the physical distance separating them. It may be attributed to either miss location of SNPs during the sequencing and alignment or genotyping errors. A recombination fraction larger than 0 was used (as a signal) to terminate a fragment and to initiate the next one. The fragmentation yielded a distribution of fragment size across the genome (number of SNPs per fragment) shown in Figure 4. Most of the fragments were rather small but some relatively large fragments (more than 200 SNPs) allowed genomewide identification of cold spots (Figure 5). These cold spots are for the sire producing meiosis and if tested in multiple families would allow distinguishing between cold spots at the individual and population levels.
The number of recombination events per chromosome per individual (gamete) is shown in Figure 6. There were some calves (189 and 284) with only single recombination event per chromosome. The distribution of all recombination events between fragments generated by GFPS along the 29 autosomes is given in Figure 7. There were 613 recombination events detected. In some situations, recombinations are evenly distributed in the genome. However, some areas have higher values suggesting either hot spots or miss location of whole fragments during the process of sequencing and assembling to produce Illumina’s array. For example, recombination fraction between the first fragment in chromosome 17 (unlinked with nearby fragments) was genomewide tested (all fragments in the entire genome) and resulted in a recombination fraction estimate of 0.07 with another fragment in chromosome 19 which suggests an error of assembling in the Illumina array.
The informativity of SNPs greatly increased after haplotype reconstruction using GFPS. More than 90% (range 86.1 to 95.1% for the 29 chromosomes) of noninformative SNPs become informative due to the use of the information of SNPs linked to them.
The analysis of genome regions of identity shared by two individuals is given in Figure 8. The analyses compare regions of identity (ROIs) inherited from sire or from dam for each of two individuals in a halfsib family. As expected for halfsibs, the regions shared or identical from paternal origin between two individuals is on average 0.52, with a range of values between 0.40 and 0.68. The maternally inherited ROIs between individuals ranged between 0.05 and 0.29 with a mean of 0.11. These results suggest that there is a significant amount of fragments with a relative large variation that are identical by descent among unrelated dams. Figure 9 shows the tail of the distributions of paternal and maternal ROIs. Fragments of more than 20 Mb of maternal autosomes are commonly shared by individuals with unrelated mothers.
The results of the simulation of BTA1 with the same LD and genetic linkage structure of our data are depicted in Table 5. The results of correct phasing are for informative segments. That is, homozygote calves allowed identification of origin of haplotype (paternal or maternal). Non informative areas are detectable and cannot be phased out. The phasing method was very accurate to identify haplotypes in both sires and calves. Recombination events were considered identified if they were less that 3Mb apart from the true (simulated) event. The average distance between the true (simulated) and identified recombination events was 0.79 Mb.
Discussion
This paper describes a new method, GFPS, for reconstructing haplotypes after the use of SNP arrays in breeding populations in which large groups of progeny from at least one progenitor are available. The method permits reconstruction of long haplotypes utilizing consecutive SNPs within a fragment in a similar fashion to the process of sequencing and assembling small fragments of DNA that are aligned using common sequences in the extremes in order to generate larger fragments. In turn, linkage analyses of long fragments can identify recombination events and pinpoint genotyping or assembling errors. The application of this method provides a new insight into the genome of highly used sires or dams by allowing identification of individual hot and cold spots or identifying chromosomes with non Mendelian segregation which might be caused by chromosomal abnormalities. The application of this method allows a high control of the haplotypes passing from generation to generation and can provide a better understanding of the genetic basis of production and diseases in breeding populations.
The alternative to the proposed GFPS is longrange phasing and imputation methods [21, 22] which make use of surrogate fathers and mothers in order to identify haplotypes. In general, imputation methods utilize allele frequencies in order to reconstruct haplotypes by assuming that the larger the allele frequencies, the larger the chance of being represented in the haplotypes. Consequently, there is always an error associated to imputed haplotypes and rare alleles are missed. The advantage of the method presented here and described elsewhere is that information on sibs is used together with the laws of inheritance [5, 10]. The novelty of our method is that it includes joint estimation of recombination and LD for reconstruction of parents and progeny haplotypes. As shown in this paper, if only one parent is available then both parameters can be accurately estimated if at least one of them is 0 as has been shown here with Monte Carlo computer simulations. Homozygous progeny from heterozygous sires flag which allele and haplotype is inhered from the sire, and consequently, allow identification of haplotypes of paternal and maternal origin. It dramatically increases informativity of heterozygous SNPs between parent and progeny (over 90%).
Many of the applications of SNP arrays in farm animals ignore the sequential nature of polymorphism in haplotypes. For example, in many instances genomic selection [1] assumes a large number of unlinked loci. As discussed in previous work it has implications in genomic evaluations that are systematically ignored [2, 3, 23]: a) progeny from the same parent share large haplotypes (not just single SNP) as shown with ROIs in this paper, and b) single heterozygous SNPs for both sire and halfsib progeny are noninformative, and consequently information is reduced or lost in genomic evaluations. Other advantages and potential applications of the proposed method are: 1) requires only currently available genotypic information on progeny as required for genomic selection when using juveniles for shortening generation intervals, 2) can be applied to any farm animal with large families even if only one parent is common to a group of progeny, 3) facilitates haplotyping the entire breeding population for new investigations such as signatures of selection or high order linkage disequilibrium, 4) can be used to estimate molecular relatedness more precisely by using haplotypes of given length rather than the sum of single SNPs (large fragments are likely identical by descent), and 5) can help to tracing up allele and haplotypes through generations which may facilitate the detection of genes involved in diseases or production.
The methods developed are designed for haplotyping farm animals with large families. However, they can also be applied to wild animals as long as large families and dense SNP arrays are available in those species. After all, the cattle used in our study came from a free range production system where the only samples or records taken were from DNA which was first used for paternity testing and then for phasing haplotypes.
Conclusions
Haplotype phasing is possible and highly accurate when estimating jointly linkage disequilibrium and genetic linkage in animal breeding populations as long as large families are available.
Appendix 1
Joint estimation of genetic linkage and linkage disequilibrium in halfsib families
Let the sire have genotype TtMm at two SNPs, T/t, and M/m and linkage phase (TM/tm) and n_{ j,i } be the genotype counts from offspring from the ith sire family (j=TTMM, TTMm, TTmm, TtMM, TtMm, Ttmm, ttMM, ttMm, and ttmm). The recombination fraction, c, is estimated simultaneously with linkage disequilibrium (δ), and allele frequencies (f_{ T }, f_{ M }). The likelihood equation for linkage phase TM/tm for data of the ith halfsib family is:
where the probabilities of offspring genotypes among halfsib offspring are obtained from Table 6: ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{TM}}$; ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{Tm}}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{TM}}$; ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{Tm}}$; ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{tM}}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{TM}}$; ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right)\left({f}_{\mathit{tm}}+{f}_{\mathit{TM}}\right)+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}\left({f}_{\mathit{tM}}+{f}_{\mathit{Tm}}\right)$; ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{Tm}}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{tm}}$; ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{tM}}$${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{tM}}+\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.c\phantom{\rule{0.5em}{0ex}}{f}_{\mathit{tm}}$; and ${\varphi}_{\mathit{\text{TTMm}}}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.\left(1c\right){f}_{\mathit{tm}}$.
Likelihood equation [A1] can be solved by applying the E.M algorithm:
where N_{ i } is the size of the ith halfsib family. Using initial values of the haplotype frequencies and iterating over equation A2 will converge to ML estimates of haplotype frequencies. Linkage disequilibrium is estimated by $\widehat{\delta}={\widehat{f}}_{\mathit{TM}}^{i}{\widehat{f}}_{\mathit{tm}}^{i}{\widehat{f}}_{\mathit{Tm}}^{i}{\widehat{f}}_{\mathit{tM}}^{i}$.
If the linkage phase of the sire is Tm/tM then the E.M. equations are:
The probability of phase TM/tm is:
where L_{TM/tm}(δ, f_{ T }, f_{ M }, cdata) and L_{Tm/tM}(δ, f_{ T }, f_{ M }, cdata) are the likelihoods of phase (TM/tm) and phase (Tm/tM) under a model estimating linkage disequilibrium (δ), allele frequencies for one of the alleles at markers T/t and M/m (f_{ T }, and f_{ M }), and the recombination fraction (c).
Estimation of LD across multiple halfsib families
The likelihood equation to estimate LD across halfsib families is:
where L_{ i }(δ, f_{ T }, f_{ M }, c nG) is the likelihood for the ith halfsib family conditional to genotype marker information (nG) and nf is the number of families. Note that depending on the sire genotype, allele frequencies for T and M (double homozygote) or M (homoheterozygote) do not need to be estimated. The E.M. algorithm can be applied to multiple families by iterating on the four haplotype frequencies and recombination fraction:
where equations for haplotype frequencies for each single family varies depending on the sire genotype (see [15] for a full description of all possible situations). Equation [A4] was solved iteratively after giving a starting value to the haplotype frequencies and by estimating in each iteration ${\widehat{f}}_{T}={\widehat{f}}_{\mathit{Tm}}^{i}+{\widehat{f}}_{\mathit{TM}}^{i}$ and ${\widehat{f}}_{M}={\widehat{f}}_{\mathit{TM}}^{i}+{\widehat{f}}_{\mathit{tM}}^{i}$.
Abbreviations
 SNP:

Single Nucleotide Polymorphism
 LD:

Linkage disequilibrium
 ROI:

regions of identity
 GFPS:

Genome fragmentation phasing strategy.
References
 1.
Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome wide dense marker maps. Genetics. 2001, 157: 18191829.
 2.
Edriss V, Fernando RL, Su G, Lund MS, Guldbrandtsen B: The effect of using genealogybased haplotypes for genomic prediction. Genet Sel Evol. 2013, 45: 510.1186/12979686455.
 3.
Mulder HA, Calus MP, Veerkamp RF: Prediction of haplotypes for ungenotyped animals and its effect on markerassisted breeding value estimation. Genet Sel Evol. 2010, 42: 1010.1186/129796864210.
 4.
Browning SR, Browning BL: Haplotype phasing: existing methods and new developments. Nat Rev Genet. 2011, 12: 703714.
 5.
Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am J Hum Genet. 2007, 81: 10841097. 10.1086/521987.
 6.
Scheet P, Stephens M: A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006, 78: 629644. 10.1086/502802.
 7.
Stephens M, Scheet P: Accounting for decay of linkage disequilibrium in haplotype inference and missingdata imputation. Am J Hum Genet. 2005, 76: 449462. 10.1086/428594.
 8.
Rohde K, Fuerst R: Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information. Hum Mutat. 2001, 17: 289295. 10.1002/humu.26.
 9.
Kruglyak L, Daly MJ, ReeveDaly MP, Lander ES: Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. H. Genet. 1996, 58: 13471363.
 10.
Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin  rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97101. 10.1038/ng786.
 11.
Lander ES, Green P: Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA. 1987, 84: 23632367. 10.1073/pnas.84.8.2363.
 12.
Williams AL, Housman DE, Rinard MC, Gifford DK: Rapid haplotype inference for nuclear families. Genome Biol. 2010, 11: R10810.1186/gb20101110r108.
 13.
Lai EY, Wang WB, Jiang T, Wu KP: A lineartime algorithm for reconstructing zerorecombinant haplotype configuration on a pedigree. BMC Bioinforma. 2012, 13 (Suppl 17): S1910.1186/1471210513S17S19.
 14.
GomezRaya L, Olsen HG, Klungland H, Våge DI, Olsaker I, Talle SB, Aasland M, Lien S: The use of genetic markers to measure genomics response to selection in livestock. Genetics. 2002, 162: 13811388.
 15.
GomezRaya L: Maximum likelihood estimation of linkage disequilibrium in halfsib families. Genetics. 2012, 191: 195213. 10.1534/genetics.111.137521.
 16.
Pemberton TJ, Absher D, Feldman MW, Myers RM, Rosenberg NA, Li JZ: Genomic patterns of homozygosity in worldwide human populations. Am J Hum Genet. 2012, 91: 275292. 10.1016/j.ajhg.2012.06.014.
 17.
Szpiech ZA, Xu J, Pemberton TJ, Peng W, Zöllner S, Rosenberg NA, Li JZ: Long Runs of Homozygosity Are Enriched for Deleterious Variation. Am J Hum Genet. 2013, 10.1016/j.ajhg.2013.05.003.
 18.
Purfield DC, Berry DP, McParland S, Bradley DG:Runs of homozygosity and population history in cattle. BMC Genet. 2012, 10.1186/147121561370.
 19.
GomezRaya L: Biased estimation of the recombination fraction using halfsib families and informative offspring. Genetics. 2001, 157: 13571367.
 20.
Georges M, Nielsen D, Mackinnon M, Mishra A, Okimoto R, et al: Mapping quantitative trait loci controlling milk production in dairy cattle by exploting progeny testing. Genetics. 1995, 139: 907920.
 21.
Hickey JM, Kinghorn BP, Tier B, Wilson JF, Dunstan N, van der Werf JHJ: A combined longrange phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol. 2011, 43: 1210.1186/129796864312.
 22.
Kong A, Masson G, Frigge ML, Gylfason A, Zusmanovich P, Thorleifsson G, Olason PI, Ingason A, Steinberg S, Rafnar T, Sulem P, Mouy M, Jonsson F, Thorsteinsdottir U, Gudbjartsson DF, Stefansson H, Stefansson K:Detection of sharing by descent, longrange phasing and haplotype imputation. Nature Genet. 2008, 40: 10681075. 10.1038/ng.216.
 23.
Meuwissen THE, Goddard ME: Prediction of identity by descent probabilities from markerhaplotypes. Genet Sel Evol. 2001, 33: 605634. 10.1186/12979686336605.
Acknowledgements
WMR acknowledges support from a Marie Curie International Reintegration Grant from the European Union, project no. PIRG08GA2010277031 “SelectionForWelfare”. LGR and WMR acknowledge support from project AGL201239137.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
LGR contributed to the developed the idea and carried out the mathematical modeling for the joint estimation of LD and linkage disequilibrium. He drafted the first version of the manuscript. AMH participated in the design of the study, contributed with preparation of DNA samples and writing of several parts of the manuscript. DT carried out the tissue sampling, and contributed to the writing of the manuscript. WMR contributed to the development of the idea, helped to carry out tissue sampling and contributed to the writing of the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Breeding
 Haplotype phasing
 Linkage disequilibrium
 SNP