Duo Shared Genomic Segment analysis identifies a genome-wide significant risk locus at 18q21.33 in myeloma pedigrees

Aim: High-risk pedigrees (HRPs) are a powerful design to map highly penetrant risk genes. We previously described Shared Genomic Segment (SGS) analysis, a mapping method for single large extended pedigrees that also addresses genetic heterogeneity inherent in complex diseases. SGS identifies shared segregating chromosomal regions that may inherit in only a subset of cases. However, single large pedigrees that are individually powerful (at least 15 meioses between studied cases) are scarce. Here, we expand the SGS strategy to incorporate evidence from two extended HRPs by identifying the same segregating risk locus in both pedigrees and allowing for some relaxation in the size of each HRP. Methods: Duo-SGS is a procedure to combine single-pedigree SGS evidence. It implements statistically rigorous duo-pedigree thresholding to determine genome-wide significance levels that account for optimization across pedigree pairs. Single-pedigree SGS identifies optimal segments shared by case subsets at each locus across the genome, with nominal significance assessed empirically. Duo-SGS combines the statistical evidence for SGS segments at the same genomic location in two pedigrees using Fisher’s method. One pedigree is paired with all others and the best duo-SGS evidence at each locus across the genome is established. Genome-wide significance thresholds are determined through distribution-fitting and the Theory of Large Deviations. We applied the duoSGS strategy to eleven extended, myeloma HRPs. Results: We identified one genome-wide significant region at 18q21.33 (0.85 Mb, P = 7.3 × 10−9) which contains one gene, CDH20. Thirteen regions were genome-wide suggestive: 1q42.2, 2p16.1, 3p25.2, 5q21.3, 5q31.1, 6q16.1, 6q26, 7q11.23, 12q24.31, 13q13.3, 18p11.22, 18q22.3 and 19p13.12. Conclusion: Our results provide novel risk loci with segregating evidence from multiple HRPs and offer compelling targets and specific segment carriers to focus a future search for functional variants involved in inherited risk formyeloma.

We recently described a novel strategy to map genes involved in complex disease risk using extremely large high-risk pedigrees and allowing for intra-familial heterogeneity, called Shared Genomic Segment (SGS) [20] . Cases sharing genomic segments from a common ancestor through 15 meioses or more are unexpected at a genome-wide level [23] , and hence a single large high-risk pedigree (HRP) can provide the power to identify novel loci with genome-wide significance [24] . Our resource of eleven large myeloma pedigrees included several with 3-4 cases and meioses in the 8-14 range [20] . While these remain extremely large families, they may lack sufficient power individually for genome-wide significance. Also, a multi-pedigree strategy is attractive. Evidence for the same risk locus in two extended pedigrees adds confidence to the locus and can build on the power of both. The remaining challenge for any multi-pedigree approach, however, is to adequately address heterogeneity between pedigrees [25] .
Here, we expand the SGS method based on combining evidence from pairs of HRPs, while still allowing for intra-familial heterogeneity within each pedigree. In our approach, duo-SGS, we fix one pedigree and optimize over all pedigree pairs to balance discovery with multiple testing. Both pedigrees must have a segregating genomic segment at the same risk locus. The method is robust to allelic heterogeneity as different alleles at the same locus may be shared within each pedigree. We apply the duo-SGS approach to eleven MM HRPs to identify novel loci involved in myeloma risk.

Duo-SGS method
An overview of the duo-SGS approach can be found in Figure 1. After identifying HRPs and genotyping cases, the observed shared genomic segments in single pedigrees are established and compared between pedigrees, and genome-wide thresholds are determined.

Observed duo-SGS sharing
The single pedigree SGS approach has been described previously [20] . Briefly, the single SGS approach identifies shared observed genomic segments by defining consecutive runs of SNPs that are identity-by-state in a group of cases ( Figure 1, Step 2). If the length of an observed segment is significantly longer than it would be by chance, inherited sharing (identity-by-descent) is implied. The nominal significance of each segment is assessed empirically. Expected length sharing under the null hypothesis is generated using a gene-dropping algorithm (Figure 1, Step 3). Chromosomes are assigned to the pedigree founders (those with no parents in the pedigree) randomly and according to a population linkage disequilibrium model. These simulated chromosomes are "dropped" through the pedigree structure using Mendelian inheritance expectations according to a genetic map for recombination. All members of the pedigree receive genotypes under the null hypothesis, and simulated genomic segments from this null configuration are established. These simulations are repeated at least one million times. The empirical P-value for an observed segment is the proportion of simulated segments that are identical or encompass the observed segment to the number of simulations. All subsets of at least two cases within a pedigree are assessed for observed segments. Then, at every position across the genome, the best evidence (lowest empirical P-value) for an excessive length of sharing is established ( Figure 1, Step 4). This process results in a final optimized set of shared segments for a single pedigree. Each optimal segment corresponds to a specific subset of cases and has a nominal empirical P-value.
For two pedigrees, the duo-SGS evidence is the combination of the nominal empirical P -values for the optimal segments at the same genome position in the two pedigrees. Specifically, the Fisher method to combine P-values was used. All possible pedigree pairs could be considered as separate analyses, but there are n C 2 pedigree pairs (ways to select 2 pedigrees from n total pedigrees), and hence multiple testing can rapidly become an issue. Alternatively, a single analysis comprising optimization across all pedigree pairs could be considered, but this global approach may cloud individual pedigree-pair findings. To balance these two extremes, we propose a fixed-pedigree duo-SGS strategy ( Figure 1, Step 5). The procedure is as follows: (1) fix a pedigree of interest; (2) calculate genome-wide duo-SGS evidence for the fixed pedigree with each of the other pedigrees; and (3) optimize across the duo-SGS findings to identify the most significant duo-SGS result at each point across the genome. The optimized findings over pedigree pairs are the duo-SGS results for the fixed pedigree. In this approach, we identify the best two-pedigree results that include the fixed pedigree. The procedure is then repeated for each pedigree, thus producing duo-SGS results for each pedigree.

Genome-wide thresholds for duo-SGS
Critical to interpreting the observed duo-SGS results are genome-wide significance duo-SGS thresholds for each pedigree (Figure 1, Step 6). To establish these, we echo the same optimization process in null data. Establishing these thresholds is similar to the calculation described for the single pedigree SGS method [20] . Under the reasonable assumption that the vast majority of the genome represents chance sharing (i.e., most of the genome does not contain a disease risk gene) we model the distribution for null sharing on the distribution of the empirical P-values for each pedigree. To avoid comparing the findings to themselves or skewing to possible true-positives, the empirical-P-values are perturbed, and the distributionfitting is performed at 1 million simulations. The latter is to avoid inappropriate distributionfitting to extreme outliers, the few results from the alternate hypothesis if included at their final resolution. To perturb an empirical P-value we determine its Wilson score 95% confidence interval (CI) (Equation 1) and randomly sample a value from within it. where p is the empirical P-value, z is 1.96 (for the 95%CI), and n is the number of simulations (here, 1,000,000). The Wilson interval was selected because it always produces non-negative confidence bounds for the P-values. The genome-wide set of perturbed empirical P-values for a pedigree are considered the "null" P-values for that single pedigree.
The duo-SGS procedure (described above) is performed using the single pedigree genomewide null P-values. The result of this process is a set of optimal duo-SGS null P-values.
Genome-wide significant and suggestive thresholds are determined following our previously described method for single pedigree SGS [20] . Briefly, the null duo-SGS P-values are logtransformed and fitted to a gamma distribution. The shape (k) and rate (σ) parameters of the fitted distribution are applied using the Theory of Large Deviations to calculate the significance thresholds by solving: where μ(X) is the genome-wide false positive rate, C is the number of chromosomes, α(X) is the probability of exceeding X ∼ χ 2k 2 , and G is the genome length in Morgans [26] . The false-positive rate is set to 0.05 for the genome-wide significant threshold and 1.0 for the genome-wide suggestive threshold. After solving for X, the threshold, T is determined by Thresholds are specific to each fixed pedigree to assess their duo-SGS results.

MM high-risk pedigrees
The statewide Utah Cancer Registry (UCR) has been an NCI-supported Surveillance, Epidemiology, and End Results (SEER) Program registry since its inception in 1966.
The UCR was utilized to invite all individuals with myeloma in the state to participate. Peripheral blood was collected for DNA extraction from individuals who completed informed consent.
The Utah Population Database (UPDB) is a unique resource [27] . It includes a 16-generation genealogy of approximately 5 million people with at least one event in Utah that is recordlinked to the UCR and state vital records. Using the UPDB, ancestors whose descendants have an excess of disease based on internal cancer rates and years at risk can be identified and studied as HRPs. The UPDB was used to identify ancestors whose descendants showed a statistical excess of MM (P < 0.05). The expectation was based on internal disease rates based on birth cohort, sex, birthplace (in/outside Utah), and years at risk. The total number of myeloma cases in each HRP identified ranged from 4 to 37 cases. After annotating the pedigrees with those with DNA, 11 pedigrees were identified to contain 3 or 4 myeloma cases with DNA (28 individuals; 8 individuals were in more than one pedigree). In each pedigree, the cases were separated by 8 to 23 meioses.
DNA from the 28 cases was genotyped on the Illumina Omni Express high-density SNP array at the University of Utah. Only high-quality bi-allelic SNPs and individuals with adequate call rates across the genome were included. The PLINK software [28] was used for quality control. SNPs with < 95% call rate across the 28 individuals were removed. After filtering, 678,447 SNPs remained. These SNPs were transformed to match 1000Genomes strand orientation.
Individuals were removed if < 90% of the filtered SNPs are called. One myeloma case had a < 90% call rate and was eliminated from the study. We also checked for sex inconsistency based on the genotypes -all cases passed. PLINK relationship estimates were compared with the UPDB pedigree structures -no issues were found.
The duo-SGS method was applied to the MM pedigrees to identify regions with genomewide suggestive or significant evidence. Post-hoc, some duo-SGS regions were removed from consideration. Duplicate regions occur when the same pair of pedigrees identify the same region in both their fixed-pedigree results. In these situations, duo-SGS P-values are identical, but thresholds vary by which pedigree is fixed, potentially leading to different significance levels. The most significant result was reported, and the lesser removed. If an individual resided in two pedigrees and also shared the region in both pedigrees, the region was removed. If the region spanned a centromere, it was removed. Forty-two suggestive regions were removed as duplicates, involving an overlap individual or at the centromere.

RESULTS
Duo-SGS findings were identified for each of the eleven MM HRPs. The significance thresholds for each fixed pedigree are in Table 1. One region at 18q21.33 reached genomewide significance and 13 regions were genome-wide suggestive. Table 2 shows the details of the significant or suggestive regions identified, including the duo-SGS P-value, expected rate per genome μ(t), the two pedigrees involved, each segregating shared region in the pedigrees, and the overlapping region.  Figure 2 shows the two regions and the overlap.
Thirteen loci were found with genome-wide suggestive evidence [ Table 2]. In four of these loci, several pedigree pairs provide duo-SGS evidence beyond genome-wide suggestive. The locus at 6q16.1 was previously identified as significant in single pedigree SGS in UT-571744, with risk variants in USP45 implicated [20] . Here, we find five pedigree pairs, all including UT-571744, and provide suggestive evidence, including one pair which achieves the second-highest duo-SGS significance in the study [μ(t) = 0.121, P = 7.  Table 3.

DISCUSSION
We expanded the shared genomic segment method to identify segregating chromosomal segments with overlapping statistical evidence from two HRPs. The strategy allows for genetic heterogeneity within each pedigree and provides formal significance thresholds for interpretation. The approach circumvents issues of intra-familial heterogeneity that can hinder mapping in large pedigrees. For complex diseases, large HRPs are likely enriched for multiple susceptibility variants [24] and sprinkled with sporadic cases; hence methods that require all cases to share to attain discovery power are not suitable. Here we optimize over subsets within pedigrees and consider pairing with all other pedigrees to provide the flexibility required. The method also specifically defines which pedigrees and cases share evidence at a locus, which is imperative for follow-up sequencing. Additional value may be gained by comparing demographic or clinical characteristics of the sharers in each pedigree to nuance the phenotype which may aid future gene mapping and provide insight into the nature of the mechanism of risk at a locus.
Application of the novel duo-SGS approach to eleven MM HRPs implicated a novel genome-wide significant region at 18q21.33 in MM risk, as well as 13 suggestive regions. Other than 6q16.1, which overlaps with our previous single pedigree SGS study, all loci identified in this study provide novel regions of interest in myeloma. None of the regions overlapped with existing genome-wide association study loci or other prior rare risk variants implicated in MM. A next step would be to investigate the loci for rare and deleterious coding variants or regulatory variants. Pedigree segregation methods can provide statistically compelling regions to concentrate efforts to identify and characterize regulatory risk variants. Also, SGS results can be used as genomic annotations of prior evidence to layer with additional omic information or provide a focused region for interrogating regulatory risk variants.
The literature supports a role of some of the genes found in our duo-SGS regions in MM.
The genome-wide significant region at 18q21.33 contained CDH20, a gene that plays a role in intracellular adhesion by forming cadherin junctions. Cadherins have been suggested in solid tumor invasion, and metastasis as disruption to cell-cell junctions is a prerequisite [29] . Solid tumors co-aggregate in MM families suggesting a shared genetic background [10] . At 6q26, several pedigree pairs were genome-wide suggestive, and the overlapping segments fall in PARK2 which mediates proteasomal degradation. PARK2 is a tumor suppressor [30] and the gene harbors risk variants for lung cancer [31] .
While the duo-SGS approach is useful for analyzing pedigrees smaller than those typically required for the single pedigree SGS approach, a large number of meioses are still required. The HRPs in this study are still substantially larger than those available in most familybased resources (8-23 meioses between sampled cases). Hence the method has practical limitations in other settings. Nonetheless, the interesting regions identified in large pedigrees provide evidence that can be used to narrow the search for risk variants in smaller families as well, as demonstrated in our prior study [20] .
As in all family-based genetic studies, our results could be sensitive to inaccurate pedigree structures. However, relationship and ethnicity checks are standard protocols and mitigate the possibility of error. Another limitation to this study is the observational nature.
Additional functional studies will be required to describe causation and characterize the mechanisms involved in these loci and myeloma risk.
We have identified several novel loci that segregate in at least two myeloma HRPs. These loci are likely to harbor genes and rare risk variants for MM and are compelling new targets for inherited risk to MM.
In conclusion, we developed a novel strategy for gene mapping in complex traits that uses multiple large high-risk pedigrees. The approach is robust to heterogeneity both within and between pedigrees and formally corrects for multiple testing to allow for statistically rigorous discovery. We applied this strategy to MM, a complex cancer of plasma cells, and identified one novel genome-wide significant locus at 18q21.33 and 13 suggestive loci. Our study offers a new technique for gene mapping and demonstrates its utility to narrow the search for risk variants in complex traits.