Phasing of de novo mutations using a scaled‐up multiple amplicon long‐read sequencing approach

Abstract De novo mutations (DNMs) play an important role in severe genetic disorders that reduce fitness. To better understand their role in disease, it is important to determine the parent‐of‐origin and timing of mutational events that give rise to these mutations, especially in sex‐specific developmental disorders such as male infertility. However, currently available short‐read sequencing approaches are not ideally suited for phasing, as this requires long continuous DNA strands that span both the DNM and one or more informative single‐nucleotide polymorphisms. To overcome these challenges, we optimized and implemented a multiplexed long‐read sequencing approach using Oxford Nanopore technologies MinION platform. We focused on improving target amplification, integrating long‐read sequenced data with high‐quality short‐read sequence data, and developing an anchored phasing computational method. This approach handled the inherent phasing challenges of long‐range target amplification and the normal accumulation of sequencing error associated with long‐read sequencing. In total, 77 of 109 DNMs (71%) were successfully phased and parent‐of‐origin identified. The majority of phased DNMs were prezygotic (90%), the accuracy of which is highlighted by an average mutant allele frequency of 49.6% and standard error of 0.84%. This study demonstrates the benefits of employing an integrated short‐read and long‐read sequencing approach for large‐scale DNM phasing.

2012). It has been shown that approximately 80% of DNMs are of paternal origin (Goldmann et al., 2016;Kong et al., 2012;Oud et al., 2022;Yuen et al., 2016). A major factor known to contribute to an increase in DMNs in individuals is advanced parental age at the time of conception, particularly paternal age (Goldmann et al., 2016;Kong et al., 2012). Investigating the parental origin and timing of DNMs provides not only biological insight into the generation and ability of these DNMs to underlie genetic disorders, it has also been shown to be important for determining the recurrency risk of these disorders (Almobarak et al., 2021;Campbell et al., 2014).
Phasing analysis interrogates the diploid genome, allowing allele separation of the parental chromosomes. This helps not only to determine the parental origin and timing of DNMs, but is also critical to identify compound heterozygous mutations and look into allelespecific expression, linked variants, and structural variation (Ebert et al., 2021;Soifer et al., 2020;Tewhey et al., 2011). With short-read whole genome sequencing (WGS) of parent-offspring trios, 15%-20% of DNMs can be successfully phased and parent-oforigin called (Goldmann et al., 2016). However, this percentage is expected to be even lower in whole exome sequencing (WES).
Phasing challenges can be attributed to the limited sequencing read lengths, the presence of intronic gaps, and the reduced amount of genetic variation in the exonic regions compared to intronic regions (Frigola et al., 2017). By definition, germline DNMs need to be absent in the parental somatic cells, requiring trio-based exome or genome sequencing of parent-offspring trios for discovery. In the next step, the parent-of-origin and zygosity of a DNM can be identified by targeted amplification and long-read sequencing of a region spanning the DNM as well as one or more parentally informative singlenucleotide polymorphism (iSNP). While this appears straightforward, long-read sequencing has both random and positional error, which may result in false variants used for phasing, reducing the reliability of downstream analysis (Magi et al., 2018;Watson & Warr, 2019).
There are numerous methodologies to target genomic regions for enrichment before sequencing, with the majority being divided into PCR-or CRISPR-based approaches (Gilpatrick et al., 2020;Hafford-Tear et al., 2019;Player et al., 2020), in addition, a novel "adaptive sampling" method has recently been developed (Loose et al., 2016;Payne et al., 2021). Adaptive sampling takes advantage of the realtime individual pore control, sequencing, and basecalling of the Oxford nanopore technologies (ONT) platform to selectively sequence DNA. Briefly, adaptive sampling uses a known library to interpret each DNA molecule as it is ratcheted through a protein pore, reversing the polarity through the pore, and rejecting the molecule if it fails to match its reference library. Comparative to PCR enrichment, adaptive sampling or "ReadUntil" is limited in coverage, with reduced flow cell yields of up to 90% (S. Martin et al., 2022).
This results in greater consumable costs, which would be additional to the greater computational demand of the approach. Similarly, when mapping sequence data to the reference genome from CRISPR targeting approaches, the off-target mapping of the sequences is several fold greater than PCR-based methods and target coverage is therefore often many factors lower (Hafford-Tear et al., McDonald et al., 2021), with costs per target significantly higher.
Innate challenges also exist with PCR approaches, including the presence of inhibitors, variable target length, optimisation time, amplification bias, and nucleotide errors (Potapov & Ong, 2017;Shagin et al., 2017). However, despite these challenges with longrange PCR enrichment, the approach is arguably more effective for scaled-up targeted phasing at present.
This study aims to identify the DNM parent-of-origin and zygosity using a targeted long-range and standard PCR approach for phasing 109 distinct DNMs previously identified in infertile men by patient-parent trio exome sequencing . Targeted amplification of regions encompassing each unique DNM is performed using an optimized PCR workflow designed to quickly increase PCR success rates and reduce presequencing base error from standard and long-range PCR. The combination of exome patient-parent trio data, targeted ONT sequencing, and validated DNMs are used to improve phasing and allele frequency confidence. Critical aspects of the process are assessed to ascertain practical application for large-scale use of the approach, including amplification length, long-read sequencing error rates, and overall phasing performance.

| WES
Genomic DNA was extracted from the whole blood for all patients at the time of evaluation and treatment at the fertility center. Parental DNA was obtained from saliva using the Oragene OG-500 kit (DNA Genotek). Exomes of patient-parent trios were prepared for sequencing using 1 µg of high-quality genomic DNA, which was quantified with the Qubit dsDNA HS kit (Thermo Fisher Scientific).

| DNM/iSNP calling and DNM validation
Sequence reads were aligned to the Human Genome Reference Assembly GCRh37 using Burrows-Wheeler Alignment (BWA) version 0.7.17 (Li, 2013) and indexed using SAMtools (Li et al., 2009) McKenna et al., 2010). De novo analysis was performed using custom-designed in-house analysis pipelines and validation of DNMs in the proband and parents was performed using standard Sanger Sequencing . iSNP calling was also performed using an in-house pipeline, which called proband and parent variants using Clair v2 (Luo et al., 2020), proband variants were selected based on unique parental inheritance with a minimum parental coverage of 5×. A single iSNP is selected for parent-of-origin assignment, see Section 2.7.

| Target amplification
In 77 infertile males analyzed, 109 rare protein-coding DNMs were identified and PCR amplified targeted phasing was performed, for a basic schematic of the methods see Supporting Information: Figure S1.
Irrespective of known iSNP presence, primers were designed for the target regions (determined as per Figure 1b) of all 109 DNMs (Supporting Information: Table S1) using our in-house pipeline which designs primers using Primer3 version 2.3.6 (Untergasser et al., 2012) and Human Genome Reference Assembly GCRh37. Automated rounds of primer creation were run for each target, selecting the most optimal primer pair based on predicted fragment size, self-binding, hairpin factors, reverse primer to forward primer binding, and nonspecific genome binding. All expected fragment sizes were limited to a maximum length of 10 kb for greater enrichment success and speed, F I G U R E 1 (a) Pedigree chart illustrating the bases of an iSNP and a DNM. To phase a DNM and determine the parent-of-origin, single reads must span from the DNM to the iSNP without fragmentation. The distance between these two points of interest dictates the feasibility of phasing the target region with short or long-read sequencers. PCR amplification followed by long-read sequencing allows target read lengths of up to 10,000 bp. (b) Illustration of the three categories in which targets are grouped and phased. The gray boxes upon the thin line in each category (1, 2, and 3) represent existing exome read coverage. Each category is based on the distance between the iSNP and DNM, which determines whether it can be phased with only the existing short-read WES trio data (1), WES and long-read proband data (2), or WES and longread trio data (3). The percentage of samples that fit each category is highlighted in blue. DNM, de novo mutation; iSNP, informative singlenucleotide polymorphism; WES, whole exome sequencing. though optimisation steps were required for 50% of the targets (Supporting Information: Figure S2).
For this group primers were designed in 5 kb windows around the DNM. iSNPs from the~5 kb region were identified in the introns of parental ONT data in the downstream bioinformatics. Standard and long-range PCR target enrichment were carried out using optimized running conditions of two separate supermixes/enzymes, BioRad iproof and TakaraBio PrimeSTAR GLX (Supporting Information: Table S2). Primary steps use dilution to reduce contaminants and inhibitors, where dilution was not enough a hotstart supermix (Quantabio RepliQa HiFi ToughMix) was used which reduces the need for annealing temperature specificity (Supporting Information: Table S2). The number of primer pairs that required bespoke optimisation was reduced to just 12% of the samples by using broad and basic steps as per the Supporting Information:

| Optimising scalability
On average 30 targets per MinION run were amplified, pooled, and sequenced. Initially, a limited number of 11 samples were barcoded and sequenced, but a reduction in usable data was noted (Supporting Information: Table S3). As such, this approach was replaced with careful sample pooling instead. The 11 samples were not resequenced, as the small number meant there was still ample data per target (mean of 19 million bases per target). Targets were checked to ensure no overlap in genome location, and demultiplexing was performed based on region extraction of the mapped BAM files (see Section 2.7). Using this approach, a loss of 17.8% (SEM 7.2) of the total basecalled yield was avoided by bypassing barcode ligations, demultiplexing binning issues, and barcode trimming (Supporting Information: Table S3).

| Long-read sequencing
The PCR target enrichments of >20 ng were prepared for sequencing with the ONT ligation sequencing kit (SQK-LSK109) following the manufacturer's protocol, with adjustments for sample type and yield.
Individual sample libraries with <20 ng were concentrated at given

| Bioinformatics: Variant calling
Variants were called using Clair v2 (Luo et al., 2020) and filter parameters for variants required >20× coverage, allele frequency 20%-80%, 96% confidence, and allele agreement with the validated DNM. The remaining variants were error-polished using the high base accurate WES data, removing variants that disagree with any overlapping WES data. WES data for patients and parents were used in filtering ONT called variants illustrated in Figure 2 and Supporting Information: Tables S4 and S5. Intronic variants identified were further error-polished where the inheritance information was available from long-read parental data.

| Bioinformatics: Phasing
To identify reliable variants, reads were split based on DNM base type, and the remaining variants were checked for agreement with the split reads using bamql (Masella et al., 2016). The iSNP with the greatest combinations of split read agreement, coverage, and additional supporting sequencing data was selected for final phasing, see Supporting Information: Table S6 for iSNP information. This preliminary variant sorting and iSNP selection approach provided a more reliable variant list. The next step split the raw reads by base type at DNM and selected iSNP positions using bamql. After this step, the allele frequencies, parent-of-origin, and timing of the DNM event (pre/postzygosity) were determined. The parent-of-origin conclusion was re-affirmed manually in Integrative Genomics Viewer F I G U R E 2 Using validated DNM for an anchored phasing pipeline. Pipeline overview for phasing, including illustration of variant strength and filtering. Reads are split by their Sanger-validated DNM nucleotide, then homozygous variants are identified in the two groups. These variants are compared to additional supporting sequencing data and prioritized for phasing based on their primary, secondary, tertiary, and unvalidated categories. These categories are based on the extent of the variant validation from WES and ONT Trio data. Small deletions that have no alternative supporting data are removed due to their high false call rate. An iSNP is selected based on the sequencing support category and used to phase the original proband ONT mapped reads, after which the allele frequencies, zygosity, and parent-of-origin are determined. DNM, de novo mutation; iSNP, informative single-nucleotide polymorphism.
(IGV) by visualizing the reads containing the DNM and the chosen iSNP from the pipeline.
Additional to manual analysis in IGV, postzygotic and prezygotic DNMs were determined by three category assessments, those that appeared postzygotic are detailed in Supporting Information: Table S7. The categories are; WES DNM base information (coverage/frequency), ONT DNM base information (coverage/frequency), and ONT allele information (coverage and allele frequencies). This information also helped determine background allele error, by presenting reads in the data that represent third and fourth allelic forms.
Allele and base error were calculated in two ways, one approach represented total error, which was calculated for a target DNM by combining all known false base frequencies for total base position error, and for total allele error, combining all known false allele frequencies.
The other approach used the most prevalent base or allele frequency that could be determined as false. Assessment of error and noise in data helped support prezygotic and postzygotic DNM calls.
To provide an assessment of basecalling background error that is allele relevant, we ignored checking every base within every target as that would be unnecessarily extensive and time-consuming. Instead, we used the highest false base percentage of each iSNP used in phasing each target and calculated the mean false base percentage (Supporting Information: Table S8). Furthermore, the false base could be qualified as it had parental data support. A quality assessment of the bases for each target within the BAM files was performed using Picard "QualityScoreDistributions" (PicardToolkit, 2019).

| RESULTS
WES of 77 infertile males and their unaffected parents identified 109 rare DNMs, all of which were independently validated by Sanger sequencing. Accurate phasing and parent-of-origin calling of the DNMs requires DNA molecules spanning a parentally iSNP and the DNM (Figure 1a). As such, the ability to call the parent-of-origin is primarily reliant on read lengths. This is notable in the WES data, where only 8% of DNMs could be phased as most iSNP were located >300 bp away from the DNM (Figure 1b-(1) and Table 1).

| Target amplification groups
For phasing, all DNM regions were targeted with either standard or long-range PCR, depending on required fragment size, and sequenced using ONT long-read sequencing. Primer pairs were designed for targeted long-read phasing of the 109 targets (Supporting Information: Table S1). The fragment size was capped at~10 kb to simplify PCR optimisation. Unfortunately, amplification success was still impacted by Maternal 01974 PLEKHA1 C > G 2 C-G C-G target length within the 10 kb range, with ideal lengths found to be <4 kb (Supporting Information: Figure S2a). PCR optimisation steps were required for 50% of all targets (Supporting Information: Figure S2b). The amplification size had no impact on error or quality in base calls or allele assignment (Supporting Information: Figure S3 and Table S8). Importantly, for 71% of cases, an iSNP was identified in the available trio-based WES data within 10 kb of the DNM (Figure 1b- (1) and (2)). For this group of DNMs, phasing can be done by targeted longread sequencing of the proband only, since the iSNP is already typed in patients and their parents.
Cases, where an iSNP could not be found in the coding region, had primers designed to cover 5 kb regions around the DNM position for parent and proband samples (Figure 1b-(3)). We chose the 5 kb region based on the analysis of 4344 DNMs identified in 53 whole genome sequenced individuals/children . This revealed the presence of at least 1 iSNP within 5 kb of any given DNM in 81% of cases (Supporting Information: Tables S9 and S10). In the end, we obtained long-read sequencing data with iSNPs for 77 out of 109 DNMs selected (71%, Supporting Information: Tables S9 and S11).  Table S4). Our data contain a high percentage of false variants, this is largely due to handling raw data rather than consensus polished data. We decided to use this raw data approach to increase sensitivity to detect mosaic mutations. Small indels made up on average 25% of the false positives, and on average 94% of all indels detected in the long-read sequencing data were likely false positives (see Supporting Information: Table S5, and for an example of a false indel see Supporting Information: Figure S4). Because of this high error rate for indel calling, we decided to remove all indels without supporting data available, which is noted in the illustration of our approach (Figure 2). For simplicity, these indels include those that occur over homopolymer regions, which may exaggerate the error of nonhomopolymer indel calls, though recent "Guppy" basecaller updates have been suggested to reduce homopolymer error.

| Validation of the phasing approach using WES data
The parent-of-origin for the DNMs was initially assessed from the short-read WES data. Short-reads, 150 bp in length, encompassing DNM and iSNPs in the WES data were available for 14 of the 109 DNMs, but only 9 (8%) (Figure 1b- (1)) had acceptable coverage (>10× per allele, Table 1). These DNMs were also target amplified and sequenced using ONT long-read sequencing to provide a control cohort for our long-read phasing method. For all 9 DNMs with acceptable coverage in the WES data, the parent-of-origin assignment agreed between the two approaches and the DNM allele frequencies obtained were comparable (  Table S8), and that impacted the allele frequency significantly. After removal of these false allelic data, we noticed that the average ONT allele frequency for prezygotic DNMs was around 50% with an average of 49.6%, and a standard error margin (SEM) of 0.8% (Supporting Information: Table S13 and Figure S5). For germline DNMs, we expect the mutant allele frequency to be approximately 50%. While raw allele frequencies were expectedly much worse than DNM base frequencies, the polished allele frequencies are slightly more accurate compared to the WES DNM and ONT DNM prezygotic nucleotide base frequencies of 48.6% (SEM 1.0%) and 48.8% (SEM 0.8%), respectively (Supporting Information: Table S13). As expected, postzygotic frequencies of mutant alleles deviated significantly from this, with an average of 15.9% and a standard error margin of 2.4% in ONT allele data. Comparatively the frequencies of postzygotic mutations are less consistent in WES data, where the postzygotic frequency average was 28.8% (SEM 4.9%) (Supporting Information: Table S13).
The parent-of-origin was also investigated within the pre and postzygotic fractions, where we observe the prezygotic paternal preference of 83% and the postzygotic paternal origin of 62% (five out of eight), see Supporting Information: Figure S6. Classification of postzygotic DNMs was determined as per Supporting Information: Table S7 and a detailed IGV illustration is depicted in Figure 3. All postzygotic calls were corroborated by the DNM base frequencies in the short-read and long-read data, and further supported by all F I G U R E 3 Example of the postzygotic DNM in C10orf71 visualized in IGV. The iSNP and DNM base associated (colored) counts and percentages are displayed at the top of the plot as a descriptive extension of the coverage bar. This postzygotic DNM is observed in the ONT and WES allele data and is supported by the discrepancies in the DNM positional base frequencies, while the iSNP base frequencies remain close to the expected 50:50 ratio. The IGV visual of the reads presents them grouped by iSNP base type. This read view illustrates a third allelic form in the reads that matches the base and iSNP combination expected, had the DNM not occurred (C-A 33% in ONT, 41% in WES). DNM, de novo mutation; iSNP, informative single-nucleotide polymorphism; WES, whole exome sequencing.
identified allele frequencies and the degree of error in target data (see Supporting Information: Tables S8 and S7).

| DISCUSSION
In this study, we optimized and applied a multiplexed long-read sequencing approach which makes use of high-quality short-read exome data to perform routine phasing of DNMs. We report the phasing results of 77 DNMs from 64 of our 77 patient-parent trios linked to male infertility, achieving successful phasing for 71% of the 109 DNMs investigated using this long-read targeted approach. In contrast, only 9 of these DNMs (8%) could be reliably phased based on short-read WES data alone.
Short-read exome sequencing has become an increasingly DNMs are known to arise from mutational events occurring during gametogenesis, predominantly during spermatogenesis rather than during oogenesis. This is assumed to be associated to the scale of male gamete production and failure of DNA repair mechanisms which lead to the increased opportunity for mutational events to occur (Aitken & Baker, 2020;Evenson et al., 2020;Grégoire et al., 2013;Haldane, 1947;Kong et al., 2012). Previous literature has shown that DNMs occur on the paternal allele approximately 80% of the time (Goldmann et al., 2016;Kong et al., 2012;Yuen et al., 2016).
In agreement with this literature, 83% of all phased DNMs in this study were determined to be of paternal origin.
Parent-of-origin and zygosity information adds another layer to our understanding of potential disease-causing variants. This is important when investigating genetic diseases, especially those that likely have complex and varied mechanisms. In our cohort of 77 patients, 51 patients were confirmed to suffer from nonobstructive severe oligospermic or azoospermic phenotypes. In the original publication related to this study , we showed that six out of the eight likely causative DNMs identified in these patients were of paternal origin (Supporting Information: Tables S9 and S14, Figure S7). As DNMs predominantly cause a dominant effect, it suggests that DNMs with a deleterious effect on the health of an individual can escape negative selection in spermatids from the paternal germline.
Accurate detection of the DNM allele frequency is critical to differentiate prezygotic from postzygotic mutational events, important in clinical settings for estimating the recurrence risk (Almobarak et al., 2021;Scanga et al., 2021). Our approach yielded a highly accurate allele frequency average of 49.6% in the prezygotic mutations, with an SEM of 0.84% (Supporting Information: Nonetheless, CRISPR-Cas target enrichment shows great promise, and will likely be the best approach for targets larger than 10-20 kb.
Comparative to PCR, adaptive sampling would also significantly increase costs due to the drop in flow cell yield and the specific computational infrastructure required. Despite this, adaptive sampling is an intriguing technology, with advantages over challenging genomic regions and swift targeted diagnostics.
No correlation was observed in basecalling error from PCR extension in targets of greater sizes. However, it is worth considering the potential increases in base error and bias from PCR approaches which would compound the lower accuracy inherent to long-read sequencing. Our data support the importance of minimizing target region sizes when performing PCR-based amplification for targeted sequencing, especially when performing primer optimisation for >100 bespoke primer pairs. Limiting target sizes will reduce labor-intensive PCR optimisation and though not observed in our study it may also reduce base error from PCR fidelity issues. We should, however, be mindful that for 11% of the DNMs studied no iSNPs were found within the 5 kb window, so minimizing the target region can also negatively impact phasing. For another 18% of DNMs, however, the sequencing data was of insufficient quality for phasing purposes, so clearly a balance must be found between sequencing quality and target size.
Since ONT released the MinION platform in 2014, there have been extensive leaps in advancing both the chemistry and the bioinformatic tools. This has resulted in raw base accuracy moving from as low as~60% (Loman & Watson, 2015) to the current 92%-97% in the 9.4.1 flow cell chemistry used in this investigation. It should be noted that further increases in accuracy have also been suggested in recent flow cell chemistry, such as the release this year of the R10.4.1 flow cell. Bioinformatic tools that include the variant caller "Clair," used here, have also shown increased confidence in variant calls but are thought to be reaching their limit, with greater confidence requiring significant alternative algorithms or improvements in chemistry (Luo et al., 2020). Despite the bioinformatic improvements in base calling and variant calling, we observe that the accuracy of long-read data on long-range and standard PCR products still causes far greater false positives than WES short-read data. After filtering ONT variants by read depth and quality scores, our anchored approach filtered an additional 50% of the remaining variants on average. If false variants that were missed before our anchored filtering approach were included in the phasing process it is likely some targets would be phased incorrectly or not phased at all. Many phasing tools such as "whatshap" carry out phasing with the understanding that variants within the vcf file are correct, so the removal of false variants is important.
Our study provides an approach for accurately phasing and parent-of-origin calling DNMs in a set of 77 patients. To our knowledge, this is the first time that phasing of DNMs has been investigated on this scale using long and standard range PCR targeted ONT sequencing, where each sample has a uniquely specific target.
We optimized the method for efficiency and streamlined the laboratory and computational pipelines for processing large numbers of DNMs for detailed phasing analysis. We incorporate additional short-read sequencing patient-parent trio data and Sanger-validated DNMs that are commonly available from DNM discovery pipelines like ours. This approach enabled us to improve DNM phasing and postzygotic calling. This data-supported and anchored phasing approach can be of great use in both research and diagnostic settings where DNMs are routinely studied and interpreted.

ACKNOWLEDGMENTS
We appreciate the participation of all patients and their parents in this study. We thank Arron Scott and Bryan Hepworth (Newcastle University) for technical support, the genomics core facility for whole exome sequencing (WES), the Bioinformatic support unit (BSU) for initial WES processing. This project was funded by an Investigator Award in Science from the Wellcome Trust (209451) and a Royal Society and Wolfson Foundation award (WM160091) to J. A. V.