Vast amounts of genome sequencing data generated from large-scale research studies like HostSeq provide an opportunity to summarise the spectrum of pathogenic variation in a subset of the Canadian population. Sharing variant-level data with public databases, like ClinVar, is crucial for advancing our understanding of genomic variants related to Mendelian diseases. However, such entries are often incomplete or contain discrepancies which render interpretation and classifications less useful.
The GENCOV and HostSeq cohorts were annotated and summarised using custom workflows to identify variants with a pathogenic and likely pathogenic (P/LP) classification in ClinVar. These variants were further filtered using custom gene panels, gnomAD frequency, variant type, gene-disease relationship and the number of reputable ClinVar laboratory submissions. Manually assessed variants from the GENCOV study were compared with ClinVar classifications to identify discrepancies.
A total of 1956 unique variants were manually classified as P/LP through the GENCOV study. Of these, 65% (1276) were also identified in ClinVar, and among those, 69% (889) had concordant P/LP classifications. The manual assessment of unique exonic variants in GENCOV yielded a higher number of putative P/LP variants (1595), including truncating and missense variants compared with the 127 unique exonic P/LP variants in HostSeq.
These results highlight that a large proportion of P/LP variation is either absent or has conflicting evidence for pathogenicity in ClinVar, emphasising the importance of periodic reassessment, discrepancy resolution and updates to ensure completeness and accuracy of public databases.
