STRC Pseudogene Problem

The Problem in One Sentence

STRCP1 (STRC pseudogene 1) is 99.6% identical to STRC at the nucleotide level. Standard variant pathogenicity tools cannot reliably distinguish between them, causing systematic failures in STRC variant classification.

What is STRCP1?

PropertySTRCSTRCP1
Locationchr15q15.3chr15q15.3 (adjacent)
Nucleotide identityreference~99.6% identical to STRC
Protein productStereocilin (functional)None (pseudogene — premature stops)
ExpressionOuter hair cellsNot expressed
Clinical significanceDisease-causingNot relevant

STRCP1 is a gene duplication that became non-functional (gained frameshift mutations and premature stops) but retains near-identical sequence. At 99.6% identity, a single-nucleotide difference every ~250 bp.

Why This Breaks Standard Tools

SIFT

How SIFT works: compares the mutated amino acid against amino acids found at the same position in aligned homologous sequences across species and paralogs.

What goes wrong: STRCP1 sequence is included in alignment databases as a “paralog” of STRC. Since STRCP1 has many mutations at every position (it’s a pseudogene), the alignment shows apparent “tolerance” for variation at STRC positions — artificially inflating the number of observed amino acids and making real pathogenic variants look benign.

Result for E1659A: SIFT incorrectly scores as “Tolerated” because the pseudogene sequence provides false variation evidence.

PolyPhen-2

How PolyPhen-2 works: uses multiple sequence alignment + structural information to assess conservation.

What goes wrong: Same as SIFT — pseudogene sequences contaminate the alignment, making conserved positions appear variable. PolyPhen-2 structural features partially compensate but the alignment signal dominates.

Result for E1659A: Unreliable score; may predict benign despite true conservation.

CADD (Combined Annotation-Dependent Depletion)

How CADD works: machine learning on >1000 genomic features. But sequence conservation features are included.

What goes wrong: Conservation features are computed from read depth and sequence alignment. At STRC, read depth confusion (reads mapping to both STRC and STRCP1) and alignment artifacts reduce apparent conservation signal.

Result: Underestimates pathogenicity for STRC variants.

REVEL (as partial backup)

How REVEL works: ensemble of multiple tools (SIFT, PolyPhen-2, MutationTaster, FATHMM, etc.).

Why partially better: averaging across tools smooths some of the pseudogene noise. Some tools in the ensemble use orthogonal features.

REVEL score for E1659A: 0.65 (mildly elevated; threshold for moderate evidence: 0.644 per Pejaver 2022). This is at the borderline — not strongly pathogenic, not benign.

Limitation: REVEL still incorporates SIFT and PolyPhen-2 scores, so it inherits pseudogene contamination. Score is reliable but not definitively pathogenic.

Why AlphaMissense Works

AlphaMissense (DeepMind, 2023) uses protein structure, not DNA sequence alignment.

How it works:

  1. Takes the protein sequence (not the DNA)
  2. Runs AlphaFold to predict protein structure
  3. Assesses how the amino acid change affects the predicted structure and evolutionary fitness of the protein fold
  4. Does NOT use multiple sequence alignment against other organisms — uses AlphaFold’s built-in structural language model

Why pseudogene doesn’t matter:

  • STRCP1 is not expressed → has no protein structure → cannot contaminate AlphaMissense training
  • AlphaMissense was trained on protein variants in UniProt, which only contains expressed, functional proteins
  • The DNA identity between STRC and STRCP1 is irrelevant at the protein structure level

Result for E1659A: AlphaMissense 0.9016 — Likely Pathogenic. No pseudogene noise.

The Diagnostic Failure Chain

For Egor’s son Misha (and for many STRC patients):

WES sequencing → finds c.4976A>C variant →
Clinical lab runs SIFT + PolyPhen-2 + CADD →
Pseudogene contaminates alignment scores →
Tools underpredict pathogenicity →
Lab classifies as VUS →
VUS = no trial access, no disability recognition →
Correct diagnosis buried under bioinformatics artifact

This is a systematic failure affecting thousands of STRC patients worldwide. HK Children’s Hospital classified Misha’s variant as VUS in December 2022. The reclassification required stepping outside the standard toolset.

The Fix

  1. AlphaMissense: bypass pseudogene problem entirely (protein structure, not DNA alignment)
  2. Manual conservation: align STRC sequences from UniProt (not genome alignment databases), 9+ mammalian species, confirm 100% conservation at position 1659
  3. ACMG in trans evidence: PM3 criterion (in trans with confirmed pathogenic allele) is immune to pseudogene contamination
  4. REVEL as corroborating: score of 0.65 still provides mild supporting evidence despite limitations

Why This Matters Broadly

STRC is not the only gene with this problem. Any gene with a nearby pseudogene (PARKIN, PMS2, SBDS, CYP21A2, SMN1) will have similar bioinformatics failures. The lesson: always check for pseudogenes before trusting SIFT/PolyPhen-2 scores on a variant.

For STRC specifically: the ~99.6% identity means many STRC variants are systematically misclassified as VUS. This has downstream consequences for hundreds of families globally.

Connections