STRC Pseudogene Problem
The Problem in One Sentence
STRCP1 (STRC pseudogene 1) is 99.6% identical to STRC at the nucleotide level. Standard variant pathogenicity tools cannot reliably distinguish between them, causing systematic failures in STRC variant classification.
What is STRCP1?
| Property | STRC | STRCP1 |
|---|---|---|
| Location | chr15q15.3 | chr15q15.3 (adjacent) |
| Nucleotide identity | reference | ~99.6% identical to STRC |
| Protein product | Stereocilin (functional) | None (pseudogene — premature stops) |
| Expression | Outer hair cells | Not expressed |
| Clinical significance | Disease-causing | Not relevant |
STRCP1 is a gene duplication that became non-functional (gained frameshift mutations and premature stops) but retains near-identical sequence. At 99.6% identity, a single-nucleotide difference every ~250 bp.
Why This Breaks Standard Tools
SIFT
How SIFT works: compares the mutated amino acid against amino acids found at the same position in aligned homologous sequences across species and paralogs.
What goes wrong: STRCP1 sequence is included in alignment databases as a “paralog” of STRC. Since STRCP1 has many mutations at every position (it’s a pseudogene), the alignment shows apparent “tolerance” for variation at STRC positions — artificially inflating the number of observed amino acids and making real pathogenic variants look benign.
Result for E1659A: SIFT incorrectly scores as “Tolerated” because the pseudogene sequence provides false variation evidence.
PolyPhen-2
How PolyPhen-2 works: uses multiple sequence alignment + structural information to assess conservation.
What goes wrong: Same as SIFT — pseudogene sequences contaminate the alignment, making conserved positions appear variable. PolyPhen-2 structural features partially compensate but the alignment signal dominates.
Result for E1659A: Unreliable score; may predict benign despite true conservation.
CADD (Combined Annotation-Dependent Depletion)
How CADD works: machine learning on >1000 genomic features. But sequence conservation features are included.
What goes wrong: Conservation features are computed from read depth and sequence alignment. At STRC, read depth confusion (reads mapping to both STRC and STRCP1) and alignment artifacts reduce apparent conservation signal.
Result: Underestimates pathogenicity for STRC variants.
REVEL (as partial backup)
How REVEL works: ensemble of multiple tools (SIFT, PolyPhen-2, MutationTaster, FATHMM, etc.).
Why partially better: averaging across tools smooths some of the pseudogene noise. Some tools in the ensemble use orthogonal features.
REVEL score for E1659A: 0.65 (mildly elevated; threshold for moderate evidence: 0.644 per Pejaver 2022). This is at the borderline — not strongly pathogenic, not benign.
Limitation: REVEL still incorporates SIFT and PolyPhen-2 scores, so it inherits pseudogene contamination. Score is reliable but not definitively pathogenic.
Why AlphaMissense Works
AlphaMissense (DeepMind, 2023) uses protein structure, not DNA sequence alignment.
How it works:
- Takes the protein sequence (not the DNA)
- Runs AlphaFold to predict protein structure
- Assesses how the amino acid change affects the predicted structure and evolutionary fitness of the protein fold
- Does NOT use multiple sequence alignment against other organisms — uses AlphaFold’s built-in structural language model
Why pseudogene doesn’t matter:
- STRCP1 is not expressed → has no protein structure → cannot contaminate AlphaMissense training
- AlphaMissense was trained on protein variants in UniProt, which only contains expressed, functional proteins
- The DNA identity between STRC and STRCP1 is irrelevant at the protein structure level
Result for E1659A: AlphaMissense 0.9016 — Likely Pathogenic. No pseudogene noise.
The Diagnostic Failure Chain
For Egor’s son Misha (and for many STRC patients):
WES sequencing → finds c.4976A>C variant →
Clinical lab runs SIFT + PolyPhen-2 + CADD →
Pseudogene contaminates alignment scores →
Tools underpredict pathogenicity →
Lab classifies as VUS →
VUS = no trial access, no disability recognition →
Correct diagnosis buried under bioinformatics artifact
This is a systematic failure affecting thousands of STRC patients worldwide. HK Children’s Hospital classified Misha’s variant as VUS in December 2022. The reclassification required stepping outside the standard toolset.
The Fix
- AlphaMissense: bypass pseudogene problem entirely (protein structure, not DNA alignment)
- Manual conservation: align STRC sequences from UniProt (not genome alignment databases), 9+ mammalian species, confirm 100% conservation at position 1659
- ACMG in trans evidence: PM3 criterion (in trans with confirmed pathogenic allele) is immune to pseudogene contamination
- REVEL as corroborating: score of 0.65 still provides mild supporting evidence despite limitations
Why This Matters Broadly
STRC is not the only gene with this problem. Any gene with a nearby pseudogene (PARKIN, PMS2, SBDS, CYP21A2, SMN1) will have similar bioinformatics failures. The lesson: always check for pseudogenes before trusting SIFT/PolyPhen-2 scores on a variant.
For STRC specifically: the ~99.6% identity means many STRC variants are systematically misclassified as VUS. This has downstream consequences for hundreds of families globally.
Connections
- STRC E1659A Conservation and Reclassification — why standard tools failed; why AlphaMissense was used
- STRC Hearing Loss — why Misha’s variant was initially VUS
[see-also]STRC Research Methodology — pseudogene awareness was step 1 of the methodology[see-also]STRC Electrostatic Analysis E1659A — structural approach also avoids pseudogene problem[about]Misha[see-also]Paralog Off-Target Rule — the therapy-design generalisation of this same paralog problem