EditorialBehind the build

74,000 variants and counting

If you sequence your genome, you'll find roughly 4 million single-nucleotide variants. Most are silent. Some are intergenic. A few thousand land in coding regions or known regulatory elements. Of those, several tens of thousands have at least one peer-reviewed study attached. A few hundred have enough research that a careful clinician would say "this matters for this person's care."

Most consumer genetic-health products show you about ten.

We sat with that gap for a while. It's not that the products are wrong about the ten. For most users, the most-studied variants (APOE, MTHFR, CYP2C19, BRCA1/2, a few diabetes hits) are the most clinically actionable. But "we'll tell you about ten" misrepresents what's actually in your file. The other ~74,000 variants we have research on aren't background noise. They include known eQTLs that shape how your tissues express specific proteins. They include pharmacogenomic markers, the kind cataloged in PharmGKB, that affect dose response for medications you might take in five years. They include rare-disease carrier signals from ClinVar that matter for family planning.

What is a SNP, and why does the count get so big

A single nucleotide polymorphism explained in one line: it is a position in the genome where the literature has recorded that humans differ by one DNA letter, and that one-letter difference has been associated with a measurable trait, disease risk, or drug response. Each is given a stable rsID and tracked in dbSNP. Most of the ones we surface have at least one entry in the NHGRI-EBI GWAS Catalog or a curated clinical database, plus one or more papers indexed in PubMed.

When you ask how SNPs affect health, the honest answer is: it depends on the variant, the genotype (homozygous vs heterozygous changes the interpretation), the effect size in the original GWAS, and whether subsequent studies replicated the finding. The same rsID can be a confident clinical signal for one trait and a weak suggestion for another. Evidence quality always visible is the only way we know how to present that without misleading people.

The engineering problem

The reason most products don't surface the long tail is engineering. Each variant needs:

Doing this by hand at the per-variant level isn't scalable. There aren't enough clinically-literate writers in the world to author 74,000 variant pages by Monday. Doing it with LLMs naively gets you the failure mode the previous post talked about: confident-sounding pages with citations that don't hold up under inspection.

So we built a pipeline that splits the problem. The encyclopedia layer, what the page says about each variant, runs through Anthropic's Sonnet model with a strict prompt that forces every claim to anchor on a study from the variant's article corpus, with explicit hedging rules, em-dash bans (long story), and a refusal to substitute paralog gene names. The recommendations layer, what the variant suggests you might do about it, runs through Haiku with an even stricter "refuse to invent" rule and a four-kind citation requirement.

Both layers are RAG-strict: the model can't make claims that aren't sourceable to the curated evidence we feed it. We've spent a lot of engineering on validators that catch the failure modes: the gene-name substitutions that smaller models drift toward, the Mendelian-vs-common-variant conflations that careless prompting produces, the em-dash and curly-quote artifacts that signal "this is LLM output" to a careful reader.

This is what we mean by evidence-based genomics. Not "an LLM read some papers and wrote a confident summary." Anchored to a PMID, cited inline, with the hedging the underlying study actually warrants. We don't prescribe, we describe.

Where we are today

About 1,300 variants have full recommendation sets that have passed both LLM generation and an explicit "Sonnet review of Haiku's output" gate. About 50,000 have full encyclopedia entries. The remaining 24,000 are either waiting for the curated catalogue rows to mature, or have been routed to "no actionable recommendations" because the evidence genuinely doesn't support a concrete change.

The corpus skews toward the genes you would expect on a top-of-funnel reading list: APOE for cognitive aging, MTHFR for folate metabolism, CYP2C19 and SLCO1B1 for pharmacogenomics, HFE for iron, LCT for lactase persistence, FTO and TCF7L2 for metabolic risk. These are over-represented because they are over-studied, and they are over-studied because they have replicated effect sizes large enough to matter. But the tail is where the personalization lives. A single rare variant a user carries can change a single recommendation in a way the top-ten products will never see.

The throughput we're seeing today is ~1,500 variants per day of high-quality generation across both layers. At that rate the corpus closes the long tail in roughly six weeks. The economics work because the corpus is reusable: every variant we generate is a permanent asset across every user who carries it. The marginal cost per user approaches zero once we hit the tail.

What we'd do differently if we started over

Two things we got right early: anchoring on PMIDs from the start (not retrofitted), and refusing to ship recommendations whose citations couldn't be re-verified. Those two decisions are what makes the corpus citation-defensible to any researcher or clinician who reads it.

Two things we'd do differently. We'd start the structured-evidence corpus (GWAS catalog + ClinVar + PharmGKB) before the article corpus, not in parallel. The structured layer is the honest anchor for high-magnitude variants whose cited papers don't name them in prose. We figured that out four months in; doing it from the start would have saved about three weeks of generator iteration. The other thing: we'd have started writing the public catalogue earlier. Every variant page that goes up is a permanent inbound for the relevant researchers, clinicians, and (increasingly) AI assistants that cite our pages back to their users. Building corpus value compounds; building a moat doesn't.

We'll write more about the action plan v2 consolidation engine in a follow-up. The short version: if you have full recommendation sets for the variants a user carries, the per-user action plan becomes a JOIN and a small contradiction-resolution pass. No per-user LLM call, sub-second response, costs nothing at the margin. Your genome stays yours, your interpretation gets richer every week the corpus grows, and the corpus is the product. The personalization is the cherry on top.


Want updates when we ship new variant pages or a research deep-dive? Read the latest issue or get notified about early access.