PrivacyData sovereignty

Why your DNA is the most sensitive data you own

Most of the data privacy debate we're used to having is about recoverable data. Your email address can be changed. Your phone number can be changed. Even your social security number, though painful, can be reissued after identity theft. Your credit card gets a new number after a breach. Your password gets rotated.

Your genome is in a different category. You have one. You can't rotate it. If it leaks, it leaks for life. And, increasingly, it leaks for the lives of your blood relatives, who share enough of it that a leak of your genome is partially a leak of theirs.

This essay is about why that category difference matters more than people usually think, and what it implies for how you should think about storing and sharing genetic data. If you have ever wondered who owns your genetic data, or whether 23andMe data is safe after the bankruptcy headlines, this is the framing we use when we think about it.

What's actually in a genome that makes it sensitive

A complete sequence of your DNA contains roughly 3 billion base pairs. The actionable information density isn't uniform; some regions are highly informative about identity, ancestry, traits, and disease risk, others are not. But the parts that are informative are uniquely so:

Identifiable. A genome is among the most identifying datasets there is. Sixty SNPs is enough to identify any individual in any sufficiently large population uniquely (Lin et al. 2004, and many follow-ups since). A typical consumer test reports hundreds of thousands of SNPs. The notion that genetic data can be meaningfully "de-identified" by stripping a name and date of birth is, in the modern threat model, a fiction. The re-identification risk of genetic data is one of the few areas where "the math gets worse every year" is a literal description, not a metaphor.

Inheritable. Your genome is half-shared with each of your parents, half-shared with any biological children, and on average a quarter-shared with siblings and grandparents. When you upload your genome to a service, you upload partial copies of your relatives' genomes too, without their consent. They never agreed to whatever data sharing the service does.

Permanent. A leaked credit card gets canceled in an hour. A leaked password gets rotated in a minute. A leaked genome stays leaked. Every advance in re-identification, in genetic-trait inference, and in genomic forensics applies retroactively to leaked data.

Predictive about your future self. Your DNA contains information about your risk for diseases you don't have yet. Knowing your APOE status changes what your future self is, in some sense, at risk of. Knowing your BRCA status changes what your future self might choose to do prophylactically. A leak today potentially reveals information about your health twenty years from now.

Predictive about your relatives' future selves. Genetic information cascades. A genome containing a BRCA pathogenic variant tells you something about your siblings' likelihood of carrying the same. A genome with Lynch syndrome markers tells you something about your children's risk.

The point isn't that any one of these properties is uniquely scary. It's that they're all simultaneously true and they compound.

The threat models people don't think about

Most people who think about genetic privacy at all focus on a small set of obvious threats: genetic data and insurance discrimination (largely illegal in US health insurance under the GINA genetic information discrimination act, but with real loopholes around life, disability, and long-term care coverage), employment discrimination (also covered by GINA), or government surveillance (rare but real). Life insurance genetic discrimination in particular is not covered by GINA, which is one of the gaps in the patchwork of genetic privacy laws in the US.

The threat models that are actually more likely:

Commercial data brokers building inferred profiles. If your genome is part of a database that's sold or licensed for "de-identified research," then in practice the data flows through partners whose security posture and downstream use is opaque to you. The "de-identification" rests on assumptions about future re-identification that don't hold up.

Forensic identification via relatives. Multiple cold cases have been solved using consumer-grade genetic databases that the suspect never personally submitted to. A first cousin's upload to a public-genealogy database can be sufficient to identify you uniquely if other information is available. This isn't hypothetical; it's well-documented (the Golden State Killer case being the most famous example, but not the only one). Law enforcement genetic data access does not stop at the FBI's CODIS database anymore. The question "can the government access your DNA" now has to include "via a third cousin you have never met".

Bankruptcy, acquisition, and breach. Companies fail. Companies get acquired. Companies get breached. The privacy policy you signed in 2017 binds the original entity, but what happens to the data when the entity changes hands is largely shaped by bankruptcy court priorities and breach-disclosure law. 23andMe's own privacy notice is worth reading in full, and we wrote a separate post about the 23andMe bankruptcy case that walks through one recent example. Note that HIPAA does not cover direct-to-consumer genetic testing companies, because they are not "covered entities" under the statute. The protections people assume exist often do not.

Subpoena and law enforcement access. Any service holding plaintext access to your genetic data can be compelled to produce it by a properly issued legal request. The legal standard for compelling production varies by jurisdiction and case type but is generally lower than people assume.

Long-tail scientific use cases. "We use your data for research" sounds benign and often is. But the boundary of what research is acceptable shifts over time. A consent given in 2015 may cover uses that wouldn't have been imagined then but are routine in 2030. Most consumer privacy policies don't tie consent to a specific research use case; they license a broad future-use right.

What the right design looks like

If you accept the threat model above, then the design principle for privacy-first genetic testing is simple: minimize the amount of plaintext genome that exists outside the user's direct control. Genetic data sovereignty is the property you are trying to preserve. Encrypted genetic data storage with user-held keys is one of the ways to do it.

In practice this means:

End-to-end encryption with user-held keys. Sometimes called client-side encryption DNA in the literature, this means the service stores your genome, but the encryption key lives with you. The service cannot decrypt without your participation. A breach yields ciphertext, not data.

Computation that doesn't require plaintext. Where possible, design the system so the operations the service needs to do (variant lookup, genotype matching, report generation) can happen against encrypted indices or hashed lookups rather than raw genome reads. The frontier here is zero knowledge proof genetics, where a service can verify a claim about your genome without ever seeing the underlying bases. Not every operation is amenable to that yet; the ones that aren't should be minimized.

Granular consent. Different operations have different sensitivities. Generating a personal report is one consent. Contributing to a research dataset is a separate consent. Allowing partner-firm access is a third. Bundling all of these into "by signing up you agree" is a design choice that prioritizes the service's flexibility over the user's understanding.

Retention that the user controls. When a user decides they're done, the data should be destroyable in a way that's verifiable. Not "we'll delete it from our active databases but retain backups for X years"; actually destroyable, end-to-end.

These are the principles behind data sovereignty in this category. The implementations differ across services. Most consumer-genetic-health products today don't implement most of them. The industry was built before the threat model was clear.

How Expressive's design reflects this

We're not the only service taking this seriously, but it's the central architectural decision we made and we think it's worth being explicit about.

The encryption key for your genome is derived from a signature you produce with a wallet you control. It doesn't live on our servers in any form. We can't decrypt your data without your signature, which means we can't be compelled to produce plaintext data by a third party, can't sell access to it, and can't accidentally leak it in a breach.

We process the file server-side because we have to, running a pipeline against gigabyte-scale genomic data inside a browser is slow and battery-draining, but the reads are over encrypted blocks with the keys flowing through the user's session rather than living at rest on our infrastructure.

We use HMAC lookups rather than plaintext indices for any operation that needs to find your records. The HMAC is computed from your wallet-derived key, so we can identify your records to serve them, but we can't reverse the HMAC to recover identifying information.

We don't share your data with research partners. We don't aggregate it for trait studies. We don't have a "you opt into research at signup" clause hidden in our terms of service. We may eventually offer an explicit opt-in for specific research collaborations, but that will be its own consent flow, separate from signup, with the specific research described concretely.

We have a detailed technical post on how this actually works for the security-curious. The short version: the keys live with you, the data is encrypted at rest and in transit with keys we don't hold, and there is no architecturally privileged "we can see everything" position.

What this means for you, the data owner

A few practical takeaways:

The era of treating genetic data like email-tier personal information is ending. The properties of the data make that framing wrong. The shift toward designs that respect those properties is slow but underway. We built Expressive to be genetic data you actually own: your genome stays yours, by construction, not by promise. We hope more services join.


Want updates when we ship new variant pages or a research deep-dive? Read the latest issue or get notified about early access.