Bias in Genomic Data

Riya Shet

A critical look at ancestry bias in genome-wide studies and the risk scores built from them. Using the 2024 All of Us controversy, I argue that bias enters genomics in two ways: who gets sampled, and how their data is shown.

Read

Critical analysis · GWAS · Polygenic risk scores · Health equity · Gulf focus

Donut charts comparing ancestry composition: prior GWAS aggregate around 86 percent European versus All of Us at 54 percent European

The problem

Genome-wide association studies link DNA to disease, and we have run them on millions of people. As of 2021, about 86% of those people were of European ancestry. The clinical product, the polygenic risk score, is several-fold less accurate outside that group. Middle Eastern and Gulf populations are barely represented at all.

All of Us, and the figure that became the story

In 2024 the All of Us program published a landmark paper. It sequenced 245,000 genomes, about 46% from non-European ancestry, which is a real step toward closing the gap. The cover charts show that shift away from the old 86% European aggregate. Its headline figure was the problem. A UMAP coloured by self-described race made a continuous genetic landscape look like a handful of discrete races. The same paper also published an admixture plot showing the continuous reality, yet the UMAP became the press image, and geneticists pushed back within weeks. Bias entered not only through who was sampled, but through which figure reached the front page.

Why this matters for the Gulf

Even All of Us, with its 46% non-European cohort, includes almost no Middle Eastern ancestry. A polygenic risk score trained on it will not transfer cleanly to an Emirati patient. That is why the Emirati Genome Programme and the Qatar Genome Programme matter. Regional cohorts are a precondition for safe genomic medicine here, not a redundant copy of the international ones.

Recommendations

The talk closes with five recommendations: treat ancestry as a continuum rather than fixed categories, add a bias-and-equity review step for genomic publications, default Gulf clinical tools to local reference data, train clinicians to read risk scores as ancestry-dependent confidence intervals, and pre-register visualisation choices, not just analyses. The slides and the full reference list are in the repository.