The dirty secret of structural databases: Garbage in, garbage out

20 Apr 2026

The Protein Data Bank (PDB) now holds over 250,000 structures.

By Dr. Neil Taylor

AlphaFold’s database extends that coverage with well over a million predicted models. On paper, we have never had more structural knowledge at our fingertips.

And yet, in practice, there’s a truth that doesn’t get discussed nearly enough in the excitement around AI and structural biology: much of this data is not high resolution and not error-free. Building on problematic structures can lead directly to flawed decisions in structure-based drug design (SBDD).

The quality problem nobody advertises

Not all structural data is created equal. Resolution varies enormously. Many PDB entries were deposited under challenging experimental conditions, and refinement standards are far from consistent. Electron density maps can be poorly resolved at the very binding site you care about most. Ligands are sometimes modelled using sub-optimal parameter files, leading to incorrect conformations – geometries that do not reflect biological reality.

Post-translational modifications, mutations, missing density, crystal packing artefacts, variable occupancy, flexibility, soaking versus co-crystallisation differences – each introduces another layer of interpretive complexity.

The issue is not just about avoiding particular dead ends. It is about recognising how these errors influence downstream decisions in medicinal chemistry…

Dr. Neil Taylor, Founder, DesertSci

And yet, these structures routinely flow into computational pipelines, virtual screening campaigns, and AI-driven drug discovery models with surprisingly little critical filtering.

I have seen drug design programmes anchor on binding modes that later turned out to be crystallographic artefacts. I have also watched virtual screening campaigns built on structures with poorly resolved loop regions near the active site produce biologically irrelevant hit lists.

The issue is not just about avoiding particular dead ends. It is about recognising how these errors influence downstream decisions in medicinal chemistry, and learning how to better inform future structure-based drug design programmes.

The AI amplification problem

This challenge becomes more acute in the context of AI in drug discovery. AI models are only as good as the data they are trained on. As the field accelerates the use of structure-based generative models, docking scoring functions, and binding affinity predictors, the quality of structural datasets becomes critical.

Garbage in, garbage out is not a new principle. But at scale, the consequences change.

When models are trained on hundreds of thousands of structures, errors are not just propagated – they are amplified and obscured. Biases embedded in structural biology datasets do not announce themselves. They quietly shape outputs, influencing predictions in ways that are difficult to detect downstream.

What good structural data practice actually looks like

The answer is not to slow down innovation. It is to apply greater rigour. The most effective structure-based drug design teams I have worked with treat data curation as a discipline in its own right. That includes:

  • Applying strict validation criteria to structural data
  • Rejecting or flagging structures that do not meet quality thresholds
  • Validating sequence records and ligand binding poses
  • Cross-checking against supporting datasets
  • Being explicit about uncertainty and limitations in structural models

It also means investing in the right infrastructure.

Structural data management platforms that track provenance, surface quality metrics, and allow scientists to interrogate the evidence behind each structure are not a luxury – they are essential for modern SBDD workflows.

A call for higher standards

The field is at an inflection point. Structural data is now used to train AI systems, prioritise drug discovery programmes, and inform multi-million dollar decisions. Increasingly, outputs from AI models will help determine the direction of research programmes themselves. The standards we apply to structural data need to reflect that reality.

That means being honest – with ourselves, our teams, and our leadership – about what the data is actually telling us. It means building cultures where questioning a structure’s quality is encouraged, not seen as an obstruction.

The promise of structure-based drug design is real. But it is only as strong as the foundations it is built on. From what I continue to see across the industry, too many of those foundations are not being examined nearly carefully enough.

Dr Neil Taylor

For more insights into AI in drug discovery, computational chemistry and research-grade scientific software, follow Dr Neil Taylor on LinkedIn. Or, if you’d like to arrange a demonstration of DesertSci’s Proasis, please get in touch with our team.

Posted in: Current

Comments: (0)

Leave a Comment