Embrace the Mess

Three new papers from Roche, accompanied by a perspective by Elspeth Garman, describe a fantastic ligand–protein structural dataset. They focus on several related fatty acid-binding proteins (FABPs) and report 229 high-resolution ligand-bound structures. The first paper explores how protein dynamics (and allosteric interactions with membrane mimics) change when empty lipid-binding sites are filled by natural or synthetic ligands. The second examines detailed atomistic interactions across isoforms, with lessons for selectivity. Along the way, there is great crystallographic lore: nearly isomorphous crystals, twinning, and more. The third paper focuses on how ligand chemistry can often be mis-assigned due to complexities in synthesis, isomerization, or transformation within the crystal. This is a problem we have observed ourselves in our macrodomain ligands and I expect to see more often as “make on demand” chemistry democratizes ligand soaking.

The authors distill these findings into a set of very conservative lessons: prioritize chemical certainty, gate on full occupancy, and filter out ambiguous cases before using structural data for machine learning. Elspeth echos some of these concerns in the perspective. I disagree. I think we need to consider these lessons differently depending on the use case:

  1. Downstream medicinal chemistry or mechanistic interpretation without a structural biologist in the loop: If someone is simply taking the PDB as “truth” to guide synthesis, then rigorous filtering to avoid incorrect ligand identities is absolutely appropriate. I don’t think this actually happens anywhere, but this is always the straw man that such papers “warning” about the pollution of the PDB are concerned about. This includes papers criticizing our work (see also our response).
  2. Medicinal chemistry fully integrated with structural biology: Here, strict filtering risks losing huge opportunities. Alternate conformations in both ligand and binding site, unusual B-factors, and subtle occupancy differences are not nuisances. These are the very signals that can inspire new design strategies and suggest unexplored mechanisms of selectivity.
  3. Input to machine learning to predict protein ligand complexes: Curating only “gold-standard” data is going to be incredibly limiting. Disagreements between models and experimental data are opportunities to improve algorithms and better understand real-world uncertainty. I doubt that many of the structures that will be deposited by the OpenBind consortium will pass these filters.

Real experimental data is messy, full of alternate conformations, unexpected chemistries, and crystallization “oddities”. Filtering exclusively for perfection may feel safe, but it also limits discovery. Even though using coordinates of the partial occupancy ligands and static alternative conformations will improve things, I’m hoping that the ML for structural biology field will increasingly embrace the mess of experimental data more directly. I’ve written about this before from conceptual, practical, and policy perspectives. While this trio of papers represents a tremendous teaching text that guides the reader through many of the complexities of protein-ligand data sets, I disagree with the jeremiads at the end of these papers about the potential for misuse. I truly wish there were more careful papers like this out there.