Three new papers from Roche, accompanied by a perspective by Elspeth Garman, describe a fantastic ligand–protein structural dataset. They focus on several related fatty acid-binding proteins (FABPs) and report 229 high-resolution ligand-bound structures. The first paper explores how protein dynamics (and allosteric interactions with membrane mimics) change when empty lipid-binding sites are filled by natural or synthetic ligands. The second examines detailed atomistic interactions across isoforms, with lessons for selectivity. Along the way, there is great crystallographic lore: nearly isomorphous crystals, twinning, and more. The third paper focuses on how ligand chemistry can often be mis-assigned due to complexities in synthesis, isomerization, or transformation within the crystal. This is a problem we have observed ourselves in our macrodomain ligands and I expect to see more often as “make on demand” chemistry democratizes ligand soaking.
The authors distill these findings into a set of very conservative lessons: prioritize chemical certainty, gate on full occupancy, and filter out ambiguous cases before using structural data for machine learning. Elspeth echos some of these concerns in the perspective. I disagree. I think we need to consider these lessons differently depending on the use case:
Real experimental data is messy, full of alternate conformations, unexpected chemistries, and crystallization “oddities”. Filtering exclusively for perfection may feel safe, but it also limits discovery. Even though using coordinates of the partial occupancy ligands and static alternative conformations will improve things, I’m hoping that the ML for structural biology field will increasingly embrace the mess of experimental data more directly. I’ve written about this before from conceptual, practical, and policy perspectives. While this trio of papers represents a tremendous teaching text that guides the reader through many of the complexities of protein-ligand data sets, I disagree with the jeremiads at the end of these papers about the potential for misuse. I truly wish there were more careful papers like this out there.