News from the Fraser Lab


Embrace the Mess

James Fraser
29 July 2025
tags: #papers

Three new papers from Roche, accompanied by a perspective by Elspeth Garman, describe a fantastic ligand–protein structural dataset. They focus on several related fatty acid-binding proteins (FABPs) and report 229 high-resolution ligand-bound structures. The first paper explores how protein dynamics (and allosteric interactions with membrane mimics) change when empty lipid-binding sites are filled by natural or synthetic ligands. The second examines detailed atomistic interactions across isoforms, with lessons for selectivity. Along the way, there is great crystallographic lore: nearly isomorphous crystals, twinning, and more. The third paper focuses on how ligand chemistry can often be mis-assigned due to complexities in synthesis, isomerization, or transformation within the crystal. This is a problem we have observed ourselves in our macrodomain ligands and I expect to see more often as “make on demand” chemistry democratizes ligand soaking.

The authors distill these findings into a set of very conservative lessons: prioritize chemical certainty, gate on full occupancy, and filter out ambiguous cases before using structural data for machine learning. Elspeth echos some of these concerns in the perspective. I disagree. I think we need to consider these lessons differently depending on the use case:

  1. Downstream medicinal chemistry or mechanistic interpretation without a structural biologist in the loop: If someone is simply taking the PDB as “truth” to guide synthesis, then rigorous filtering to avoid incorrect ligand identities is absolutely appropriate. I don’t think this actually happens anywhere, but this is always the straw man that such papers “warning” about the pollution of the PDB are concerned about. This includes papers criticizing our work (see also our response).
  2. Medicinal chemistry fully integrated with structural biology: Here, strict filtering risks losing huge opportunities. Alternate conformations in both ligand and binding site, unusual B-factors, and subtle occupancy differences are not nuisances. These are the very signals that can inspire new design strategies and suggest unexplored mechanisms of selectivity.
  3. Input to machine learning to predict protein ligand complexes: Curating only “gold-standard” data is going to be incredibly limiting. Disagreements between models and experimental data are opportunities to improve algorithms and better understand real-world uncertainty. I doubt that many of the structures that will be deposited by the OpenBind consortium will pass these filters.

Real experimental data is messy, full of alternate conformations, unexpected chemistries, and crystallization “oddities”. Filtering exclusively for perfection may feel safe, but it also limits discovery. Even though using coordinates of the partial occupancy ligands and static alternative conformations will improve things, I’m hoping that the ML for structural biology field will increasingly embrace the mess of experimental data more directly. I’ve written about this before from conceptual, practical, and policy perspectives. While this trio of papers represents a tremendous teaching text that guides the reader through many of the complexities of protein-ligand data sets, I disagree with the jeremiads at the end of these papers about the potential for misuse. I truly wish there were more careful papers like this out there.


The Tortured Proteins Department, Episode 5

James Fraser
12 July 2025
tags: #podcast

We chatted about the latest news, including the new NIH open access policy, trends observed with scientists using LLMs, & the Montpelier Mile race.

“The new Administration began weaponizing what should not be weaponized - the health of all Americans…creating chaos and promoting an…unreasoned agenda of blacklisting certain topics, that…has absolutely nothing to do with the promotion of science.” - Judge William Young (reported by Max Kozlov, Nature News)


The Tortured Proteins Department, Episode 4

James Fraser
14 June 2025
tags: #podcast

The fourth episode of The Tortured Proteins Department is out now!

We chatted about the latest news, our Conformational Ensembles Conference, vibe coding in science, and the Dipsea and Montpelier Mile races.

The pre-prints discussed in this episode:


Developing a foundation in the scientific literature

Gabriella Estevam
26 May 2025
tags: #teaching

As the saying goes, “chance favors the prepared mind.” The process of scientific discovery begins with a deep understanding of the knowns, such that we can address the unknowns. Each project I’ve worked on as a biochemist has required its own literature foundation, and as a scientist who likes to study a variety of proteins and work on multiple projects in parallel, I’ve developed a system for rapidly building a foundation in the scientific literature.

Entering a new scientific field is as exciting as it is challenging, and in my career, happens often. While the unifying theme is always structural biology and enzymology, the exact systems I’ve studied have been distinct and required a careful understanding of their specific scientific history, and really, what works and what doesn’t. The faster I can identify gaps in knowledge and develop a hypothesis, the faster I can get to the best part: testing it.

So, what are the right papers? Where are the papers? Who are the scientists in the field? What are the key discoveries?

Here is my method:

Define the field and focus

When it’s clear why I’m reading, it makes it that much easier to identify what to read. A project is constructed from three things: a core question and hypothesis, a set of methods, and a broader field. Defining the contents of those categories allows for targeted and goal-oriented reading.

If I am to use my PhD as an example, the core question of my project was: how can we comprehensively map MET kinase resistance mutations?

To address this question, there are several things I need to know, which for me spiral into an extended list of questions like: what kind of protein is MET? What is its role? In what receptor tyrosine kinase (RTK) family does it belong? What is the current status of pathologic mutation annotation in MET-associated diseases? How does resistance develop? What methods have been used to study MET, and what were the caveats? What model systems have been used to study MET? Is there a structure? How was that structure solved and what is the resolution? What are the motifs, domains, PTMs, protein-protein interactions? What is the state of the art for comprehensively identifying sensitizing and resistance mutations? How have these questions been addressed in other proteins? You get the point…

By outlining learning objectives in this way, I can mentally organize and group questions based on theme. From there, it is a matter of tackling each topic like a to-do list and generating an intellectually fulfilling reading strategy. For the questions above, this is how I might group and define them:

  • MET kinase (core focus of project)
    • Biochemistry
    • Structure
    • Model systems
    • Disease mutations
  • Deep Mutational Scanning (central method)
    • DNA library construction
    • Selection-based pooled screening
    • NGS
    • Coding & data analysis
    • Molecules studied to date
  • RTKs, protein kinases, signaling (broader field)
    • Protein kinase phylogeny
    • Phosphorylation relay
    • RTK and protein kinase subfamilies
    • Structure-function similarities
    • Activation mechanisms
    • Disease implications

Expect overlap when organizing. For instance, while reading for methods, there might be a paper that performed a deep mutational scan (DMS) on a different kinase, and found potential mechanisms of resistance through exhaustive mutagenesis and selection – two birds, one scone! This is an opportunity to learn more about my broader field in the context of the exact method I want to apply. I can focus on the results, caveats, data interpretation, and begin to develop a realistic picture of how things might work for my project.

In the situation where I’m in the early stages of conceptualizing a project and am independently generating a hypothesis, which is the position I’m mostly in now, I use this strategy in the context of a high-level problem, but this might need its own post in the future. Nevertheless, blueprinting required literature is the first step.

Scan and collect titles

Before deeply reading papers, I first collect papers. If there is even one paper already acting as a starting point, which I’ve gotten as a suggestion or otherwise, I will begin collecting titles from the references as a strong pre-filtered list. However, my favorite way to collect reading material is by simply searching keywords or questions in Google Scholar. For instance, to broadly understand crystallography, I’ll type “crystallography” into the search engine. If there is an abstract set of concepts in mind and I want to understand what exists, whatever it is, I’ll type that. Collecting titles is becoming easier and more reliable with AI tools, and my approach here is the same, but with more prompting, .bib importing, and PDF attaching.

The goal of this stage is coverage of the literature, which means that early, foundational papers are some of the most important. Therefore, I ensure my searches are based on relevance and not date. By sampling papers over time, discovery trends can be mapped out and potentially used to predict the next wave of research.

From there, I scan titles, the first two sentence snippets, authors, and date – again, to sample my reading across time and avoid recency bias. If the title looks relevant, I’ll open it and give the whole paper a visual scan. At this point, I’m looking at figure content, skimming the abstract, skimming the discussion, and taking note of authors, but not spending more than a couple minutes per paper on this process.

If the paper content looks relevant or interesting, I save the PDF through a paper manager (Zotero, Mendeley, Paperpile, etc.) using their browser extension. My philosophy here is to use the manager as a “paper bank.” Each project I work on is given a dedicated folder, and as I make my way through the literature, I keep the papers I want to reference and read again, but remove the ones I don’t.

When I’m building a literature foundation, it is a daily process of collecting and filtering. Reading volume varies as I trade off between deeply absorbing a handful of papers versus three dozen papers at lesser depth. Ultimately, my highest reading value is interest.

Attenuate attention

The most immediate outcome of reading primary research is the development of hypotheses, approaches, experimental implementation, and iteration. However, one of the most important long-term outcomes is a curated reference list for your next paper.

Reading papers word-by-word is unnecessary and can distract from developing the literature breadth needed to build a reference list. The key is sampling and reading enough to filter out the most relevant papers.

Interact with papers

An effective way to understand literature is to engage with it. Whether conceptualizing, leading, or joining a project, there will be unfamiliar topics. When reading for comprehension, stop when more information is needed.

Stop reading to understand acronyms, look up terminologies, and quickly find summaries of methods. If there is a larger concept that is unclear or piques interest, open and skim the cited works. Dig until ready to jump back into the paper. This can lead to a tangent of unexpected primary research reading, but that’s often when I branch out and discover the most across the scientific literature.

Interact with the literature based on how you practice science. Since I work with proteins, when I read a structure paper, one of the first things I do is open the cited PDB files and reference the model as I’m reading. I’ll highlight specific residues, ligands, and toggle between different visual representations. I’ll find all the structural models of the same protein through its UniProt reference ID, generate ensembles, and build a broader understanding of authors and methods used for a given protein. If there is complex data interpretation or visualization, I’ll visit the published repository and scan the code to understand the analysis process.

At this stage, it is also important to know who is in the field – learn who the authors are. What are their affiliations? What else have they written? Visit their websites, OrcID, Google Scholar, etc. There is a probability the authors might be at the next conference you’re attending.

Keep reading

After building a literature foundation, maintain it. Revisit core papers if a refresh or reframe is due. Expand or contract reference lists. Stay current with developments. To this day I use the Fraser Lab method of following the scientific literature, which I adopted during my time in the lab and highly recommend.

Then repeat the process in another scientific space! This keeps things exciting, fresh, and creative when ideas are founded from multiple scientific domains.

If you’ve made it this far, thanks for reading, and find this co-posted at Gabriella’s site!


The Tortured Proteins Department, Episode 3

James Fraser
16 May 2025
tags: #podcast

The third episode of The Tortured Proteins Department is out now!

We chatted about grant cancellations, exciting regional meetings and reunions, two fun new preprints, community norms around code release, and the importance of giving kudos.

The pre-prints discussed in this episode: