News from the Fraser Lab


My Approach to Evaluating Faculty Applications

James Fraser
19 July 2023

As a faculty member at the University of California, San Francisco (UCSF), I am often asked about my approach to evaluating faculty applications. In writing it out, I not only clarify my thinking, but also provide transparency about how one faculty member evaluates applications. Additionally, by sharing this, I hope to get feedback to help improve my own process for evaluating applications in the future.

Protein Folding Funnel

Evaluating faculty applications, in my view, is akin to the process of protein folding, as described by Levinthal’s paradox. Levinthal’s paradox suggests that it would be virtually impossible for a protein to achieve its functional structure by exhaustively exploring every possible conformation due to the sheer number of potential configurations. Instead, proteins navigate through a funnel-like process, where a sequence of favorable local interactions steers the protein toward its final, folded ensemble. When I evaluate faculty applications, I adopt a similar approach. I don’t undertake an exhaustive examination of every single detail of all applications. Instead, I employ a funnel-like process, starting with broader criteria, then progressively narrowing down to more specific aspects of the proposed research program. I strive to do this without resorting to traditional markers of prestige such as the reputation of the journals where they’ve published or their academic pedigree. This process guides me toward the most promising applications that resonate with me both scientifically and in terms of shared scientific values.

The first step in my evaluation process is to review the Diversity, Equity, and Inclusion (DEI) statement. Based on other published rubrics , I assess the applicant’s awareness and involvement in DEI initiatives. I’ll also look over any teaching or mentoring track record as part of this, recognizing that not everyone has had the chance or environment to fully engage in these activities. This is a critical step for me. If an applicant does not demonstrate a strong commitment to DEI, I do not proceed further with their application. This initial screening takes less than five minutes per candidate and typically eliminates about half of the applicants.

Next, I turn my attention to the research statement. The opening page (and especially the opening paragraph!) is crucial here. I look for a clearly articulated problem or a set of problems that the applicant intends to address. If the scientific problem statement, its significance, or the applicant’s approach to solving it are unclear to me, I do not proceed with considering the candidate. This step takes less than two minutes per candidate and usually eliminates another half of the remaining applicants.

For the remaining candidates, I undertake a thorough review of the entire research statement and cover letter. I study the applicant’s key preprints and papers to familiarize myself with their specific scientific questions and approaches. Interestingly, many of the faculty members I’ve been involved in hiring at UCSF had not yet published their major work in a peer-reviewed journal at the time of their application. This is not a deterrent for me; in fact, I embrace preprints wholeheartedly. Preprints provide an open and immediate insight into a researcher’s latest work, and I am fully capable of evaluating them on their own merits. However, what I find less favorable are “confidential manuscripts in review”. Because these do not offer me the same level of transparency as preprints, I won’t review them as part of the application. Including such “confidential manuscripts” demonstrates a disconnect with the open science principles that I value in future colleagues.

During this stage, I also try to evaluate how successful they have been in making progress on key problems in prior career stages by scanning letters of reference and scanning additional papers by the applicant (and often in the field of the applicant).

I also want to clarify what I look for in reference letters, even though they are a minor factor relative to the research proposal and papers of the applicant. It’s common for every applicant to be described as “the best person who has passed through the lab in years,” so overall praise isn’t the differentiator for me. Instead, I focus on three key things:

1 - Context for the scientific barrier the candidate overcame in their prior work.

2 - Discussion of how the candidate’s FUTURE work will differentiate from the thrust of their current lab.

3 - Corroborating data on teaching, mentorship, and outreach.

Letters can add depth to these three dimensions, but rarely detract from them. While it’s not a strict requirement, a well-crafted letter that resonates on these three issues can be immensely helpful in painting a comprehensive picture of the candidate.

This overall step of evaluating the research statement and papers (with a scan of letters of references and other papers) is time-intensive, taking approximately 20 to 40 minutes per candidate. However, this is the point where I decide if a candidate should be evaluated by the entire committee, generally nominating about 10-15 candidates.

At this point, I also get the short list of other members of the committee. Some of my colleagues may weigh other factors such as the prestige of journals where the applicant has published, their academic pedigree, or the likelihood of securing funding. This diversity in evaluation criteria is a strength of a committee approach, provided we are all aware of and acknowledge our biases. We typically get about 100-300 applicants in a cycle, but there is usually a significant overlap in shortlists. Generally, the committee process leads to a shortlist of ~25 candidates.

The next step involves a deeper reflection on each shortlisted application. I spend an additional 30 minutes per application, contemplating the fit of the research statement with our institution and gauging my excitement level about the proposed research. I again consider the DEI and teaching/mentoring efforts. My aim is to identify 5 to 7 applicants that I am extremely enthusiastic about, 10 applicants that I am open to learning more about if other committee members are sufficiently enthusiastic, and 5 to 10 applicants that I am skeptical about but am willing to be convinced by other committee members.

Finally, we (the hiring committee) engage in a comprehensive discussion and ranking process. Each committee member presents their shortlisted candidates, and we collectively rank them for zoom and/or on-site interviews. This process tries to offer a balanced assessment of each candidate, helping us identify the most promising faculty members for UCSF.

In conclusion, my approach to faculty application evaluation is designed to be rigorous and thorough, while being efficient and minimizing proxies of prestige like journal name or institution. I’m cognizant that I have my own implicit and explicit biases, but what is outlined here is a reflection of how I try to identify candidates who not only excel in their research but also share our values. I believe it’s important to share my process, not as a standard, but as an example of one possible approach. I encourage anyone serving on a hiring committee to outline their own unique criteria and detail the process they use to arrive at a shortlist.

Thank you to Prachee Avasthi, Zara Weinberg, Willow Coyote-Maestas, Stephanie Wankowicz, Chuck Sanders, Brian Kelch, and Jeanne Hardy for feedback and discussions about this topic.


Fraser Lab DEIJ Journal Club - Blinding Grant Peer Review

Eric Greene
04 November 2022
tags: #deij_jc

Background
A group of scientists within the Fraser, Coyote-Maestas, and Pinney labs have begun a journal club centered around issues of diversity, equity, inclusion, and justice within academia, specifically in the biological sciences.

Our goal is to provide an environment for continued learning, critical discussion, and brainstorming action items that individuals and labs can implement. Our discussions and proposed interventions reflect our own opinions based on our personal identities and lived experiences, and may differ from the identities and experiences of others. We will recap our discussions and proposed action items through a series of blog posts, and encourage readers to directly engage with DEIJ practitioners and their scholarship to improve your environment.

November 4th, 2022 – Blinding peer review

Discussion Leader: Eric Greene

Articles:

Summary Article: “Funding: Blinding peer review”

Primary Article: “An experimental test of the effects of redacting grant applicant identifiers on peer review outcomes”

Bonus Article: “Strategies for inclusive grantmaking”

Summary and Key Points:

STEM research funding is a highly competitive space that has a persistent lack of diversity and representation, especially at the faculty level. I chose this case study as it discusses one of the largest current racial disparities in STEM, highlights a source of white privilege that directly impacts lab funding, and provides experimental evidence towards one mitigation strategy.

The NIH is a substantial funding source for biomedical research in the US and NIH funding is foundational to the existence of many laboratories that are driving biomedical scientific discovery. However, there is a large and persistent funding gap between White and Black investigators, where Black PIs are funded at 55-60% White PIs rate.

In response to this disparity, the NIH conducted a study on the effects of blinding applicants’ identity and institution on the review of R01 proposals. The goal of this large experiment was to gain an understanding about the role of peer review in facilitating racial bias in grant awards and to understand the extent to which blinding applicant identity could blunt racial bias. The experiment uncovered the following:

  • Scores for applications from Black PIs were unaffected by blinding, but scores for applications from White PIs were significantly lower when the White PIs identity was blinded such that the racial gap was cut in half. This finding could be due to the “Halo effect” where personal/institutional prestige dramatically upweights advantaged/privileged individuals and can be seen as another mechanism fueling a ‘winners keep winning’ phenomena. Indeed the “Halo effect” has been indicated to be a potent factor in manuscript peer-review.

  • The principle critique of invoking the “Halo effect” to rationalize the findings of this study is that proposal writers did not write their proposal with identifying information redacted, it was done administratively with previously reviewed R01 applications, leaving uncertainty regarding the impact of administrative redaction on ‘grantsmanship’. However, we discussed the likelihood that applicants who benefit from individual/institutional prestige would likely write favorably toward this status in their applications thus in effect working to entrench any positive “Halo effect” benefit.

Blinding applicant identification on grant proposals is not a silver bullet that solves racial disparity in NIH funding. Including being imperfect itself, with ~22% of reviewers able to positively identify blinded applicant identity. However, this is one tool that has a demonstrated effect here to blunt reviewer bias. While blinding was somewhat effective here, there are means of double blinding and/or tiered blinding of application materials that can be used instead that may hold greater potential.

A key part of our discussion was about the review criteria for NIH funding that explicitly required a numerical evaluation of the individual and institution. Evaluation of a person contributes to an obligate entanglement of one’s past scientific accomplishments with their future potential during the grant review process. Not only can this equivalence be false (people often can succeed past initial setbacks), but it also can be harmful by promoting an applicant’s self-worth to be tied to their productivity. Funding requires accounting for equipment available to carry out the research, which is important for accountability on the part of the investigator, but does not necessarily require a numerical number. This detailed level of evaluation would prompt reviewers to score prestigious/well-resourced institutions higher even if the same research could be carried out elsewhere. We discussed as an alternative whether equipment/facilities categories could be scored as ‘sufficient’ or ‘insufficient’ and not influence the overall impact score of the application.

Open Questions:

  • How does one justly judge an application as fundable?
  • The ‘Halo effect’ in consequential academic evaluation processes has amassed supportive evidence beyond grant funding. How do we best de-leverage this effect towards a level playing field?
  • Blinding applicant identity can help, even if not perfect, how do we improve blinding processes through an equity lens?
  • Another explanation for lower Black PI funding rates stems from the subject matter of study, such as studying health care topics of interest for communities of color, which though important may not necessarily be of high funding value to reviewer or reviewing institute. How can these health care topics be adequately elevated and funded?
  • To what extent do non-NIH funding mechanisms also incur racial disparity? What have other organizations tried to mitigate? Have these strategies worked?

Proposed Action Items:

While trainees may have limited influence to change the course of NIH peer review, there are nonetheless actions that one can take:

  • Call your Representative/Senators to implore them to raise the NIH budget. The value of NIH sponsored research is high to the general public and with more funds, the 10-30% fund rate will increase and be less demoralizing to independent investigators and trainees.
  • Should you find yourself in the position of power as a peer reviewer, practice empathy during the review process and familiarize yourself with bias’ that can crop up in the process
  • Vote. The NIH is a government entity and is not immune to political authority figures.
  • Encourage unsuccessful applicants to pursue resubmission. Rejection is hard but community can help.
  • Encourage other non-federal funding mechanisms to blind reviewers or if they have the budget, to do a study where each application is evaluated blinded and open. Compare the scores and who gets funded.

Multi-state models from PanDDA

Galen Correy
08 August 2022
tags: #how_to

Background

The pan-dataset density analysis (PanDDA) tool developed by Nick Pearce and colleagues at the XChem facility of the Diamond Light Source is a super powerful method for identifying low occupancy states in X-ray crystallography data [1,2]. Why do we care about low occupancy states? For one thing, the field of fragment-based drug discovery relies on tools to identify weakly bound ligands [3,4]. When fragments are soaked into protein crystals, the occupancy of the fragment (i.e. the proportion of protein molecules with a fragment bound) can often be relatively low (e.g. 10-20%). PanDDA helps to identify low occupancy fragments by subtracting the ground-state electron density (i.e. the electron density when no ligand is present) from the changed-state electron density (i.e. the electron density when the ligand is present) [1]. In addition to transforming crystallographic fragment screening, PanDDA can also help to identify and model larger ligands that may bind with relatively high affinity compared to fragments, but still have relatively low occupancy. This discrepancy can arise because ligand occupancy in soaking experiments does not necessarily correlate with binding affinity as measured by solution-based methods. One reason for this is low ligand solubility; it may be difficult to reach 1:1 stoichiometry in a soaking experiment. Another reason is that a binding site may be partially obstructed, or otherwise stabilized in a conformation that decreases the ligand occupancy. The presence of low occupancy states is a fundamental challenge of using crystallographic soaking experiments for determining ligand structures: identifying and resolving these states is the reason that PanDDA is such a powerful method.

PanDDA is a powerful tool for identifying low occupancy states, but it presents crystallographers with a new challenge: actually modeling the states it identifies! The best option is to model both states using alternative occupancy (altloc) identifiers in the coordinate file to distinguish ligand-bound and ligand-free states [1,5] (this results in what we call a multi-state model). However, these multi-state models can be difficult to interpret/visualize, especially for the vast majority of users that are only interested in the ligand-bound state. A related issue is that we want to ensure that users can easily examine the PanDDA event maps that were used to model a ligand. For our recent preprint describing the design and structure-based optimization of ligands targeting the Nsp3 macrodomain, we modeled all the structures using a multi-state approach [6]. We’ve taken the following steps to disseminate the structures and maps as rapidly and helpfully as possible.

  1. Multi-state coordinate files and structure factor intensities have been deposited in the PDB (with RELEASE NOW selected)

  2. Structure factor intensities in MTZ format, Dimple output, PanDDA event/Z-maps, refined structures and ligand-bound states are available to download from Zenodo

  3. Diffraction images are available to download from https://proteindiffraction.org (search by PDB code)

How to extract the ligand-bound state in our multi-state models

Option 1

  • Download coordinates from PDB (e.g. fetch 5SQP in PyMOL)

  • Remove the altloc A coordinates - these correspond to the ligand-free state (remove alt A in PyMOL)

  • The coordinates can then be visualized or saved as a coordinate file (pdb 5SQP_ligand-bound.pdb in PyMOL)

Option 2

  • Use this PyMOL script to fetch the coordinates using the PDB code and extract the ligand-bound state

  • This script removes the altloc records for residues that only have a single conformation modeled in the ligand-bound state and renames the altloc records for residues with multiple conformations (Alternatively: the ligand-bound states can be downloaded directly from Zenodo)

How to inspect PanDDA event maps

Option 1

  • Use this script to extract the PanDDA event map from the deposited structure factor CIFs (discussed here)
  • The resulting map coefficients in MTZ format can be converted to CCP4 format using phenix.mtz2map.

Option 2

  • Download the PanDDA event map in .ccp4 format from Zenodo. (Note: use COOT version 0.8.9.2 to visualize maps.)

Where to next?

Our goal is to use macromolecular structural information to make ligand discovery more efficient. We think that identifying and modeling low occupancy states is critical to this endeavor. Developing automated ways to model the low occupancy states identified by PanDDA is a long-term goal. This will speed up ligand modeling and reduce the error/bias that is often associated with manual approaches.

References

[1] Pearce, N. M., Krojer, T., Bradley, A. R., Collins, P., Nowak, R. P., Talon, R., Marsden, B. D., Kelm, S., Shi, J., Deane, C. M. & von Delft, F. A multi-crystal method for extracting obscured crystallographic states from conventionally uninterpretable electron density. Nat. Commun. 8, 15123 (2017).

[2] Schuller, M., Correy, G. J., Gahbauer, S., Fearon, D., Wu, T., Díaz, R. E., Young, I. D., Carvalho Martins, L., Smith, D. H., Schulze-Gahmen, U., Owens, T. W., Deshpande, I., Merz, G. E., Thwin, A. C., Biel, J. T., Peters, J. K., Moritz, M., Herrera, N., Kratochvil, H. T., QCRG Structural Biology Consortium, Aimon, A., Bennett, J. M., Brandao Neto, J., Cohen, A. E., Dias, A., Douangamath, A., Dunnett, L., Fedorov, O., Ferla, M. P., Fuchs, M. R., Gorrie-Stone, T. J., Holton, J. M., Johnson, M. G., Krojer, T., Meigs, G., Powell, A. J., Rack, J. G. M., Rangel, V. L., Russi, S., Skyner, R. E., Smith, C. A., Soares, A. S., Wierman, J. L., Zhu, K., O’Brien, P., Jura, N., Ashworth, A., Irwin, J. J., Thompson, M. C., Gestwicki, J. E., von Delft, F., Shoichet, B. K., Fraser, J. S. & Ahel, I. Fragment binding to the Nsp3 macrodomain of SARS-CoV-2 identified through crystallographic screening and computational docking. Sci Adv 7, (2021).

[3] Erlanson, D. A., McDowell, R. S. & O’Brien, T. Fragment-based drug discovery. J. Med. Chem. 47, 3463–3482 (2004).

[4] Murray, C. W. & Rees, D. C. The rise of fragment-based drug discovery. Nat. Chem. 1, 187–192 (2009).

[5] Pearce, N. M., Krojer, T. & von Delft, F. Proper modelling of ligand binding requires an ensemble of bound and unbound states. Acta Crystallogr D Struct Biol 73, 256–266 (2017).

[6] Gahbauer, S., Correy, G. J., Schuller, M., Ferla, M. P., Doruk, Y. U., Rachman, M., Wu, T., Diolaiti, M., Wang, S., Jeffrey Neitz, R., Fearon, D., Radchenko, D., Moroz, Y., Irwin, J. J., Renslo, A. R., Taylor, J. C., Gestwicki, J. E., von Delft, F., Ashworth, A., Ahel, I., Shoichet, B. K. & Fraser, J. S. Structure-based inhibitor optimization for the Nsp3 Macrodomain of SARS-CoV-2. bioRxiv 2022.06.27.497816 (2022). doi:10.1101/2022.06.27.497816


Fraser Lab DEIJ Journal Club - Examining the STEM Pipeline Metaphor

Christian Macdonald
10 June 2022
tags: #deij_jc

Background
A group of scientists within the Fraser lab have begun a journal club centered around issues of diversity, equity, inclusion, and justice within academia, specifically in the biological sciences.

Our goal is to provide an environment for continued learning, critical discussion, and brainstorming action items that individuals and labs can implement. Our discussions and proposed interventions reflect our own opinions based on our personal identities and lived experiences, and may differ from the identities and experiences of others. We will recap our discussions and proposed action items through a series of blog posts, and encourage readers to directly engage with DEIJ practitioners and their scholarship to improve your environment.

June 10th, 2022 – The STEM Pipeline

Discussion Leader: Chris Macdonald

Articles:

  • Problematizing the STEM Pipeline Metaphor: Is the STEM Pipeline Metaphor Serving Our Students and the STEM Workforce? Cannady MA, Greenwald E, and Harris KN. DOI: 10.1002/sce.21108
  • Reimagining the Pipeline: Advancing STEM Diversity, Persistence, and Success. Allen-Ramdial SAA, and Campbell AG. DOI: 10.1093/biosci/biu076
  • Improving Underrepresented Minority Student Persistence in STEM. Estrada et al. DOI: 10.1187/cbe.16-01-0038

Bonus Article: Planting Equity: Using What We Know to Cultivate Growth as a Plant Biology Community. Montgomery BL. DOI: 10.1105/tpc.20.00589

Summary STEM graduates require extensive education, and progressively demand more specialized and advanced training. This has some implications for DEI work. One important one is that each educational level has compounding effects on the following ones. The common metaphor of a “STEM pipeline” has been used to capture this idea, where learners who move away from a STEM career trajectory are the leaks. In a DEI context, this means differential leakiness would be important to consider. Metaphors can be useful by simplifying complex systems and helping us reason about them. That assumes they accurately capture the important dynamics of the system, however. If they don’t they can hinder our thinking. Some have claimed that the pipeline metaphor is such a case, challenging both its accuracy and the helpfulness of the interventions it suggests.

I picked these three papers because they critically evaluate the value and accuracy of the metaphor and suggest policies to achieve the outcomes we want (a diverse and equitable environment) but that might not come directly from thinking about leaks.

-[Cannady et al.] uses longitudinal data on students in the US to see if the metaphor is accurate, and claims it is not. -[Allen-Ramdial et al.] builds off the inaccuracy of the metaphor and suggests policies that the “pipeline” might not suggest -[Estrada et al.] is a product of the Joint Working Group on Improving Underrepresented Minorities (URMs) Persistence in Science, Technology, Engineering, and Mathematics (STEM), which was convened by NIGMS and HHMI. It is an example of how a large working group can adapt the criticisms of the previous two papers and propose policies to achieve an equitable environment.

As I was picking the papers for our discussion, I also thought about alternative metaphors we might use and whether they would help us think differently. I discovered the article by [Beronda L. Montgomery], which offered a wonderful example of a very different way of thinking about education that would lead us to do different things as a result.

Key Points:

  • The metaphor may not be accurate: similar numbers of underrepresented minority students and non-underrepresented minority students enter STEM majors, and similar proportions remain through undergraduate education.
  • The metaphor leads us to think that trajectories are strictly one way (you can’t unleak), while in fact there is much more fluidity in practice.
  • The metaphor focuses our attention on individual failures (the leaks) rather than institutional ones (the pipes).
  • There is an important distinction between an institution’s culture, which is essentially the beliefs, policies, and values that guide behavior, and its climate, which is the result of the actual implementation of them. An institution may have an unwelcoming or harmful climate while still having a healthy culture, but the pipeline metaphor focuses our attention on policy rather than implementation.

Open Questions:

  • Is “STEM” a useful category, or is it too broad?
  • What sorts of trajectories do “typical” successful scientists follow? What is the definition of “success” in STEM?
  • What differentiates “leaky” institutions from others?
  • How can we take the useful features of the pipeline metaphor and avoid the harmful ones?
  • How does the overall educational landscape influence DEI efforts at the post-secondary levels and beyond?

Proposed Action Items: We broadly agree with the policies suggested by [Allen-Ramdial et al.] and [Estrada et al.], although they are larger-scale interventions. In particular:

  • Engage across institutions. Faculty at minority-serving institutions play essential but often ignored roles in diversifying STEM, and DEI initiatives at research-intensive institutions sometimes only engage with other research-intensive institutions. Programs that connect faculty across institutional boundaries can contribute to diversifying trainee access to career opportunities.
  • Focus on aligning culture and climate. Ask how students and trainees feel, and listen to them. A failure of good intentions may be a result of both culture and climate.
  • Take faculty involvement in DEI seriously. Effective and long-term DEI efforts are much more useful than broad but shallow activities. Institutions can encourage deep engagement by evaluating faculty DEI work on par with teaching and research.
  • At an individual level, we found rethinking our metaphors can be a useful exercise. Ask yourself: what sort of environments would I like to create? Are the concepts I deploy sufficient to get there? Are they accurate? Are there alternatives?

So you want to do a structural bioinformatic analysis…

Stephanie Wankowicz
10 May 2022
tags: #how_to

Over the past two years I have done a bunch of structural bioinformatic work, resulting in the paper Ligand binding remodels protein side chain conformational heterogeneity. And I made A LOT of mistakes.

Below are many of the lessons, guidelines, and pitfalls for a structural bioinformatic analysis. While many of the principles below are specifically tailored to a paired analysis (such as apo versus holo or peptide bound versus small molecule bound), these guidelines can help with any structural bioinformatics project.

For specific suggestions, I have the code I created linked at the bottom of each section. This code is built on bash, python, Phenix/cctbx, and qFit. The code should be easily adaptable to other projects/inquiries. If there are any questions, feel free to contact me.

Define your selection criteria early.

Before you start downloading structures, you need to decide what structures you would like to highlight. Some of these items can be subsetted using the PDB advanced selection criteria, including:

  1. Method of structure (X-ray, CryoEM, NMR, Neutron, ect)
  2. Resolution
  3. Cryo or Room Temperature
  4. Size of the protein
  5. Type of ligands
  6. Single or multidomain proteins

You may also want to cross check these structures with external databases (ChemBL, Uniprot, ect). You can do much of this work on the PDB website in their advanced search section.

Once you get a list of structures with your initial criteria, you can parse the header of the PDB or get other statistics of PDB/density file from the MTZ file with a program like phenix.mtz_dump.

This is the stage where you will start creating pairs of structures. Some criteria you will want to think of at this stage include: Unit cell dimensions and angles Space group Sequence (get this from the PDB and not from another database to know which residues were actually resolved in the structure). Ligand types/crystallographic additives (how much overlap do you want between the paired structures) Experimental methods such as crystallographic conditions (this will be tricker but may be important and worth it to go through headers manually). At this stage, I suggest keeping duplicate pairs (ie if you have multiple apo or wildtype proteins for each holo or mutant proteins). Many structures will be thrown out downstream and it can be helpful to have ‘back ups’.

Here is a pipeline you can use to select the PDBs to move forward in your analysis.

Re-refine structures.

The PDB has a lot of structures refined with many different software packages and versions. To ensure that you are comparing apples to apples, pick one refinement software version and re-refine all of your structures.

The software that I used was phenix.refine.

Unless you know exactly how you want to refine your structures, spend some time with ~15 structures and play around with refinement strategies. Some things to think about:

Do you want different resolution cutoffs to have different refinement strategies? Are you going to refine anisotropically or with hydrogens? How are different refinement strategies impacting the R-free or R-gap of structures?

Once you have a refinement script you are happy to test your refinement script with ~50 PDBs, find errors and adjust from there. As the PDB files you are feeding your refinement strategy may be labeled in many different ways, you are likely going to have to build in flags as well as if statements to refine the structures.

If 80% of your structures are re-refined, move on. Send bugs to the respective software groups, and accept your losses (trust me, they are not worth it!).

Here is an example re-refinement pipeline that works with Phenix version 1.19.

Here is a pipeline that will re-refine your structures, run qFit, and then refine your qFit structure.

Quality control of structures.

The first, and easiest part of quality control of structures has to do with refinement metrics. Are you decreasing the R-free or R-gap? This can be extracted through refinement log files or running additional analysis. Are there any clashes in the structure? Are there Ramachandran outliers?

With pairs, you want to assess how well they align together in 3D. Aligning them is a beast in and of itself. Due to structures with the same sequence having different chain ids or residue numbers, we will need to match those up as all downstream analyses will rely on this.

There are many different methods to align structures, but I landed with pymol, alpha carbon align. This did not work well for 100% of the paired structures I had but it is what worked for the majority of them.

I then required all chains start with residue 1. Then, as I was working with paired structures, I based all holo structures off of apo structures. Therefore, I reassigned the closest geometric chain in the holo to the chain in the apo.

Some additional criteria you will want to think about in this stage include: How well do the backbones of structures line up? This can be assessed by alpha carbon RMSD between the structures. While some analyses may want to keep large changes, others may want to throw them out. How much do the ligands overlap between the structures?

Here is a pipeline that will extract and compare R-values, align your pairs, and spit out alpha RMSD and ligand overlap between the pairs.

Analysis of structures

Now we get to the fun stuff!

Before we can run any analysis, you need to think about how you want to extract information from the structures. Are you going to do it based on chain and residue numbering or based on location. I choose the former as it is easier downstream. However, this required me to reassign the chain and residue numbers for many structures (see above).

The other thing to think about when comparing structures is if there are duplicates. In my case, I had multiple holo assigned to multiple apo. Therefore in the analysis, it was important to keep track of not just the PDB, but also the PDB’s matched pair.

Finally, you also need to consider how you are going to look at certain sections of the PDB. For example, I wanted to examine binding site residues. But my criteria (any residue heavy atom within 5A of any ligand heavy atom) sometimes gave me one or two different residues in the holo or apo depending on how much those residues moved. I decided to look at the union of those two lists, but you could also look at the intersection of those two lists.

Here are a bunch of analyses that I ran on my pairs or individual models.

Quality control of the analysis

For almost every single analysis I did, I would plot the result and have a few outrageous outliers. This was always a clue of something I coded wrong in the analysis, or something incorrect about the labeling of the PDBs.

When looking at the result of your analysis, always look at the minimum and maximum values on both an individual basis (ie if you are looking at some sort of residue metric), as well as on a structure basis. Take at least the top and bottom five metric values and go through checking for the following: Is residue 1, chain A of structure 1 within your RMSD cutoff of residue 1, chain A of structure 2? If you manually calculate the metric you are measuring, is it matching what your code says? Look at the structure in Pymol or Chimera. Does the numerical value of the metric line up with what you are visualizing?

Repeat this process until you can visually/biologically explain at least the top and bottom five metric values.