Blind (CAPRI-style) predictions with RosettaDock

Back to index

Introduction

    In the refinement tutorial, you learned how to sample around a binding site of a protein, which is a small defined area.  This is the simplest case of large-scale docking.  However, the complete conformational space of protein-protein docking is immense, with three translational and three rotational degrees of freedom for each docking partner producing millions of possible coarse-grained conformations.  A single Monte Carlo simulation beginning from a random start position is likely to sample only a small fraction of this available conformational space.  RosettaDock partially reduces this sampling problem by coupling each Monte Carlo simulation with explicit minimization, thus ensuring that each Monte Carlo simulation finds the nearest local minimum, though this may or may not be the global minimum.  Nonetheless, accurate sampling of conformational space generally requires several thousand decoys for a blind prediction run.

    With large-scale simulations, RosettaDock consistently predicts complex structures to within 5.0A rmsd in most docking targets of less than 450 residues, as demonstrated on Chen's docking benchmark and in five rounds of the Cricial Assesment of PRedicted Interactions (CAPRI) blind docking challenge.

    RosettaDock has many applications of experimental and commercial interest.  Experimentalists may want to predict the structure of a complex they are studying in order to discover insights into the function of the complex.  Pharmaceutical researchers may want to predict a complex between a therapeutic antibody they have developed and the antigen target in order to understand the mechanism of action of the antibody and to suggest experiments for improving antigen-antibody affinity.


<T15 capri prediction>

CAPRI target 15 (round 5) blind RosettaDock prediction
Blue:  colicin D
; red:  immD (predicted model 7); green:  immD (1v74.pdb)
79% native contacts, 0.55A rmsd, 0.24A interface rmsd


Constrain the docking search when possible

The docking search can be simplified in many cases when there is biological information to indicate the general region of binding on one or both partners.  There are three major ways to reduce the docking search space:

1) Pre-orient the suspected binding site toward the other partner.  This means you must design your starting structure intelligently (see preparing PDB files), but then you can -dock_pert that partner rather than randomizing it.  This places the least constraint on docking, so it is the safest and generally the smartest approach to limit the search space.

2) If the protein is multidomain and biological information strongly indicates that only one of the domains binds the other partner, then reduce that docking partner to the high-affinity domain.  This must be done with great caution.  Read the biological information carefully and do not remove portions of the protein that might interact with the other partner.  Consider the size of the two partners and be sure to leave a large enough portion of the trimmed partner to bind the entire width of the other partner. 

Note:  this also introduces another problem.  By trimming the protein, you have created a molecule that does not exist in reality.  Portions of the molecule that were previously buried will become hydrophobic surfaces that might attract the other binding partner.  For this reason it is appropriate to set a repulsive site constraint on newly exposed surfaces so docking does not occur there.

3) If binding sites are known on both partners, set a loose distance constraint (25-30A) between the two binding sites. 

Docking flags for blind runs

If you need to review docking flags, see details of docking flags.

The three key flags here are '-dock_pert', '-spin', and '-randomize.'

In the default case, where you want to put no constraints on either docking partner, use '-randomize1 -randomize2' as search flags.

If you have pre-oriented partner 1 but wish to place no constraints on docking partner 2, you will still want to perturb partner 1.  The way to do this is to use '-dock_pert 5 15 25 -randomize2'.  The numbers on -dock_pert are arbitrary but it is better to use larger numbers for a blind search than for a refinement run.

If you have pre-oriented both partners 1 and 2, you still want to perturb both partners and spin around the line of centers because you do not usually know the orientation of the two patches relative to one another.  The search flags in this case should be '-dock_pert 5 15 25 -spin'

Method 1:  Find the best score on a score vs. arbitrary rms plot

If you have not already done so, read Using condor to run RosettaDock before you try to a global run.

This method is the creation of Ora Furman at the University of Washington.  With no prior knowledge of the complex structure, it is good to generate 5K-10K decoys and then look for low-scoring outliers.  The physical basis for this is that the vast majority of orientational configurations will generate an unfavorable interaction between the partners, while the native orientation, if it is found, should stand out from the cloud.

Technically, this is done by simply generating 5000 decoys according to the protocol used for refinement but with the appropriate flags for a blind run as described above.

I suggest making a new config file called cal.config:

cp $rosetta_scripts/condor_scripts/test.config cal.config

Then, change the following parameters:

prefix='ZZ'
nstruct=5000
search_flags='-dock_pert 5 15 25 -spin'  (or the proper flag as described in the previous section)
Njobs=100

This example is already provided in test.config (commented out).

Then, run crun.bash

crun.bash test cal

Condor should launch directly.

A pictorial illustration of a blind 5000-decoy run is given for the Baker lab's (Ora Furman) prediction of T12. The following is a score vs. rmsd plot for 10,000 initial decoys from this run:

<score vs. rms for T12>

Remember that rmsd has no meaning in this casesince the native is not defined; rather, it is just a way to spread the points out to identify outliers.

In this case, the method identifies a group of points around (19,-230) that stands out clearly from the cloud.  Ora then refined these decoys and submitted the results to CAPRI.  This point turned out to be very accurate, within 1.0A rmsd or a high-accuracy prediction by CAPRI standards.

Refine the outlier(s)
If this method identifies clear outliers (I would say the case above is clear), the next step is to refine the outlier structures as shown in the previous tutorial, refinement runs.  You should test for a binding funnel around that location.  If the refinement produces a better-scoring structure than the outlier, than use the refined structure as your model. 

Please note that only the initial random search combined with the refinement can reliably identify correct answers.   The presence of a strong binding funnel in the refinement stage is a strong case that the decoy is a plausible model; the identification of outliers in the first stage is not strong evidence by itself.

If this method fails to find an outlier, the next step to try is to generate more decoys to improve the sampling.  If after ~20K decoys, no outliers are found, it is best to use method 2 below.

Method 2:  Generate 100,000+ decoys and cluster the top 200

The logic of method 2 is to overwhelm the sampling problem with hundreds of thousands of decoys and then to analyze only the very best (top 1%) of these.  This is technically challenging because 100K+ decoys requires a lot of disk space, so we must get around the problem by using a 'smart scorefilter' which checks decoys against a reference score and keeps only the best 1 percent.

This protocol must be done in two steps.  The first is a calibration run of 5000, which serves as a reference set for the 100K run.  You would have generated the calibration run in the course of trying method 1, so just use this as your calibration set.  Then you will perform the 100K run. 

If you have not already done a calibration run, make and launch the config file 'cal.config' as described above.

Starting with cal.config, create a new condor config file called 'global.config'.  You will want to keep the search flags exactly the same as in cal.config.  Make the following changes from 'cal.config' to 'global.config':

prefix='GS'  (do not use 'ZZ' - this will mix calibration and global run decoys)
nstruct=1000
scorefilter='smart_scorefilter 0.01'  (This option is provided for you in test.config)

search_flags and Njobs are the same as cal.config.

Note:  The '-smart_scorefilter' flag is very important.  This is how you generate 100K decoys but only save 1,000 of them.  This is the only case in which you want to use this flag.  It will only work after the calibration has been run; otherwise there is no way to calculate a 1% reference score.

Now, launch the global script:

crun.bash test global

After this run, you will have 1000 decoys in the directory 'GS'.  After this, you will want to extract the top 200 of these 1000 and cluster them by rmsd.  A set of post-processing scripts is provided in $rosetta_scripts/docking for this purpose.  The protocol for post-processing is described in post-processing large scale runs.

Manually rank output structures

Whether your final set of decoys is a set of refined models from the outliers of a score vs. rms plot or a set of clustered decoys from a 100K run, you should carefully visually examine your final structures before ranking the models and/or performing further analysis.  Read manual structure analysis to complete your understanding of large-scale RosettaDock runs.

Back to index