Post-processing large scale runs

Back to index

Setup

Begin by copying the directory with all your pdb's from your cluster to your home machine.  You need to create a directory with the name of your pdb (the four letter code you gave to the starting structure), and then place the directory of 1000 within that.

If your pdb name is 'aa11' and your directory of 1000 is called 'GS', I suggest the architecture

runs -> aa11 -> GS

I will use this directory convention in the rest of the tutorial.

Extracting

The idea of extraction is to focus your attention on the top 200 or so decoys out of your large-scale run.  These lowest-scoring decoys have the most likeliehood of containing the correct answer, provided you have sampled enough search space to find the global minimum.

In 'runs', first type

pp_pdb.sh aa11

(pp stands for post-processing).  In the directory 'aa11', you will now find directories called 'top10', 'top200', and 'top1000'.  The most useful of these is the top 200.

Clustering

The logic of clustering is that the more often a structure is found by Rosetta, the more likely that structure is true to reality.  In terms of energy surfaces, the low-scoring structures represent deep portions of the energy surface, and the large clusters of structures (50-100 of the top 200) represent broad potential wells.  It has been suggested that the cluster size of a structure might correspond to its entropy, that is, that a large cluster has more substates than a small cluster, and thus a more positive entropy and a lower free energy.

In runs:
Use pp_pdb2.sh, a wrapper script which calls pp_cluster_set.sh, which does the clustering.  pp_cluster_set.sh uses the statistical program R to cluster the structures.  You might get this error if you attempt to run pp_pdb2.sh on jazz or on a new machine in the lab.

R not available on this workstation ... no clustering possible
Consider pushing the files home with pp_push set.sh

This means you need to install R on your machine in a standard location like '/usr/bin/R'.  R can be freely downloaded at this site.

In runs, type
pp_pdb2.sh aa11 2.5

2.5A is the rmsd cutoff for clustering.  You can change this to another number (e.g. 5.0) if your cluster sizes are too small.
This clusters the top 10 and top 200 sets of structures.  To cluster the top 1000 (this takes a while), type

pp_cluster_set.sh aa11 1000 2.5

The algorithm of (hierarchical) clustering is basically as follows:
calculate pairwise rmsd of all 200 structures
find the closest pair and group them together (now there are 199 objects instead of 200)
Repeat this pairing until there is only one object.
Go back and split all pairings that exceed the rmsd cutoff (2.5A in this case).

Now, go into <pdbcode>/top200 and look at the file clusterscores.bysize.  It should look something like this:

The columns go as follows:

1) score rank of the top-scoring structure in that cluster
2) size of the cluster
3) pdb file of the top-scoring structure in that cluster
4) score      "                                                               "
5) rms (meaningless because native is not defined)

  1 124  ba12a1.ppk_0003.pdb -297.40  34.14
  4  24  bR12a1.ppk_0001.pdb -292.81  31.39
  5  16  bk12a1.ppk_0008.pdb -292.33  33.30
  2  14  bp12a1.ppk_0008.pdb -293.22  32.59
  3   4  aU12a1.ppk_0003.pdb -293.12  32.98
  6   3  aG12a1.ppk_0003.pdb -291.65  52.69
  7   1  aL12a1.ppk_0009.pdb -291.48  34.73
#cluster score rank -  cluster size - best decoy - best score - rms


Generally, I consider a cluster significant if it has a size of ~15 or larger.  I consider larger clusters to be more significant than small ones, provided that they do not conflict with biological information.  In this case, the first structure heavily dominates and is probably the most likely to be the correct answer.  Ideally, the largest cluster should be in the top 2 or 3 by score rank.

The cluster numbers in the files 'cluster1.pdb', 'cluster2.pdb', etc. that are produced refer to the order of the clusters when ranked in descending order by cluster size.

When considering clusters, one should consider both score rank and cluster size.  For example, I would give cluster 4 about the same or higher weight than cluster 3, because even though cluster 4 is two smaller than cluster 3, cluster 4 is the second-ranked structure by score and cluster 3 is only 5th-ranked by score.

Even though Rosetta clusters with a 2.5A rmsd radius, sometimes you will find clusters that are 1-3A rmsd from each other.  You should check the similarity of all possible pairs of clusters in order to avoid redundancy your the model set.  There are two ways to do this:

1) Use the rosetta script rms2.pl.  This requires you to enter all of the cluster names into a file and type

rms2.pl <clusterlist>

This will give you output with all of the pairwise rmsd's.  If two clusters are within 2.5A rmsd of each other on this chart, then consider them as one model and combine the cluster sizes together.

2) If you want to visualize the similarity between two rosetta models, simply put them together into one pdb file using 'cat' and then view the file in rasmol.

footnote:  if you want a pictorial view of the clustering algorithm, take a look at the file cluster.ps (use kghostview).

It is best to visually examine the output structures afterward before making your final rankings.  See manual structure analysis.

Back to index