Post-processing
large scale runs
Setup
Begin by copying the
directory with all your pdb's from your cluster to your home
machine. You
need to create a directory with the name of your pdb (the four letter
code you gave to the starting structure), and then place the directory
of 1000 within that.
If your pdb name is 'aa11' and your directory of 1000 is called 'GS', I
suggest the architecture
runs -> aa11 -> GS
I will use this directory convention in the rest of the tutorial.
Extracting
The idea of extraction is to focus your attention on the top 200
or so decoys out of your large-scale run. These lowest-scoring
decoys have the most likeliehood of containing the correct answer,
provided you have sampled enough search space to find the global
minimum.
In 'runs', first type
pp_pdb.sh aa11
(pp stands for post-processing). In the directory 'aa11', you
will now find directories called 'top10',
'top200', and 'top1000'. The most useful of these is the top 200.
Clustering
The logic of clustering is that the more often a structure is found by
Rosetta, the more likely that structure is true to reality. In
terms of energy surfaces, the low-scoring structures represent deep
portions of the energy surface, and the large clusters of structures
(50-100 of the top 200) represent broad potential wells. It has
been suggested that the cluster size of
a structure might correspond to its entropy, that is, that a large
cluster has more substates than a small cluster, and thus a more
positive entropy and a lower free energy.
In
runs:
Use pp_pdb2.sh, a wrapper script which calls pp_cluster_set.sh, which
does the clustering. pp_cluster_set.sh uses the statistical
program R to cluster the structures. You might get this error if
you attempt to run pp_pdb2.sh on jazz or on a
new machine in the lab.
R not
available on this workstation ... no clustering possible
Consider
pushing the files home with pp_push set.sh
This means you need to install R on your machine in a standard location
like '/usr/bin/R'. R can be freely downloaded at
this site.
In runs, type
pp_pdb2.sh
aa11 2.5
2.5A is the rmsd cutoff for clustering. You can change this to
another number (e.g. 5.0) if your cluster sizes are too small.
This clusters the top 10 and top 200 sets of structures. To
cluster the top 1000 (this takes a while), type
pp_cluster_set.sh
aa11 1000 2.5
The algorithm of (hierarchical) clustering is basically as follows:
calculate pairwise rmsd of all 200 structures
find the closest pair and group them together (now there are 199
objects instead of 200)
Repeat this pairing until there is only one object.
Go back and split all pairings that exceed the rmsd cutoff (2.5A in
this case).
Now, go into <pdbcode>/top200 and look at the file
clusterscores.bysize. It should look something like this:
The columns go as follows:
1) score rank of the top-scoring structure in that cluster
2) size of the cluster
3) pdb file of the top-scoring structure in that cluster
4) score "
"
5) rms (meaningless because native is not defined)
1 124 ba12a1.ppk_0003.pdb -297.40 34.14
4 24 bR12a1.ppk_0001.pdb -292.81 31.39
5 16 bk12a1.ppk_0008.pdb -292.33 33.30
2 14 bp12a1.ppk_0008.pdb -293.22 32.59
3 4 aU12a1.ppk_0003.pdb -293.12 32.98
6 3 aG12a1.ppk_0003.pdb -291.65 52.69
7 1 aL12a1.ppk_0009.pdb -291.48 34.73
#cluster
score rank - cluster size - best decoy - best score - rms
Generally, I consider a cluster significant if it has a size of ~15 or
larger. I consider larger clusters to be more significant than
small ones, provided that they do not conflict with biological
information. In this case, the first structure heavily dominates
and
is probably the most likely to be the correct answer. Ideally,
the largest cluster should be in the top 2 or 3 by score rank.
The cluster numbers in the files 'cluster1.pdb', 'cluster2.pdb', etc.
that are produced refer to the order of the clusters when ranked in
descending order by cluster size.
When considering clusters, one should consider both score rank and
cluster size. For example, I would give cluster 4 about the same
or higher weight
than cluster 3, because even though cluster 4 is two smaller than
cluster
3, cluster 4 is the second-ranked structure by score and cluster 3 is
only
5th-ranked by score.
Even though Rosetta clusters with a 2.5A rmsd radius, sometimes you
will find clusters that are 1-3A rmsd from each other. You should
check the similarity of all possible pairs of clusters in order to
avoid redundancy your the model set. There are two ways to do
this:
1) Use the rosetta script
rms2.pl. This requires you to enter all of the cluster names into
a file and type
rms2.pl
<clusterlist>
This will give you output with all of the pairwise
rmsd's. If two clusters are within 2.5A rmsd of each other on
this chart, then consider them as one model and combine the cluster
sizes together.
2) If you want to visualize the similarity between two rosetta models,
simply put them together into one pdb file using 'cat' and then view
the file in rasmol.
footnote: if you want a pictorial view of the clustering
algorithm, take a look at the file cluster.ps (use kghostview).
It is best to visually examine the output structures afterward before
making your final rankings. See
manual
structure analysis.
Back to index