## here are some tools for clustering

step 1: if you have pdb-files, you have to make them into a 
     silent-mode output file:
     
     make a list of the decoys, eg /bin/ls aa*pdb > tmp.list
     
     then compose into .out file:
     ./python/compose_score_silent.py tmp.out tmp.list

step 2: (optional) pre-process your native-pdb-file for reading by the 
     stupid clustering program

     ./python/make_coords_file.py nat.pdb A tmp.out > nat.coords

step 3: cluster

     ./C/cluster_info_silent.out tmp.out nat.coords cluster/tmp 5,15,45,75 3,4

     if you dont have a native, replace "nat.coords" by "-"

     this will make a million files in the directory cluster/ that start with
     the characters "tmp" (that's what the 3rd argument specifies)

step 4: make a contacts plot

     ./python/make_new_plot.py cluster/tmp.contacts 

step 5: make a dendrogram of the clusters

     ./python/make_color_trees.py cluster/tmp 1 25

###################################

you can get some info about the scripts by running each one without
any arguments

the clustering is a very simple algorithm: given an RMSD threshold,
find the decoy with the most neighbors within this threshold. This is
your first cluster. Now delete all the members of this cluster, and
repeat: find the decoy with the most neighbors within this threshold.
This is your second cluster. etc etc 

The only tricky part is how you decide what the clustering threshold
should be. You could say you want N decoys in the top cluster. Or you
could say you want the threshold to be 3 Angstroms. The complicated
command line arguments to the clustering program are designed to
allow the program to make a smart decision:

./C/cluster_info_silent.out <silent-file> <coords-file> <prefix> a,b,c,d e,f

a is the smallest cluster you want to see
b,c, and d bound the size of the top cluster, and e and f bound the 
clustering threshold.

The program will try to get a top cluster of size c. This will define
some initial clustering threshold t. If t >= e and t <= f, you're done.
If t<e then the program will try a top cluster of size c+1, increasing
the top cluster size until either the clustering threshold falls within [e,f]
or a top cluster size of d is reached. If the initial threshold had been
greater than f, the program would instead have decreased the top cluster
size until the threshold fell below f or a top cluster of size b was reached.

In short: the top cluster size will lie between b and d, and if possible the
clustering threshold will lie between e and f.

The memory and speed are most sensitive to the setting "d" as well as the
total number of decoys, so try not to set "d" too big. If it's too small the
first time you can always run it again.

For 1000 decoys and a smallish protein I might use:

5,10,50,150 3,4

Of course the thresholds 3,4 should be scaled with the length of the protein.


