# -*- tab-width:2;indent-tabs-mode:t;show-trailing-whitespace:t;rm-trailing-spaces:t -*-
# vi: set ts=2 noet:
#
# (c) Copyright Rosetta Commons Member Institutions.
# (c) This file is part of the Rosetta software suite and is made available under license.
# (c) The Rosetta software is developed by the contributing members of the Rosetta Commons.
# (c) For more information, see http://www.rosettacommons.org. Questions about this can be
# (c) addressed to University of Washington UW TechTransfer, email: license@u.washington.edu.

## @file test/scientific/README.txt


____________________________________________________

HOW TO RUN
____________________________________________________

0) Check Prerequisites

   a) Rosetta
   b) Input Structures ( see next )
   c) R  version >= 1.11
      c.1) with packages 'plyr', 'ggplot2', 'reshape', 'Design'  (see ?install.packages in R)

1) Download input structures

   cd mini_base_dir
   svn checkout https://svn.rosttacommons.org/source/trunk/mini.data/tests/scientific/tests/rotamer_recovery/inputs inputs

2) Compile Rosetta

   cd mini_base_dir
   ./scons.py bin mode=release -j<num_processors>

3) Run test

   Gd mini_base_dir/test/scientific
   ./scientific.py rotamer_recovery -d mini_database_dir

4) Investigate results

   Ct mini_base_dir/test/scientific/statistics/rotamer_recovery
   -> results.out




____________________________________________________

ABOUT
____________________________________________________



 Rotamer Recovery Scientific Benchmark

The rotamer recovery scientific benchmark addresses the question,
"Given the sequence identity and backbone conformation of a
experimentally characterized protein structure, how
accurately can Rosetta predict the conformation of the sidechains?"
Having an accurate answer will:

 a) Inform researchers on the trustworthiness of Rosetta as molecular modeler
 b) Guide tuning Rosetta to improve structural predictions


The input structures is a selection of the Richardson's Top5200
dataset.  The Top5200 set was constructed as follows.  Daniel Keedy
and others the Richardson Lab clustered all structures in the PDB on
April 5th 2007 into 70% sequence homology clusters.  Each structure
with resolution 2.2A or better and was not filtered by hand was run
through MolProbity.  Then from each group the structure with the best
average resolution and MolProbity score was selected provided it had
resolution at least 2.0A.  All structures from the Top5200 having
between 50 and 200 residues and resolution less than 1.2A were
selected for this benchmark. This leaves 152 structures with 17463
residues.


The strategy taken here is to model how accurately Rosetta can
predict the conformation sidechains with properties of native
structures as predictors.  To assess the accuracy, we observe that
protein sidechains populate defined conformations called Rotamers.
If at a given residue the sidechain conformation in the native
structure and the predicted structure occupy the same rotameric bin
then we say it is a rotameric match or the rotamer is recovered.  The
model we will build will predict the per residue rotameric match 0/1
indicator variable.  The properties of the reference structure we can
will use as model predictors include:

 a) Residue amino acid identity 
 b) Solvent exposure with neighbor count as a proxy
 c) Crystallographic B-Factor
 d) Number of Degrees of Freedom in sidechain
 e) Some or all of the components of the Rosetta Score function.
 
I will now describe the predictor variables in detail. To represent
the amino acid identity of each residue, we will create a collection
of 0/1 indicator variables.  Note that glycine, having no atoms on
the sidechain, and alanine, having only a single atom as a sidechain,
do not have any torsional degrees of freedom in their sidechains, so
by default Rosetta will predict their conformations correctly.  For
the remaining 18 amino acid types we will create a variable that
takes on 1 when a residue has the particular amino acid and 0
otherwise.  

Residues that are exposed to solvent are in general less constrained
then residues that are packed on the interior of a protein structure.
With fewer structural constraints, the likelihood of multiple
realistic rotameric conformations increases and therefore the
expected recovery rate decreases.

The crystallographic data comes with estimates of the uncertainty of
the atomic coordinates.  Roughly, the R-Factor measures the overall
resolution for a structure and the B-Factor measures the measured
atom level spread of the electron density either.  A high B-Factor
may indicate flexibility at that part of the structure or it may
indicate imprecision due to the experimental method.  In either case
either a high B-Factor means the accuracy of the native rotamer is
uncertain, so the the expected rotamer recovery rate should be low.

Here we will model the connection between the predictors with the
rotamer match statistic with a logistic regression model.  In this
case we assume the probability of match Prob{ match=1 } given the
predictor vector X=[x_1, x_2, ..., x_n] is

 P := Prob{ match=1 | X } = [ 1 + exp ( -X*b ) ]^-1

where b is vector of model parameters to be estimated from data.
Under this model, the logarithm of the odds of Rosetta correctly
predicting a rotamer match for a particular residue is,

 logit(P) := log[P/(1-P)] = X*b

From this each model parameters b_i can be interpreted as the change
in logit(P) given a unit of change in the i'th predictor, X_i,

  dlogit(P) / dX_i = b_i


