Rosetta and difficult molecular replacement problems
October 26, 2010

This document briefly walks through the use of Rosetta to solve difficult molecular replacement problems.  These tools assume that the user has access to the Phenix suite of crystallographic software (in particular, phaser and the mapbuilding script mtz2map); however, all intermediate files are included so that if the user does not, most of the demo may still be run.

The basic protocol is done in 5 steps; each step has a corresponding script in the folder:

1) Using HHSearch, find potential homologues to the target sequence.  Use a Rosetta "helper script" to prepare templates (and Rosetta inputs for subsequent computations).

2) Use PHASER to search for placement of the trimmed templates within the unit cell.

3) Generate a map correspoding to each putative MR solution.

4) (optional) If the choice of template/orientation is not clear, a "gapless" refinement protocol may be used to eliminate false positives.  Resulting models may be rescored in PHASER.

5) Finally, rebuild gaps and refine each template/orientation in Rosetta, constrained by the density of each solution.  After rescoring with PHASER, the best template/orientation should be clear (if the correct solution was among the starting models).



==================================
step 1. prepare_template_for_MR.sh
==================================

This commandline illustrates the use of my script for preparing templates for an initial phaser run.  Functionally, it's doing the same thing as the crystallographic software 'Sculptor' but it doesn't remap the residues as sculptor does (and makes it easier to run with different alignments).  The script takes just one arguments: an HHR format alignment file.

Alignments generally come from HHsearch's web interface (http://toolkit.tuebingen.mpg.de/hhpred).  After submitting the sequence through their website, export the results to a .hhr file.  Results may be trimmed so only alignments with a reasonable e-value and sequence coverage are included.

The script parses the .hhr file, downloads each template PDB, and trims the PDB to the aligned residues.  In addition, the script produces a 'rosetta-style' alignment file; the format is briefly introduced below.  These alignment files are used in Rosetta model-building.

## 1CRB_ 2qo4b
# hhsearch
scores_from_program: 0 1.00
2 DFNGYWKMLSNENFEEYLRALDVNVALRKIANLLKPDKEIVQDGDHMIIRTLSTFRNYIMDFQVGKEFEEDLTGIDDRKCMTTVSWDGDKLQCVQKGEKEGRGWTQWIEGDELHLEMRAEGVTCKQVFKKV
0 AFSGTWQVYAQENYEEFLRAISLPEEVIKLAKDVKPVTEIQQNGSDFTITSKTPGKTVTNSFTIGKEAEIT--TMDGKKLKCIVKLDGGKLVCRTD----RFSHIQEIKAGEMVETLTVGGTTMIRKSKKI
--

The first line is '##' followed by a code for the target and one for the template.  The second line identifies the source of the alignment; the third just keep as it is.  The fourth line is the target sequence and the fifth is the template ... the number is an 'offset', identifying where the sequence starts.  However, the number doesn't use the PDB resid but just counds residues _starting at 0_.  The sixth line is '--'.

The results for this demo appear in the folder 'templates'.  For each alignement in the starting .hhr file, 3 files are produced.



==========================================
steps 2-3. run_phaser.sh and make_maps.sh
==========================================

This command line shows the use of Phaser to generate initial molecular replacement solutions.  For each template we run phaser to find potential placements of each template in the unit cell.

The example scripts here only generate a single model from a single template, but for a real-world case, one will often want to use many different templates and may want to generate more than one possible solution using 'TOPFILES n'.  In general, though, we have found it is better to use fewer potential solutions from more templates than many solutions from few templates.

Sometimes weak hits may be found by lowering the rotation function cutoff in phaser by adding the line 'SELECT ROT FINAL PERCENT 0.65' (or even 0.5) to the phaser script.  Increasing the packing function threshold (with PACK 10) may also help in some cases.

Finally, for each template/orientation, we generate the 2fo-fc map for input to Rosetta in the next step.



================================
step 4. run_rosetta_mr_nogap.sh
=================================

The next two commandlines illustrate the use of rosetta's comparative modelling into density.  After running the script and an initial phaser run, density maps are generated from each phaser hit, and cm-into-density is done.  The flag -MR::mode cm is used to run this mode.  This first application does not try to rebuild gaps in the alignment, it just performs the threading and runs relax into density.  Thus, the only inputs needed are: the target fasta file, the rosetta-style ali file, and the template pdb.  Because there is no rebuilding, not many models are needed to adequately cover conformational space, generally 10-20 is sufficient.

Because it is so fast (and requires so few trajectories) I will generally do this first on all phaser hits.  If one of the hits gives a strong signal when rescored with phaser, I will proceed to the next step (in which gaps are rebuilt) using that hit alone.

A brief overview of flags is given below:

-in::file::extended_pose 1
-in::file::fasta inputs/1crb.fasta
-in::file::alignment inputs/1crb_2qo4.ali
-in::file::template_pdb inputs/2qo4.PHASER.1.pdb
	The fasta, alignment and template PDBs.  The flag 'extended_pose 1' must always be given.

-relax::default_repeats 4
-relax::jump_move true
-edensity:mapfile inputs/sculpt_2QO4_A.PHASER.1.map
-edensity:mapreso 3.0
-edensity:grid_spacing 1.5
-edensity::sliding_window_wt 1.0
-edensity::sliding_window 5
	This is how the density map and scorefunction parameters are given to Rosetta.  The input map (-edensity:mapfile) is CCP4 format.  The flags 'mapreso' and 'grid_spacing' define the resolution of the calculated density and the grid spacing on which correlations are computed (these do not match the input file's grid but rather the grid on which Rosetta computes things).  Generally I don't go lower than these values (with grid-spacing 1/2 of mapreso); for very large structures I may run on an even coarser grid to speed things up a bit.

	The flag 'sliding_window_wt' is the weight on the fit-to-density term.  1.0 is generally fine.  If the experimental data is low resolution (3A or worse), than you might want to drop this to 0.5.  'sliding_window' is the residue-width over which local correlations are computed.  5 is generally fine.

-cm::aln_format grishin
	The input format of the 'ali' file.  Just keep this for now.

-MR::max_gaplength_to_model 0
	This tells Rosetta not to rebuild gaps in the alignment.

-nstruct 10
	The number of output structures.  If we are not doing gap rebuilding, models converge pretty well so not many models are needed.

-ignore_unrecognized_res
	If the template contains nonstandard residues/ligands/waters, this tells Rosetta to ignore them.


=================================
5. run_rosetta_mr_rebuild_gaps.sh
=================================

This script is the same as above, but also rebuilds gaps in the alignment.  The main difference is that a non-zero value is given for '-MR::max_gaplength_to_model'; additionally, some flags must be given that describe how rosetta should rebuild gaps.

Several additional input files must be provided as well.  Rebuilding of gaps is done by fragment insertion (as in Rosetta ab initio); thus two backbone fragment files (3-mers and 9-mers) must be given.  The application for building these is included with rosetta but requires a bunch of external tools/databases.  The easiest way to generate fragments is to use the Robetta server (http://robetta.bakerlab.org/fragmentsubmit.jsp).  The fragment files should be built with the full-length sequence; rosetta handles remapping the fragments if not all gaps are rebuilt.

The additional flags are given below:
-MR::max_gaplength_to_model 8
	Rosetta will close gaps up to this width; the larger this value is, the more sampling is required.  I wouldn't recommend higher than 8 or 10.

-loops::frag_sizes 9 3 1
-loops::frag_files inputs/aa1crb_09_05.200_v1_3.gz inputs/aa1crb_03_05.200_v1_3.gz none
	Fragment files and sizes.  9 & 3 are the default produced by Robetta; 'none' means to automatically generate the 1-mer fragments from the 3-mers.

-loops::remodel quick_ccd
-loops::random_grow_loops_by 5
-loops::random_order
-loops::extended
	Parameters of the loopbuilding; these parameters work well.

-loops::relax fastrelax
	Tells Rosetta to relax the structure after loopbuilding.

-nstruct 100
	Number of output models should be much higher.

Generally, I will make about 2000 models building all gaps up to length 10.  The top 10% are selected by Rosetta energy; each of these structures are rescored with phaser; the best is relaxed again using the updated density.

Since each model is independently generated, multiple processes are run to produce these 2000 models.  To manage the output, either each process can be run from a separate directory, or '-out:prefix <prefix>' can be used to keep jobs from overwriting each other's structures.

Alternatively, there is a compact output format, 'silent files' that can be used to dump structures to.  Simply add the flags '-out:file:silent <silent_filename> -out:file:silent_struct_type binary' and all structures from one process will be written to this compact file.  Then the rosetta program 'extract_pdb' can be used to extract:
> bin/extract_pdbs.default.linuxgccrelease -database $DB -in:file:silent <silent_filename> -silent_struct_type binary

