Rosetta++ Coding Guidelines                                 Last Updated: June 8, 2005    

Introduction

Rosetta++ is a C++ implementation of Rosetta that has been converted from the Fortran implementation and it currently it shares the design and programming style of the Fortran version. These guidelines are intended to help us maintain and improve the reliability, clarity, and performance of the code while we continue its development and modernization. The guidelines will evolve along with the design of Rosetta++ and our accumulated knowledge.

A modest familiarity with the C++ language and with the ObjexxFCL array and string types will be needed to write new Rosetta++ code. Less detailed knowledge is required to modify existing code: in many cases merely observing the usage in the existing code will suffice.

See the Introduction to Rosetta++ document for an overview of the C++ implementation and migration from the Fortran Rosetta.

The ObjexxFCL documentation is provided separately.

C++ design and development is much larger subject than we can hope to cover here. Some good references worth exploring include:


General

  • Use structs to group data that is associated with the same entity, paving the way for the migration to an object-oriented design.

  • Use const qualifiers on declarations of variables and function arguments that should not be changed.

  • Namespaces can be used to wrap associated objects that make up a conceptual component and to wrap related classes.

  • Avoid raw pointers. A range of smart pointers is available including std::auto_ptr and Boost::smart_ptr. The choice of pointer type is important: they vary with respect to value vs. pointer copy semantics and data ownership. The performance impact of smart pointers should be assessed.

  • Avoid C-style arrays. Especially avoid writing to C-style char arrays with the scanf family of functions: this invariably introduces buffer overflow bugs that can hide unnoticed and are hard to track down: use std::string buffers and prefer C++ stream i/o.

  • Use include guards in all header files.

  • Header files should include all of the other header files that they need rather than requiring those headers to be included before it.

  • Use forward declarations in header files instead of header file includes when possible to reduce dependencies and speed builds.

  • Practice RAII (Resource Allocation Is Initialization): constructors should be used to acquire resources and destructors should release them. Function-scope resources should be released in the same scope in which they were acquired.

  • Check, handle, and clear stream condition states during i/o operations: the rules for this are a bit tricky.

  • Use assertions to test for events that should not occur for any normal program inputs and operations: function pre- and post-conditions, object invariants, divisions by zero, and so forth:

    #include <cassert>
    ...
    assert( condition );

  • Use if blocks to test for errors that could happen in normal program operation. Errors should generate useful reports that include the values of relevant variables.

  • Never put a using declaration or directive at global or namespace scope in a header file unless you are certain that you want all users of that header to have those names brought into scope. Doing so can force name ambiguity problems into unsuspecting source code. Instead limit using declarations and directives to the smallest scope possible, preferably no larger than function scope.

  • Fix everything that causes a legitimate compiler warning on any platform.


Modernization

  • Migrate and merge the existing (converted COMMON block) namespaces to structs.

  • Use std::string, not ObjexxFCL::Fstring, in all new code. Be aware that std::strings are indexed starting from 0 and that trailing spaces are significant when comparing std::strings.

  • Migrate away from the use of multidimensional arrays that pass slices of themselves to functions using Fortran-style argument FArray passing tricks. It is better for the function call to indicate the data that the function will use. In many cases the right hierarchical data structure can provide the "slice" of interest in a valid sub-object (e.g., a std::vector< std::vector > >).

  • Use C++ Standard Library and Boost collection classes instead of multidimensional arrays where appropriate. The choice of data structures can have a big impact on performance and the interoperability of different parts of Rosetta++ so consult with a software engineer if necessary.

  • Specialized array types such as Blitz++, Boost::MultiArray, POOMA, TNT, and MTL can be considered for future use with Rosetta++ but such choices should be coordinated with the the developer community. These array types do not support the Fortran array passing tricks that FArray does so care must be used in planning such a migration for existing code.

  • More specific guidance on the choice of data structures and array types will evolve as the modernization of Rosetta++ proceeds. Developers are encouraged to work on modernization in consultation with the whole development team so that the shared knowledge and experience will lead to robust solutions that interoperate smoothly.


Build Speed

  • Don't include headers that your code doesn't need. Header files should only include headers that they need to compile (test this by compiling a source file that only includes the header), not those that their corresponding source file needs, and should forward declare the other types that appear.

  • Use forward declarations instead of including header files wherever possible, especially in header files. If an object is only used by name in a reference or pointer context then a forward declaration is sufficient. Base and member classes require header inclusion.

  • STL containers (std::list, std::map, std::string, etc.) cannot be portably forward declared but including <set> and <map> can greatly slow down build times. Using a wrapper class that accesses the std::set or std::map through a pointer can insulate your code from those headers: contact a support person for help on this.


Runtime Performance

  • Make your code work correctly before you worry about making it fast. Most of the computational time is spent inside loops, so it essential that the innermost part of the loops are fast. Move if statements outside of loops if possible. Be especially careful with the calls to pairenergy and fast_pairenergy; these functions are called on the order of 10^9 times.

  • As in Fortran arrays, FArrays of rank greater than one are column-major ordered and so are most efficiently accessed in loops where the innermost loop varies the first index.

  • FArray linear indexing can provide a big speed boost for performance-critical code sections at some cost to code clarity.

  • Numeric and bool function arguments can be passed by value rather than reference when the function does not need to modify the actual argument's value in the calling function. This is a little more efficient than passing these types by reference. You can declare the argument as passed by constant value to document and check at compile-time that the function does not alter the argument.

  • Avoid dynamic allocation in loops and frequently called functions. This can include the construction of objects, such as FArrays, that allocate heap memory, as well as explicit calls to new.

  • Declare function-local arrays of fixed size static in high call count functions to avoid the overhead of heap allocation on each call (and move any initialization from construction-time to run-time).

  • Function-local arrays sized by Dimensions that don't change during the function should use the Dimensions' values by adding the ():
        Dimension d;
        ...
        void f()
        {

           FArray2D_int A( 2 * d(), 2 * d() );
           ...
        }
    to avoid the overhead of the automatic sizing system and the need to include DimensionExpressions.h if an expression is used.
  • If you have a code section that is performance critical help is available for profiling and tuning.


Stylistic

  • Prefer  if  blocks to  goto  (avoids scoping issues with local declarations).

  • Use this standard if block format:

        if ( condition ) {
           action();
        } else {
           other_action();
        }

  • Use brace-wrapped statements in for, do, while, and if statements that take more than one line:

        if ( condition ) // Don't do this!
           action();

        if ( condition ) action(); // OK

        if ( condition ) { // OK
           action();
        }

    With the discouraged use the next developer might add other actions thinking that they are in the same statement block.

  • Put const on the right side of types:
        Type const t; // Preferred
    instead of:
        const Type t; // Discouraged

    There are two reasons:
    1. C++ types are read from right to left.
    2. To declare a const pointer the const has to be on the right of the *.
  • Use a coherent indenting scheme for loops, conditionals, etc.

  • Use tab indenting and keep line lengths to 80 characters with 2 characters per tab.

  • Use spaces for readability around binary operators, inside parentheses and template argument <> brackets, and between items in function argument lists and variable declaration lists.

  • Comment the purpose and usage of each function.

  • Give variables meaningful, non-cryptic names.

  • Comment all variables that are not excruciatingly obvious.

  • Remove unused variables and functions; if functions are left
    if for debugging purposes, comment this.

  • Remove obsolete code (it can still be retrieved from the repository).

  • Write your code in small, manageable pieces. Each function should encompass one concept and fit on one page of code if possible. If your functions are lengthy, they can probably be broken into pieces that are easier to manage. If you find yourself duplicating a chuck of code, break it off into a new function.

  • Conditional checks should happen inside the called function rather than in the calling function when possible. For example, instead of:
      if ( condition_exist ) my_function();
    use:
      my_function();
    where my_function begins with:
      if ( !condition_exists ) return;

    This helps keep things a bit more modular, and also ensures that your function has no bad side effects if someone calls it but forgets to check for the essential condition. The only exceptions should be when function would have to be called many times and cannot be inlined. If you can avoid an entire loop by checking a condition outside the function, great, but you should probably still have a check inside the function too for other instances when the function is called.

  • Every new file should have the emacs mode information and the CVS information at the top of it. Copy it from a similar file. CVS will automatically put the correct revision, date and author in, so don't worry about this.

  • Vim users might want to use the .vimrc file contained in the cvs rosetta++ package. This file contains autocmd for removing trailing spaces, as well as proper definitions of tab space and indent patterns. Copy this .vimrc file to your root directory if you don't already have one there, or modify your .vimrc file accordingly.


Rosetta++ Specifics

  • Runlevels should be determined by command line options only. Hardcoded runlevels will not be checked in.

  • The more system calls we have inside Rosetta++, the more likely it is to be non-functional on some platform. Currently, there are system calls inside Rosetta++ (and some of them do cause failures on some clusters), but please do not add any more unless there is absolutely no other way to solve your problem.

  • Do not add anything to misc.h/.cc that does not absolutely have to go there. Note that the monte carlo scores and the rms values are currently in misc.h and do not belong there. This should be changed but is kind of a pain to fix. If anyone wants to volunteer...

  • Logical variables declared in files_paths.h control the initialization and output behavior of Rosetta++. (i.e., what happens in initialize, input_pdb, output_decoy and make_pdb and a few other places). If your variable doesn't determine global initialization or output behavior, don't declare it here. The value of these variables can be set in pretty much any way you choose in options.cc, but actions that happen as a result of these flags cannot depend on the mode you're running. I.e., flags defined in files_paths.h must produce the same results for everyone.

  • modes determine the protocols that will be run (e.g., what routines are called from main_rosetta). flags define conditions that are independent of the mode/protocol being used. I.e., fullatom_flag determines whether or not complete sidechain coordinates are being used. relax mode determines that the fullatom_relax protocol will run. There may be both a mode and a flag that is appropriate: i.e., loop mode indicates that the protocols for screening, permuting and folding loops will run, but any mode can run with the loop_flag true (i.e., the loop flag controls such things as refolding in the presence of chain breaks, how wobble moves work, whether a loop library should be input etc.). Similarly for docking: dock mode uses the set of protocols developed by Jeff, but the docking flag (which can be used by any mode) indicates that multiple chains are in used and need to be tracked (among other things). Note that while in theory, any flag should work in any mode, in practice this is probably not true. Usually, a flag has been tested only in the primary mode for which it is used (i.e., docking flag in docking mode) and the score mode.

    The upshot of all this is.... if you want something to happen conditionally in Rosetta++, but it isn't specific to a mode (i.e., a protocol) name your variables flags (i.e., ssblocks, disulfides). A good test is, if you want something to happen both in protocol X and when you just want to score a bunch of structures, it should be a flag.

  • Don't attempt to defeat centralized bookkeeping features. These are probably the trickiest part of Rosetta++ and the easiest way to introduce a bug. Be exceedingly careful modifying: monte_carlo.cc, recover.cc, initialize.cc, output_decoy.cc.

    In general, only monte_carlo is allowed to modify the best and low arrays, only initialize is allowed to modify the start arrays.

    Additionally, be careful with frag_begin, frag_size, new_rotamer (these are globals in fairly protected namespaces and should be set by function call only). count_pair and count_pair_position are essential to ensure that functions and derivatives are in agreement.

    The best way to resolve bookkeeping issues, now that the code is in an object-oriented language, is to convert the bookkeeping variables to a structure or object and then create a new instance for your particular calculation. In this way, a Monte Carlo cycle could be embedded within another Monte Carlo cycle. Be sure to consult with the original author (probably Carol) when converting bookkeeping data structures.

  • For checkins to CVS, it is helpful if changes are grouped conceptually so they can be checked in in layers. This allows people to 'undo' change more easily should it ever be required. It also really helps document what's happening to the code base. Checking in independent changes individually is particularly important when you're making one set of changes that affects all of Rosetta++ and another that maybe only affects a subset of the protocols.

    Say you want to make two sets of changes that are pretty much independent -- perhaps adding a new scorefxn term and changing the way the pair energy is evaluated. Make a decision about which of these changes should happen first (i.e., does one require the other?). Order may not matter. Make the first set of changes starting from the current checkout of CVS For the second set of changes, you've got two choices -- you can make the second set on top of the first, or you can make them independently in another copy of the CVS version. After you checkin the first change, you'll need to update the code with the second set of changes to let CVS merge the changes together. CVS has fewer problems (i.e., conflicts) when you make the two sets of changes completely independently, rather than one on top of the other. YMMV


Building Rosetta++

Rosetta++ currently builds with GCC 3.3 & 3.4, and Intel C++ 8.0 & 8.1 and should be buildable with future versions of those compilers and other highly ANSI/ISO standard compliant C++ compilers.

Debug Builds

  • Debug builds (with NDEBUG not defined) should always be used to test new code to get array bounds checking and other assertion checks enabled. Rosetta++ debug builds do a lot of runtime checks and so run quite a bit slower than Fortran debug builds.

  • Suggested GCC debug build g++ command line switches:
    -march=<your_arch> -malign-double -ffor-scope -fno-exceptions -fstack-check -O0 -ggdb

Release Builds

  • Define NDEBUG (using a -DNDEBUG switch on the g++ command line) to disable the runtime assert bounds and other checks. The normal C++ debug build (without -DNDEBUG) runs quite slowly due to the bounds checks on each array access.

  • Aggressive inlining is essential for maximum Rosetta++ performance. With GCC we suggest using the g++ inlining switches:
    -finline-functions -finline-limit=20000

  • Suggested GCC release build g++ command line switches:
    -march=<your_arch> -malign-double -ffor-scope -fno-exceptions -ffast-math -finline-functions -finline-limit=20000 -funroll-loops -O3 -DNDEBUG -s

Profiling Builds

  • Profiling builds are release builds with extra profiling switches. On GCC this means adding the g++ command line switches:
    -ggdb -pg and removing the -s
    switch.