Introduction to Rosetta++                                 Last Updated: May 15, 2005    

What Is Rosetta++?

Rosetta++ is a reimplementation of the Rosetta biomolecular modeling program in the C++ language. It is named Rosetta++ to distinguish it from the Fortran implementation. Rosetta++ as released in mid 2004 is the first phase of an evolutionary modernization to a more reliable and extensible design. The goals of this phase were to produce a C++ Rosetta that preserved the "look and feel" of the Fortran and had near-Fortran performance. Future phases will focus on migration to an object-oriented architecture that localizes behavioral properties and provides simpler and safer interfaces for development of new scientific algorithms within a modular, extensible framework.


Rosetta++ vs. Rosetta

Rosetta++ preserves the "look and feel" of the Fortran implementation by using a library, ObjexxFCL, to provide Fortran-compatible array, string, i/o, and intrinsic function support. Modifying and extending Rosetta++ code should be straightforward for researchers familiar with the Fortran version: array and string indexing, array passing "tricks", and other Fortran "features" are unchanged. The key issues to be aware of when working with the Rosetta++ code are discussed below.

At this point Rosetta++ still has the stylistic and design attributes of the Fortran Rosetta. The major reliability and extensibility improvements will come later, but even while working within the current design C++ compilers will catch many of the typical errors that Fortran compilers don't.

Obtaining near-Fortran performance with Rosetta++ required mode-specific profiling and hand tuning the bottleneck code. For the modes we tuned for so far (docking, design, and relax cases) Rosetta++ is 0-10% slower than the Fortran Rosetta when a comparable workload is performed (in individual cases where small numerical differences lead to a different computational sequence the speed can vary by more). Further tuning and the planned move to right-sized arrays should narrow or eliminate the gap. Due to the performance impact of C++ stream-based i/o, quick test cases perform worse, relatively, than real-world runs.

One area where Rosetta++ suffers by comparison is in compile times. C++ is simply a much more complex language than FORTRAN 77 and compiles a lot slower. GCC is notoriously slow, but even with faster compilers a major Rosetta++ rebuild will provide a good opportunity for a coffee break. The slowness of C++ compiles is somewhat ameliorated by the fact that Rosetta++ constants and array dimensions are defined in source (.cc), not header (.h), files so modification of these will not trigger a large number of compiles. On the other hand, changes to function interfaces will trigger many compiles due to the required presence of C++ function prototypes in the headers.


Rosetta++ Overview

Source Code Organization

The file structure of Rosetta++ is similar to that of the Fortran Rosetta with some additions. A source (.cc) file exists for each Fortran source (.f) file. Header (.h) files with function prototypes are used for each source file containing functions. Additional source and header files are used for the namespaces that correspond to the COMMON blocks of the Fortran Rosetta. The ObjexxFCL library source is contained in the ObjexxFCL subdirectory of the main source directory.

Fortran -> C++ Conversion Highlights

Looking at the Fortran and C++ versions of a simple Rosetta routine shows some of the flavor of the conversion.

Fortran: contact_order
   subroutine contact_order(co)
   implicit none
   include 'param.h'
   include 'param_aa.h'
   include 'misc.h'
   include 'cenlist.h'
   real co
   integer i,j,kk,nco
   co=0
   nco=0
   do i=1,total_residue
     if (is_protein(res(i))) then
       do kk=1,cen12up(i)
         j=cen_list(kk,i)
         if (cendist(i,j).lt.64.0 .and. abs(j-i).gt.2) then
           co=co+abs(j-i)
           nco=nco+1
         endif
       enddo
     endif
   enddo
   if (nco.gt.0) co=co/(real(nco))
   return
   end


C++: contact_order
void
contact_order( float & co )
{
  using namespace cenlist_ns;
  using namespace misc;
  using namespace param_aa;

  co = 0.0;
  int nco = 0;
  for ( int i = 1; i <= total_residue; ++i ) {
    if ( is_protein(res(i)) ) {
      for ( int kk = 1; kk <= cen12up(i); ++kk ) {
        int j = cen_list(kk,i);
        if ( cendist(i,j) < 64.0 && std::abs(j-i) > 2 ) {
          co += std::abs(j-i);
          ++nco;
        }
      }
    }
  }
  if ( nco > 0 ) co /= static_cast< float >( nco );
}

Comparing the Fortran and C++ we can see that there is a lot of similarity and that:

  • Fortran subroutines become C++ functions returning void.
  • C++ function arguments are declared in the argument list, not in the function body.
  • Fortran header file include statments are replaced, in general, by using declarations for the corresponding namespace (whose headers are included at the top of the C++ source file).
  • Fortran DO loops become C++ for loops.
  • Fortran IF blocks become C++ if blocks.

Fortran -> C++ Conversion Attributes

Some of the main attributes of the conversion are shown below.

Data Types
Fortran Types C++ Types
LOGICAL bool
INTEGER int
INTEGER*2 short
INTEGER*1 ObjexxFCL::byte
REAL/REAL*4 float
DOUBLE PRECISION/REAL*8 double
COMPLEX/COMPLEX*8 std::complex<float>
DOUBLE COMPLEX/COMPLEX*16 std::complex<double>
CHARACTER*N ObjexxFCL::Fstring(N)
N-dimensional array ObjexxFCL::FArrayND
PARAMETER constant const variable
SAVE variable static variable


Language Constructs
Fortran Construct C++ Construct
SUBROUTINE Function returning void
type FUNCTION Function returning type
COMMON block namespace
DATA statement Scalar or array with uniform initial value: initializer value assignment at point of construction

Array with nonuniform initial values: initializer function passed to array constructor
EQUIVALENCE

Built-in type: union

Array: ObjexxFCL::FArrayNDa proxy array associated with an ObjexxFCL::FArrayND array

DO loop for loop
IF block if block


In addition to a basic knowledge of the C++ language, working with the Rosetta++ code requires some familarity with the ObjexxFCL array, string, i/o, and intrinsic function support and with the conventions used in the conversion from Fortran. These topics are introduced in the sections that follow.


Arrays

The ObjexxFCL library provides a system of Fortran-like arrays of up to five dimensions called FArrays. The FArrays are contiguous, column-major arrays that support the array passing "tricks" (legal and otherwise) common in FORTRAN 77 code. There are four types of arrays of each rank (number of dimensions):

  1. Base FArrays: FArray1DB, …, FArray5DB:
    These are abstract base classes of the three concrete FArray types of the same rank and are used for reference function arguments that can be passed FArrays of any of those types.
  2. Real FArrays: FArray1D, …, FArray5D:
    These are "real" arrays that control their own data.
  3. Proxy FArrays: FArray1Dp, …, FArray5Dp:
    These are "proxy" arrays that attach to all or part of the data of another array. Proxy FArrays support dynamic sizing via Dimension size parameter objects like real FArrays and can automatically reattach to the data of a reallocated source array when used as whole-array proxies.
  4. Argument FArrays: FArray1Da, …, FArray5Da:
    These are "proxy" arrays that use all or part of the data of another array. For performance reasons, unlike proxy FArrays they do not support Dimension-based dynamic sizing and cannot automatically reattach to reallocated source arrays. Argument arrays are primarily intended for function arguments to support passing arrays or parts of arrays of possibly different rank or dimensions than used in the function.

The FArrays are template-based and so can hold elements of any type. Shorthand typedef names are provided in the ObjexxFCL header files for FArrays of common types, such as FArray2D_float (which is short for the type FArray2D< float >).

Here are simplified Fortran and C++ versions of the same small function that show the basics of array usage in Rosetta++:

Fortran
  subroutine angles_get_template( Bxyz2, Bcentroids2 )

  real*8 Bxyz2(3,8), Bcentroids2(3,20)

  real*8 Bxyz(3,8), Bcentroids(3,20)
  common /angles_standard_residue/ Bxyz, Bcentroids

  integer i,j

  do i = 1,8
    do j = 1,3
      Bxyz2(j,i) = Bxyz(j,i)
    enddo
  enddo
  do i = 1,20
    do j = 1,3
      Bcentroids2(j,i) = Bcentroids(j,i)
    enddo
  enddo
  return
  end


C++: Argument Arrays Passed by Value
void
angles_get_template(
  FArray2Da_double Bxyz2,
  FArray2Da_double Bcentroids2
)
{
  using namespace angles_standard_residue;

  Bxyz2.dimension( 3, 8 );
  Bcentroids2.dimension( 3, 20 );

  for ( int i = 1; i <= 8; ++i ) {
    for ( int j = 1; j <= 3; ++j ) {
      Bxyz2(j,i) = Bxyz(j,i);
    }
  }
  for ( int i = 1; i <= 20; ++i ) {
    for ( int j = 1; j <= 3; ++j ) {
      Bcentroids2(j,i) = Bcentroids(j,i);
    }
  }
}

The index-based array element access is the same in the Fortran and C++ versions. Because FArrays are stored in column-major order like Fortran arrays (and unlike C-style arrays) the same nested loop sequencing is still efficient in Rosetta++.

FArrays support arbitrary index ranges like Fortran arrays, unlike C-style arrays, which only support 0-based indexes. As in Fortran, when a single dimension value is used the starting index defaults to 1. The DRange (for real and proxy FArrays) and SRange (for argument FArrays) types are used to specify index ranges with a starting value other than 1. Here are some sample array definitions:

Array Definitions
Fortran C++
REAL*8 A(3,8)
INTEGER B(-10:10,50)
FArray2D_double A(3,8);
FArray2D_int B(DRange(-10,10),50);


Array Function Arguments

In the C++ version of the angles_get_template function above argument FArrays (indicated by the "a" in the FArray2Da_double type name) are used. Argument FArrays are pass-by-value proxies for the passed array that allow the function to declare the array of any rank or dimensions. The dimension calls set the dimensions of the array in the function. Although the argument arrays are passed by value, being proxies they hold a pointer to the actual array data so the array data is not copied and no dynamic allocation occurs during the argument array construction. But there is still a performance cost for the construction and dimensioning of argument arrays so faster methods should be used when Fortran-style array passing tricks are not needed.

For functions that are always going to be passed arrays of the rank and dimensions expected, array arguments can be declared as a pass-by-reference base FArrays for greater efficiency. Since the passed array dimensions are used no dimension call is required. Here is what the angles_get_template function looks like using base FArrays:

C++: Base FArrays Passed by Reference
void
angles_get_template(
  FArray2DB_double & Bxyz2,
  FArray2DB_double & Bcentroids2
)
{
  using namespace angles_standard_residue;

  for ( int i = 1; i <= 8; ++i ) {
    for ( int j = 1; j <= 3; ++j ) {
      Bxyz2(j,i) = Bxyz(j,i);
    }
  }
  for ( int i = 1; i <= 20; ++i ) {
    for ( int j = 1; j <= 3; ++j ) {
      Bcentroids2(j,i) = Bcentroids(j,i);
    }
  }
}

Because only a reference is passed for each array and no dimension calls are needed this is much more efficient. Note that the array use in the body of the function is unchanged. For small functions that are called many times using base FArrays instead of argument FArrays can provide a significant performance boost. Base FArray arguments have been used where possible in Rosetta++'s performance-critical functions.

Real FArrays passed by reference could be used for function arguments but these can only be passed real arrays, not argument or base arrays, so there is usually no reason to do this. The only case where using real FArrays as function arguments makes sense is if the function needs to resize the passed array, which cannot be done to base FArrays (dimension calls have such different meaning for real and proxy FArrays that code should know which type it is operating on).

Linear Array Indexing for Speed

Accessing multidimensional FArrays by standard multi-index accessors requires some multiplication and addition operations to locate the position of the desired element in the array's data block. As the array rank increases so does the cost of such accesses. In performance-critical loops such accesses can dominate the running time of a program. Fortran compilers can use their knowledge of Fortran array storage to optimize away many of the indexing computations from inner loops. But the C++ compiler does not have any such built-in knowledge about the FArrays or any other array libraries.

The FArray classes provide a linear indexing capability to get Fortran-like speed in performance-critical code sections. Linear indexing uses the [] operator with a single index that specified the 0-based offset linear position of an element in the array. The angles_get_template function using linear indexing is shown below.

C++: Linear Array Indexing
void
angles_get_template(
  FArray2DB_double & Bxyz2,
  FArray2DB_double & Bcentroids2
)
{
  using namespace angles_standard_residue;

  for ( int i = 1, l = 0; i <= 8; ++i ) {
    for ( int j = 1; j <= 3; ++j, ++l ) {
      Bxyz2[l] = Bxyz[l];
    }
  }
  for ( int i = 1, l = 0; i <= 20; ++i ) {
    for ( int j = 1; j <= 3; ++j, ++l ) {
      Bcentroids2[l] = Bcentroids[l];
    }
  }
}

Linear indexing can obscure loop code semantics so its use should be limited to performance bottleneck code as indicated by profiling and comments should be used to explain the usage. Where linear indexing has been used in Rosetta++ the equivalent multi-index is indicated in comments. See the ObjexxFCL documentation for more information on linear indexing.

FArray Notes

FArrays have some features that FORTRAN 77 arrays lack:

  • Default and copy construction.
  • Index range and size information is carried with FArrays and can be accessed.
  • Whole array assignments can assign, add, or subtract the value of one FArray to another (A=B; A+=B; A-=B;) or assign or modify each element by a constant value (A=0.0; A+=3; A-=4; A*=5; A/=6;). This eliminates loops and is more efficient for multidimensional arrays. Whole array assignments have been used to replace some loop assignments in Rosetta++.
  • Linear indexing for high performance (as described above).
  • Some useful common functions are provided as member or friend functions for the appropriate rank FArrays, including dot_product, length, length_squared, distance, distance_squared, normalize, identity, and transpose.
  • Real FArrays can be dynamically sized, via dimension constructor arguments, and resized, via dimension function calls, at run time.
  • Argument FArrays can be redimensioned at run time via dimension function calls.
  • Real and proxy FArrays support automated dynamic sizing via Dimension objects and expressions: the arrays are automatically resized, and if appropriate reinitialized, when a Dimension they depend on is changed.

FArray bounds checking is active in debug builds (when NDEBUG undefined). This also checks that specified argument FArray dimensions are within the actual passed array dimensions when such information is available.


Strings

The Fstring string type provided by the ObjexxFCL is analogous to a Fortran character string: it has a fixed length, characters are indexed from 1, and comparisons ignore trailing blanks.

The examples below demonstrate some simple string declarations in Fortran and their equivalent in C++ using Fstring.

Fortran: CHARACTER String
CHARACTER*20 first
CHARACTER*20 last
CHARACTER*41 full
CHARACTER*23 short
CHARACTER*4 code

first = 'Fred'
last = 'Flintstone'
full = first(1:INDEX(first,' '))//' '//last ! "Fred Flintstone"
short = first(1:1)//'. '//last ! "F. Flintstone"
code = first(1:1)//last(1:3) ! "FFli"

C++: Fstring
Fstring first( 20 );
Fstring last( 20 );
Fstring full( 41 );
Fstring short( 23 );
Fstring code( 4 );

first = "Fred";
last = "Flintstone";
full = first.trimmed() + ' ' + last; // "Fred Flintstone"
short = first[1] + ". " + last; // "F. Flintstone"
code = first[1] + last(1,3); // "FFli"

Fstring Notes:

  • Single characters are accessed using the [] operator, as with std::string.
  • Substrings are formed by a two-index function, so the Fortran substring name(i:j) becomes name(i,j) with Fstring.
  • Concatenation follows the std::string usage: Fortran's first//space//last  becomes  first+space+last  with Fstring.
  • In C++, Fstring (and other) declarations can be mixed in with executable statements, unlike Fortran, and Fstrings can be constructed with an initial value. So in the example we could have had:
      Fstring full( first.trimmed() + ' ' + last );
    in place of the assignment to full.

char or Fstring(1)?

Many of the arrays and variables holding single characters have been migrated from Fstring of length one to char to save some space and time overhead. This may be done for more instances within Rosetta++ in future releases. An Fstring of length one behaves more like a Fortran CHARACTER*1 than char does and some care must be used when making this change. Char can be silently converted to integer types in C++ so you lose some type safety with char, and for char variables c and d, c+d is an integer not a two-character string (use std::string(c)+d or Fstring(c)+d to get concatenation).

std::string or Fstring?

For some uses std::string may be more appropriate than Fstring, particularly when the fixed length nature of Fstrings is either not needed or not desired. Fstring includes conversions to and from std::string. The main differences between Fstring and std::string are:

  • std::strings can change length during their lifetime.
  • The indexing of characters in std::string begins with index 0, not 1.
  • Comparisons between std::strings do not ignore trailing spaces.
  • The std::string::substr function must be used to generate substrings and these substrings are copies, not active "windows" onto the original std::string.
  • Some member functions are different.

Non-Array Function Arguments

The initial conversion from Fortran to C++ used pass-by-reference for all non-array function arguments since this most closely matches the behavior of Fortran. Some cases where argument's of built-in numeric or bool types are not modified by the function were changed to pass-by-value for efficiency. This was necessary for arguments that were passed literal constants.

Some Fstring arguments were changed to pass by const reference. This was necessary for arguments that were passed literal or const strings.


Input/Output (I/O) Usage

C++ stream-based i/o has been used in Rosetta++ to provide a native C++ style. Formatted i/o with a Fortran-like usage is provided by the ObjexxFCL. The i/o statements look less like the Fortran than the rest of Rosetta++ but the correspondence is fairly clear:

Fortran: Formatted I/O
    ! Read
    read(funit,98,end=100) isize,begin,end,line
 98 format(i3,1x,I4,1x,I4,1x,A)

    ! Write
    do i=1,num_loop
      write(iunit,100) loop_begin(i), loop_end(i)
    enddo
100 format ("LOOP: ",2I6)

C++: Formatted I/O
// Read
funit >> bite( 3, isize ) >> skip( 1 ) >> bite( 4, begin ) >>
 skip( 1 ) >> bite( 4, end ) >> skip( 1 ) >> bite( line ) >> skip;
if ( funit.eof() ) goto L100;

// Write
for ( int i = 1; i <= num_loop; ++i ) {
  iunit << "LOOP: " << I( 6, loop_begin(i) ) << I( 6, loop_end(i) )
   << '\n';
}

The bite function calls have the form bite(w,v) where w is the field width and v is the variable that is assigned the value. The skip(w) calls skip over the next w characters.

There are a range of integer and floating point formatted output functions that correspond to Fortran format specifications. The integer output function calls have the form I(w,v) and I(w,m,v) where w is the field width, m is the minimum number of digits, and v is the value to output. The floating point output functions have the form F(w,d,v), E(w,d,v), and G(w,d,v) where w is the field width, d is the number of digits, and v is the value to output.

The ObjexxFCL formatted i/o types and functions are declared in the ObjexxFCL::fmt nested namespace to provide the means to disambiguate the short output function names such as I, F, G, and E from other identifiers in scope.

Unfortunately, C++ stream-based i/o is currently significantly slower than Fortran formatted i/o so the performance impact of large i/o operations should be evaluated. For reads of a few large database files Rosetta++ uses C-style sscanf operations to avoid a large performance penalty.


DATA Initialization

The DATA statement does one-time initialization at the time of creation of a Fortran variable or array. The ObjexxFCL classes support the analogous construction-time initialization via constructors that accept a uniform initial value and the FArray constructors also accept an initializer function argument for nonuniform initialization.


COMMON Blocks

Rosetta's Fortran COMMON blocks were converted to C++ namespaces. All namespaces are declared and defined in header and source files separate from the Rosetta++ function source files. Function sources that use a namespace include its header file at the top of the source file. To bring the namespace identifiers into the function-local scope as in the Fortran, a using declaration is used in place of the Fortran include statement or COMMON declaration in a function/subroutine. At this point there are many separate namespaces and namespace files but these will be consolidated in future releases.


The ObjexxFCL Library

Separate documentation for the ObjexxFCL library provided with Rosetta++ has a description of the classes and functions. The ObjexxFCL library licensed with Rosetta++ is a copyrighted product of Objexx Engineering, Inc. but it is provided in source code form and with licensing that allows future extension and modification of the ObjexxFCL source by Rosetta++ developers.

The ObjexxFCL types and functions are contained in the ObjexxFCL namespace. The Rosetta++ ObjexxFCL header includes a
    using namespace ObjexxFCL;
declaration to bring the ObjexxFCL identifiers into scope. The examples here also assume this directive has been specified. To use the ObjexxFCL without the using directive each identifier must be prefixed with ObjexxFCL::, as in ObjexxFCL::Fstring.


Fortran Intrinsic and Additional Functions in The ObjexxFCL

character and string intrinsic functions are provided by the ObjexxFCL Fstring, char_functions, and string_functions modules. Member function equivalents are provided for most of these to facilitate migration to a more native C++ style. Other useful character and string functions are also provided.

Mathematical intrinsic functions for which no standard C++ equivalent exists are provided as inline functions in the ObjexxFCL Fmath.h header. This includes max and min functions with up to six arguments, two-argument max and min functions for numeric types that return by value (for efficiency), Fortran 95 bit functions, and equivalents to the Fortran MOD and SIGN functions. Additional mathematical functions are also provided in Fmath.h.


Support

Support for building and developing Rosetta++ will be provided by software engineers in some/all of the labs and by Objexx Engineering. In the meantime, limited help can be obtained from: