| Introduction to Rosetta++ Last Updated: May 15, 2005 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What Is Rosetta++?Rosetta++ is a reimplementation of the Rosetta biomolecular modeling program in the C++ language. It is named Rosetta++ to distinguish it from the Fortran implementation. Rosetta++ as released in mid 2004 is the first phase of an evolutionary modernization to a more reliable and extensible design. The goals of this phase were to produce a C++ Rosetta that preserved the "look and feel" of the Fortran and had near-Fortran performance. Future phases will focus on migration to an object-oriented architecture that localizes behavioral properties and provides simpler and safer interfaces for development of new scientific algorithms within a modular, extensible framework. Rosetta++ vs. RosettaRosetta++ preserves the "look and feel" of the Fortran implementation by using a library, ObjexxFCL, to provide Fortran-compatible array, string, i/o, and intrinsic function support. Modifying and extending Rosetta++ code should be straightforward for researchers familiar with the Fortran version: array and string indexing, array passing "tricks", and other Fortran "features" are unchanged. The key issues to be aware of when working with the Rosetta++ code are discussed below. At this point Rosetta++ still has the stylistic and design attributes of the Fortran Rosetta. The major reliability and extensibility improvements will come later, but even while working within the current design C++ compilers will catch many of the typical errors that Fortran compilers don't. Obtaining near-Fortran performance with Rosetta++ required mode-specific profiling and hand tuning the bottleneck code. For the modes we tuned for so far (docking, design, and relax cases) Rosetta++ is 0-10% slower than the Fortran Rosetta when a comparable workload is performed (in individual cases where small numerical differences lead to a different computational sequence the speed can vary by more). Further tuning and the planned move to right-sized arrays should narrow or eliminate the gap. Due to the performance impact of C++ stream-based i/o, quick test cases perform worse, relatively, than real-world runs. One area where Rosetta++ suffers by comparison is in compile times. C++ is simply a much more complex language than FORTRAN 77 and compiles a lot slower. GCC is notoriously slow, but even with faster compilers a major Rosetta++ rebuild will provide a good opportunity for a coffee break. The slowness of C++ compiles is somewhat ameliorated by the fact that Rosetta++ constants and array dimensions are defined in source (.cc), not header (.h), files so modification of these will not trigger a large number of compiles. On the other hand, changes to function interfaces will trigger many compiles due to the required presence of C++ function prototypes in the headers. Rosetta++ OverviewSource Code OrganizationThe file structure of Rosetta++ is similar to that of the Fortran Rosetta with some additions. A source (.cc) file exists for each Fortran source (.f) file. Header (.h) files with function prototypes are used for each source file containing functions. Additional source and header files are used for the namespaces that correspond to the COMMON blocks of the Fortran Rosetta. The ObjexxFCL library source is contained in the ObjexxFCL subdirectory of the main source directory. Fortran -> C++ Conversion HighlightsLooking at the Fortran and C++ versions of a simple Rosetta routine shows some of the flavor of the conversion. Fortran: contact_order
C++: contact_order
Comparing the Fortran and C++ we can see that there is a lot of similarity and that:
Fortran -> C++ Conversion AttributesSome of the main attributes of the conversion are shown below. Data Types
Language Constructs
ArraysThe ObjexxFCL library provides a system of Fortran-like arrays of up to five dimensions called FArrays. The FArrays are contiguous, column-major arrays that support the array passing "tricks" (legal and otherwise) common in FORTRAN 77 code. There are four types of arrays of each rank (number of dimensions):
The FArrays are template-based and so can hold elements of any type. Shorthand typedef names are provided in the ObjexxFCL header files for FArrays of common types, such as FArray2D_float (which is short for the type FArray2D< float >). Here are simplified Fortran and C++ versions of the same small function that show the basics of array usage in Rosetta++: Fortran
C++: Argument Arrays Passed by Value
The index-based array element access is the same in the Fortran and C++ versions. Because FArrays are stored in column-major order like Fortran arrays (and unlike C-style arrays) the same nested loop sequencing is still efficient in Rosetta++. FArrays support arbitrary index ranges like Fortran arrays, unlike C-style arrays, which only support 0-based indexes. As in Fortran, when a single dimension value is used the starting index defaults to 1. The DRange (for real and proxy FArrays) and SRange (for argument FArrays) types are used to specify index ranges with a starting value other than 1. Here are some sample array definitions: Array Definitions
|
void
angles_get_template(
FArray2DB_double & Bxyz2,
FArray2DB_double & Bcentroids2
)
{
using namespace angles_standard_residue;
for ( int i = 1; i <= 8; ++i ) {
for ( int j = 1; j <= 3; ++j ) {
Bxyz2(j,i) = Bxyz(j,i);
}
}
for ( int i = 1; i <= 20; ++i ) {
for ( int j = 1; j <= 3; ++j ) {
Bcentroids2(j,i) = Bcentroids(j,i);
}
}
}
|
Because only a reference is passed for each array and no dimension calls are needed this is much more efficient. Note that the array use in the body of the function is unchanged. For small functions that are called many times using base FArrays instead of argument FArrays can provide a significant performance boost. Base FArray arguments have been used where possible in Rosetta++'s performance-critical functions.
Real FArrays passed by reference could be used for function arguments but these can only be passed real arrays, not argument or base arrays, so there is usually no reason to do this. The only case where using real FArrays as function arguments makes sense is if the function needs to resize the passed array, which cannot be done to base FArrays (dimension calls have such different meaning for real and proxy FArrays that code should know which type it is operating on).
Accessing multidimensional FArrays by standard multi-index accessors requires some multiplication and addition operations to locate the position of the desired element in the array's data block. As the array rank increases so does the cost of such accesses. In performance-critical loops such accesses can dominate the running time of a program. Fortran compilers can use their knowledge of Fortran array storage to optimize away many of the indexing computations from inner loops. But the C++ compiler does not have any such built-in knowledge about the FArrays or any other array libraries.
The FArray classes provide a linear indexing capability to get Fortran-like speed in performance-critical code sections. Linear indexing uses the [] operator with a single index that specified the 0-based offset linear position of an element in the array. The angles_get_template function using linear indexing is shown below.
C++: Linear Array Indexing
void
angles_get_template(
FArray2DB_double & Bxyz2,
FArray2DB_double & Bcentroids2
)
{
using namespace angles_standard_residue;
for ( int i = 1, l = 0; i <= 8; ++i ) {
for ( int j = 1; j <= 3; ++j, ++l ) {
Bxyz2[l] = Bxyz[l];
}
}
for ( int i = 1, l = 0; i <= 20; ++i ) {
for ( int j = 1; j <= 3; ++j, ++l ) {
Bcentroids2[l] = Bcentroids[l];
}
}
}
|
Linear indexing can obscure loop code semantics so its use should be limited to performance bottleneck code as indicated by profiling and comments should be used to explain the usage. Where linear indexing has been used in Rosetta++ the equivalent multi-index is indicated in comments. See the ObjexxFCL documentation for more information on linear indexing.
FArrays have some features that FORTRAN 77 arrays lack:
FArray bounds checking is active in debug builds (when NDEBUG undefined). This also checks that specified argument FArray dimensions are within the actual passed array dimensions when such information is available.
The Fstring string type provided by the ObjexxFCL is analogous to a Fortran character string: it has a fixed length, characters are indexed from 1, and comparisons ignore trailing blanks.
The examples below demonstrate some simple string declarations in Fortran and their equivalent in C++ using Fstring.
Fortran: CHARACTER String
CHARACTER*20 first CHARACTER*20 last CHARACTER*41 full CHARACTER*23 short CHARACTER*4 code first = 'Fred' last = 'Flintstone' full = first(1:INDEX(first,' '))//' '//last ! "Fred Flintstone" short = first(1:1)//'. '//last ! "F. Flintstone" code = first(1:1)//last(1:3) ! "FFli" |
Fstring first( 20 ); Fstring last( 20 ); Fstring full( 41 ); Fstring short( 23 ); Fstring code( 4 ); first = "Fred"; last = "Flintstone"; full = first.trimmed() + ' ' + last; // "Fred Flintstone" short = first[1] + ". " + last; // "F. Flintstone" code = first[1] + last(1,3); // "FFli" |
Fstring Notes:
Many of the arrays and variables holding single characters have been migrated from Fstring of length one to char to save some space and time overhead. This may be done for more instances within Rosetta++ in future releases. An Fstring of length one behaves more like a Fortran CHARACTER*1 than char does and some care must be used when making this change. Char can be silently converted to integer types in C++ so you lose some type safety with char, and for char variables c and d, c+d is an integer not a two-character string (use std::string(c)+d or Fstring(c)+d to get concatenation).
For some uses std::string may be more appropriate than Fstring, particularly when the fixed length nature of Fstrings is either not needed or not desired. Fstring includes conversions to and from std::string. The main differences between Fstring and std::string are:
The initial conversion from Fortran to C++ used pass-by-reference for all non-array function arguments since this most closely matches the behavior of Fortran. Some cases where argument's of built-in numeric or bool types are not modified by the function were changed to pass-by-value for efficiency. This was necessary for arguments that were passed literal constants.
Some Fstring arguments were changed to pass by const reference. This was necessary for arguments that were passed literal or const strings.
C++ stream-based i/o has been used in Rosetta++ to provide a native C++ style. Formatted i/o with a Fortran-like usage is provided by the ObjexxFCL. The i/o statements look less like the Fortran than the rest of Rosetta++ but the correspondence is fairly clear:
Fortran: Formatted I/O
! Read
read(funit,98,end=100) isize,begin,end,line
98 format(i3,1x,I4,1x,I4,1x,A)
! Write
do i=1,num_loop
write(iunit,100) loop_begin(i), loop_end(i)
enddo
100 format ("LOOP: ",2I6) |
// Read
funit >> bite( 3, isize ) >> skip( 1 ) >> bite( 4, begin ) >>
skip( 1 ) >> bite( 4, end ) >> skip( 1 ) >> bite( line ) >> skip;
if ( funit.eof() ) goto L100;
// Write
for ( int i = 1; i <= num_loop; ++i ) {
iunit << "LOOP: " << I( 6, loop_begin(i) ) << I( 6, loop_end(i) )
<< '\n';
} |
The bite function calls have the form bite(w,v) where w is the field width and v is the variable that is assigned the value. The skip(w) calls skip over the next w characters.
There are a range of integer and floating point formatted output functions that correspond to Fortran format specifications. The integer output function calls have the form I(w,v) and I(w,m,v) where w is the field width, m is the minimum number of digits, and v is the value to output. The floating point output functions have the form F(w,d,v), E(w,d,v), and G(w,d,v) where w is the field width, d is the number of digits, and v is the value to output.
The ObjexxFCL formatted i/o types and functions are declared in the ObjexxFCL::fmt nested namespace to provide the means to disambiguate the short output function names such as I, F, G, and E from other identifiers in scope.
Unfortunately, C++ stream-based i/o is currently significantly slower than Fortran formatted i/o so the performance impact of large i/o operations should be evaluated. For reads of a few large database files Rosetta++ uses C-style sscanf operations to avoid a large performance penalty.
The DATA statement does one-time initialization at the time of creation of a Fortran variable or array. The ObjexxFCL classes support the analogous construction-time initialization via constructors that accept a uniform initial value and the FArray constructors also accept an initializer function argument for nonuniform initialization.
Rosetta's Fortran COMMON blocks were converted to C++ namespaces. All namespaces are declared and defined in header and source files separate from the Rosetta++ function source files. Function sources that use a namespace include its header file at the top of the source file. To bring the namespace identifiers into the function-local scope as in the Fortran, a using declaration is used in place of the Fortran include statement or COMMON declaration in a function/subroutine. At this point there are many separate namespaces and namespace files but these will be consolidated in future releases.
Separate documentation for the ObjexxFCL library provided with Rosetta++ has a description of the classes and functions. The ObjexxFCL library licensed with Rosetta++ is a copyrighted product of Objexx Engineering, Inc. but it is provided in source code form and with licensing that allows future extension and modification of the ObjexxFCL source by Rosetta++ developers.
The ObjexxFCL types and functions are contained in the ObjexxFCL namespace. The Rosetta++ ObjexxFCL header includes a
using namespace ObjexxFCL;
declaration to bring the ObjexxFCL identifiers into scope. The examples here also assume this directive has been specified. To use the ObjexxFCL without the using directive each identifier must be prefixed with ObjexxFCL::, as in ObjexxFCL::Fstring.
character and string intrinsic functions are provided by the ObjexxFCL Fstring, char_functions, and string_functions modules. Member function equivalents are provided for most of these to facilitate migration to a more native C++ style. Other useful character and string functions are also provided.
Mathematical intrinsic functions for which no standard C++ equivalent exists are provided as inline functions in the ObjexxFCL Fmath.h header. This includes max and min functions with up to six arguments, two-argument max and min functions for numeric types that return by value (for efficiency), Fortran 95 bit functions, and equivalents to the Fortran MOD and SIGN functions. Additional mathematical functions are also provided in Fmath.h.
Support for building and developing Rosetta++ will be provided by software engineers in some/all of the labs and by Objexx Engineering. In the meantime, limited help can be obtained from: