PyRosettaCluster is a class for reproducible, high-throughput job distribution
of user-defined PyRosetta protocols efficiently parallelized on the user's
local computer, high-performance computing (HPC) cluster, or elastic cloud
computing infrastructure with available compute resources.
Args:
tasks: A `list` of `dict` objects, a callable or called function returning
a `list` of `dict` objects, or a callable or called generator yielding
a `list` of `dict` objects. Each dictionary object element of the list
is accessible via kwargs in the user-defined PyRosetta protocols.
In order to initialize PyRosetta with user-defined PyRosetta command line
options at the start of each user-defined PyRosetta protocol, either
`extra_options` and/or `options` must be a key of each dictionary object,
where the value is a `str`, `tuple`, `list`, `set`, or `dict` of
PyRosetta command line options.
Default: [{}]
input_packed_pose: Optional input `PackedPose` object that is accessible via
the first argument of the first user-defined PyRosetta protocol.
Default: None
seeds: A `list` of `int` objects specifying the random number generator seeds
to use for each user-defined PyRosetta protocol. The number of seeds
provided must be equal to the number of user-defined input PyRosetta
protocols. Seeds are used in the same order that the user-defined PyRosetta
protocols are executed.
Default: None
decoy_ids: A `list` of `int` objects specifying the decoy numbers to keep after
executing user-defined PyRosetta protocols. User-provided PyRosetta
protocols may return a list of `Pose` and/or `PackedPose` objects, or
yield multiple `Pose` and/or `PackedPose` objects. To reproduce a
particular decoy generated via the chain of user-provided PyRosetta
protocols, the decoy number to keep for each protocol may be specified,
where other decoys are discarded. Decoy numbers use zero-based indexing,
so `0` is the first decoy generated from a particular PyRosetta protocol.
The number of decoy_ids provided must be equal to the number of
user-defined input PyRosetta protocols, so that one decoy is saved for each
user-defined PyRosetta protocol. Decoy ids are applied in the same order
that the user-defined PyRosetta protocols are executed.
Default: None
client: An initialized dask `distributed.client.Client` object to be used as
the dask client interface to the local or remote compute cluster. If `None`,
then PyRosettaCluster initializes its own dask client based on the
`PyRosettaCluster(scheduler=...)` class attribute. Deprecated by the
`PyRosettaCluster(clients=...)` class attribute, but supported for legacy
purposes. Either or both of the `client` or `clients` attribute parameters
must be `None`.
Default: None
clients: A `list` or `tuple` object of initialized dask `distributed.client.Client`
objects to be used as the dask client interface(s) to the local or remote compute
cluster(s). If `None`, then PyRosettaCluster initializes its own dask client based
on the `PyRosettaCluster(scheduler=...)` class attribute. Optionally used in
combination with the `PyRosettaCluster().distribute(clients_indices=...)` method.
Either or both of the `client` or `clients` attribute parameters must be `None`.
See the `PyRosettaCluster().distribute()` method docstring for usage examples.
Default: None
scheduler: A `str` of either "sge" or "slurm", or `None`. If "sge", then
PyRosettaCluster schedules jobs using `SGECluster` with `dask-jobqueue`.
If "slurm", then PyRosettaCluster schedules jobs using `SLURMCluster` with
`dask-jobqueue`. If `None`, then PyRosettaCluster schedules jobs using
`LocalCluster` with `dask.distributed`. If `PyRosettaCluster(client=...)`
or `PyRosettaCluster(clients=...)` is provided, then
`PyRosettaCluster(scheduler=...)` is ignored.
Default: None
cores: An `int` object specifying the total number of cores per job, which
is input to the `dask_jobqueue.SLURMCluster(cores=...)` argument or
the `dask_jobqueue.SGECluster(cores=...)` argument.
Default: 1
processes: An `int` object specifying the total number of processes per job,
which is input to the `dask_jobqueue.SLURMCluster(processes=...)` argument
or the `dask_jobqueue.SGECluster(processes=...)` argument.
This cuts the job up into this many processes.
Default: 1
memory: A `str` object specifying the total amount of memory per job, which
is input to the `dask_jobqueue.SLURMCluster(memory=...)` argument or
the `dask_jobqueue.SGECluster(memory=...)` argument.
Default: "4g"
scratch_dir: A `str` object specifying the path to a scratch directory where
dask litter may go.
Default: "/temp" if it exists, otherwise the current working directory
min_workers: An `int` object specifying the minimum number of workers to
which to adapt during parallelization of user-provided PyRosetta protocols.
Default: 1
max_workers: An `int` object specifying the maximum number of workers to
which to adapt during parallelization of user-provided PyRosetta protocols.
Default: 1000 if the initial number of `tasks` is <1000, else use the
the initial number of `tasks`
dashboard_address: A `str` object specifying the port over which the dask
dashboard is forwarded. Particularly useful for diagnosing PyRosettaCluster
performance in real-time.
Default=":8787"
nstruct: An `int` object specifying the number of repeats of the first
user-provided PyRosetta protocol. The user can control the number of
repeats of subsequent user-provided PyRosetta protocols via returning
multiple clones of the output pose(s) from a user-provided PyRosetta
protocol run earlier, or cloning the input pose(s) multiple times in a
user-provided PyRosetta protocol run later.
Default: 1
compressed: A `bool` object specifying whether or not to compress the output
'.pdb' files with bzip2, resulting in '.pdb.bz2' files.
Default: True
compression: A `str` object of 'xz', 'zlib' or 'bz2', or a `bool` or `NoneType`
object representing the internal compression library for pickled `PackedPose`
objects and user-defined PyRosetta protocol `kwargs` objects. The default of
`True` uses 'xz' for serialization if it's installed, otherwise uses 'zlib'
for serialization.
Default: True
system_info: A `dict` or `NoneType` object specifying the system information
required to reproduce the simulation. If `None` is provided, then PyRosettaCluster
automatically detects the platform and returns this attribute as a dictionary
{'sys.platform': `sys.platform`} (for example, {'sys.platform': 'linux'}).
If a `dict` is provided, then validate that the 'sys.platform' key has a value
equal to the current `sys.platform`, and log a warning message if not.
Additional system information such as Amazon Machine Image (AMI) identifier
and compute fleet instance type identifier may be stored in this dictionary,
but is not validated. This information is stored in the simulation records for
accounting.
Default: None
pyrosetta_build: A `str` or `NoneType` object specifying the PyRosetta build as
output by `pyrosetta._version_string()`. If `None` is provided, then PyRosettaCluster
automatically detects the PyRosetta build and sets this attribute as the `str`.
If a `str` is provided, then validate that the input PyRosetta build is equal
to the active PyRosetta build, and log a warning message if not.
Default: None
sha1: A `str` or `NoneType` object specifying the git SHA1 hash string of the
particular git commit being simulated. If a non-empty `str` object is provided,
then it is validated to match the SHA1 hash string of the current HEAD,
and then it is added to the simulation record for accounting. If an empty string
is provided, then ensure that everything in the working directory is committed
to the repository. If `None` is provided, then bypass SHA1 hash string
validation and set this attribute to an empty string.
Default: ""
project_name: A `str` object specifying the project name of this simulation.
This option just adds the user-provided project_name to the scorefile
for accounting.
Default: datetime.now().strftime("%Y.%m.%d.%H.%M.%S.%f") if not specified,
else "PyRosettaCluster" if None
simulation_name: A `str` object specifying the name of this simulation.
This option just adds the user-provided `simulation_name` to the scorefile
for accounting.
Default: `project_name` if not specified, else "PyRosettaCluster" if None
environment: A `NoneType` or `str` object specifying the active conda environment
YML file string. If a `NoneType` object is provided, then generate a YML file
string for the active conda environment and save it to the full simulation
record. If a `str` object is provided, then validate it against the active
conda environment YML file string and save it to the full simulation record.
Default: None
output_path: A `str` object specifying the full path of the output directory
(to be created if it doesn't exist) where the output results will be saved
to disk.
Default: "./outputs"
scorefile_name: A `str` object specifying the name of the output JSON-formatted
scorefile. The scorefile location is always `output_path`/`scorefile_name`.
Default: "scores.json"
simulation_records_in_scorefile: A `bool` object specifying whether or not to
write full simulation records to the scorefile. If `True`, then write
full simulation records to the scorefile. This results in some redundant
information on each line, allowing downstream reproduction of a decoy from
the scorefile, but a larger scorefile. If `False`, then write
curtailed simulation records to the scorefile. This results in minimally
redundant information on each line, disallowing downstream reproduction
of a decoy from the scorefile, but a smaller scorefile. If `False`, also
write the active conda environment to a YML file in 'output_path'. Full
simulation records are always written to the output '.pdb' or '.pdb.bz2'
file(s), which can be used to reproduce any decoy without the scorefile.
Default: False
decoy_dir_name: A `str` object specifying the directory name where the
output decoys will be saved. The directory location is always
`output_path`/`decoy_dir_name`.
Default: "decoys"
logs_dir_name: A `str` object specifying the directory name where the
output log files will be saved. The directory location is always
`output_path`/`logs_dir_name`.
Default: "logs"
logging_level: A `str` object specifying the logging level of python tracer
output to write to the log file of either "NOTSET", "DEBUG", "INFO",
"WARNING", "ERROR", or "CRITICAL". The output log file is always written
to `output_path`/`logs_dir_name`/`simulation_name`.log on disk.
Default: "INFO"
logging_address: A `str` object specifying the socket endpoint for sending and receiving
log messages across a network, so log messages from user-provided PyRosetta
protocols may be written to a single log file on the host node. The `str` object
must take the format 'host:port' where 'host' is either an IP address, 'localhost',
or Domain Name System (DNS)-accessible domain name, and the 'port' is a digit greater
than or equal to 0. If the 'port' is '0', then the next free port is selected.
Default: 'localhost:0' if `scheduler=None` or either the `client` or `clients`
keyword argument parameters specify instances of `dask.distributed.LocalCluster`,
otherwise '0.0.0.0:0'
ignore_errors: A `bool` object specifying for PyRosettaCluster to ignore errors
raised in the user-provided PyRosetta protocols. This comes in handy when
well-defined errors are sparse and sporadic (such as rare Segmentation Faults),
and the user would like PyRosettaCluster to run without raising the errors.
Default: False
timeout: A `float` or `int` object specifying how many seconds to wait between
PyRosettaCluster checking-in on the running user-provided PyRosetta protocols.
If each user-provided PyRosetta protocol is expected to run quickly, then
0.1 seconds seems reasonable. If each user-provided PyRosetta protocol is
expected to run slowly, then >1 second seems reasonable.
Default: 0.5
max_delay_time: A `float` or `int` object specifying the maximum number of seconds to
sleep before returning the result(s) from each user-provided PyRosetta protocol
back to the client. If a dask worker returns the result(s) from a user-provided
PyRosetta protocol too quickly, the dask scheduler needs to first register that
the task is processing before it completes. In practice, in each user-provided
PyRosetta protocol the runtime is subtracted from `max_delay_time`, and the dask
worker sleeps for the remainder of the time, if any, before returning the result(s).
It's recommended to set this option to at least 1 second, but longer times may
be used as a safety throttle in cases of overwhelmed dask scheduler processes.
Default: 3.0
filter_results: A `bool` object specifying whether or not to filter out empty
`PackedPose` objects between user-provided PyRosetta protocols. When a protocol
returns or yields `NoneType`, PyRosettaCluster converts it to an empty `PackedPose`
object that gets passed to the next protocol. If `True`, then filter out any empty
`PackedPose` objects where there are no residues in the conformation as given by
`Pose.empty()`, otherwise if `False` then continue to pass empty `PackedPose` objects
to the next protocol. This is used for filtering out decoys mid-trajectory through
user-provided PyRosetta protocols if protocols return or yield any `None`, empty
`Pose`, or empty `PackedPose` objects.
Default: True
save_all: A `bool` object specifying whether or not to save all of the returned
or yielded `Pose` and `PackedPose` objects from all user-provided
PyRosetta protocols. This option may be used for checkpointing trajectories.
To save arbitrary poses to disk, from within any user-provided PyRosetta
protocol:
`pose.dump_pdb(
os.path.join(kwargs["PyRosettaCluster_output_path"], "checkpoint.pdb"))`
Default: False
dry_run: A `bool` object specifying whether or not to save '.pdb' files to
disk. If `True`, then do not write '.pdb' or '.pdb.bz2' files to disk.
Default: False
cooldown_time: A `float` or `int` object specifying how many seconds to sleep after the
simulation is complete to allow loggers to flush. For very slow network filesystems,
2.0 or more seconds may be reasonable.
Default: 0.5
Returns:
A PyRosettaCluster instance.