PyRosettaCluster is a class for reproducible, high-throughput job distribution
of user-defined PyRosetta protocols efficiently parallelized on the user's
local computer, high-performance computing (HPC) cluster, or elastic cloud
computing infrastructure with available compute resources.
Args:
tasks: A `list` of `dict` objects, a callable or called function returning
a `list` of `dict` objects, or a callable or called generator yielding
a `list` of `dict` objects. Each dictionary object element of the list
is accessible via kwargs in the user-defined PyRosetta protocols.
In order to initialize PyRosetta with user-defined PyRosetta command line
options at the start of each user-defined PyRosetta protocol, either
`extra_options` and/or `options` must be a key of each dictionary object,
where the value is a `str`, `tuple`, `list`, `set`, or `dict` of
PyRosetta command line options.
Default: [{}]
input_packed_pose: Optional input `PackedPose` object that is accessible via
the first argument of the first user-defined PyRosetta protocol.
Default: None
seeds: A `list` of `int` objects specifying the random number generator seeds
to use for each user-defined PyRosetta protocol. The number of seeds
provided must be equal to the number of user-defined input PyRosetta
protocols. Seeds are used in the same order that the user-defined PyRosetta
protocols are executed.
Default: None
decoy_ids: A `list` of `int` objects specifying the decoy numbers to keep after
executing user-defined PyRosetta protocols. User-provided PyRosetta
protocols may return a list of `Pose` and/or `PackedPose` objects, or
yield multiple `Pose` and/or `PackedPose` objects. To reproduce a
particular decoy generated via the chain of user-provided PyRosetta
protocols, the decoy number to keep for each protocol may be specified,
where other decoys are discarded. Decoy numbers use zero-based indexing,
so `0` is the first decoy generated from a particular PyRosetta protocol.
The number of decoy_ids provided must be equal to the number of
user-defined input PyRosetta protocols, so that one decoy is saved for each
user-defined PyRosetta protocol. Decoy ids are applied in the same order
that the user-defined PyRosetta protocols are executed.
Default: None
client: An initialized dask `distributed.client.Client` object to be used as
the dask client interface to the local or remote compute cluster. If `None`,
then PyRosettaCluster initializes its own dask client based on the
`PyRosettaCluster(scheduler=...)` class attribute.
Default: None
scheduler: A `str` of either "sge" or "slurm", or `None`. If "sge", then
PyRosettaCluster schedules jobs using `SGECluster` with `dask-jobqueue`.
If "slurm", then PyRosettaCluster schedules jobs using `SLURMCluster` with
`dask-jobqueue`. If `None`, then PyRosettaCluster schedules jobs using
`LocalCluster` with `dask.distributed`. If `PyRosettaCluster(client=...)`
is provided, then `PyRosettaCluster(scheduler=...)` is ignored.
Default: None
cores: An `int` object specifying the total number of cores per job, which
is input to the `dask_jobqueue.SLURMCluster(cores=...)` argument.
Default: 1
processes: An `int` object specifying the total number of processes per job,
which is input to the `dask_jobqueue.SLURMCluster(processes=...)` argument.
This cuts the job up into this many processes.
Default: 1
memory: A `str` object specifying the total amount of memory per job, which
is input to the `dask_jobqueue.SLURMCluster(memory=...)` argument.
Default: "4g"
scratch_dir: A `str` object specifying the path to a scratch directory where
dask litter may go.
Default: "/temp" if it exists, otherwise the current working directory
min_workers: An `int` object specifying the minimum number of workers to
which to adapt during parallelization of user-provided PyRosetta protocols.
Default: 1
max_workers: An `int` object specifying the maximum number of workers to
which to adapt during parallelization of user-provided PyRosetta protocols.
Default: 1000 if the initial number of `tasks` is <1000, else use the
the initial number of `tasks`
dashboard_address: A `str` object specifying the port over which the dask
dashboard is forwarded. Particularly useful for diagnosing PyRosettaCluster
performance in real-time.
Default=":8787"
nstruct: An `int` object specifying the number of repeats of the first
user-provided PyRosetta protocol. The user can control the number of
repeats of subsequent user-provided PyRosetta protocols via returning
multiple clones of the output pose(s) from a user-provided PyRosetta
protocol run earlier, or cloning the input pose(s) multiple times in a
user-provided PyRosetta protocol run later.
Default: 1
compressed: A `bool` object specifying whether or not to compress the output
.pdb files with bzip2, resulting in .pdb.bz2 files.
Default: True
system_info: A `dict` or `NoneType` object specifying the system information
required to reproduce the simulation. If `None` is provided, then PyRosettaCluster
automatically detects the platform and returns this attribute as a dictionary
{'sys.platform': `sys.platform`} (for example, {'sys.platform': 'linux'}).
If a `dict` is provided, then validate that the 'sys.platform' key has a value
equal to the current `sys.platform`, and log a warning message if not.
Additional system information such as Amazon Machine Image (AMI) identifier
and compute fleet instance type identifier may be stored in this dictionary,
but is not validated. This information is stored in the simulation records for
accounting.
Default: None
pyrosetta_build: A `str` or `NoneType` object specifying the PyRosetta build as
output by `pyrosetta._version_string()`. If `None` is provided, then PyRosettaCluster
automatically detects the PyRosetta build and sets this attribute as the `str`.
If a `str` is provided, then validate that the input PyRosetta build is equal
to the active PyRosetta build, and log a warning message if not.
Default: None
sha1: A `str` or `NoneType` object specifying the git SHA1 hash string of the
particular git commit being simulated. If a non-empty `str` object is provided,
then it is validated to match the SHA1 hash string of the current HEAD,
and then it is added to the simulation record for accounting. If an empty string
is provided, then ensure that everything in the working directory is committed
to the repository. If `None` is provided, then bypass SHA1 hash string
validation and set this attribute to an empty string.
Default: ""
project_name: A `str` object specifying the project name of this simulation.
This option just adds the user-provided project_name to the scorefile
for accounting.
Default: datetime.now().strftime("%Y.%m.%d.%H.%M.%S.%f") if not specified,
else "PyRosettaCluster" if None
simulation_name: A `str` object specifying the name of this simulation.
This option just adds the user-provided simulation_name to the scorefile
for accounting.
Default: `project_name` if not specified, else "PyRosettaCluster" if None
environment: A `NoneType` or `str` object specifying the active conda environment
YML file string. If a `NoneType` object is provided, then generate a YML file
string for the active conda environment and save it to the full simulation
record. If a `str` object is provided, then validate it against the active
conda environment YML file string and save it to the full simulation record.
Default: None
output_path: A `str` object specifying the full path of the output directory
(to be created if it doesn't exist) where the output results will be saved
to disk.
Default: "./outputs"
scorefile_name: A `str` object specifying the name of the output JSON-formatted
scorefile. The scorefile location is always `output_path`/`scorefile_name`.
Default: "scores.json"
simulation_records_in_scorefile: A `bool` object specifying whether or not to
write full simulation records to the scorefile. If `True`, then write
full simulations records to the scorefile. This results in some redundant
information on each line, allowing downstream reproduction of a decoy from
the scorefile, but a larger scorefile. If `False`, then write
curtailed simulations records to the scorefile. This results in minimally
redundant information on each line, disallowing downstream reproduction
of a decoy from the scorefile, but a smaller scorefile. If `False`, also
write the active conda environment to a YML file in 'output_path'. Full
simulation records are always written to the output '.pdb' or '.pdb.bz2'
file(s), which can be used to reproduce any decoy without the scorefile.
Default: False
decoy_dir_name: A `str` object specifying the directory name where the
output decoys will be saved. The directory location is always
`output_path`/`decoy_dir_name`.
Default: "decoys"
logs_dir_name: A `str` object specifying the directory name where the
output log files will be saved. The directory location is always
`output_path`/`logs_dir_name`.
Default: "logs"
logging_level: A `str` object specifying the logging level of python tracer
output to write to the log file of either "NOTSET", "DEBUG", "INFO",
"WARNING", "ERROR", or "CRITICAL". The output log file is always written
to `output_path`/`logs_dir_name`/`simulation_name`.log on disk.
Default: "INFO"
ignore_errors: A `bool` object specifying for PyRosettaCluster to ignore errors
raised in the user-provided PyRosetta protocols. This comes in handy when
well-defined errors are sparse and sporadic (such as rare Segmentation Faults),
and the user would like PyRosettaCluster to run without raising the errors.
Default: False
timeout: A `float` or `int` object specifying how many seconds to wait between
PyRosettaCluster checking-in on the running user-provided PyRosetta protocols.
If each user-provided PyRosetta protocol is expected to run quickly, then
0.1 seconds seems reasonable. If each user-provided PyRosetta protocol is
expected to run slowly, then >1 second seems reasonable.
Default: 0.5
save_all: A `bool` object specifying whether or not to save all of the returned
or yielded `Pose` and `PackedPose` objects from all user-provided
PyRosetta protocols. This option may be used for checkpointing trajectories.
To save arbitrary poses to disk, from within any user-provided PyRosetta
protocol:
`pose.dump_pdb(os.path.join(kwargs["output_path"], "checkpoint.pdb")`
Default: False
dry_run: A `bool` object specifying whether or not to save .pdb files to
disk. If `True`, then do not write .pdb or .pdb.bz2 files to disk.
Default: False
Returns:
A PyRosettaCluster instance.