Rosetta
Classes | Variables
pyrosetta.distributed.cluster.core Namespace Reference

Classes

class  PyRosettaCluster
 

Variables

string __author__ = "Jason C. Klima"
 
 G = TypeVar("G")
 

Detailed Description

PyRosettaCluster is a class for reproducible, high-throughput job distribution
of user-defined PyRosetta protocols efficiently parallelized on the user's
local computer, high-performance computing (HPC) cluster, or elastic cloud
computing infrastructure with available compute resources.

Args:
    tasks: A `list` of `dict` objects, a callable or called function returning
        a `list` of `dict` objects, or a callable or called generator yielding
        a `list` of `dict` objects. Each dictionary object element of the list
        is accessible via kwargs in the user-defined PyRosetta protocols.
        In order to initialize PyRosetta with user-defined PyRosetta command line
        options at the start of each user-defined PyRosetta protocol, either
        `extra_options` and/or `options` must be a key of each dictionary object,
        where the value is a `str`, `tuple`, `list`, `set`, or `dict` of
        PyRosetta command line options.
        Default: [{}]
    input_packed_pose: Optional input `PackedPose` object that is accessible via
        the first argument of the first user-defined PyRosetta protocol.
        Default: None
    seeds: A `list` of `int` objects specifying the random number generator seeds
        to use for each user-defined PyRosetta protocol. The number of seeds
        provided must be equal to the number of user-defined input PyRosetta
        protocols. Seeds are used in the same order that the user-defined PyRosetta
        protocols are executed.
        Default: None
    decoy_ids: A `list` of `int` objects specifying the decoy numbers to keep after
        executing user-defined PyRosetta protocols. User-provided PyRosetta
        protocols may return a list of `Pose` and/or `PackedPose` objects, or
        yield multiple `Pose` and/or `PackedPose` objects. To reproduce a
        particular decoy generated via the chain of user-provided PyRosetta
        protocols, the decoy number to keep for each protocol may be specified,
        where other decoys are discarded. Decoy numbers use zero-based indexing,
        so `0` is the first decoy generated from a particular PyRosetta protocol.
        The number of decoy_ids provided must be equal to the number of
        user-defined input PyRosetta protocols, so that one decoy is saved for each
        user-defined PyRosetta protocol. Decoy ids are applied in the same order
        that the user-defined PyRosetta protocols are executed.
        Default: None
    client: An initialized dask `distributed.client.Client` object to be used as
        the dask client interface to the local or remote compute cluster. If `None`,
        then PyRosettaCluster initializes its own dask client based on the
        `PyRosettaCluster(scheduler=...)` class attribute. Deprecated by the
        `PyRosettaCluster(clients=...)` class attribute, but supported for legacy
        purposes. Either or both of the `client` or `clients` attribute parameters
        must be `None`.
        Default: None
    clients: A `list` or `tuple` object of initialized dask `distributed.client.Client`
        objects to be used as the dask client interface(s) to the local or remote compute
        cluster(s). If `None`, then PyRosettaCluster initializes its own dask client based
        on the `PyRosettaCluster(scheduler=...)` class attribute. Optionally used in
        combination with the `PyRosettaCluster().distribute(clients_indices=...)` method.
        Either or both of the `client` or `clients` attribute parameters must be `None`.
        See the `PyRosettaCluster().distribute()` method docstring for usage examples.
        Default: None
    scheduler: A `str` of either "sge" or "slurm", or `None`. If "sge", then
        PyRosettaCluster schedules jobs using `SGECluster` with `dask-jobqueue`.
        If "slurm", then PyRosettaCluster schedules jobs using `SLURMCluster` with
        `dask-jobqueue`. If `None`, then PyRosettaCluster schedules jobs using
        `LocalCluster` with `dask.distributed`. If `PyRosettaCluster(client=...)`
        or `PyRosettaCluster(clients=...)` is provided, then 
        `PyRosettaCluster(scheduler=...)` is ignored.
        Default: None
    cores: An `int` object specifying the total number of cores per job, which
        is input to the `dask_jobqueue.SLURMCluster(cores=...)` argument or
        the `dask_jobqueue.SGECluster(cores=...)` argument.
        Default: 1
    processes: An `int` object specifying the total number of processes per job,
        which is input to the `dask_jobqueue.SLURMCluster(processes=...)` argument
        or the `dask_jobqueue.SGECluster(processes=...)` argument.
        This cuts the job up into this many processes.
        Default: 1
    memory: A `str` object specifying the total amount of memory per job, which
        is input to the `dask_jobqueue.SLURMCluster(memory=...)` argument or
        the `dask_jobqueue.SGECluster(memory=...)` argument.
        Default: "4g"
    scratch_dir: A `str` object specifying the path to a scratch directory where
        dask litter may go.
        Default: "/temp" if it exists, otherwise the current working directory
    min_workers: An `int` object specifying the minimum number of workers to
        which to adapt during parallelization of user-provided PyRosetta protocols.
        Default: 1
    max_workers: An `int` object specifying the maximum number of workers to
        which to adapt during parallelization of user-provided PyRosetta protocols.
        Default: 1000 if the initial number of `tasks` is <1000, else use the
            the initial number of `tasks`
    dashboard_address: A `str` object specifying the port over which the dask
        dashboard is forwarded. Particularly useful for diagnosing PyRosettaCluster
        performance in real-time.
        Default=":8787"
    nstruct: An `int` object specifying the number of repeats of the first
        user-provided PyRosetta protocol. The user can control the number of
        repeats of subsequent user-provided PyRosetta protocols via returning
        multiple clones of the output pose(s) from a user-provided PyRosetta
        protocol run earlier, or cloning the input pose(s) multiple times in a
        user-provided PyRosetta protocol run later.
        Default: 1
    compressed: A `bool` object specifying whether or not to compress the output
        '.pdb' files with bzip2, resulting in '.pdb.bz2' files.
        Default: True
    compression: A `str` object of 'xz', 'zlib' or 'bz2', or a `bool` or `NoneType`
        object representing the internal compression library for pickled `PackedPose` 
        objects and user-defined PyRosetta protocol `kwargs` objects. The default of
        `True` uses 'xz' for serialization if it's installed, otherwise uses 'zlib'
        for serialization.
        Default: True
    system_info: A `dict` or `NoneType` object specifying the system information
        required to reproduce the simulation. If `None` is provided, then PyRosettaCluster
        automatically detects the platform and returns this attribute as a dictionary
        {'sys.platform': `sys.platform`} (for example, {'sys.platform': 'linux'}).
        If a `dict` is provided, then validate that the 'sys.platform' key has a value
        equal to the current `sys.platform`, and log a warning message if not.
        Additional system information such as Amazon Machine Image (AMI) identifier
        and compute fleet instance type identifier may be stored in this dictionary,
        but is not validated. This information is stored in the simulation records for
        accounting.
        Default: None
    pyrosetta_build: A `str` or `NoneType` object specifying the PyRosetta build as
        output by `pyrosetta._version_string()`. If `None` is provided, then PyRosettaCluster
        automatically detects the PyRosetta build and sets this attribute as the `str`.
        If a `str` is provided, then validate that the input PyRosetta build is equal
        to the active PyRosetta build, and log a warning message if not.
        Default: None
    sha1: A `str` or `NoneType` object specifying the git SHA1 hash string of the
        particular git commit being simulated. If a non-empty `str` object is provided,
        then it is validated to match the SHA1 hash string of the current HEAD,
        and then it is added to the simulation record for accounting. If an empty string
        is provided, then ensure that everything in the working directory is committed
        to the repository. If `None` is provided, then bypass SHA1 hash string
        validation and set this attribute to an empty string.
        Default: ""
    project_name: A `str` object specifying the project name of this simulation.
        This option just adds the user-provided project_name to the scorefile
        for accounting.
        Default: datetime.now().strftime("%Y.%m.%d.%H.%M.%S.%f") if not specified,
            else "PyRosettaCluster" if None
    simulation_name: A `str` object specifying the name of this simulation.
        This option just adds the user-provided `simulation_name` to the scorefile
        for accounting.
        Default: `project_name` if not specified, else "PyRosettaCluster" if None
    environment: A `NoneType` or `str` object specifying the active conda environment
        YML file string. If a `NoneType` object is provided, then generate a YML file
        string for the active conda environment and save it to the full simulation
        record. If a `str` object is provided, then validate it against the active
        conda environment YML file string and save it to the full simulation record.
        Default: None
    output_path: A `str` object specifying the full path of the output directory
        (to be created if it doesn't exist) where the output results will be saved
        to disk.
        Default: "./outputs"
    scorefile_name: A `str` object specifying the name of the output JSON-formatted
        scorefile. The scorefile location is always `output_path`/`scorefile_name`.
        Default: "scores.json"
    simulation_records_in_scorefile: A `bool` object specifying whether or not to
        write full simulation records to the scorefile. If `True`, then write
        full simulation records to the scorefile. This results in some redundant
        information on each line, allowing downstream reproduction of a decoy from
        the scorefile, but a larger scorefile. If `False`, then write
        curtailed simulation records to the scorefile. This results in minimally
        redundant information on each line, disallowing downstream reproduction
        of a decoy from the scorefile, but a smaller scorefile. If `False`, also
        write the active conda environment to a YML file in 'output_path'. Full
        simulation records are always written to the output '.pdb' or '.pdb.bz2'
        file(s), which can be used to reproduce any decoy without the scorefile.
        Default: False
    decoy_dir_name: A `str` object specifying the directory name where the
        output decoys will be saved. The directory location is always
        `output_path`/`decoy_dir_name`.
        Default: "decoys"
    logs_dir_name: A `str` object specifying the directory name where the
        output log files will be saved. The directory location is always
        `output_path`/`logs_dir_name`.
        Default: "logs"
    logging_level: A `str` object specifying the logging level of python tracer
        output to write to the log file of either "NOTSET", "DEBUG", "INFO",
        "WARNING", "ERROR", or "CRITICAL". The output log file is always written
        to `output_path`/`logs_dir_name`/`simulation_name`.log on disk.
        Default: "INFO"
    logging_address: A `str` object specifying the socket endpoint for sending and receiving
        log messages across a network, so log messages from user-provided PyRosetta
        protocols may be written to a single log file on the host node. The `str` object
        must take the format 'host:port' where 'host' is either an IP address, 'localhost',
        or Domain Name System (DNS)-accessible domain name, and the 'port' is a digit greater
        than or equal to 0. If the 'port' is '0', then the next free port is selected.
        Default: 'localhost:0' if `scheduler=None` or either the `client` or `clients`
            keyword argument parameters specify instances of `dask.distributed.LocalCluster`,
            otherwise '0.0.0.0:0'
    ignore_errors: A `bool` object specifying for PyRosettaCluster to ignore errors
        raised in the user-provided PyRosetta protocols. This comes in handy when
        well-defined errors are sparse and sporadic (such as rare Segmentation Faults),
        and the user would like PyRosettaCluster to run without raising the errors.
        Default: False
    timeout: A `float` or `int` object specifying how many seconds to wait between
        PyRosettaCluster checking-in on the running user-provided PyRosetta protocols.
        If each user-provided PyRosetta protocol is expected to run quickly, then
        0.1 seconds seems reasonable. If each user-provided PyRosetta protocol is
        expected to run slowly, then >1 second seems reasonable.
        Default: 0.5
    max_delay_time: A `float` or `int` object specifying the maximum number of seconds to 
        sleep before returning the result(s) from each user-provided PyRosetta protocol
        back to the client. If a dask worker returns the result(s) from a user-provided
        PyRosetta protocol too quickly, the dask scheduler needs to first register that
        the task is processing before it completes. In practice, in each user-provided
        PyRosetta protocol the runtime is subtracted from `max_delay_time`, and the dask
        worker sleeps for the remainder of the time, if any, before returning the result(s).
        It's recommended to set this option to at least 1 second, but longer times may
        be used as a safety throttle in cases of overwhelmed dask scheduler processes.
        Default: 3.0
    filter_results: A `bool` object specifying whether or not to filter out empty
        `PackedPose` objects between user-provided PyRosetta protocols. When a protocol
        returns or yields `NoneType`, PyRosettaCluster converts it to an empty `PackedPose`
        object that gets passed to the next protocol. If `True`, then filter out any empty
        `PackedPose` objects where there are no residues in the conformation as given by
        `Pose.empty()`, otherwise if `False` then continue to pass empty `PackedPose` objects
        to the next protocol. This is used for filtering out decoys mid-trajectory through
        user-provided PyRosetta protocols if protocols return or yield any `None`, empty
        `Pose`, or empty `PackedPose` objects.
        Default: True
    save_all: A `bool` object specifying whether or not to save all of the returned
        or yielded `Pose` and `PackedPose` objects from all user-provided
        PyRosetta protocols. This option may be used for checkpointing trajectories.
        To save arbitrary poses to disk, from within any user-provided PyRosetta
        protocol:
            `pose.dump_pdb(
                os.path.join(kwargs["PyRosettaCluster_output_path"], "checkpoint.pdb"))`
        Default: False
    dry_run: A `bool` object specifying whether or not to save '.pdb' files to
        disk. If `True`, then do not write '.pdb' or '.pdb.bz2' files to disk.
        Default: False
    cooldown_time: A `float` or `int` object specifying how many seconds to sleep after the
        simulation is complete to allow loggers to flush. For very slow network filesystems,
        2.0 or more seconds may be reasonable.
        Default: 0.5

Returns:
    A PyRosettaCluster instance.

Variable Documentation

◆ __author__

string pyrosetta.distributed.cluster.core.__author__ = "Jason C. Klima"
private

◆ G

pyrosetta.distributed.cluster.core.G = TypeVar("G")