Shortcuts

SLURMEnvironment

class lightning.pytorch.plugins.environments.SLURMEnvironment(auto_requeue=True, requeue_signal=None)[소스]

기반 클래스: lightning.fabric.plugins.environments.cluster_environment.ClusterEnvironment

Cluster environment for training on a cluster managed by SLURM.

매개변수
  • auto_requeue (bool) – Whether automatic job resubmission is enabled or not. How and under which conditions a job gets rescheduled gets determined by the owner of this plugin.

  • requeue_signal (Optional[Signals]) – The signal that SLURM will send to indicate that the job should be requeued. Defaults to SIGUSR1 on Unix.

static detect()[소스]

Returns True if the current process was launched on a SLURM cluster.

It is possible to use the SLURM scheduler to request resources and then launch processes manually using a different environment. For this, the user can set the job name in SLURM to ‘bash’ or ‘interactive’ (srun –job- name=interactive). This will then avoid the detection of SLURMEnvironment and another environment can be detected automatically.

반환 형식

bool

global_rank()[소스]

The rank (index) of the currently running process across all nodes and devices.

반환 형식

int

local_rank()[소스]

The rank (index) of the currently running process inside of the current node.

반환 형식

int

node_rank()[소스]

The rank (index) of the node on which the current process runs.

반환 형식

int

static resolve_root_node_address(nodes)[소스]

The node selection format in SLURM supports several formats.

This function selects the first host name from

  • a space-separated list of host names, e.g., ‘host0 host1 host3’ yields ‘host0’ as the root

  • a comma-separated list of host names, e.g., ‘host0,host1,host3’ yields ‘host0’ as the root

  • the range notation with brackets, e.g., ‘host[5-9]’ yields ‘host5’ as the root

반환 형식

str

world_size()[소스]

The number of processes across all devices and nodes.

반환 형식

int

property creates_processes_externally: bool

Whether the environment creates the subprocesses or not.

반환 형식

bool

property main_address: str

The main address through which all processes connect and communicate.

반환 형식

str

property main_port: int

An open and configured port in the main node through which all processes communicate.

반환 형식

int