Shortcuts

FSDPStrategy

class lightning.pytorch.strategies.FSDPStrategy(accelerator=None, parallel_devices=None, cluster_environment=None, checkpoint_io=None, precision_plugin=None, process_group_backend=None, cpu_offload=None, mixed_precision=None, activation_checkpointing=None, **kwargs)[소스]

기반 클래스: lightning.pytorch.strategies.parallel.ParallelStrategy

Strategy for Fully Sharded Data Parallel provided by torch.distributed.

경고

This is an experimental feature.

Fully Sharded Training shards the entire model across all available GPUs, allowing you to scale model size, whilst using efficient communication to reduce overhead. In practice, this means we can remain at parity with PyTorch DDP, whilst scaling our model sizes dramatically. The technique is similar to ZeRO-Stage 3.

For more information check out this blogpost.

Defaults have been set and options have been exposed, but may require configuration based on your level of memory/speed efficiency. We suggest having a look at this tutorial for more information.

매개변수
barrier(name=None)[소스]

Synchronizes all processes which blocks processes until the whole group enters this function.

매개변수

name (Optional[str]) – an optional name to pass into barrier.

반환 형식

None

broadcast(obj, src=0)[소스]

Broadcasts an object to all processes.

매개변수
  • obj (TypeVar(TBroadcast)) – the object to broadcast

  • src (int) – source rank

반환 형식

TypeVar(TBroadcast)

model_sharded_context()[소스]

Provide hook to create modules in a distributed aware context. This is useful for when we’d like to shard the model instantly, which is useful for extremely large models which can save memory and initialization time.

Returns: Model parallel context.

반환 형식

Generator

model_to_device()[소스]

Moves the model to the correct device.

반환 형식

None

predict_step(*args, **kwargs)[소스]

The actual predict step.

See predict_step() for more details

반환 형식

Union[Tensor, Dict[str, Any]]

reduce(tensor, group=None, reduce_op='mean')[소스]

Reduces a tensor from several distributed processes to one aggregated tensor.

매개변수
  • tensor (Union[Tensor, Any]) – the tensor to sync and reduce

  • group (Optional[Any]) – the process group to gather results from. Defaults to all processes (world)

  • reduce_op (Union[ReduceOp, str, None]) – the reduction operation. Defaults to ‘mean’/’avg’. Can also be a string ‘sum’ to calculate the sum during reduction.

반환 형식

Tensor

반환

reduced value, except when the input was not a tensor the output remains is unchanged

setup(trainer)[소스]

Setup plugins for the trainer fit and creates optimizers.

매개변수

trainer (Trainer) – the trainer instance

반환 형식

None

setup_environment()[소스]

Setup any processes or distributed connections.

This is called before the LightningModule/DataModule setup hook which allows the user to access the accelerator environment before setup is complete.

반환 형식

None

setup_optimizers(trainer)[소스]

Creates optimizers and schedulers.

매개변수

trainer (Trainer) – the Trainer, these optimizers should be connected to

반환 형식

None

teardown()[소스]

This method is called to teardown the training process.

It is the right place to release memory and free other resources.

반환 형식

None

test_step(*args, **kwargs)[소스]

The actual test step.

See test_step() for more details

반환 형식

Union[Tensor, Dict[str, Any], None]

training_step(*args, **kwargs)[소스]

The actual training step.

See training_step() for more details

반환 형식

Union[Tensor, Dict[str, Any]]

validation_step(*args, **kwargs)[소스]

The actual validation step.

See validation_step() for more details

반환 형식

Union[Tensor, Dict[str, Any], None]

property root_device: torch.device

Return the root device.

반환 형식

device