persia.distributed
Module Contents
- class persia.distributed.BaguaDistributedOption(algorithm, **options)
Bases:
DistributedBaseOption
Implements an option to convert torch model to a bagua distributed model.
Example for
BaguaDistributedOption
:from persia.distributed import BaguaDistributedOption kwargs = { "enable_bagua_net": True } bagua_option = BaguaDistributedOption("gradient_allreduce", **kwargs)
Algorithms supported in Bagua:
Algorithm Name
gradient_allreduce
decentralized
low_precision_decentralized
qadam
bytegrad
async
Note
You can review Bagua Algorithm for more details, especially the arguments of algorithm.
Note
The
BaguaDistributedOption
only supports the CUDA environment, if you want to run PERSIA task on the CPU cluster, tryDDPOption
with backend=’gloo’ instead ofBaguaDistributedOption
.- Parameters
algorithm (str) – name of Bagua algorithm.
options (dict) – options for Bagua algorithm
- convert2distributed_model(model, world_size, rank_id, device_id=None, master_addr=None, optimizer=None)
- Parameters
model (torch.nn.Module) – the PyTorch model that needs to be converted to data-parallel model.
world_size (int) – total number of processes.
rank_id (int) – rank of current process.
device_id (int, optional) – device id for current process.
master_addr (str, optional) – master of collective communication ip address.
optimizer (torch.optim.Optimizer, optional) – the PyTorch optimizer that may need to be converted alongside the model.
- init_with_env_file()
Check if the current option is initiad with a ddp environment file.
- Returns
Whether distributed option init with env file.
- Return type
bool
- class persia.distributed.DDPOption(initialization_method='tcp', backend='nccl', **options)
Bases:
DistributedBaseOption
Implements an option to convert torch model to a DDP model.
Current backend in
DDPOption
only support nccl and gloo. You can setbackend="nccl"
if your PERSIA task is training on the cluster with the CUDA device. Or setbackend="gloo"
if your PERSIA task is training on the cluster only with the CPU.For example:
from persia.distributed.DDPOption ddp_option = DDPOption(backend="nccl")
If you want to change the default master_port or master_addr, add the
kwargs
toDDPOption
.from persia.distributed.DDPOption ddp_option = DDPOption(backend="nccl", master_port=23333, master_addr="localhost")
- Parameters
initialization_method (str) – the PyTorch distributed initialization_method method, support tcp and file currently. See PyTorch initialization for more details.
backend (str) – backend of collective communication. Currently support nccl.
options (dict) – options that include the master_port or master_addr.
- convert2distributed_model(model, world_size, rank_id, device_id=None, master_addr=None, optimizer=None)
- Parameters
model (torch.nn.Module) – the PyTorch model that needs to be converted to data-parallel model.
world_size (int) – total number of processes.
rank_id (int) – rank of current process.
device_id (int, optional) – device id for current process.
master_addr (str, optional) – master of collective communication ip address.
optimizer (torch.optim.Optimizer, optional) – the PyTorch optimizer that may need to be converted alongside the model.
- init_with_env_file()
Check if the current option was initialized with a ddp env file or not
- Returns
True
if the current option was initialized with a ddp env file- Return type
bool
- class persia.distributed.DistributedBaseOption(master_port, master_addr=None)
Bases:
abc.ABC
Implements a common option to convert torch model to a distributed data parallel model, e.g. Bagua Distributed or pyTorch DDP.
This class should not be instantiated directly.
- Parameters
master_port (int) – master of collective communication ip address.
master_addr (str, optional) – master of collective communication service port.
- abstract convert2distributed_model(model, world_size, rank_id, device_id=None, master_addr=None, optimizer=None)
- Parameters
model (torch.nn.Module) – the PyTorch model that needs to be converted to data-parallel model.
world_size (int) – total number of processes.
rank_id (int) – rank of current process.
device_id (int, optional) – device id for current process.
master_addr (str, optional) – master of the collective communication ip address.
optimizer (torch.optim.Optimizer, optional) – the PyTorch optimizer that may need to be converted alongside the model.
- abstract init_with_env_file()
Check if the current option was initialized with a ddp env file or not
- Returns
True
if the current option was initialized with a ddp env file- Return type
bool
- persia.distributed.get_default_distributed_option(device_id=None)
Get default distributed option.
- Parameters
device_id (int, optional) – CUDA device_id. Apply
backend="nccl"
to theDDPOption
if the device_id not None, otherwise use thebackend="gloo"
for CPU only mode.- Returns
Default distributed option.
- Return type