flytekitplugins.kfpytorch.PyTorch#

class flytekitplugins.kfpytorch.PyTorch(master=<factory>, worker=<factory>, run_policy=None, num_workers=None, increase_shared_mem=True)#

Configuration for an executable PyTorch Job. Use this to run distributed PyTorch training on Kubernetes. Please notice, in most cases, you should not worry about the configuration of the master and worker groups. The default configuration should work. The only field you should change is the number of workers. Both replicas will use the same image, and the same resources inherited from task function decoration.

Parameters:

master (Master) – Configuration for the master replica group.
worker (Worker) – Configuration for the worker replica group.
run_policy (RunPolicy | None) – Configuration for the run policy.
num_workers (int | None) – [DEPRECATED] This argument is deprecated. Use worker.replicas instead.
increase_shared_mem (bool) – [DEPRECATED] This argument is deprecated. Use @task(shared_memory=…) instead. PyTorch uses shared memory to share data between processes. If torch multiprocessing is used (e.g. for multi-processed data loaders) the default shared memory segment size that the container runs with might not be enough and and one might have to increase the shared memory size. This option configures the task’s pod template to mount an emptyDir volume with medium Memory to to /dev/shm. The shared memory size upper limit is the sum of the memory limits of the containers in the pod.

Methods

Attributes

increase_shared_mem: bool = True

num_workers: int | None = None

run_policy: RunPolicy | None = None

master: Master

worker: Worker