Polyaxon allows to schedule distributed MPI experiments, and supports tracking metrics, outputs, and models.

Overview

In order to use the mpi backend, users need to install the MPIJob.

To enable distributed runs, you need to set the backend field to mpi and update the environment section.

You can annotate your experiments with any framework you are using, it's optional.

The environment section allows to customize the resources as well as defining the topology/replicas of the experiment.

Define the distributed topology

To define a cluster in Polyaxon with 2 workers, add a replicas subsection to the environment section of your polyaxonfile:

...
framework: mpi
...
environment:
  replicas:
    n_workers: 2
    default_worker:
      resources:
        gpu:
          requests: 1
          limits: 1

Since the MPIOperator does not allow to expose specific resources for the different workers, you can only use the default worker subsection to define the default resources for all workers.