Polyaxon can schedule and manage KubeFlow operator to run distributed experiments.

Overview

Polyaxon supports and simplifies distributed training. By default, when a user creates a distributed experiment, Polyaxon uses a native behaviour for creating, tracking, and monitoring these experiments.

Starting from version v0.4.2, Polyaxon can create distributed experiments on Kubeflow.

Running distributed experiments on Kubeflow

In order to use Kubeflow as backend for running distributed experiments, the user need to have a running Kubeflow deployment running.

Depending on the framework you are using, you can easily switch the backend used for scheduling a distributed experiment from native to kubeflow:

version: 1

kind: experiment

backend: kubeflow

...

By setting the field backend to kubeflow in your polyaxonfile, Polyaxon will submit your experiment to the Kubeflow operators corresponding to the framework used, i.e.

version: 1

kind: experiment

backend: kubeflow

framework: tensorflow

...

Will create a TFJob, to learn more about how to start a distributed Tensorflow experiment please check this guide.

version: 1

kind: experiment

backend: kubeflow

framework: pytorch

...

Will create a PytorchJob, to learn more about how to start a distributed Pytorch experiment please check this guide.

We have an independent section for the MPIJob integration.

Also, since Polyaxon already supports distributed experiments on MXNet and Horovod as well, we intend to support the MXNetJob operator as well as other operators in the future to give the user the option to switch from native backend to kubeflow backend.

The next sections will show you how to deploy the supported Kubeflow operators.

Deploying Kubeflow operators

For teams not running/using Kubeflow and want to use this integration, Polyaxon provides Helm charts for the Kubeflow operators currently supported.

These Helm charts will be maintained and supported by Polyaxon to allow users to easily deploy and manage them in similar way they manage Polyaxon.

These operators require Helm to be installed, you can have a look at this guide to setup Helm

We are also distributing these charts directly on our official charts repo https://charts.polyaxon.com

$ helm repo add polyaxon https://charts.polyaxon.com
$ helm repo update

Deploying/Deleting TFJob

In order to use Kubeflow as backend for running distributed Tensorflow experiments, you need to deploy TFJob on the same namespace where Polyaxon was deployed

helm install polyaxon/tfjob --name=plxtf --namespace=polyaxon
helm install del plxtf --purge

Deploying/Deleting PytorchJob

In order to use Kubeflow as backend for running distributed Pytorch experiments, you need to deploy PytorchJob on the same namespace where Polyaxon was deployed

helm install polyaxon/pytorchjob --name=plxpytorch --namespace=polyaxon
helm install del plxpytorch --purge

Deploying/Deleting MpiJob

In order to use Kubeflow as backend for running distributed experiments using MPI, you need to deploy PytorchJob on the same namespace where Polyaxon was deployed

helm install polyaxon/mpijob --name=plxmpi --namespace=polyaxon
helm install del plxmpi --purge