Customize Node Scheduling
Polyaxon provides a list of options to select which nodes should be used for the core platform, for the dependencies, and for the experiments.
Polyaxon comes with 4 node selectors to assign pods to nodes
core: the core polyaxon platform
experiments: all user's experiments scheduled by polyaxon
jobs: all user's generic jobs scheduled by polyaxon
builds: all build jobs scheduled by polyaxon
Additionally every dependency in our helm package, exposes a node selector option.
By providing these values, or some of them, you can constrain the pods belonging to that category to only run on particular nodes or to prefer to run on particular nodes.
For example, if you have some GPU nodes, you might want to only use them for training your experiments. In this case you should label your nodes:
$ kubectl label nodes <node-name> <label-key>=<label-value>
And use the same label for
kubectl label nodes worker_1 worker_2 polyaxon.com=experiments
nodeSelectors: experiments: polyaxon.com: experiments
Experiments and Jobs node selectors¶
In some cases providing a default node selectors for scheduling experiments on some specific nodes is not enough, for example if the user has labelled 3 nodes with following label:
$ kubectl label nodes node1 node2 node3 polyaxon: experiments
And 1 of these nodes has a specific GPU that the user wishes to use for a particular experiment or for running a Jupyter notebook.
The user can label that node with a label:
$ kubectl label nodes node3 polyaxon: specific-gpu
And use that label to override the default scheduling behavior:
--- version: 1 kind: experiment environment: node_selector: polyaxon: specific-gpu build: image: tensorflow/tensorflow:1.4.1-gpu-py3 build_steps: - pip3 install --no-cache-dir -U polyaxon-helper run: cmd: python3 model.py # Use default params
This will force Polyaxon to schedule this particular experiment on that specific node.
This definition can be used in very similar way to schedule a notebook or a tensorboard on that node:
--- version: 1 kind: notebook environment: node_selector: polyaxon: specific-gpu build: image: tensorflow/tensorflow:1.4.1-gpu-py3 build_steps: - pip3 install jupyter
You can even use that to schedule a particular job of distributed experiment on that node, for example we can imagine that if the user runs a distributed experiment with a master, 2 workers, and one ps, and the user wishes to schedule the worker on that node:
--- version: 1 kind: experiment environment: tensorflow: n_workers: 2 n_ps: 1 worker: - index: 1 node_selector: polyaxon: specific-gpu build: image: tensorflow/tensorflow:1.4.1 build_steps: - pip install --no-cache-dir -U polyaxon-helper run: cmd: python run.py --train-steps=400 --sync
This will schedule the master, the ps, and the first worker on any node experiment node,
and will force the second worker to be scheduled on the node with the label
If one or more taints are applied to a node, and you want to make sure some pods should not deploy on it, Polyaxon provides tolerations option for the core platform as well as for all dependencies, e.i. database, broker, expose their own tolerations option.
Example for core platform:
tolerations: core: ...
Example for Rabbitmq:
rabbitmq: tolerations: ...
It allows you to constrain which nodes your pod is eligible to schedule on, based on the node's labels.
Polyaxon has a default
Affinity values for both dependencies and core to ensure that they deploy on the same node.