Sections

Version

Represents the polyaxon file specification version.

Example:

version: 1

Kind

Represents the polyaxon file specification kind, i.e. one of the values experiment, group, job, notebook, tensorboard, pipeline.

Example:

kind: experiment

logging

Defines the logging behavior for your execution, this subsection accepts:

  • level: The log level.
  • formatter: The log formatter regex.

Example:

logging:
  level: INFO
logging:
  level: WARNING

hptuning

Settings defines seed, concurrency, search algorithm, early_stopping, matrix. In general the hptuning defines some values that must be unique for all experiments created based on the polyaxonfile.

seed

A seed to use when generating random values during the hyperparameters search.

Example:

seed: 3234

concurrency

Defines how many experiments to run concurrently when the polyaxon file use a matrix section. This option will be ignored if the polyaxon file only have one independent experiment.

Example:

concurrency: 3

matrix

The matrix section works the same way as travisCI matrix section, and it basically creates multiple specifications. The way it does that is depend on the methods used to define the hyperparameters and the search algorithm:

  • In the case of grid search, the matrix space is defined by the Cartesian Product of all your defined parameters. it is important that only discrete methods are used with grid search.

  • In the case of random search, hyperband, and bayesian optimization, the space search is defined based on sampling, from the provided distributions, if discrete values are provided sampling is done uniformly, unless pvalues is provided.

The matrix also defines variables same way the declarations does, the only difference is that all the values generated by the matrix contribute to the definition of an experiment group. Each experiment in this group is defined based on a combination of the values declared in the matrix.

The matrix is defined as {key: value} object where the key is the name of the parameter you are defining and the value is one of these options:

Discrete values

  • values: a list of values, e.g.

    • [1, 2, 3, 4]
  • range: [start, stop, step] same way you would define a range in python, e.g.

    • [1, 10, 2]
    • {start: 1, stop: 10, step: 2}
    • '1:10:2'
  • linspace: [start, stop, num] steps from start to stop spaced evenly on a linear scale, e.g.

    • [1, 10, 5]
    • {start: 1, stop: 10, num: 20}
    • '1:2:20'
  • logspace: [start, stop, num] steps from start to stop spaced evenly on a log scale, e.g.

    • [1, 10, 5]
    • {start: 1, stop: 10, num: 20}
    • '1:2:20'
  • geomspace: [start, stop, num] steps from start to stop, numbers spaced evenly on a log scale (a geometric progression).

    • [1, 10, 5]
    • {start: 1, stop: 10, num: 20}
    • '1:2:20'

Distributions

  • pvalues: Draws a value_i from values with probability prob_i, e.g.

    • [(value1, prob1), (value2, prob12), (value3, prob3), ...]
  • uniform: Draws samples from a uniform distribution over the half-open interval [low, high), e.g.

    • 0:1
    • [0, 1]
    • {'low': 0, 'high': 1}
  • quniform: Draws samples from a quantized uniform distribution over [low, high], round(uniform(low, high) / q) * q, e.g.

    • 0:1
    • [0, 1]
    • {'low': 0, 'high': 1}
  • loguniform: Draws samples from a log uniform distribution over [low, high], e.g.

    • 0:1
    • [0, 1]
    • {'low': 0, 'high': 1}
  • qloguniform: Draws samples from a quantized log uniform distribution over [low, high]

    • 0:1
    • [0, 1]
    • {'low': 0, 'high': 1}
  • normal: Draws random samples from a normal (Gaussian) distribution defined by [loc, scale]

    • 0:1
    • [0, 1]
    • {'loc': 0, 'loc': 1}
  • qnormal: Draws random samples from a quantized normal (Gaussian) distribution defined by [loc, scale]

    • 0:1
    • [0, 1]
    • {'loc': 0, 'loc': 1}
  • lognormal: Draws random samples from a log normal (Gaussian) distribution defined by [loc, scale]

    • 0:1
    • [0, 1]
    • {'loc': 0, 'loc': 1}
  • qlognormal: Draws random samples from a quantized log normal (Gaussian) distribution defined by [loc, scale]

    • 0:1
    • [0, 1]
    • {'loc': 0, 'loc': 1}

Example:

matrix:
  lr:
    logspace: 0.01:0.1:5

  loss:
    values: [MeanSquaredError, AbsoluteDifference]
matrix:
  lr:
    uniform: 0.01:0.8

  loss:
    pvalues: [(MeanSquaredError, 0,2), (AbsoluteDifference, 0.8)]

These values can be accessed in the following way:

--lr={{ lr }} --loss={{ loss }}

You can, of course, only access one generated value at a time, and the value is chosen directly by the algorithm doing the search defined in the hptuning.

For each experiment generated during the hyperparameters search, Polyaxon will also add these values to your declarations, and will export them under the environment variable name POLYAXON_DECLARATIONS.

Polyaxon append the matrix value combination to your declarations and export them under the environment variable name POLYAXON_DECLARATIONS

Check how you can get the cluster definition to use it with your models.

Hyperparameters search using grid search. This the default value when no algorithm is provided. By default, the grid search will travers all possible combinations based on the cartesian product, unless n_experiments is provided.

Example:

grid_search:
  n_experiments: 10

Hyperparameters search using random search.

Example:

random_search:
  n_experiments: 10

search algorithm: hyperband

Hyperparameters search using hyperband.

Example:

hyperband:
  max_iter: 81
  eta: 3
  resource:
    name: num_steps
    type: int
  metric:
    name: loss
    optimization: minimize
  resume: False

search algorithm: bo

Hyperparameters search using bayesian optimization.

Example:

bo:
  n_iterations: 15
  n_initial_trials: 30
  metric:
    name: loss
    optimization: minimize
  utility_function:
    acquisition_function: ucb
    kappa: 1.2
    gaussian_process:
      kernel: matern
      length_scale: 1.0
      nu: 1.9
      n_restarts_optimizer: 0

early_stopping

Defines a list of metrics and the values for these metrics to stop the search algorithm.

Example:

early_stopping:
  - metric: loss
    value: 0.01
    optimization: minimize

  - metric: accuracy
    value: 0.97
    optimization: maximize

Environment

The environment section allows to alter the resources and configuration of the runtime of your experiments.

Based on this section you can define, how many workers/ps you want to run, the resources, the node selectors, and the configs of each job.

The values of this section are:

resources

The resources to use for the job. In the case of distributed run, it's the resources to use for the master job. A resources definition, is optional and made of three optional fields:

  • cpu: {limits: value, requests: value}
  • memory: {limits: value, requests: value}
  • gpu: {limits: value, requests: value}
environment:
  resources:
    cpu:
      requests: 1
      limits: 2
    memory:
      requests: 256
      limits: 1024
    gpu:
      request: 1
      limits: 1

outputs

Sometime you experiment or your job might depend on previous jobs or experiments, and you need to use their outputs to either do fine tuning or post processing of those outputs.

Outputs gives you a way to reference outputs from previous experiments and jobs, by either using their ids or names (if you gave them name).

This will both mount necessary outputs volumes, and will expose the paths of those outputs in your experiment/job that requested them.

If you referenced different outputs from jobs and experiments, the paths will following the same order that was provided.

environment:
  outputs:
    jobs: [1, 234, 'job_name1', 'another_username/another_project/job_name2']
    experiments: [12, 'experiment_name', 'my_other_project/experiment_name2']

persistence

The volumes to mount for data and outputs, this is only needed when Polyaxon was deployed with multiple data volumes or multiple outputs volumes or both.

environment:
  persistence:
    data: ['data_volume_name1', 'data_volume_name2', 'data_volume_name3']
    outputs: 'outputs_volume_name2'

node selectors

The labels to use as node selectors for scheduling the job on a specific node. You can also set default node selectors during the deployment and use this subsection to override the default values.

environment:
  node_selector:
    node_label: node_value

tolerations

The tolerations to use for the scheduling the job. You can also set default tolerations during the deployment and use this subsection to override the default values.

environment:
  tolerations:
    - key: "key"
      operator: "Exists"
      effect: "NoSchedule"

affinity

The affinity to use for the scheduling the job. You can also set default affinity during the deployment and use this subsection to override the default values.

environment:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: type
                operator: In
                values:
                - "polyaxon-experiments"
            topologyKey: "kubernetes.io/hostname"

To enable a distributed run, the user can define one of the following framework:

configmap_refs

A list of config map references to mount during the scheduling of a job/build/experiment

environment:
  configmap_refs: ['configmap1', 'configmap3']

secret_refs

environment:
  secret_refs: ['secret1', 'secret2']

A list of secret references to mount during the scheduling of a job/build/experiment

tensorflow

n_workers

The number of workers to use for an experiment.

n_ps

The number of parameter server to use for an experiment.

default_worker

Default environment specification to use for all workers.

environment:
  default_worker:
    resources:
    node_selector:
    affinity:
    tolerations:

default_ps

Default environment specification to use for all ps.

environment:
  default_ps:
    resources:
    node_selector:
    affinity:
    tolerations:

worker

Defines a specific worker(s)' environment section defining, indicated by index:

environment:
  worker:
    - index: i
      resources:
      node_selector:
      affinity:
      tolerations:

ps

Defines a specific ps(s)' environment section defining, indicated by index:

environment:
  ps:
    - index: i
      resources:
      node_selector:
      affinity:
      tolerations:

Example:

environment:

  node_selector:
    polyaxon: experiments

  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: type
                operator: In
                values:
                - "polyaxon-experiments"
            topologyKey: "kubernetes.io/hostname"

  tolerations:
    - key: "key"
      operator: "Exists"
      effect: "NoSchedule"

  resources:
    cpu:
      requests: 2
      limits: 4
    memory:
      requests: 512
      limits: 2048

  tensorflow:
      n_workers: 4
      n_ps: 1

      default_worker:
        resources:
          cpu:
            requests: 1
            limits: 2
          memory:
            requests: 256
            limits: 1024
          gpu:
            request: 1
            limits: 1
        tolerations:
          - operator: "Exists"

      worker:
        - index: 2
          resources:
            cpu:
              requests: 1
              limits: 2
            memory:
              requests: 256
              limits: 1024
        - index: 3
          node_selector:
            polyaxon: special_node
          tolerations:
            - key: "key"
              operator: "Exists"

      ps_resources:
        - index: 0
          cpu:
            requests: 1
            limits: 1
          memory:
            requests: 256
            limits: 1024

mxnet

n_workers

The number of workers to use for an experiment.

n_ps

The number of parameter server to use for an experiment.

default_worker

Default environment specification to use for all workers.

environment:
  default_worker:
    resources:
    node_selector:
    affinity:
    tolerations:

default_ps

Default environment specification to use for all ps.

environment:
  default_ps:
    resources:
    node_selector:
    affinity:
    tolerations:

worker

Defines a specific worker(s)' environment section defining, indicated by index:

environment:
  worker:
    - index: i
      resources:
      node_selector:
      affinity:
      tolerations:

ps

Defines a specific ps(s)' environment section defining, indicated by index:

environment:
  ps:
    - index: i
      resources:
      node_selector:
      affinity:
      tolerations:

Example:

environment:
  mxnet:
    n_workers: 4
    n_ps: 1

    default_ps:
      node_selector:
        polyaxon: nodes_for_param_servers

pytorch

n_workers

The number of workers to use for an experiment.

default_worker

Default environment specification to use for all workers.

environment:
  default_worker:
    resources:
    node_selector:
    affinity:
    tolerations:

worker

Defines a specific worker(s)' environment section defining, indicated by index:

environment:
  worker:
    - index: i
      resources:
      node_selector:
      affinity:
      tolerations:

Example:

environment:
  pytorch:
    n_workers: 4

horovod

n_workers

The number of workers to use for an experiment.

default_worker

Default environment specification to use for all workers.

environment:
  default_worker:
    resources:
    node_selector:
    affinity:
    tolerations:

worker

Defines a specific worker(s)' environment section defining, indicated by index:

environment:
  worker:
    - index: i
      resources:
      node_selector:
      affinity:
      tolerations:

Example:

environment:
  horovod:
    n_workers: 4

declarations

This section is the appropriate place to declare constants and variables that will be used by the rest of our specification file.

To declare a simple constant value:

declarations:
  batch_size: 128

List of values or nested values:

declarations:
  layer_units: [100, 200, 10]
declarations:
  convolutions:
    conv1:
       kernels: [32, 32]
       size: [2, 2]
       strides: [1, 1]
    conv2:
       kernels: [64, 64]
       size: [2, 2]
       strides: [1, 1]

This declaration can be used to pass values to our program:

 ... --batch-size={{ batch-size }}
--unit1="{{ layer_units[0] }}" --unit2="{{ layer_units[1] }}" --unit3="{{ layer_units[2] }}"
--conv1_kernels="{{ convolutions.conv1.kernels }}" --conv1_stides="{{ convolutions.conv1.strides }}" ...

The double-brackets is important and indicate that we want to use our declaration.

The declaration are particularly important for descriptive models.

All your declaration will be exported under the environment variable name POLYAXON_DECLARATIONS.

Polyaxon export your declarations under environment variable name POLYAXON_DECLARATIONS

Check how you can get the experiment declarations to use them with your models.

build

This is where you define how you build an image to run your code. This section defines the following values/subsections:

  • image [required]: the base image polyaxon will use to build an image for you to run your code.
  • build_steps [optional]: steps are basically a list of ops that Polyaxon use with docker RUN to install/run further operations you define in the list.
  • env_vars [optional]: environment variables are also a list of tuples of 2 elements, that polyaxon will use to add env variables in the docker image.
  • git: the git url of an external repo.
  • ref [optional]: the commit/branch/treeish to use for creating the build.
  • nocache [optional]: to force rebuild the image.
build:
  image: my_image
  build_steps:
    - pip install PILLOW
    - pip install scikit-learn
  env_vars:
    - [KEY1, VALUE1]
    - [KEY2, VALUE2]
  ref: 14e9d652151eb058afa0b51ba110671f2ca10cbf

External repo

build:
  image: ubuntu
  git: https://github.com/user/repo
  ref: 14e9d652151eb058afa0b51ba110671f2ca10cbf

run

This is where you define how you want to run your code.

  • cmd [required]: The command(s) to run during the execution of your code.

Some examples of valid cmd commands to pass in the polyaxonfile:

run:
  cmd: video_prediction_train --num_masks=1

Or

run:
  cmd: video_prediction_train --num_masks=1 && video_prediction_train --num_masks=2

Or

run:
  cmd: ./file1.sh || ./file2.sh

Or

run:
  cmd: ./file1.sh ; ./file2.sh

Or

run:
  cmd: 
    - video_prediction_train --num_masks=1
    - video_prediction_train --num_masks=2
    - video_prediction_train --num_masks=3