Polyaxon is structured as a modern, decoupled, micro-service oriented architecture.
- A robust core JSON API
- An Asynchronous, customizable, and scalable scheduler
- An extensive tracking API
- An event/action oriented interface
- A pipeline engine capable of authoring workflows as directed acyclic graphs (DAGs)
- An optimization engine to search automatically and concurrently for the best hyperparameters in a search space based on state of the art algorithms
- A CI system to trigger experiments/hyperparams tuning/pipelines automatically based on some event and track their execution and report results to users
These components work together to make every Polyaxon deployment function smoothly, but because they're decoupled there's plenty of room for customisation.
In fact users can decide for example to deploy only the core and tracking API, and replace the built-in scheduler, pipeline, and optimization engine with other platforms.
Polyaxon relies on several components to function smoothly:
Depending on the version you are deploying, you may need as well:
- Kubernetes cluster(s) for deploying Polyaxon
- Docker, Docker compose, or a container management platform for deploying a scalable Polyaxon (tracking only) version
- Linux station for installing the platform from source
In order to understand how Polyaxon can help you organize your workflow, you need to understand how Polyaxon abstract the best practices of data science job.
Polyaxon runs both in the cloud and on premise, and provides access via:
- Polyaxon command line interface
- Polyaxon dashboard
- Polyaxon SDKs targeting the Polyaxon api
- Polyaxon Webhooks
These interfaces hide the powerful abstractions provided by the Polyaxon architecture. When a machine learning engineer or a data scientist deploys a model, Polyaxon relies on Kubernetes for:
- Managing the resources of your cluster (Memory, CPU, and GPU)
- Creating easy, repeatable, portable deployments
- Scaling up and down as needed
Polyaxon does the heavy lifting of:
- Scheduling the jobs
- Versioning the code
- Creating docker images
- Monitoring the statuses and resources
- Tracking params, logs, configurations, and tags
- Reporting metrics and outputs and other results to the user
The choice of using Docker containers to run jobs is important, it provides the user a wide range of possibilities to customize the run environment to fit the requirements and dependencies needed for the experiments.
Polyaxon relies on a set of concepts to manage the experimentation process, in this section we provide a high level introduction to these concepts, with more details in pages dedicated to each concept.
User is the entity that creates projects, starts experiments, creates jobs and pipelines, manages teams and clusters.
User has a set of permissions, and can be a normal user or a superuser.
Please refer to the users management section for more details.
Team provides a way to manage groups of users, their access roles, and resources quotas.
This entity exists only on Polyaxon EE version
quota is attached to a user/team/project, the entity created, i.e. builds/jobs/experiments/notebooks, cannot exceed the parallelism and may not consume more
resources than the quota specification allows.
This entity exists only on Polyaxon EE version
Project in Polyaxon is very similar to a project in GitHub,
it aims at organizing your efforts to solve a specific problem.
A project consist of a name and a description, the code to execute, the data, and a polyaxonfile.yml.
Please refer to the projects section for more details.
Experiment is the execution of your model with data and the provided parameters on the cluster.
Experiment Job is the Kubernetes pod running on the cluster for a specific experiment,
if an experiment runs in a distributed way it will create multiple instances of
Please refer to the experiments section for more details.
Experiment Group provide 2 interfaces:
- An automatic and practical way to run a version of your model and data with different hyper parameters based on a hyperparameters search algorithm.
- A selection of experiments to compare.
Please refer to the experiment groups - selection for more details on how to create group selections
Please refer to the experiment groups - hyperparameters optimization for more details on how to run hyperparameter search.
Job is the execution of your code to do some data processing or any generic operation.
Please refer to the jobs section for more details.
BuildJob is the process of creating containers, Polyaxon provides different backends for creating containers.
Please refer to the build jobs section for more details.
Tensorboard is a job running to visualize the metrics of an experiment,
the metrics of all experiments created during a hyperparameters-optimization group,
the metrics of all experiment in a selection group, or the experiments of a project.
Please refer to the tensorboards for more details.
Notebooks is a job running project wide to provide a fast and easy way to explore data and start experiments.
Polyaxon provides different backends to start notebooks, or Jupyter Labs.
Please refer to the project notebooks§§ for more details.
Checkpointing is a very important concept in machine learning, it prevents losing progress. It also provides the possibility to resume an experiment from a specific state.
Polyaxon provides some structure and organization regarding checkpointing and outputs saving.
Please refer to the save, resume & restart for more details.