In order to understand how Polyaxon can help you organize your workflow, you need to understand how Polyaxon abstract the best practices of data science job.
Polyaxon runs both in the cloud and on premise, and provides access via:
- Polyaxon command line interface
- Polyaxon dashboard
- Polyaxon sdk targeting the Polyaxon api
These interfaces hides the powerful abstraction provided by the Polyaxon architecture. When a machine learning engineer or a data scientist deploys a model, Polyaxon relies on Kubernetes for:
- Managing the resources of your cluster (Memory, CPU, and GPU)
- Creating an easy, repeatable, portable deployments
- Scaling up and down as needed
Polyaxon does the heavy lifting of:
- Scheduling the jobs
- Versioning the code
- Creating docker images
- Monitoring the statuses and resources
- Reporting back the results to the user
The choice of using Docker containers to run your jobs is important, it provides the user a wide range of possibilities to customize the run environment to fit the requirements and dependencies needed for the experiments.
Polyaxon relies on a set of concepts to manage an experimentation workflow, in this section we provide a high level introduction to these concepts, with more details in pages dedicated to each concept.
User is the entity that creates projects, starts experiments, manages teams and clusters.
User has a set of permissions, and can be normal user or superuser.
Please refer to the management section for more details.
Team provides a way to manage team/group of users, their access roles, and resources quotas.
This is still a work in progress. If you want to be notified when we release this feature, please subscribe to receive our progress.
Project in Polyaxon is very similar to a project in github,
it aims at organizing your efforts to solve a specific problem.
A project consist of a name and a description, the code to execute, the data, and a polyaxonfile.yml.
Please refer to the projects section for more details.
Experiment Group is an automatic and practical way to run a version of your model and data with different hyper parameters.
Please refer to the experiment groups and hyper parameters search section for more details.
Experiment is the execution of your model with data and the provided parameters on the cluster.
Please refer to the experiments and distributed runs section for more details.
Experiment Job is the Kubernetes pod running on the cluster for a specific experiment,
if an experiment run in a distributed way it will create multiple instances of
Please refer to the experiment jobs section for more details.
Distributed Experiment is the execution of a model or a computation graph across a cluster.
Please refer to the distributed experiments for more details.
Job is the execution of your code to do some data processing or any generic operation.
Please refer to the jobs section for more details.
Finding good hyperparameters involves can be very challenging, and requires to efficiently search the space of possible hyperparameters as well as how to manage a large set of experiments for hyperparameter tuning.
Please refer to the hyperparameters search for more details.
Checkpointing, resuming and restarting experiments¶
Checkpointing is a very important concept in machine learning, it prevents losing progress. It also provide the possibility to resume an experiment from a specific state.
Polyaxon provides some structure and organization regarding checkpointing and outputs saving.
Please refer to the save, resume & restart for more details.
Tensorboard is a job running to visualize the metrics of an experiment,
the experiments of a group, or of a project.
Please refer to the tensorboard for more details.
Project plugin is a job running project wide.
Please refer to the project plugins section for more details.