Polyaxon uses an ephemeral token to authenticate the jobs/experiments before granting client access to other APIs related to the experiment/job, this ephemeral tokens have a TLL with default value (e.g. 3 hours) after which the token gets invalidated, which in turn makes the job/experiment unable to authenticate.

Use experiment groups to control the concurrency

You can use experiment groups to only schedule a number of experiments that you know will have enough resources on your cluster to schedule them and run them.

Increase the ephemeral token TTL

In case you want to schedule a large number on Kubernetes and you want to avoid this issue, you might want to increase the ephemeral tokens TTL to a larger number, the value is in seconds:

ttl:
  ephemeralToken:  # in seconds, e.g. 3600