You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We run a mixed inference/training cluster with a fixed number of gpus. We use priorityclasses to manage GPU allocations, but it doesn't seem that a volcano job with a high priority class can preempt or evict inference jobs with lower ones.
Example: a job/podgroup with priority 1300000 is stuck in Pending with message volcano 0/101 nodes are unavailable: 100 Insufficient nvidia.com/gpu., even though there are more than enough available GPUs being used by workloads with priority 1200000.
When I deploy a pod that's not using schedulerName: volcano I see it successfully evict those pods and use the gpus. Is there a way I can configure volcano to evict pods using the default k8s sheduler with lower priority classes?
Theoretically, it is possible, the --scheduler-name parameter of the scheduler can be filled with an array, but I don't recommend doing this because there are two schedulers in the cluster at the same time, as their caches within the scheduler are different, their scheduling behavior may conflict. Volcano supports scheduling for deployment/statefulset/damonset/job, etc., and it fully supports the native filtering and scoring of kube-scheduler: https://volcano.sh/en/docs/unified_scheduling/, so you can schedule training and inference jobs at the same time with Volcano. By the way, are you saying that your inference service is scheduled by kube-scheduler? Are you deploying it with deployment, and then these inference services have already occupied all the GPU resources in the cluster?
Please describe your problem in detail
Hi,
We run a mixed inference/training cluster with a fixed number of gpus. We use priorityclasses to manage GPU allocations, but it doesn't seem that a volcano job with a high priority class can preempt or evict inference jobs with lower ones.
Example: a job/podgroup with priority
1300000
is stuck in Pending with messagevolcano 0/101 nodes are unavailable: 100 Insufficient nvidia.com/gpu.
, even though there are more than enough available GPUs being used by workloads with priority1200000
.When I deploy a pod that's not using
schedulerName: volcano
I see it successfully evict those pods and use the gpus. Is there a way I can configure volcano to evict pods using the default k8s sheduler with lower priority classes?Attaching my queue and example job config:
Any other relevant information
No response
The text was updated successfully, but these errors were encountered: