Can the volcano scheduler evict pods not managed by volcano? #4039

ironman5366 · 2025-02-24T20:57:59Z

Please describe your problem in detail

Hi,

We run a mixed inference/training cluster with a fixed number of gpus. We use priorityclasses to manage GPU allocations, but it doesn't seem that a volcano job with a high priority class can preempt or evict inference jobs with lower ones.

Example: a job/podgroup with priority 1300000 is stuck in Pending with message volcano 0/101 nodes are unavailable: 100 Insufficient nvidia.com/gpu., even though there are more than enough available GPUs being used by workloads with priority 1200000.

When I deploy a pod that's not using schedulerName: volcano I see it successfully evict those pods and use the gpus. Is there a way I can configure volcano to evict pods using the default k8s sheduler with lower priority classes?

Attaching my queue and example job config:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: training
spec:
  weight: 1
  reclaimable: false
  capability:
    nvidia.com/gpu: 320

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2025-02-24T02:45:39Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: train-example
  namespace: default
  resourceVersion: "209718960"
  uid: df1b9ff9-c9a2-4795-8cda-10d8a5350a0f
spec:
  maxRetry: 3
  minAvailable: 4
  plugins:
    pytorch:
    - --master=main
    - --worker=worker
    - --port=23456
    svc: []
  priorityClassName: training-high
  queue: training
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: main
    policies:
    - action: CompleteJob
      event: TaskCompleted
    replicas: 1
    template:
      metadata:
        name: train-example-main
      spec:
        containers:
        - env:
          - name: MULTINODE
            value: "true"
          image: example-image
          name: train-example
          resources:
            limits:
              nvidia.com/gpu: "8"
            requests:
              nvidia.com/gpu: "8"
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        priorityClassName: training-high
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
  - maxRetry: 3
    minAvailable: 3
    name: worker
    replicas: 3
    template:
      metadata:
        name: train-example-worker
      spec:
        containers:
        - env:
          - name: MULTINODE
            value: "true"
          image: example-image
          name: train-example
          resources:
            limits:
              nvidia.com/gpu: "8"
            requests:
              nvidia.com/gpu: "8"
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        priorityClassName: training-high
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
  volumes:
  - mountPath: /mount
    volumeClaimName: example-pvc
status:
  conditions:
  - lastTransitionTime: "2025-02-24T02:45:39Z"
    status: Pending
  - lastTransitionTime: "2025-02-24T02:45:39Z"
    status: Pending
  controlledResources:
    plugin-pytorch: pytorch
    plugin-svc: svc
    volume-pvc-example-pvc: example-pvc
  minAvailable: 4
  pending: 4
  state:
    lastTransitionTime: "2025-02-24T02:45:39Z"
    phase: Pending
  taskStatusCount:
    main:
      phase:
        Pending: 1
    worker:
      phase:
        Pending: 3

Any other relevant information

No response

The text was updated successfully, but these errors were encountered:

JesseStutler · 2025-02-25T13:16:01Z

Theoretically, it is possible, the --scheduler-name parameter of the scheduler can be filled with an array, but I don't recommend doing this because there are two schedulers in the cluster at the same time, as their caches within the scheduler are different, their scheduling behavior may conflict. Volcano supports scheduling for deployment/statefulset/damonset/job, etc., and it fully supports the native filtering and scoring of kube-scheduler: https://volcano.sh/en/docs/unified_scheduling/, so you can schedule training and inference jobs at the same time with Volcano. By the way, are you saying that your inference service is scheduled by kube-scheduler? Are you deploying it with deployment, and then these inference services have already occupied all the GPU resources in the cluster?

ironman5366 added the kind/question Categorizes issue related to a new question label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can the volcano scheduler evict pods not managed by volcano? #4039

Can the volcano scheduler evict pods not managed by volcano? #4039

ironman5366 commented Feb 24, 2025

JesseStutler commented Feb 25, 2025

Can the volcano scheduler evict pods not managed by volcano? #4039

Can the volcano scheduler evict pods not managed by volcano? #4039

Comments

ironman5366 commented Feb 24, 2025

Please describe your problem in detail

Any other relevant information

JesseStutler commented Feb 25, 2025