Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can the volcano scheduler evict pods not managed by volcano? #4039

Open
ironman5366 opened this issue Feb 24, 2025 · 1 comment
Open

Can the volcano scheduler evict pods not managed by volcano? #4039

ironman5366 opened this issue Feb 24, 2025 · 1 comment
Labels
kind/question Categorizes issue related to a new question

Comments

@ironman5366
Copy link

Please describe your problem in detail

Hi,

We run a mixed inference/training cluster with a fixed number of gpus. We use priorityclasses to manage GPU allocations, but it doesn't seem that a volcano job with a high priority class can preempt or evict inference jobs with lower ones.

Example: a job/podgroup with priority 1300000 is stuck in Pending with message volcano 0/101 nodes are unavailable: 100 Insufficient nvidia.com/gpu., even though there are more than enough available GPUs being used by workloads with priority 1200000.

When I deploy a pod that's not using schedulerName: volcano I see it successfully evict those pods and use the gpus. Is there a way I can configure volcano to evict pods using the default k8s sheduler with lower priority classes?

Attaching my queue and example job config:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: training
spec:
  weight: 1
  reclaimable: false
  capability:
    nvidia.com/gpu: 320
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    meta.helm.sh/release-namespace: default
  creationTimestamp: "2025-02-24T02:45:39Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: train-example
  namespace: default
  resourceVersion: "209718960"
  uid: df1b9ff9-c9a2-4795-8cda-10d8a5350a0f
spec:
  maxRetry: 3
  minAvailable: 4
  plugins:
    pytorch:
    - --master=main
    - --worker=worker
    - --port=23456
    svc: []
  priorityClassName: training-high
  queue: training
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: main
    policies:
    - action: CompleteJob
      event: TaskCompleted
    replicas: 1
    template:
      metadata:
        name: train-example-main
      spec:
        containers:
        - env:
          - name: MULTINODE
            value: "true"
          image: example-image
          name: train-example
          resources:
            limits:
              nvidia.com/gpu: "8"
            requests:
              nvidia.com/gpu: "8"
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        priorityClassName: training-high
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
  - maxRetry: 3
    minAvailable: 3
    name: worker
    replicas: 3
    template:
      metadata:
        name: train-example-worker
      spec:
        containers:
        - env:
          - name: MULTINODE
            value: "true"
          image: example-image
          name: train-example
          resources:
            limits:
              nvidia.com/gpu: "8"
            requests:
              nvidia.com/gpu: "8"
          volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        priorityClassName: training-high
        volumes:
        - emptyDir:
            medium: Memory
          name: dshm
  volumes:
  - mountPath: /mount
    volumeClaimName: example-pvc
status:
  conditions:
  - lastTransitionTime: "2025-02-24T02:45:39Z"
    status: Pending
  - lastTransitionTime: "2025-02-24T02:45:39Z"
    status: Pending
  controlledResources:
    plugin-pytorch: pytorch
    plugin-svc: svc
    volume-pvc-example-pvc: example-pvc
  minAvailable: 4
  pending: 4
  state:
    lastTransitionTime: "2025-02-24T02:45:39Z"
    phase: Pending
  taskStatusCount:
    main:
      phase:
        Pending: 1
    worker:
      phase:
        Pending: 3

Any other relevant information

No response

@ironman5366 ironman5366 added the kind/question Categorizes issue related to a new question label Feb 24, 2025
@JesseStutler
Copy link
Member

Theoretically, it is possible, the --scheduler-name parameter of the scheduler can be filled with an array, but I don't recommend doing this because there are two schedulers in the cluster at the same time, as their caches within the scheduler are different, their scheduling behavior may conflict. Volcano supports scheduling for deployment/statefulset/damonset/job, etc., and it fully supports the native filtering and scoring of kube-scheduler: https://volcano.sh/en/docs/unified_scheduling/, so you can schedule training and inference jobs at the same time with Volcano. By the way, are you saying that your inference service is scheduled by kube-scheduler? Are you deploying it with deployment, and then these inference services have already occupied all the GPU resources in the cluster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Categorizes issue related to a new question
Projects
None yet
Development

No branches or pull requests

2 participants