Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] GPU optimizer bug fix and document fix #656

Merged
merged 82 commits into from
Feb 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
5cbd968
Bug fix
Dec 10, 2024
11be286
Merge commit '0d40fbd19ba01daf1aa6267515814c18f19aaa09' into jingyuan…
Dec 11, 2024
7adb1b7
Fix configuration for domain podautoscaler
Dec 11, 2024
c159726
Lint fix
Dec 11, 2024
f9f1d99
Add license for new files.
Dec 11, 2024
6f717c5
Lint fix on added unit test.
Dec 11, 2024
5c5225b
Add authorization support
Dec 13, 2024
6a7584d
Support parameterized benchmark
Dec 15, 2024
bd46cc3
Remove next_in paramter
Dec 15, 2024
401be9f
Bug fix
Dec 15, 2024
f39e4b4
Fix typo
Dec 15, 2024
22e7db5
Bug fix
Dec 15, 2024
4296ce1
Apply stream parameter
Dec 16, 2024
2c40b7c
Cleaning up responses.
Dec 16, 2024
17a0798
Bug fix
Dec 16, 2024
4df3b76
If error not reported as a temporary eror, we will not retry.
Dec 16, 2024
ee494b7
GPU profile now support TPAT (time per all token)
Dec 18, 2024
36cbf87
Debug optimizer
Dec 20, 2024
d59f12a
bird prompt dataset generation
nwangfw Dec 20, 2024
deee544
update benchmark to support prompt dataset loading
nwangfw Dec 20, 2024
7da1be8
Benchmark now support workload parameter
Dec 20, 2024
32e2ba9
Bug fix
Dec 21, 2024
3d3e929
Log control
Dec 21, 2024
cf47cab
Improve stability and lint fix.
Dec 21, 2024
31b4b0e
Bug fix
Dec 21, 2024
28f2521
switch logs for gpu-optimizer to json format
Dec 23, 2024
2b672cf
added BIRD dataset with Aruze timestamp script
nwangfw Dec 23, 2024
e2f58d8
add BIRD brust pattern workload generation
nwangfw Dec 27, 2024
7c2d455
Visualizer now support workload file
Dec 30, 2024
ab20a58
Print out workload input
Dec 31, 2024
ccd5a40
Bug fix
Dec 31, 2024
b5b14dc
lint fix
Jan 1, 2025
132260d
remove timestamp offset
Jan 1, 2025
3018bb4
Bug fix: call _parse_profiles without parameter out_records will not …
Jan 1, 2025
28b69eb
Use current ts to load profile may to early, revert to use an interva…
Jan 1, 2025
0b28588
Use the larger of average request rate in window and current request …
Jan 2, 2025
6c26386
Tuning up request rate temporarily.
Jan 2, 2025
ad7fb37
Bug fix
Jan 2, 2025
b014b23
Remove fixed rate
Jan 2, 2025
2908082
changing load profile back
nwangfw Jan 7, 2025
56feb2f
Merge branch 'main' into jingyuan/gpu_optimizer
Jan 8, 2025
beb6de4
Provide compatibility to v3 gateway profiles.
Jan 8, 2025
762a506
Adjust development config
Jan 8, 2025
ba20697
Add config for gateway-plugin development
Jan 8, 2025
a9007b0
delayed scale in deployment added
nwangfw Jan 9, 2025
fa4fbcf
Add trace to benchmark
Jan 9, 2025
a141cd6
rollback to old version without delayed scale in
nwangfw Jan 9, 2025
9d514e4
Merge branches 'jingyuan/gpu_optimizer' and 'jingyuan/gpu_optimizer' …
Jan 9, 2025
152915f
Disregard pending requests for now.
Jan 9, 2025
f13e0e0
Bug fix
Jan 9, 2025
444ca30
Bug fix
Jan 9, 2025
3df5e82
Adapt to latest profile about pending requests and update unittest.
Jan 11, 2025
81cfb63
Output correct timestamp
Jan 13, 2025
cc2ac94
Output pending and total requests from load reader
Jan 14, 2025
4f2b2ee
Ignore pending for now.
Jan 14, 2025
169047c
Add throughput filter.
Jan 15, 2025
cd08af9
bug and lint fix
Jan 15, 2025
5dac822
Fix a bug that when mat_tputs are 0
Jan 15, 2025
1d3f44d
Lint fix
Jan 15, 2025
6edb872
fix benchmark on count num_requests
Jan 16, 2025
8e9d9b0
Optimizer now can adopt deployment changes using "kubectl apply"
Jan 22, 2025
9fda28a
Add comments
Jan 22, 2025
921a9fe
bug fix
Jan 22, 2025
ff22ab9
Make signature prefer higher index on choose profiles.
Jan 22, 2025
186fcdc
Bug fix, watch ScalingReplicaSet for label changes
Jan 23, 2025
ae27970
Bug fix
Jan 23, 2025
ca0f020
Change back SLO preference.
Jan 23, 2025
e697adc
Merge branch 'main' into jingyuan/gpu_optimizer
Jan 24, 2025
e50a343
Merge branch 'main' into jingyuan/gpu_optimizer
zhangjyr Jan 27, 2025
1911a73
Merge branch 'main' into jingyuan/gpu_optimizer
Jan 28, 2025
6ab316d
Merge branch 'main' into jingyuan/gpu_optimizer
Jan 31, 2025
737dd12
Refine gpu optimizer document and apply more generic default parameters.
Jan 31, 2025
895eccc
Update document to use production vllm configuration example
Jan 31, 2025
ca48d52
Add samples/heterogenous
Jan 31, 2025
a63d28a
Clean up
Jan 31, 2025
524f00d
Modify load reader to support latest workload
Feb 11, 2025
3a8b31c
Merge commit '1d3473a418a044c788c70fcfcf9d79f732893381' into jingyuan…
Feb 12, 2025
d2f45dc
Fix doc and example
Feb 12, 2025
f0088e1
Use 100 instead 1 as scale fraction.
Feb 12, 2025
a2f9b80
remove unnecessary samples
Feb 12, 2025
6d04c8f
Lint fix
Feb 12, 2025
8bea71d
Merge branch 'main' into jingyuan/gpu_optimizer
zhangjyr Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ spec:
endpoint: aibrix-gpu-optimizer.aibrix-system.svc.cluster.local:8080
path: /metrics/default/simulator-llama2-7b-a40
targetMetric: "vllm:deployment_replicas"
targetValue: "1"
targetValue: "100" # For stable workloads. Set to a fraction to tolerate bursts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it used to be static value. Now you mean it can be same as other autoscalers? I didn't get the kv cache example idea. could you ellaborate a little bit more?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the KV cache example, the targetValue is set to 50, meaning that once the average KV Cache utilization surpasses 50%, the scaling out will be triggered.
Now that the targetValue for the GPU optimizer output metric can behave similarly. The previous 1 GPU will now output as a value between 1-100. If the targetValue is 70, and the GPU Optimizer outputs 80, PodAutosScaler will scale to 2 pods instead of previous 1 pod. This behavior gives a user the freedom to reserve some buffer pods for short bursts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

em there're two questions

  1. In that case, should targetValue: "100" a more meaningful number rather than 100? 100 means it has to be full to be scaled out, right?

  2. targetValue is paired with targetMetric, currently the key is vllm:deployment_replicas, if the value is 1-100, this is kind of confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some misunderstanding. Previous value "1" now will output 1-100, while a value of "2" now will output 101-200. TargetValue: "100" still makes sense for a stable workload, but in real production workload, maybe "50" or "80" is reasonable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric is like CPU, not KV cache utilization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the offline discussion, we can merge this one first. Our short term goal is to have clear documentation since this would be public soon. You can also consider to use other annotation etc to make it

Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ spec:
endpoint: aibrix-gpu-optimizer.aibrix-system.svc.cluster.local:8080
path: /metrics/default/simulator-llama2-7b-a100
targetMetric: "vllm:deployment_replicas"
targetValue: "1"
targetValue: "100" # For stable workloads. Set to a fraction to tolerate bursts.
18 changes: 9 additions & 9 deletions docs/source/features/heterogeneous-gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Step 1: Deploy the heterogeneous deployments.

One deployment and corresponding PodAutoscaler should be deployed for each GPU type.
See `sample heterogeneous configuration <https://github.com/aibrix/aibrix/tree/main/samples/heterogeneous>`_ for an example of heterogeneous configuration composed of two GPU types. The following codes
deploy heterogeneous deployments using L20 and A10 GPU.
deploy heterogeneous deployments using L20 and V100 GPU.

.. code-block:: bash

Expand All @@ -45,9 +45,10 @@ Incoming requests are routed through the gateway and directed to the optimal pod

kubectl get pods
NAME READY STATUS RESTARTS AGE
deepseek-coder-7b-a10-96667667c-6gjql 2/2 Running 0 33s
deepseek-coder-7b-v100-96667667c-6gjql 2/2 Running 0 33s
deepseek-coder-7b-l20-96667667c-7zj7k 2/2 Running 0 33s

Step 2: Install aibrix python module:

Step 2: Install aibrix python module:

Expand All @@ -74,32 +75,31 @@ Step 4: Decide SLO and generate profile, run `aibrix_gen_profile -h` for help.

kubectl -n aibrix-system port-forward svc/aibrix-redis-master 6379:6379 1>/dev/null 2>&1 &
# Wait for port-forward taking effect.
aibrix_gen_profile deepseek-coder-7b-a10 --cost [cost1] [SLO-metric] [SLO-value] -o "redis://localhost:6379/?model=deepseek-coder-7b"
aibrix_gen_profile deepseek-coder-7b-v100 --cost [cost1] [SLO-metric] [SLO-value] -o "redis://localhost:6379/?model=deepseek-coder-7b"
aibrix_gen_profile deepseek-coder-7b-l20 --cost [cost2] [SLO-metric] [SLO-value] -o "redis://localhost:6379/?model=deepseek-coder-7b"

Now the GPU Optimizer is ready to work. You should observe that the number of workload pods changes in response to the requests sent to the gateway. Once the GPU optimizer finishes the scaling optimization, the output of the GPU optimizer is passed to PodAutoscaler as a metricSource via a designated HTTP endpoint for the final scaling decision. The following is an example of PodAutoscaler spec.

A simple example of PodAutoscaler spec for a10 GPU is as follows:
A simple example of PodAutoscaler spec for v100 GPU is as follows:

.. literalinclude:: ../../../samples/heterogeneous/deepseek-coder-7b-l20-podautoscaler.yaml
.. literalinclude:: ../../../samples/heterogeneous/deepseek-coder-7b-v100-podautoscaler.yaml
:language: yaml


Miscellaneous
-------------

A new label label ``model.aibrix.ai/min_replicas`` is added to specifies the minimum number of replicas to maintain when there is no workload. We recommend setting this to 1 for at least one Deployment spec to ensure there is always one READY pod available. For example, while the GPU optimizer might recommend 0 replicas for an a10 GPU during periods of no activity, setting ``model.aibrix.ai/min_replicas: "1"`` will maintain one a10 replica. This label only affects the system when there is no workload - it is ignored when there are active requests.
A new label label ``model.aibrix.ai/min_replicas`` is added to specifies the minimum number of replicas to maintain when there is no workload. We recommend setting this to 1 for at least one Deployment spec to ensure there is always one READY pod available. For example, while the GPU optimizer might recommend 0 replicas for an v100 GPU during periods of no activity, setting ``model.aibrix.ai/min_replicas: "1"`` will maintain one v100 replica. This label only affects the system when there is no workload - it is ignored when there are active requests.

.. code-block:: yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-coder-7b-a10
name: deepseek-coder-7b-v100
labels:
model.aibrix.ai/name: "deepseek-coder-7b"
model.aibrix.ai/min_replicas: "1" # min replica for gpu optimizer when no workloads.
... rest yaml deployments

Important: The ``minReplicas`` field in the PodAutoscaler spec must be set to 0 to allow proper scaling behavior. Setting it to any value greater than 0 will interfere with the GPU optimizer's scaling decisions. For instance, if the GPU optimizer determines an optimal configuration of ``{a10: 0, l20: 4}`` but the a10 PodAutoscaler has ``minReplicas: 1``, the system won't be able to scale the a10 down to 0 as recommended.
Important: The ``minReplicas`` field in the PodAutoscaler spec must be set to 0 to allow proper scaling behavior. Setting it to any value greater than 0 will interfere with the GPU optimizer's scaling decisions. For instance, if the GPU optimizer determines an optimal configuration of ``{v100: 0, l20: 4}`` but the v100 PodAutoscaler has ``minReplicas: 1``, the system won't be able to scale the v100 down to 0 as recommended.

12 changes: 11 additions & 1 deletion python/aibrix/aibrix/gpu_optimizer/load_monitor/load_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import json
import logging
import math
import re
from datetime import datetime
from typing import Any, List, Optional, Protocol, Tuple, Union
Expand Down Expand Up @@ -147,7 +148,12 @@ class WorkloadReader:

def __init__(self, filepath, scale: float = 1.0, interval: int = 10) -> None:
if filepath != unittest_filepath:
self.df = pd.read_json(filepath)
try:
self.df = pd.read_json(filepath)
except Exception:
self.df = pd.read_json(filepath, lines=True)
self.df["Timestamp"] = self.df["timestamp"]
self.df["Requests"] = self.df["requests"]

self.scale = scale
self.interval = interval
Expand Down Expand Up @@ -180,6 +186,10 @@ def read(self, ts: float = 0.0) -> Tuple[List[LoadRecord], float]:
self.log2_aggregate(self.tick_df["Prompt Length"] * self.scale, 1),
self.log2_aggregate(self.tick_df["Output Length"] * self.scale, 1),
):
# Unlikely, just in case.
if math.isinf(output_tokens) or math.isinf(input_tokens):
continue

records.append(
LoadRecord(
(self.tick - self.start),
Expand Down
6 changes: 5 additions & 1 deletion python/aibrix/aibrix/gpu_optimizer/load_monitor/monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ def __init__(
deployment: Optional[DeploymentStates] = None,
namespace: Optional[str] = None,
profile_reader: Optional[ProfileReader] = None,
gpu_fraction: float = 100.0,
debug: bool = False,
):
"""Initialize the model monitor.
Expand All @@ -119,6 +120,7 @@ def __init__(
replicas: (optional) The initial number of replicas for the model deployment.
interval: (optional) The interval (in seconds) at which to monitor the model. Defaults to 10 seconds.
window: (optional) The window (in seconds) to consider for clustering. Defaults to 240 seconds.
gpu_fraction: (optional) The number of fractions that a GPU is counted. Defaults to 100.
debug: (optional) Whether to enable debugging behavior. Defaults to False.
"""
self.model_name = model_name
Expand All @@ -129,6 +131,7 @@ def __init__(
self.debug = debug
self.done = False
self.window = float(window)
self.gpu_fraction = gpu_fraction
self._lock = threading.Lock()

# Load reader
Expand All @@ -139,7 +142,7 @@ def __init__(

# Optimizer
self._profiles: Dict[str, GPUProfile] = {}
self._optimizer = Optimizer()
self._optimizer = Optimizer(self.gpu_fraction)

# Monitor states
self._centers: Iterable[Centeroid] = Empty_Array
Expand Down Expand Up @@ -276,6 +279,7 @@ def load_profiles(self, profile_reader: Optional[ProfileReader] = None) -> bool:

profiles = profile_reader.read()
for profile in profiles:
profile.cost /= self.gpu_fraction
if self._update_profile(profile):
logger.debug(f"Profile of {profile.gpu} updated.")

Expand Down
10 changes: 7 additions & 3 deletions python/aibrix/aibrix/gpu_optimizer/optimizer/optimizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,14 @@


class Optimizer:
def __init__(self, profiles: Optional[Iterable[GPUProfile]] = None):
def __init__(
self, gpu_fraction: float, profiles: Optional[Iterable[GPUProfile]] = None
):
self._config = MelangConfig()
self._workload_distribution_template: Optional[np.ndarray] = None
self._indexes: Optional[list] = None # Values ticks of tputs columns and rows
self._log_indexes: Optional[list] = None # Cache the log2 value of index
self._gpu_fraction = gpu_fraction
if profiles is not None:
for profile in profiles:
self.set_profile(profile)
Expand Down Expand Up @@ -73,7 +76,7 @@ def set_workload_distribution(
self._workload_distribution_template.fill(0)

# Maintain the overall request scale disregard some request are not covered.
self._config.total_request_rate = total_request_rate
self._config.total_request_rate = total_request_rate * self._gpu_fraction
# covered_request_rate is used to calculate the workload distribution.
covered_request_rate = reduce(
lambda cnt, center: cnt + center.rate, profiles, 0.0
Expand All @@ -82,7 +85,8 @@ def set_workload_distribution(
for profile in profiles:
try:
signature = self._validate_workload_signature(profile)
self._workload_distribution_template[signature] = (
# Merge possible multiple patterns (out of range patterns coinincident with border patterns)
self._workload_distribution_template[signature] += (
profile.rate / covered_request_rate
) # type: ignore
logger.debug(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ kind: Deployment
metadata:
labels:
adapter.model.aibrix.ai/enabled: "true"
model.aibrix.ai/min_replicas: "4"
model.aibrix.ai/name: deepseek-coder-7b
model.aibrix.ai/port: "8000"
model.aibrix.ai/min_replicas: "1" # min replica when there is no workloads.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ spec:
path: /metrics/default/deepseek-coder-7b-l20
protocolType: http
targetMetric: vllm:deployment_replicas
targetValue: "1"
minReplicas: 1
targetValue: "100" # For stable workloads. Set to a fraction to tolerate bursts.
minReplicas: 0
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ spec:
path: /metrics/default/deepseek-coder-7b-v100
protocolType: http
targetMetric: vllm:deployment_replicas
targetValue: "1"
targetValue: "100" # For stable workloads. Set to a fraction to tolerate bursts.
minReplicas: 0
scaleTargetRef:
apiVersion: apps/v1
Expand Down
32 changes: 32 additions & 0 deletions samples/heterogeneous/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
kind: Kustomization

resources:
- deepseek-coder-7b-service.yaml
- deepseek-coder-7b-l20-deployment.yaml
- deepseek-coder-7b-l20-podautoscaler.yaml
- deepseek-coder-7b-v100-deployment.yaml
- deepseek-coder-7b-v100-podautoscaler.yaml

patches:
- patch: |- # Use the '|' and '-' for inline patching, warm up 10 hosts and start with 7
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-coder-7b-v100
labels:
model.aibrix.ai/min_replicas: "1"
target:
kind: Deployment
name: deepseek-coder-7b-v100
- patch: |- # Use the '|' and '-' for inline patching, warm up 10 hosts and start with 7
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-coder-7b-l20
labels:
model.aibrix.ai/min_replicas: "0"
target:
kind: Deployment
name: deepseek-coder-7b-l20

apiVersion: kustomize.config.k8s.io/v1beta1