-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] GPU optimizer bug fix and document fix #656
Merged
Merged
Changes from all commits
Commits
Show all changes
82 commits
Select commit
Hold shift + click to select a range
5cbd968
Bug fix
11be286
Merge commit '0d40fbd19ba01daf1aa6267515814c18f19aaa09' into jingyuan…
7adb1b7
Fix configuration for domain podautoscaler
c159726
Lint fix
f9f1d99
Add license for new files.
6f717c5
Lint fix on added unit test.
5c5225b
Add authorization support
6a7584d
Support parameterized benchmark
bd46cc3
Remove next_in paramter
401be9f
Bug fix
f39e4b4
Fix typo
22e7db5
Bug fix
4296ce1
Apply stream parameter
2c40b7c
Cleaning up responses.
17a0798
Bug fix
4df3b76
If error not reported as a temporary eror, we will not retry.
ee494b7
GPU profile now support TPAT (time per all token)
36cbf87
Debug optimizer
d59f12a
bird prompt dataset generation
nwangfw deee544
update benchmark to support prompt dataset loading
nwangfw 7da1be8
Benchmark now support workload parameter
32e2ba9
Bug fix
3d3e929
Log control
cf47cab
Improve stability and lint fix.
31b4b0e
Bug fix
28f2521
switch logs for gpu-optimizer to json format
2b672cf
added BIRD dataset with Aruze timestamp script
nwangfw e2f58d8
add BIRD brust pattern workload generation
nwangfw 7c2d455
Visualizer now support workload file
ab20a58
Print out workload input
ccd5a40
Bug fix
b5b14dc
lint fix
132260d
remove timestamp offset
3018bb4
Bug fix: call _parse_profiles without parameter out_records will not …
28b69eb
Use current ts to load profile may to early, revert to use an interva…
0b28588
Use the larger of average request rate in window and current request …
6c26386
Tuning up request rate temporarily.
ad7fb37
Bug fix
b014b23
Remove fixed rate
2908082
changing load profile back
nwangfw 56feb2f
Merge branch 'main' into jingyuan/gpu_optimizer
beb6de4
Provide compatibility to v3 gateway profiles.
762a506
Adjust development config
ba20697
Add config for gateway-plugin development
a9007b0
delayed scale in deployment added
nwangfw fa4fbcf
Add trace to benchmark
a141cd6
rollback to old version without delayed scale in
nwangfw 9d514e4
Merge branches 'jingyuan/gpu_optimizer' and 'jingyuan/gpu_optimizer' …
152915f
Disregard pending requests for now.
f13e0e0
Bug fix
444ca30
Bug fix
3df5e82
Adapt to latest profile about pending requests and update unittest.
81cfb63
Output correct timestamp
cc2ac94
Output pending and total requests from load reader
4f2b2ee
Ignore pending for now.
169047c
Add throughput filter.
cd08af9
bug and lint fix
5dac822
Fix a bug that when mat_tputs are 0
1d3f44d
Lint fix
6edb872
fix benchmark on count num_requests
8e9d9b0
Optimizer now can adopt deployment changes using "kubectl apply"
9fda28a
Add comments
921a9fe
bug fix
ff22ab9
Make signature prefer higher index on choose profiles.
186fcdc
Bug fix, watch ScalingReplicaSet for label changes
ae27970
Bug fix
ca0f020
Change back SLO preference.
e697adc
Merge branch 'main' into jingyuan/gpu_optimizer
e50a343
Merge branch 'main' into jingyuan/gpu_optimizer
zhangjyr 1911a73
Merge branch 'main' into jingyuan/gpu_optimizer
6ab316d
Merge branch 'main' into jingyuan/gpu_optimizer
737dd12
Refine gpu optimizer document and apply more generic default parameters.
895eccc
Update document to use production vllm configuration example
ca48d52
Add samples/heterogenous
a63d28a
Clean up
524f00d
Modify load reader to support latest workload
3a8b31c
Merge commit '1d3473a418a044c788c70fcfcf9d79f732893381' into jingyuan…
d2f45dc
Fix doc and example
f0088e1
Use 100 instead 1 as scale fraction.
a2f9b80
remove unnecessary samples
6d04c8f
Lint fix
8bea71d
Merge branch 'main' into jingyuan/gpu_optimizer
zhangjyr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
kind: Kustomization | ||
|
||
resources: | ||
- deepseek-coder-7b-service.yaml | ||
- deepseek-coder-7b-l20-deployment.yaml | ||
- deepseek-coder-7b-l20-podautoscaler.yaml | ||
- deepseek-coder-7b-v100-deployment.yaml | ||
- deepseek-coder-7b-v100-podautoscaler.yaml | ||
|
||
patches: | ||
- patch: |- # Use the '|' and '-' for inline patching, warm up 10 hosts and start with 7 | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: deepseek-coder-7b-v100 | ||
labels: | ||
model.aibrix.ai/min_replicas: "1" | ||
target: | ||
kind: Deployment | ||
name: deepseek-coder-7b-v100 | ||
- patch: |- # Use the '|' and '-' for inline patching, warm up 10 hosts and start with 7 | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: deepseek-coder-7b-l20 | ||
labels: | ||
model.aibrix.ai/min_replicas: "0" | ||
target: | ||
kind: Deployment | ||
name: deepseek-coder-7b-l20 | ||
|
||
apiVersion: kustomize.config.k8s.io/v1beta1 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it used to be static value. Now you mean it can be same as other autoscalers? I didn't get the kv cache example idea. could you ellaborate a little bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the KV cache example, the targetValue is set to 50, meaning that once the average KV Cache utilization surpasses 50%, the scaling out will be triggered.
Now that the targetValue for the GPU optimizer output metric can behave similarly. The previous 1 GPU will now output as a value between 1-100. If the targetValue is 70, and the GPU Optimizer outputs 80, PodAutosScaler will scale to 2 pods instead of previous 1 pod. This behavior gives a user the freedom to reserve some buffer pods for short bursts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
em there're two questions
In that case, should
targetValue: "100"
a more meaningful number rather than 100? 100 means it has to be full to be scaled out, right?targetValue
is paired withtargetMetric
, currently the key isvllm:deployment_replicas
, if the value is 1-100, this is kind of confusingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is some misunderstanding. Previous value "1" now will output 1-100, while a value of "2" now will output 101-200. TargetValue: "100" still makes sense for a stable workload, but in real production workload, maybe "50" or "80" is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric is like CPU, not KV cache utilization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had the offline discussion, we can merge this one first. Our short term goal is to have clear documentation since this would be public soon. You can also consider to use other annotation etc to make it