Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] GPU optimizer bug fix and document fix #656

Merged
merged 82 commits into from
Feb 14, 2025
Merged

Conversation

zhangjyr
Copy link
Collaborator

Pull Request Description

This PR fixes:

  1. When GPU profiles used by the GPU optimizer do not cover all input/output patterns in the workload, there are possibilities that out-of-scope patterns overlap with an in-scope pattern during profile match and cause unexpected data loss. This PR will aggregate statistics of overlapped patterns

  2. Sample configuration for GPU optimizer document has an error that includes duplicate label "model.aibrix.ai/min_replicas:" in L20 deployments. Also, the GPU optimizer document mentioned the L20 and A100 GPUs are included in the heterogeneous sample, which should be L20 and V100.

This PR also uses 100 as the TargetValue for Podautoscaler to scale, which replaces previous value 1. This enables users to customize the scale threshold as other Podautoscaler metrics do. For example, PodAutoScaler uses Cache KV utilization can set the scale threshold to 50 for Podautoscaler to scale up once the utilization of cache KV surpasses 50. Now, setting TargetValue to 50 can archive the same affect.

Related Issues

Resolves: #[Insert issue number(s)]

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

targetValue: "100" # For stable workloads. Set to a fraction to tolerate bursts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it used to be static value. Now you mean it can be same as other autoscalers? I didn't get the kv cache example idea. could you ellaborate a little bit more?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the KV cache example, the targetValue is set to 50, meaning that once the average KV Cache utilization surpasses 50%, the scaling out will be triggered.
Now that the targetValue for the GPU optimizer output metric can behave similarly. The previous 1 GPU will now output as a value between 1-100. If the targetValue is 70, and the GPU Optimizer outputs 80, PodAutosScaler will scale to 2 pods instead of previous 1 pod. This behavior gives a user the freedom to reserve some buffer pods for short bursts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

em there're two questions

  1. In that case, should targetValue: "100" a more meaningful number rather than 100? 100 means it has to be full to be scaled out, right?

  2. targetValue is paired with targetMetric, currently the key is vllm:deployment_replicas, if the value is 1-100, this is kind of confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some misunderstanding. Previous value "1" now will output 1-100, while a value of "2" now will output 101-200. TargetValue: "100" still makes sense for a stable workload, but in real production workload, maybe "50" or "80" is reasonable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric is like CPU, not KV cache utilization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the offline discussion, we can merge this one first. Our short term goal is to have clear documentation since this would be public soon. You can also consider to use other annotation etc to make it

@zhangjyr zhangjyr requested a review from Jeffwan February 13, 2025 18:30
@Jeffwan Jeffwan merged commit bbb148c into main Feb 14, 2025
10 checks passed
@Jeffwan Jeffwan deleted the jingyuan/gpu_optimizer branch February 14, 2025 01:39
varungup90 pushed a commit that referenced this pull request Feb 20, 2025
* Bug fix

* Fix configuration for domain podautoscaler
Add test case for make url created from metricSource as expected: endpoint should include port, if not and port is specified, port will be append to endpoint.

* Lint fix

* Add license for new files.

* Lint fix on added unit test.

* Add authorization support

* Support parameterized benchmark

* Remove next_in paramter

* Bug fix

* Fix typo

* Bug fix

* Apply stream parameter

* Cleaning up responses.

* Bug fix

* If error not reported as a temporary eror, we will not retry.

* GPU profile now support TPAT (time per all token)
Fix an error in benchmark that may cause error when now all token_latencies might missing some data.

* Debug optimizer

* bird prompt dataset generation

* update benchmark to support prompt dataset loading

* Benchmark now support workload parameter

* Bug fix

* Log control

* Improve stability and lint fix.

* Bug fix

* switch logs for gpu-optimizer to json format

* added BIRD dataset with Aruze timestamp script

* add BIRD brust pattern workload generation

* Visualizer now support workload file

* Print out workload input

* Bug fix

* lint fix

* remove timestamp offset

* Bug fix: call _parse_profiles without parameter out_records will not add up returns.

* Use current ts to load profile may to early, revert to use an interval ago.

* Use the larger of average request rate in window and current request rate to get sufficient resources.

* Tuning up request rate temporarily.

* Bug fix
Fix request rate to 8 temporarily

* Remove fixed rate

* changing load profile back

* Provide compatibility to v3 gateway profiles.

* Adjust development config

* Add config for gateway-plugin development

* delayed scale in deployment added

* Add trace to benchmark

* rollback to old version without delayed scale in

* Disregard pending requests for now.

* Bug fix

* Bug fix

* Adapt to latest profile about pending requests and update unittest.

* Output correct timestamp

* Output pending and total requests from load reader

* Ignore pending for now.

* Add throughput filter.

* bug and lint fix

* Fix a bug that when mat_tputs are 0

* Lint fix

* fix benchmark on count num_requests

* Optimizer now can adopt deployment changes using "kubectl apply"

* Add comments

* bug fix

* Make signature prefer higher index on choose profiles.

* Bug fix, watch ScalingReplicaSet for label changes

* Bug fix

* Change back SLO preference.
Optimize update logic.

* Refine gpu optimizer document and apply more generic default parameters.

* Update document to use production vllm configuration example
Fix benchmark and gen_profile to work inside python module.

* Add samples/heterogenous

* Clean up

* Modify load reader to support latest workload
Fix a potential bug that in corner cases, out of profile patterns are maps to closest profiled patterns and causes possible data loss.

* Fix doc and example

* Use 100 instead 1 as scale fraction.

* remove unnecessary samples

* Lint fix

---------

Signed-off-by: Jingyuan <[email protected]>
Co-authored-by: Jingyuan Zhang <[email protected]>
Co-authored-by: Ning Wang <[email protected]>
Signed-off-by: Varun Gupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants