-
Notifications
You must be signed in to change notification settings - Fork 3
Offline inference benchmark #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ssraghavan-oci
commented
Apr 22, 2025
- Docs for offline inference using vllm and sglang
- Has sample config file used for deployment
## Sample Config File (`example_sglang.yaml`) | ||
|
||
```yaml | ||
benchmark_type: offline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this YAML / config approach is a great v0 for this. However, I do think we should add a field to the recipe json that accepts these parameters somehow so the user doesn't have to maintain the deployment JSON and the YAML.
I defer to @vishnukam3 on this one... can you weigh in here?
@@ -0,0 +1,104 @@ | |||
# Online Inference Blueprint (LLMPerf) | |||
|
|||
This blueprint benchmarks **online inference performance** of large language models using **LLMPerf**, a standardized benchmarking tool. It is designed to evaluate LLM APIs served via platforms such as OpenAI-compatible interfaces, including self-hosted LLM inference endpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we include a section on what offline vs online means here? And link between the offline version and the online version so people can choose after they understand the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I do have a title When to use Offline inference
in offline readme and this recipe helps
section in online readme which outlines when to use when.
"output_object_storage": [ | ||
{ | ||
"bucket_name": "inference_output", | ||
"mount_location": "/mlcommons_output", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change this to benchmarking output? Is this specific to mlcommons? Looks like re-use perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea that makes sense. It can be anything that the customer wants to call. I just named it mlcommons output as metrics there and the json file here has the same data.
"output_object_storage": [ | ||
{ | ||
"bucket_name": "inference_output", | ||
"mount_location": "/mlcommons_output", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here - maybe make /benchmarking_output
## Prerequisites | ||
|
||
Before running this blueprint: | ||
- You **must have an inference server already running**, compatible with the OpenAI API format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding something like:
"You can first deploy this blueprint for an online inference server to test the workflow".
Then, I would provide an example config + blueprint which utilizes this to test the workflow. That way, someone could have a pre-canned end-to-end workflow which we provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted! will make the changes
…he PR for offline inference