Skip to content

[WIP][AQUA] GPU Shape Recommendation #1221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

elizjo
Copy link
Member

@elizjo elizjo commented Jul 7, 2025

Wrote an additional POST API and aqua command for recommending GPU shapes for a particular model

ads aqua recommend shapes --model_ocid 'ocid1.datasciencemodel.oc1.<ocid>' "" 

Returns
Screenshot 2025-08-01 at 2 19 10 PM

{
  "model_ocid" : "ocid1.datasciencemodel.oc1.<ocid>"
}

Returns

{                                                                                                                                                                                                                                            
    "display_name": "Almawave/Velvet-14B",
    "recommendations": [
        {
            "shape_details": {
                "available": false,
                "core_count": null,
                "memory_in_gbs": null,
                "name": "BM.GPU.MI300X.8",
                "shape_series": "GPU",
                "gpu_specs": {
                    "gpu_memory_in_gbs": 1536,
                    "gpu_count": 8,
                    "gpu_type": "MI300X",
                    "quantization": [
                        "fp8",
                        "gguf"
                    ],
                    "ranking": {
                        "cost": 90,
                        "performance": 90
                    }
                }
            },
            "configurations": [
                {
                    "model_details": {
                        "model_size_gb": 28.16,
                        "kv_cache_size_gb": 26.84,
                        "total_model_gb": 55.0
                    },
                    "deployment_params": {
                        "quantization": "bfloat16",
                        "max_model_len": 131072,
                        "params": ""
                    },
                    "recommendation": "No override PARAMS needed. \n\nModel fits well within the allowed compute shape (55.0GB used / 1536.0GB allowed)."
                }
            ]
        }
    ],
     "troubleshoot": ""
}

Status: business logic works, API works, unit tests finished, rich diff CLI table finished
Screenshot 2025-07-29 at 12 04 57 PM

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 7, 2025
Copy link

github-actions bot commented Jul 7, 2025

📌 Cov diff with main:

Coverage-0%

📌 Overall coverage:

Coverage-18.62%

@mrDzurb mrDzurb changed the title GPU Shape Recommendation [AQUA] GPU Shape Recommendation Jul 7, 2025
Copy link

github-actions bot commented Jul 8, 2025

📌 Cov diff with main:

Coverage-0%

📌 Overall coverage:

Coverage-18.62%

@mrDzurb mrDzurb changed the title [AQUA] GPU Shape Recommendation [WIP][AQUA] GPU Shape Recommendation Jul 15, 2025
@elizjo elizjo force-pushed the ODSC-74228/GPU-Shape-Recommendation branch from 2f54f8b to 26e08a2 Compare July 25, 2025 17:29
Copy link

📌 Cov diff with main:

Coverage-32%

📌 Overall coverage:

Coverage-58.28%

Copy link

📌 Cov diff with main:

Coverage-54%

📌 Overall coverage:

Coverage-58.46%

Copy link

📌 Cov diff with main:

Coverage-0%

📌 Overall coverage:

Coverage-18.43%

Copy link

📌 Cov diff with main:

Coverage-70%

📌 Overall coverage:

Coverage-58.58%

Copy link

github-actions bot commented Aug 1, 2025

📌 Cov diff with main:

Coverage-70%

📌 Overall coverage:

Coverage-58.58%

},
"VM.GPU.A10.1": {
"gpu_count": 1,
"gpu_memory_in_gbs": 24,
"gpu_type": "A10"
"gpu_type": "A10",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add FP8 for the A10 shapes as well.

@@ -1287,6 +1287,7 @@ def load_gpu_shapes_index(

# Merge: remote shapes override local
local_shapes = local_data.get("shapes", {})
remote_data = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

@@ -13,6 +13,7 @@
from ads.aqua.extension.evaluation_handler import __handlers__ as __eval_handlers__
from ads.aqua.extension.finetune_handler import __handlers__ as __finetune_handlers__
from ads.aqua.extension.model_handler import __handlers__ as __model_handlers__
from ads.aqua.extension.recommend_handler import __handlers__ as __gpu_handlers__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can name it as __shape_handler?

Detects quantization bit-width as a string (e.g., '4bit', '8bit') from Hugging Face config dict.
"""
if raw.get("load_in_8bit"):
return "8bit"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move this into constants.

If model is un-quantized, uses the weight size.
If model is pre-quantized, uses the quantization level.
"""
key = (self.quantization or self.weight_dtype or "float32").lower()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move "float32" to constants

"""
vals = []
curr = min_len
max_seq_len = 16384 if not self.max_seq_len else self.max_seq_len
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the numbers like 16384 to constants and add some description there

Copy link

github-actions bot commented Aug 2, 2025

📌 Cov diff with main:

Coverage-69%

📌 Overall coverage:

Coverage-58.57%

"""

@handle_exceptions
def post(self, *args, **kwargs): # noqa: ARG002
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this handler to deployment_handler.py under the AquaDeploymentHandler class?

We already have a list_shape method there, so I suggest adding a new method called list_recommended_shape.

The endpoint path could be /aqua/deployments/recommended_shapes.

The main reason for this change is that we’ll be implementing similar logic for service models as well. Having a unified handler will allow us to return the recommended list of shapes for both service and custom models in a consistent way.

ShapeRecommendationReport,
ShapeReport,
)
from ads.config import COMPARTMENT_OCID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this const variable is not used?

return self.rich_diff_table(shape_recommend_report)

@staticmethod
def validate_model_ocid(ocid: str) -> DataScienceModel:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not planning to use this method outside of the class, it would be better to make it as private.
def _validate_model_ocid. Check for the others.

from ads.model.datascience_model import DataScienceModel


class AquaRecommendApp(AquaApp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe we can name it as AquaShapeRecommendApp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the shape recommendation feature falls under the broader Model Deployment functionality. Instead of creating a new App class, I suggest we implement the business logic in a regular helper class and then integrate it into the existing AquaDeploymentApp.

Since AquaDeploymentApp already includes a list_shapes method, we can add a new method called list_recommended_shapes to maintain backward compatibility.

Looking ahead, if we need similar logic for other modules like fine-tuning or evaluation, we can follow the same pattern and add the corresponding methods in those apps as well.

As for naming, we could rename the current AquaRecommendApp to AquaShapeRecommend. I don't think we need to inherit from AquaApp in this case, since it's primarily a utility class focused on recommendation logic.

Use `ads aqua recommend which_gpu --help` to get more details on available parameters.
"""

def which_gpu(self, **kwargs) -> ShapeRecommendationReport:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to make it a bit more generic - which_shape?

return recommendations

@staticmethod
def rich_diff_table(shape_report: ShapeRecommendationReport) -> Table:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make it private?

from ads.model.datascience_model import DataScienceModel


class AquaRecommendApp(AquaApp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the shape recommendation feature falls under the broader Model Deployment functionality. Instead of creating a new App class, I suggest we implement the business logic in a regular helper class and then integrate it into the existing AquaDeploymentApp.

Since AquaDeploymentApp already includes a list_shapes method, we can add a new method called list_recommended_shapes to maintain backward compatibility.

Looking ahead, if we need similar logic for other modules like fine-tuning or evaluation, we can follow the same pattern and add the corresponding methods in those apps as well.

As for naming, we could rename the current AquaRecommendApp to AquaShapeRecommend. I don't think we need to inherit from AquaApp in this case, since it's primarily a utility class focused on recommendation logic.

ValueError
If the file cannot be opened, parsed, or the 'shapes' key is missing.
"""
user_shapes = AquaDeploymentApp().list_shapes(compartment_id=compartment_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid calling AquaDeploymentApp().list_shapes, I think it would be cleaner to add a shapes method directly to the OCIDataScienceModelDeployment class.

For example:

class OCIDataScienceModelDeployment(
    OCIDataScienceMixin,
    oci.data_science.models.ModelDeployment,
):

  @classmethod
  def shapes(
      cls,
      compartment_id: str = None,
      **kwargs,
  ) -> List[oci.data_science.models.ModelDeploymentShapeSummary]:
      return oci.pagination.list_call_get_all_results(
          cls().client.list_model_deployment_shapes,
          compartment_id or COMPARTMENT_ID,
          **kwargs
      ).data

This makes the logic more self-contained within the deployment model and avoids relying on external app instances just to retrieve shape info.

The usage could be:

user_shapes = OCIDataScienceModelDeployment.shapes(compartment_id=compartment_id)

if name in set_user_shapes:
compute_shape = set_user_shapes.get(name)
compute_shape.available = True
compute_shape.shape_series = "GPU"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The oci.data_science.models.ModelDeploymentShapeSummary already contains the def shape_series(self, shape_series):. Maybe we can use it?

Copy link

github-actions bot commented Aug 4, 2025

📌 Cov diff with main:

Coverage-66%

📌 Overall coverage:

Coverage-58.55%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants