Replies: 8 comments 2 replies
-
| There are ways to do this. Current DeepJavaLibrary support your use case Using this container with serving.properties requirements.txt will work for your case. Tested with G5 and P4D instances. | 
Beta Was this translation helpful? Give feedback.
-
| Are we supposed to mention anything in the model.py? | 
Beta Was this translation helpful? Give feedback.
-
| From what I see, https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/vllm_deploy_llama_13b.html contains the tutorial for doing that. | 
Beta Was this translation helpful? Give feedback.
-
| Is there any options to use vLLM on the model.py file ? When I try this I got %%writefile models2/model.py
from djl_python import Input, Output
from vllm import LLM
# Check whether CUDA (thus Nvidia GPU) is avaiable
# Define model and tokenizer function variable
client = None
# Model loader function
def load_model():
  global client
  client = LLM("mistralai/Mistral-7B-Instruct-v0.2", trust_remote_code=True, tensor_parallel_size=8)
# Handler function
def handle(input: Input):
  print('handler called', flush=True)
  # Check if input is empty
  if input.is_empty():
      return None
  input = input.get_as_json()
  print("ron input", input)
  input_prompt = str(input.get('prompt', ''))
  if len(input_prompt) < 1:
     return None
  # Load the model
  if client is None:
     load_model()
  output_words = client.generate(payloads, sampling_params=sampling_params)
  # Send result to output
  output = Output()
  output.add(output_words)
  return output | 
Beta Was this translation helpful? Give feedback.
-
| 
 hey @lanking520 , https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/large-model-inference/sample-llm/vllm_rollingbatch_deploy_customized_processing.ipynb | 
Beta Was this translation helpful? Give feedback.
-
| The sample that Qing has shared is quite old at this point. I recommend that you follow our guide here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/deployment_guide/deploying-your-endpoint.md#option-2-configuration---environment-variables. Replace HF_MODEL_ID with the huggingface model id you are trying deploy. If you have a custom model, or artifacts stored in s3, we have some details on using sagemaker's support for uncompressed model artifacts here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/deployment_guide/deploying-your-endpoint.md#option-2-configuration---environment-variables. Hope this helps. | 
Beta Was this translation helpful? Give feedback.
-
| how can this be run without djl-serving? can you run vllm/vllm-openai:latest container on aws sagemaker? if not what needs to be changed to make it work? | 
Beta Was this translation helpful? Give feedback.
-
| When a model is deployed as a sagemaker endpoint using DLJ+vLLM, is it deployed as an openai compatible server or is it following offline inference within the endpoint? | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I am trying to test the throughout using vLLMs while inference. I am using amazon sagemaker. My typical notebook example is this one - https://github.com/huggingface/notebooks/blob/5ef609e9078e6248d73f28106e60ddafa9359db1/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb . Are there any resources which I can use as reference to deploy an endpoint using Vllm on sagemaker?
Beta Was this translation helpful? Give feedback.
All reactions