Skip to content

[vllm] support base_url parameter for vLLM client initialization #3324

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

re-imagined
Copy link

@re-imagined re-imagined commented Apr 18, 2025

related issue #3322

  • Introduce base_url parameter to simplify server connection
  • Update init method to support both base_url and host+port configurations
  • Modify URL construction in various methods to use base_url
  • Update documentation to include new initialization examples

@re-imagined re-imagined changed the title feat(draft): add base_url parameter for vLLM server connection feat(draft): support base_url parameter for vLLM client initialization Apr 18, 2025
@re-imagined re-imagined force-pushed the vllm_client_custom_url branch from 7219281 to 38c9c8e Compare April 18, 2025 05:31
@re-imagined re-imagined changed the title feat(draft): support base_url parameter for vLLM client initialization [vllm] support base_url parameter for vLLM client initialization Apr 22, 2025
@re-imagined re-imagined marked this pull request as ready for review April 22, 2025 13:03
Comment on lines 241 to 301


@pytest.mark.slow
@require_3_gpus
class TestVLLMClientServerTPBaseURL(unittest.TestCase):
model_id = "Qwen/Qwen2.5-1.5B"

@classmethod
def setUpClass(cls):
# We want the server to run on GPU 1 and 2, so we set CUDA_VISIBLE_DEVICES to "1,2"
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = "1,2" # Restrict to GPU 1 and 2

# Start the server process
cls.server_process = subprocess.Popen(
["trl", "vllm-serve", "--model", cls.model_id, "--tensor_parallel_size", "2"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
env=env,
)

# Initialize the client with base_url
cls.client = VLLMClient(base_url="http://localhost:8000", connection_timeout=120)

def test_generate(self):
prompts = ["Hello, AI!", "Tell me a joke"]
outputs = self.client.generate(prompts)

# Check that the output is a list
self.assertIsInstance(outputs, list)

# Check that the number of generated sequences is equal to the number of prompts
self.assertEqual(len(outputs), len(prompts))

# Check that the generated sequences are lists of integers
for seq in outputs:
self.assertTrue(all(isinstance(tok, int) for tok in seq))

def test_update_model_params(self):
model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="cuda")
self.client.update_model_params(model)

def test_reset_prefix_cache(self):
# Test resetting the prefix cache
self.client.reset_prefix_cache()

@classmethod
def tearDownClass(cls):
super().tearDownClass()

# Close the client
cls.client.close_communicator()

# vLLM x pytest (or Popen) seems not to handle process termination well. To avoid zombie processes, we need to
# kill the server process and its children explicitly.
parent = psutil.Process(cls.server_process.pid)
children = parent.children(recursive=True)
for child in children:
child.send_signal(signal.SIGTERM)
cls.server_process.terminate()
cls.server_process.wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.slow
@require_3_gpus
class TestVLLMClientServerTPBaseURL(unittest.TestCase):
model_id = "Qwen/Qwen2.5-1.5B"
@classmethod
def setUpClass(cls):
# We want the server to run on GPU 1 and 2, so we set CUDA_VISIBLE_DEVICES to "1,2"
env = os.environ.copy()
env["CUDA_VISIBLE_DEVICES"] = "1,2" # Restrict to GPU 1 and 2
# Start the server process
cls.server_process = subprocess.Popen(
["trl", "vllm-serve", "--model", cls.model_id, "--tensor_parallel_size", "2"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
env=env,
)
# Initialize the client with base_url
cls.client = VLLMClient(base_url="http://localhost:8000", connection_timeout=120)
def test_generate(self):
prompts = ["Hello, AI!", "Tell me a joke"]
outputs = self.client.generate(prompts)
# Check that the output is a list
self.assertIsInstance(outputs, list)
# Check that the number of generated sequences is equal to the number of prompts
self.assertEqual(len(outputs), len(prompts))
# Check that the generated sequences are lists of integers
for seq in outputs:
self.assertTrue(all(isinstance(tok, int) for tok in seq))
def test_update_model_params(self):
model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="cuda")
self.client.update_model_params(model)
def test_reset_prefix_cache(self):
# Test resetting the prefix cache
self.client.reset_prefix_cache()
@classmethod
def tearDownClass(cls):
super().tearDownClass()
# Close the client
cls.client.close_communicator()
# vLLM x pytest (or Popen) seems not to handle process termination well. To avoid zombie processes, we need to
# kill the server process and its children explicitly.
parent = psutil.Process(cls.server_process.pid)
children = parent.children(recursive=True)
for child in children:
child.send_signal(signal.SIGTERM)
cls.server_process.terminate()
cls.server_process.wait()

I think one test is enough

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes sense, thanks for suggesting it.
Sorry for the delayed review.
I've made a few comments.
Have you tested on your infrastructure?

@re-imagined re-imagined force-pushed the vllm_client_custom_url branch from cd35db0 to 0bb3624 Compare April 27, 2025 14:07
@re-imagined
Copy link
Author

I think this change makes sense, thanks for suggesting it. Sorry for the delayed review. I've made a few comments. Have you tested on your infrastructure?

Thanks for your review. 😄
I am trying to test it on different cases, and checking if the model output is correct, but I am struggling on some nccl error when executing the tests. (it worked well on last commit 🤣, until I merge the main branch)
I will let you know as soon as i have an update

@qgallouedec
Copy link
Member

Hey! any update?

@re-imagined
Copy link
Author

Hey! any update?

hi @qgallouedec,
Sorry for late update, I've been out for few days.

After several tests, I realized that using 0.0.0.0 as the host address in StatelessProcessGroup.create isn't viable—it requires the server's actual IP address to function properly. But if we use base url to init the vllm client, we don't know the server's ip.

Proposed Approach: Use the base url to resolve the server's IP programmatically in the client (when calling /health).

I’m currently testing this approach and diving deeper into how StatelessProcessGroup interacts with PyTorch’s TCPStore

I’ll update this thread once I’ve narrowed down the root cause and validated the fix. Let me know if you have insights or suggestions!

@re-imagined
Copy link
Author

@qgallouedec
hi, it's ready for review now

"seconds. Make sure the server is running by running `trl vllm-serve`."
) from exc
else:
if response.status_code == 200:
if "X-Forwarded-For" in response.headers:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need the ip addrees when using base url

  1. An HTTP proxy can add a header containing the backend IP so that the client can read it from the response. The most common header for passing client IPs is X-Forwarded-For
  2. get the ip via a request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants