[vllm] support base_url parameter for vLLM client initialization #3324

re-imagined · 2025-04-18T03:48:49Z

related issue #3322

Introduce base_url parameter to simplify server connection
Update init method to support both base_url and host+port configurations
Modify URL construction in various methods to use base_url
Update documentation to include new initialization examples

qgallouedec · 2025-04-26T00:28:42Z

tests/test_vllm_client_server.py

+
+
+@pytest.mark.slow
+@require_3_gpus
+class TestVLLMClientServerTPBaseURL(unittest.TestCase):
+    model_id = "Qwen/Qwen2.5-1.5B"
+
+    @classmethod
+    def setUpClass(cls):
+        # We want the server to run on GPU 1 and 2, so we set CUDA_VISIBLE_DEVICES to "1,2"
+        env = os.environ.copy()
+        env["CUDA_VISIBLE_DEVICES"] = "1,2"  # Restrict to GPU 1 and 2
+
+        # Start the server process
+        cls.server_process = subprocess.Popen(
+            ["trl", "vllm-serve", "--model", cls.model_id, "--tensor_parallel_size", "2"],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            env=env,
+        )
+
+        # Initialize the client with base_url
+        cls.client = VLLMClient(base_url="http://localhost:8000", connection_timeout=120)
+
+    def test_generate(self):
+        prompts = ["Hello, AI!", "Tell me a joke"]
+        outputs = self.client.generate(prompts)
+
+        # Check that the output is a list
+        self.assertIsInstance(outputs, list)
+
+        # Check that the number of generated sequences is equal to the number of prompts
+        self.assertEqual(len(outputs), len(prompts))
+
+        # Check that the generated sequences are lists of integers
+        for seq in outputs:
+            self.assertTrue(all(isinstance(tok, int) for tok in seq))
+
+    def test_update_model_params(self):
+        model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="cuda")
+        self.client.update_model_params(model)
+
+    def test_reset_prefix_cache(self):
+        # Test resetting the prefix cache
+        self.client.reset_prefix_cache()
+
+    @classmethod
+    def tearDownClass(cls):
+        super().tearDownClass()
+
+        # Close the client
+        cls.client.close_communicator()
+
+        # vLLM x pytest (or Popen) seems not to handle process termination well. To avoid zombie processes, we need to
+        # kill the server process and its children explicitly.
+        parent = psutil.Process(cls.server_process.pid)
+        children = parent.children(recursive=True)
+        for child in children:
+            child.send_signal(signal.SIGTERM)
+        cls.server_process.terminate()
+        cls.server_process.wait()


Suggested change

@pytest.mark.slow

@require_3_gpus

class TestVLLMClientServerTPBaseURL(unittest.TestCase):

model_id = "Qwen/Qwen2.5-1.5B"

@classmethod

def setUpClass(cls):

# We want the server to run on GPU 1 and 2, so we set CUDA_VISIBLE_DEVICES to "1,2"

env = os.environ.copy()

env["CUDA_VISIBLE_DEVICES"] = "1,2" # Restrict to GPU 1 and 2

# Start the server process

cls.server_process = subprocess.Popen(

["trl", "vllm-serve", "--model", cls.model_id, "--tensor_parallel_size", "2"],

stdout=subprocess.PIPE,

stderr=subprocess.PIPE,

env=env,

)

# Initialize the client with base_url

cls.client = VLLMClient(base_url="http://localhost:8000", connection_timeout=120)

def test_generate(self):

prompts = ["Hello, AI!", "Tell me a joke"]

outputs = self.client.generate(prompts)

# Check that the output is a list

self.assertIsInstance(outputs, list)

# Check that the number of generated sequences is equal to the number of prompts

self.assertEqual(len(outputs), len(prompts))

# Check that the generated sequences are lists of integers

for seq in outputs:

self.assertTrue(all(isinstance(tok, int) for tok in seq))

def test_update_model_params(self):

model = AutoModelForCausalLM.from_pretrained(self.model_id, device_map="cuda")

self.client.update_model_params(model)

def test_reset_prefix_cache(self):

# Test resetting the prefix cache

self.client.reset_prefix_cache()

@classmethod

def tearDownClass(cls):

super().tearDownClass()

# Close the client

cls.client.close_communicator()

# vLLM x pytest (or Popen) seems not to handle process termination well. To avoid zombie processes, we need to

# kill the server process and its children explicitly.

parent = psutil.Process(cls.server_process.pid)

children = parent.children(recursive=True)

for child in children:

child.send_signal(signal.SIGTERM)

cls.server_process.terminate()

cls.server_process.wait()

I think one test is enough

trl/extras/vllm_client.py

trl/trainer/grpo_trainer.py

trl/extras/vllm_client.py

qgallouedec

I think this change makes sense, thanks for suggesting it.
Sorry for the delayed review.
I've made a few comments.
Have you tested on your infrastructure?

re-imagined · 2025-04-27T14:19:15Z

I think this change makes sense, thanks for suggesting it. Sorry for the delayed review. I've made a few comments. Have you tested on your infrastructure?

Thanks for your review. 😄
I am trying to test it on different cases, and checking if the model output is correct, but I am struggling on some nccl error when executing the tests. (it worked well on last commit 🤣, until I merge the main branch)
I will let you know as soon as i have an update

qgallouedec · 2025-05-02T02:37:41Z

Hey! any update?

re-imagined · 2025-05-04T17:44:28Z

Hey! any update?

hi @qgallouedec,
Sorry for late update, I've been out for few days.

After several tests, I realized that using 0.0.0.0 as the host address in StatelessProcessGroup.create isn't viable—it requires the server's actual IP address to function properly. But if we use base url to init the vllm client, we don't know the server's ip.

Proposed Approach: Use the base url to resolve the server's IP programmatically in the client (when calling /health).

I’m currently testing this approach and diving deeper into how StatelessProcessGroup interacts with PyTorch’s TCPStore

I’ll update this thread once I’ve narrowed down the root cause and validated the fix. Let me know if you have insights or suggestions!

…ocket connection - Add logic to check for X-Forwarded-For header in response - If header is not present, use socket connection to get host IP - Update self.host with the obtained IP address

re-imagined · 2025-05-08T07:35:28Z

@qgallouedec
hi, it's ready for review now

…nto vllm_client_custom_url

re-imagined · 2025-05-08T11:34:30Z

trl/extras/vllm_client.py

                        "seconds. Make sure the server is running by running `trl vllm-serve`."
                    ) from exc
            else:
                if response.status_code == 200:
+                    if "X-Forwarded-For" in response.headers:


we still need the ip addrees when using base url

An HTTP proxy can add a header containing the backend IP so that the client can read it from the response. The most common header for passing client IPs is X-Forwarded-For

get the ip via a request

re-imagined mentioned this pull request Apr 18, 2025

Feature Request: Support Custom URL Configuration for vLLM Client Initialization #3322

Open

re-imagined changed the title ~~feat(draft): add base_url parameter for vLLM server connection~~ feat(draft): support base_url parameter for vLLM client initialization Apr 18, 2025

re-imagined force-pushed the vllm_client_custom_url branch from 7219281 to 38c9c8e Compare April 18, 2025 05:31

re-imagined changed the title ~~feat(draft): support base_url parameter for vLLM client initialization~~ [vllm] support base_url parameter for vLLM client initialization Apr 22, 2025

re-imagined marked this pull request as ready for review April 22, 2025 13:03