Skip to content

Conversation

yangqinghao-cmss
Copy link
Contributor

@yangqinghao-cmss yangqinghao-cmss commented Sep 16, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds unit tests for distributed patches. The tests are a good addition, but I've found a couple of issues. One test case has a bug that will cause a NameError and also suffers from test pollution by not cleaning up monkey-patched global objects. Another test is incomplete as it doesn't verify an important side effect of the function under test, which could hide a potential bug. I've provided suggestions to fix these issues to make the test suite more robust and reliable.

Comment on lines 49 to 67
class TestCommunicationAdaptation(unittest.TestCase):
def setUp(self):
import torch.distributed as dist
from vllm_ascend.patch.platform.patch_common.patch_distributed import communication_adaptation_310p

self.original_broadcast = dist.broadcast
self.original_all_reduce = dist.all_reduce


def test_communication_adaptation_310p(self):
import torch.distributed as dist
from vllm_ascend.patch.platform.patch_common.patch_distributed import is_310p
if is_310p():
communication_adaptation_310p()
self.assertNotEqual(dist.broadcast, self.original_broadcast)
self.assertNotEqual(dist.all_reduce, self.original_all_reduce)
else:
self.assertEqual(dist.broadcast, self.original_broadcast)
self.assertEqual(dist.all_reduce, self.original_all_reduce)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This test case has two issues:

  1. Test Pollution: It modifies global state by patching torch.distributed.broadcast and torch.distributed.all_reduce (and their distributed_c10d counterparts), but it does not clean up these changes after the test runs. This can lead to test pollution, where the outcome of subsequent tests is affected, making test failures difficult to debug.

  2. NameError Bug: The function communication_adaptation_310p is imported within setUp but used in test_communication_adaptation_310p, which will cause a NameError because the import is not in the correct scope.

To ensure test isolation and fix the bug, you should:

  • Move the import of communication_adaptation_310p into test_communication_adaptation_310p.
  • In setUp, also save the original torch.distributed.distributed_c10d.broadcast and torch.distributed.distributed_c10d.all_reduce functions.
  • Implement a tearDown method to restore all patched functions to their original state after the test completes.
  • Update the test assertions to also verify the patching of the distributed_c10d functions for completeness.
class TestCommunicationAdaptation(unittest.TestCase):
    def setUp(self):
        import torch.distributed as dist

        self.original_broadcast = dist.broadcast
        self.original_all_reduce = dist.all_reduce
        self.original_c10d_broadcast = dist.distributed_c10d.broadcast
        self.original_c10d_all_reduce = dist.distributed_c10d.all_reduce

    def tearDown(self):
        import torch.distributed as dist
        dist.broadcast = self.original_broadcast
        dist.all_reduce = self.original_all_reduce
        dist.distributed_c10d.broadcast = self.original_c10d_broadcast
        dist.distributed_c10d.all_reduce = self.original_c10d_all_reduce

    def test_communication_adaptation_310p(self):
        import torch.distributed as dist
        from vllm_ascend.patch.platform.patch_common.patch_distributed import (
            communication_adaptation_310p, is_310p)
        if is_310p():
            communication_adaptation_310p()
            self.assertNotEqual(dist.broadcast, self.original_broadcast)
            self.assertNotEqual(dist.all_reduce, self.original_all_reduce)
            self.assertNotEqual(dist.distributed_c10d.broadcast,
                                self.original_c10d_broadcast)
            self.assertNotEqual(dist.distributed_c10d.all_reduce,
                                self.original_c10d_all_reduce)
        else:
            self.assertEqual(dist.broadcast, self.original_broadcast)
            self.assertEqual(dist.all_reduce, self.original_all_reduce)
            self.assertEqual(dist.distributed_c10d.broadcast,
                             self.original_c10d_broadcast)
            self.assertEqual(dist.distributed_c10d.all_reduce,
                             self.original_c10d_all_reduce)

Comment on lines +39 to +46
config = ParallelConfig()
config.data_parallel_master_port = 29500

port = config.get_next_dp_init_port()
self.assertEqual(port, 30000)

port2 = config.get_next_dp_init_port()
self.assertEqual(port2, 30000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test correctly checks that the port from the environment variable is used. However, the underlying implementation in parallel_config_get_dp_port has an undesirable side effect: it increments config.data_parallel_master_port even when its value is not used because the environment variable is present. This can lead to unexpected behavior, such as non-contiguous port allocation if the environment variable is not always set.

To make this behavior explicit and prevent future regressions, the test should also assert the state of config.data_parallel_master_port. This will document the current behavior and highlight the side effect. Ideally, the implementation should be refactored to avoid modifying state unnecessarily.

Suggested change
config = ParallelConfig()
config.data_parallel_master_port = 29500
port = config.get_next_dp_init_port()
self.assertEqual(port, 30000)
port2 = config.get_next_dp_init_port()
self.assertEqual(port2, 30000)
config = ParallelConfig()
config.data_parallel_master_port = 29500
port = config.get_next_dp_init_port()
self.assertEqual(port, 30000)
# The current implementation has a side effect of incrementing the port
# even when the environment variable is used. This makes it explicit.
self.assertEqual(config.data_parallel_master_port, 29501)
port2 = config.get_next_dp_init_port()
self.assertEqual(port2, 30000)
self.assertEqual(config.data_parallel_master_port, 29502)

Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant