-
Notifications
You must be signed in to change notification settings - Fork 370
Changes to TRT-LLM download tool for multigpu distributed case #3830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/utils.py 2025-09-22 06:35:28.523784+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/utils.py 2025-09-22 06:36:00.657186+00:00
@@ -863,6 +863,6 @@
return False
def is_thor() -> bool:
if torch.cuda.get_device_capability() in [(11, 0)]:
- return True
\ No newline at end of file
+ return True6e99bbc to
7134053
Compare
3f1fa7e to
54948d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/distributed/utils.py 2025-09-25 19:33:28.176615+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/distributed/utils.py 2025-09-25 19:34:02.325958+00:00
@@ -100,11 +100,10 @@
return True
except Exception as e:
logger.warning(f"Failed to detect CUDA version: {e}")
return False
-
return True
def _cache_root() -> Path:2bbc423 to
5beefc0
Compare
5beefc0 to
809c7ee
Compare
b96b9ee to
2f2cd31
Compare
809c7ee to
5fb74da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py 2025-10-13 18:42:31.890493+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py 2025-10-13 18:43:14.026641+00:00
@@ -23,10 +23,11 @@
if not dist.is_initialized():
dist.init_process_group(
backend="nccl",
init_method="env://",
)
+
class DistributedGatherModel(nn.Module):
def __init__(self, input_dim, world_size, group_name):
super().__init__()
self.fc = nn.Linear(input_dim, input_dim)bd02455 to
38224c5
Compare
|
|
||
| def check_tensor_parallel_device_number(world_size: int) -> None: | ||
| if world_size % 2 != 0: | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it matter what the examples need? This is supposed to be user facing code.
|
|
||
|
|
||
| def initialize_logger(rank: int, logger_file_name: str) -> logging.Logger: | ||
| logger = logging.getLogger() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this logger used? Typically this is a user's responsibility. we should not be creating the actual logger handlers in our library. We can do this in the examples code tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok was meant for the use case that if we want separate log outputs for the two different GPUs. But I agree these all can be moved to the user facing code
f4f338a to
f07b5cb
Compare
…T-LLM wheel by using lock file
f07b5cb to
4118355
Compare
…ding race conditionst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some changes that do not conform to Python style guidelines:
--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py 2025-11-19 02:00:51.654228+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py 2025-11-19 02:01:22.401857+00:00
@@ -60,38 +60,40 @@
return True
def create_distributed_config(item: Dict[str, Any]) -> Dict[str, Any]:
"""Create distributed test configuration from a regular config.
-
+
Takes a standard test config and modifies it for distributed testing:
- Changes runner to multi-GPU instance
- Adds num_gpus field
- Adds config marker
"""
import sys
-
+
# Create a copy to avoid modifying the original
dist_item = item.copy()
-
+
# Debug: Show original config
print(f"[DEBUG] Creating distributed config from:", file=sys.stderr)
print(f"[DEBUG] Python: {item.get('python_version')}", file=sys.stderr)
print(f"[DEBUG] CUDA: {item.get('desired_cuda')}", file=sys.stderr)
- print(f"[DEBUG] Original runner: {item.get('validation_runner')}", file=sys.stderr)
-
+ print(
+ f"[DEBUG] Original runner: {item.get('validation_runner')}", file=sys.stderr
+ )
+
# Override runner to use multi-GPU instance
dist_item["validation_runner"] = "linux.g4dn.12xlarge.nvidia.gpu"
-
+
# Add distributed-specific fields
dist_item["num_gpus"] = 2
dist_item["config"] = "distributed"
-
+
# Debug: Show modified config
print(f"[DEBUG] New runner: {dist_item['validation_runner']}", file=sys.stderr)
print(f"[DEBUG] GPUs: {dist_item['num_gpus']}", file=sys.stderr)
-
+
return dist_item
def main(args: list[str]) -> None:
parser = argparse.ArgumentParser()
@@ -131,38 +133,43 @@
raise ValueError(f"Invalid matrix structure: {e}")
includes = matrix_dict["include"]
filtered_includes = []
distributed_includes = [] # NEW: separate list for distributed configs
-
+
print(f"[DEBUG] Processing {len(includes)} input configs", file=sys.stderr)
for item in includes:
if filter_matrix_item(
item,
options.jetpack == "true",
options.limit_pr_builds == "true",
):
filtered_includes.append(item)
-
+
# NEW: Create distributed variant for specific configs
# Only Python 3.10 + CUDA 13.0 for now
if item["python_version"] == "3.10" and item["desired_cuda"] == "cu130":
- print(f"[DEBUG] Creating distributed config for py3.10+cu130", file=sys.stderr)
+ print(
+ f"[DEBUG] Creating distributed config for py3.10+cu130",
+ file=sys.stderr,
+ )
distributed_includes.append(create_distributed_config(item))
-
+
# Debug: Show summary
print(f"[DEBUG] Final counts:", file=sys.stderr)
print(f"[DEBUG] Regular configs: {len(filtered_includes)}", file=sys.stderr)
- print(f"[DEBUG] Distributed configs: {len(distributed_includes)}", file=sys.stderr)
+ print(
+ f"[DEBUG] Distributed configs: {len(distributed_includes)}", file=sys.stderr
+ )
# NEW: Output both regular and distributed configs
filtered_matrix_dict = {
"include": filtered_includes,
- "distributed_include": distributed_includes # NEW field
+ "distributed_include": distributed_includes, # NEW field
}
-
+
# Output to stdout (consumed by GitHub Actions)
print(json.dumps(filtered_matrix_dict))
if __name__ == "__main__":41e8582 to
b3f62c7
Compare
b3f62c7 to
a61153d
Compare
TRT-LLM installation tool for distributed