Skip to content

Conversation

@apbose
Copy link
Collaborator

@apbose apbose commented Sep 22, 2025

TRT-LLM installation tool for distributed

  1. The download is to be done by only one GPU to avoid unnecessary downloads
  2. Use of lock files in the tool for the above purpose

@github-actions github-actions bot added component: tests Issues re: Tests component: conversion Issues re: Conversion stage component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Sep 22, 2025
@meta-cla meta-cla bot added the cla signed label Sep 22, 2025
@github-actions github-actions bot requested a review from peri044 September 22, 2025 06:34
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/utils.py	2025-09-22 06:35:28.523784+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/utils.py	2025-09-22 06:36:00.657186+00:00
@@ -863,6 +863,6 @@
    return False


def is_thor() -> bool:
    if torch.cuda.get_device_capability() in [(11, 0)]:
-        return True
\ No newline at end of file
+        return True

@apbose apbose force-pushed the abose/trt_llm_installation_dist branch from 6e99bbc to 7134053 Compare September 22, 2025 22:47
@github-actions github-actions bot added the component: build system Issues re: Build system label Sep 25, 2025
@apbose apbose force-pushed the abose/trt_llm_installation_dist branch from 3f1fa7e to 54948d9 Compare September 25, 2025 19:33
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/distributed/utils.py	2025-09-25 19:33:28.176615+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/distributed/utils.py	2025-09-25 19:34:02.325958+00:00
@@ -100,11 +100,10 @@
        return True

    except Exception as e:
        logger.warning(f"Failed to detect CUDA version: {e}")
        return False
-

    return True


def _cache_root() -> Path:

@apbose apbose force-pushed the abose/trt_llm_installation_dist branch 3 times, most recently from 2bbc423 to 5beefc0 Compare September 25, 2025 22:13
@apbose apbose changed the title Changes to TRT-LLM download tool for multigpu distributed case Changes to TRT-LLM download tool for multigpu distributed case [WIP] Sep 25, 2025
@apbose apbose force-pushed the abose/trt_llm_installation_dist branch from 5beefc0 to 809c7ee Compare September 26, 2025 00:11
@apbose apbose changed the title Changes to TRT-LLM download tool for multigpu distributed case [WIP] Changes to TRT-LLM download tool for multigpu distributed case Sep 26, 2025
@apbose apbose force-pushed the abose/trt_llm_installtion branch 5 times, most recently from b96b9ee to 2f2cd31 Compare October 7, 2025 17:27
@apbose apbose force-pushed the abose/trt_llm_installation_dist branch from 809c7ee to 5fb74da Compare October 11, 2025 02:04
@github-actions github-actions bot added documentation Improvements or additions to documentation component: lowering Issues re: The lowering / preprocessing passes component: converters Issues re: Specific op converters component: runtime component: torch_compile labels Oct 11, 2025
@apbose apbose changed the base branch from abose/trt_llm_installtion to main October 11, 2025 02:05
@github-actions github-actions bot removed documentation Improvements or additions to documentation component: lowering Issues re: The lowering / preprocessing passes component: conversion Issues re: Conversion stage component: converters Issues re: Specific op converters component: runtime component: torch_compile labels Oct 13, 2025
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-10-13 18:42:31.890493+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-10-13 18:43:14.026641+00:00
@@ -23,10 +23,11 @@
if not dist.is_initialized():
    dist.init_process_group(
        backend="nccl",
        init_method="env://",
    )
+

class DistributedGatherModel(nn.Module):
    def __init__(self, input_dim, world_size, group_name):
        super().__init__()
        self.fc = nn.Linear(input_dim, input_dim)

@apbose apbose force-pushed the abose/trt_llm_installation_dist branch 3 times, most recently from bd02455 to 38224c5 Compare October 13, 2025 20:38

def check_tensor_parallel_device_number(world_size: int) -> None:
if world_size % 2 != 0:
raise ValueError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it matter what the examples need? This is supposed to be user facing code.



def initialize_logger(rank: int, logger_file_name: str) -> logging.Logger:
logger = logging.getLogger()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this logger used? Typically this is a user's responsibility. we should not be creating the actual logger handlers in our library. We can do this in the examples code tho

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok was meant for the use case that if we want separate log outputs for the two different GPUs. But I agree these all can be moved to the user facing code

@github-actions github-actions bot removed component: build system Issues re: Build system component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Oct 17, 2025
@apbose apbose force-pushed the abose/trt_llm_installation_dist branch from f4f338a to f07b5cb Compare October 17, 2025 22:57
@apbose apbose force-pushed the abose/trt_llm_installation_dist branch from f07b5cb to 4118355 Compare November 11, 2025 03:31
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-11-19 02:00:51.654228+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-11-19 02:01:22.401857+00:00
@@ -60,38 +60,40 @@
        return True


def create_distributed_config(item: Dict[str, Any]) -> Dict[str, Any]:
    """Create distributed test configuration from a regular config.
-    
+
    Takes a standard test config and modifies it for distributed testing:
    - Changes runner to multi-GPU instance
    - Adds num_gpus field
    - Adds config marker
    """
    import sys
-    
+
    # Create a copy to avoid modifying the original
    dist_item = item.copy()
-    
+
    # Debug: Show original config
    print(f"[DEBUG] Creating distributed config from:", file=sys.stderr)
    print(f"[DEBUG]   Python: {item.get('python_version')}", file=sys.stderr)
    print(f"[DEBUG]   CUDA: {item.get('desired_cuda')}", file=sys.stderr)
-    print(f"[DEBUG]   Original runner: {item.get('validation_runner')}", file=sys.stderr)
-    
+    print(
+        f"[DEBUG]   Original runner: {item.get('validation_runner')}", file=sys.stderr
+    )
+
    # Override runner to use multi-GPU instance
    dist_item["validation_runner"] = "linux.g4dn.12xlarge.nvidia.gpu"
-    
+
    # Add distributed-specific fields
    dist_item["num_gpus"] = 2
    dist_item["config"] = "distributed"
-    
+
    # Debug: Show modified config
    print(f"[DEBUG]   New runner: {dist_item['validation_runner']}", file=sys.stderr)
    print(f"[DEBUG]   GPUs: {dist_item['num_gpus']}", file=sys.stderr)
-    
+
    return dist_item


def main(args: list[str]) -> None:
    parser = argparse.ArgumentParser()
@@ -131,38 +133,43 @@
        raise ValueError(f"Invalid matrix structure: {e}")

    includes = matrix_dict["include"]
    filtered_includes = []
    distributed_includes = []  # NEW: separate list for distributed configs
-    
+
    print(f"[DEBUG] Processing {len(includes)} input configs", file=sys.stderr)

    for item in includes:
        if filter_matrix_item(
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            filtered_includes.append(item)
-            
+
            # NEW: Create distributed variant for specific configs
            # Only Python 3.10 + CUDA 13.0 for now
            if item["python_version"] == "3.10" and item["desired_cuda"] == "cu130":
-                print(f"[DEBUG] Creating distributed config for py3.10+cu130", file=sys.stderr)
+                print(
+                    f"[DEBUG] Creating distributed config for py3.10+cu130",
+                    file=sys.stderr,
+                )
                distributed_includes.append(create_distributed_config(item))
-    
+
    # Debug: Show summary
    print(f"[DEBUG] Final counts:", file=sys.stderr)
    print(f"[DEBUG]   Regular configs: {len(filtered_includes)}", file=sys.stderr)
-    print(f"[DEBUG]   Distributed configs: {len(distributed_includes)}", file=sys.stderr)
+    print(
+        f"[DEBUG]   Distributed configs: {len(distributed_includes)}", file=sys.stderr
+    )

    # NEW: Output both regular and distributed configs
    filtered_matrix_dict = {
        "include": filtered_includes,
-        "distributed_include": distributed_includes  # NEW field
+        "distributed_include": distributed_includes,  # NEW field
    }
-    
+
    # Output to stdout (consumed by GitHub Actions)
    print(json.dumps(filtered_matrix_dict))


if __name__ == "__main__":

@apbose apbose force-pushed the abose/trt_llm_installation_dist branch from 41e8582 to b3f62c7 Compare November 19, 2025 07:21
@apbose apbose force-pushed the abose/trt_llm_installation_dist branch from b3f62c7 to a61153d Compare November 19, 2025 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants