Skip to content

run VLM model error on 910B3 when enable Tensor Parallel. #215

@wly-115

Description

@wly-115

ATB logs

[2025-09-28 20:14:33.201600] [info] [710583] [operation_base.cpp:430] WordEmbedding_3 inTensorsSize:2outTensorsSize:1
[2025-09-28 20:14:33.202085] [info] [710583] [gather_ops_runner.cpp:21] GatherOpsRunner::GatherOpsRunner called
[2025-09-28 20:14:33.202486] [debug] [710583] [probe.cpp:30] save ids:3_0_ opType:GatherOperation
[2025-09-28 20:14:33.202514] [info] [710583] [lcal_runner.cpp:26] AllGatherLcclRunner:0 LcalRunner::LcalRunner 19 called, rank : 3/4 commMode: 1 commDomain: 0 magicNumberDisabled: 0
[2025-09-28 20:14:33.202529] [info] [710583] [comm_pool.h:31] GetComm Key: 3_0
[2025-09-28 20:14:33.202731] [info] [710583] [lcal_comm.cpp:372] rank 3/4 running devId:3uid: 0
[2025-09-28 20:14:33.202748] [debug] [710583] [lcal_comm.cpp:95] rtGetSocVersion -- The result after converting ver to string is:Ascend910B3
[2025-09-28 20:14:33.202825] [debug] [710583] [lcal_comm.cpp:432] 3 <-----> 0, halGetPairDevicesInfo: *value = 1
[2025-09-28 20:14:33.202857] [error] [710583] [lcal_comm.cpp:444] do not support pcie > 2 rank! rankSize_ = 4
[2025-09-28 20:14:33.202893] [error] [710583] [lcal_comm.cpp:241] EnablePeerAccess failed!
[2025-09-28 20:14:33.202901] [error] [710583] [lcal_comm.cpp:375] init common failed!
[2025-09-28 20:14:33.202902] [info] [710508] [lcal_comm.cpp:372] rank 2/4 running devId:2uid: 0
[2025-09-28 20:14:33.202907] [error] [710583] [lcal_runner.cpp:111] AllGatherLcclRunner:0 init LcalComm failed, lcalErrorCode : -4
[2025-09-28 20:14:33.202961] [error] [710583] [comm_pool.h:42] CommPool commCreateFunc fail
[2025-09-28 20:14:33.202968] [error] [710583] [lcal_runner.cpp:46] AllGatherLcclRunner:0 get lcal comm from comm pool failed, rank : 3
[2025-09-28 20:14:33.202979] [info] [710583] [lccl_runner.cpp:18] AllGatherLcclRunner:0 LcclRunner::LcclRunner called, rank : 3/4
[2025-09-28 20:14:33.202988] [error] [710583] [lccl_runner.cpp:28] AllGatherLcclRunner:0 GetLcalComm failed, rank: 3
[2025-09-28 20:14:33.202996] [info] [710583] [all_gather_lccl_runner.cpp:23] AllGatherLcclRunner::AllGatherLcclRunner called
[2025-09-28 20:14:33.202996] [debug] [710508] [lcal_comm.cpp:432] 2 <-----> 0, halGetPairDevicesInfo: *value = 1
[2025-09-28 20:14:33.202990] [info] [710431] [lcal_comm.cpp:372] rank 1/4 running devId:1uid: 0
[2025-09-28 20:14:33.203007] [debug] [710583] [probe.cpp:30] save ids:3_1_ opType:AllGatherOperation
[2025-09-28 20:14:33.203025] [error] [710508] [lcal_comm.cpp:444] do not support pcie > 2 rank! rankSize_ = 4
[2025-09-28 20:14:33.203039] [error] [710508] [lcal_comm.cpp:241] EnablePeerAccess failed!
[2025-09-28 20:14:33.203050] [error] [710508] [lcal_comm.cpp:375] init common failed!
[2025-09-28 20:14:33.203061] [error] [710508] [lcal_runner.cpp:111] AllGatherLcclRunner:0 init LcalComm failed, lcalErrorCode : -4
[2025-09-28 20:14:33.203087] [error] [710508] [comm_pool.h:42] CommPool commCreateFunc fail
[2025-09-28 20:14:33.203099] [error] [710508] [lcal_runner.cpp:46] AllGatherLcclRunner:0 get lcal comm from comm pool failed, rank : 2
[2025-09-28 20:14:33.203173] [debug] [710431] [lcal_comm.cpp:432] 1 <-----> 0, halGetPairDevicesInfo: *value = 1
[2025-09-28 20:14:33.203188] [error] [710431] [lcal_comm.cpp:444] do not support pcie > 2 rank! rankSize_ = 4
[2025-09-28 20:14:33.203154] [info] [710508] [lccl_runner.cpp:18] AllGatherLcclRunner:0 LcclRunner::LcclRunner called, rank : 2/4
[2025-09-28 20:14:33.203198] [error] [710431] [lcal_comm.cpp:241] EnablePeerAccess failed!
[2025-09-28 20:14:33.203204] [error] [710431] [lcal_comm.cpp:375] init common failed!
[2025-09-28 20:14:33.203210] [error] [710431] [lcal_runner.cpp:111] AllGatherLcclRunner:0 init LcalComm failed, lcalErrorCode : -4
[2025-09-28 20:14:33.203216] [error] [710508] [lccl_runner.cpp:28] AllGatherLcclRunner:0 GetLcalComm failed, rank: 2
[2025-09-28 20:14:33.203229] [error] [710431] [comm_pool.h:42] CommPool commCreateFunc fail
[2025-09-28 20:14:33.203239] [info] [710508] [all_gather_lccl_runner.cpp:23] AllGatherLcclRunner::AllGatherLcclRunner called
[2025-09-28 20:14:33.203245] [error] [710431] [lcal_runner.cpp:46] AllGatherLcclRunner:0 get lcal comm from comm pool failed, rank : 1
[2025-09-28 20:14:33.203255] [info] [710431] [lccl_runner.cpp:18] AllGatherLcclRunner:0 LcclRunner::LcclRunner called, rank : 1/4

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions