🌵 Mark GKD trainer test as expected failure due to OOM issue #4126

qgallouedec · 2025-09-23T19:27:38Z

For some reason, we have a memory spike with liger that regularly breaks the CI. Marking this test as xfail.

13970 Addr: b7fa734000000_0, Size: 777.6MiB (815351040 bytes) allocation, Total memory used after allocation: 11.5GiB (12341773448 bytes), timestamp Tue Sep 23 2025 13:14:13 GMT-0600 (Mountain Daylight Time)
CUDACachingAllocator.cpp:0:c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::malloc(signed char, unsigned long, CUstream_st*)
:0:c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*)
:0:c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long)
:0:c10::StorageImpl::StorageImpl(c10::StorageImpl::use_byte_size_t, c10::SymInt const&, c10::Allocator*, bool)
EmptyTensor.cpp:0:at::TensorBase at::detail::_empty_generic<long>(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>)
??:0:at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>)
??:0:at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>)
??:0:at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>)
??:0:at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&)
RegisterCUDA_0.cpp:0:at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&)
RegisterCUDA_0.cpp:0:at::(anonymous namespace)::structured_ufunc_add_CUDA_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>)
??:0:at::TensorIteratorBase::fast_set_up(at::TensorIteratorConfig const&)
??:0:at::TensorIteratorBase::build(at::TensorIteratorConfig&)
??:0:at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&)
??:0:at::meta::structured_add_Tensor::meta(at::Tensor const&, at::Tensor const&, c10::Scalar const&)
RegisterCUDA_0.cpp:0:at::(anonymous namespace)::wrapper_CUDA_add_Tensor(at::Tensor const&, at::Tensor const&, c10::Scalar const&)
RegisterCUDA_0.cpp:0:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::Scalar const&), &at::(anonymous namespace)::wrapper_CUDA_add_Tensor>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::Scalar const&> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
??:0:at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
VariableType_2.cpp:0:torch::autograd::VariableType::(anonymous namespace)::add_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
VariableType_2.cpp:0:c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::add_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
ADInterpreters.cpp:0:c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0]
ADInterpreters.cpp:0:at::functorch::autogradBasedTransformSendToNext(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, at::functorch::Interpreter const&, at::functorch::TransformType, std::optional<bool>, std::optional<bool>, bool)
??:0:at::functorch::GradInterpreterPtr::sendToNextInterpreterImpl(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, bool)
:0:at::functorch::Interpreter::sendToNextInterpreter(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, bool)
DynamicLayer.cpp:0:at::functorch::dynamicLayerBack(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, bool)
??:0:at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
VariableType_2.cpp:0:torch::autograd::VariableType::(anonymous namespace)::add_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
VariableType_2.cpp:0:c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::add_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
ADInterpreters.cpp:0:c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0]
ADInterpreters.cpp:0:at::functorch::autogradBasedTransformProcess(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, long, at::functorch::TransformType)
:0:at::functorch::Interpreter::process(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
DynamicLayer.cpp:0:at::functorch::dynamicLayerFrontFallback(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
??:0:at::_ops::add_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&)
input_buffer.cpp:0:torch::autograd::accumulate(std::vector<at::Tensor, std::allocator<at::Tensor> >&, unsigned long, at::Tensor&&)
??:0:torch::autograd::InputBuffer::add(unsigned long, at::Tensor&&, std::optional<c10::Stream> const&, std::optional<c10::Stream> const&)
??:0:torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)
??:0:torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)
??:0:torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)
:0:torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)
thread.cc:0:execute_native_thread_routine
/build/glibc-B3wQXB/glibc-2.31/nptl/pthread_create.c:477:start_thread
??:0:clone

HuggingFaceDocBuilderDev · 2025-09-23T19:32:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova

Thanks for raising the issue.

Don't you think that it would be worth to investigate the underlying issue causing this test to fail, rather the hiding it with xfail?

qgallouedec · 2025-09-24T14:15:31Z

Yes, I agree, but it's probably due to an internal error in liger, and I'm not familiar with triton. I think it would be best for Kashif to investigate, but he's away for the moment

albertvillanova

OK. Then we could merge this PR as a hotfix, but I would suggest opening an issue to address the underlying problem, so we don't forget it.

What do you think?

qgallouedec · 2025-09-24T18:26:09Z

agree, done here #4140

mark GKD trainer test as expected failure due to OOM issue

a8c42ba

qgallouedec requested review from albertvillanova and kashif September 23, 2025 19:27

Merge branch 'main' into x-fail-oom-test

a2c0bc5

albertvillanova reviewed Sep 24, 2025

View reviewed changes

qgallouedec mentioned this pull request Sep 24, 2025

Fix CI tests #4132

Closed

albertvillanova approved these changes Sep 24, 2025

View reviewed changes

qgallouedec mentioned this pull request Sep 24, 2025

Fix GKD Liger memory spike #4140

Merged

qgallouedec changed the title ~~Mark GKD trainer test as expected failure due to OOM issue~~ 🌵 Mark GKD trainer test as expected failure due to OOM issue Sep 24, 2025

qgallouedec merged commit 094e076 into main Sep 24, 2025
11 of 12 checks passed

qgallouedec deleted the x-fail-oom-test branch September 24, 2025 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🌵 Mark GKD trainer test as expected failure due to OOM issue #4126

🌵 Mark GKD trainer test as expected failure due to OOM issue #4126

Uh oh!

qgallouedec commented Sep 23, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 23, 2025

Uh oh!

albertvillanova left a comment

Uh oh!

qgallouedec commented Sep 24, 2025

Uh oh!

albertvillanova left a comment

Uh oh!

qgallouedec commented Sep 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

🌵 Mark GKD trainer test as expected failure due to OOM issue #4126

🌵 Mark GKD trainer test as expected failure due to OOM issue #4126

Uh oh!

Conversation

qgallouedec commented Sep 23, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 23, 2025

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Sep 24, 2025

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qgallouedec commented Sep 24, 2025 •

edited

Loading