Skip to content

Conversation

@qgallouedec
Copy link
Member

For some reason, we have a memory spike with liger that regularly breaks the CI. Marking this test as xfail.

Screenshot 2025-09-23 at 1 27 19 PM
13970 Addr: b7fa734000000_0, Size: 777.6MiB (815351040 bytes) allocation, Total memory used after allocation: 11.5GiB (12341773448 bytes), timestamp Tue Sep 23 2025 13:14:13 GMT-0600 (Mountain Daylight Time)
CUDACachingAllocator.cpp:0:c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::malloc(signed char, unsigned long, CUstream_st*)
:0:c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::malloc(void**, signed char, unsigned long, CUstream_st*)
:0:c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::allocate(unsigned long)
:0:c10::StorageImpl::StorageImpl(c10::StorageImpl::use_byte_size_t, c10::SymInt const&, c10::Allocator*, bool)
EmptyTensor.cpp:0:at::TensorBase at::detail::_empty_generic<long>(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>)
??:0:at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, std::optional<c10::MemoryFormat>)
??:0:at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, std::optional<c10::Device>, std::optional<c10::MemoryFormat>)
??:0:at::detail::empty_cuda(c10::ArrayRef<long>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, std::optional<c10::MemoryFormat>)
??:0:at::detail::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&)
RegisterCUDA_0.cpp:0:at::(anonymous namespace)::create_out(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&)
RegisterCUDA_0.cpp:0:at::(anonymous namespace)::structured_ufunc_add_CUDA_functional::set_output_raw_strided(long, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions, c10::ArrayRef<at::Dimname>)
??:0:at::TensorIteratorBase::fast_set_up(at::TensorIteratorConfig const&)
??:0:at::TensorIteratorBase::build(at::TensorIteratorConfig&)
??:0:at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&)
??:0:at::meta::structured_add_Tensor::meta(at::Tensor const&, at::Tensor const&, c10::Scalar const&)
RegisterCUDA_0.cpp:0:at::(anonymous namespace)::wrapper_CUDA_add_Tensor(at::Tensor const&, at::Tensor const&, c10::Scalar const&)
RegisterCUDA_0.cpp:0:c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&, c10::Scalar const&), &at::(anonymous namespace)::wrapper_CUDA_add_Tensor>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&, c10::Scalar const&> >, at::Tensor (at::Tensor const&, at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
??:0:at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
VariableType_2.cpp:0:torch::autograd::VariableType::(anonymous namespace)::add_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
VariableType_2.cpp:0:c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::add_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
ADInterpreters.cpp:0:c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0]
ADInterpreters.cpp:0:at::functorch::autogradBasedTransformSendToNext(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, at::functorch::Interpreter const&, at::functorch::TransformType, std::optional<bool>, std::optional<bool>, bool)
??:0:at::functorch::GradInterpreterPtr::sendToNextInterpreterImpl(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, bool)
:0:at::functorch::Interpreter::sendToNextInterpreter(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, bool)
DynamicLayer.cpp:0:at::functorch::dynamicLayerBack(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, bool)
??:0:at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
VariableType_2.cpp:0:torch::autograd::VariableType::(anonymous namespace)::add_Tensor(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&)
VariableType_2.cpp:0:c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::add_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
ADInterpreters.cpp:0:c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const [clone .isra.0]
ADInterpreters.cpp:0:at::functorch::autogradBasedTransformProcess(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*, long, at::functorch::TransformType)
:0:at::functorch::Interpreter::process(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
DynamicLayer.cpp:0:at::functorch::dynamicLayerFrontFallback(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*)
??:0:at::_ops::add_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&)
input_buffer.cpp:0:torch::autograd::accumulate(std::vector<at::Tensor, std::allocator<at::Tensor> >&, unsigned long, at::Tensor&&)
??:0:torch::autograd::InputBuffer::add(unsigned long, at::Tensor&&, std::optional<c10::Stream> const&, std::optional<c10::Stream> const&)
??:0:torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)
??:0:torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)
??:0:torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)
:0:torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)
thread.cc:0:execute_native_thread_routine
/build/glibc-B3wQXB/glibc-2.31/nptl/pthread_create.c:477:start_thread
??:0:clone

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising the issue.

Don't you think that it would be worth to investigate the underlying issue causing this test to fail, rather the hiding it with xfail?

@qgallouedec
Copy link
Member Author

Yes, I agree, but it's probably due to an internal error in liger, and I'm not familiar with triton. I think it would be best for Kashif to investigate, but he's away for the moment

@qgallouedec qgallouedec mentioned this pull request Sep 24, 2025
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Then we could merge this PR as a hotfix, but I would suggest opening an issue to address the underlying problem, so we don't forget it.

What do you think?

@qgallouedec
Copy link
Member Author

qgallouedec commented Sep 24, 2025

agree, done here #4140

@qgallouedec qgallouedec changed the title Mark GKD trainer test as expected failure due to OOM issue 🌵 Mark GKD trainer test as expected failure due to OOM issue Sep 24, 2025
@qgallouedec qgallouedec merged commit 094e076 into main Sep 24, 2025
11 of 12 checks passed
@qgallouedec qgallouedec deleted the x-fail-oom-test branch September 24, 2025 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants