Skip to content

[BUG] Outdated Janus implementation needs refactoring #194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
htlou opened this issue Apr 27, 2025 · 8 comments · May be fixed by #197
Open
3 tasks done

[BUG] Outdated Janus implementation needs refactoring #194

htlou opened this issue Apr 27, 2025 · 8 comments · May be fixed by #197
Assignees
Labels
bug Something isn't working

Comments

@htlou
Copy link
Member

htlou commented Apr 27, 2025

Required prerequisites

What version of align-anything are you using?

Newest (cutoff 4.27)

System information

System version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0] linux
Align Anything version: newest, cutoff date 0427, after commit eea5af6

Problem description

The existing Janus implementation was merged in February 2025. Since then, align-anything has undergone several major refactorings, which have resulted in the following issues with the Janus implementation:

  • The image output finetuning contains an outdated version of the chat template, which prevents it from simultaneously supporting both Janus and Janus Pro models
  • The image output DPO has a naming inconsistency with DPOTextTrainer, causing a TypeError: DPOTrainer.loss() got an unexpected keyword argument 'batch' error
  • The image input finetuning has issues with dataset loading and inconsistent function naming in the backend, causing a data loading error.

Reproducible example code

/

Traceback

Expected behavior

No response

Additional context

I'm currently fixing these bugs. To be specific:

  • All the bugs in the image output fine-tuning/DPO are fixed.
  • The dataset loading error in the image input fine-tuning has been successfully located. I will fix this error in the next few days.

After all the bugs are fixed and tested, I will open a PR and merge these modifications. All the updates during this process will be reported in this thread.

@htlou htlou added the bug Something isn't working label Apr 27, 2025
@htlou htlou self-assigned this Apr 27, 2025
@htlou
Copy link
Member Author

htlou commented Apr 27, 2025

This issue is created to consolidate discussions about these problems from recent issues, such as #187 and #184.

@htlou htlou linked a pull request Apr 29, 2025 that will close this issue
9 tasks
@NROwind
Copy link

NROwind commented May 5, 2025

This issue is created to consolidate discussions about these problems from recent issues, such as #187 and #184.

你好 请问#184 提到的 TypeError: deepspeed.utils.nvtx.instrument_w_nvtx..wrapped_fn() got multiple values for keyword argument 'task'问题有解决了么

@htlou
Copy link
Member Author

htlou commented May 6, 2025

你好 请问#184 提到的 TypeError: deepspeed.utils.nvtx.instrument_w_nvtx..wrapped_fn() got multiple values for keyword argument 'task'问题有解决了么

这个问题在本PR中已经被解决,目前本PR能够实现正常的Janus和Janus Pro模型的SFT和DPO训练

@NROwind
Copy link

NROwind commented May 8, 2025

你好 请问#184 提到的 TypeError: deepspeed.utils.nvtx.instrument_w_nvtx..wrapped_fn() got multiple values for keyword argument 'task'问题有解决了么

这个问题在本PR中已经被解决,目前本PR能够实现正常的Janus和Janus Pro模型的SFT和DPO训练
你好 我使用最新的代码跑的时候 会报错 [rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/janus/sft.py", line 115, in
[rank0]: sys.exit(main())
[rank0]: ^^^^^^
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/janus/sft.py", line 109, in main
[rank0]: trainer = SuperviseTrainer(cfgs=cfgs, ds_cfgs=ds_cfgs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/text_to_text/sft.py", line 59, in init
[rank0]: self.init_datasets()
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/janus/sft.py", line 50, in init_datasets
[rank0]: self.train_dataloader, self.eval_dataloader = self.get_dataloaders(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/base/supervised_trainer.py", line 93, in get_dataloaders
[rank0]: train_dataset = train_data_dtype(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/code/align-anything/align_anything/datasets/janus/supervised.py", line 78, in init
[rank0]: self.raw_data = load_dataset(
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 2062, in load_dataset
[rank0]: builder_instance = load_dataset_builder(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 1782, in load_dataset_builder
[rank0]: dataset_module = dataset_module_factory(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 1519, in dataset_module_factory
[rank0]: ).get_module()
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 822, in get_module
[rank0]: data_files = DataFilesDict.from_patterns(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/data_files.py", line 690, in from_patterns
[rank0]: else DataFilesList.from_patterns(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/data_files.py", line 583, in from_patterns
[rank0]: resolve_pattern(
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/data_files.py", line 384, in resolve_pattern
[rank0]: raise FileNotFoundError(error_msg)
[rank0]: FileNotFoundError: Unable to find '/home/czh/code/align-anything/projects/janus/example/supervised/text_to_image/train_tokenized.pt' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.ndjson', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.xml', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.mkv', '.mp4', '.avi', '.mov', '.MKV', '.MP4', '.AVI', '.MOV', '.pdf', '.PDF', '.zip']

您那边是可以正常跑的??

@NROwind
Copy link

NROwind commented May 8, 2025

你好 请问#184 提到的 TypeError: deepspeed.utils.nvtx.instrument_w_nvtx..wrapped_fn() got multiple values for keyword argument 'task'问题有解决了么

这个问题在本PR中已经被解决,目前本PR能够实现正常的Janus和Janus Pro模型的SFT和DPO训练
你好 我使用最新的代码跑的时候 会报错 [rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/janus/sft.py", line 115, in
[rank0]: sys.exit(main())
[rank0]: ^^^^^^
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/janus/sft.py", line 109, in main
[rank0]: trainer = SuperviseTrainer(cfgs=cfgs, ds_cfgs=ds_cfgs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/text_to_text/sft.py", line 59, in init
[rank0]: self.init_datasets()
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/janus/sft.py", line 50, in init_datasets
[rank0]: self.train_dataloader, self.eval_dataloader = self.get_dataloaders(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/code/align-anything/align_anything/trainers/base/supervised_trainer.py", line 93, in get_dataloaders
[rank0]: train_dataset = train_data_dtype(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/code/align-anything/align_anything/datasets/janus/supervised.py", line 78, in init
[rank0]: self.raw_data = load_dataset(
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 2062, in load_dataset
[rank0]: builder_instance = load_dataset_builder(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 1782, in load_dataset_builder
[rank0]: dataset_module = dataset_module_factory(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 1519, in dataset_module_factory
[rank0]: ).get_module()
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/load.py", line 822, in get_module
[rank0]: data_files = DataFilesDict.from_patterns(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/data_files.py", line 690, in from_patterns
[rank0]: else DataFilesList.from_patterns(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/data_files.py", line 583, in from_patterns
[rank0]: resolve_pattern(
[rank0]: File "/home/czh/anaconda3/lib/python3.12/site-packages/datasets/data_files.py", line 384, in resolve_pattern
[rank0]: raise FileNotFoundError(error_msg)
[rank0]: FileNotFoundError: Unable to find '/home/czh/code/align-anything/projects/janus/example/supervised/text_to_image/train_tokenized.pt' with any supported extension ['.csv', '.tsv', '.json', '.jsonl', '.ndjson', '.parquet', '.geoparquet', '.gpq', '.arrow', '.txt', '.tar', '.xml', '.blp', '.bmp', '.dib', '.bufr', '.cur', '.pcx', '.dcx', '.dds', '.ps', '.eps', '.fit', '.fits', '.fli', '.flc', '.ftc', '.ftu', '.gbr', '.gif', '.grib', '.png', '.apng', '.jp2', '.j2k', '.jpc', '.jpf', '.jpx', '.j2c', '.icns', '.ico', '.im', '.iim', '.tif', '.tiff', '.jfif', '.jpe', '.jpg', '.jpeg', '.mpg', '.mpeg', '.msp', '.pcd', '.pxr', '.pbm', '.pgm', '.ppm', '.pnm', '.psd', '.bw', '.rgb', '.rgba', '.sgi', '.ras', '.tga', '.icb', '.vda', '.vst', '.webp', '.wmf', '.emf', '.xbm', '.xpm', '.BLP', '.BMP', '.DIB', '.BUFR', '.CUR', '.PCX', '.DCX', '.DDS', '.PS', '.EPS', '.FIT', '.FITS', '.FLI', '.FLC', '.FTC', '.FTU', '.GBR', '.GIF', '.GRIB', '.PNG', '.APNG', '.JP2', '.J2K', '.JPC', '.JPF', '.JPX', '.J2C', '.ICNS', '.ICO', '.IM', '.IIM', '.TIF', '.TIFF', '.JFIF', '.JPE', '.JPG', '.JPEG', '.MPG', '.MPEG', '.MSP', '.PCD', '.PXR', '.PBM', '.PGM', '.PPM', '.PNM', '.PSD', '.BW', '.RGB', '.RGBA', '.SGI', '.RAS', '.TGA', '.ICB', '.VDA', '.VST', '.WEBP', '.WMF', '.EMF', '.XBM', '.XPM', '.aiff', '.au', '.avr', '.caf', '.flac', '.htk', '.svx', '.mat4', '.mat5', '.mpc2k', '.ogg', '.paf', '.pvf', '.raw', '.rf64', '.sd2', '.sds', '.ircam', '.voc', '.w64', '.wav', '.nist', '.wavex', '.wve', '.xi', '.mp3', '.opus', '.AIFF', '.AU', '.AVR', '.CAF', '.FLAC', '.HTK', '.SVX', '.MAT4', '.MAT5', '.MPC2K', '.OGG', '.PAF', '.PVF', '.RAW', '.RF64', '.SD2', '.SDS', '.IRCAM', '.VOC', '.W64', '.WAV', '.NIST', '.WAVEX', '.WVE', '.XI', '.MP3', '.OPUS', '.mkv', '.mp4', '.avi', '.mov', '.MKV', '.MP4', '.AVI', '.MOV', '.pdf', '.PDF', '.zip']

您那边是可以正常跑的??

我拉取的代码如下:
git clone git@github.com:PKU-Alignment/align-anything.git
git fetch origin pull/197/head:pr-197
git checkout pr-197
git checkout d66557d
能麻烦您看看哪里有问题么

@htlou
Copy link
Member Author

htlou commented May 8, 2025

这个问题本月稍早时已经经过测试被发现,是开发ti2t时兼容性未保证所导致的,目前修复已经写好,将在本PR的下一个commit中提交

@NROwind
Copy link

NROwind commented May 9, 2025

这个问题本月稍早时已经经过测试被发现,是开发ti2t时兼容性未保证所导致的,目前修复已经写好,将在本PR的下一个commit中提交

好的 方便问问下次commit大概是什么时候么 麻烦了!

@NROwind
Copy link

NROwind commented May 17, 2025

这个问题本月稍早时已经经过测试被发现,是开发ti2t时兼容性未保证所导致的,目前修复已经写好,将在本PR的下一个commit中提交

请问最近有更新计划吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants