Skip to content

【Hackathon 8th No.16】 data_efficient_nopt 论文复现 #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: develop
Choose a base branch
from

Conversation

xiaoyewww
Copy link
Contributor

PR types

New Features

PR changes

Others

Describe

support data_efficient_nopt

Copy link

paddle-bot bot commented Mar 23, 2025

Thanks for your contribution!

@xiaoyewww
Copy link
Contributor Author

@wangguan1995
Copy link
Contributor

@wangguan1995
Copy link
Contributor

如果有可复现的精度结果,可以日志截图到github+上传log,这边可以开始测试

@xiaoyewww
Copy link
Contributor Author

https://github.com/delta-lab-ai/data_efficient_nopt/blob/main/pretrain_basic.py#L262 @wangguan1995 paddle目前没有gaussian_br的实现吗

image 目前没有

ok,已经参考实现了paddle的gaussian_blur

@xiaoyewww
Copy link
Contributor Author

xiaoyewww commented Apr 2, 2025

如果有可复现的精度结果,可以日志截图到github+上传log,这边可以开始测试

image 目前复现了一下poisson fno 预训练,pd和pt没有固定随机数种子,所以前期loss会有差异,经过几百个step后趋势一致。

复现结果和论文中有点差异,猜测超参哪里有差异,论文上没看到相关描述:
image

@xiaoyewww
Copy link
Contributor Author

poisson fno推理结果,采用官方提供权重.

# torch
RMSE: 0.25861763998531323 RMSE (normalized) 0.14146761527157586 R2: 0.9765389656726264 Slope: 0.9752451781576813

# paddle
RMSE: 0.25861764924824066 RMSE (normalized) 0.14146758505387425 R2: 0.9765389632378143 Slope: 0.9752452311012886

@xiaoyewww
Copy link
Contributor Author

xiaoyewww commented Apr 6, 2025

如果有可复现的精度结果,可以日志截图到github+上传log,这边可以开始测试

image 目前复现了一下poisson fno 预训练,pd和pt没有固定随机数种子,所以前期loss会有差异,经过几百个step后趋势一致。
复现结果和论文中有点差异,猜测超参哪里有差异,论文上没看到相关描述: image

前10个step对比:
poisson_fno_combined_train_loss_10steps

paddle:

Epoch 1 Batch 0 Train Loss 0.3359823226928711 train_l2 loss 1.0018577575683594 train_rmse loss 0.7787399888038635
Total Times. Global step: 0, Batch: 0, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 1.274991750717163, Forward: 0.5737528800964355, Backward: 0.18942928314208984, Optimizer: 0.023642539978027344
Epoch 1 Batch 1 Train Loss 0.3453991115093231 train_l2 loss 0.9957258105278015 train_rmse loss 0.7938657999038696
Total Times. Global step: 1, Batch: 1, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08668327331542969, Forward: 0.019646406173706055, Backward: 0.012578487396240234, Optimizer: 0.02925419807434082
Epoch 1 Batch 2 Train Loss 0.33508121967315674 train_l2 loss 0.9866492748260498 train_rmse loss 0.7812024354934692
Total Times. Global step: 2, Batch: 2, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08733677864074707, Forward: 0.016697168350219727, Backward: 0.011176347732543945, Optimizer: 0.032007694244384766
Epoch 1 Batch 3 Train Loss 0.3373328149318695 train_l2 loss 0.9720052480697632 train_rmse loss 0.7818474769592285
Total Times. Global step: 3, Batch: 3, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08364057540893555, Forward: 0.01732492446899414, Backward: 0.011551856994628906, Optimizer: 0.0324702262878418
Epoch 1 Batch 4 Train Loss 0.3260154128074646 train_l2 loss 0.9649569988250732 train_rmse loss 0.7634750604629517
Total Times. Global step: 4, Batch: 4, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08973145484924316, Forward: 0.017143964767456055, Backward: 0.011417150497436523, Optimizer: 0.0321955680847168
Epoch 1 Batch 5 Train Loss 0.33446627855300903 train_l2 loss 0.9470340609550476 train_rmse loss 0.7787452936172485
Total Times. Global step: 5, Batch: 5, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08418393135070801, Forward: 0.015747785568237305, Backward: 0.010698318481445312, Optimizer: 0.03490447998046875
Epoch 1 Batch 6 Train Loss 0.31356751918792725 train_l2 loss 0.9271667003631592 train_rmse loss 0.7467784285545349
Total Times. Global step: 6, Batch: 6, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08341240882873535, Forward: 0.015986919403076172, Backward: 0.010836601257324219, Optimizer: 0.03387713432312012
Epoch 1 Batch 7 Train Loss 0.32571274042129517 train_l2 loss 0.918164074420929 train_rmse loss 0.7629383206367493
Total Times. Global step: 7, Batch: 7, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08514046669006348, Forward: 0.01694965362548828, Backward: 0.011275529861450195, Optimizer: 0.033127784729003906
Epoch 1 Batch 8 Train Loss 0.3198857605457306 train_l2 loss 0.8946603536605835 train_rmse loss 0.7452360391616821
Total Times. Global step: 8, Batch: 8, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08858108520507812, Forward: 0.01703476905822754, Backward: 0.011278867721557617, Optimizer: 0.032628536224365234
Epoch 1 Batch 9 Train Loss 0.28028005361557007 train_l2 loss 0.8539849519729614 train_rmse loss 0.6707751154899597
Total Times. Global step: 9, Batch: 9, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.08603501319885254, Forward: 0.017177581787109375, Backward: 0.011240959167480469, Optimizer: 0.032387495040893555
Epoch 1 Batch 10 Train Loss 0.303079217672348 train_l2 loss 0.8385870456695557 train_rmse loss 0.7334427833557129
Total Times. Global step: 10, Batch: 10, Rank: 0, Data Shape: [128, 4, 64, 64], Data time: 0.07994580268859863, Forward: 0.01645064353942871, Backward: 0.011035442352294922, Optimizer: 0.03404521942138672

torch:

Epoch 1 Batch 0 Train Loss 0.3359190821647644
Total Times. Batch: 0, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 1.5983459949493408, Forward: 2.262549877166748, Backward: 0.5132086277008057, Optimizer: 0.012012720108032227
Epoch 1 Batch 1 Train Loss 0.34538960456848145
Total Times. Batch: 1, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03945159912109375, Forward: 0.03422045707702637, Backward: 0.045007944107055664, Optimizer: 0.010618925094604492
Epoch 1 Batch 2 Train Loss 0.33507877588272095
Total Times. Batch: 2, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03759288787841797, Forward: 0.011688470840454102, Backward: 0.06759905815124512, Optimizer: 0.010719060897827148
Epoch 1 Batch 3 Train Loss 0.3374229967594147
Total Times. Batch: 3, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03585243225097656, Forward: 0.011916399002075195, Backward: 0.06729936599731445, Optimizer: 0.010690450668334961
Epoch 1 Batch 4 Train Loss 0.32614579796791077
Total Times. Batch: 4, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03865694999694824, Forward: 0.011200428009033203, Backward: 0.06806373596191406, Optimizer: 0.010601520538330078
Epoch 1 Batch 5 Train Loss 0.33475780487060547
Total Times. Batch: 5, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03668999671936035, Forward: 0.011771440505981445, Backward: 0.06752943992614746, Optimizer: 0.010657072067260742
Epoch 1 Batch 6 Train Loss 0.3140023946762085
Total Times. Batch: 6, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.0358736515045166, Forward: 0.011792898178100586, Backward: 0.0673990249633789, Optimizer: 0.010664939880371094
Epoch 1 Batch 7 Train Loss 0.3263723850250244
Total Times. Batch: 7, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03751945495605469, Forward: 0.011594295501708984, Backward: 0.06769108772277832, Optimizer: 0.010838031768798828
Epoch 1 Batch 8 Train Loss 0.3208008110523224
Total Times. Batch: 8, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03836321830749512, Forward: 0.011111259460449219, Backward: 0.06809735298156738, Optimizer: 0.010608196258544922
Epoch 1 Batch 9 Train Loss 0.28148674964904785
Total Times. Batch: 9, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.03702974319458008, Forward: 0.01204824447631836, Backward: 0.06731843948364258, Optimizer: 0.010616540908813477
Epoch 1 Batch 10 Train Loss 0.30438655614852905
Total Times. Batch: 10, Rank: 0, Data Shape: torch.Size([128, 4, 64, 64]), Data time: 0.034720659255981445, Forward: 0.011367082595825195, Backward: 0.06775617599487305, Optimizer: 0.010625123977661133

@xiaoyewww
Copy link
Contributor Author

helmholtz_64 fno和possion_64 fno一致,采用相同模型结构。

@luotao1 luotao1 changed the title 【Hackathon 8th No.13】 data_efficient_nopt 论文复现 【Hackathon 8th No.16】 data_efficient_nopt 论文复现 May 8, 2025
@@ -0,0 +1,49 @@
import paddle
import torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

清理下torch相关内容

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个脚本用于ckpt权重转换,因为目前没有资源去完整预训练一个模型。

@@ -0,0 +1,13 @@
# Automatically generated by https://github.com/damnever/pigar.

adan-pytorch==0.1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thx.

"""
loss functions
# """
# import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要清理注释内容,准备合入

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thx.



def get_forcing(S):
# x1 = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thx.

@@ -0,0 +1,40 @@
# Usage

## 1. Data Download
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://paddlescience-docs.readthedocs.io/zh-cn/latest/zh/development/#3
文档需要按照复现指南进行编写(ReadME 改为 doc的形式)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档部分完成,请问一下结果展示部分怎么写?

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from timm.models.layers import DropPath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处的torch库会报错:

Traceback (most recent call last):
  File "/workspace/PaddleScience_repo/data_efficient_nopt/examples/data_efficient_nopt/inference_fno_helmholtz_poisson.py", line 17, in <module>
    from models.fno import build_fno
  File "/workspace/PaddleScience_repo/data_efficient_nopt/examples/data_efficient_nopt/models/fno.py", line 9, in <module>
    from timm.models.layers import DropPath
ModuleNotFoundError: No module named 'timm'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已移除相关依赖

@wangguan1995
Copy link
Contributor

目前模型文件夹需要迁移到arch
数据相关需要迁移到data
脚本需要进行整理
可以参考:https://paddlescience-docs.readthedocs.io/zh-cn/latest/zh/development/

Copy link
Contributor

@wangguan1995 wangguan1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix


论文通过以下方案解决上述提到的问题:

1. 无监督预训练
Copy link
Contributor

@wangguan1995 wangguan1995 Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档对科学计算案例是至关重要的,对于案例和代码的可传播性非常关键
需要写的更详细一些(参考drivaernetpluplu的文档,写的非常详细),文档出现的图片打包发给我,我帮你制作链接

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已补充

k = k.replace("running_var", "_variance")
k = k.replace("running_mean", "_mean")
k = k.replace("module.", "")
# 添加到飞桨权重字典中
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中文注释需要清理一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

and len(self.masking) == 2
): # and self.masking[1] > 0.:
mask = self.mask_generator()
# return x, file_idx, paddle.to_tensor(self.subset_dict[self.sub_dsets[file_idx].get_name()]), bcs, y, mask, x_blur
Copy link
Contributor

@wangguan1995 wangguan1995 Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要清理注释

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@@ -0,0 +1,376 @@
default: &DEFAULT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@xiaoyewww xiaoyewww Jun 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分yaml文件感觉没办法再进一步优化了,都是一些训练相关的不同配置参数,主体部分我放在了data_efficient_nopt.yaml中

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

paddle.set_device(device)

# Modify params
params["batch_size"] = int(params.batch_size // world_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数需要整理到对应的yaml文件中,以提升代码可读性

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

from ruamel.yaml.comments import CommentedMap as ruamelDict
from scipy.stats import linregress
from tqdm import tqdm
from utils import logging_utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

from tqdm import tqdm


def _get_act(activation):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

激活函数的编码是否可以复用以下文件的代码, 使得复现代码更简洁、紧凑
https://github.com/PaddlePaddle/PaddleScience/blob/develop/ppsci/arch/activation.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

# Update the file paths in `cexamples/data_efficient_nopt/config/data_efficient_nopt.yaml`, specify to mode in `train`, and then specify to `train_path`, `val_path`, `test_path`, `scales_path` and `train_rand_idx_path`

# pretrain or finetune, for possion_64 or helmholtz_64.
# specify config_name to fno_possion using `data_efficient_nopt_fno_poisson`, or to fno_helmholtz using `data_efficient_nopt_fno_helmholtz`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要补充复现的精度指标

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


=== "模型评估命令"

暂无
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要补充模型推理指标,补充案例Checkpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


下方展示了部分实验结果:

| Model | Checkpoint | **$RMSE$** | **RMSE (normalized)$** | **R2** | **Slope** |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

移动到开头

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


总而言之,这篇论文提出了一种创新且高效的神经算子学习框架,通过无监督预训练在大量廉价的无标签物理数据上学习通用表示,并通过情境学习在推理阶段利用少量相似案例来提升OOD泛化能力。这一框架显著降低了对昂贵模拟数据的需求,并提高了模型在复杂物理问题中的适应性和泛化性,为科学机器学习的数据高效发展开辟了新途径。

下方展示了部分实验结果:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要补充可视化的结果对比

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

源码中没有可以使用的可视化脚本

self.split_offset = 0
self.len = self.offsets[-1]
else:
print("Using train/val/test split: {}".format(self.train_val_test))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

判断注释是否无用?可以考虑去掉或者改为logger进行打印

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

for d in queue:
yield d
except Exception as err:
print("ERRRR", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

清理print

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

except Exception as err:
print("ERRRR", err)
sampler_choices.pop(index_sampled)
print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

清理print

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

try:
x, y = self.sub_dsets[file_idx][local_idx]
except: # noqa
print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

"""
self.model.eval()
if full:
cutoff = 999999999999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要对硬编码进行处理

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

HydrogenSulfate
HydrogenSulfate previously approved these changes Jul 8, 2025
Copy link
Collaborator

@HydrogenSulfate HydrogenSulfate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有一处小问题,还麻烦修改下

Comment on lines 31 to 32

logger = logging.getLogger(__name__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请不要使用自定义的logger,ppsci.utils下有logger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. 已修改。

Copy link
Contributor

@wangguan1995 wangguan1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to fix

count = 0
for _, data in enumerate(temp_loader):
if count > cutoff:
del temp_loader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

导致内存泄露

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是因为dataloader的num workers设置为0导致,修改为1或4正常。


1. 通过预测的相似性。 论文通过计算它们在输出空间中的距离来找到空间和时间上的相似演示 。这意味着,对于空间和时间域上的两个输入位置,如果论文发现它们经过训练的神经算子的输出相似,那么论文就将它们视为相似样本 。遵循 [24, 25],论文假设演示与查询共享相同的物理参数分布 。

2. 聚合。 对于查询的每个空间-时间位置,在找到其在演示中的相似样本后,论文聚合并平均它们的解作为预测 。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文档加大力度,继续翻译+贴图

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已补充。

@xiaoyewww xiaoyewww force-pushed the data_effient_nopt branch from e57264b to 5d858dc Compare July 11, 2025 17:16
xiaoyewww and others added 6 commits July 12, 2025 01:36
Signed-off-by: WG <39621324+wangguan1995@users.noreply.github.com>
Signed-off-by: WG <39621324+wangguan1995@users.noreply.github.com>
Update data_efficient_nopt.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants