-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Bug description
I am using the default configs, code and data to train a model within BioNeMo framework. The timeout occurs at the middle of the training.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
The configs might relate to the training:
trainer:
devices: 8 # number of GPUs or CPUs
num_nodes: 1
accelerator: gpu #gpu or cpu
precision: 16 #16 or 32
logger: False # logger is provided by NeMo exp_manager
enable_checkpointing: False # checkpointing is done by NeMo exp_manager
replace_sampler_ddp: False # use NeMo Megatron samplers
max_epochs: null # # use max_steps instead with NeMo Megatron model
log_every_n_steps: 10 # number of interations between logging
val_check_interval: 15e4
limit_val_batches: 50 # number of batches in validation step, use fraction for fraction of data, 0 to disable
limit_test_batches: 500 # number of batches in test step, use fraction for fraction of data, 0 to disable
accumulate_grad_batches: 1
gradient_clip_val: 1.0
benchmark: False
max_steps: 500000
### Error messages and logs
Epoch 0: 6%|██ | 32040/500150 [6:28:43<94:39:17, 1.37it/s, loss=2.6, v_num=95nc, reduced_train_loss=2.590, global_step=3.2e+4, consumed_samples=2.56e+7][E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64088, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625251 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624886 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96120, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25624887 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96125, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625593 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:472] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:478] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:830] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96117, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 25625613 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800741 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800733 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800769 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800781 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:458] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=64090, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800847 milliseconds before timing out.
### Environment
a03-zpeng@m3dgx01:~$ pip list
Package Version Location
absl-py 1.4.0
accessible-pygments 0.0.4
aiohttp 3.9.0
aiosignal 1.3.1
alabaster 0.7.13
aniso8601 9.0.1
annotated-types 0.6.0
antlr4-python3-runtime 4.9.3
apex 0.1
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
asttokens 2.2.1
astunparse 1.6.3
async-timeout 4.0.2
attrdict 2.0.1
attrs 23.1.0
audioread 3.0.0
awscli 1.29.67
Babel 2.12.1
backcall 0.2.0
beautifulsoup4 4.12.2
bionemo 0.2.0.dev0 /workspace/bionemo
biopandas 0.4.1
biopython 1.79
black 23.1.0
bleach 6.0.0
blinker 1.6.2
blis 0.7.9
boto3 1.28.10
botocore 1.31.67
braceexpand 0.1.7
Brotli 1.1.0
cachetools 5.3.1
catalogue 2.0.8
cdifflib 1.2.6
certifi 2023.7.22
cffi 1.15.1
cfgv 3.4.0
charset-normalizer 3.1.0
click 8.1.7
cloudpickle 2.2.1
cmake 3.24.1.1
colorama 0.4.4
coloredlogs 15.0.1
comm 0.1.3
commonmark 0.9.1
confection 0.0.4
contourpy 1.0.7
coverage 7.4.0
crc32c 2.3.post0
cubinlinker 0.3.0+2.g87b01ae
cuda-python 12.1.0rc5+1.g38940ef
cudf 23.4.0
cugraph 23.4.0
cugraph-dgl 23.4.0
cugraph-service-client 23.4.0
cugraph-service-server 23.4.0
cuml 23.4.0
cupy-cuda12x 12.0.0b3
cycler 0.11.0
cymem 2.0.7
Cython 0.29.35
dacite 1.8.1
dask 2023.3.2
dask-cuda 23.4.0
dask-cudf 23.4.0
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
dgl 1.1.3
dgllife 0.2.8
diffdock 0.0.5
dill 0.3.7
Distance 0.1.3
distlib 0.3.8
distributed 2023.3.2.1
DLLogger 1.0.0
docker-pycreds 0.4.0
docopt 0.6.2
docutils 0.16
e3nn 0.5.1
editdistance 0.6.2
einops 0.6.1
exceptiongroup 1.1.1
execnet 1.9.0
executing 1.2.0
expecttest 0.1.3
fair-esm 2.0.0
faiss-cpu 1.7.4
fastjsonschema 2.17.1
fastrlock 0.8.1
fasttext 0.9.2
filelock 3.12.2
fire 0.5.0
flash-attn 1.0.7
Flask 2.2.5
Flask-RESTful 0.3.10
flatbuffers 23.5.26
fonttools 4.47.2
frozenlist 1.3.3
fsspec 2023.5.0
ftfy 6.1.1
future 0.18.3
g2p-en 2.1.0
gast 0.4.0
gdown 4.7.1
gevent 23.9.1
geventhttpclient 2.0.2
gitdb 4.0.10
GitPython 3.1.41
google-auth 2.20.0
google-auth-oauthlib 0.4.6
graphsurgeon 0.4.6
graphviz 0.20.1
greenlet 3.0.3
grpcio 1.56.0
h5py 3.9.0
huggingface-hub 0.20.2
humanfriendly 10.0
hydra-core 1.2.0
hyperopt 0.2.7
hypothesis 5.35.1
identify 2.5.33
idna 3.4
ijson 3.2.3
imagesize 1.4.1
importlib-metadata 6.6.0
inflect 7.0.0
iniconfig 2.0.0
intel-openmp 2021.4.0
ipadic 1.0.0
ipdb 0.13.11
ipykernel 6.23.3
ipython 8.14.0
ipython-genutils 0.2.0
ipywidgets 8.0.7
isort 5.12.0
itsdangerous 2.1.2
jedi 0.18.2
jieba 0.42.1
Jinja2 3.1.2
jiwer 2.5.2
jmespath 1.0.1
joblib 1.2.0
json5 0.9.14
jsonlines 4.0.0
jsonschema 4.17.3
jupyter_client 8.3.0
jupyter_core 5.3.1
jupyter-tensorboard 0.2.0
jupyterlab 2.3.2
jupyterlab-pygments 0.2.2
jupyterlab-server 1.2.0
jupyterlab-widgets 3.0.8
jupytext 1.14.6
k2 1.24.3.dev20230725+cuda12.1.torch2.1.0a0
kaldi-python-io 1.2.2
kaldiio 2.18.0
kiwisolver 1.4.4
kornia 0.6.12
langcodes 3.3.0
latexcodec 2.0.1
Levenshtein 0.21.1
librosa 0.9.2
lightning-utilities 0.9.0
llvmlite 0.39.1
locket 1.0.0
loguru 0.7.0
lxml 4.9.3
Markdown 3.4.3
markdown-it-py 2.2.0
markdown2 2.4.9
MarkupSafe 2.1.3
marshmallow 3.20.1
matplotlib 3.4.3
matplotlib-inline 0.1.6
mdit-py-plugins 0.4.0
mdurl 0.1.2
mecab-python3 1.0.5
megatron-core 0.2.0
mistune 3.0.1
mkl 2021.1.1
mkl-devel 2021.1.1
mkl-include 2021.1.1
mock 5.0.2
more-itertools 10.1.0
mpmath 0.19
msgpack 1.0.5
multidict 6.0.4
murmurhash 1.0.9
mypy-extensions 1.0.0
nbclient 0.8.0
nbconvert 7.6.0
nbformat 5.9.0
nemo-text-processing 0.1.8rc0
nemo-toolkit 1.20.0
nest-asyncio 1.5.6
networkx 2.6.3
ninja 1.11.1
nltk 3.8.1
nodeenv 1.8.0
notebook 6.4.10
numba 0.56.4+1.g5f1bc7084
numpy 1.22.2
nvidia-dali-cuda120 1.26.0
nvidia-pyindex 1.0.9
nvidia-pytriton 0.4.0
nvtx 0.2.5
oauthlib 3.2.2
omegaconf 2.2.3
onnx 1.14.1
onnx-graphsurgeon 0.3.27
onnxruntime-gpu 1.16.3
onnxscript 0.1.0.dev20240113
OpenCC 1.1.6
opencv 4.6.0
opt-einsum 3.3.0
opt-einsum-fx 0.1.4
packaging 23.1
pandas 1.5.2
pandocfilters 1.5.0
pangu 4.0.6.1
parameterized 0.9.0
parso 0.8.3
partd 1.4.0
pathspec 0.11.1
pathtools 0.1.2
pathy 0.10.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.0.1
pip 21.2.4
pipdeptree 2.13.0
plac 1.3.5
platformdirs 4.1.0
pluggy 1.2.0
ply 3.11
polars 0.16.7
polygraphy 0.47.1
pooch 1.7.0
portalocker 2.7.0
POT 0.7.0
pre-commit 3.4.0
preshed 3.0.8
prettytable 3.8.0
progress 1.6
prometheus-client 0.17.0
prompt-toolkit 3.0.38
protobuf 3.20.3
psutil 5.9.4
ptxcompiler 0.8.1+1.gbe9fca5
ptyprocess 0.7.0
pure-eval 0.2.2
py 1.11.0
py-cpuinfo 9.0.0
py4j 0.10.9.7
pyannote.core 5.0.0
pyannote.database 5.0.1
pyannote.metrics 3.2.1
pyarrow 14.0.1
pyasn1 0.5.0
pyasn1-modules 0.3.0
pybind11 2.10.4
pybtex 0.24.0
pybtex-docutils 1.0.2
pycocotools 2.0+nv0.7.3
pycparser 2.21
pydantic 2.5.3
pydantic_core 2.14.6
pydata-sphinx-theme 0.13.1
pydub 0.25.1
pyfaidx 0.7.2
pyfastx 1.1.0
Pygments 2.15.1
pylibcugraph 23.4.0
pylibcugraphops 23.4.0
pylibraft 23.4.0
Pympler 1.0.1
pynini 2.1.5
pynvml 11.4.1
pyparsing 3.0.9
pypinyin 0.49.0
pypinyin-dict 0.6.0
pyrsistent 0.19.3
PySocks 1.7.1
pytest 7.4.0
pytest-cov 4.1.0
pytest-dependency 0.5.1
pytest-forked 1.6.0
pytest-rerunfailures 11.1.2
pytest-runner 6.0.0
pytest-shard 0.1.2
pytest-timeout 2.2.0
pytest-xdist 3.3.1
python-dateutil 2.8.2
python-hostlist 1.23.0
python-rapidjson 1.14
python-slugify 8.0.1
pytorch-lightning 1.9.4
pytorch-quantization 2.1.2
pytz 2023.3
PyYAML 6.0
pyzmq 23.2.1
raft-dask 23.4.0
rapidfuzz 2.13.7
rdkit 2023.9.1
rdkit-pypi 2022.9.5
regex 2023.6.3
requests 2.31.0
requests-mock 1.11.0
requests-oauthlib 1.3.1
resampy 0.4.2
rich 12.6.0
rmm 23.4.0
rouge-score 0.1.2
rsa 4.7.2
ruamel.yaml 0.17.32
ruamel.yaml.clib 0.2.7
ruff 0.0.292
s3transfer 0.7.0
sacrebleu 2.3.1
sacremoses 0.0.53
safetensors 0.3.1
scikit-learn 1.2.0
scipy 1.10.1
seaborn 0.12.2
Send2Trash 1.8.2
sentence-transformers 2.2.2
sentencepiece 0.1.99
sentry-sdk 1.28.1
setproctitle 1.3.2
setuptools 65.5.1
sh 1.14.3
shellingham 1.5.0.post1
six 1.16.0
smart-open 6.3.0
smmap 5.0.0
snowballstemmer 2.2.0
sortedcontainers 2.4.0
soundfile 0.12.1
soupsieve 2.4.1
sox 1.4.1
spacy 3.5.3
spacy-legacy 3.0.12
spacy-loggers 1.0.4
Sphinx 5.3.0
sphinx-book-theme 1.0.0
sphinx-copybutton 0.5.2
sphinx-glpi-theme 0.3
sphinxcontrib-applehelp 1.0.4
sphinxcontrib-bibtex 2.5.0
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.1
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
sphinxext-opengraph 0.8.2
spyrmsd 0.5.2
srsly 2.4.6
stack-data 0.6.2
sympy 1.12
tabulate 0.9.0
tbb 2021.9.0
tblib 1.7.0
tensorboard 2.9.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorrt 8.6.1
termcolor 2.3.0
terminado 0.17.1
testbook 0.4.2
text-unidecode 1.3
textdistance 4.5.0
texterrors 0.4.4
tfrecord 1.14.1
thinc 8.1.10
threadpoolctl 3.1.0
thriftpy2 0.4.16
tinycss2 1.2.1
tokenizers 0.15.0
toml 0.10.2
tomli 2.0.1
toolz 0.12.0
torch 2.1.0a0+4136153
torch-cluster 1.6.1
torch-geometric 2.3.0
torch-scatter 2.0.9
torch-sparse 0.6.17
torch-tensorrt 1.5.0.dev0
torchaudio 2.1.0
torchdata 0.7.0a0
torchmetrics 1.0.1
torchvision 0.16.0a0
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformer-engine 0.9.0
transformers 4.36.0
treelite 3.2.0
treelite-runtime 3.2.0
triton 2.0.0.dev20221202
triton-model-navigator 0.7.4
tritonclient 2.41.1
typed-ast 1.5.5
typer 0.7.0
types-dataclasses 0.6.6
typing_extensions 4.6.3
typing-inspect 0.6.0
ucx-py 0.31.0
uff 0.6.9
urllib3 1.26.16
virtualenv 20.25.0
wandb 0.15.6
wasabi 1.1.2
wcwidth 0.2.6
webdataset 0.2.33
webencodings 0.5.1
Werkzeug 2.3.6
wget 3.2
wheel 0.40.0
widgetsnbextension 4.0.8
wrapt 1.14.1
xdoctest 1.0.2
xgboost 1.7.5
yarl 1.9.2
youtokentome 1.0.6
zict 3.0.0
zipp 3.15.0
zope.event 5.0
zope.interface 6.1
### More info
I am using the nvidia BioNeMo framework.