-
Notifications
You must be signed in to change notification settings - Fork 59
Open
Description
基本信息
- Python版本: (2.7 / 3.6)
- MoXing版本:(1.8.2)
- 浏览器:Chrome
问题描述 / 重现步骤
正常启动程序,训练ResNet50模型(300M左右模型文件),但是运行了多个epoch后突然显示以下信息(见Log),任务失败。原因是Unable to connect to endpoint,可能是OBS连接不稳定所致。
(简单描述问题信息,如果是bug,请描述重现步骤)
作业基本信息
-
相关作业类型:
-
作业ID: resnet-42586680-10
-
引擎类型: (TensorFlow)
-
运行参数:无
-
计算节点个数:1
-
计算节点规格:单机8卡
相关源码 / 输出日志
Caused by op u'ModelSaver/save/SaveV2', defined at:
File "resnet_cloud_multi/imagenet_resnet_cloud.py", line 218, in
launch_train_with_config(config, trainer)
...
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InternalError (see above for traceback): : Unable to connect to endpoint
Metadata
Metadata
Assignees
Labels
No labels