Skip to content

训练作业突然失败停止 #38

@NomanDD

Description

@NomanDD

基本信息

  • Python版本: (2.7 / 3.6)
  • MoXing版本:(1.8.2)
  • 浏览器:Chrome

问题描述 / 重现步骤

正常启动程序,训练ResNet50模型(300M左右模型文件),但是运行了多个epoch后突然显示以下信息(见Log),任务失败。原因是Unable to connect to endpoint,可能是OBS连接不稳定所致。

(简单描述问题信息,如果是bug,请描述重现步骤)

作业基本信息

  • 相关作业类型:

  • 作业ID: resnet-42586680-10

  • 引擎类型: (TensorFlow)

  • 运行参数:无

  • 计算节点个数:1

  • 计算节点规格:单机8卡

相关源码 / 输出日志

Caused by op u'ModelSaver/save/SaveV2', defined at:
File "resnet_cloud_multi/imagenet_resnet_cloud.py", line 218, in
launch_train_with_config(config, trainer)
...
File "/usr/local/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): : Unable to connect to endpoint

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions