Description
After training with the command:
python run.py train experiments/wikisql-glove-run.jsonnet
and getting through 3990 epochs:
[2020-11-19T18:51:02] Step 39990: loss=0.8703
I tried next step:
python run eval experiments/wikisql-glove-run.jsonnet
but I go the following error:
Loading model from logdir/glove_run/model_checkpoint-00030100
0%| | 0/8421 [00:00<?, ?it/s]
Traceback (most recent call last):
File "run.py", line 109, in
main()
File "run.py", line 91, in main
infer.main(infer_config)
File "/app/ratsql/commands/infer.py", line 163, in main
inferer.infer(model, output_path, args)
File "/app/ratsql/commands/infer.py", line 71, in infer
output, args.use_heuristic)
File "/app/ratsql/commands/infer.py", line 86, in _inner_infer
decoded = self._infer_one(model, orig_item, preproc_item, beam_size, output_history, use_heuristic)
File "/app/ratsql/commands/infer.py", line 98, in _infer_one
model, data_item, preproc_item, beam_size=beam_size, max_steps=1000, from_cond=False)
File "/app/ratsql/models/spider/spider_beam_search.py", line 59, in beam_search_with_heuristics
assert next_choices is not None
AssertionError
In the logdir/glove_run I have
drwxr-xr-x. 2 root root 54 Nov 20 10:34 ie_dirs
lrwxrwxrwx. 1 root root 25 Nov 19 18:51 model_checkpoint -> model_checkpoint-00040000
-rw-r--r--. 1 root root 142281149 Nov 19 18:51 model_checkpoint-00040000
-rw-r--r--. 1 root root 240073 Nov 19 18:51 log.txt
-rw-r--r--. 1 root root 142281149 Nov 19 18:10 model_checkpoint-00039100
-rw-r--r--. 1 root root 142281149 Nov 19 17:24 model_checkpoint-00038100
-rw-r--r--. 1 root root 142281149 Nov 19 16:38 model_checkpoint-00037100
and so on.
Please advice what went wrong?
Previously, the training thrown an error related to the the fact that SIGKILL was not recognized. I replaced, following the internet fix, with a SIGTERM and an conditional to see if the object has the method.
Could