-
Notifications
You must be signed in to change notification settings - Fork 63
'Dataflow experiment', vocabulary file missing #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Apologies! It looks like I must have missed the vocab directory. I've added it to the uploaded dataset. You can find the missing vocab folder in this archive.
Good catch, there's a typo in the docs (missing a double hyphen). The correct command is:
Cheers, |
Thank you so much @ChrisCummins . I able to run the command in background with Currently I am rerunning from the beginning with all default values. Regards, |
Hmm, it's hard to say without more context. Could you please post the command error message, and the contents of the log directory that you are restoring from? For reference, here is what a logs directory should contain: two files
Cheers, |
The checkpoint files were present, it was my mistake that I looked into wrong log directory, extremely sorry for that. Here is the list. programl/dataset/dataflow/logs/programl/reachability/20:11:17T00:14:02
├── build_info.json
├── checkpoints
│ ├── 001.Checkpoint.pb
│ ├── 002.Checkpoint.pb
│ ├── 003.Checkpoint.pb
│ ├── 004.Checkpoint.pb
│ ├── 005.Checkpoint.pb
│ ├── 006.Checkpoint.pb
│ ├── 007.Checkpoint.pb
│ ├── 008.Checkpoint.pb
│ ├── 009.Checkpoint.pb
│ ├── 010.Checkpoint.pb
│ ├── 011.Checkpoint.pb
│ ├── 012.Checkpoint.pb
│ ├── 013.Checkpoint.pb
│ ├── 014.Checkpoint.pb
│ └── 015.Checkpoint.pb
├── epochs
│ ├── 001.EpochList.pbtxt
│ ├── 002.EpochList.pbtxt
│ ├── 003.EpochList.pbtxt
│ ├── 004.EpochList.pbtxt
│ ├── 005.EpochList.pbtxt
│ ├── 006.EpochList.pbtxt
│ ├── 007.EpochList.pbtxt
│ ├── 008.EpochList.pbtxt
│ ├── 009.EpochList.pbtxt
│ ├── 010.EpochList.pbtxt
│ ├── 011.EpochList.pbtxt
│ ├── 012.EpochList.pbtxt
│ ├── 013.EpochList.pbtxt
│ ├── 014.EpochList.pbtxt
│ └── 015.EpochList.pbtxt
├── flags.txt
└── graph_loader
├── train.txt
└── val.txt
3 directories, 34 files Also reran with most-recent checkpoint using command; INFO: Elapsed time: 19.057s, Critical Path: 0.29s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/programl/task/dataflow/train_ggnn --analysis reachability --path /home/sarath/programl/dataset/dataflow --restore_from /home/sarath/programl/dataset/dataflow/logs/programl/reachability/20:11:17T00:14:02/
INFO: Build completed successfully, 1 total action
I1117 21:16:07 vocabulary.py:52] Selected 2,230-element vocabulary achieving 98.33% node text coverage
I1117 21:16:07 gpu_scheduler.py:102] Acquired GPU 0 (GeForce GTX 1070)
I1117 21:16:09 dataflow.py:180] Resuming training from checkpoint 15 with val F1 score 0.957
I1117 21:16:09 ggnn.py:140] GGNN has 87,070 training params When I checked, its been stuck in a while loop inside batch_builder.Stop() Does this training part over? please let me know if you need more details. Regards, |
Hi Sarath, Thanks for following up. Those details are really helpful, and should give me everything I need to debug what's going on. Clearly, there is some error in the training loop that is causing the process to hang forever. I will take a look at this and post a fix asap! Cheers, |
Read batches into a queue so that we can use the blocking Queue.get() to wait for a batch with a timeout. Using a timeout is useful for catching cases where a dead iterator will lead to data starvation and a non-terminating process. #140
Hi @sarathsankar, Thanks again for the details and pointer to the
Please re-open if the issue persists! Cheers, |
Thanks @ChrisCummins, now I able to rerun the training part. Tried to run in three different ways;
INFO: Build completed successfully, 1 total action
I1126 11:17:47.562422 140579923326784 vocabulary.py:55] Selected 2,230-element vocabulary achieving 98.33% node text coverage
I1126 11:17:49.323891 140579923326784 dataflow.py:178] Resuming training from checkpoint 14 with val F1 score 0.947
I1126 11:17:49.328396 140579923326784 ggnn.py:140] GGNN has 87,070 training params
Train to 1,000,000 graphs: 92%|█████████▏| 92254/100000 [5:21:41<1:00:57, 2.12 graphs/s, f1=0.938, loss=0.0172, prec=0.961, rec=0.923]I1126 16:39:53.659049 140579923326784 ggnn.py:460] LR Scheduler step. New learning rate is 0.00015756235243115225 (was 0.00016585510782226554)
Train to 1,000,000 graphs: 100178 graphs [5:46:10, 4.82 graphs/s, f1=0.937, loss=0.0176, prec=0.961, rec=0.922]
Val at 1,000,000 graphs: 99%|█████████▉| 9905/10000 [13:51<00:07, 11.92 graphs/s, f1=0.967, loss=0.0096, prec=0.989, rec=0.947]
I1126 17:17:51.568361 140579923326784 ggnn.py:214] Wrote /home/sarath/programl/dataset/dataflow/logs/programl/reachability/20:11:17T00:14:02/epochs/014.EpochList.pbtxt
I1126 17:17:54.853213 140579923326784 dataflow.py:146] Selected best checkpoint 14 with val F1 score 0.967
Test: 75441 graphs [3:13:02, 3.81 graphs/s, f1=0.973, loss=0.0109, prec=0.994, rec=0.953] Please let me know if you need more details. Regards, |
Hi @sarathsankar, Ouch. Looks like a significant performance regression. Just so I understand, after pulling the latest version, performance is no 2-3 graphs / sec for training/validation, and ~12 graphs / sec for validation. Do you happen to remember roughly what performance was like before? Opened #147 to investigate further. Cheers, |
Hi,
I was trying to do dataflow experiment, and I couldn't find any csv file for vocabulary, under all the tar dataset given.
Also facing "'analysis' key not found" issue for below command.
bazel run //programl/task/dataflow:train_ggnn --analysis reachability --path=$HOME/programl
I am new to 'bazel', please help me on this.
The text was updated successfully, but these errors were encountered: