'Dataflow experiment', vocabulary file missing #140

sarathsankar · 2020-11-04T19:09:58Z

Hi,

I was trying to do dataflow experiment, and I couldn't find any csv file for vocabulary, under all the tar dataset given.

Also facing "'analysis' key not found" issue for below command.

bazel run //programl/task/dataflow:train_ggnn --analysis reachability --path=$HOME/programl

I am new to 'bazel', please help me on this.

ChrisCummins · 2020-11-05T18:04:16Z

I was trying to do dataflow experiment, and I couldn't find any csv file for vocabulary, under all the tar dataset given.

Apologies! It looks like I must have missed the vocab directory. I've added it to the uploaded dataset. You can find the missing vocab folder in this archive.

Also facing "'analysis' key not found" issue for below command.

Good catch, there's a typo in the docs (missing a double hyphen). The correct command is:

bazel run //programl/task/dataflow:train_ggnn -- --analysis reachability --path=$HOME/programl

Cheers,
Chris

sarathsankar · 2020-11-16T18:41:00Z

Thank you so much @ChrisCummins .

I able to run the command in background with batch_size as 4096, but automatically paused at 93%. As you mentioned in #144 , I stopped the process(kill -9 PID) to rerun from the most-recent checkpoint. But I couldn't find any EpochList.pbtxt and Checkpoint.pb, and throwing file not found error. Is this due to the way I stopped the process or value of batch_size? Please let me know.

Currently I am rerunning from the beginning with all default values.

Regards,
Sarath

ChrisCummins · 2020-11-17T02:13:35Z

Hmm, it's hard to say without more context. Could you please post the command error message, and the contents of the log directory that you are restoring from?

For reference, here is what a logs directory should contain: two files build_info.json and flags.txt, then two directories containing epoch metadata and model checkpoints:

$ ls -lR ~/programl/dataflow/logs/programl/datadep/ddf_30/
/home/me/programl/dataflow/logs/programl/datadep/ddf_30/:
total 8
-rw-r--r--  1 cummins staff 218 May 28 23:49 build_info.json
drwxr-xr-x 17 cummins staff 544 Jun  9 15:59 checkpoints
drwxr-xr-x 18 cummins staff 576 Jun  9 15:59 epochs
-rw-r--r--  1 cummins staff 739 Jun  4 01:16 flags.txt

/home/me/programl/dataflow/logs/programl/datadep/ddf_30/checkpoints:
total 17580
-rw-r--r-- 1 cummins staff 1198621 May 28 14:09 001.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198621 May 28 14:12 002.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198621 May 28 14:15 003.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198621 May 28 14:18 004.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 14:53 005.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 15:19 006.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 15:46 007.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 16:12 008.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 16:40 009.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 17:05 010.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 17:31 011.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 17:55 012.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 18:20 013.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 18:46 014.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198583 May 28 22:02 015.Checkpoint.pb

/home/me/programl/dataflow/logs/programl/datadep/ddf_30/epochs:
total 64
-rw-r--r-- 1 cummins staff 1146 May 28 14:09 001.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1144 May 28 14:12 002.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1146 May 28 14:15 003.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1143 May 28 14:18 004.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1146 May 28 14:53 005.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1144 May 28 15:19 006.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1146 May 28 15:46 007.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1143 May 28 16:12 008.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1141 May 28 16:40 009.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1142 May 28 17:05 010.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1142 May 28 17:31 011.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1144 May 28 17:55 012.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1141 May 28 18:20 013.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1142 May 28 18:46 014.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1143 May 28 22:02 015.EpochList.pbtxt
-rw-r--r-- 1 cummins staff  561 May 29 01:49 TEST.EpochList.pbtxt

Cheers,
Chris

sarathsankar · 2020-11-18T03:31:20Z

The checkpoint files were present, it was my mistake that I looked into wrong log directory, extremely sorry for that. Here is the list.

programl/dataset/dataflow/logs/programl/reachability/20:11:17T00:14:02
├── build_info.json
├── checkpoints
│   ├── 001.Checkpoint.pb
│   ├── 002.Checkpoint.pb
│   ├── 003.Checkpoint.pb
│   ├── 004.Checkpoint.pb
│   ├── 005.Checkpoint.pb
│   ├── 006.Checkpoint.pb
│   ├── 007.Checkpoint.pb
│   ├── 008.Checkpoint.pb
│   ├── 009.Checkpoint.pb
│   ├── 010.Checkpoint.pb
│   ├── 011.Checkpoint.pb
│   ├── 012.Checkpoint.pb
│   ├── 013.Checkpoint.pb
│   ├── 014.Checkpoint.pb
│   └── 015.Checkpoint.pb
├── epochs
│   ├── 001.EpochList.pbtxt
│   ├── 002.EpochList.pbtxt
│   ├── 003.EpochList.pbtxt
│   ├── 004.EpochList.pbtxt
│   ├── 005.EpochList.pbtxt
│   ├── 006.EpochList.pbtxt
│   ├── 007.EpochList.pbtxt
│   ├── 008.EpochList.pbtxt
│   ├── 009.EpochList.pbtxt
│   ├── 010.EpochList.pbtxt
│   ├── 011.EpochList.pbtxt
│   ├── 012.EpochList.pbtxt
│   ├── 013.EpochList.pbtxt
│   ├── 014.EpochList.pbtxt
│   └── 015.EpochList.pbtxt
├── flags.txt
└── graph_loader
    ├── train.txt
    └── val.txt

3 directories, 34 files

Also reran with most-recent checkpoint using command;
bazel run //programl/task/dataflow:train_ggnn -- --analysis reachability --path /home/sarath/programl/dataset/dataflow --restore_from /home/sarath/programl/dataset/dataflow/logs/programl/reachability/20\:11\:17T00\:14\:02/
The process has been stuck for hours_(still running)_, and it takes only 501MiB of GPU memory.

INFO: Elapsed time: 19.057s, Critical Path: 0.29s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/programl/task/dataflow/train_ggnn --analysis reachability --path /home/sarath/programl/dataset/dataflow --restore_from /home/sarath/programl/dataset/dataflow/logs/programl/reachability/20:11:17T00:14:02/
INFO: Build completed successfully, 1 total action
I1117 21:16:07 vocabulary.py:52] Selected 2,230-element vocabulary achieving 98.33% node text coverage
I1117 21:16:07 gpu_scheduler.py:102] Acquired GPU 0 (GeForce GTX 1070)
I1117 21:16:09 dataflow.py:180] Resuming training from checkpoint 15 with val F1 score 0.957
I1117 21:16:09 ggnn.py:140] GGNN has 87,070 training params

When I checked, its been stuck in a while loop inside batch_builder.Stop()

Does this training part over? please let me know if you need more details.

Regards,
Sarath KS

ChrisCummins · 2020-11-19T17:23:25Z

Hi Sarath,

Thanks for following up. Those details are really helpful, and should give me everything I need to debug what's going on. Clearly, there is some error in the training loop that is causing the process to hang forever. I will take a look at this and post a fix asap!

Cheers,
Chris

Read batches into a queue so that we can use the blocking Queue.get() to wait for a batch with a timeout. Using a timeout is useful for catching cases where a dead iterator will lead to data starvation and a non-terminating process. #140

ChrisCummins · 2020-11-25T13:00:28Z

Hi @sarathsankar,

Thanks again for the details and pointer to the batch_builder.Stop() while loop. I was able to reproduce the issue locally and isolated the problem to a bug in the data loader. Fix implemented in 733e748 and merged into development branch. Note the path to the target has now changed from //programl/task/dataflow:train_ggnn to //tasks/dataflow:train_ggnn:

$ git checkout development && git pull
$ bazel run -c opt //tasks/dataflow:train_ggnn -- --analysis reachability --path=$HOME/programl

Please re-open if the issue persists!

Cheers,
Chris

sarathsankar · 2020-11-26T15:52:07Z

Thanks @ChrisCummins, now I able to rerun the training part.

Tried to run in three different ways;

Run training from the beginning(fresh-run)
- Too slow, stopped manually.
Run from last checkpoint(checkpoint 15)
- Got 'queue.Empty' error
Run from last checkpoint after deleting last checkpoint file(deleted 015.EpochList.pbtxt and 015.Checkpoint.pb)
- Completed training phase, automatically moved to testing phase. Last iteration took around 5hrs for 1,00,000 files. (log details added below). Test phase still running.
- Please correct me if I did this step wrongly..!!
- Both training and testing phases took too long to process after latest pull

INFO: Build completed successfully, 1 total action
I1126 11:17:47.562422 140579923326784 vocabulary.py:55] Selected 2,230-element vocabulary achieving 98.33% node text coverage
I1126 11:17:49.323891 140579923326784 dataflow.py:178] Resuming training from checkpoint 14 with val F1 score 0.947
I1126 11:17:49.328396 140579923326784 ggnn.py:140] GGNN has 87,070 training params
Train to 1,000,000 graphs:  92%|█████████▏| 92254/100000 [5:21:41<1:00:57,  2.12 graphs/s, f1=0.938, loss=0.0172, prec=0.961, rec=0.923]I1126 16:39:53.659049 140579923326784 ggnn.py:460] LR Scheduler step. New learning rate is 0.00015756235243115225 (was 0.00016585510782226554)
Train to 1,000,000 graphs: 100178 graphs [5:46:10,  4.82 graphs/s, f1=0.937, loss=0.0176, prec=0.961, rec=0.922]
Val at 1,000,000 graphs:  99%|█████████▉| 9905/10000 [13:51<00:07, 11.92 graphs/s, f1=0.967, loss=0.0096, prec=0.989, rec=0.947]
I1126 17:17:51.568361 140579923326784 ggnn.py:214] Wrote /home/sarath/programl/dataset/dataflow/logs/programl/reachability/20:11:17T00:14:02/epochs/014.EpochList.pbtxt
I1126 17:17:54.853213 140579923326784 dataflow.py:146] Selected best checkpoint 14 with val F1 score 0.967
Test: 75441 graphs [3:13:02,  3.81 graphs/s, f1=0.973, loss=0.0109, prec=0.994, rec=0.953]

Please let me know if you need more details.
Thank you.

Regards,
Sarath KS

ChrisCummins · 2020-11-30T12:51:28Z

Hi @sarathsankar,

Ouch. Looks like a significant performance regression. Just so I understand, after pulling the latest version, performance is no 2-3 graphs / sec for training/validation, and ~12 graphs / sec for validation. Do you happen to remember roughly what performance was like before?

Opened #147 to investigate further.

Cheers,
Chris

Kadakol mentioned this issue Nov 19, 2020

DevMap experiments #144

Closed

ChrisCummins closed this as completed Nov 25, 2020

ChrisCummins mentioned this issue Nov 25, 2020

"Test timed out" on benchmark_dataflow_lstm #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'Dataflow experiment', vocabulary file missing #140

'Dataflow experiment', vocabulary file missing #140

sarathsankar commented Nov 4, 2020

ChrisCummins commented Nov 5, 2020

sarathsankar commented Nov 16, 2020

ChrisCummins commented Nov 17, 2020

sarathsankar commented Nov 18, 2020 •

edited

Loading

ChrisCummins commented Nov 19, 2020

ChrisCummins commented Nov 25, 2020 •

edited

Loading

sarathsankar commented Nov 26, 2020 •

edited

Loading

ChrisCummins commented Nov 30, 2020

'Dataflow experiment', vocabulary file missing #140

'Dataflow experiment', vocabulary file missing #140

Comments

sarathsankar commented Nov 4, 2020

ChrisCummins commented Nov 5, 2020

sarathsankar commented Nov 16, 2020

ChrisCummins commented Nov 17, 2020

sarathsankar commented Nov 18, 2020 • edited Loading

ChrisCummins commented Nov 19, 2020

ChrisCummins commented Nov 25, 2020 • edited Loading

sarathsankar commented Nov 26, 2020 • edited Loading

ChrisCummins commented Nov 30, 2020

sarathsankar commented Nov 18, 2020 •

edited

Loading

ChrisCummins commented Nov 25, 2020 •

edited

Loading

sarathsankar commented Nov 26, 2020 •

edited

Loading