Skip to content

'Dataflow experiment', vocabulary file missing #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sarathsankar opened this issue Nov 4, 2020 · 8 comments
Closed

'Dataflow experiment', vocabulary file missing #140

sarathsankar opened this issue Nov 4, 2020 · 8 comments

Comments

@sarathsankar
Copy link

Hi,

I was trying to do dataflow experiment, and I couldn't find any csv file for vocabulary, under all the tar dataset given.

Also facing "'analysis' key not found" issue for below command.

bazel run //programl/task/dataflow:train_ggnn --analysis reachability --path=$HOME/programl

I am new to 'bazel', please help me on this.

@ChrisCummins
Copy link
Owner

I was trying to do dataflow experiment, and I couldn't find any csv file for vocabulary, under all the tar dataset given.

Apologies! It looks like I must have missed the vocab directory. I've added it to the uploaded dataset. You can find the missing vocab folder in this archive.

Also facing "'analysis' key not found" issue for below command.

Good catch, there's a typo in the docs (missing a double hyphen). The correct command is:

bazel run //programl/task/dataflow:train_ggnn -- --analysis reachability --path=$HOME/programl

Cheers,
Chris

@sarathsankar
Copy link
Author

Thank you so much @ChrisCummins .

I able to run the command in background with batch_size as 4096, but automatically paused at 93%. As you mentioned in #144 , I stopped the process(kill -9 PID) to rerun from the most-recent checkpoint. But I couldn't find any EpochList.pbtxt and Checkpoint.pb, and throwing file not found error. Is this due to the way I stopped the process or value of batch_size? Please let me know.

Currently I am rerunning from the beginning with all default values.

Regards,
Sarath

@ChrisCummins
Copy link
Owner

Hmm, it's hard to say without more context. Could you please post the command error message, and the contents of the log directory that you are restoring from?

For reference, here is what a logs directory should contain: two files build_info.json and flags.txt, then two directories containing epoch metadata and model checkpoints:

$ ls -lR ~/programl/dataflow/logs/programl/datadep/ddf_30/
/home/me/programl/dataflow/logs/programl/datadep/ddf_30/:
total 8
-rw-r--r--  1 cummins staff 218 May 28 23:49 build_info.json
drwxr-xr-x 17 cummins staff 544 Jun  9 15:59 checkpoints
drwxr-xr-x 18 cummins staff 576 Jun  9 15:59 epochs
-rw-r--r--  1 cummins staff 739 Jun  4 01:16 flags.txt

/home/me/programl/dataflow/logs/programl/datadep/ddf_30/checkpoints:
total 17580
-rw-r--r-- 1 cummins staff 1198621 May 28 14:09 001.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198621 May 28 14:12 002.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198621 May 28 14:15 003.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198621 May 28 14:18 004.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 14:53 005.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 15:19 006.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 15:46 007.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 16:12 008.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 16:40 009.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 17:05 010.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 17:31 011.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 17:55 012.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 18:20 013.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198597 May 28 18:46 014.Checkpoint.pb
-rw-r--r-- 1 cummins staff 1198583 May 28 22:02 015.Checkpoint.pb

/home/me/programl/dataflow/logs/programl/datadep/ddf_30/epochs:
total 64
-rw-r--r-- 1 cummins staff 1146 May 28 14:09 001.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1144 May 28 14:12 002.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1146 May 28 14:15 003.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1143 May 28 14:18 004.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1146 May 28 14:53 005.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1144 May 28 15:19 006.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1146 May 28 15:46 007.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1143 May 28 16:12 008.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1141 May 28 16:40 009.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1142 May 28 17:05 010.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1142 May 28 17:31 011.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1144 May 28 17:55 012.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1141 May 28 18:20 013.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1142 May 28 18:46 014.EpochList.pbtxt
-rw-r--r-- 1 cummins staff 1143 May 28 22:02 015.EpochList.pbtxt
-rw-r--r-- 1 cummins staff  561 May 29 01:49 TEST.EpochList.pbtxt

Cheers,
Chris

@sarathsankar
Copy link
Author

sarathsankar commented Nov 18, 2020

The checkpoint files were present, it was my mistake that I looked into wrong log directory, extremely sorry for that. Here is the list.

programl/dataset/dataflow/logs/programl/reachability/20:11:17T00:14:02
├── build_info.json
├── checkpoints
│   ├── 001.Checkpoint.pb
│   ├── 002.Checkpoint.pb
│   ├── 003.Checkpoint.pb
│   ├── 004.Checkpoint.pb
│   ├── 005.Checkpoint.pb
│   ├── 006.Checkpoint.pb
│   ├── 007.Checkpoint.pb
│   ├── 008.Checkpoint.pb
│   ├── 009.Checkpoint.pb
│   ├── 010.Checkpoint.pb
│   ├── 011.Checkpoint.pb
│   ├── 012.Checkpoint.pb
│   ├── 013.Checkpoint.pb
│   ├── 014.Checkpoint.pb
│   └── 015.Checkpoint.pb
├── epochs
│   ├── 001.EpochList.pbtxt
│   ├── 002.EpochList.pbtxt
│   ├── 003.EpochList.pbtxt
│   ├── 004.EpochList.pbtxt
│   ├── 005.EpochList.pbtxt
│   ├── 006.EpochList.pbtxt
│   ├── 007.EpochList.pbtxt
│   ├── 008.EpochList.pbtxt
│   ├── 009.EpochList.pbtxt
│   ├── 010.EpochList.pbtxt
│   ├── 011.EpochList.pbtxt
│   ├── 012.EpochList.pbtxt
│   ├── 013.EpochList.pbtxt
│   ├── 014.EpochList.pbtxt
│   └── 015.EpochList.pbtxt
├── flags.txt
└── graph_loader
    ├── train.txt
    └── val.txt

3 directories, 34 files

Also reran with most-recent checkpoint using command;
bazel run //programl/task/dataflow:train_ggnn -- --analysis reachability --path /home/sarath/programl/dataset/dataflow --restore_from /home/sarath/programl/dataset/dataflow/logs/programl/reachability/20\:11\:17T00\:14\:02/
The process has been stuck for hours_(still running)_, and it takes only 501MiB of GPU memory.

INFO: Elapsed time: 19.057s, Critical Path: 0.29s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/programl/task/dataflow/train_ggnn --analysis reachability --path /home/sarath/programl/dataset/dataflow --restore_from /home/sarath/programl/dataset/dataflow/logs/programl/reachability/20:11:17T00:14:02/
INFO: Build completed successfully, 1 total action
I1117 21:16:07 vocabulary.py:52] Selected 2,230-element vocabulary achieving 98.33% node text coverage
I1117 21:16:07 gpu_scheduler.py:102] Acquired GPU 0 (GeForce GTX 1070)
I1117 21:16:09 dataflow.py:180] Resuming training from checkpoint 15 with val F1 score 0.957
I1117 21:16:09 ggnn.py:140] GGNN has 87,070 training params

When I checked, its been stuck in a while loop inside batch_builder.Stop()

Does this training part over? please let me know if you need more details.

Regards,
Sarath KS

@ChrisCummins
Copy link
Owner

Hi Sarath,

Thanks for following up. Those details are really helpful, and should give me everything I need to debug what's going on. Clearly, there is some error in the training loop that is causing the process to hang forever. I will take a look at this and post a fix asap!

Cheers,
Chris

ChrisCummins added a commit that referenced this issue Nov 19, 2020
Read batches into a queue so that we can use the blocking Queue.get()
to wait for a batch with a timeout. Using a timeout is useful for
catching cases where a dead iterator will lead to data starvation and
a non-terminating process.

#140
@ChrisCummins
Copy link
Owner

ChrisCummins commented Nov 25, 2020

Hi @sarathsankar,

Thanks again for the details and pointer to the batch_builder.Stop() while loop. I was able to reproduce the issue locally and isolated the problem to a bug in the data loader. Fix implemented in 733e748 and merged into development branch. Note the path to the target has now changed from //programl/task/dataflow:train_ggnn to //tasks/dataflow:train_ggnn:

$ git checkout development && git pull
$ bazel run -c opt //tasks/dataflow:train_ggnn -- --analysis reachability --path=$HOME/programl

Please re-open if the issue persists!

Cheers,
Chris

@sarathsankar
Copy link
Author

sarathsankar commented Nov 26, 2020

Thanks @ChrisCummins, now I able to rerun the training part.

Tried to run in three different ways;

  • Run training from the beginning(fresh-run)

    • Too slow, stopped manually.
  • Run from last checkpoint(checkpoint 15)

    • Got 'queue.Empty' error
      checkpoint_rerun_queue_issue1
  • Run from last checkpoint after deleting last checkpoint file(deleted 015.EpochList.pbtxt and 015.Checkpoint.pb)

    • Completed training phase, automatically moved to testing phase. Last iteration took around 5hrs for 1,00,000 files. (log details added below). Test phase still running.
    • Please correct me if I did this step wrongly..!!
    • Both training and testing phases took too long to process after latest pull
INFO: Build completed successfully, 1 total action
I1126 11:17:47.562422 140579923326784 vocabulary.py:55] Selected 2,230-element vocabulary achieving 98.33% node text coverage
I1126 11:17:49.323891 140579923326784 dataflow.py:178] Resuming training from checkpoint 14 with val F1 score 0.947
I1126 11:17:49.328396 140579923326784 ggnn.py:140] GGNN has 87,070 training params
Train to 1,000,000 graphs:  92%|█████████▏| 92254/100000 [5:21:41<1:00:57,  2.12 graphs/s, f1=0.938, loss=0.0172, prec=0.961, rec=0.923]I1126 16:39:53.659049 140579923326784 ggnn.py:460] LR Scheduler step. New learning rate is 0.00015756235243115225 (was 0.00016585510782226554)
Train to 1,000,000 graphs: 100178 graphs [5:46:10,  4.82 graphs/s, f1=0.937, loss=0.0176, prec=0.961, rec=0.922]
Val at 1,000,000 graphs:  99%|█████████▉| 9905/10000 [13:51<00:07, 11.92 graphs/s, f1=0.967, loss=0.0096, prec=0.989, rec=0.947]
I1126 17:17:51.568361 140579923326784 ggnn.py:214] Wrote /home/sarath/programl/dataset/dataflow/logs/programl/reachability/20:11:17T00:14:02/epochs/014.EpochList.pbtxt
I1126 17:17:54.853213 140579923326784 dataflow.py:146] Selected best checkpoint 14 with val F1 score 0.967
Test: 75441 graphs [3:13:02,  3.81 graphs/s, f1=0.973, loss=0.0109, prec=0.994, rec=0.953]

Please let me know if you need more details.
Thank you.

Regards,
Sarath KS

@ChrisCummins
Copy link
Owner

Hi @sarathsankar,

Ouch. Looks like a significant performance regression. Just so I understand, after pulling the latest version, performance is no 2-3 graphs / sec for training/validation, and ~12 graphs / sec for validation. Do you happen to remember roughly what performance was like before?

Opened #147 to investigate further.

Cheers,
Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants