-
Couldn't load subscription status.
- Fork 9
Benchmarking #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Benchmarking #392
Conversation
| # Assign to a variable to prevent garbage collection before sync. | ||
| logits = model_tpu(input_ids).logits | ||
|
|
||
| torch_xla.sync() # Wait for the computation to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually wait for computation to complete. It just launches the kernel on TPU and proceeds. I think that the right api is wait_device_ops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank your for the info. This mean even for eager mode, we also need to call wait_device_ops each time we want to measure time, correct? (so we can wait until computation completes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that if I use wait_device_ops for preheat timing, it became 35ms where using torch_xla.sync() gives me 3000ms. It feels like using wait_device_ops is not including the compilation time for initial run. Since we want to compare compilation time as well for first run, I will use torch_xla.sync() for preheat time.
No description provided.