-
Couldn't load subscription status.
- Fork 20
Add celerity blockchain for task divergence checking #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Check-perf-impact results: (5a19ced85f862a00d0114dd241122462) ❓ No new benchmark data submitted. ❓ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
ae68e51 to
af692d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
af692d0 to
2dd8a09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
2dd8a09 to
6c3f128
Compare
ef86cd1 to
9ae8356
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the overall design, I mostly like it, but one thing I noticed is that there is a lot of code that's purely for testing which is in the actual "main logic" part. So far, I believe we mostly try to keep testing-only code in testing helpers, which makes it easier to see at a glance what is actually happening in a production run vs. what is done only in testing. We might want to move the testing-related functionality out of the main code.
d5d2e90 to
b88b60f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, other than the minor documentation and formatting things discussed offline.
I particularly appreciate moving most of the testing logic and functionality outside the main code path.
bb94e68 to
4af1341
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This is becoming a valuable on-boarding debug feature.
I have not looked closely at the algorithm so far, but some more documentaiton / clarification on what the workflow around the state and the member functions of abstract_block_chain is would help a lot.
4af1341 to
252ad19
Compare
|
Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72) ❓ No new benchmark data submitted. ❓ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
252ad19 to
8a27177
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I did a first pass and added some comments, mostly about comprehensibility / naming.
388c399 to
b06ad9e
Compare
|
Check-perf-impact results: (4c65f1399a47e0eb1340f63004745b17) ❓ No new benchmark data submitted. ❓ |
fc46cfc to
ebf4f9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things are coming together! GitHub decided to post some of my comments early for some reason... Well, here's the rest :P
By the way, we have a section on diverging execution in ours docs / website, it would be great if you could add a sentence there mentioning this check and how it can be enabled!
| std::transform(div_test.begin(), div_test.end(), std::back_inserter(sizes), [](auto& div) { return div->m_task_records.size(); }); | ||
| auto [min, max] = std::minmax_element(sizes.begin(), sizes.end()); | ||
|
|
||
| std::vector<per_node_task_hashes> extended_lifetime_hashes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is for..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extended_lifetime_hashes are needed, because of how the single node divergence_test_communicator works. The idea is following:
- each simulated node calls allgather sequentially. The send/receive buffer is saved internally as simple pointers.
- When all simulated nodes called the allgather the data is copied.
To prevent those send/receive buffers for per_node_task_hashes from going out of scope we need to "extend" their lifetime. For this, I included the extended_lifetime_hashes vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so I took a look at this again and this does not work: divergence_block_chain::collect_hashes returns a per_node_task_hashes by value, which itself contains a std::vector<task_hash>. So extended_lifetime_hashes[0].data() is no longer the same pointer as stored in divergence_test_communicator::m_gather_data, and you're effectively reading already free'd memory. If you run the tests with AddressSanitizier (and without mimalloc), you will get an error for it.
I think the only way the collective operations can be properly mocked (that is, to have them block until all "ranks" have called them), is to put each block chain into a separate thread. The upside is that you won't need any of the pre_check / post_check hackery. I would suggest to abstract all of that into a divergence_check_test_context (which would subsume the divergence_test_communicator_provider) that also wraps a number of divergence_block_chains, each in their own thread, and possibly even the task_test_context for each.
ebf4f9a to
3fe7ae2
Compare
3fe7ae2 to
3adceaa
Compare
3adceaa to
6eb38e4
Compare
6eb38e4 to
1a6e3ee
Compare
| std::unique_ptr<communicator> m_communicator; | ||
|
|
||
| void divergence_out(const divergence_map& check_map, const int task_num); | ||
| void reprot_divergence(const divergence_map& check_map, const int task_num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| void reprot_divergence(const divergence_map& check_map, const int task_num); | |
| void report_divergence(const divergence_map& check_map, const int task_num); |
| MPI_Comm_dup(MPI_COMM_WORLD, &comm); | ||
| m_divergence_check = std::make_unique<divergence_checker>(*m_task_recorder, std::make_unique<mpi_communicator>(comm), m_test_mode); | ||
| #endif | ||
| // if (m_cfg->is_recording()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment
| // // Sychronize all nodes before reseting shuch that we don't get into a deadlock | ||
| // // With this barrier we can be shure that every node is fully finished and all mpi operations are done. (Sending ...) | ||
| // MPI_Barrier(MPI_COMM_WORLD); | ||
| // m_divergence_check.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, other than a few typos and commented out code.
| } | ||
|
|
||
| #if CELERITY_DIVERGENCE_CHECK | ||
| // Sychronize all nodes before reseting shuch that we don't get into a deadlock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few typos "Sychronize" "reseting" "shuch"
| CELERITY_FEATURE_UNNAMED_KERNELS=$<BOOL:${CELERITY_FEATURE_UNNAMED_KERNELS}> | ||
| CELERITY_DETAIL_HAS_NAMED_THREADS=$<BOOL:${CELERITY_DETAIL_HAS_NAMED_THREADS}> | ||
| CELERITY_ACCESSOR_BOUNDARY_CHECK=$<BOOL:${CELERITY_ACCESSOR_BOUNDARY_CHECK}> | ||
| CELERITY_DIVERGENCE_CHECK=$<BOOL:${CELERITY_DIVERGENCE_CHECK}> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs to be added to cmake/celerity-config.cmake.in!
| #if CELERITY_DIVERGENCE_CHECK | ||
| // divergence checker needs recording | ||
| m_recording = true; | ||
| #else | ||
| m_recording = parsed_and_validated_envs.get_or(env_recording, false); | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not a big fan of CELERITY_RECORDING as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, maybe we should just get rid of it in a small follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments!
Also since the clang-tidy check for this seems broken in CI: Please go over all new function definitions and make sure parameters that can be const are const.
|
|
||
| private: | ||
| std::thread m_thread; | ||
| bool m_is_running = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_is_running must be protected by a mutex!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just use an atomic.
| divergence_block_chain& operator=(const divergence_block_chain&) = delete; | ||
| divergence_block_chain& operator=(divergence_block_chain&&) = delete; | ||
|
|
||
| bool check_for_divergence(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a comment on what a true / false return value means. It reads like this would return true when there was divergence, but the function actually throws in that case!
| m_thread = std::thread(&divergence_checker::run, this); | ||
| m_is_running = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there is a race between setting m_is_running = true here and the check for m_is_running in run(). I suggest you reverse the order to fix this.
| m_thread = std::thread(&divergence_checker::run, this); | |
| m_is_running = true; | |
| m_is_running = true; | |
| m_thread = std::thread(&divergence_checker::run, this); |
| #if CELERITY_DIVERGENCE_CHECK | ||
| // divergence checker needs recording | ||
| m_recording = true; | ||
| #else | ||
| m_recording = parsed_and_validated_envs.get_or(env_recording, false); | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not a big fan of CELERITY_RECORDING as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.
| if(min_hash_count == 0) { | ||
| if(max_hash_count != 0 && m_local_nid == 0) { | ||
| check_for_deadlock(); | ||
| } else if(max_hash_count == 0) { | ||
| return true; | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: What is happening here?
| m_hashes_added = m_task_records.size(); | ||
| } | ||
|
|
||
| void divergence_block_chain::clear(const int min_progress) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: This does not clear the chain (similar to what vector::clear would do. Maybe something along the lines of erase_front, prune_leading, or similar?
|
Okay so as discussed offline, we won't include this in 0.5.0 as it needs another revision. The main points:
|
This pull request adds a divergence checking mechanism for tasks.
It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:
Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):
All of this is automatically turned on by running the program with task recording enabled.