Add celerity blockchain for task divergence checking #217

GagaLP · 2023-10-02T15:43:42Z

This pull request adds a divergence checking mechanism for tasks.

It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:

[2023-10-02 17:31:07.784] [error] Divergence detected in task graph at index 1:

0x471b0f1db5e4b8e6 on nodes 1 
0xe9fbff654e3748e1 on nodes 0 

[2023-10-02 17:31:07.784] [error] Task record for hash 0x471b0f1db5e4b8e6:

id: 1, debug_name: task_b_4, type: device-compute, cgid: 0
geometry:
         dimensions: 2, global_size: [1,1,1], global_offset: [0,0,0], granularity: [1,1,1]
accesses: 
         bid: 0, buffer_name: , mode: R, req: {[64,0,0] - [128,1,1]}
dependencies: 
         node: 0, kind: true-dep, origin: last-epoch

Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):

[warning] After 10 seconds of waiting nodes 1, did not move to the next task. The runtime might be stuck.

All of this is automatically turned on by running the program with task recording enabled.

github-actions · 2023-10-02T15:44:23Z

Check-perf-impact results: (5a19ced85f862a00d0114dd241122462)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions

clang-tidy made some suggestions

include/divergence_block_chain.h

include/recorders.h

include/task.h

src/divergence_block_chain.cc

src/runtime.cc

test/divergence_check_tests.cc

github-actions

clang-tidy made some suggestions

include/divergence_block_chain.h

github-actions

clang-tidy made some suggestions

include/divergence_block_chain.h

PeterTh

Regarding the overall design, I mostly like it, but one thing I noticed is that there is a lot of code that's purely for testing which is in the actual "main logic" part. So far, I believe we mostly try to keep testing-only code in testing helpers, which makes it easier to see at a glance what is actually happening in a production run vs. what is done only in testing. We might want to move the testing-related functionality out of the main code.

include/divergence_block_chain.h

include/grid.h

include/handler.h

include/print_utils.h

src/divergence_block_chain.cc

test/debug_naming_tests.cc

test/divergence_check_tests.cc

test/system/distr_tests.cc

github-actions

clang-tidy made some suggestions

include/divergence_block_chain.h

PeterTh

LGTM, other than the minor documentation and formatting things discussed offline.

I particularly appreciate moving most of the testing logic and functionality outside the main code path.

fknorr

Thanks! This is becoming a valuable on-boarding debug feature.

I have not looked closely at the algorithm so far, but some more documentaiton / clarification on what the workflow around the state and the member functions of abstract_block_chain is would help a lot.

test/divergence_check_test_utils.h

include/divergence_block_chain.h

include/utils.h

src/divergence_block_chain.cc

test/divergence_check_test_utils.h

include/divergence_block_chain.h

github-actions · 2023-11-27T16:08:17Z

Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions

clang-tidy made some suggestions

include/communicator.h

test/divergence_check_test_utils.h

psalz

Thanks! I did a first pass and added some comments, mostly about comprehensibility / naming.

include/divergence_block_chain.h

include/communicator.h

include/divergence_block_chain.h

include/communicator.h

include/divergence_block_chain.h

github-actions · 2023-12-06T14:59:54Z

Check-perf-impact results: (4c65f1399a47e0eb1340f63004745b17)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions

clang-tidy made some suggestions

test/divergence_check_test_utils.h

test/divergence_check_tests.cc

test/system/distr_tests.cc

include/recorders.h

include/runtime.h

include/mpi_communicator.h

src/runtime.cc

include/ranges.h

include/grid.h

include/communicator.h

psalz

Things are coming together! GitHub decided to post some of my comments early for some reason... Well, here's the rest :P

By the way, we have a section on diverging execution in ours docs / website, it would be great if you could add a sentence there mentioning this check and how it can be enabled!

src/runtime.cc

CHANGELOG.md

src/divergence_block_chain.cc

include/divergence_block_chain.h

test/divergence_check_tests.cc

psalz · 2023-12-12T19:13:43Z

test/divergence_check_test_utils.h

+		std::transform(div_test.begin(), div_test.end(), std::back_inserter(sizes), [](auto& div) { return div->m_task_records.size(); });
+		auto [min, max] = std::minmax_element(sizes.begin(), sizes.end());
+
+		std::vector<per_node_task_hashes> extended_lifetime_hashes;


I don't understand what this is for..?

The extended_lifetime_hashes are needed, because of how the single node divergence_test_communicator works. The idea is following:

each simulated node calls allgather sequentially. The send/receive buffer is saved internally as simple pointers.

When all simulated nodes called the allgather the data is copied.

To prevent those send/receive buffers for per_node_task_hashes from going out of scope we need to "extend" their lifetime. For this, I included the extended_lifetime_hashes vector.

Okay so I took a look at this again and this does not work: divergence_block_chain::collect_hashes returns a per_node_task_hashes by value, which itself contains a std::vector<task_hash>. So extended_lifetime_hashes[0].data() is no longer the same pointer as stored in divergence_test_communicator::m_gather_data, and you're effectively reading already free'd memory. If you run the tests with AddressSanitizier (and without mimalloc), you will get an error for it.

I think the only way the collective operations can be properly mocked (that is, to have them block until all "ranks" have called them), is to put each block chain into a separate thread. The upside is that you won't need any of the pre_check / post_check hackery. I would suggest to abstract all of that into a divergence_check_test_context (which would subsume the divergence_test_communicator_provider) that also wraps a number of divergence_block_chains, each in their own thread, and possibly even the task_test_context for each.

src/divergence_block_chain.cc

test/divergence_check_tests.cc

psalz · 2023-12-19T15:29:41Z

include/divergence_checker.h

 	std::unique_ptr<communicator> m_communicator;

-	void divergence_out(const divergence_map& check_map, const int task_num);
+	void reprot_divergence(const divergence_map& check_map, const int task_num);


Suggested change

void reprot_divergence(const divergence_map& check_map, const int task_num);

void report_divergence(const divergence_map& check_map, const int task_num);

psalz · 2023-12-19T15:36:55Z

src/runtime.cc

+		MPI_Comm_dup(MPI_COMM_WORLD, &comm);
+		m_divergence_check = std::make_unique<divergence_checker>(*m_task_recorder, std::make_unique<mpi_communicator>(comm), m_test_mode);
+#endif
+		// if (m_cfg->is_recording()) {


psalz · 2023-12-19T15:37:07Z

src/runtime.cc

+			// // Sychronize all nodes before reseting shuch that we don't get into a deadlock
+			// // With this barrier we can be shure that every node is fully finished and all mpi operations are done. (Sending ...)
+			// MPI_Barrier(MPI_COMM_WORLD);
+			// m_divergence_check.reset();


PeterTh

LGTM, other than a few typos and commented out code.

PeterTh · 2023-12-19T15:38:44Z

src/runtime.cc

 		}

+#if CELERITY_DIVERGENCE_CHECK
+		// Sychronize all nodes before reseting shuch that we don't get into a deadlock


A few typos "Sychronize" "reseting" "shuch"

psalz · 2023-12-19T15:49:47Z

CMakeLists.txt

  CELERITY_FEATURE_UNNAMED_KERNELS=$<BOOL:${CELERITY_FEATURE_UNNAMED_KERNELS}>
  CELERITY_DETAIL_HAS_NAMED_THREADS=$<BOOL:${CELERITY_DETAIL_HAS_NAMED_THREADS}>
  CELERITY_ACCESSOR_BOUNDARY_CHECK=$<BOOL:${CELERITY_ACCESSOR_BOUNDARY_CHECK}>
+  CELERITY_DIVERGENCE_CHECK=$<BOOL:${CELERITY_DIVERGENCE_CHECK}>


Also needs to be added to cmake/celerity-config.cmake.in!

psalz · 2023-12-19T15:53:39Z

src/config.cc

+#if CELERITY_DIVERGENCE_CHECK
+			// divergence checker needs recording
+			m_recording = true;
+#else
 			m_recording = parsed_and_validated_envs.get_or(env_recording, false);
+#endif


Should we print a warning that recording is being force-enabled here? What about the user explicitly setting CELERITY_RECORDING=0? cc @PeterTh @fknorr

Im not a big fan of CELERITY_RECORDING as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.

Agreed, maybe we should just get rid of it in a small follow-up PR.

fknorr

A few more comments!

Also since the clang-tidy check for this seems broken in CI: Please go over all new function definitions and make sure parameters that can be const are const.

include/communicator.h

fknorr · 2023-12-20T09:40:11Z

include/divergence_checker.h

+
+  private:
+	std::thread m_thread;
+	bool m_is_running = false;


m_is_running must be protected by a mutex!

Or just use an atomic.

fknorr · 2023-12-20T09:45:07Z

include/divergence_checker.h

+	divergence_block_chain& operator=(const divergence_block_chain&) = delete;
+	divergence_block_chain& operator=(divergence_block_chain&&) = delete;
+
+	bool check_for_divergence();


Needs a comment on what a true / false return value means. It reads like this would return true when there was divergence, but the function actually throws in that case!

fknorr · 2023-12-20T09:46:22Z

include/divergence_checker.h

+		m_thread = std::thread(&divergence_checker::run, this);
+		m_is_running = true;


It feels like there is a race between setting m_is_running = true here and the check for m_is_running in run(). I suggest you reverse the order to fix this.

Suggested change

m_thread = std::thread(&divergence_checker::run, this);

m_is_running = true;

m_is_running = true;

m_thread = std::thread(&divergence_checker::run, this);

fknorr · 2023-12-20T09:50:56Z

src/config.cc

+#if CELERITY_DIVERGENCE_CHECK
+			// divergence checker needs recording
+			m_recording = true;
+#else
 			m_recording = parsed_and_validated_envs.get_or(env_recording, false);
+#endif


Im not a big fan of CELERITY_RECORDING as it exists for that very reason. The user does not care about DAGs being recorded, they care about divergence checks or graph printing, from which we can decide whether recording needs to be active or not.

fknorr · 2023-12-20T09:53:23Z

src/divergence_checker.cc

+	if(min_hash_count == 0) {
+		if(max_hash_count != 0 && m_local_nid == 0) {
+			check_for_deadlock();
+		} else if(max_hash_count == 0) {
+			return true;
+		}
+		return false;
+	}


Comment: What is happening here?

fknorr · 2023-12-20T10:02:30Z

src/divergence_checker.cc

+	m_hashes_added = m_task_records.size();
+}
+
+void divergence_block_chain::clear(const int min_progress) {


Naming: This does not clear the chain (similar to what vector::clear would do. Maybe something along the lines of erase_front, prune_leading, or similar?

psalz · 2023-12-20T13:18:20Z

Okay so as discussed offline, we won't include this in 0.5.0 as it needs another revision. The main points:

Deadlock detection as-is would produce too many false positive warnings; not sure yet how to proceed on this.
Testing infrastructure invokes UB; needs multi-threading to properly mock blocking collective operations
We should have a test case (distr_test / integration test?) that exercises the case that one node submits a task while the other does not (the divergence then occurs between that task and the shutdown epoch).

github-actions bot reviewed Oct 2, 2023

View reviewed changes

GagaLP force-pushed the divergence-check branch from ae68e51 to af692d0 Compare October 5, 2023 12:23

github-actions bot reviewed Oct 5, 2023

View reviewed changes

include/divergence_block_chain.h Outdated Show resolved Hide resolved

include/divergence_block_chain.h Outdated Show resolved Hide resolved

include/divergence_block_chain.h Outdated Show resolved Hide resolved

include/divergence_block_chain.h Outdated Show resolved Hide resolved

GagaLP force-pushed the divergence-check branch from af692d0 to 2dd8a09 Compare October 5, 2023 14:18

github-actions bot reviewed Oct 5, 2023

View reviewed changes

include/divergence_block_chain.h Outdated Show resolved Hide resolved

GagaLP force-pushed the divergence-check branch from 2dd8a09 to 6c3f128 Compare October 9, 2023 08:46

GagaLP requested review from fknorr and psalz October 9, 2023 12:00

GagaLP force-pushed the divergence-check branch 4 times, most recently from ef86cd1 to 9ae8356 Compare October 9, 2023 15:40

PeterTh requested changes Oct 10, 2023

View reviewed changes

GagaLP force-pushed the divergence-check branch 2 times, most recently from d5d2e90 to b88b60f Compare October 12, 2023 11:45

github-actions bot reviewed Oct 12, 2023

View reviewed changes

include/divergence_block_chain.h Outdated Show resolved Hide resolved

include/divergence_block_chain.h Outdated Show resolved Hide resolved

PeterTh approved these changes Oct 12, 2023

View reviewed changes

GagaLP force-pushed the divergence-check branch 2 times, most recently from bb94e68 to 4af1341 Compare October 12, 2023 14:13

fknorr requested changes Nov 2, 2023

View reviewed changes

psalz added this to the 0.5.0 milestone Nov 15, 2023

psalz assigned GagaLP Nov 15, 2023

GagaLP force-pushed the divergence-check branch from 4af1341 to 252ad19 Compare November 27, 2023 16:07

github-actions bot reviewed Nov 27, 2023

View reviewed changes

GagaLP force-pushed the divergence-check branch from 252ad19 to 8a27177 Compare November 29, 2023 11:36

psalz requested changes Nov 29, 2023

View reviewed changes

GagaLP force-pushed the divergence-check branch 2 times, most recently from 388c399 to b06ad9e Compare December 6, 2023 14:59

GagaLP force-pushed the divergence-check branch 2 times, most recently from fc46cfc to ebf4f9a Compare December 6, 2023 15:32

github-actions bot reviewed Dec 6, 2023

View reviewed changes

psalz reviewed Dec 12, 2023

View reviewed changes

psalz requested changes Dec 12, 2023

View reviewed changes

GagaLP force-pushed the divergence-check branch from ebf4f9a to 3fe7ae2 Compare December 18, 2023 15:19

add celerity blockchain for task divergence checking

fac5661

GagaLP force-pushed the divergence-check branch from 3fe7ae2 to 3adceaa Compare December 18, 2023 15:58

This comment was marked as outdated.

Sign in to view

GagaLP force-pushed the divergence-check branch from 3adceaa to 6eb38e4 Compare December 19, 2023 10:04

GagaLP changed the title ~~Added celerity blockchain for task divergence checking~~ Add celerity blockchain for task divergence checking Dec 19, 2023

This comment was marked as outdated.

Sign in to view

[no ci] Revision: add celerity blockchain for task divergence checking

1a6e3ee

GagaLP force-pushed the divergence-check branch from 6eb38e4 to 1a6e3ee Compare December 19, 2023 11:49

psalz reviewed Dec 19, 2023

View reviewed changes

PeterTh approved these changes Dec 19, 2023

View reviewed changes

psalz reviewed Dec 19, 2023

View reviewed changes

fknorr requested changes Dec 20, 2023

View reviewed changes

Some minor refactorings here and there

7a9cfe0

psalz removed this from the 0.5.0 milestone Dec 20, 2023

	void reprot_divergence(const divergence_map& check_map, const int task_num);
	void report_divergence(const divergence_map& check_map, const int task_num);

		m_thread = std::thread(&divergence_checker::run, this);
		m_is_running = true;

Uh oh!

Add celerity blockchain for task divergence checking #217

Are you sure you want to change the base?

Add celerity blockchain for task divergence checking #217

Uh oh!

Conversation

GagaLP commented Oct 2, 2023

Uh oh!

github-actions bot commented Oct 2, 2023

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PeterTh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PeterTh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fknorr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 27, 2023

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PeterTh left a comment •

edited

Loading

psalz Dec 20, 2023 •

edited

Loading