-
Notifications
You must be signed in to change notification settings - Fork 36
Optimize restart #350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
jobordner
wants to merge
22
commits into
enzo-project:main
Choose a base branch
from
jobordner:optimize-restart
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Optimize restart #350
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Added MethoOrder to replace existing MethodOrderMorton Cello - Added cello::num_children(block) to return number of children for the given block Charm - Added MsgOrder Charm++ message type Control - bug fix [race condition] moved r_initialize_block_array() call to p_set_block_array() body instead of immediately after (asyncronous) call to p_set_block_array() Mesh - removed unused ip_source parameter to Block constructor - added p_method_order_foo() entry methods for MethodOrder Problem - added "order" Method type Parameters - added Method:order:type parameter Balance - replace reading Scalars "order_morton:index" and "order_morton:count" with Block::get_order(index,count) - moved call to reset Block::ip_next() to -1 - added call to clear sync_method_balance_
Cleaning - addressing some compiler warning messages Data - cleaning Scalar data access: added cello::scalar<T>(block,i) methods Parameters - bug fix: logic error in assertion in Config::read checking active zones being even Io - Renamed [index|count]_order as order_[index|count] - added order_next Mesh - renamed Block::[index|count]_order as order_[index|count] - added Block::order_next Method Order - Removed unused "next" Scalar - Added "type" parameter - Bug fix: replace cello::num_children() with cello::num_children(block) - Changed scalar access to use cello::scalar<T> Method Check - Update EnzoMethodCheck to use Block::order_[index|count] instead of block scalars
Problem Method - removed MethodOrderMorton - removed order parameters from MethodCheck: uses Block::order_index - removed debug code from EnzoMethodCheck.cpp - removed EnzoMethodCheck::order_ attribute Parameters - renamed method_order_type as method_order_ordering to fix conflict with method_type - removed method_check_ordering parameter Io - removed ordering from IoEnzoWriter
This changeset only addresses a couple complier warnings; remaining description below pertains to previous changeset 277f661, which is missing a description. - Added MsgAdapt::count_ to maintain operation order despite out-of-order messages - Added MsgCoarsen::face_level_count_ to maintain operation order despite out-of-order messages Adapt - removed debug code frome control_adapt.cpp - merged reset_bounds() and initialize_self() into single initialize_bounds() in Adapt class - removed Adapt::LevelType class enum - added Block child face level counts to maintain operation order - added Adapt::count for local counter to maintain operation order - update face level in Adapt::set_face_level_<foo> only if counts are not older - added "min_level_" to Index is_sibling() / is_nibling() calls - added Adapt face_level counts for maintaining operation order
Mesh - Implementing Block ordering attributes Io - Updated IoBlock for Block ordering attributes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Something isn't working
io:checkpoint-restart
Issue/PR associated with checkpoint/restart
performance
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is follow-up work to texascale-fixes to address possible issues
for large-scale runs (the main test problem used is 1024^3 4-level AMR
on 1024 nodes (50K cores) of TACC Frontera). Warning: this is a big PR.
Main updates include the following:
A bug was fixed in the FileHdf5 class, where attributes were
previously opened but not closed when reading HDF5 files. Fixing
this sped up restart by a large amount (roughly 500x), with higher
speedups for larger files. It now takes seconds instead of an hour.
Previously the "adapt" phase would fail by hanging if Charm++ was
compiled using the "randomized queues" setting, which indicated the
possibility of race conditions in regular use. This has been
addressed by including the ordering of messages within the messages
themselves (as a count) to ensure messages are applied in order,
even if they arrive out-of-order. Adapt now runs smoothly with
Charm++ randomized queues.
Previously the "order_morton" method had to be scheduled before both
"check" and "balance" methods with a schedule that accomodated
both. Now, the "order" method must be scheduled before each, but can
be scheduled more than once to simplify scheduling. Also the
"ordering" parameters for "check" and "balance" were removed (they
just use the last ordering called). The "order" method itself was
generalized for any ordering, though currently "morton" is still the
only one available. See the updated documentation for more details
and an example.
Some general cleaning was performed as well, including removing
debugging ifdefs to declutter the code. Commit messages include
more details about updates.