Skip to content

Conversation

jobordner
Copy link
Contributor

This is follow-up work to texascale-fixes to address possible issues
for large-scale runs (the main test problem used is 1024^3 4-level AMR
on 1024 nodes (50K cores) of TACC Frontera). Warning: this is a big PR.
Main updates include the following:

    1. I/O performance bug-fix

    A bug was fixed in the FileHdf5 class, where attributes were
    previously opened but not closed when reading HDF5 files. Fixing
    this sped up restart by a large amount (roughly 500x), with higher
    speedups for larger files. It now takes seconds instead of an hour.

    1. Adapt robustification

    Previously the "adapt" phase would fail by hanging if Charm++ was
    compiled using the "randomized queues" setting, which indicated the
    possibility of race conditions in regular use. This has been
    addressed by including the ordering of messages within the messages
    themselves (as a count) to ensure messages are applied in order,
    even if they arrive out-of-order. Adapt now runs smoothly with
    Charm++ randomized queues.

    1. Order Method cleaning

    Previously the "order_morton" method had to be scheduled before both
    "check" and "balance" methods with a schedule that accomodated
    both. Now, the "order" method must be scheduled before each, but can
    be scheduled more than once to simplify scheduling. Also the
    "ordering" parameters for "check" and "balance" were removed (they
    just use the last ordering called). The "order" method itself was
    generalized for any ordering, though currently "morton" is still the
    only one available. See the updated documentation for more details
    and an example.

Some general cleaning was performed as well, including removing
debugging ifdefs to declutter the code. Commit messages include
more details about updates.

jobordner added 16 commits July 11, 2023 15:33
  - Added MethoOrder to replace existing MethodOrderMorton

Cello
  - Added cello::num_children(block) to return number of children for the
    given block

Charm

  - Added MsgOrder Charm++ message type

Control

  - bug fix [race condition] moved r_initialize_block_array() call to
    p_set_block_array() body instead of immediately after (asyncronous)
    call to p_set_block_array()

Mesh

  - removed unused ip_source parameter to Block constructor
  - added p_method_order_foo() entry methods for MethodOrder

Problem

  - added "order" Method type

Parameters

  - added Method:order:type parameter

Balance

  - replace reading Scalars "order_morton:index" and
    "order_morton:count" with Block::get_order(index,count)
  - moved call to reset Block::ip_next() to -1
  - added call to clear sync_method_balance_
Cleaning

  - addressing some compiler warning messages

Data

  - cleaning Scalar data access: added cello::scalar<T>(block,i) methods

Parameters

  - bug fix: logic error in assertion in Config::read checking active zones being even

Io

  - Renamed [index|count]_order as order_[index|count]
  - added order_next

Mesh

  - renamed Block::[index|count]_order as order_[index|count]
  - added Block::order_next

Method Order

  - Removed unused "next" Scalar
  - Added "type" parameter
  - Bug fix: replace cello::num_children() with cello::num_children(block)
  - Changed scalar access to use cello::scalar<T>

Method Check

  - Update EnzoMethodCheck to use Block::order_[index|count] instead of block scalars
Problem

Method

  - removed MethodOrderMorton
  - removed order parameters from MethodCheck: uses Block::order_index
  - removed debug code from EnzoMethodCheck.cpp
  - removed EnzoMethodCheck::order_ attribute

Parameters

  - renamed method_order_type as method_order_ordering to fix
    conflict with method_type
  - removed method_check_ordering parameter

Io

  - removed ordering from IoEnzoWriter
This changeset only addresses a couple complier warnings; remaining
description below pertains to previous changeset 277f661, which is
missing a description.

  - Added MsgAdapt::count_ to maintain operation order despite
    out-of-order messages
  - Added MsgCoarsen::face_level_count_ to maintain operation order
    despite out-of-order messages

Adapt

  - removed debug code frome control_adapt.cpp
  - merged reset_bounds() and initialize_self() into single initialize_bounds()
    in Adapt class
  - removed Adapt::LevelType class enum
  - added Block child face level counts to maintain operation order
  - added Adapt::count for local counter to maintain operation order
  - update face level in Adapt::set_face_level_<foo> only if counts are
    not older
  - added "min_level_" to Index is_sibling() / is_nibling() calls
  - added Adapt face_level counts for maintaining operation order
@jobordner jobordner marked this pull request as draft August 19, 2023 19:25
@jobordner jobordner added bug Something isn't working performance io:checkpoint-restart Issue/PR associated with checkpoint/restart labels Aug 19, 2023
Mesh

  - Implementing Block ordering attributes

Io

  - Updated IoBlock for Block ordering attributes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working io:checkpoint-restart Issue/PR associated with checkpoint/restart performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant