Skip to content

Conversation

brittonsmith
Copy link
Member

@brittonsmith brittonsmith commented Jul 18, 2025

PR Summary

This allows users to define or override properties of on-disk fields, namely units, aliases, and display_name. This is particularly useful if a dataset contains extra fields not defined in a frontend for which the user would like to define units. The syntax is:

ds = yt.load(...)
ds.register_field(("enzo", "DinosaurDensity"), units="code_mass/code_length**3")

Happy to add docs and tests if this looks broadly acceptable.

PR Checklist

  • New features are documented, with docstrings and narrative docs
  • Adds a test for any bugs fixed. Adds tests for new features.

@brittonsmith brittonsmith added the enhancement Making something better label Jul 18, 2025
matthewturk
matthewturk previously approved these changes Jul 18, 2025
Copy link
Member

@matthewturk matthewturk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it!

@neutrinoceros
Copy link
Member

would it be possible to make the api ds.fields.register instead ? I'm not sure what type ds.fields is, but that's where I'd look first as a user, and Dataset is already very busy as far as attributes and methods go.

@matthewturk
Copy link
Member

I think that might require that the index be created first? Which would get in the way. I do like it more, though.

@brittonsmith
Copy link
Member Author

I think that might require that the index be created first? Which would get in the way. I do like it more, though.

Yes, this is correct. You need to call this before the field_list is populated.

@neutrinoceros
Copy link
Member

so... maybe it should be a keyword argument to yt.load then ?

@matthewturk
Copy link
Member

I think both would be good, but I think for clarity ds.register_fields is probably a bit cleaner than adding more to load.

@brittonsmith
Copy link
Member Author

Unless I am mistaken, adding a keyword argument to load would be fairly invasive as I would have to make sure it is threaded through every frontend.

@neutrinoceros
Copy link
Member

not necessarily ! It is true that adding a keyword argument to yt.load would prevent the introduction of a similarly-named keyword argument in any frontend, and we should be careful about not shadowing any existing one, but as long as we choose careffuly there is a design space there that we can exploit.

@neutrinoceros
Copy link
Member

ds.register_fields is probably a bit cleaner

for completeness; I feel the opposite way: an API that can only be called between yt.load and the index being filled seems very brittle to me, and very likely to confuse users (especially in an interactive context, which we constantly encourage users towards)

@brittonsmith
Copy link
Member Author

@neutrinoceros, you might have to lay this out for me because I'm not sure I see how a keyword arg can be added to yt.load in a simple way. If I understand this function correctly, load is choosing the appropriate (i.e., frontend-specific) subclass of Dataset, instantiating an instance of that, and returning it. If I look at, for example, EnzoDataset, it does not accept **kwargs, so how would I pass say a registered_fields keyword without adding it there, the baseclass, and every other frontend. The only other thing I can see to do is to insert functionality into load itself, which feels even less elegant.

I take your point about this being a somewhat brittle operation, but there is error checking in place to prevent this from being called incorrectly as well as documentation, so to me it is acceptable, and with the above point, the simpler solution.

@neutrinoceros
Copy link
Member

neutrinoceros commented Jul 30, 2025

Yes, I'm specifically suggesting that the logic live in yt.load (see for instance how the hint keyword argument is implemented). If the logic takes too much space, it can be written as a private helper function.

What matters most in my opinion is that we avoid taking shortcuts in designing the interface, because that part will need to be maintained forever. In contrast,
elegance of implementation details seems irrelevant to me, as long as it can be refactored later, and I don't see it any reason why it couldn't be.

@brittonsmith
Copy link
Member Author

Ok, I understand now, thank you. To be honest, I'm not in favor of this solution. The hint keyword has an important place here as it aids in the process of figuring out the right frontend, i.e., one of the main functions of load. On the other hand, adding field registering here prevents the user from instantiating a frontend-specific dataset subclass directly. From an organization of duties perspective, I don't think this is where it belongs. From an implementation perspective, I think this also makes it more difficult to thread the logic into the field system. I'm not in favor of, for example, creating another global configuration object to query. We may need another perspective here.

@neutrinoceros
Copy link
Member

adding field registering here prevents the user from instantiating a frontend-specific dataset subclass directly.

that's precisely what the hint argument was added to avoid. Are there other reasons to work around yt.load this way that I'm unaware of ?

@matthewturk
Copy link
Member

I've been mulling this over, and I think I'm increasingly opposed (GASP, HORROR!) to adding more things to load. I don't think it's a bad place to do it, but I think we should take the approach that adding new things there that aren't meant for a single frontend can be challenging.

I genuinely think that adding it to the Dataset object may be the simplest way, and particularly because it mirrors the way we manage other fields. I also think that it might be possible to do this in a way that -- like adding gradient fields -- we override things appropriate. That should probably be a follow-up system.

In short, what I'm coming to think is that we are in a time when we're making some compromises about the existing field system that will stay within where it is (such as this) and those that seek to move it forward (such as #5226 ). I don't think this is such an invasive change that we should forestall it to a day when the field system is improved -- a day which I never thought before would come, but now which seems much more likely. So I think we should do this like it is, even if it's not where we are in agreement about an ideal.

@neutrinoceros
Copy link
Member

To be clear, I'm never going to block any of this:

  • I'm not actually holding any stakes
  • if my argument isn't compelling already I doubt I'll be able to convince anyone

However, for what it's worth, I would like to explain where my reluctance is coming from:
I've been frustrated for years by how much responsibilities and interfaces are bound to the Dataset class. I think its current design can only be justified by historical reasons, which is another way of saying that it grew organically, without a clear definition of its scope. The heavy use of inheritance makes it even harder to maintain, so I would rather we avoid cramming even more stuff into this already overcrowded space (as a general guideline). I understand it's the easy way to do things now, and in some cases the only one, but I genuinely believe that's it's a negative feedback loop, so naturally I tend to struggle against it. Anyway, it is also possible that my strong feelings in that area are clouding my judgement, and as I said, I don't consider myself a mn actual stakeholder, so do as you see fit. I just needed this out my chest.

@neutrinoceros
Copy link
Member

(In case it needs be said: this is in no way personal, and I genuinely respect everyone in this conversation. I'm sorry if it comes across any differently.)

@brittonsmith
Copy link
Member Author

@neutrinoceros, thank you, I've not taken this discussion personally. I appreciate fleshing out the pros and cons of my proposed solution and your alternative. I think I understand your points and where you are coming from in terms of the bloat to the Dataset class, but I'm still of the opinion that putting this in Dataset is the lesser evil for where the code is now. It seems like at least @matthewturk is on board. Can we move forward with this then? I don't think it sinks us significantly deeper into technical debt against future refactoring.

@matthewturk
Copy link
Member

I am inclined to thumbs up on this, but need another day to think.

@neutrinoceros
Copy link
Member

Overall I'm -0.1 on this: I'm leaning against it, but I'm not invested enough that my vote should be blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Making something better workshop-2025
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants