-
Notifications
You must be signed in to change notification settings - Fork 33
[Prototype] Generalize dynamic config classes #245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
looks great so far!
about the global registry thing. another approach would be to outsource the functionality to a mixin class that can be optionally added to a config class. that way we can have more control over where this feature is used, and it would also for having different registries for different config base classes. |
It would be a bit of a hack, but I can make it work.
I considered several alternatives, but they all tend to run into trouble once we start dealing with more than just a few registries. The user has to think about what registry a class belongs to when setting the name, and make sure to pick a unique name for that one, so the benefit of having shorter names would come at a cost to developers. Also there is the issue of classes potentially belonging to multiple registries, which can be solved at the price of extra complexity and varying I tried to think of a developer-friendly solution for a while but ended up failing to, so went for a single global registry instead. It's good enough because the validation mechanism ensures the selected class has the expected base class. The only downsides are the less user-friendly names and the impossibility to restrict which subclasses are allowed for a given field. |
@tscholak I started working on more dynamic classes and realized user-friendly names are essential. So I implemented a simple solution where each class can have its own registry, and the class is selected by iterating through the registries in the MRO. We still build a So we get neat names without adding complexity on the developer side. The only real missing piece would some way for users to find out the available types, we should add it to automatic config documentation once we have it. |
yes, this is great! |
fast_llm/cli.py
Outdated
# TODO: Remove backward compatibility. | ||
if len(args) >= 2 and args[0] == "train": | ||
if args[1] == "gpt": | ||
args = ["type=train_gpt"] + args[2:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with regards to deprecating the fast-llm train
command syntax. I'm not super confident that we ever will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a way to preserve the syntax https://github.com/ServiceNow/Fast-LLM/pull/245/files#diff-1572906edaf7e94bf94dcf9dfdecc5b34757c2e56924cd3054a365b10db5ee31R26
fast_llm/data/dataset/gpt/config.py
Outdated
GPTSampledDatasetConfig.register_subclass("concatenated_memmap", GPTConcatenatedMemmapConfig) | ||
GPTSampledDatasetConfig.register_subclass("fim", GPTFimSampledDatasetConfig) | ||
GPTSampledDatasetConfig.register_subclass("legacy", GPTLegacyDatasetConfig) | ||
GPTSampledDatasetConfig.register_subclass("test_slow", GPTTestSlowDatasetConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so this is for registering user-friendly names or the classes themselves?
can we make registration automatic? didn't we want to use the type
classvar for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type
is for the instance field. We could add other classvars, but it wouldn't be ideal because we'd still need to distinguish which base class it belongs to (ex. one classvar for each base class), and we may want to register multiple user-friendly names. Also It adds the magic of the classvar being converted to dynamic type name. Overall the external register call seems better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by this answer. I was expecting something like this:
If a config class defines/overrides the type
classvar field, it automatically becomes a registered dynamic class with the value of the type
field becoming its human-readable name, which will be used for serialization and deserialization.
Example:
@config_class
class MainConfig:
foo: FooConfig
@config_class
class FooConfig:
type: typing.ClassVar[str]
class BarConfig(FooConfig)
type = "bar"
something: int
class BazConfig(FooConfig)
type = "baz"
something_else: int
Serialized version:
foo:
type: baz
something_else: 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think unless we can show a concrete need for multiple human-readable names per class or for registering the same subclass under multiple base classes (imho both extremely niche), the explicit register_subclass(...)
pattern adds too much noise and duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type
is a field for instances https://github.com/ServiceNow/Fast-LLM/pull/245/files#diff-15aab2503b9a097809b7fa450b3f63ea0db5d700982d7265610fdf28b476788aR989. What you are referring to is an equivalent to the type_
classvar we had before https://github.com/ServiceNow/Fast-LLM/pull/245/files#diff-94ac1cad97ec073f07a7d6ead2489610430a6fb40aa3e34d520b86bd88e6c9fcL102. These two can't be the same.
Adding back the same type_
variable is no longer an option, because all base classes have a registry and we don't know which one is concerned. So we'd need some adjustments, I can think of a few options: a registry_class
classvar, separate type variables (ex. gpt_dataset_type
), or a more complex type: typing.ClassVar[tuple[tuple[type[Config], str], ...]]
. None of these look better than register_subclass
.
I think we do want to support multiple names and/or base classes. Multiple names may be needed for backward compatibility, or to simplify usage. (The final saved config use the full class name either way, so that's just convenience.) And multiple base classes is useful for specialization, for example GPTTrainerConfig
should be registered as train_gpt
as a RunnableConfig
, but can take the shorter name gpt
as a TrainerConfig
. Concretely this allows the fast-llm train gpt
syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a decorator looks very good. However, why can't we use one global registry for all Config
classes?
Names should be unique per class hierarchy, conflicts will rise on register (not implemented in example below).
Below is an example using a global ConfigRegistry
, which supports:
- Command runners
- Typed class hierarchies (models, trainers, evaluators, etc.)
- Instantiating proper subclasses from YAML
- Auto-dispatching based on
type:
keys in config
The system includes demo configs and commands at the end to show how it works in practice.
Features
- Decorator for easy class registration:
@ConfigRegistry.register("name", implementation_class)
- Automatic resolution of subclassed dataclasses from YAML with
type
field - Reusable base logic for any component: models, runners, evaluators, checkpoints, etc.
- Command line integration with flexible structure
🔧 Subclass Resolution Logic
In this design, the "base class" used to resolve registered subclasses is determined by context:
-
✅ In nested dataclass fields (e.g.,
model: ModelConfig
), the base class is what's declared in the field's type annotation. -
✅ In top-level commands or manual use, the base class is explicitly passed to
from_dict(base_cls, data)
orConfigRegistry.get(type_name, base_cls)
.
This allows the same global registry to work across unrelated class hierarchies (e.g., ModelConfig
, TrainerConfig
, EvaluatorConfig
), because each lookup is scoped to its specific base class.
What do you think?
from dataclasses import is_dataclass, fields, dataclass
from typing import get_args, get_origin, Union, Any
import yaml
import types
# ---------------- CONFIG REGISTRY MACHINERY (SIMPLIFIED) ----------------
@dataclass
class Config:
# Each Config can optionally define a "configurable" runtime class it instantiates
configurable_class = None
def get_configurable(self):
# Returns an instance of the associated runtime class using self as config
if self.configurable_class is None:
return None
return self.configurable_class(self)
class ConfigRegistry:
# Global registry mapping friendly names (like "gpt", "train") to config classes
_table: dict[str, set[type]] = {}
@classmethod
def register(cls, name: str | None = None, configurable_class: type | None = None):
# Decorator to register a Config class under a name and link to its runtime class
def deco(target: Config):
key = name or target.__name__
target.configurable_class = configurable_class
cls._table.setdefault(key, set()).add(target)
return target
return deco
@classmethod
def get(cls, name: str, base: type) -> type:
# Retrieve the config class registered under `name` that is a subclass of `base`
for cand in cls._table.get(name, ()):
if issubclass(cand, base):
return cand
raise KeyError(f"No registered class named '{name}' that is subclass of {base.__name__}")
def from_dict(cls: type, raw: dict | list | Any) -> Any:
"""Recursively convert a dict (or list) into a dataclass instance."""
if is_dataclass(cls) and isinstance(raw, dict):
init_kwargs = {}
for f in fields(cls):
field_value = raw.get(f.name)
field_type = f.type
if field_value is None:
init_kwargs[f.name] = None
continue
init_kwargs[f.name] = _convert_value(field_value, field_type)
return cls(**init_kwargs)
return _convert_value(raw, cls)
def _convert_value(value: Any, target_type: type) -> Any:
# Handles typing annotations like list[T], dict[K, V], Union, and dataclasses
origin = get_origin(target_type)
args = get_args(target_type)
if origin is list and isinstance(value, list):
return [_convert_value(v, args[0]) for v in value]
elif origin is dict and isinstance(value, dict):
return {_convert_value(k, args[0]): _convert_value(v, args[1]) for k, v in value.items()}
elif is_dataclass(target_type) and isinstance(value, dict):
# If a 'type' key is present, use it to find a registered subclass
typ_name = value.pop("type", None)
if typ_name:
subclass = ConfigRegistry.get(typ_name, target_type)
if subclass:
return from_dict(subclass, value)
return from_dict(target_type, value)
elif origin is types.UnionType:
# Handle Optional[T] (which is Union[T, None])
non_none_args = [arg for arg in args if arg is not type(None)]
if len(non_none_args) == 1:
return _convert_value(value, non_none_args[0])
return value
# ---------------- MODEL CLASSES ----------------
@dataclass
class CheckPointConfig(Config):
path: str
@dataclass
class GPTCheckPointConfig(CheckPointConfig):
pass
@dataclass
class SSMCheckPointConfig(CheckPointConfig):
pass
@ConfigRegistry.register("llama")
@dataclass
class LlamaCheckPointConfig(GPTCheckPointConfig):
pass
@ConfigRegistry.register("mamba")
@dataclass
class MambCheckPointConfig(SSMCheckPointConfig):
pass
# Model runtime classes (used during execution)
class Model: pass
class GPTModel(Model): pass
class SSMModel(Model): pass
@dataclass
class ModelConfig(Config):
checkpoint: CheckPointConfig | None = None
def get_model(self):
return self.get_configurable()
@ConfigRegistry.register("gpt", GPTModel)
@dataclass
class GPTModelConfig(ModelConfig):
pass
@ConfigRegistry.register("ssm", SSMModel)
@dataclass
class SSMModelConfig(ModelConfig):
pass
# ---------------- TRAINER CLASSES ----------------
@dataclass
class TrainerConfig:
train_iters: int
# ---------------- EVALUATOR CLASSES ----------------
class Evaluator: pass
class EvaluatorLoss(Evaluator): pass
class EvaluatorLmEval(Evaluator): pass
@dataclass
class EvaluatorConfig(Config):
def get_evaluator(self):
return self.get_configurable()
@ConfigRegistry.register("loss", EvaluatorLoss)
@dataclass
class EvaluatorLossConfig(EvaluatorConfig):
pass
@ConfigRegistry.register("lm_eval", EvaluatorLmEval)
@dataclass
class EvaluatorLmEvalConfig(EvaluatorConfig):
pass
# ---------------- RUNNER CLASSES ----------------
class Runner:
def __init__(self, cfg):
self.cfg = cfg
def run(self):
raise NotImplementedError
class EvaluateRunner(Runner):
def run(self):
print("Running evaluation for", self.cfg.model.__class__.__name__)
print("Checkpoint is", self.cfg.model.checkpoint.__class__.__name__)
for name, ev in self.cfg.evaluations.items():
print(f" • {name}: {ev.__class__.__name__}")
class TrainRunner(Runner):
def run(self):
print("Running Training for", self.cfg.model.__class__.__name__)
print("Checkpoint is", self.cfg.model.checkpoint.__class__.__name__)
print("Iterations:", self.cfg.trainer.train_iters)
@dataclass
class RunnerConfig(Config):
def get_runner(self):
return self.get_configurable()
@ConfigRegistry.register("evaluate", EvaluateRunner)
@dataclass
class EvaluateConfig(RunnerConfig):
model: ModelConfig
evaluations: dict[str, EvaluatorConfig]
@ConfigRegistry.register("train", TrainRunner)
@dataclass
class TrainConfig(RunnerConfig):
model: ModelConfig
trainer: TrainerConfig
# ---------------- CLI ENTRYPOINT ----------------
class Command:
@staticmethod
def run(argv: list[str]):
if len(argv) != 2:
raise SystemExit("Usage: <command> <yaml-string>")
cmd_name, yaml_blob = argv
raw_cfg = yaml.safe_load(yaml_blob)
# Look up the appropriate config class from registry
cmd_cfg_cls = ConfigRegistry.get(cmd_name, RunnerConfig)
# Parse YAML to a typed dataclass
cmd_cfg = from_dict(cmd_cfg_cls, raw_cfg)
# Instantiate and run the corresponding Runner
cmd_cfg.get_configurable().run()
# ---------------- DEMO ----------------
evaluate_command_args = """
model:
type: gpt
checkpoint:
type: llama3
path: /path/to/model
evaluations:
stack3b:
type: lm_eval
fineweb:
type: loss
"""
args = ["evaluate", evaluate_command_args]
Command.run(args)
train_command_args = """
model:
type: ssm
checkpoint:
type: mamba
path: /path/to/model
trainer:
train_iters: 1000
"""
args = ["train", train_command_args]
Command.run(args)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bigximik thanks for the suggestion. It's actually very close to my initial implementation, without the implicit registration. But I moved away from it following discussions here and other concerns that arose when trying it out in existing dynamic classes:
- This is opt-in from the developer perspective, but always enabled for the user. I like that but we got an explicit request to do that [Prototype] Generalize dynamic config classes #245 (comment). There is also an explicit request to have multiple registries
- Name clashes are too much of a problem. We can have a "gpt" model, but also a "gpt" trainer, a "gpt" dataset, a "gpt" huggingface model, etc. With a single global register, this is not possible and we need longer names to properly disambiguate, at which point it's basically the same as the class name. So the only viable option is to use the class name itself.
- That means long and annoying names. We can't just write "default" or "llama3" for rotary embeddings, we have to provide redundant information, ex. "Llama3RotaryConfig" or an adjusted form like "llama3_rotary".
- It's also not great for backward compatibility, since we'd have to deprecate all existing human-friendly names.
- Concerning the cli part, it goes against the request to preserve the syntax. Your suggestion solves the "train" part (it's basically the same as mine) but leaves out the "gpt" part, so we'd still need to dispatch the actual trainer dynamically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlamypoirier Thanks for the comments!
Regarding the first four points:
Name conflicts aren't a concern from my point of view when using a global registry — the registry supports multiple classes registered under the same name. Real name conflicts can occur if more than one class is a subclass of the required base at resolution time. However, the same issue would occur even if separate registries were used in the decorator. These conflicts can still be detected and reported. Yes, conflict detection is deferred from decorator application to resolution time, but it's still caught early — e.g., during a dry run or config loading.
This approach ensures the base class (i.e., the separate registry) needs to be specified only once, and only in the place it actually matters — such as in the config field type — which improves maintainability.
To my understanding, this also aligns with an opt-in model.
Perhaps @tscholak can clarify whether this satisfies the multiple registry and opt-in requirement?
Regarding implicit Configurable
registration
We can explicitly separate the two decorators: one for configurable registration, and one for registering user-friendly names. Or we can keep it as it is now for configurable registration.
Regarding leaving the CLI interface intact
We can keep the CLI interface unchanged, so it remains fast-llm train gpt --config ...
, still using a single global registry. Actually we can have both, but i still would advocate for a new one which is fast-llm train --config ...
@dataclass
class RunnerConfig(Config):
def get_runner(self):
return self.get_configurable()
@ConfigRegistry.register("evaluate", EvaluateRunner)
@dataclass
class EvaluateConfig(RunnerConfig):
model: ModelConfig
evaluations: dict[str, EvaluatorConfig]
@ConfigRegistry.register("train", TrainRunner)
@dataclass
class TrainConfig(RunnerConfig):
model: ModelConfig
trainer: TrainerConfig
@ConfigRegistry.register("evaluate_gpt", EvaluateRunner)
@dataclass
class GPTEvaluateConfig(RunnerConfig):
model: GPTModelConfig
evaluations: dict[str, EvaluatorConfig]
@ConfigRegistry.register("train_ssm", TrainRunner)
@dataclass
class SSMTrainConfig(RunnerConfig):
model: SSMModelConfig
trainer: TrainerConfig
# ---------------- CLI ENTRYPOINT ----------------
class Command:
@staticmethod
def run(argv: list[str]):
if len(argv) != 3 and len(argv) != 2:
raise SystemExit("Usage: <command> <model> <yaml-string> or <command> <yaml-string>")
if len(argv) == 3:
cmd_name, model_name, yaml_blob = argv
else:
cmd_name, yaml_blob = argv
model_name = None
raw_cfg = yaml.safe_load(yaml_blob)
if model_name is None:
full_command_name = cmd_name
else:
full_command_name = f'{cmd_name}_{model_name}'
# Look up the appropriate config class from registry
print(full_command_name)
print(ConfigRegistry._table)
cmd_cfg_cls = ConfigRegistry.get(full_command_name, RunnerConfig)
# Parse YAML to a typed dataclass
cmd_cfg = from_dict(cmd_cfg_cls, raw_cfg)
# Instantiate and run the corresponding Runner
cmd_cfg.get_configurable().run()
# ---------------- DEMO ----------------
evaluate_command_args = """
model:
checkpoint:
type: llama3
path: /path/to/model
evaluations:
stack3b:
type: lm_eval
fineweb:
type: loss
"""
args = ["evaluate", "gpt", evaluate_command_args]
Command.run(args)
train_command_args = """
model:
checkpoint:
type: mamba
path: /path/to/model
trainer:
train_iters: 1000
"""
args = ["train", "ssm", train_command_args]
Command.run(args)
train_command_args = """
model:
type: ssm
checkpoint:
type: mamba
path: /path/to/model
trainer:
train_iters: 1000
"""
args = ["train", train_command_args]
Command.run(args)
Note
My example is just a demo to demonstrate the concept. It doesn't implement type field injection during config serialization or other advanced features already handled by the config classes nor it autocompletion friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same kind of opt-in that got rejected here. All base classes potentially participate in the dynamic dispatch, depending on what their derived classes may do. I don't have a problem with it, but @tscholak specifically requested enabling dynamic dispatch for specific base classes only.
I'm not a fan of having a single registry with duplicate names. It works , but looks potentially confusing. I prefer an explicit specification of what base class is allowed for a given registration, it's easier to reason with and document, e.x we can make clear and explicit lists of allowed types for the base classes we care about for dynamic dispatch.
As for the cli part, your implementation would be difficult to run in practice because the actual model/trainer class is specified through inheritance rather than composition, and changing that would be really difficult. Once we take that into account, we basically get the implementation I'm proposing (see #269), independently of the actual registration mechanism.
Anyway, I've spent enough time on this an would rather move on to other tasks. #268 is fully functional and satisfies all requirements as far as I know, so I'd rather not change it anymore unless there is a very good reason to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense @jlamypoirier
fast_llm/engine/training/config.py
Outdated
@@ -349,30 +338,23 @@ class TrainerConfig(PretrainedFastLLMModelConfig, ExperimentConfig): | |||
_abstract = True | |||
# TODO: Generalize data, schedule, logging, etc. | |||
training: TrainingConfig = Field( | |||
default_factory=TrainingConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mentioned in the other PR that this is necessary now, but I don't quite understand why. if you've got several classes of type TrainingConfig
, you want to have a straightforward way to say which one is the default. that's pretty much it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem was all sorts of potential issues happening if not going through _from_dict
, not necessarily related to this PR (ex. dataclasses doing the wrong thing when calling it for nested configs). But now that I enforced going through it (with the metaclass __call__
), I think I can safely bring it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this won't work without some hacking around. Problem is we don't just want to set a default, we want to set a default to the nested config's type
field even is a value is provided. And we can't just set a default to type
because it would propagate to subclasses and cause all sorts of trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, though we still need a clean, consistent way to specify default values for config fields.
field: SomeConfig = Field(default_factory=SomeConcreteConfig)
should just work. If SomeConcreteConfig
is a subclass in dynamic dispatch / tagged union style setup, then instantiating it directly should still produce a valid config object. The fact that it has a type or tag field shouldn't block the default from being usable. If it's not currently reliable, that's a sign that we're over-complicating the dispatch mechanism than a reason to give up on defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work perfectly fine, just not in a useful way, and it would be a ton of trouble.
default_factory
(and default
) is for when a value is not provided. This means it is (and should be) ignored whenever anything is passed, including an empty dict. So it's not suitable for setting the default field type, because setting any value will undo it.
On top of that, the config mechanism doesn't distinguish whether nested fields are passed explicitly or just empty. So some defaults may lead to incorrect serialization paths, ex: {} --(instantiate)-> Config({}) --(serialize)-> [MISSING] --(instantiate)-> Config([DEFAULT])
. (in fact this happens for all nested fields, not just configs, but configs are the biggest risk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent many hours this week diving into this PR and its discussions to understand the broader impact. I want to be crystal-clear about where we stand:
Fast-LLM is an model training framework, not a configuration R&D project.
Yet the config subsystem keeps burning days of senior time. That's time we're not spending on training models, fixing bugs, or shipping features. We've now burned more than a week on a problem Pydantic solves out of the box. That's not sustainable.
Every extra layer of custom logic raises the bar for contributors, concentrates knowledge in one person, and stalls the real work: training models and shipping features. As the team lead, I have to call it what it is: it's an operational risk.
We need to reverse this pattern. The config system should get out of our way, not pull the whole team into its gravity well.
And no, this isn't about shutting down ideas. It's about protecting this team from complexity creep that slows everyone down.
As I mentioned multiple times already, I do care a lot about it, clear and concise configs are essential to me for development, testing, debugging, understanding bug reports, etc.
This illustrates the underlying issue: the current design optimizes for what the framework maintainer finds convenient for himself, but not for what contributors actually need when developing models or features.
Now, let's look at what works and doesn't work in Pydantic:
from pydantic import BaseModel, Field, ValidationError
from typing import Literal, Union
class Foo(BaseModel):
type: Literal['foo'] = 'foo'
x: int = 1
class Bar(BaseModel):
type: Literal['bar'] = 'bar'
y: int = 2
Item = Union[Foo, Bar]
class Wrapper(BaseModel):
item: Item = Field(default_factory=Foo, discriminator='type')
print(Wrapper().model_dump()) # works, default_factory runs, uses Foo()
# print(Wrapper(item={})) # doesn't work, validation fails because of missing discriminator
# print(Wrapper(item={'x': 42})) # doesn't work, validation fails because of missing discriminator
print(Wrapper(item={'type': 'bar', 'y': 99}).model_dump()) # works, dispatches to Bar
This setup gets most things right:
- Dynamic dispatch is opt-in.
- The discriminator is explicit and user-controllable.
- Defaults work when the field is absent.
- There is no magic registry.
Does it handle every edge case? No. But it works well for the common case. And that's the bar Fast-LLM needs to meet.
Right now, we're debating edge cases that haven't materialized (e.g., sub-subclassing edge-cases should be rare, and if we're running into them, that's a sign to simplify the class hierarchy and not to complicate things further.) Meanwhile, the team is stuck. That's the wrong tradeoff.
In hindsight, we should have adopted Pydantic from the start. But since we're here now, and deadlines are real, we need to course-correct pragmatically.
We've seen the cost of keeping this overly flexible. It's time to simplify. Here are the principles we need to lock in now:
-
Dynamic config behaviour must be opt-in, via a mixin or decorator. Most config classes will just be plain config dataclasses.
-
Subclasses declare their own tag via something, a
ClassVar[str]
, or decorator param. It doesn't really matter. Your choice, @jlamypoirier. But it must be self-contained and discoverable. -
If a config field is missing, we use
default_factory
. If a dict is provided without a tag (it shouldn't), we either:- inject the default tag automatically (preferred), or
- error clearly with a message explaining the missing tag (acceptable fallback).
-
No global subclass registries tied to CLI routing. CLI dispatch should be its own, explicit layer. Config structure should not bend to command-line ergonomics.
These constraints exist to make Fast-LLM maintainable and usable by the whole team. The bar is not "can @jlamypoirier make this work." It's "can any contributor pick this up, understand it, and extend it easily."
That's what makes this framework sustainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be clear, the time we're taking is due to the overly detailed discussion we're having here. I had a working solution 2 weeks ago, with near-zero complexity on the user side.
We've seen the cost of keeping this overly flexible. It's time to simplify. Here are the principles we need to lock in now:
- Dynamic config behaviour must be opt-in, via a mixin or decorator. Most config classes will just be plain config dataclasses.
It's already opt-in, there is nothing to do when we don't want it, and nothing will happen if we don't use it. Everything else is implementation details.
- Subclasses declare their own tag via something, a
ClassVar[str]
, or decorator param. It doesn't really matter. Your choice, @jlamypoirier. But it must be self-contained and discoverable.
Already done, see the decorator implementation.
If a config field is missing, we use
default_factory
. If a dict is provided without a tag (it shouldn't), we either:
- inject the default tag automatically (preferred), or
- error clearly with a message explaining the missing tag (acceptable fallback).
That's already what we're doing.
- No global subclass registries tied to CLI routing. CLI dispatch should be its own, explicit layer. Config structure should not bend to command-line ergonomics.
It is and always had been the opposite, the cli adapts to the config system. I updated the cli system according to what the config sytstem now supports, and your request to preserve the syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jlamypoirier. I want to respond clearly to your points:
"I had a working solution 2 weeks ago..."
The PR is still in draft, tests were failing non-stop until the last commit, and key review feedback -- including on default handling, tagging, and registration -- remains unresolved or blatantly ignored. There was no point at which this was merge-ready. It stalled until I pushed.
"There is nothing to do when we don't want [dynamic dispatch]..."
That's not what "opt-in" means. Right now, every config class implicitly enters a world with base-class registries, tag handling, and dynamic behaviours -- even if it doesn't use them. Every single config class pays this complexity tax. That's passive inclusion, not opt-in.
What I've asked for is explicit control: if a developer doesn't add a mixin or decorator, the config class behaves like a plain (config) dataclass. No registry, no tagging, no indirection. That's what "opt-in" means in a maintainable system. And I had described how this can work eons ago: #104 (comment)
"Already done..." / "That's already what we're doing."
If that were the case, this conversation wouldn't be happening. Are the default factories back? No? Well then.
“The CLI adapts to the config system..."
We shouldn't even be debating this. Why is there even coupling between the CLI commands and the config system? We have 3 or 4 commands right now. Give them one, hard-coded root config class each and call it a day.
We're deep in this because we didn't push for simplicity early. I'm asking for a course correction now, not to block you, but to unblock everyone else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not what "opt-in" means. Right now, every config class implicitly enters a world with base-class registries, tag handling, and dynamic behaviours -- even if it doesn't use them. Every single config class pays this complexity tax. That's passive inclusion, not opt-in.
What I've asked for is explicit control: if a developer doesn't add a mixin or decorator, the config class behaves like a plain (config) dataclass. No registry, no tagging, no indirection. That's what "opt-in" means in a maintainable system. And I had described how this can work eons ago: #104 (comment)
Let's not debate about semantics. Nothing visible happens if the developer doesn't do anything, what happens in the background is irrelevant. Unless I'm missing something?
If that were the case, this conversation wouldn't be happening. Are the default factories back? No? Well then.
Default factories were never there for config classes. All I can do at this time is force them to be an exact match with the type hint, which is equivalent to them not being there. Anything more is not an option.
We shouldn't even be debating this. Why is there even coupling between the CLI commands and the config system? We have 3 or 4 commands right now. Give them one, hard-coded root config class each and call it a day.
I don't get what's wrong here. I can revert the changes if you want. But we have a perfect match between cli commands and the new dynamic config system, it would be weird to keep and maintain a separate infrastructure just because we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, @jlamypoirier. I don't want to drag this out further, so we'll move forward for now.
That said, I want to clearly record the unresolved concerns:
Every config class now implicitly participates in dynamic dispatch and registry logic, even when that complexity isn't needed. The cognitive overhead of the config system remains too high, and this PR made it worse. The sheer review burden here shows how far the design diverges from what most developers expect in a config system. We've burned over a week of senior time debating mechanics, and that's time we could have spent on models, training, or features.
We continue to build and maintain a bespoke config layer, which distracts from model development. A better choice would be to lean on mature solutions like Pydantic, which are battle-tested and solve 95% of our use cases with far less friction.
If the complexity tax of the current system continues to stall progress, I will propose a full migration to Hydra + OmegaConf + Pydantic.
For now, I'm prioritizing unblocking the team and moving forward.
attention_factor: None | float = Field( | ||
default=None, | ||
desc="Attention factor for yarn-type scaling.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this because the desc is not informative or because something has changed so that descriptions don't work anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attention factor
is just repeating the field name, and for yarn-type scaling
is now redundant with the type/class name. So the existing descriptions are completely useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, redundancy, I get it. But rather than dropping the description entirely, should we consider replacing it with something more informative? e.g., when this field matters, what values are valid, what happens if it's None, etc. Otherwise we're losing a chance to document intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but it's outside the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the desc is touching the field, so scope isn’t the issue. If we're already changing it, let's replace the description with something useful rather than delete it and lose intent. Otherwise please revert this hunk and keep the existing description until we can document it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These descriptions were already irrelevant, so whether I'm removing them or leaving them as a repetition of the field name doesn't matter. And more importantly, I'm not familiar enough with these classes to make a proper description, feel free to suggest some.
layers: list[TransformerSubLayerName] = Field( | ||
default=None, | ||
default=(TransformerSubLayerName.query, TransformerSubLayerName.value_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default=(TransformerSubLayerName.query, TransformerSubLayerName.value_), | |
default_factory=lambda: [TransformerSubLayerName.query, TransformerSubLayerName.value_], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tuple is fine, it gets transformed into a list automatically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the tuple if there's a documented guarantee that the config layer converts it automatically and that states the conversion rules clearly.
Right now, the behaviour isn't obvious at all. The type of this field is annotated as a list but initialized as a tuple. That's a straightforward type mismatch. I wasn't aware that coercion rules existed beyond serialized data, and I doubt anyone else is either. Does it always convert tuple to list? What happens with typing.Sequence
or other container types?
Explicit is always better than implicit. Let's document these conversions clearly, or just avoid them entirely by using explicit defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will be part of the config documentation.
|
||
if expected_class is not None: | ||
# Should be handled in `from_dict`, but can fail if instantiating directly. | ||
Assert.is_(self.__class__, expected_class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlamypoirier shouldn't this be something like Assert.custom(issubclass, expected_class, self.__class__)
instead?
I.e. the dynamically inferred class (expected_class) should be a subclass of self.class, it should not necessarily match the class exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value is overridden in from_dict
. Anyway this PR isn't too relevant anymore, I broke it into pieces
✨ Description
Fix: #126
Generalize the concept of dynamic config class from the dataset mechanism to all config class. I opted for a unique global registry of all config classes. This choice is motivated by simplicity (multiple registries would be hell to manage), though it means registry keys need to be globally unique so they'll tend to be longer. Config classes are now specified through the
type
field, using the config class name or its shorthand without theConfig
suffix.Included a proposal to simplify the cli, so the runnable is selected through its
type
field, ex.fast-llm type=GPTTrainer ...
instead offast-llm train gpt
.Added backward compatibility for the old dataset dynamic type names.
TODO:
type
field conflicts with normalization, rotary and peft types. These are good use cases for dynamic classes so I'll implement, but I'll wait after Review the architecture split #237.🔍 Type of change
Select all that apply: