-
Notifications
You must be signed in to change notification settings - Fork 35
Block interface: refactor #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: tp_mamba
Are you sure you want to change the base?
Conversation
@jlamypoirier, thanks for pushing on this. Unifying GPT and HybridSSM is absolutely the right long-term goal, and this PR on block abstractions is valuable groundwork. Before we get too deep into this, I want to clarify expectations on the direction for the block architecture. Eventually, we need to land the feature #242 exactly as scoped: declarative, per-block config, explicit architecture, full heterogeneity, and named block/weight sharing. Please have a look also at #277. That's the path forward we need. Anything that diverges from this or locks us into an alternative config approach (including any override machinery revival) is off the table and won't be accepted by me or the team. Your skills are exactly what we need to lay the groundwork for #242 quickly and cleanly. That doesn't mean you need to implement the whole feature yourself. What matters is that your work enables us to land #242 without obstacles. Thanks. |
@tscholak I'm aiming quite close to the scope https://github.com/ServiceNow/Fast-LLM/pull/342/files#diff-a600ef5264541f7a4a8b245115233343926edacce94e81249e31134e8760adf1R174. After having a look, I think it's safer that I implement it myself. There are very complex backward compatibility implications to deal with, plus it's a good time to deal with some old technical debt and add much-needed improvements. I'll leave out weight sharing for now though because it's a brand new feature and is not directly related. |
✨ Description
Decouple the block interface from the transformer. This will make the SSM interface cleaner, improve readability, prevent bug, simplify the implementation of future mixers, etc. It's also a step towards merging SSMs into the GPT model, and varying block configurations (#242). I also included some groundwork for these tasks, so that upcoming PRs are smaller and simpler.
This PR is pure refactor, and changes that would cause backward compatibility concerns are left for future work.
block
submodule and move everything non-specific to transformers there from thetransformer
directory.BlockConfig
,MLPConfig
,AttentionConfig
fromTransformerConfig
. Using inheritance for backward compatibility, but composition would be preferable.BlockDimNames
,BlockKwargs
from their transformer counterparts for common variables. Use inheritance in their specializations to simplify usage. (Ex. we can systematically useSSMDimNames
,SSMKwargs
when dealing with SSMs.)per_layer_lr_scale
(from incorrect inheritance ofBlockConfig
) andnormalization
fromSSMConfig
, which would cause unexpected behaviour if defined.🔍 Type of change
Select all that apply: