-
Notifications
You must be signed in to change notification settings - Fork 16
Generate metrics from model's layers #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Generate metrics from model's layers #63
Conversation
26f5a26
to
f644864
Compare
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
f644864
to
f212b8a
Compare
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
scripts/generate_layers_metrics.py
Outdated
print("abs_diff list appended") | ||
print(len(absolute_differences)) | ||
|
||
prefix = get_default_validation_prefix(model_id, max_new_token, batch_size, 0, 'float16') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_id is not defined in the function. This looks like it might be a bug.
I proposed to change model_id to model_path since this is what you use throughout the code.
prefix = get_default_validation_prefix(model_path, max_new_token, batch_size, 0, 'float16')
scripts/generate_layers_metrics.py
Outdated
input_ids, padding_kwargs = pad_input_ids(prompt_list, min_pad_length=seq_length) | ||
return input_ids, padding_kwargs | ||
|
||
def __infer_layer(warmup, model, max_len, device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add with torch.no_grad(): inside __infer_layer to prevent unnecessary autograd graph construction.
scripts/generate_layers_metrics.py
Outdated
# without ntk scaling, extending the seq length too far gives bogus results. | ||
max_seq_len = model.config.max_expected_seq_len | ||
|
||
result = generate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to use model.forward() instead of generate to:
- Trigger one full forward pass without sampling or token iteration.
- See all intermediate activations, since hooks will fire exactly once per layer.
- Avoid introducing noise from sampling, past key caching, etc.
If we use generate only may mask individual issues inside specific layers because It’s a high-level API that wraps many operations: (forward pass, KV cache logic, sampling or greeting decoding, post-processing). It may skip certain branches inside model.forward() depending on decoding logic (e.g., only decoder path, or only first token). It may use optimized inference paths (e.g., with contiguous_cache=True) that bypass certain logic like residual addition, attention masking, or past key handling.
@jjhursey is this the way that you typically compare between the models at a macro level. Only relying on generate?
Let us know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Added the model.forward
mode
scripts/generate_layers_metrics.py
Outdated
|
||
if not warmup: | ||
for i in range(result.shape[0]): | ||
print(result[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the script, we have several print() statements with no control over log levels, formatting or file redirection. It would be better to have a structured logging interface using dprint()
or Python’s built-in logging
What is the logging mechanisms that FMS uses? @jjhursey @ani300
example:
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
logger.info("Saving file...")
logger.debug(f"Layer: {layer}, Output: {output}")
logger.warning("Some layers were skipped due to missing output")
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
|
||
if "generate" in mode: | ||
with torch.no_grad(): | ||
result = generate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this saving all of the layers per iteration of generate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I wrote it, it is saving by layer name, so the first iteration completed gets their output measured and saved in csv files. Do you think we should save by iteration as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for this, I'd think we would want to save per iteration of generate, otherwise we are only saving prefill which is essentially the same as just running the forward case
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
@@ -166,3 +167,19 @@ def sample_squad_v2_qa_requests( | |||
prompt_length_max, | |||
seed, | |||
) | |||
|
|||
def prepare_inputs(batch_size, seq_length, tokenizer, sharegpt_path, seed=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add documentation for this? Also, is it possible to make the sample_requests as configurable (as we have other sample requests methods)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I made the changes to the other files to use the utils' method in this other PR #77. I thought it would be better to not mix these changes in here, then once it's merged I can rebase the other PR. Is this ok?
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
This adds the piece for generating the metrics by layer. We can leverage the get_thresholds.py with some modifications, to later use the mean diff in the pytests.


The idea is to run, the prompts through the model with the pre- and post-hooks added, and then get the metrics for the outputs intercepted by each layer, as in this diagram. Then we can have a baseline with CPU/GPU for a failure threshold in AIU tests. Same idea as the test_decoders.py, but for each layer. This way we can measure the discrepancies for the outputs and use the thresholds for detailed debugging problems in AIU.
How to run
files should get created at
/tmp/output
dir:cd /aiu-fms-testing-utils/tests/resources python3 get_thresholds.py --models ibm-granite/granite-3.2-8b-instruct --metrics abs_diff cos_sim --file_base /tmp/output --layer_io
should print the metric of each layer:
Also, a JSON file is saved to the same output dir. A sample file can be found at: sample_layer_th.json