Skip to content

Generate metrics from model's layers #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

flaviabeo
Copy link
Contributor

@flaviabeo flaviabeo commented Jun 16, 2025

This adds the piece for generating the metrics by layer. We can leverage the get_thresholds.py with some modifications, to later use the mean diff in the pytests.
metrics
The idea is to run, the prompts through the model with the pre- and post-hooks added, and then get the metrics for the outputs intercepted by each layer, as in this diagram. Then we can have a baseline with CPU/GPU for a failure threshold in AIU tests. Same idea as the test_decoders.py, but for each layer. This way we can measure the discrepancies for the outputs and use the thresholds for detailed debugging problems in AIU.
metrics_fms_deepview_integration (1)

How to run

  • Generate the csv metrics:
cd aiu-fms-testing-utils/tests/resources

mkdir sharegpt

cd sharegpt

wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

mkdir /tmp/output

python3 aiu-fms-testing-utils/scripts/generate_layers_metrics.py --mode generate

files should get created at /tmp/output dir:

ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-0_dtype-float16--model.base_model.layers7.ln.abs_diff.csv
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-0_dtype-float16--model.base_model.layers7.ln.cos_sim.csv
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-0_dtype-float16--model.base_model.layers8.attn.dense.abs_diff.csv
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-0_dtype-float16--model.base_model.layers8.attn.dense.cos_sim.csv
  • Get the thresholds (WIP):
cd /aiu-fms-testing-utils/tests/resources

python3 get_thresholds.py --models ibm-granite/granite-3.2-8b-instruct --metrics abs_diff cos_sim --file_base /tmp/output --layer_io

should print the metric of each layer:

Layer model.base_model.layers25.attn.in_proj.query avg abs_diff = 2.079996666484281
Layer model.base_model.layers25.attn.in_proj.key avg abs_diff = 1.2256532914682756
Layer model.base_model.layers25.attn.in_proj.value avg abs_diff = 0.8446561344670284
Layer model.base_model.layers25.attn.in_proj avg abs_diff = 0.0
Layer model.base_model.layers25.attn.dense avg abs_diff = 0.23142293885894077
Layer model.base_model.layers25.ff_ln avg abs_diff = 0.9550253005897409
Layer model.base_model.layers25.ff_sub_layer.wg avg abs_diff = 1.2256491705546648
Layer model.base_model.layers25.ff_sub_layer.a avg abs_diff = 0.5235781749861929
Layer model.base_model.layers25.ff_sub_layer.w1 avg abs_diff = 1.2707070667436549
Layer model.base_model.layers25.ff_sub_layer.w2 avg abs_diff = 0.5201997339672954
Layer model.base_model.layers25.ff_sub_layer avg abs_diff = 0.5201997339672954
Layer model.base_model.layers26.ln avg abs_diff = 0.04852477119171675
[...]
Layer model.base_model.layers39.attn.in_proj.query avg cos_sim = 0.999176025390625
Layer model.base_model.layers39.attn.in_proj.key avg cos_sim = 0.9991455078125
Layer model.base_model.layers39.attn.in_proj.value avg cos_sim = 0.9986572265625
Layer model.base_model.layers39.attn.in_proj avg cos_sim = 0.0
Layer model.base_model.layers39.attn.dense avg cos_sim = 0.9987258911132812

Also, a JSON file is saved to the same output dir. A sample file can be found at: sample_layer_th.json

@flaviabeo flaviabeo force-pushed the generate_metrics_layers branch 4 times, most recently from 26f5a26 to f644864 Compare June 20, 2025 00:17
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
@flaviabeo flaviabeo force-pushed the generate_metrics_layers branch from f644864 to f212b8a Compare June 20, 2025 00:19
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
print("abs_diff list appended")
print(len(absolute_differences))

prefix = get_default_validation_prefix(model_id, max_new_token, batch_size, 0, 'float16')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_id is not defined in the function. This looks like it might be a bug.
I proposed to change model_id to model_path since this is what you use throughout the code.

prefix = get_default_validation_prefix(model_path, max_new_token, batch_size, 0, 'float16')

input_ids, padding_kwargs = pad_input_ids(prompt_list, min_pad_length=seq_length)
return input_ids, padding_kwargs

def __infer_layer(warmup, model, max_len, device,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add with torch.no_grad(): inside __infer_layer to prevent unnecessary autograd graph construction.

# without ntk scaling, extending the seq length too far gives bogus results.
max_seq_len = model.config.max_expected_seq_len

result = generate(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to use model.forward() instead of generate to:

  • Trigger one full forward pass without sampling or token iteration.
  • See all intermediate activations, since hooks will fire exactly once per layer.
  • Avoid introducing noise from sampling, past key caching, etc.

If we use generate only may mask individual issues inside specific layers because It’s a high-level API that wraps many operations: (forward pass, KV cache logic, sampling or greeting decoding, post-processing). It may skip certain branches inside model.forward() depending on decoding logic (e.g., only decoder path, or only first token). It may use optimized inference paths (e.g., with contiguous_cache=True) that bypass certain logic like residual addition, attention masking, or past key handling.

@jjhursey is this the way that you typically compare between the models at a macro level. Only relying on generate?
Let us know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Added the model.forward mode


if not warmup:
for i in range(result.shape[0]):
print(result[i])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the script, we have several print() statements with no control over log levels, formatting or file redirection. It would be better to have a structured logging interface using dprint() or Python’s built-in logging
What is the logging mechanisms that FMS uses? @jjhursey @ani300
example:

import logging
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

logger.info("Saving file...")
logger.debug(f"Layer: {layer}, Output: {output}")
logger.warning("Some layers were skipped due to missing output")

flaviabeo added 14 commits June 30, 2025 17:55
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
@flaviabeo flaviabeo marked this pull request as ready for review July 2, 2025 18:55
flaviabeo added 2 commits July 2, 2025 16:29
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>

if "generate" in mode:
with torch.no_grad():
result = generate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this saving all of the layers per iteration of generate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I wrote it, it is saving by layer name, so the first iteration completed gets their output measured and saved in csv files. Do you think we should save by iteration as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for this, I'd think we would want to save per iteration of generate, otherwise we are only saving prefill which is essentially the same as just running the forward case

flaviabeo added 15 commits July 9, 2025 12:24
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
@@ -166,3 +167,19 @@ def sample_squad_v2_qa_requests(
prompt_length_max,
seed,
)

def prepare_inputs(batch_size, seq_length, tokenizer, sharegpt_path, seed=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add documentation for this? Also, is it possible to make the sample_requests as configurable (as we have other sample requests methods)?

Copy link
Contributor Author

@flaviabeo flaviabeo Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I made the changes to the other files to use the utils' method in this other PR #77. I thought it would be better to not mix these changes in here, then once it's merged I can rebase the other PR. Is this ok?

Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants