-
Notifications
You must be signed in to change notification settings - Fork 22
Generate metrics from model's layers #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
JRosenkranz
merged 62 commits into
foundation-model-stack:main
from
flaviabeo:generate_metrics_layers
Jul 14, 2025
Merged
Changes from 38 commits
Commits
Show all changes
62 commits
Select commit
Hold shift + click to select a range
134c3b8
Initial generate_layers_metrics version
flaviabeo ce42e32
Initial version of inference with pre and post hooks
flaviabeo c48e271
Checks output type
flaviabeo f78909d
Convert tensor method and save files
flaviabeo 9b4853c
Adds Cosine Similarity + prefix files
flaviabeo f212b8a
Adds dim=1 to Cosine sim
flaviabeo c9c54a4
Removes extra space
flaviabeo 315ff35
Adds layer IO mode to get_thresholds
flaviabeo 09c76c2
Changes model_id to model_path
flaviabeo 3afcac2
Fixes model_path assignment
flaviabeo 3ce8e9c
Save metrics to json
flaviabeo 55b7811
Fix json results assignment
flaviabeo 54017ac
Adds python logger
flaviabeo ace8dfe
Fix logs
flaviabeo 9934746
Adds env variable for LOG LEVEL
flaviabeo 5e1f043
unsqueeze cosine similarity
flaviabeo 02a01ce
Fix same device for cosine similarity
flaviabeo c7d5a40
Convert cos sim to list
flaviabeo 3a96397
Test euclidean dist
flaviabeo d68c52f
Adds sample json output to layer th
flaviabeo 7639b08
Merge branch 'main' into generate_metrics_layers
flaviabeo eb0b866
Adds logging to th script
flaviabeo dee632e
Model forward mode
flaviabeo 4576a3c
Adds docs
flaviabeo dc31192
Fix typos
flaviabeo be1b8d8
Small detail changes
flaviabeo 3ea4084
Prefix with sequence lenght on files' names
flaviabeo ee32a6b
Adds output path to the json th
flaviabeo 45b6514
Catch StopIteration error
flaviabeo a7732e9
Adds docstring to methods
flaviabeo d35b521
Fix cosine similarity calculation
flaviabeo d90b227
Fix print cpu output shape
flaviabeo a6894ce
Order result JSON for th
flaviabeo e41bf20
Review fixes required
flaviabeo 86b5fea
Adds layer mode header
flaviabeo 41b849a
Includes head sub-tensors values
flaviabeo db9b9fe
Metric list shape
flaviabeo d4aa817
Metric list shape
flaviabeo b8d900c
First part of review fixes requested
flaviabeo 1c800e3
Help argsparse added
flaviabeo f74cbbc
Adds docs about the arg parse
flaviabeo c8aed03
Modifies the th output json to all dicts
flaviabeo fc249c5
Moves methods to utils
flaviabeo 0f7f697
Small fix
flaviabeo 12d8c9e
Avg and mean for cosine similarity
flaviabeo de8ee15
Fix avg and mean dict
flaviabeo fbafe1d
Fix avg and mean dict
flaviabeo 96ed494
Fix find files with cos sim
flaviabeo 5ae39a1
Fix layer names in json
flaviabeo b10ed9d
Updates sample result JSON
flaviabeo ced6d31
Changes layer stack structure to dict
flaviabeo a6f84bc
Adds zero values handling
flaviabeo 2fe9124
Merge branch 'main' into generate_metrics_layers
flaviabeo 8b5a64f
Fix infer method docstring
flaviabeo b6bf42d
Adds model path and saves all generate iteractions
flaviabeo cc42d7d
Save iters and read by layers
flaviabeo 40e5924
Removes unused import
flaviabeo 12946a0
Changes metric list to all generate iters
flaviabeo 526c6d5
Improves layers th data structure
flaviabeo 28c44f8
Fix th json
flaviabeo d2e3d98
Add configurable sample requests to prepare inputs
flaviabeo acde4fd
Changes 0 values to small values (avoid nan)
flaviabeo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# Layer Metrics Generation | ||
|
||
Generate metrics by layers to be used in tests and model enablement debugging. | ||
|
||
1. [Generate metrics by layer in GPU](./LAYERS.md#1-generate-metrics-by-layer) | ||
2. [Get Thresholds](./LAYERS.md#2-get-thresholds) | ||
3. [Apply metrics where needed](./LAYERS.md#3-apply-the-thresholds-where-needed) | ||
|
||
The steps as part of the diagram below: | ||
 | ||
To see the full integration with other debugging tools, check [item 3](./LAYERS.md#3-apply-the-thresholds-where-needed). | ||
|
||
## 1. Generate Metrics by Layer | ||
|
||
The idea is to run, the prompts through the model with the pre- and post-hooks added, and then get the metrics for the outputs intercepted by each layer, as in this diagram. Then we can have a baseline with CPU/GPU for a failure threshold in AIU tests. Same idea as the [test_decoders.py](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/tests/models/test_decoders.py), but for each layer. This way we can measure the discrepancies for the outputs and use the thresholds for detailed debugging problems in AIU. | ||
|
||
 | ||
|
||
The script [generate_layers_metrics.py](../scripts/generate_layers_metrics.py) requires the following environment variables: | ||
|
||
```bash | ||
export MODEL_PATHS=ibm-granite/granite-3.2-8b-instruct | ||
export BATCH_SIZES=1 | ||
export SEQ_LENGTHS=64 | ||
export MAX_NEW_TOKENS=128 | ||
export OUTPUT_PATH=/tmp/output/granite | ||
``` | ||
|
||
These variables support single and array values. | ||
|
||
The argument required for this script is the `--mode`, which is the generation mode desired for the output; The choices can be `generate` or `model-forward`. | ||
- `generate` uses FMS [generate](../scripts/generate_layers_metrics.py#L118); It’s a high-level API that wraps many operations: forward pass, KV cache logic, sampling or greeting decoding, post-processing. | ||
```python | ||
result = generate( | ||
model, | ||
ids, | ||
max_new_tokens=max_new_tokens, | ||
use_cache=use_cache, | ||
do_sample=do_sample, | ||
max_seq_len=max_seq_len, | ||
timing="e2e", | ||
eos_token_id=None, | ||
contiguous_cache=True, | ||
extra_kwargs={}, | ||
) | ||
``` | ||
- `model-forward` will call [model.forward](../scripts/generate_layers_metrics.py#L135); Avoids introducing noise from sampling, past key caching, etc. | ||
```python | ||
result = model.forward( | ||
ids, | ||
use_cache=use_cache | ||
) | ||
``` | ||
|
||
### How to run | ||
|
||
Once all is set up, we can generate the CSV metrics: | ||
|
||
```bash | ||
cd aiu-fms-testing-utils/tests/resources | ||
|
||
mkdir /tmp/output | ||
|
||
python3 aiu-fms-testing-utils/scripts/generate_layers_metrics.py --mode generate | ||
``` | ||
The files should get created at `/tmp/output` dir: | ||
```bash | ||
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-0_dtype-float16--model.base_model.layers7.ln.abs_diff.csv | ||
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-0_dtype-float16--model.base_model.layers7.ln.cos_sim.csv | ||
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-0_dtype-float16--model.base_model.layers8.attn.dense.abs_diff.csv | ||
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-0_dtype-float16--model.base_model.layers8.attn.dense.cos_sim.csv | ||
``` | ||
|
||
## 2. Get Thresholds | ||
|
||
To get the second step of the flow and get the thresholds by layer, run: | ||
```bash | ||
cd /aiu-fms-testing-utils/tests/resources | ||
|
||
python3 get_thresholds.py --models ibm-granite/granite-3.2-8b-instruct --metrics abs_diff cos_sim --file_base /tmp/output --layer_io | ||
``` | ||
It should print the metric of each layer: | ||
```bash | ||
Layer model.base_model.layers25.attn.in_proj.query avg abs_diff = 2.079996666484281 | ||
Layer model.base_model.layers25.attn.in_proj.key avg abs_diff = 1.2256532914682756 | ||
Layer model.base_model.layers25.attn.in_proj.value avg abs_diff = 0.8446561344670284 | ||
Layer model.base_model.layers25.attn.in_proj avg abs_diff = 0.0 | ||
Layer model.base_model.layers25.attn.dense avg abs_diff = 0.23142293885894077 | ||
Layer model.base_model.layers25.ff_ln avg abs_diff = 0.9550253005897409 | ||
Layer model.base_model.layers25.ff_sub_layer.wg avg abs_diff = 1.2256491705546648 | ||
Layer model.base_model.layers25.ff_sub_layer.a avg abs_diff = 0.5235781749861929 | ||
Layer model.base_model.layers25.ff_sub_layer.w1 avg abs_diff = 1.2707070667436549 | ||
Layer model.base_model.layers25.ff_sub_layer.w2 avg abs_diff = 0.5201997339672954 | ||
Layer model.base_model.layers25.ff_sub_layer avg abs_diff = 0.5201997339672954 | ||
Layer model.base_model.layers26.ln avg abs_diff = 0.04852477119171675 | ||
[...] | ||
Layer model.base_model.layers39.attn.in_proj.query avg cos_sim = 0.999176025390625 | ||
Layer model.base_model.layers39.attn.in_proj.key avg cos_sim = 0.9991455078125 | ||
Layer model.base_model.layers39.attn.in_proj.value avg cos_sim = 0.9986572265625 | ||
Layer model.base_model.layers39.attn.in_proj avg cos_sim = 0.0 | ||
Layer model.base_model.layers39.attn.dense avg cos_sim = 0.9987258911132812 | ||
``` | ||
Also, a JSON file is saved to the same output dir. A sample file can be found at: [sample_layer_th.json](https://github.com/flaviabeo/aiu-fms-testing-utils/blob/generate_metrics_layers/tests/resources/sample_layer_th.json) | ||
|
||
## 3. Apply the thresholds where needed | ||
|
||
In case of AIU debugging tools, the thresholds will be applied to compare AIU outputs with CPU, and then assert if the differences are within the thresholds generated. Bellow, is an architecture of the full integration: | ||
JRosenkranz marked this conversation as resolved.
Show resolved
Hide resolved
|
||
 | ||
|
||
The box named `deepview layer debug` has the diagram of how the model layers outputs are generated to be compared against the CPU results. This is important so that the debug tools can catch operations and layers that have issues in their enablement for AIU hardware. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.