fix: inputs of prompt path by trimming at end. #25

wallashss · 2025-04-11T19:02:57Z

This PR fix a problem of reproducibility on inference.py when we set as input prompts from file with --prompt_path.

The issue is due to POSIX files must be terminated with the new line character, so when the script reads the prompt of the file it does not remove it and it impacts the tokenization of the prompt. So, the solution is just trim right the prompt text (with rstrip()).

The default behavior is trim at right, but users can also disable this option with no_prompt_trim, if they wish to.

I also did a small change to the format of response, because I had to split the response that contains the prompt as well, and it was kinda of difficult to see them concatenated.

Signed-off-by: Wallas Santos <wallashss@ibm.com>

ani300 · 2025-04-14T19:47:54Z

scripts/inference.py

+parser.add_argument(
+    '--no_prompt_trim',
+    action="store_true",
+    help="Disable rtrip() from input prompts defined in --prompt_path. "


ani300 · 2025-04-14T19:48:55Z

scripts/inference.py

    if args.output_path != "":
        output_path = Path(args.output_path)
        output_path.mkdir(parents=True, exist_ok=True)
        if output_path.is_dir():
            file_path = output_path / f"{result_idx}.txt"
            with file_path.open("w", encoding="utf-8") as file:
                file.write(output_str + "\n")
-    dprint(output_str)
+    dprint(f"prompt #{result_idx}: \n'{input_str}'")
+    dprint(f"generation #{result_idx}: \n'{output_str}'")


doesn't the output_str already contain the input_str?

ani300 · 2025-04-14T19:49:18Z

scripts/inference.py

+    prompt_len = prompts_lens[result_idx]
+    if add_special_tokens:
+        prompt_len -= 1
+    prompt = result[:prompt_len]
+    result = result[prompt_len:]
+    input_str = tokenizer.convert_tokens_to_string(
+        tokenizer.convert_ids_to_tokens(prompt)
+    )


add comment to code why this needs to happen

ani300 · 2025-04-14T19:49:31Z

scripts/inference.py

    result = generation.trim_prefix(result, tokenizer.bos_token_id)
-
+    


remove extra space

ani300 · 2025-04-14T19:49:36Z

scripts/inference.py

    if has_padding:
        result = generation.trim_prefix(result)
-
+    


remove extra space

ani300 · 2025-04-14T19:50:06Z

scripts/inference.py

@@ -460,11 +467,17 @@ def truncate_prompts_to_max_length(prompts, max_len, max_allowed_length):
        len(prompt_file_paths) >= args.batch_size
    ), f"Not enough prompt files at {prompt_path} for a batch size of {args.batch_size}"

+    no_prompt_trim = args.no_prompt_trim


why define variable here if it's only used once?

ani300

asking for some changes to code style and formatting, as well as some clarifications

tharapalanivel · 2025-08-20T03:04:20Z

Hi @wallashss is this change still required and actively being worked on? Thank you

wallashss · 2025-08-21T12:28:12Z

Hey, these were just minor improvements that I thought the first time I used these scripts, I think now they were not so needed anymore and I don't have bandwidth to look those again. You can close this if you wish so. Thank you!

fix: inputs of prompt path by trimming at end.

a84cce4

Signed-off-by: Wallas Santos <wallashss@ibm.com>

ani300 reviewed Apr 14, 2025

View reviewed changes

scripts/inference.py

result = generation.trim_prefix(result, tokenizer.bos_token_id)

Copy link

Contributor

ani300 Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space

ani300 reviewed Apr 14, 2025

View reviewed changes

scripts/inference.py

if has_padding:

result = generation.trim_prefix(result)

Copy link

Contributor

ani300 Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra space

ani300 reviewed Apr 14, 2025

View reviewed changes

ani300 requested changes Apr 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: inputs of prompt path by trimming at end. #25

fix: inputs of prompt path by trimming at end. #25

Uh oh!

wallashss commented Apr 11, 2025

Uh oh!

ani300 Apr 14, 2025

Uh oh!

ani300 Apr 14, 2025

Uh oh!

ani300 Apr 14, 2025

Uh oh!

ani300 Apr 14, 2025

Uh oh!

ani300 Apr 14, 2025

Uh oh!

ani300 Apr 14, 2025

Uh oh!

ani300 left a comment

Uh oh!

tharapalanivel commented Aug 20, 2025

Uh oh!

wallashss commented Aug 21, 2025

Uh oh!

Uh oh!

		result = generation.trim_prefix(result, tokenizer.bos_token_id)

fix: inputs of prompt path by trimming at end. #25

Are you sure you want to change the base?

fix: inputs of prompt path by trimming at end. #25

Uh oh!

Conversation

wallashss commented Apr 11, 2025

Uh oh!

ani300 Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ani300 Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ani300 Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ani300 Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ani300 Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ani300 Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ani300 left a comment

Choose a reason for hiding this comment

Uh oh!

tharapalanivel commented Aug 20, 2025

Uh oh!

wallashss commented Aug 21, 2025

Uh oh!

Uh oh!