This repository contains the code for the paper "Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives" by Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati and Manuel Gomez-Rodriguez.
State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it—they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we introduce an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, to completely eliminate the financial incentive to strategize, we introduce a simple incentive-compatible token pricing mechanism. Under this mechanism, the price users pay for an output provided by a model depends on the number of characters of the output—they pay a fixed price per character. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama
, Gemma
and Mistral
families, and input prompts from the LMSYS Chatbot Arena platform
All the experiments were performed using Python 3.11.2. In order to create a virtual environment and install the project dependencies you can run the following commands:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
All the experiments were performed using Python 3.11.2. In order to create a virtual environment and install the project dependencies you can run the following commands:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
├── data
└──LMSYS.txt
├── figures
├──fixed_string
└── heur
├── notebooks
├── outputs
├──cpt
├──fixed_string
└── heuristic
├── scripts
├──script_slurm_heur.sh
│ └──script_slurm_lmsys.sh
└── src
├──heuristic_misreporting.py
├──LMSYS_generation.py
├──tokenizations_fixex_plausible.py
├──tokenizations_fixed.py
├── tokenizations.py
└── utils.py
data
contains the processed set of LMSYS prompts usedfigures
contains all the figures presented in the paper.notebooks
contains python notebooks to generate all the figures included in the paper:plots_fixed.ipynb
plots Figure 1.plots_heur.ipynb
plots all LMSYS experiment figures.process_ds.ipynb
builds the LMSYS dataset.cpt.ipynb
returns the number of characters per token from LMSYS generations.appendix_example.ipynb
generates the examples in Appendix C.2.
outputs
intermediate output files generated by the experiments' scripts and analyzed in the notebooks. They can be generated using the scripts in thesrc
folder.cpt
contains answers generated to the LMSYS prompts to estimate the number of character-per-token.fixed_string
contains the results oftokenizations_fixed_plausible.py
used to generated Figure 1. That is, it contains counts of plausible tokenizations for the stringslanguage models
andcausal inference
.heuristic
contains the results of running the heuristic algorithmheuristic_misreporting.py
.
scripts
contains a set of scripts used to run all the experiments presented in the paper.src
contains all the code necessary to reproduce the results in the paper. Specifically:heuristic_misreporting.py
is the main script used to create all figures (except Figure 1) in the paper. It implements the misreporting heuristic based on token indices, runs it on prompts (taken from the LMSYS dataset) for multiple iterations, determining the plausibility in the last step, and returns the number of plausible longer tokenizations found.tokenizations_fixed_plausible.py
is used to create the data for Figure 1 in the paper. It computes all tokenizations of an output string, and computes all top-p/k plausible tokenizations, given a prompt.tokenizations_fixed.py
computes all tokenizations of an output string, and determines if the longest is also the most likely, given a prompt.tokenizations.py
contains auxiliary functions for tokenization operations, including finding all possible tokenizations of a string, computing the cumulative autoregressive probability of a token sequence, or verifying if a token sequence is top-p/k plausible.utils.py
contains auxiliary functions.
Our experiments use LLMs from the Llama, Gemma and Mistral families, which are "gated" models, that is, they require licensing to use.
You can request to access it at: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct, https://huggingface.co/google/gemma-3-4b-it and https://huggingface.co/mistralai/Ministral-8B-Instruct-2410.
Once you have access, you can download any model in the Llama, Gemma and Mistral families.
Then, before running the scripts you need to authenticate with your Hugging Face account by running huggingface-cli
login in the terminal.
Each model should be downloaded to the models/
folder.
The script tokenizations_fixed_plausible.py generates the output needed to reproduce Figure 1 in the paper. It returns for a given output string (and prompt) the number of top-p/k plausible tokenizations. To reproduce the figure, run the notebook plots_fixed.ipynb.
The script heuristic_misreporting.py generates the output needed to reproduce all figures (except Figure 1). You can run it in your local python environment or use the Slurm submission script on a cluster, using script_slurm_heur.sh with your particular machine specifications. Using script_slurm_heur.sh to run the scripts automatically uses the LMSYS prompts in the file LMSYS.txt. You can use the flags --model
to set a specific model, such as meta-llama/Llama-3.2-1B-Instruct
, the flag --temperature
to set the temperature, --p
to set top-p parameter, --prompts
to use a list of string as prompts and splits
to select how many iterations of the heuristic should be used.
To reproduce all the figures, run the notebook plots_heur.ipynb.
In case you have questions about the code, you identify potential bugs or you would like us to include additional functionalities, feel free to open an issue or contact Ander Artola Velasco.
If you use parts of the code in this repository for your own research, please consider citing:
@misc{velasco2025llmoverchargingyoutokenization,
title={Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives},
author={Ander Artola Velasco and Stratis Tsirtsis and Nastaran Okati and Manuel Gomez-Rodriguez},
year={2025},
eprint={2505.21627},
archivePrefix={arXiv},
primaryClass={cs.GT},
url={https://arxiv.org/abs/2505.21627},
}